这是用户在 2024-4-30 19:53 为 https://app.immersivetranslate.com/pdf-pro/bfe27c95-fd4a-4f89-a083-5f0e520010ce 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_04_30_e4c6e3b9f4cab86f4f66g

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos
SparseBEV:来自多摄像头视频的高性能稀疏 3D 物体检测

Haisong Liu Yao Teng Tao Lu Haiguang Wang Limin Wang
刘海松 滕瑶 卢涛 王海光 王立民
State Key Laboratory for Novel Software Technology, Nanjing University Shanghai AI Lab
南京大学新软件技术国家重点实验室 上海人工智能实验室
{liuhs, yaoteng, taolu, haiguangwang}@smail.nju.edu.cn, lmwangenju.edu.cn

Abstract 摘要

Camera-based 3D object detection in BEV (Bird's Eye View) space has drawn great attention over the past few years. Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation cost. On the other side, sparse detectors follow a query-based paradigm without explicit dense BEV feature construction, but achieve worse performance than the dense counterparts. In this paper, we find that the key to mitigate this performance gap is the adaptability of the detector in both and image space. To achieve this goal, we propose SparseBEV, a fully sparse 3D object detector that outperforms the dense counterparts. SparseBEV contains three key designs, which are (1) scale-adaptive self attention to aggregate features with adaptive receptive field in BEV space, (2) adaptive spatio-temporal sampling to generate sampling locations under the guidance of queries, and (3) adaptive mixing to decode the sampled features with dynamic weights from the queries. On the test split of nuScenes, SparseBEV achieves the state-of-the-art performance of 67.5 NDS. On the val split, SparseBEV achieves 55.8 NDS while maintaining a real-time inference speed of 23.5 FPS. Code is available at https://github.com/ MCG-NJU/SparseBEV.
基于相机的 BEV(鸟瞰视图)空间中的 3D 物体检测在过去几年中引起了极大关注。密集检测器通常遵循两阶段流程,首先构建密集的 BEV 特征,然后在 BEV 空间中执行物体检测,但受到复杂的视角变换和高计算成本的影响。另一方面,稀疏检测器遵循基于查询的范式,而无需显式构建密集的 BEV 特征,但性能比密集对应物差。在本文中,我们发现缓解这种性能差距的关键是检测器在 BEV 空间和图像空间中的适应性。为实现这一目标,我们提出 SparseBEV,一种优于密集对应物的全稀疏 3D 物体检测器。SparseBEV 包含三个关键设计,即(1)尺度自适应自注意力,以在 BEV 空间中聚合具有自适应感受野的特征,(2)自适应时空采样,根据查询指导生成采样位置,以及(3)自适应混合,使用来自查询的动态权重解码采样特征。在 nuScenes 的测试集上,SparseBEV 实现了 67 的最新性能。5 个 NDS。在 val 分割上,SparseBEV 实现了 55.8 个 NDS,同时保持了 23.5 FPS 的实时推理速度。代码可在 https://github.com/MCG-NJU/SparseBEV 获得。

1. Introduction 1.介绍

Camera-based 3D Object Detection [13, 54, 31, 11, 24, 25,40 ] has witnessed great progress over the past few years. Compared with the LiDAR-based counterparts [19, 56, 4, 36], camera-based approaches have lower deployment cost and can detect long-range objects.
基于相机的 3D 物体检测[13,54,31,11,24,25,40]在过去几年中取得了巨大的进展。与基于 LiDAR 的方法[19,56,4,36]相比,基于相机的方法具有更低的部署成本,并且可以检测到远距离的物体。
Previous methods can be divided into two paradigms. BEV (Bird's Eye View)-based methods [13, 11, 25, 24, 40] follow a two-stage pipeline by first constructing an explicit
以往的方法可以分为两种范式。基于鸟瞰图的方法[13、11、25、24、40]通过首先构建显式的
Figure 1: Performance comparison on the val split of nuScenes [1]. All methods use ResNet50 [10] as the image backbone and the input size is set to . FPS is measured on a single RTX 3090 with the PyTorch p 32 backend. We balance accuracy and speed by reducing the number of decoder layers without re-training.
图 1:在 nuScenes [1]的 val 数据集上进行性能比较。所有方法都使用 ResNet50 [10]作为图像骨干网络,输入大小设置为 。FPS 是在单个 RTX 3090 上使用 PyTorch p 32 后端测量的。我们通过减少解码器层数而不重新训练来平衡准确性和速度。
dense BEV feature from multi-view features and then performing object detection in BEV space. Although achieving remarkable progress, they suffer from high computation cost and rely on complex view transformation operators. Another line of work [54, 31, 32] explores the sparse query-based paradigm by initializing a set of sparse reference points in 3D space. Specifically, DETR3D [54] links the queries to image features using 3D-to-2D projection. It has simpler structure and faster speed, but its performance still lags far behind the dense ones. PETR series uses dense global attention for the interaction between query and image feature, which is computationally expensive and buries the advantage of the sparse paradigm. Thus, a natural question arises whether fully sparse detectors achieve similar accuracy to the dense ones?
从多视图特征中提取密集的 BEV 特征,然后在 BEV 空间中执行目标检测。尽管取得了显著进展,但它们面临着高计算成本和依赖复杂视图转换运算符的问题。另一方面的工作 [54, 31, 32] 探索了稀疏的基于查询的范式,通过在 3D 空间中初始化一组稀疏参考点。具体来说,DETR3D [54] 使用 3D 到 2D 投影将查询与图像特征联系起来。它具有更简单的结构和更快的速度,但其性能仍远落后于密集型方法。PETR 系列 使用密集的全局注意力来实现查询和图像特征之间的交互,这在计算上是昂贵的,并掩盖了稀疏范式的优势。因此,一个自然的问题是,完全稀疏的检测器是否能够达到与密集型方法相似的准确性?
In this paper, we find that the key to obtain high performance in sparse 3D object detection is the adaptability of the detector in both BEV and image space. In BEV space,
在本文中,我们发现稀疏 3D 目标检测取得高性能的关键是检测器在 BEV 空间和图像空间中的适应性。在 BEV 空间中,

the detector should be able to aggregate multi-scale features adaptively. Dense BEV-based detectors typically use a BEV encoder to encode multi-scale BEV features. It can be a stack of residual blocks with FPN [26] (e.g. BEVDet [26]), or a transformer encoder with multi-scale deformable attention [60] (e.g. BEVFormer [25]). For sparse detectors such as DETR3D, we argue that the multi-head self attention (MHSA) [49] among queries can play the role of the encoder, as queries are defined in BEV space. However, the vanilla MHSA has a global receptive field, lacking an explicit multi-scale design. In image space, the detector should be adaptive to different objects with different sizes and categories. This is because although the objects have similar sizes in 3D space, they might vary greatly in images. However, the single-point sampling in DETR3D has a fixed local receptive field and the sampled feature is processed by static linear layers, hindering its performance.
探测器应能够自适应地聚合多尺度特征。密集的基于 BEV 的探测器通常使用 BEV 编码器来编码多尺度的 BEV 特征。它可以是一堆残差块与 FPN[26](例如 BEVDet[26]),或者是一个带有多尺度可变形注意力的变压器编码器[60](例如 BEVFormer[25])。对于稀疏探测器如 DETR3D,我们认为查询之间的多头自注意力(MHSA)[49]可以扮演编码器的角色,因为查询是在 BEV 空间中定义的。然而,普通的 MHSA 具有全局感受野,缺乏明确的多尺度设计。在图像空间中,探测器应该能够适应不同尺寸和类别的不同对象。这是因为虽然对象在 3D 空间中的尺寸相似,但在图像中可能有很大差异。然而,DETR3D 中的单点采样具有固定的局部感受野,并且采样特征由静态线性层处理,从而阻碍了其性能。
To this end, we present SparseBEV, a fully sparse 3D object detector that matches or even outperforms the dense counterparts. Our SparseBEV detector contains three key designs, which are (1) scale-adaptive self attention to aggregate features with adaptive receptive field in BEV space, (2) adaptive spatio-temporal sampling to generate sampling locations under the guidance of queries, and (3) adaptive mixing to decode the sampled features with dynamic weights from the queries. We also propose to use pillars instead of reference points as the formulation of query, since pillars introduce better spatial priors.
为此,我们提出了 SparseBEV,一种完全稀疏的三维物体检测器,可以匹配甚至超越密集的对应物。我们的 SparseBEV 检测器包含三个关键设计,分别是(1)尺度自适应自注意力,以自适应接受域在 BEV 空间中聚合特征,(2)自适应时空采样,在查询的指导下生成采样位置,以及(3)自适应混合,使用来自查询的动态权重解码采样特征。我们还建议使用柱而不是参考点作为查询的公式,因为柱引入了更好的空间先验。
We conduct comprehensive experiments on the nuScenes dataset. As shown in Fig. 1, our SparseBEV achieves the performance of and the speed of on the val split, surpassing all previous methods in both speed and accuracy. Besides, we can flexibly adjust of the inference speed by reducing the number of decoder layers without re-training. On test split, SparseBEV with V2-99 [20] backbone achieves without using future frames or test-time augmentation. By further utilizing future frames, SparseBEV achieves 67.5 NDS, outperforming the previous state-of-the-art BEVFormerV2 [55] by 2.7 NDS.
我们在 nuScenes 数据集上进行了全面的实验。如图 1 所示,我们的 SparseBEV 在 val 分割上实现了 的性能和 的速度,超过了所有先前的方法在速度和准确性方面。此外,我们可以通过减少解码器层数而不重新训练来灵活调整推理速度。在测试分割上,使用 V2-99 [20] 骨干的 SparseBEV 实现了 的性能,而不使用未来帧或测试时间增强。通过进一步利用未来帧,SparseBEV 实现了 67.5 NDS,优于先前的最先进的 BEVFormerV2 [55] 2.7 NDS。

2.1. Query-based 2D Object Detection
2.1. 基于查询的 2D 目标检测

Recently, Transformer [49] with its attention blocks has been widely applied in the computer vision tasks . In object detection, DETR [2] was the first model to predict objects based on learnable queries and treat the detection as a set prediction problem. A lot of works [60, were then proposed to accelerate the convergence of DETR by using the sampled features instead of using the global ones. For example, DeformableDETR [60] samples image features based on sparse reference points, and then applies the deformable attention on the features. Sparse R-CNN [45] uses the ROIAlign [9] to obtain local features and then performs the dynamic convolution. AdaMixer [7] combines the advantages of the sampling points and the dynamic convolution to further boost the convergence. DN-DETR [21] devises a denoising mechanism that feeds ground-truth bounding boxes with noises into decoder and trains the model to reconstruct the original boxes. DINO [57] furtherly optimizes DN-DETR [21] and DAB-DETR [30] by proposing a contrastive denoising training strategy and a mixed query selection method. There are also several methods designed to tackle the instability of the training procedure. Our work follows the query-based detection paradigm and extends it to 3D space with temporal information.
最近,Transformer [49] 及其注意力模块已被广泛应用于计算机视觉任务中。在目标检测中,DETR [2] 是第一个基于可学习查询预测对象并将检测视为集合预测问题的模型。随后提出了许多工作 [60, 以加速 DETR 的收敛,通过使用采样特征而非全局特征。例如,DeformableDETR [60] 基于稀疏参考点对图像特征进行采样,然后在特征上应用可变形注意力。Sparse R-CNN [45] 使用 ROIAlign [9] 获取局部特征,然后执行动态卷积。AdaMixer [7] 结合了采样点和动态卷积的优势,进一步提升了收敛速度。DN-DETR [21] 设计了一个去噪机制,将带有噪声的真实边界框输入解码器,并训练模型重建原始边界框。DINO [57] 进一步优化了 DN-DETR [21] 和 DAB-DETR [30],提出了对比去噪训练策略和混合查询选择方法。 还有几种方法 旨在解决训练过程的不稳定性。我们的工作遵循基于查询的检测范式,并将其扩展到具有时间信息的 3D 空间。

2.2. Monocular 3D Object Detection
2.2. 单目 3D 物体检测

Monocular 3D object detection takes one single image as input and outputs predicted 3D bounding boxes of objects. A significant challenge in monocular 3D object detection is how to transfer 2D features to 3D space. Several works [53, 43, 51, 39] incorporates depth information to deal with this problem. Pseudo LiDAR [53] first estimates the depth of input images and constructs pseudo point clouds. The pseudo point clouds are then sent to a LiDARbased detection module to predict the 3D boxes of interested objects. CaDDN [43] further proposes a fully differentiable end-to-end network which learns pixel-wise categorical depth distributions to predict appropriate depth intervals in 3D space. Inspired by FCOS [47], FCOS3D [51] projects coordinates onto images and decouples them as 2D attributes (centerness and classification) and 3D attributes (depths, sizes, and orientations).
单目 3D 物体检测将单个图像作为输入,并输出对象的预测 3D 边界框。单目 3D 物体检测中的一个重要挑战是如何将 2D 特征转移到 3D 空间。几项工作[53, 43, 51, 39]结合深度信息来解决这个问题。伪 LiDAR[53]首先估计输入图像的深度并构建伪点云。然后将伪点云发送到基于 LiDAR 的检测模块,以预测感兴趣对象的 3D 框。CaDDN[43]进一步提出了一个完全可微的端到端网络,该网络学习像素级别的分类深度分布,以预测 3D 空间中适当的深度间隔。受 FCOS[47]启发,FCOS3D[51]将 坐标投影到 图像上,并将它们解耦为 2D 属性(中心和分类)和 3D 属性(深度、大小和方向)。

2.3. 3D Object Detection in BEV
2.3. BEV 中的 3D 物体检测

Bird's-eye-view (BEV) object detection [38, 42, 13, 54, ] aims at detecting objects in BEV space given either single-view or multi-view 2D images. Early works [44, 53, 43] typically transformed 2D features to BEV space based on single-view images and conducted monocular 3D object detection. LSS [42] takes six 2D images of different views as input and transforms them into 3D space based on depth estimation. Based on LSS, BEVDet [13] lifts 2D feature to BEV space and uses a BEV encoder with residual blocks and FPN to further encode the BEV features. BEVFormer [25] proposes a spatiotemporal transformer encoder that projects multi-view and multi-timestamp input to BEV representations. To ease the optimization, BEVFormerV2 [55] introduces perspective view supervision and supervises monocular 3D object detection parallel with BEV object detection. SOLOFusion [40] defines the localization potential for depth estimation and fuses short-term, high-resolution and long-term, low-resolution temporal stereo.
鸟瞰(BEV)物体检测 [38, 42, 13, 54, ] 旨在在 BEV 空间中检测物体,给定单视图或多视图 2D 图像。早期研究 [44, 53, 43] 通常基于单视图图像将 2D 特征转换为 BEV 空间,并进行单眼 3D 物体检测。LSS [42] 以六个不同视图的 2D 图像作为输入,并基于深度估计将它们转换为 3D 空间。基于 LSS,BEVDet [13] 将 2D 特征提升到 BEV 空间,并使用具有残差块和 FPN 的 BEV 编码器进一步编码 BEV 特征。BEVFormer [25] 提出了一个时空变换编码器,将多视图和多时间戳输入投影到 BEV 表示中。为了简化优化,BEVFormerV2 [55] 引入了透视视图监督,并与 BEV 物体检测并行监督单眼 3D 物体检测。SOLOFusion [40] 为深度估计定义了定位潜力,并融合了短期、高分辨率和长期、低分辨率的时间立体。
Figure 2: The overall architecture of SparseBEV, a fully-sparse camera-only 3D object detector. Queries are initialized to be a sparse set of pillars in BEV space. The scale-adaptive self attention further encodes the queries with adaptive receptive fields. Next, multi-view and multi-timestamp features are aggregated with adaptive spatio-temporal sampling and decoded by adaptive mixing. The decoder repeats times to produce final predictions.
图 2:SparseBEV 的整体架构,一种完全稀疏的仅相机的 3D 物体检测器。查询被初始化为 BEV 空间中的稀疏柱集合。尺度自适应的自注意力进一步使用自适应接受域对查询进行编码。接下来,多视图和多时间戳特征通过自适应时空采样进行聚合,并通过自适应混合进行解码。解码器重复 次以生成最终预测。
Inspired by DETR, another line of works [54, 31, 32] explore the sparse query-based paradigm. DETR3D [54] proposes a top-down framework starting from a learnable sparse set of reference points and refining them iteratively via 3D-to-2D queries. However, such 3D-to-2D projection hinders the receptive field of the query. To handle this, PETR series [31, 32] use global attention for the interaction betweeen queries and image features, and introduce 3D positional embeddings to encode features into representation without explicit projection. Although achieving notable improvements, the expensive dense global attention buries the advantages of the sparse paradigm and makes it difficult to utilize long-term temporal information efficiently. In contrast, we keep the fully sparse design of DETR3D and boost the performance by enhancing the adaptability of the detector.
受 DETR 启发,另一系列作品[54, 31, 32]探索了稀疏查询为基础的范式。DETR3D [54]提出了一个自顶向下的框架,从可学习的稀疏参考点集开始,并通过 3D 到 2D 的查询迭代地对它们进行细化。然而,这种 3D 到 2D 投影阻碍了查询的感受野。为了解决这个问题,PETR 系列[31, 32]使用全局注意力来处理查询和图像特征之间的交互,并引入 3D 位置嵌入来将 特征编码为 表示,而无需显式投影。尽管取得了显著的改进,但昂贵的密集全局注意力掩盖了稀疏范式的优势,并使长期时间信息难以有效利用。相比之下,我们保持了 DETR3D 的完全稀疏设计,并通过增强检测器的适应性来提高性能。

3. SparseBEV

As shown in Fig. 2, SparseBEV is a query-based onestage detector with decoder layers. The input multicamera videos are processed frame-by-frame using image backbone and FPN. Next, a set of sparse pillar queries are initialized in BEV space and aggregated by scale-adaptive self attention. These queries then interact with the image features via adaptive spatio-temporal sampling and adaptive mixing to make object predictions. We also propose a dual-branch version of SparseBEV to further enhance the temporal modeling.
如图 2 所示,SparseBEV 是一个基于查询的单阶段检测器,具有 个解码器层。输入的多摄像头视频使用图像骨干和 FPN 逐帧处理。接下来,在 BEV 空间中初始化一组稀疏柱查询,并通过可缩放自适应自注意力进行聚合。然后,这些查询通过自适应时空采样和自适应混合与图像特征交互,以进行 个物体预测。我们还提出了 SparseBEV 的双分支版本,以进一步增强时间建模。

3.1. Query Formulation 3.1. 查询形式

We first define a set of learnable queries, where each of them is represented by its translation , dimension , rotation and velocity . The queries are initialized to be pillars in BEV space where is set to 0 and is set to . The initial velocity is set to 0. Other parameters are drawn from random gaussian distributions. Following Sparse R-CNN[45], we attach a -dim query feature to each query box to encode the rich instance characteristics.
我们首先定义一组可学习的查询,其中每个查询由其平移 、尺寸 、旋转 和速度 表示。查询被初始化为 BEV 空间中的柱,其中 设置为 0, 设置为 。初始速度设置为 0。其他参数 从随机高斯分布中绘制。与 Sparse R-CNN[45]类似,我们将 维查询特征附加到每个查询框上,以编码丰富的实例特征。

3.2. Scale-adaptive Self Attention
3.2. 自适应尺度自注意力

As mentioned above, dense BEV-based methods typically use a BEV encoder to encode multi-scale BEV features. However, in our method, since we do not explicitly build a BEV feature, how to aggregate multi-scale features in BEV space is a challenge.
正如上文所述,密集的基于 BEV 的方法通常使用 BEV 编码器来编码多尺度 BEV 特征。然而,在我们的方法中,由于我们没有明确构建 BEV 特征,如何在 BEV 空间中聚合多尺度特征是一个挑战。
In this work, we argue that the self attention can play the role of BEV encoder, since queries are defined in BEV space. The vanilla multi-head self attention has global receptive field and lacks the ability of local multi-scale aggregation. Thus, we propose scale-adaptive self attention (SASA), which learns appropriate receptive fields under the guidance of queries. First, we compute the all-pair distances ( is the number of queries) between the query centers in BEV space:
在这项工作中,我们认为自注意力可以扮演 BEV 编码器的角色,因为查询是在 BEV 空间中定义的。传统的多头自注意力具有全局感受野,缺乏局部多尺度聚合的能力。因此,我们提出了自适应尺度自注意力(SASA),在查询的指导下学习适当的感受野。首先,我们计算 BEV 空间中查询中心之间的所有对距离 是查询数量):
Figure 3: Visualization of the learned BEV receptive field in SASA. We choose two samples from the validation split and visualize 4 heads for each sample, denoted by and . Queries are represented by their centers.
图 3:在 SASA 中学习的 BEV 感受野的可视化。我们从验证集中选择两个样本,并为每个样本可视化 4 个头,用 表示。查询由它们的中心表示。
where and denotes the center of the -th query. The attention considers not only the similarity between query features, but the distance between them as well. A toy example below shows how it works:
其中 表示第 个查询的中心。注意力不仅考虑查询特征之间的相似性,还考虑它们之间的距离。下面是一个玩具示例,展示了它的工作原理:
where is the query itself and is the channel dimension. is a scalar to control the receptive field for each query. When , it degrades to the vanilla self attention with global receptive field. As grows, the attention weights for distant queries becomes smaller and the receptive field narrows.
其中 是查询本身, 是通道维度。 是一个标量,用于控制每个查询的感受野。当 时,它退化为具有全局感受野的普通自注意力。随着 的增长,远程查询的注意力权重变小,感受野变窄。
In practice, the receptive field controller is adaptive to each query and specific to each head. Supposing there are heads, we use a linear transformation to generate headspecific from the give query :
在实践中,接受域控制器 对每个查询都是自适应的,并且对每个头部都是特定的。假设有 个头部,我们使用线性变换从给定的查询 生成头部特定的
where the weights are shared across different queries.
其中权重在不同的查询之间共享。
In our experiments, we surprisingly find that the for each head is learnt to uniformly distribute in a certain range regardless of the initialization. In Fig. 3, we sort the heads according to and visualize the attention weights for the distance part. As we can see, each head learns a different receptive field from each other, enabling the network to aggregate features in a multi-scale manner like FPN.
在我们的实验中,我们惊奇地发现,每个头部的 都学会了在一定范围内均匀分布,而不受初始化的影响。在图 3 中,我们根据 对头部进行排序,并可视化距离部分的注意力权重。正如我们所看到的,每个头部从彼此中学习了不同的接受域,使网络能够像 FPN 一样以多尺度方式聚合特征。
Scale-adaptive self attention (SASA) demonstrates the necessity of FPN, while being more flexible as it learns the scales adaptively from the query. We also find an interesting phenomenon that different categories of queries have different sizes of receptive field. For example, queries representing the bus have larger receptive field than those representing the pedestrians. More details can be found in the ablation studies.
比例自适应自注意力(SASA)展示了 FPN 的必要性,同时更加灵活,因为它从查询中自适应地学习比例。我们还发现一个有趣的现象,不同类别的查询具有不同大小的感受野。例如,代表公交车的查询具有比代表行人的查询更大的感受野。更多细节可以在消融研究中找到。

3.3. Adaptive Spatio-temporal Sampling
3.3. 自适应时空采样

For each frame, we use a linear layer to generate a set of sampling offsets adaptively from the query feature. These offsets are transformed to sampling points based on the query pillar:
对于每个帧,我们使用一个线性层从查询特征中自适应地生成一组采样偏移量 。这些偏移量基于查询柱转换为 采样点:
Compared with the deformable attention in BEVFormer, our sampling points are adaptive to both query pillar and query feature, thus better covering objects with varying sizes. Besides, these points are not restricted to the query, since we do not limit the range of the sampling offsets.
与 BEVFormer 中的可变形注意力相比,我们的采样点适应于查询柱和查询特征,因此更好地覆盖了大小不同的对象。此外,这些点不受查询的限制,因为我们没有限制采样偏移的范围。
Next, we perform temporal alignment by warping the sampling points according to motions. In autonomous driving, there are two types of motion: one is ego-motion and the other is object motion. Ego-motion describes the motion of the car from its own perspective as it navigates through the environment, while object motion refers to the movement of other objects in the environment as they move around the self-driving car.
接下来,我们通过根据运动对采样点进行变形来执行时间对齐。在自动驾驶中,有两种类型的运动:一种是自我运动,另一种是物体运动。自我运动描述了汽车从自己的视角在环境中导航时的运动,而物体运动则指的是环境中其他物体的移动,它们围绕着自动驾驶汽车移动。
Dealing with Object Motion. As mentioned above, instantaneous velocity can be equal to average velocity for a short time window in self-driving. Thus, we adaptively warp the sampling points to previous timestamps using the velocity vector from the query:
处理物体运动。如上所述,在自动驾驶中,瞬时速度可以在短时间窗口内等于平均速度。因此,我们通过使用来自查询的速度向量 将采样点自适应地变形到先前的时间戳。
where denotes the timestamp of previous frame ( denotes the current timestamp). Note that is identical to because the velocity vector is defined in plane.
其中 表示前一帧 的时间戳( 表示当前时间戳)。请注意, 相同,因为速度向量是在 平面中定义的。
Dealing with Ego Motion. Next, we warp the sampling points based on the ego pose provided by the dataset. Points are first transformed to the global coordinate system and then to the local coordinate system of frame :
处理自我运动。接下来,我们根据数据集提供的自我姿势对采样点进行变形。首先将点转换到全局坐标系,然后转换到帧 的局部坐标系:
where is the ego pose of frame .
其中 是帧 的自我姿势。
Sampling. For each timestamp, we project the warped sampling points to each view using the provided camera intrinsics and extrinsics. Since there are overlaps between adjacent views, the projected point may hit one or more views, which are termed as . For each hit view , we have multi-scale feature maps from the image backbone. Features are
采样。对于每个时间戳,我们使用提供的相机内参和外参将扭曲的采样点 投影到每个视图中。由于相邻视图之间存在重叠,投影点可能会命中一个或多个视图,这些被称为 。对于每个命中的视图 ,我们从图像主干网络中得到多尺度特征图 。特征是
Method Backbone Input Size Epochs NDS mAP mATE mASE mAOE mAVE mAAE
PETRv2 [32] ResNet50 60 45.6 34.9 0.700 0.275 0.580 0.437 0.187
BEVStereo [22] ResNet50 50.0 37.2 0.598 0.270 0.438 0.367 0.190
BEVPoolv2 [12] ResNet50 52.6 40.6 0.572 0.275 0.463 0.275 0.188
SOLOFusion [40] ResNet50 53.4 42.7 0.567 0.274 0.511 0.252 0.181
Sparse4Dv2 [29] ResNet50 100 53.9 43.9 0.598 0.270 0.475 0.282 0.179
StreamPETR ResNet50 60 55.0 0.613 0.267 0.413 0.265 0.196
SparseBEV ResNet50 36 54.5 43.2 0.606 0.274 0.387 0.251 0.186
SparseBEV ResNet50 36 44.8 0.581 0.271 0.373 0.247 0.190
DETR3D ResNet101-DCN 24 43.4 34.9 0.716 0.268 0.379 0.842 0.200
BEVFormer ResNet101-DCN 24 51.7 41.6 0.673 0.274 0.372 0.394 0.198
BEVDepth [24] ResNet101 53.5 41.2 0.565 0.266 0.358 0.331 0.190
Sparse4D ResNet101-DCN 48 55.0 44.4 0.603 0.276 0.360 0.309 0.178
SOLOFusion [40] ResNet101 58.2 48.3 0.503 0.264 0.381 0.246 0.207
SparseBEV ResNet101 24 0.562 0.265 0.321 0.243 0.195
Table 1: Performance comparison on the nuScenes val split. † benefits from perspective pretraining. indicates methods with CBGS [59] which will elongate 1 epoch into 4.5 epochs.
表 1:在 nuScenes 验证集上的性能比较。†受益于透视预训练。 表示使用 CBGS [59]的方法,将 1 个时期延长为 4.5 个时期。
first sampled by bilinear interpolation in the image plane and then weighted over the scale axis:
首先通过双线性插值在图像平面上对其进行采样 ,然后在尺度轴上加权:
where is the number of the multi-scale feature maps and is the projection function for view is the weight for the -th point on the -th feature map and is generated from the query feature by linear transformation.
其中 是多尺度特征图的编号, 是视图 的投影函数, 是第 个特征图上第 个点的权重,由查询特征通过线性变换生成。

3.4. Adaptive Mixing 3.4. 自适应混合

Given the sampled features from different timestamps and locations, the key is how to adaptively decode it under the guidance of queries. Inspired by AdaMixer [7], we introduce a simple but effective approach to decode and aggregate the spatio-temporal features with dynamic convolutions [15] and MLP-Mixer [48]. Supposing there are frames in total and sampling points per frame, we first stack them to a total of points. Thus, the sampled features are organized to .
鉴于来自不同时间戳和位置的采样特征,关键在于如何在查询的指导下自适应地解码它。受 AdaMixer [7]的启发,我们引入了一种简单而有效的方法来使用动态卷积[15]和 MLP-Mixer [48]解码和聚合时空特征。假设总共有 帧,每帧 个采样点,我们首先将它们堆叠到总共 个点。因此,采样特征被组织成
Channel Mixing. We first perform channel mixing on to enhance object semantics. The dynamic weights are generated from the query feature :
通道混合。我们首先对 执行通道混合,以增强对象语义。动态权重是从查询特征 生成的:
where is the dynamic weights and is shared across different frames and different sampling points.
其中 是动态权重,跨不同帧和不同采样点共享。

Point Mixing. Next, we then transpose the feature and apply the dynamic weights to the point dimension of it:
点混合。接下来,我们将特征转置并将动态权重应用于其点维度:
where is the dynamic weights and is shared across different channels.
其中 是动态权重,且在不同通道之间共享。
After channel mixing and point mixing, the spatiotemporal features are flattened and aggregated by a linear layer. The final regression and classification predictions are computed by two MLPs respectively.
在通道混合和点混合之后,时空特征被展平并通过线性层进行聚合。最终的回归和分类预测分别由两个 MLP 计算。

3.5. Dual-branch SparseBEV
3.5. 双分支 SparseBEV

Inspired by SlowFast [6], we further enhance the longterm temporal modeling by dividing the input video into a slow stream and a fast stream, resulting in a dual-branch SparseBEV. The slow stream is designed to capture finegrained appearance details and it operates at low frame rates and high resolutions. The fast stream is responsible for capturing long-term temporal stereo and it operates at high frame rates and low resolutions. Sampling points are projected to the two streams respectively and the sampled features are stacked before adaptive mixing.
受 SlowFast [6] 的启发,我们通过将输入视频分为慢速流和快速流来进一步增强长期时间建模,从而得到双分支 SparseBEV。慢速流旨在捕捉细粒度的外观细节,以低帧率和高分辨率运行。快速流负责捕捉长期时间立体,以高帧率和低分辨率运行。采样点分别投影到两个流中,采样特征在自适应混合之前堆叠。
In this way, we decouple the learning of static appearance and temporal motion, leading to better performance. Besides, the computation cost is significantly reduced since only a small fraction of frames needs to be processed at high resolution. However, since this dual-branch design makes the framework a little complex, we do not use it unless otherwise stated. The ablations of this part can be found in the supplementary material.
通过这种方式,我们解耦了静态外观和时间运动的学习,从而提高了性能。此外,由于只有少量帧需要以高分辨率处理,计算成本大大降低。然而,由于这种双分支设计使框架有点复杂,除非另有说明,否则我们不使用它。此部分的消融可以在补充材料中找到。
Method Backbone Epochs NDS mAP mATE mASE mAOE mAVE mAAE
DETR3D [54] V2-99 24 47.9 41.2 0.641 0.255 0.394 0.845 0.133
PETR [31] V2-99 24 50.4 44.1 0.593 0.249 0.383 0.808 0.132
UVTR [23] V2-99 24 55.1 47.2 0.577 0.253 0.391 0.508 0.123
BEVFormer [25] V2-99 24 56.9 48.1 0.582 0.256 0.375 0.378 0.126
BEVDet4D [11] Swin-B [33] 56.9 45.1 0.511 0.241 0.386 0.301 0.121
PolarFormer [16] 极化形态学 [16] V2-99 24 57.2 49.3 0.556 0.256 0.364 0.440 0.127
PETRv2 [32] V2-99 24 59.1 50.8 0.543 0.241 0.360 0.367 0.118
Sparse4D [28] 稀疏 4D [28] V2-99 48 59.5 51.1 0.533 0.263 0.369 0.317 0.124
BEVDepth [24] BEV 深度 [24] V2-99 60.0 50.3 0.445 0.245 0.378 0.320 0.126
BEVStereo [22] BEV 立体 [22] V2-99