2024_04_30_e4c6e3b9f4cab86f4f66g

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos
SparseBEV：来自多摄像头视频的高性能稀疏 3D 物体检测

Haisong Liu Yao Teng Tao Lu Haiguang Wang Limin Wang
刘海松滕瑶卢涛王海光王立民 State Key Laboratory for Novel Software Technology, Nanjing University Shanghai AI Lab
南京大学新软件技术国家重点实验室上海人工智能实验室{liuhs, yaoteng, taolu, haiguangwang}@smail.nju.edu.cn, lmwangenju.edu.cn

Abstract 摘要

Camera-based 3D object detection in BEV (Bird's Eye View) space has drawn great attention over the past few years. Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation cost. On the other side, sparse detectors follow a query-based paradigm without explicit dense BEV feature construction, but achieve worse performance than the dense counterparts. In this paper, we find that the key to mitigate this performance gap is the adaptability of the detector in both and image space. To achieve this goal, we propose SparseBEV, a fully sparse 3D object detector that outperforms the dense counterparts. SparseBEV contains three key designs, which are (1) scale-adaptive self attention to aggregate features with adaptive receptive field in BEV space, (2) adaptive spatio-temporal sampling to generate sampling locations under the guidance of queries, and (3) adaptive mixing to decode the sampled features with dynamic weights from the queries. On the test split of nuScenes, SparseBEV achieves the state-of-the-art performance of 67.5 NDS. On the val split, SparseBEV achieves 55.8 NDS while maintaining a real-time inference speed of 23.5 FPS. Code is available at https://github.com/ MCG-NJU/SparseBEV.
基于相机的 BEV（鸟瞰视图）空间中的 3D 物体检测在过去几年中引起了极大关注。密集检测器通常遵循两阶段流程，首先构建密集的 BEV 特征，然后在 BEV 空间中执行物体检测，但受到复杂的视角变换和高计算成本的影响。另一方面，稀疏检测器遵循基于查询的范式，而无需显式构建密集的 BEV 特征，但性能比密集对应物差。在本文中，我们发现缓解这种性能差距的关键是检测器在 BEV 空间和图像空间中的适应性。为实现这一目标，我们提出 SparseBEV，一种优于密集对应物的全稀疏 3D 物体检测器。SparseBEV 包含三个关键设计，即（1）尺度自适应自注意力，以在 BEV 空间中聚合具有自适应感受野的特征，（2）自适应时空采样，根据查询指导生成采样位置，以及（3）自适应混合，使用来自查询的动态权重解码采样特征。在 nuScenes 的测试集上，SparseBEV 实现了 67 的最新性能。5 个 NDS。在 val 分割上，SparseBEV 实现了 55.8 个 NDS，同时保持了 23.5 FPS 的实时推理速度。代码可在 https://github.com/MCG-NJU/SparseBEV 获得。

1. Introduction 1.介绍

Camera-based 3D Object Detection [13, 54, 31, 11, 24, 25,40 ] has witnessed great progress over the past few years. Compared with the LiDAR-based counterparts [19, 56, 4, 36], camera-based approaches have lower deployment cost and can detect long-range objects.
基于相机的 3D 物体检测[13,54,31,11,24,25,40]在过去几年中取得了巨大的进展。与基于 LiDAR 的方法[19,56,4,36]相比，基于相机的方法具有更低的部署成本，并且可以检测到远距离的物体。

Previous methods can be divided into two paradigms. BEV (Bird's Eye View)-based methods [13, 11, 25, 24, 40] follow a two-stage pipeline by first constructing an explicit
以往的方法可以分为两种范式。基于鸟瞰图的方法[13、11、25、24、40]通过首先构建显式的

Figure 1: Performance comparison on the val split of nuScenes [1]. All methods use ResNet50 [10] as the image backbone and the input size is set to

. FPS is measured on a single RTX 3090 with the PyTorch

p 32 backend. We balance accuracy and speed by reducing the number of decoder layers without re-training.
图 1：在 nuScenes [1]的 val 数据集上进行性能比较。所有方法都使用 ResNet50 [10]作为图像骨干网络，输入大小设置为

。FPS 是在单个 RTX 3090 上使用 PyTorch

p 32 后端测量的。我们通过减少解码器层数而不重新训练来平衡准确性和速度。

dense BEV feature from multi-view features and then performing object detection in BEV space. Although achieving remarkable progress, they suffer from high computation cost and rely on complex view transformation operators. Another line of work [54, 31, 32] explores the sparse query-based paradigm by initializing a set of sparse reference points in 3D space. Specifically, DETR3D [54] links the queries to image features using 3D-to-2D projection. It has simpler structure and faster speed, but its performance still lags far behind the dense ones. PETR series

uses dense global attention for the interaction between query and image feature, which is computationally expensive and buries the advantage of the sparse paradigm. Thus, a natural question arises whether fully sparse detectors achieve similar accuracy to the dense ones?
从多视图特征中提取密集的 BEV 特征，然后在 BEV 空间中执行目标检测。尽管取得了显著进展，但它们面临着高计算成本和依赖复杂视图转换运算符的问题。另一方面的工作 [54, 31, 32] 探索了稀疏的基于查询的范式，通过在 3D 空间中初始化一组稀疏参考点。具体来说，DETR3D [54] 使用 3D 到 2D 投影将查询与图像特征联系起来。它具有更简单的结构和更快的速度，但其性能仍远落后于密集型方法。PETR 系列

使用密集的全局注意力来实现查询和图像特征之间的交互，这在计算上是昂贵的，并掩盖了稀疏范式的优势。因此，一个自然的问题是，完全稀疏的检测器是否能够达到与密集型方法相似的准确性？

In this paper, we find that the key to obtain high performance in sparse 3D object detection is the adaptability of the detector in both BEV and image space. In BEV space,
在本文中，我们发现稀疏 3D 目标检测取得高性能的关键是检测器在 BEV 空间和图像空间中的适应性。在 BEV 空间中，
the detector should be able to aggregate multi-scale features adaptively. Dense BEV-based detectors typically use a BEV encoder to encode multi-scale BEV features. It can be a stack of residual blocks with FPN [26] (e.g. BEVDet [26]), or a transformer encoder with multi-scale deformable attention [60] (e.g. BEVFormer [25]). For sparse detectors such as DETR3D, we argue that the multi-head self attention (MHSA) [49] among queries can play the role of the

encoder, as queries are defined in BEV space. However, the vanilla MHSA has a global receptive field, lacking an explicit multi-scale design. In image space, the detector should be adaptive to different objects with different sizes and categories. This is because although the objects have similar sizes in 3D space, they might vary greatly in images. However, the single-point sampling in DETR3D has a fixed local receptive field and the sampled feature is processed by static linear layers, hindering its performance.
探测器应能够自适应地聚合多尺度特征。密集的基于 BEV 的探测器通常使用 BEV 编码器来编码多尺度的 BEV 特征。它可以是一堆残差块与 FPN[26]（例如 BEVDet[26]），或者是一个带有多尺度可变形注意力的变压器编码器[60]（例如 BEVFormer[25]）。对于稀疏探测器如 DETR3D，我们认为查询之间的多头自注意力（MHSA）[49]可以扮演编码器的角色，因为查询是在 BEV 空间中定义的。然而，普通的 MHSA 具有全局感受野，缺乏明确的多尺度设计。在图像空间中，探测器应该能够适应不同尺寸和类别的不同对象。这是因为虽然对象在 3D 空间中的尺寸相似，但在图像中可能有很大差异。然而，DETR3D 中的单点采样具有固定的局部感受野，并且采样特征由静态线性层处理，从而阻碍了其性能。

To this end, we present SparseBEV, a fully sparse 3D object detector that matches or even outperforms the dense counterparts. Our SparseBEV detector contains three key designs, which are (1) scale-adaptive self attention to aggregate features with adaptive receptive field in BEV space, (2) adaptive spatio-temporal sampling to generate sampling locations under the guidance of queries, and (3) adaptive mixing to decode the sampled features with dynamic weights from the queries. We also propose to use pillars instead of reference points as the formulation of query, since pillars introduce better spatial priors.
为此，我们提出了 SparseBEV，一种完全稀疏的三维物体检测器，可以匹配甚至超越密集的对应物。我们的 SparseBEV 检测器包含三个关键设计，分别是（1）尺度自适应自注意力，以自适应接受域在 BEV 空间中聚合特征，（2）自适应时空采样，在查询的指导下生成采样位置，以及（3）自适应混合，使用来自查询的动态权重解码采样特征。我们还建议使用柱而不是参考点作为查询的公式，因为柱引入了更好的空间先验。

We conduct comprehensive experiments on the nuScenes dataset. As shown in Fig. 1, our SparseBEV achieves the performance of

and the speed of

on the val split, surpassing all previous methods in both speed and accuracy. Besides, we can flexibly adjust of the inference speed by reducing the number of decoder layers without re-training. On test split, SparseBEV with V2-99 [20] backbone achieves

without using future frames or test-time augmentation. By further utilizing future frames, SparseBEV achieves 67.5 NDS, outperforming the previous state-of-the-art BEVFormerV2 [55] by 2.7 NDS.
我们在 nuScenes 数据集上进行了全面的实验。如图 1 所示，我们的 SparseBEV 在 val 分割上实现了

的性能和

的速度，超过了所有先前的方法在速度和准确性方面。此外，我们可以通过减少解码器层数而不重新训练来灵活调整推理速度。在测试分割上，使用 V2-99 [20] 骨干的 SparseBEV 实现了

的性能，而不使用未来帧或测试时间增强。通过进一步利用未来帧，SparseBEV 实现了 67.5 NDS，优于先前的最先进的 BEVFormerV2 [55] 2.7 NDS。

2.1. Query-based 2D Object Detection
2.1. 基于查询的 2D 目标检测

Recently, Transformer [49] with its attention blocks has been widely applied in the computer vision tasks

. In object detection, DETR [2] was the first model to predict objects based on learnable queries and treat the detection as a set prediction problem. A lot of works [60,

were then proposed to accelerate the convergence of DETR by using the sampled features instead of using the global ones. For example, DeformableDETR [60] samples image features based on sparse reference points, and then applies the deformable attention on the features. Sparse R-CNN [45] uses the ROIAlign [9] to obtain local features and then performs the dynamic convolution. AdaMixer [7] combines the advantages of the sampling points and the dynamic convolution to further boost the convergence. DN-DETR [21] devises a denoising mechanism that feeds ground-truth bounding boxes with noises into decoder and trains the model to reconstruct the original boxes. DINO [57] furtherly optimizes DN-DETR [21] and DAB-DETR [30] by proposing a contrastive denoising training strategy and a mixed query selection method. There are also several methods

designed to tackle the instability of the training procedure. Our work follows the query-based detection paradigm and extends it to 3D space with temporal information.
最近，Transformer [49] 及其注意力模块已被广泛应用于计算机视觉任务中。在目标检测中，DETR [2] 是第一个基于可学习查询预测对象并将检测视为集合预测问题的模型。随后提出了许多工作 [60,

以加速 DETR 的收敛，通过使用采样特征而非全局特征。例如，DeformableDETR [60] 基于稀疏参考点对图像特征进行采样，然后在特征上应用可变形注意力。Sparse R-CNN [45] 使用 ROIAlign [9] 获取局部特征，然后执行动态卷积。AdaMixer [7] 结合了采样点和动态卷积的优势，进一步提升了收敛速度。DN-DETR [21] 设计了一个去噪机制，将带有噪声的真实边界框输入解码器，并训练模型重建原始边界框。DINO [57] 进一步优化了 DN-DETR [21] 和 DAB-DETR [30]，提出了对比去噪训练策略和混合查询选择方法。还有几种方法

旨在解决训练过程的不稳定性。我们的工作遵循基于查询的检测范式，并将其扩展到具有时间信息的 3D 空间。

2.2. Monocular 3D Object Detection
2.2. 单目 3D 物体检测

Monocular 3D object detection takes one single image as input and outputs predicted 3D bounding boxes of objects. A significant challenge in monocular 3D object detection is how to transfer 2D features to 3D space. Several works [53, 43, 51, 39] incorporates depth information to deal with this problem. Pseudo LiDAR [53] first estimates the depth of input images and constructs pseudo point clouds. The pseudo point clouds are then sent to a LiDARbased detection module to predict the 3D boxes of interested objects. CaDDN [43] further proposes a fully differentiable end-to-end network which learns pixel-wise categorical depth distributions to predict appropriate depth intervals in 3D space. Inspired by FCOS [47], FCOS3D [51] projects

coordinates onto

images and decouples them as 2D attributes (centerness and classification) and 3D attributes (depths, sizes, and orientations).
单目 3D 物体检测将单个图像作为输入，并输出对象的预测 3D 边界框。单目 3D 物体检测中的一个重要挑战是如何将 2D 特征转移到 3D 空间。几项工作[53, 43, 51, 39]结合深度信息来解决这个问题。伪 LiDAR[53]首先估计输入图像的深度并构建伪点云。然后将伪点云发送到基于 LiDAR 的检测模块，以预测感兴趣对象的 3D 框。CaDDN[43]进一步提出了一个完全可微的端到端网络，该网络学习像素级别的分类深度分布，以预测 3D 空间中适当的深度间隔。受 FCOS[47]启发，FCOS3D[51]将

坐标投影到

图像上，并将它们解耦为 2D 属性（中心和分类）和 3D 属性（深度、大小和方向）。

2.3. 3D Object Detection in BEV
2.3. BEV 中的 3D 物体检测

Bird's-eye-view (BEV) object detection [38, 42, 13, 54,

] aims at detecting objects in BEV space given either single-view or multi-view 2D images. Early works [44, 53, 43] typically transformed 2D features to BEV space based on single-view images and conducted monocular 3D object detection. LSS [42] takes six 2D images of different views as input and transforms them into 3D space based on depth estimation. Based on LSS, BEVDet [13] lifts 2D feature to BEV space and uses a BEV encoder with residual blocks and FPN to further encode the BEV features. BEVFormer [25] proposes a spatiotemporal transformer encoder that projects multi-view and multi-timestamp input to BEV representations. To ease the optimization, BEVFormerV2 [55] introduces perspective view supervision and supervises monocular 3D object detection parallel with BEV object detection. SOLOFusion [40] defines the localization potential for depth estimation and fuses short-term, high-resolution and long-term, low-resolution temporal stereo.
鸟瞰（BEV）物体检测 [38, 42, 13, 54,

] 旨在在 BEV 空间中检测物体，给定单视图或多视图 2D 图像。早期研究 [44, 53, 43] 通常基于单视图图像将 2D 特征转换为 BEV 空间，并进行单眼 3D 物体检测。LSS [42] 以六个不同视图的 2D 图像作为输入，并基于深度估计将它们转换为 3D 空间。基于 LSS，BEVDet [13] 将 2D 特征提升到 BEV 空间，并使用具有残差块和 FPN 的 BEV 编码器进一步编码 BEV 特征。BEVFormer [25] 提出了一个时空变换编码器，将多视图和多时间戳输入投影到 BEV 表示中。为了简化优化，BEVFormerV2 [55] 引入了透视视图监督，并与 BEV 物体检测并行监督单眼 3D 物体检测。SOLOFusion [40] 为深度估计定义了定位潜力，并融合了短期、高分辨率和长期、低分辨率的时间立体。

Figure 2: The overall architecture of SparseBEV, a fully-sparse camera-only 3D object detector. Queries are initialized to be a sparse set of pillars in BEV space. The scale-adaptive self attention further encodes the queries with adaptive receptive fields. Next, multi-view and multi-timestamp features are aggregated with adaptive spatio-temporal sampling and decoded by adaptive mixing. The decoder repeats

times to produce final predictions.
图 2：SparseBEV 的整体架构，一种完全稀疏的仅相机的 3D 物体检测器。查询被初始化为 BEV 空间中的稀疏柱集合。尺度自适应的自注意力进一步使用自适应接受域对查询进行编码。接下来，多视图和多时间戳特征通过自适应时空采样进行聚合，并通过自适应混合进行解码。解码器重复

次以生成最终预测。

Inspired by DETR, another line of works [54, 31, 32] explore the sparse query-based paradigm. DETR3D [54] proposes a top-down framework starting from a learnable sparse set of reference points and refining them iteratively via 3D-to-2D queries. However, such 3D-to-2D projection hinders the receptive field of the query. To handle this, PETR series [31, 32] use global attention for the interaction betweeen queries and image features, and introduce 3D positional embeddings to encode

features into

representation without explicit projection. Although achieving notable improvements, the expensive dense global attention buries the advantages of the sparse paradigm and makes it difficult to utilize long-term temporal information efficiently. In contrast, we keep the fully sparse design of DETR3D and boost the performance by enhancing the adaptability of the detector.
受 DETR 启发，另一系列作品[54, 31, 32]探索了稀疏查询为基础的范式。DETR3D [54]提出了一个自顶向下的框架，从可学习的稀疏参考点集开始，并通过 3D 到 2D 的查询迭代地对它们进行细化。然而，这种 3D 到 2D 投影阻碍了查询的感受野。为了解决这个问题，PETR 系列[31, 32]使用全局注意力来处理查询和图像特征之间的交互，并引入 3D 位置嵌入来将

特征编码为

表示，而无需显式投影。尽管取得了显著的改进，但昂贵的密集全局注意力掩盖了稀疏范式的优势，并使长期时间信息难以有效利用。相比之下，我们保持了 DETR3D 的完全稀疏设计，并通过增强检测器的适应性来提高性能。

3. SparseBEV

As shown in Fig. 2, SparseBEV is a query-based onestage detector with

decoder layers. The input multicamera videos are processed frame-by-frame using image backbone and FPN. Next, a set of sparse pillar queries are initialized in BEV space and aggregated by scale-adaptive self attention. These queries then interact with the image features via adaptive spatio-temporal sampling and adaptive mixing to make

object predictions. We also propose a dual-branch version of SparseBEV to further enhance the temporal modeling.
如图 2 所示，SparseBEV 是一个基于查询的单阶段检测器，具有

个解码器层。输入的多摄像头视频使用图像骨干和 FPN 逐帧处理。接下来，在 BEV 空间中初始化一组稀疏柱查询，并通过可缩放自适应自注意力进行聚合。然后，这些查询通过自适应时空采样和自适应混合与图像特征交互，以进行

个物体预测。我们还提出了 SparseBEV 的双分支版本，以进一步增强时间建模。

3.1. Query Formulation 3.1. 查询形式

We first define a set of learnable queries, where each of them is represented by its translation

, dimension

, rotation

and velocity

. The queries are initialized to be pillars in BEV space where

is set to 0 and

is set to

. The initial velocity is set to

0. Other parameters

are drawn from random gaussian distributions. Following Sparse R-CNN[45], we attach a

-dim query feature to each query box to encode the rich instance characteristics.
我们首先定义一组可学习的查询，其中每个查询由其平移

、尺寸

、旋转

和速度

表示。查询被初始化为 BEV 空间中的柱，其中

设置为 0，

设置为

。初始速度设置为

0。其他参数

从随机高斯分布中绘制。与 Sparse R-CNN[45]类似，我们将

维查询特征附加到每个查询框上，以编码丰富的实例特征。

3.2. Scale-adaptive Self Attention
3.2. 自适应尺度自注意力

As mentioned above, dense BEV-based methods typically use a BEV encoder to encode multi-scale BEV features. However, in our method, since we do not explicitly build a BEV feature, how to aggregate multi-scale features in BEV space is a challenge.
正如上文所述，密集的基于 BEV 的方法通常使用 BEV 编码器来编码多尺度 BEV 特征。然而，在我们的方法中，由于我们没有明确构建 BEV 特征，如何在 BEV 空间中聚合多尺度特征是一个挑战。

In this work, we argue that the self attention can play the role of BEV encoder, since queries are defined in BEV space. The vanilla multi-head self attention has global receptive field and lacks the ability of local multi-scale aggregation. Thus, we propose scale-adaptive self attention (SASA), which learns appropriate receptive fields under the guidance of queries. First, we compute the all-pair distances

(

is the number of queries) between the query centers in BEV space:
在这项工作中，我们认为自注意力可以扮演 BEV 编码器的角色，因为查询是在 BEV 空间中定义的。传统的多头自注意力具有全局感受野，缺乏局部多尺度聚合的能力。因此，我们提出了自适应尺度自注意力（SASA），在查询的指导下学习适当的感受野。首先，我们计算 BEV 空间中查询中心之间的所有对距离

（

是查询数量）：

Figure 3: Visualization of the learned BEV receptive field in SASA. We choose two samples from the validation split and visualize 4 heads for each sample, denoted by

and

. Queries are represented by their centers.
图 3：在 SASA 中学习的 BEV 感受野的可视化。我们从验证集中选择两个样本，并为每个样本可视化 4 个头，用

和

表示。查询由它们的中心表示。

where

and

denotes the center of the

-th query. The attention considers not only the similarity between query features, but the distance between them as well. A toy example below shows how it works:
其中

和

表示第

个查询的中心。注意力不仅考虑查询特征之间的相似性，还考虑它们之间的距离。下面是一个玩具示例，展示了它的工作原理：

where

is the query itself and

is the channel dimension.

is a scalar to control the receptive field for each query. When

, it degrades to the vanilla self attention with global receptive field. As

grows, the attention weights for distant queries becomes smaller and the receptive field narrows.
其中

是查询本身，

是通道维度。

是一个标量，用于控制每个查询的感受野。当

时，它退化为具有全局感受野的普通自注意力。随着

的增长，远程查询的注意力权重变小，感受野变窄。

In practice, the receptive field controller

is adaptive to each query and specific to each head. Supposing there are

heads, we use a linear transformation to generate headspecific

from the give query

:
在实践中，接受域控制器

对每个查询都是自适应的，并且对每个头部都是特定的。假设有

个头部，我们使用线性变换从给定的查询

生成头部特定的

：

where the weights are shared across different queries.
其中权重在不同的查询之间共享。

In our experiments, we surprisingly find that the

for each head is learnt to uniformly distribute in a certain range regardless of the initialization. In Fig. 3, we sort the heads according to

and visualize the attention weights for the distance part. As we can see, each head learns a different receptive field from each other, enabling the network to aggregate features in a multi-scale manner like FPN.
在我们的实验中，我们惊奇地发现，每个头部的

都学会了在一定范围内均匀分布，而不受初始化的影响。在图 3 中，我们根据

对头部进行排序，并可视化距离部分的注意力权重。正如我们所看到的，每个头部从彼此中学习了不同的接受域，使网络能够像 FPN 一样以多尺度方式聚合特征。

Scale-adaptive self attention (SASA) demonstrates the necessity of FPN, while being more flexible as it learns the scales adaptively from the query. We also find an interesting phenomenon that different categories of queries have different sizes of receptive field. For example, queries representing the bus have larger receptive field than those representing the pedestrians. More details can be found in the ablation studies.
比例自适应自注意力（SASA）展示了 FPN 的必要性，同时更加灵活，因为它从查询中自适应地学习比例。我们还发现一个有趣的现象，不同类别的查询具有不同大小的感受野。例如，代表公交车的查询具有比代表行人的查询更大的感受野。更多细节可以在消融研究中找到。

3.3. Adaptive Spatio-temporal Sampling
3.3. 自适应时空采样

For each frame, we use a linear layer to generate a set of sampling offsets

adaptively from the query feature. These offsets are transformed to

sampling points based on the query pillar:
对于每个帧，我们使用一个线性层从查询特征中自适应地生成一组采样偏移量

。这些偏移量基于查询柱转换为

采样点：

Compared with the deformable attention in BEVFormer, our sampling points are adaptive to both query pillar and query feature, thus better covering objects with varying sizes. Besides, these points are not restricted to the query, since we do not limit the range of the sampling offsets.
与 BEVFormer 中的可变形注意力相比，我们的采样点适应于查询柱和查询特征，因此更好地覆盖了大小不同的对象。此外，这些点不受查询的限制，因为我们没有限制采样偏移的范围。

Next, we perform temporal alignment by warping the sampling points according to motions. In autonomous driving, there are two types of motion: one is ego-motion and the other is object motion. Ego-motion describes the motion of the car from its own perspective as it navigates through the environment, while object motion refers to the movement of other objects in the environment as they move around the self-driving car.
接下来，我们通过根据运动对采样点进行变形来执行时间对齐。在自动驾驶中，有两种类型的运动：一种是自我运动，另一种是物体运动。自我运动描述了汽车从自己的视角在环境中导航时的运动，而物体运动则指的是环境中其他物体的移动，它们围绕着自动驾驶汽车移动。

Dealing with Object Motion. As mentioned above, instantaneous velocity can be equal to average velocity for a short time window in self-driving. Thus, we adaptively warp the sampling points to previous timestamps using the velocity vector

from the query:
处理物体运动。如上所述，在自动驾驶中，瞬时速度可以在短时间窗口内等于平均速度。因此，我们通过使用来自查询的速度向量

将采样点自适应地变形到先前的时间戳。

where

denotes the timestamp of previous frame

(

denotes the current timestamp). Note that

is identical to

because the velocity vector is defined in

plane.
其中

表示前一帧

的时间戳（

表示当前时间戳）。请注意，

与

相同，因为速度向量是在

平面中定义的。

Dealing with Ego Motion. Next, we warp the sampling points based on the ego pose provided by the dataset. Points are first transformed to the global coordinate system and then to the local coordinate system of frame

:
处理自我运动。接下来，我们根据数据集提供的自我姿势对采样点进行变形。首先将点转换到全局坐标系，然后转换到帧

的局部坐标系：

where

is the ego pose of frame

.
其中

是帧

的自我姿势。

Sampling. For each timestamp, we project the warped sampling points

to each view using the provided camera intrinsics and extrinsics. Since there are overlaps between adjacent views, the projected point may hit one or more views, which are termed as

. For each hit view

, we have multi-scale feature maps

from the image backbone. Features are
采样。对于每个时间戳，我们使用提供的相机内参和外参将扭曲的采样点

投影到每个视图中。由于相邻视图之间存在重叠，投影点可能会命中一个或多个视图，这些被称为

。对于每个命中的视图

，我们从图像主干网络中得到多尺度特征图

。特征是

Method	Backbone	Input Size	Epochs	NDS	mAP	mATE	mASE	mAOE	mAVE	mAAE
PETRv2 [32]	ResNet50		60	45.6	34.9	0.700	0.275	0.580	0.437	0.187
BEVStereo [22]	ResNet50			50.0	37.2	0.598	0.270	0.438	0.367	0.190
BEVPoolv2 [12]	ResNet50			52.6	40.6	0.572	0.275	0.463	0.275	0.188
SOLOFusion [40]	ResNet50			53.4	42.7	0.567	0.274	0.511	0.252	0.181
Sparse4Dv2 [29]	ResNet50		100	53.9	43.9	0.598	0.270	0.475	0.282	0.179
StreamPETR	ResNet50		60	55.0		0.613	0.267	0.413	0.265	0.196
SparseBEV	ResNet50		36	54.5	43.2	0.606	0.274	0.387	0.251	0.186
SparseBEV	ResNet50		36		44.8	0.581	0.271	0.373	0.247	0.190
DETR3D	ResNet101-DCN		24	43.4	34.9	0.716	0.268	0.379	0.842	0.200
BEVFormer	ResNet101-DCN		24	51.7	41.6	0.673	0.274	0.372	0.394	0.198
BEVDepth [24]	ResNet101			53.5	41.2	0.565	0.266	0.358	0.331	0.190
Sparse4D	ResNet101-DCN		48	55.0	44.4	0.603	0.276	0.360	0.309	0.178
SOLOFusion [40]	ResNet101			58.2	48.3	0.503	0.264	0.381	0.246	0.207
SparseBEV	ResNet101		24			0.562	0.265	0.321	0.243	0.195

Table 1: Performance comparison on the nuScenes val split. † benefits from perspective pretraining.

indicates methods with CBGS [59] which will elongate 1 epoch into 4.5 epochs.
表 1：在 nuScenes 验证集上的性能比较。†受益于透视预训练。

表示使用 CBGS [59]的方法，将 1 个时期延长为 4.5 个时期。

first sampled by bilinear interpolation

in the image plane and then weighted over the scale axis:
首先通过双线性插值在图像平面上对其进行采样

，然后在尺度轴上加权：

where

is the number of the multi-scale feature maps and

is the projection function for view

is the weight for the

-th point on the

-th feature map and is generated from the query feature by linear transformation.
其中

是多尺度特征图的编号，

是视图

的投影函数，

是第

个特征图上第

个点的权重，由查询特征通过线性变换生成。

3.4. Adaptive Mixing 3.4. 自适应混合

Given the sampled features from different timestamps and locations, the key is how to adaptively decode it under the guidance of queries. Inspired by AdaMixer [7], we introduce a simple but effective approach to decode and aggregate the spatio-temporal features with dynamic convolutions [15] and MLP-Mixer [48]. Supposing there are

frames in total and

sampling points per frame, we first stack them to a total of

points. Thus, the sampled features are organized to

.
鉴于来自不同时间戳和位置的采样特征，关键在于如何在查询的指导下自适应地解码它。受 AdaMixer [7]的启发，我们引入了一种简单而有效的方法来使用动态卷积[15]和 MLP-Mixer [48]解码和聚合时空特征。假设总共有

帧，每帧

个采样点，我们首先将它们堆叠到总共

个点。因此，采样特征被组织成

。

Channel Mixing. We first perform channel mixing on

to enhance object semantics. The dynamic weights are generated from the query feature

:
通道混合。我们首先对

执行通道混合，以增强对象语义。动态权重是从查询特征

生成的：

where

is the dynamic weights and is shared across different frames and different sampling points.
其中

是动态权重，跨不同帧和不同采样点共享。
Point Mixing. Next, we then transpose the feature and apply the dynamic weights to the point dimension of it:
点混合。接下来，我们将特征转置并将动态权重应用于其点维度：

where

is the dynamic weights and is shared across different channels.
其中

是动态权重，且在不同通道之间共享。

After channel mixing and point mixing, the spatiotemporal features are flattened and aggregated by a linear layer. The final regression and classification predictions are computed by two MLPs respectively.
在通道混合和点混合之后，时空特征被展平并通过线性层进行聚合。最终的回归和分类预测分别由两个 MLP 计算。

3.5. Dual-branch SparseBEV
3.5. 双分支 SparseBEV

Inspired by SlowFast [6], we further enhance the longterm temporal modeling by dividing the input video into a slow stream and a fast stream, resulting in a dual-branch SparseBEV. The slow stream is designed to capture finegrained appearance details and it operates at low frame rates and high resolutions. The fast stream is responsible for capturing long-term temporal stereo and it operates at high frame rates and low resolutions. Sampling points are projected to the two streams respectively and the sampled features are stacked before adaptive mixing.
受 SlowFast [6] 的启发，我们通过将输入视频分为慢速流和快速流来进一步增强长期时间建模，从而得到双分支 SparseBEV。慢速流旨在捕捉细粒度的外观细节，以低帧率和高分辨率运行。快速流负责捕捉长期时间立体，以高帧率和低分辨率运行。采样点分别投影到两个流中，采样特征在自适应混合之前堆叠。

In this way, we decouple the learning of static appearance and temporal motion, leading to better performance. Besides, the computation cost is significantly reduced since only a small fraction of frames needs to be processed at high resolution. However, since this dual-branch design makes the framework a little complex, we do not use it unless otherwise stated. The ablations of this part can be found in the supplementary material.
通过这种方式，我们解耦了静态外观和时间运动的学习，从而提高了性能。此外，由于只有少量帧需要以高分辨率处理，计算成本大大降低。然而，由于这种双分支设计使框架有点复杂，除非另有说明，否则我们不使用它。此部分的消融可以在补充材料中找到。

Method	Backbone	Epochs	NDS	mAP	mATE	mASE	mAOE	mAVE	mAAE
DETR3D [54]	V2-99	24	47.9	41.2	0.641	0.255	0.394	0.845	0.133
PETR [31]	V2-99	24	50.4	44.1	0.593	0.249	0.383	0.808	0.132
UVTR [23]	V2-99	24	55.1	47.2	0.577	0.253	0.391	0.508	0.123
BEVFormer [25]	V2-99	24	56.9	48.1	0.582	0.256	0.375	0.378	0.126
BEVDet4D [11]	Swin-B [33]		56.9	45.1	0.511	0.241	0.386	0.301	0.121
PolarFormer [16] 极化形态学 [16]	V2-99	24	57.2	49.3	0.556	0.256	0.364	0.440	0.127
PETRv2 [32]	V2-99	24	59.1	50.8	0.543	0.241	0.360	0.367	0.118
Sparse4D [28] 稀疏 4D [28]	V2-99	48	59.5	51.1	0.533	0.263	0.369	0.317	0.124
BEVDepth [24] BEV 深度 [24]	V2-99		60.0	50.3	0.445	0.245	0.378	0.320	0.126
BEVStereo [22] BEV 立体 [22]	V2-99		61.0	52.5	0.431	0.246	0.358	0.357	0.138
SOLOFusion [40] SOLO 融合 [40]	ConvNeXt-B [34]		61.9	54.0	0.453	0.257	0.376	0.276	0.148
SparseBEV	V2-99	24			0.502	0.244	0.324	0.251	0.126
SparseBEV (dual-branch) SparseBEV (双分支)	V2-99	24			0.485	0.244	0.332	0.246	0.117
BEVFormerV2 † [55]	InternImage-XL [52]	24	64.8	58.0	0.448	0.262	0.342	0.238	0.128
BEVDet-Gamma	Swin-B		66.4	58.6	0.375	0.243	0.377	0.174	0.123
SparseBEV	V2-99	24			0.425	0.239	0.311	0.172	0.116

Table 2: Performance comparison on the nuScenes test split. † uses future frames.

indicates methods with CBGS [59] which will elongate 1 epoch into 4.5 epochs.
表 2：在 nuScenes 测试集上的性能比较。† 使用未来帧。

表示使用 CBGS [59] 的方法，将 1 个 epoch 延长为 4.5 个 epochs。

4. Experiments 4. 实验

4.1. Implementation Details
4.1. 实现细节

We implement our model using PyTorch [41]. Following previous methods

, we adopt common image backbones including ResNet [10] and V2-99 [20]. The decoder consists of 6 layers and weights are shared across different layers. By default, we use

frames in total and the interval between adjacent frames is about

.
我们使用 PyTorch [41] 实现了我们的模型。与之前的方法

一样，我们采用常见的图像骨干网络，包括 ResNet [10] 和 V2-99 [20]。解码器由 6 层组成，权重在不同层之间共享。默认情况下，我们总共使用

帧，相邻帧之间的间隔约为

。

During training, we adopt the Hungarian algorithm [18] for label assignment between ground-truth objects and predictions. Focal loss [27] is used for classification and L1 loss is used for 3D bounding box regression. We use the AdamW [35] optimizer with a global batch size of 8 . The initial learning rate is set to

and is decayed with cosine annealing policy.
在训练过程中，我们采用匈牙利算法 [18] 对真实对象和预测之间进行标签分配。分类使用 Focal loss [27]，3D 边界框回归使用 L1 loss。我们使用全局批量大小为 8 的 AdamW [35] 优化器。初始学习率设置为

，并采用余弦退火策略进行衰减。

Recently, we follow the training setting of the very recent work StreamPETR [50] and refresh our results for fair comparison. For the regression loss, we change the loss weight of

and

to 2.0 and leave the others to 1.0. Query denoising [21] is also introduced to stablize training and speedup convergence.
最近，我们遵循最近的工作 StreamPETR [50]的培训设置，并刷新我们的结果以进行公平比较。对于回归损失，我们将

和

的损失权重改为 2.0，其他保持为 1.0。还引入了查询去噪[21]来稳定训练并加速收敛。

4.2. Datasets and Metrics
4.2. 数据集和指标

We evaluate our model on the nuScenes dataset [1], which consists of large-scale multimodal data collected from 6 surround-view cameras, 1 lidar and 5 radars. The dataset has 1000 videos and is split into 700/150/150 videos for training/validation/testing. Each video has roughly 20s duration and the key samples are annotated every

.
我们在 nuScenes 数据集[1]上评估我们的模型，该数据集由 6 个环绕视图相机、1 个激光雷达和 5 个雷达收集的大规模多模态数据组成。该数据集有 1000 个视频，分为 700/150/150 个视频进行训练/验证/测试。每个视频大约持续 20 秒，关键样本每

进行注释。

For 3D object detection, there are up to

annotated
对于 3D 物体检测，最多有

个注释的 3D 边界框，包括 10 个类别。官方评估指标包括著名的平均精度（mAP）和五个真正的正面（TP）指标，包括 ATE、ASE、AOE、AVE 和 AAE，用于分别测量翻译、比例、方向、速度和属性误差。总体性能由 nuScenes 检测分数（NDS）来衡量，它是上述指标的综合。
3D bounding boxes of 10 classes. The official evaluation metrics include the well-known mean Average Precision (mAP) and five true positive (TP) metrics, including ATE, ASE, AOE, AVE, and AAE for measuring translation, scale, orientation, velocity, and attribute errors respectively. The overall performance is measured by the nuScenes Detection Score (NDS), which is the composite of the above metrics.
4.3. 与最先进的方法比较

4.3. Comparison with the State-of-the-art Methods
3D 边界框的 10 个类别。

nuScenes val split. In Tab. 1, we compare SparseBEV with previous state-of-the-art methods on the validation split of nuScenes. Unless otherwise stated, the image backbone is pretrained on ImageNet-1k [17] and the number of queries is set to 900 . To keep the simplicity of our approach, the dual-branch design is not used here. When adopting ResNet50 as the backbone and

as the input size, SparseBEV outperforms the previous state-ofthe-art method SOLOFusion by

and

. By further adopting nuImages [1] pretraining and reducing the number queries to 400 , SparseBEV sets a new record of 55.8 NDS while maintaining a real-time inference speed of 23.5 FPS (corresponding to Fig. 1). Next, we upgrade the backbone to ResNet101 and scale the input size up to

. Under this setting, SparseBEV still surpasses SOLOFusion by

and

, demonstrating the scalability of our method.
nuScenes 验证集。在表 1 中，我们将 SparseBEV 与 nuScenes 验证集上的先前最先进方法进行比较。除非另有说明，图像主干在 ImageNet-1k [17]上进行了预训练，查询数量设置为 900。为了保持我们方法的简单性，这里没有使用双分支设计。当将 ResNet50 作为主干，并将输入大小设置为

时，SparseBEV 的性能优于先前的最先进方法 SOLOFusion

和

。通过进一步采用 nuImages [1]预训练并将查询数量减少到 400，SparseBEV 在保持实时推理速度为 23.5 FPS 的情况下，创造了 55.8 NDS 的新纪录（对应于图 1）。接下来，我们将主干升级为 ResNet101，并将输入大小扩大到

。在这种设置下，SparseBEV 仍然超越了 SOLOFusion

和

，展示了我们方法的可扩展性。

nuScenes test split. We submit our method to the website of nuScenes and report the leaderboard in Tab. 2. Using the V2-99 [20] backbone pretrained by DD3D [39], SparseBEV achieves

and

without bells
nuScenes 测试集。我们将我们的方法提交到 nuScenes 的网站，并在表 2 中报告排行榜。使用 DD3D [39]预训练的 V2-99 [20]主干，SparseBEV 实现了

和

，没有任何额外的功能。

Query Formulation 查询表单制定	NDS	mAP
3D Reference Points 3D 参考点	55.1	44.0
BEV Pillars

Table 3: Ablations on query formulation. Compared with reference points, pillars introduce better spatial priors and lead to better performance.
表 3：查询表述上的消融。与参考点相比，支柱引入了更好的空间先验，并导致更好的性能。

Self Attention 自注意力	Distance Function 距离函数	NDS	mAP
MHSA	-	53.4	41.4
SASA		54.3	43.8

		55.1	44.3

Table 4: Comparison between vanilla multi-head self attention (MHSA) and scale-adaptive self attention (SASA). Our SASA achieves significant improvements over the baseline.
表 4：普通多头自注意力（MHSA）和尺度自适应自注意力（SASA）之间的比较。我们的 SASA 相对基准有显著改进。

and whistles. Built on top of this, the dual-branch design gives us extra bonus on the performance, hitting 63.6 NDS and

. We also follow the offline setting of BEVFormerV2 [55] which leverages both past and future frames to assist the detection. Remarkably, our method (single branch only) surpasses BEVFormerV2 by

and 2.2 NDS. In addition, BEVFormerV2 uses InternImage-XL [52] with over 300M parameters as the image backbone, while our V2-99 backbone is much more lightweight with only

parameters.
并且附加功能。在此基础上，双分支设计为我们的性能提供了额外的奖励，达到了 63.6 NDS 和

。我们还遵循 BEVFormerV2 [55]的离线设置，该设置利用过去和未来的帧来辅助检测。值得注意的是，我们的方法（仅单分支）超过了 BEVFormerV2

和 2.2 NDS。此外，BEVFormerV2 使用超过 300M 参数的 InternImage-XL [52]作为图像主干，而我们的 V2-99 主干更轻量，仅有

参数。

4.4. Ablation Studies 4.4. 切割研究

In this section, we conduct ablations on the validation split of nuScenes. For all experiments, we use ResNet50 pretrained on nuImages [1] as the image backbone. The input contains 8 frames with

resolution. We use 900 queries in the decoder and the model is trained for 24 epochs without CBGS [59].
在本节中，我们对 nuScenes 的验证集进行了切割研究。对于所有实验，我们使用在 nuImages 上预训练的 ResNet50 作为图像主干。输入包含 8 帧，分辨率为

。我们在解码器中使用 900 个查询，并且模型在没有 CBGS 的情况下进行了 24 个时期的训练 [59]。

Query Formulation. In Tab. 3, we ablate different formulations of the query. The top row uses a set of reference points distributing uniformly in 3D space (e.g. DETR3D and PETR). By replacing the reference points with pillars in BEV space, we observe performance improvements

NDS) over the baseline. This is because pillars introduce better spatial priors than reference points.
查询制定。在表 3 中，我们对查询的不同制定进行了切割研究。顶部行使用一组在 3D 空间均匀分布的参考点（例如 DETR3D 和 PETR）。通过在 BEV 空间中用柱代替参考点，我们观察到相对基线的性能改进

NDS）。这是因为柱比参考点引入了更好的空间先验。

Scale-adaptive Self Attention. We study the effect of scale-adaptive self attention (SASA) in Tab. 4. Compared with the vanilla multi-head self attention, SASA achieves

and +2.2 NDS improvements over the baseline. We further ablate different distance functions and find that

distance works best. Besides, we also observe two in-
自适应尺度自注意力。我们研究了自适应尺度自注意力（SASA）在表 4 中的效果。与普通的多头自注意力相比，SASA 在基线上实现了

和+2.2 NDS 的改进。我们进一步消融了不同的距离函数，并发现

距离效果最好。此外，我们还观察到两种不同的-

Figure 4: Averaged

over all queries and all heads for each class in SASA. Larger

indicates smaller receptive fields. We perform statistics on the val split of nuScenes and choose the queries with a confidence score over 0.3 .
图 4：SASA 中所有查询和所有头部的平均

。较大的

表示较小的感受野。我们对 nuScenes 的 val 分割进行统计，并选择置信度超过 0.3 的查询。

Figure 5: Ablations of adaptive spatio-temporal sampling. The performance continues to increase with the number of frames. For sampling points, we observe that 16 points per frame works best.
图 5：自适应时空采样的消融。性能随着帧数的增加而持续提高。对于采样点，我们观察到每帧 16 个点效果最好。

teresting phenomena in SASA. First, different heads learn a different receptive field (see Fig. 3), enabling the model to aggregate multi-scale features in BEV space. Second, queries representing larger objects tend to have larger receptive field. In Fig. 4, we average the receptive field coefficient

over all queries and all heads for each class. As we can see, the receptive field of large objects (such as bus and truck) is larger than small objects (such as pedestrian and traffic cone), demonstrating the effectiveness of our adaptive design.
在 SASA 中有一些有趣的现象。首先，不同的头部学习不同的感受野（见图 3），使模型能够在 BEV 空间中聚合多尺度特征。其次，代表较大物体的查询往往具有较大的感受野。在图 4 中，我们对每个类别的所有查询和所有头部的感受野系数

进行平均。正如我们所看到的，大物体（如公共汽车和卡车）的感受野比小物体（如行人和交通锥）大，展示了我们自适应设计的有效性。

Adaptive Spatio-temporal Sampling. In Fig. 5, we ablate the number of frames and sampling points. The performance continues to improve as the number of frames increases, proving that SparseBEV can benefit from the longterm history. Here, we use 8 frames for fair comparison with previous methods. As for the number of sampling points, we observe that dispatching 16 points for each frame leads to the best performance. We also provide the visualization of the sampling points from the last stage of the decoder in Fig. 6. Our sampling scheme has adaptive regions of interest and achieves good temporal alignments for both static and moving objects across different frames.
自适应时空采样。在图 5 中，我们消融了帧数和采样点的数量。随着帧数的增加，性能持续提高，证明 SparseBEV 可以从长期历史中受益。在这里，为了与先前的方法进行公平比较，我们使用 8 帧。至于采样点的数量，我们观察到每帧分派 16 个点会导致最佳性能。我们还在图 6 中提供了来自解码器最后阶段的采样点的可视化。我们的采样方案具有自适应的感兴趣区域，并实现了对静态和移动对象在不同帧之间的良好时间对齐。

Figure 6: Visualization of adaptive spatio-temporal sampling over three consecutive frames. Different instances are distinguished by colors. Point size indices the depth: larger points are closer to the camera. Our sampling scheme has an adaptive receptive field and is well aligned across different timestamps.
图 6：在三个连续帧上可视化自适应时空采样。不同实例由颜色区分。点的大小表示深度：较大的点靠近相机。我们的采样方案具有自适应的感受野，并在不同时间戳上对齐良好。

Ego Align	Obj. Align	NDS	mAP	mAVE
-	-	44.4	34.1	0.510
	-	54.2	43.5	0.281

Table 5: Ablations on temporal alignment. Aligning both ego and object motion leads to the best performance.
表 5：关于时间对齐的消融。对齐自车和物体运动会带来最佳性能。

Temporal Alignment. In Tab. 5, we validate the necessity of temporal alignment in spatio-temporal sampling. In our method, ego motion is aligned with the provided ego pose, while object motion is aligned with a simple constant velocity motion model. As we can see from the table, both of them contribute to the performance.
时间对齐。在表 5 中，我们验证了时空采样中时间对齐的必要性。在我们的方法中，自我运动与提供的自我姿势对齐，而对象运动与简单的恒定速度运动模型对齐。从表中我们可以看到，它们都对性能有所贡献。

Adaptive Mixing. In Tab. 6, we validate the design of the adaptive mixing. The top row is a baseline that uses attention weights to aggregate sampled features (as done in DETR3D). Static mixing and adaptive mixing improve the baseline by 2.7 NDS and 6.5 NDS respectively, demonstrating the necessity of the adaptive design. Next, we explore different combinations of channel and point mixing. Channel mixing followed by point mixing leads to better performance, proving that the object semantics should be enhanced before point mixing.
自适应混合。在表 6 中，我们验证了自适应混合设计。顶部行是使用注意力权重聚合采样特征的基线（与 DETR3D 中所做的相同）。静态混合和自适应混合分别将基线提高了 2.7 NDS 和 6.5 NDS，证明了自适应设计的必要性。接下来，我们探讨了不同的通道和点混合组合。通道混合后跟点混合会带来更好的性能，证明了在点混合之前应该增强对象语义。

4.5. Limitations and Future Work
4.5. 限制和未来工作

One limitation of SparseBEV is the heavy reliance on ego pose. As we can see from Tab. 5, the performance drops about 10 NDS without ego-based temporal alignment. However, in the real world, the ego pose provided by IMU may be unreliable and inaccurate, seriously affecting the robustness. Another limitation is that the inference latency increases linearly with the number of frames, since we stack the sampled features along the temporal dimension.
SparseBEV 的一个限制是严重依赖自我姿势。正如我们从表 5 中可以看到的那样，没有基于自我姿势的时间对齐，性能会下降约 10 个 NDS。然而，在现实世界中，由 IMU 提供的自我姿势可能不可靠且不准确，严重影响了鲁棒性。另一个限制是，由于我们沿时间维度堆叠采样特征，推理延迟会随着帧数的增加而线性增加。

In the future, we will explore a more elegant and concise way of decoupling spatial appearance and temporal motion. We will also try to extend SparseBEV to other 3D percep-
在未来，我们将探索一种更加优雅简洁的解耦空间外观和时间运动的方式。我们还将尝试将 SparseBEV 扩展到其他 3D 感知领域。

Method	Details	NDS	mAP
W/o Mixing	-	49.1	38.6
Static Mixing 静态混合	Channel Point 通道点	51.8	41.9
Adaptive Mixing 自适应混合	Channel Only	53.1	42.7
	Point Only	53.6	43.3
	Point Channel 点通道	55.4	44.6
	Channel Point 通道点

Table 6: Ablations on adaptive mixing. Channel mixing followed by point mixing leads to better performance.
表 6：自适应混合消融。通道混合后跟点混合会导致更好的性能。

tion tasks such as BEV segmentation, occupancy prediction and lane detection.
BEV 分割、占用预测和车道检测等任务。

5. Conclusion 5. 结论

In this paper, we have proposed a query-based one-stage 3D object detector, named SparseBEV, which can enjoy the benefits of the BEV space without explicitly constructing a dense BEV feature. SparseBEV improves the adaptability of the decoder by three key modules: scale-adaptive self attention, adaptive spatio-temporal sampling and adaptive mixing. We further introduce a dual-branch design to enhance the long-term temporal modeling. Experiments show that SparseBEV achieves the state-of-the-art performance on the dataset of nuScenes for both speed and accuracy. We hope this exciting result will attract more attention to the fully sparse BEV detection paradigm.
在本文中，我们提出了一种基于查询的一阶段 3D 物体检测器，名为 SparseBEV，它可以享受 BEV 空间的好处，而无需显式构建密集的 BEV 特征。 SparseBEV 通过三个关键模块改进了解码器的适应性：尺度自适应自注意力，自适应时空采样和自适应混合。我们进一步引入了双分支设计来增强长期时间建模。实验证明，SparseBEV 在 nuScenes 数据集上实现了速度和准确性方面的最新性能。我们希望这一激动人心的结果能吸引更多关注完全稀疏的 BEV 检测范式。

Acknowledgements. This work is supported by National Key R&D Program of China (No. 2022ZD0160900), National Natural Science Foundation of China (No. 62076119, No. 61921006), Fundamental Research Funds for the Central Universities (No. 020214380091, No. 020214380099), and Collaborative Innovation Center of Novel Software Technology and Industrialization. Besides, the authors would like to thank Ziteng Gao, Zhiqi Li and Ruopeng Gao for their help.
致谢。本工作得到中国国家重点研发计划（No. 2022ZD0160900）、中国国家自然科学基金（No. 62076119、No. 61921006）、中央高校基本科研业务费专项资金（No. 020214380091、No. 020214380099）和新型软件技术及产业化协同创新中心的支持。此外，作者还要感谢高子腾、李志奇和高若鹏的帮助。

Figure 7: Architecture of dual-branch SparseBEV. The input multi-camera videos are divided into a high-resolution "slow" stream and a low-resolution "fast" stream.
图 7：双分支 SparseBEV 的架构。输入的多摄像头视频被分成高分辨率的“慢”流和低分辨率的“快”流。

A. Details of Dual-branch SparseBEV
A. 双分支 SparseBEV 的详细信息

In this section, we provide detailed explanations and ablations on the dual branch design. As shown in Fig. 7, the input multi-camera videos are divided into a high-resolution "slow" stream and a low-resolution "fast" stream. Sampling points are projected to the two streams respectively and the sampled features are stacked before adaptive mixing. Experiments are conducted with a V2-99 backbone pretrained by FCOS3D [51] on the training split of nuScenes.

在本节中，我们提供了关于双分支设计的详细解释和消融。如图 7 所示，输入的多摄像头视频被分成高分辨率的“慢”流和低分辨率的“快”流。采样点分别投影到两个流中，并在自适应混合之前堆叠采样特征。在 nuScenes 的训练集上，使用由 FCOS3D [51]预训练的 V2-99 骨干网络进行实验。

In Fig. 8, we compare our dual branch design with single branch baselines. If we use a single branch of

640 (orange curve) resolution, adding more frames does not provide as much benefit as it does at

resolution (green curve). By using dual branch of

and 640

resolution with 1:2 ratio, we decouple spatial appearance and temporal motion, unlocking better performance. As we can see from the blue curve, the longer the frame sequence, the more gain the dual branch design brings.
在图 8 中，我们将我们的双分支设计与单分支基线进行了比较。如果我们使用单分支的

640（橙色曲线）分辨率，增加更多帧并不像在

分辨率（绿色曲线）时那样提供那么多好处。通过使用

和 640

分辨率的双分支，比例为 1:2，我们解耦了空间外观和时间运动，实现了更好的性能。正如我们从蓝色曲线中看到的那样，帧序列越长，双分支设计带来的收益就越大。

In Tab. 7, we provide detailed quanlitative results. Under the setting of 8 frames ( 4 seconds), our dual branch design with only two high resolution (HR) frames surpasses the baseline with eight HR frames. By increasing the number of HR frames to 4, we further improve the performance by

and

. Moreover, increasing the reso-
在表 7 中，我们提供了详细的定性结果。在 8 帧（4 秒）的设置下，我们的双分支设计仅使用两个高分辨率（HR）帧就超过了具有八个 HR 帧的基线。通过将 HR 帧的数量增加到 4，我们进一步提高了

和

的性能。此外，增加分辨率-

Figure 8: Comparison between single-branch and dualbranch under different settings. Dual branch design brings more gain as the number of frames increases.
图 8：在不同设置下单分支和双分支之间的比较。随着帧数的增加，双分支设计带来的收益更大。

lution of the LR frames does not bring any improvement, which clearly demonstrates that appearance detail and temporal motion are decoupled to different branches.
LR 帧的解耦并没有带来任何改进，这清楚地表明外观细节和时间运动被解耦到不同的分支。

Since the dual-branch design also enlarges the receptive field (smaller resolution provides larger receptive field) which may improve performance, we further analyse where the improvement comes from in Tab. 8. The first row is our baseline which takes 8 frames with a single branch of

as input. We first try to increase the receptive field by adding an extra

feature map (Row 2), and observe that the performance is slightly improved. This demonstrates that a larger receptive field is required for high-resolution and long-term inputs. However, the spatial appearance and temporal motion is still coupling, limiting
由于双分支设计还扩大了感受野（较小的分辨率提供了更大的感受野），这可能会提高性能，我们进一步分析改进来自哪里，见表 8。第一行是我们的基线，它以

作为输入，采用单分支，使用 8 帧。我们首先尝试通过添加额外的

特征图（第 2 行）来增加感受野，并观察到性能略有改善。这表明更大的感受野对高分辨率和长期输入是必要的。然而，空间外观和时间运动仍然耦合，限制了

Method	Setting	mAP	NDS
Single branch 单分支		48.9	57.3
Dual branch		49.4	57.9
Dual branch
Dual branch		50.1	58.0

Table 7: Ablations on the dual branch design.

indices the number of frames is

and the longer side of the image has

pixels. For example, "

" denotes 8 frames with

resolution.
双分支设计的消融实验表。

索引帧数为

，图像的较长边有

像素。例如，“

”表示分辨率为

的 8 帧。

Method	Feature Maps	Train. Cost	mAP
Single branch 单分支			48.9
Single branch 单分支			49.3
Dual branch

Table 8: Detailed analyses on the dual-branch design. For single branch baselines, simply adding an extra

feature map has limited effect. In contrast, our dual branch design can boost the performance significantly.
双分支设计的详细分析。对于单分支基线，简单地添加额外的

特征图效果有限。相比之下，我们的双分支设计可以显著提升性能。

Self Attention 自注意力	Distance Function 距离函数	NDS	mAP
SASA-beta		55.2	44.8
SASA

Table 9: Compared with SASA-beta, SASA not only has the ability of multi-scale feature aggregation, but generates adaptive receptive field for each query as well.
表 9：与 SASA-beta 相比，SASA 不仅具有多尺度特征聚合的能力，还为每个查询生成自适应的感受野。

the performance. By using dual branches of

and

with 1:2 ratio (Row 3), we decouple spatial appearance and temporal motion, leading to better performance. Moreover, the training cost is also reduced by

. This experiment demonstrates that we not only need larger receptive fields, but also decouple spatial appearance and temporal motion.
性能。通过使用 1:2 比例的

和

的双分支（第 3 行），我们解耦了空间外观和时间运动，从而提高了性能。此外，训练成本也减少了

。这个实验表明，我们不仅需要更大的感受野，还需要解耦空间外观和时间运动。

B. Study on Scale-adaptive Self Attention
B. 关于尺度自适应自注意力的研究

In this section, we'll talk about how we came up with scale-adaptive self attention (SASA). In the main paper, the receptive field coefficient

is specific to each head and adaptive to each query. In the development of SASA, there is an intermediate version (dubbed SASA-beta for convenience): the

for each head is simply a learnable parameter shared by all queries.
在本节中，我们将讨论我们如何提出了自适应尺度自注意力（SASA）。在主要论文中，接受域系数

对于每个头部是特定的，并且对每个查询是自适应的。在 SASA 的发展过程中，有一个中间版本（为方便起见称为 SASA-beta）：每个头部的

只是一个可学习的参数，被所有查询共享。

In Fig. 9, we take a closer look at how

changes with training. We surprisingly find that regardless of the initialization, each head learns a different

from the others and all of them are distributed in range

, enabling the network to aggregate local and multi-scale features from multiple
在图 9 中，我们更仔细地观察了

随训练而变化的情况。令人惊讶的是，我们发现无论初始化如何，每个头部学习的

都与其他头部不同，并且它们都分布在范围

内，使网络能够聚合来自多个尺度的局部特征和多尺度特征。

Figure 9: The change of

of each head in SASA-beta during training. Regardless of the initialization, each head learns a different

, enabling local and multi-scale feature aggregation.
图 9：在训练过程中 SASA-beta 每个头部的

的变化。无论初始化如何，每个头部学习的

都不同，从而实现了局部和多尺度特征的聚合。

heads. 头部。

Next, we improve SASA-beta by generating the

adaptively from the query, which corresponds to the version in the main paper. Compared with SASA-beta, SASA not only has the ability of multi-scale feature aggregation, but generates adaptive receptive field for each query as well. The quanlitative comparison between SASA-beta and SASA is shown in Tab. 9.
接下来，我们通过从查询中自适应生成

来改进 SASA-beta，该查询对应于主论文中的版本。与 SASA-beta 相比，SASA 不仅具有多尺度特征聚合的能力，还为每个查询生成自适应的感受野。SASA-beta 和 SASA 之间的定性比较显示在表 9 中。

C. More Visualizations C. 更多可视化。

In Fig. 10, we provide more visualizations of the sampling points from different stages. In the initial stage, the sampling points have the shape of pillars. In later stages, they are refined to cover objects with different sizes.
在图 10 中，我们提供了来自不同阶段的采样点的更多可视化。在初始阶段，采样点的形状为柱状。在后续阶段，它们被细化以覆盖不同大小的对象。

(a) Sample 0005, stage 1
(a) 采样 0005，阶段 1

(b) Sample 0005, stage 2
(b) 采样 0005，阶段 2

(d) Sample 0028, stage 1
(d) 样本 0028，阶段 1

(e) Sample 0028, stage 2
(e) 样本 0028，阶段 2

(f) Sample 0028, stage 3
(f) 样本 0028，第 3 阶段

Figure 10: Visualized sampling points from different stages. Different instances are distinguished by colors.
图 10：来自不同阶段的可视化采样点。不同实例由颜色区分。

: Corresponding author.
：通讯作者。
Note that the experiment setting used here is different from that in the main paper, since the experiments are conducted before the submission of ICCV 2023. After submission, we further improve our implementation to refesh our results. The conclusion is consistent between these different implmentations.
请注意，这里使用的实验设置与主要论文中的不同，因为这些实验是在提交 ICCV 2023 之前进行的。提交后，我们进一步改进我们的实现以更新我们的结果。结论在这些不同的实现之间是一致的。

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos SparseBEV：来自多摄像头视频的高性能稀疏 3D 物体检测

Abstract 摘要

1. Introduction 1.介绍

2. Related Work 2. 相关工作

2.1. Query-based 2D Object Detection2.1. 基于查询的 2D 目标检测

2.2. Monocular 3D Object Detection2.2. 单目 3D 物体检测

2.3. 3D Object Detection in BEV2.3. BEV 中的 3D 物体检测

3. SparseBEV

3.1. Query Formulation 3.1. 查询形式

3.2. Scale-adaptive Self Attention3.2. 自适应尺度自注意力

3.3. Adaptive Spatio-temporal Sampling3.3. 自适应时空采样

3.4. Adaptive Mixing 3.4. 自适应混合

3.5. Dual-branch SparseBEV3.5. 双分支 SparseBEV

4. Experiments 4. 实验

4.1. Implementation Details4.1. 实现细节

4.2. Datasets and Metrics4.2. 数据集和指标

4.3. Comparison with the State-of-the-art Methods3D 边界框的 10 个类别。

4.4. Ablation Studies 4.4. 切割研究

4.5. Limitations and Future Work4.5. 限制和未来工作

5. Conclusion 5. 结论

A. Details of Dual-branch SparseBEVA. 双分支 SparseBEV 的详细信息

B. Study on Scale-adaptive Self AttentionB. 关于尺度自适应自注意力的研究

C. More Visualizations C. 更多可视化。

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos
SparseBEV：来自多摄像头视频的高性能稀疏 3D 物体检测

2.1. Query-based 2D Object Detection
2.1. 基于查询的 2D 目标检测

2.2. Monocular 3D Object Detection
2.2. 单目 3D 物体检测

2.3. 3D Object Detection in BEV
2.3. BEV 中的 3D 物体检测

3.2. Scale-adaptive Self Attention
3.2. 自适应尺度自注意力

3.3. Adaptive Spatio-temporal Sampling
3.3. 自适应时空采样

3.5. Dual-branch SparseBEV
3.5. 双分支 SparseBEV

4.1. Implementation Details
4.1. 实现细节

4.2. Datasets and Metrics
4.2. 数据集和指标

4.3. Comparison with the State-of-the-art Methods
3D 边界框的 10 个类别。

4.5. Limitations and Future Work
4.5. 限制和未来工作

A. Details of Dual-branch SparseBEV
A. 双分支 SparseBEV 的详细信息

B. Study on Scale-adaptive Self Attention
B. 关于尺度自适应自注意力的研究