这是用户在 2024-4-30 19:18 为 https://app.immersivetranslate.com/pdf-pro/ec69246c-1a82-41a1-bcf6-151bec1ab709 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_04_30_97d0e98c57c20e89d3f9g

Fully Sparse 3D Occupancy Prediction
完全稀疏 3D 占用预测

Haisong Liu , Yang Chen , Haiguang Wang , Zetong Yang , Tianyu ,
刘海松 , 杨晨 , 王海光 , 杨泽通 , 田宇
Jia Zeng , Li Chen , Hongyang , Limin Wang
Jia Zeng , Li Chen , Hongyang , Limin Wang
Nanjing University Shanghai AI Lab
南京大学 上海人工智能实验室
https://github.com/MCG-NJU/SparseOcc

Abstract 摘要

Occupancy prediction plays a pivotal role in autonomous driving. Previous methods typically construct dense 3D volumes, neglecting the inherent sparsity of the scene and suffering high computational costs. To bridge the gap, we introduce a novel fully sparse occupancy network, termed SparseOcc. SparseOcc initially reconstructs a sparse 3D representation from visual inputs and subsequently predicts semantic/instance occupancy from the 3D sparse representation by sparse queries. A mask-guided sparse sampling is designed to enable sparse queries to interact with features in a fully sparse manner, thereby circumventing costly dense features or global attention. Additionally, we design a thoughtful ray-based evaluation metric, namely RayIoU, to solve the inconsistency penalty along depths raised in traditional voxel-level mIoU criteria. SparseOcc demonstrates its effectiveness by achieving a RayIoU of 34.0, while maintaining a real-time inference speed of 17.3 FPS, with 7 history frames inputs. By incorporating more preceding frames to 15 , SparseOcc continuously improves its performance to 35.1 RayIoU without whistles and bells.
乘员预测在自动驾驶中起着举足轻重的作用。以往的方法通常是构建密集的三维体,忽略了场景固有的稀疏性,计算成本很高。为了弥补这一缺陷,我们引入了一种新颖的完全稀疏占位网络,称为 SparseOcc。SparseOcc 最初从视觉输入中重建稀疏三维表示,随后通过稀疏查询从三维稀疏表示中预测语义/实例占有率。我们设计了一种遮罩引导的稀疏采样,使稀疏查询能够以完全稀疏的方式与 特征进行交互,从而避免了昂贵的密集特征或全局关注。此外,我们还设计了一种周到的基于射线的评估指标,即 RayIoU,以解决传统体素级 mIoU 标准中提出的深度不一致惩罚问题。SparseOcc 证明了其有效性,在输入 7 个历史帧的情况下,RayIoU 达到 34.0,同时保持了 17.3 FPS 的实时推理速度。通过加入更多的历史帧至 15 个,SparseOcc 不断提高其性能,在不增加任何附加功能的情况下,达到 35.1 RayIoU。

1 Introduction 1 引言

Vision-centric 3D occupancy prediction [1] focuses on partitioning 3D scenes into structured grids from visual images. Each grid is assigned a label indicating if it is occupied or not. This task offers more geometric details than 3D detection and produces an alternative representation to LiDAR .
以视觉为中心的三维空间占用预测[1]侧重于从视觉图像中将三维场景划分为结构化网格。每个网格都会被贴上一个标签,表明其是否被占用。这项任务提供了比三维检测更多的几何细节,并产生了一种可替代激光雷达的表示方法
Existing methods typically construct dense 3D features yet suffer from computational overhead (e.g. FPS on Tesla A100). However, dense representations are not necessary for occupancy predictions. Statistics in Fig. 1(a) reveal the geometry sparsity, that more than of the voxels are empty. This manifests a large room in occupancy prediction acceleration by exploiting the sparsity. Some works explore the sparsity of 3D scenes, but they still rely on sparse-to-dense modules for dense predictions. This inspires us to seek a fully sparse occupancy network without any dense design.
现有方法 通常会构建密集的三维特征,但计算开销较大(例如,在特斯拉 A100 上的 FPS)。然而,密集表示对于占位预测并非必要。图 1(a)中的统计数据显示了几何稀疏性,即超过 的体素是空的。这表明,利用稀疏性加速占用预测的空间很大。 一些作品探索了三维场景的稀疏性,但它们仍然依赖于稀疏到密集模块来进行密集预测。这启发我们寻求一种无需任何密集设计的完全稀疏占用率网络。
In this paper, we propose SparseOcc, the first fully sparse occupancy network. As depicted in Fig. 1(b), SparseOcc includes two steps. First, it leverages a sparse voxel decoder to reconstruct the sparse geometry of a scene in a coarse-to-fine manner. This only models non-free regions, saving computational costs significantly. Second, we design a mask transformer with sparse semantic/instance queries to predict masks and labels of segments from the sparse space. The mask transformer not only improves performance on semantic occupancy but also paves the way for panoptic occupancy. A mask-guided sparse sampling is designed to achieve sparse cross-attention in the mask transformer.
在本文中,我们提出了首个完全稀疏占用网络 SparseOcc。如图 1(b) 所示,SparseOcc 包括两个步骤。首先,它利用稀疏体素解码器,以从粗到细的方式重建场景的稀疏几何图形。这样只对非自由区域建模,大大节省了计算成本。其次,我们设计了一个具有稀疏语义/实例查询功能的掩码转换器,以便从稀疏空间预测片段的掩码和标签。掩码转换器不仅提高了语义占用的性能,还为全视角占用铺平了道路。为了在掩码转换器中实现稀疏交叉注意,设计了掩码引导的稀疏采样。
(a) Geometry sparsity (a) 几何稀疏性
(b) Overview of SparseOcc
(b) SparseOcc 概述
(c) Performance Comparison
(c) 性能比较
Figure 1: (a) We statistic the geometry sparsity, and find that even the scene with the fewest empty voxels still has over empty voxels. (b) SparseOcc reconstructs a sparse 3D representation from input images by a sparse voxel decoder, with a set of sparse queries to estimate the mask and label of each segment from the sparse 3D volumes. (c) Performance comparison on the validation split of Occ3D-nus. FPS is measured on a single Tesla A100 with the PyTorch fp32 backend.
图 1:(a) 我们对几何稀疏性进行了统计,发现即使是空体素最少的场景,也仍有超过 个空体素。(b) SparseOcc 通过稀疏体素解码器从输入图像中重建稀疏三维表示,并通过一组稀疏查询从稀疏三维体积中估计每个片段的掩码和标签。(c) Occ3D-nus 验证片段的性能比较。FPS 是在一台使用 PyTorch fp32 后端的 Tesla A100 上测量的。
As such, our SparseOcc fully exploits the sparse properties, with a fully sparse architecture free of any dense design like dense 3D features, sparse-to-dense modules, and global attention.
因此,我们的 SparseOcc 充分利用了稀疏特性,采用完全稀疏的架构,没有任何密集设计,如密集三维特征、稀疏到密集模块和全局注意力。
Besides, we notice flaws in popular voxel-level mean Intersection-over-Union (mIoU) metrics for occupancy evaluation and further design a ray-level evaluation, RayIoU, as the solution. The mIoU criterion is an ill-posed formulation given the ambiguous labeling of unscanned voxels. Previous methods[48] relieve this issue by only evaluating observed areas but raise extra issues in inconsistency penalty along depths. Instead, RayIoU addresses the two aforementioned issues simultaneously. It evaluates predicted 3D occupancy volume by retrieving depth and category predictions of designated rays. To be specific, RayIoU casts query rays into predicted 3D volumes and decides true positive predictions as the ray with the correct distance and class of its first touched occupied voxel grid. This formulates a more fair and reasonable criterion.
此外,我们还注意到流行的体素级平均交叉联合(mIoU)占位评估指标存在缺陷,并进一步设计了射线级评估指标 RayIoU 作为解决方案。鉴于未扫描体素的标记模糊不清,mIoU 标准是一个难以确定的公式。以前的方法[48]只评估观察到的区域,从而缓解了这一问题,但在深度上的不一致性惩罚却带来了额外的问题。而 RayIoU 可同时解决上述两个问题。它通过检索指定射线的深度和类别预测来评估预测的三维占位体积。具体来说,RayIoU 将查询光线投射到预测的三维体积中,并将其首次触及的所占体素网格的距离和类别正确的光线判定为真阳性预测。这样制定的标准更加公平合理。
Thanks to the sparsity design, SparseOcc achieves 34.0 RayIoU on Occ3D-nus [48], while maintaining a real-time inference speed of 17.3 FPS (Tesla A100, PyTorch fp32 backend), with 7 history frames inputs. By incorporating more preceding frames to 15 , SparseOcc continuously improves its performance to 35.1 RayIoU, achieving state-of-the-art performance without whistles and bells. The comparison between SparseOcc with previous methods in terms of performance and efficiency is shown in Fig. 1(c).
得益于稀疏性设计,SparseOcc 在 Occ3D-nus [48] 上实现了 34.0 RayIoU,同时保持了 17.3 FPS 的实时推理速度(Tesla A100,PyTorch fp32 后端),输入 7 个历史帧。通过加入更多的历史帧(达到 15 帧),SparseOcc 的性能不断提高,达到 35.1 RayIoU,实现了最先进的性能,而且没有任何附加功能。SparseOcc 与之前方法在性能和效率方面的比较如图 1(c)所示。
We summarize our contributions as follows:
我们将我们的贡献总结如下:
  1. We propose SparseOcc, the first fully sparse occupancy network without any time-consuming dense designs. It achieves 34.0 RayIoU on Occ3D-nus benchmark with an inference speed of 17.3 FPS.
    我们提出了 SparseOcc,这是首个完全稀疏占用网络,无需任何耗时的密集设计。它在 Occ3D-nus 基准测试中达到了 34.0 RayIoU,推理速度为 17.3 FPS。
  2. We present RayIoU, a ray-wise criterion for occupancy evaluation. By querying rays to 3D volume, it solves the ambiguous penalty issue for unscanned free voxels and the inconsistent depth penalty issue in the mIoU metric.
    我们提出了一种用于占用率评估的射线标准--RayIoU。通过查询三维体积的射线,它解决了未扫描自由体素的模糊惩罚问题,以及 mIoU 指标中不一致的深度惩罚问题。
Camera-based 3D Occupancy Prediction. Occupancy Networks were originally proposed by Mescheder et al. [36, 41], with a focus on continuous object representations in 3D space. Recent variations in occupancy networks shift their attention towards reconstructing 3D space and predicts voxel-level semantic information from image inputs. MonoScene [4] achieves scene occupancy completion through a 2D and a 3D UNet [42] connected by a sight projection module. OccNet [44] applies universal occupancy features to various downstream tasks and introduces the OpenOcc benchmark. SurroundOcc [55] proposes a coarse-to-fine architecture. However, the computational burden of handling a large number of voxel queries remains a challenge. TPVFormer [17] proposes using tri-perspective view representations to supplement vertical structural information, but this inevitably leads to information loss. VoxFormer [25] initializes sparse queries based on monocular depth prediction. However, VoxFormer is not fully sparse as it still requires a sparse-to-dense MAE [11] module to complete
基于摄像头的三维占位预测。占位网络最初由 Mescheder 等人提出[36, 41],重点关注三维空间中的连续物体表示。最近,占位网络 的变体将注意力转移到重建三维空间 ,并从图像输入中预测体素级语义信息。MonoScene [4] 通过一个视线投影模块连接一个 2D 和一个 3D UNet [42] 来完成场景占用。OccNet [44] 将通用占用特征应用于各种下游任务,并引入了 OpenOcc 基准。SurroundOcc [55] 提出了一种从粗到细的架构。然而,处理大量体素查询的计算负担仍然是一个挑战。TPVFormer [17] 建议使用三视角视图表示法来补充垂直结构信息,但这不可避免地会导致信息丢失。VoxFormer [25] 基于单目深度预测对稀疏查询进行初始化。然而,VoxFormer 并不是完全稀疏的,因为它仍然需要一个稀疏到密集的 MAE [11] 模块来完成
Figure 2: SparseOcc is a fully sparse architecture since it neither relies on dense 3D feature, nor has sparse-to-dense and global attention operations. The sparse voxel decoder reconstructs the sparse geometry of the scene, consisting of voxels ). The mask transformer then uses sparse queries to predict the mask and label of each segment. SparseOcc can be easily extended to panoptic occupancy by replacing the semantic query with instance query.
图 2:SparseOcc 是一种完全稀疏的架构,因为它既不依赖于密集的三维特征,也没有稀疏到密集和全局注意力操作。稀疏体素解码器可重建场景的稀疏几何图形,该几何图形由 体素 ) 组成。然后,掩码转换器使用 稀疏查询来预测每个片段的掩码和标签。通过用实例查询取代语义查询,SparseOcc 可以很容易地扩展到全景占用。
the scene. Some methods emerged in the CVPR 2023 occupancy challenge [27, 39, 9], but none of them exploits a fully sparse design. In this paper, we make the first step to explore the fully sparse architecture for 3D occupancy predictions from camera inputs.
场景。在 CVPR 2023 占位挑战赛中出现了一些方法[27, 39, 9],但没有一种方法采用了完全稀疏设计。在本文中,我们首先探索了利用摄像头输入进行三维占位预测的全稀疏架构。
End-to-end 3D Reconstruction from Posed Images. As a related task to 3D occupancy prediction, reconstruction recovers the 3D geometry from multiple posed images. Recent methods focus on more compact and efficient end-to-end 3D reconstruction pipelines [38, 46, 2, 45, 10]. Atlas [38] extracts features from multi-view input images and maps them to space to construct the truncated signed distance function [8]. NeuralRecon [46] directly reconstructs local surfaces as sparse TSDF volumes and uses a GRU-based TSDF fusion module to fuse features from previous fragments. VoRTX [45] utilizes transformers to address occlusion issues in multi-view images. CVRecon [10] takes a different approach by avoiding occlusion-sensitive perspective mappings of 2D to 3D and directly uses cost volumes to establish 3D features from 2D image features.
从摆拍图像进行端到端三维重建。作为三维占位预测的一项相关任务, 重建可从多张摆拍图像中恢复三维几何图形。最近的方法侧重于更紧凑、更高效的端到端三维重建管道[38, 46, 2, 45, 10]。Atlas [38] 从多视角输入图像中提取特征,并将其映射到 空间,从而构建截断符号距离函数 [8]。NeuralRecon [46] 直接将局部表面重建为稀疏的 TSDF 卷,并使用基于 GRU 的 TSDF 融合模块来融合来自先前片段的特征。VoRTX [45] 利用变换器来解决多视角图像中的遮挡问题。CVRecon[10]则采用了一种不同的方法,它避免了从二维到三维的对遮挡敏感的透视映射,而是直接使用成本体积从二维图像特征中建立三维特征。
Mask Transformer. Recently, unified segmentation models have been widely studied to handle semantic and instance segmentation concurrently. Cheng et al. first proposes MaskFormer [7] for unified segmentation in terms of model architecture, loss functions, and training strategies. Mask2Former [6] then introduces masked attention, with restricted receptive fields on instance masks, for better performance. Later on, Mask3D [43] successfully extends the mask transformer for point cloud segmentation with state-of-the-art performance. OpenMask3D [47] further achieves the open-vocabulary 3D instance segmentation task and proposes a model for zero-shot 3D segmentation.
掩码转换器最近,人们广泛研究了同时处理语义分割和实例分割的统一分割模型。Cheng 等人首先从模型架构、损失函数和训练策略等方面提出了用于统一分割的 MaskFormer [7]。随后,Mask2Former [6] 引入了掩码注意,对实例掩码的感受野进行限制,以获得更好的性能。随后,Mask3D [43] 成功地将掩膜变换器扩展用于点云分割,并取得了最先进的性能。OpenMask3D [47] 进一步实现了开放词汇的三维实例分割任务,并提出了零镜头三维分割模型。

3 SparseOcc

SparseOcc is a vision-centric occupancy model that only requires camera inputs. As shown in Fig. 2, SparseOcc has three modules: an image encoder consisting of an image backbone and FPN [29] to extract 2D features from multi-view images; a sparse voxel decoder (Sec. 3.1) to predict sparse class-agnostic 3D occupancy with correlated embeddings from the image features; a mask transformer decoder (Sec 3.2) to distinguish semantics and instances in the sparse 3D space.
SparseOcc 是一种以视觉为中心的占用模型,只需要摄像头输入。如图 2 所示,SparseOcc 有三个模块:一个由图像主干和 FPN [29] 组成的图像编码器,用于从多视角图像中提取二维特征;一个稀疏体素解码器(第 3.1 节),用于从图像特征中预测具有相关嵌入的稀疏类无关三维占用率;一个掩码变换器解码器(第 3.2 节),用于区分稀疏三维空间中的语义和实例。

3.1 Sparse Voxel Decoder
3.1 稀疏体素解码器

Since 3D occupancy ground truths are dense 3D volume with shape as (e.g. ), existing methods typically build a dense 3D feature of shape , but suffer from computational overhead. In this paper, we argue that such dense representation is not necessary for occupancy prediction. As in our statistics, we find that over of the voxels in the scene are free. This motivates us to explore a sparse 3D representation that only models the non-free areas of the scene, thereby saving computational resources.
由于三维占位地面实况 是形状为 (如 )的密集三维体,因此现有方法通常会建立一个形状为 的密集三维特征,但会产生计算开销。在本文中,我们认为这种密集表示对于占用预测并不是必需的。在我们的统计中,我们发现场景中超过 的体素是自由的。这促使我们探索一种稀疏的三维表示方法,只对场景中的非自由区域进行建模,从而节省计算资源。
Figure 3: The detailed architecture of sparse voxel decoder, which follows a coarse-to-fine pipeline with three layers. Within each layer, we use a transformer-like architecture for 3D-2D interaction. At the end of each layer, it upsamples each voxels by and estimates the probability of being occupied for each voxel.
图 3:稀疏体素解码器的详细结构,它采用从粗到细的流水线,共有三层。在每一层中,我们使用类似变压器的架构来实现 3D-2D 交互。在每一层的末尾,它通过 对每个体素进行上采样,并估算每个体素被占用的概率。
Overall architecture. Our designed sparse voxel decoder is shown in Fig. 3. In general, it follows a coarse-to-fine structure but only models the non-free regions. The decoder starts from a set of coarse voxel queries equally distributed in the 3D space (e.g. ). In each layer, we first upsample each voxel by , e.g. a voxel with size will be upsampled into 8 voxels with size . Next, we estimate an occupancy score for each voxel and conduct pruning to remove useless voxel grids. Here we have two approaches for pruning: one is based on a threshold (e.g., only keeps score ); the other is by top- selection. In our implementation, we simply keep voxels with top- occupancy scores for training efficiency. is a dataset-related parameter, obtained by counting the maximum number of non-free voxels in each sample at different resolutions. The voxel tokens after pruning will serve as the input for the next layer.
整体架构我们设计的稀疏体素解码器如图 3 所示。一般来说,它遵循从粗到细的结构,但只对非自由区域进行建模。解码器从在三维空间中平均分布的一组粗体素查询(如 )开始。在每一层中,我们首先对每个体素进行 的升采样,例如,大小为 的体素将被升采样为 8 个大小为 的体素。接下来,我们为每个体素估算占用率分数,并进行修剪以去除无用的体素网格。这里我们有两种剪枝方法:一种是基于阈值(例如,只保留得分 );另一种是通过顶部 选择。 是一个与数据集相关的参数,通过计算每个样本在不同分辨率下非自由体素的最大数量获得。剪枝后的体素标记将作为下一层的输入。
Detailed design. Within each layer, we use a transformer-like architecture to handle voxel queries. The concrete architecture is inspired by SparseBEV [32], a detection method using a sparse scheme. To be specific, in layer with voxel queries described by 3D locations and a -dim content vector, we first use self-attention to aggregate local and global features for those query voxels. Then, a linear layer is used to generate 3D sampling offsets for each voxel query from the associated content vector. These sampling offsets are utilized to transform voxel queries to obtain reference points in global coordinates. We finally project those sampled reference points to multi-view image space for integrating image features by adaptive mixing [49, 18].
详细设计在每一层中,我们使用类似变压器的架构来处理体素查询。具体架构的灵感来自 SparseBEV [32],这是一种使用稀疏方案的检测方法。具体来说,在 层,我们首先使用自注意力来聚合这些查询体素的局部和全局特征,然后使用线性层来处理由三维位置和 -dim 内容向量描述的 体素查询。然后,使用线性层从相关内容向量中为每个体素查询生成三维采样偏移 。利用这些采样偏移来转换体素查询,以获得全局坐标中的参考点。最后,我们将这些采样参考点投影到多视角图像空间,通过自适应混合[49, 18]来整合图像特征。
Temporal modeling. Previous dense occupancy methods typically warp the history feature to the current timestamp, and use deformable attention [62] or 3D convolutions to fuse temporal information. However, this approach is not directly applicable in our case due to the sparse nature of our 3D features. To handle this, we leverage the flexibility of the aforementioned global sampled reference points by warping them to previous timestamps to sample history multi-view image features. The sampled multi-frame features are then stacked and aggregated by adaptive mixing so as for temporal modeling.
时间建模。以往的密集占位法 通常会将历史 特征翘曲到当前时间戳,并使用可变形注意力[62]或三维卷积来融合时间信息。然而,由于三维特征的稀疏性,这种方法并不能直接适用于我们的情况。为了解决这个问题,我们利用上述全局采样参考点的灵活性,将其扭曲到之前的时间戳,以采样历史多视角图像特征。然后,通过自适应混合将采样的多帧特征进行堆叠和聚合,以便进行时间建模。
Supervision. We compute loss for sparse voxels from each layer. We use binary cross entropy (BCE) loss as the supervision, given that we are reconstructing a class-agnostic sparse occupancy space. Only the kept sparse voxels are supervised, while the discarded regions during pruning in earlier stages are ignored.
监督。我们计算每一层稀疏体素的损失。我们使用二元交叉熵(BCE)损失作为监督,因为我们正在重建一个与类别无关的稀疏占位空间。只对保留的稀疏体素进行监督,而忽略在早期阶段剪枝过程中丢弃的区域。
Moreover, due to the severe class imbalance, the model can be easily dominated by categories with a large proportion, such as the ground, thereby ignoring other important elements in the scene, such as cars, people, etc. Therefore, voxels belonging to different classes are assigned with different loss weights. For example, voxels belonging to class are assigned with a loss weight of:
此外,由于类别严重失衡,模型很容易被比例较大的类别(如地面)所支配,从而忽略场景中的其他重要元素,如汽车、人物等。因此,属于不同类别的体素会被赋予不同的损失权重。例如,属于 类别的体素的损失权重为:
where is the number of voxels belonging to the -th class in ground truth.
其中 是属于 -th class 的体素数量。

3.2 Mask Transformer 3.2 掩膜变压器

Our mask transformer is inspired by Mask2Former [6], which uses sparse semantic/instance queries decoupled by binary mask query and content vector . The mask transformer consists of three steps: multi-head self attention (MHSA), mask-guided sparse sampling, and adaptive mixing. MHSA is used for the interaction between different queries as the common practice. Mask-guided sparse sampling and adaptive mixing are responsible for the interaction between queries and 2D image features.
我们的掩码转换器受到 Mask2Former [6] 的启发,它通过二进制掩码查询 和内容向量 来解耦 稀疏语义/实例查询。掩码转换器包括三个步骤:多头自我关注(MHSA)、掩码引导的稀疏采样和自适应混合。MHSA 通常用于不同查询之间的交互。掩码引导稀疏采样和自适应混合负责查询与二维图像特征之间的交互。
Mask-guided sparse sampling. A simple baseline of mask transformer is to use the masked crossattention module in Mask2Former. However, it attends to all positions of the key, with unbearable computations. Here, we design a simple alternative. We first randomly select a set of 3D points within the mask predicted by the previous -th Transformer decoder layer. Then, we project those 3D points to multi-view images and extract their features by bilinear interpolation. Besides, our sparse sampling mechanism makes the temporal modeling easier by simply warping the sampling points (as done in the sparse voxel decoder).
掩码引导稀疏采样。掩码转换器的一个简单基准是使用 Mask2Former 中的掩码交叉注意模块。然而,它需要关注密钥的所有位置,计算量大得难以承受。在这里,我们设计了一个简单的替代方案。首先,我们在前一个 -th Transformer 解码层预测的掩码内随机选择一组三维点。然后,我们将这些三维点投影到多视角图像上,并通过双线性插值提取其特征。此外,我们的稀疏采样机制通过对采样点进行简单的扭曲(就像在稀疏体素解码器中所做的那样),使时间建模变得更加容易。
Prediction. For class prediction, we apply a linear classifier with a sigmoid activation based on the query embeddings . For mask prediction, the query embeddings are converted to mask embeddings by a MLP. The mask embeddings have the same shape as query embeddings and are dot-producted with the sparse voxel embeddings to produce mask predictions. Thus, the prediction space of our mask transformer is constrained to the sparsified 3D space from the sparse voxel decoder, rather than the full 3D scene. The mask predictions will serve as the mask query for the next layer.
预测。在类预测方面,我们根据查询嵌入 ,应用具有 sigmoid 激活的线性分类器。对于掩码预测,查询嵌入信息通过 MLP 转换为掩码嵌入信息。掩码内嵌 与查询内嵌 具有相同的形状,并与稀疏体素内嵌 点积,以生成掩码预测。因此,我们的掩码转换器的预测空间受限于稀疏体素解码器的稀疏三维空间,而不是完整的三维场景。掩码预测将作为下一层的掩码查询
Supervision. The reconstruction result from the sparse voxel decoder may not be reliable, as it may overlook or inaccurately detect certain elements. Thus, supervising the mask transformer presents certain challenges since its predictions are confined within this unreliable space. In cases of missed detection, where some ground truth segments are absent in the predicted sparse occupancy, we opt to discard these segments to prevent confusion. As for inaccurately detected elements, we simply categorize them as an additional "no object" category.
监督。稀疏体素解码器的重建结果可能并不可靠,因为它可能会忽略或不准确地检测到某些元素。因此,由于掩码转换器的预测被限制在这个不可靠的空间内,因此对掩码转换器进行监督就面临着一定的挑战。在漏检的情况下,即在预测的稀疏占位中不存在某些地面真实片段时,我们会选择丢弃这些片段,以防止混淆。至于检测不准确的元素,我们只需将其归类为额外的 "无对象 "类别。
Loss Functions. Follow MaskFormer [7], we match the ground truth with the predictions using Hungarian matching. Focal loss is used for classification, while a combination of DICE loss [37] and BCE mask loss is used for mask prediction. Thus, the total loss of SparseOcc is composed of:
损失函数。按照 MaskFormer [7],我们使用匈牙利匹配法将地面实况与预测结果进行匹配。Focal loss 用于分类,而 DICE loss [37] 和 BCE mask loss 的组合则用于掩码预测。因此,SparseOcc 的总损失由以下部分组成:
where is the loss of sparse voxel decoder.
其中 是稀疏体素解码器的损耗。

4 Ray-level mIoU 4 雷级导弹

4.1 Revisiting the Voxel-level mIoU
4.1 重新审视体素级模块

The Occ3D dataset [48], along with its proposed evaluation metrics, are widely recognized as benchmarks in this field. The ground truth occupancy is reconstructed from LiDAR measurements, and the mean Intersection over Union ( ) at the voxel level is employed to assess performance.
Occ3D 数据集[48]及其提出的评估指标是该领域公认的基准。根据激光雷达测量结果重建地面实况占用率,并采用象素级的平均交叉联合( )来评估性能。
However, due to factors such as distance and occlusion, accumulated LiDAR point clouds are imperfect. Some areas unscanned by LiDAR are marked as free, resulting in fragmented instances, as shown in Fig. 4(a). This raises problems of label inconsistency. Previous efforts in solving the evaluation issue, e.g. Occ3D, use a binary visible mask that indicates whether the voxels are observed in the current camera view. However, we found that only calculating mIoU on the observed voxel position could still cause ambiguity and be easily hacked. As illustrated in Fig. 4, RenderOcc [39] generated a cracked but thick surface, approximately unusable for downstream task. Nevertheless, RenderOcc's masked mIoU is much higher than BEVFormer's prediction [26], which is more stable and clean.
然而,由于距离和遮挡等因素,累积的激光雷达点云并不完美。如图 4(a)所示,一些未被激光雷达扫描的区域被标记为空闲区域,导致实例支离破碎。这就产生了标签不一致的问题。以前解决评估问题的方法(如 Occ3D)使用二进制可见掩码,表示在当前相机视图中是否观察到体素。然而,我们发现仅根据观察到的体素位置计算 mIoU 仍然会造成模糊性,而且很容易被黑客攻击。如图 4 所示,RenderOcc[39]生成了一个有裂缝但很厚的表面,大约无法用于下游任务。尽管如此,RenderOcc 的屏蔽 mIoU 远高于 BEVFormer 的预测值[26],后者更加稳定和干净。
The misalignment between qualitative and quantitative results is caused by the inconsistency along the depth direction of the visible mask. As shown in Fig. 5, this toy example reveals several issues with the current evaluation metrics:
定性和定量结果不一致的原因是沿可见光掩膜深度方向的不一致性。如图 5 所示,这个玩具示例揭示了当前评估指标的几个问题:
(a) Ground Truth (a) 地面实况
(b) BEVFormer
(c) RenderOcc
Figure 4: Illustration of the discrepancy between the qualitative results and quantitative metric of Occ3D. (a) The GT has some unobserved area, could be filtered out by visible mask; (b) prediction of BEVFormer, trained without using visible mask, has thin surface; (c) prediction of RenderOcc, has thick surface.
图 4:说明 Occ3D 的定性结果与定量指标之间的差异。(a) GT 有一些未观察到的区域,可通过可见光掩膜过滤掉;(b) BEVFormer 的预测结果,在没有使用可见光掩膜的情况下,表面较薄;(c) RenderOcc 的预测结果,表面较厚。
Figure 5: Consider a scenario where we have a wall in front of us, with a ground-truth distance of and a thickness of . When the prediction has a thickness of , as in RenderOcc, the inconsistency along depth of masked mIoU emergence. If the wall we predict is farther than the ground truth in total), then its IoU will be zero,because none of the predicted voxels align with the actual wall. However, if the wall we predict is closer than the ground truth in total), we will still achieve an IoU of 0.5 , since all voxels behind the surface are filled. Similarly, if the predicted depth is , we will still have an IoU of , and so on.
图 5:假设前方有一堵墙,地面真实距离为 ,厚度为 。当预测的厚度为 时,如在 RenderOcc 中一样,会出现沿掩蔽 mIoU 深度的不一致性。如果我们预测的墙面 远于地面实况 ),那么它的 IoU 将为零,因为没有一个预测的体素与实际墙面对齐。但是,如果我们预测的墙面 比地面实况 (总计)更近,我们仍然可以获得 0.5 的 IoU 值,因为表面后面的所有体素都被填满了。同样,如果预测的深度是 ,我们的 IoU 仍然是 ,以此类推。
  1. If the model fills all areas behind the surface, it inconsistently penalizes the prediction of depth. The model can obtain a higher IoU by filling all areas behind the surface and predicting a closer depth. This thick surface issue is very common in models that use the visible mask or supervision.
    如果模型填满了表面后的所有区域,就会对深度预测造成不一致的影响。模型可以通过填充表面后的所有区域并预测更接近的深度来获得更高的 IoU。这种厚表面问题在使用可见光掩膜或 监督的模型中非常常见。
  2. If the predicted occupancy is a thin surface, the penalty is overly strict, as a deviation of one voxel will lead to an IoU of zero.
    如果预测的占位是一个薄表面,则惩罚过于严格,因为一个体素的偏差将导致 IoU 为零。
  3. The visible mask only considers the visible area at the current moment, thereby reducing Occupancy to a depth estimation task with categories and overlooking the crucial ability to complete scenes beyond the visible region.
    可见光掩膜只考虑当前可见光区域,从而将 "占用 "简化为一项有类别的深度估计任务,忽略了完成可见光区域以外场景的关键能力。

4.2 Mean IoU by Ray Casting
4.2 通过射线投射获得平均 IoU

To address the above issues, we propose a new evaluation metric: Ray-level mIoU (RayIoU for short). In RayIoU, the elements of the set are now query rays, not voxels. We emulate LiDAR by projecting query rays into the predicted 3D occupancy volume. For each query ray, we compute the distance it travels before intersecting any surface and retrieve the corresponding class label. We then apply the same procedure to the ground-truth occupancy to obtain the ground-truth depth and class label. In case a ray doesn't intersect with any voxel present in the ground truth, it will be excluded from the evaluation process.
为解决上述问题,我们提出了一种新的评估指标:射线级 mIoU(简称 RayIoU)。在 RayIoU 中,集合的元素现在是查询光线,而不是体素。我们通过将查询光线投射到预测的三维占位体积中来模拟激光雷达。对于每条查询光线,我们都会计算它在与任何表面相交之前的距离,并检索相应的类别标签。然后,我们将相同的程序应用于地面实况占位,以获得地面实况深度和类别标签。如果某条射线没有与地面实况中的任何体素相交,它将被排除在评估过程之外。
As shown in Fig. 6(a), the original LiDAR rays in a real dataset tend to be unbalanced from near to far. Thus, we resample the LiDAR rays to balance the distribution on different distance, as shown in Fig. 6(b). For the near field, we modify the LiDAR ray channels to achieve equal-distant spacing when projected onto the ground plane. In the far field, we increase the angular resolution of the
如图 6(a)所示,真实数据集中的原始激光雷达射线往往从近到远分布不均。因此,我们对激光雷达射线进行重新采样,以平衡不同距离上的分布,如图 6(b)所示。对于近场,我们修改了激光雷达射线通道,以实现投射到地平面时的等距分布。在远场,我们提高了激光雷达的角度分辨率。
(a) Simulate LiDAR (a) 模拟激光雷达
(b) Equal-distant resampling
(b) 等距重采样
(c) Temporal casting (c) 时间铸造
Figure 6: Covered area of RayIoU. (a) The raw LiDAR ray samples unbalance at different distances. (b) We resample the ray to balance the weight on distance in RayIoU. (c) To investigate the performance of scene completion, we propose evaluating occupancy in the visible area on a wide time span, by casting rays on visited way points.
图 6:RayIoU 的覆盖区域。(a) 原始激光雷达射线样本在不同距离上不平衡。(b) 我们对光线重新采样,以平衡 RayIoU 中的距离权重。(c) 为了研究场景补全的性能,我们建议通过在访问过的航点上投射光线,在较大的时间跨度上评估可见区域的占用率。
ray channels to ensure a more uniform data density across varying ranges. Moreover, our query ray can originate from the LiDAR position at the current, past, or future moments of the ego path. As shown in Fig. 6, temporal casting allows us to better evaluate the scene completion performance while ensuring that the task remains well-posed.
我们的查询射线可以来自自我路径当前、过去或未来时刻的激光雷达位置。此外,我们的查询射线可以来自自我路径当前、过去或未来时刻的激光雷达位置。如图 6 所示,时间投射使我们能够更好地评估场景完成性能,同时确保任务保持良好姿势。
A query ray is classified as a true positive (TP) if the class labels coincide and the L1 error between the ground-truth depth and the predicted depth is less than a certain threshold (e.g., ). Let be he number of classes,
如果类标签重合,且地面实况深度与预测深度之间的 L1 误差小于某个阈值(如 ),则查询光线被归类为真阳性(TP)。设 为类别数、
where , and correspond to the number of true positive, false positive, and false negative predictions for class .
其中, , 和 分别对应于类别 的真阳性预测数、假阳性预测数和假阴性预测数。
RayIoU addresses all three of the aforementioned problems:
RayIoU 解决了上述所有三个问题:
  1. Since the query ray only calculates the distance it touches the first voxel, the model cannot obtain a higher IoU by filling the area behind the surface.
    由于查询光线只计算它与第一个体素的接触距离,因此模型无法通过填充表面后面的区域来获得更高的 IoU。
  2. RayIoU determines TP through the distance threshold, mitigating the overly harsh nature of voxel-level mIoU.
    RayIoU 通过距离阈值确定 TP,减轻了体素级 mIoU 过度苛刻的性质。
  3. The query ray can originate from any position in the scene, thereby considering the model's completion ability and preventing the reduction of occupancy to depth estimation.
    查询光线可以来自场景中的任何位置,从而考虑到模型的完成能力,避免将占用率降低到深度估计。

5 Experiments 5 项实验

We evaluate our model on the Occ3D-nus [48] dataset. Occ3D-nus is based on the nuScenes [3] dataset, which consists of large-scale multimodal data collected from 6 surround-view cameras, 1 LiDAR sensor and 5 radar sensors. The dataset has 1000 videos and is split into 700/150/150 videos for training/validation/testing. Each video has roughly 20s duration and the key samples are annotated at intervals of .
我们在 Occ3D-nus [48] 数据集上评估了我们的模型。Occ3D-nus 基于 nuScenes 数据集[3],该数据集由从 6 个环视摄像机、1 个激光雷达传感器和 5 个雷达传感器收集的大规模多模态数据组成。该数据集有 1000 个视频,分为 700/150/150 个视频用于训练/验证/测试。每个视频的时长约为 20 秒,关键样本的注释间隔为
We use the proposed RayIoU to evaluate the semantic segmentation performance. The query rays originate from 8 LiDAR positions of the ego path. We calculate RayIoU under three distance thresholds: 1,2 and 4 meters. The final ranking metric is averaged over these distance thresholds.
我们使用所提出的 RayIoU 来评估语义分割性能。查询光线来自自我路径的 8 个激光雷达位置。我们在三个距离阈值下计算 RayIoU:1、2 和 4 米。最终的排名指标是这些距离阈值的平均值。

5.1 Implementation Details
5.1 实施细节

We implement our model using PyTorch [40]. Following previous methods, we adopt ResNet-50 [12] as the image backbone. The mask transformer consists of 3 layers with shared weights across different layers. In our main experiments, we employ semantic queries where each query corresponds
我们使用 PyTorch [40] 实现我们的模型。按照之前的方法,我们采用 ResNet-50 [12] 作为图像骨干。掩码转换器由 3 层组成,不同层之间共享权重。在我们的主要实验中,我们采用语义查询,每个查询对应
Table 1: 3D occupancy prediction performance on Occ3D-nuScenes [48]. We use RayIoU to compare our SparseOcc with other methods. " " and " " mean fusing temporal information from 8 or 16 frames. SparseOcc outperforms all existing methods under a weaker setting.
表 1:Occ3D-nuScenes[48]上的三维占位预测性能。我们使用 RayIoU 将 SparseOcc 与其他方法进行比较。" "和 " "指融合 8 帧或 16 帧的时间信息。在较弱的设置下,SparseOcc 的性能优于所有现有方法。
Method Backbone Input Size Epochs RayIoU RayIoU RayIoU RayIoU
BEVFormer [26] R101 24 32.4 26.1 32.9 38.0 3.0
RenderOcc [39] Swin-B 12 19.5 13.4 19.6 25.5 -
BEVDet-Occ [13] R50 90 29.6 23.6 30.0 35.1 2.6
BEVDet-Occ-Long (8f) R50 90 32.6 26.6 33.1 38.2 0.8
FB-Occ (16f) [27] 对外关系与合作部门 (16f) [27] R50 90 33.5 26.7 34.1 39.7 10.3
SparseOcc (8f) R50 24 34.0 28.0 34.7 39.4
SparseOcc (16f) R50 24 12.5
Table 2: Sparse voxel decoder vs. dense voxel decoder. Our sparse voxel decoder achieves nearly faster inference speed than the dense counterparts.
表 2:稀疏体素解码器与密集体素解码器的对比。稀疏体素解码器的推理速度比密集体素解码器快近
Voxel Decoder 体素解码器 RayIoU RayIoU RayIoU RayIoU FPS
Dense coarse-to-fine 粗到细的密度 30.4 6.3
Dense patch-based 基于密集斑块 25.8 20.4 26.0 30.9 7.8
Sparse coarse-to-fine 稀疏的粗粒到细粒 23.9 35.2
to a semantic class, rather than an instance. The ray casting module in RayIoU is implemented based on the codebase of [19].
到一个语义类,而不是一个实例。RayIoU 中的光线投射模块是基于 [19] 的代码库实现的。
During training, we use the AdamW [35] optimizer with a global batch size of 8. The initial learning rate is set to and is decayed with cosine annealing policy. For all experiments, we train our models for 24 epochs. FPS is measured on a Tesla A100 GPU with the PyTorch backend (single batch size).
在训练过程中,我们使用全局批量大小为 8 的 AdamW [35] 优化器。初始学习率设置为 ,并采用余弦退火策略进行衰减。在所有实验中,我们对模型进行了 24 次训练。FPS 是在使用 PyTorch 后端的 Tesla A100 GPU 上测量的(单批次大小)。

5.2 Main Results 5.2 主要成果

In Tab. 1 and Fig. 1 (c), we compare SparseOcc with previous state-of-the-art methods on the validation split of Occ3D-nus. Despite under a weaker setting (ResNet-50 [12], 8 history frames, and input image resolution of ), SparseOcc significantly outperforms previous methods including FB-Occ, the winner of CVPR 2023 occupancy challenge, with many complicated designs including forward-backward view transformation, depth net, joint depth and semantic pre-training, and so on. SparseOcc achieves better results (+1.6 RayIoU) while being faster and simpler than FB-Occ, which demonstrates the superiority of our solution.
在表在表 1 和图 1 (c)中,我们比较了 SparseOcc 与之前在 Occ3D-nus 验证分片上最先进的方法。尽管设置较弱(ResNet-50 [12],8 个历史帧,输入图像分辨率为 ),SparseOcc 的表现明显优于之前的方法,包括 CVPR 2023 占用挑战赛的冠军 FB-Occ,该方法有许多复杂的设计,包括前后视图转换、深度网、联合深度和语义预训练等。SparseOcc 取得了更好的结果(+1.6 RayIoU),同时比 FB-Occ 更快、更简单,这证明了我们解决方案的优越性。
We further provide qualitative results in Fig. 7. Both BEVDet-Occ and FB-Occ are dense methods and make many redundant predictions behind the surface. In contrast, SparseOcc discards over of voxels while still effectively modeling the geometry of the scene and capturing fine-grained details.
我们在图 7 中进一步提供了定性结果。BEVDet-Occ 和 FB-Occ 都是高密度方法,会在表面后面进行许多冗余预测。相比之下,SparseOcc 丢弃了超过 的体素,但仍能有效地模拟场景的几何形状并捕捉细粒度细节。

5.3 Ablations 5.3 消融

In this section, we conduct ablations on the validation split of Occ3D-nuScenes to confirm the effectiveness of each module. By default, we use the single frame version of SparseOcc as the baseline. The choice for our model is made bold.
在本节中,我们将对 Occ3D-nuScenes 的验证分割进行消融,以确认每个模块的有效性。默认情况下,我们使用 SparseOcc 的单帧版本作为基准。我们对模型的选择是大胆的。
Sparse voxel decoder vs. dense voxel decoder. In Tab. 2, we compare our sparse voxel decoder to the dense counterparts. Here, we implement two baselines, and both of them output a dense feature map with shape as . The first baseline is a coarse-to-fine architecture without pruning empty voxels. In this baseline, we also replace self-attention with 3D convolution and use 3D deconvolution to upsample predictions. The other baseline is a patch-based architecture by dividing the 3D space into a small number of patches as PETRv2 [34] for BEV segmentation. We use queries and each one of them corresponds to a specific patch of shape . A stack of deconvolution layers are used to lift the coarse queries to a full-resolution 3D volume.
稀疏体素解码器与密集体素解码器的对比。在表 2 中,我们将稀疏体素解码器与密集体素解码器进行了比较。2 中,我们将稀疏体素解码器与稠密体素解码器进行了比较。在这里,我们实现了两种基线,它们都能输出以形状为 的密集特征图。第一条基线是一个从粗到细的架构,没有剪切空体素。在这一基线中,我们还用三维卷积取代了自注意,并使用三维解卷积对预测进行上采样。另一个基线是基于补丁的架构,它将三维空间划分为少量补丁,就像 PETRv2 [34] 用于 BEV 分割一样。我们使用 查询,每个查询都对应一个特定的形状补丁 。一系列解卷积层用于将粗查询提升到全分辨率三维体。

Figure 7: Visualized comparison of semantic occupancy prediction. Despite discarding over of voxels, our SparseOcc effectively models the geometry of the scene and captures fine-grained details (e.g. the yellow-marked traffic cone in the bottom row).
图 7:语义占位预测的可视化比较。尽管舍弃了超过 的体素,我们的 SparseOcc 仍能有效地模拟场景的几何形状,并捕捉到细粒度的细节(例如下行中黄色标记的交通锥)。
As we can see from the table, the dense coarse-to-fine baseline achieves a good performance of 29.9 RayIoU but with a slow inference speed of 6.3 FPS. The patch-based one is slightly faster with 7.8 FPS inference speed but with a severe performance drop by 4.1 RayIoU. Instead, our sparse voxel decoder produces sparse 3D features in the shape of (where ), achieving an inference speed that is nearly faster than the counterparts without compromising performance. This demonstrates the necessity and effectiveness of our sparse design.
从表中可以看出,从粗到细的密集基线取得了 29.9 RayIoU 的良好性能,但推理速度较慢,仅为 6.3 FPS。基于补丁的推理速度稍快,为 7.8 FPS,但性能严重下降了 4.1 RayIoU。相反,我们的稀疏体素解码器能生成 (其中 )形状的稀疏三维特征,从而在不影响性能的情况下,推理速度比同行快近 。这证明了我们稀疏设计的必要性和有效性。
Mask Transformer. In Tab. 3, we first ablate the effectiveness of mask transformer. The first row is a per-voxel baseline which directly predicts semantics from the sparse voxel decoder using a stack of MLPs. Introducing mask transformer with vanilla dense cross attention (as it is the common practice in MaskFormer and Mask3D) gives a performance boost of 1.7 RayIoU, but inevitably slows down the inference speed. Therefore, to speed up the dense cross-attention pipeline, we adopt a sparse sampling mechanism which brings a 50% reduction in inference time. Then, by further sampling sparse 3D points via predicted mask guidance, we finally achieve 29.2 RayloU with 24 FPS.
掩码转换器。表 33 中,我们首先消除了掩码转换器的有效性。第一行是每体素基线,它使用 MLPs 堆栈直接预测稀疏体素解码器的语义。引入带有 vanilla 密集交叉注意力的掩码转换器(这是 MaskFormer 和 Mask3D 的常见做法)可将性能提升 1.7 RayIoU,但不可避免地会降低推理速度。因此,为了加快密集交叉注意力流水线的速度,我们采用了稀疏采样机制,从而减少了 50% 的推理时间。然后,通过预测遮罩引导进一步对稀疏三维点进行采样,我们最终实现了 29.2 RayloU 和 24 FPS。
Is a limited set of voxels sufficient to cover the scene? In this study, we delve deeper into the impact of voxel sparsity on final performance. To investigate this, we systematically ablate the value of in Fig. 8 (a). Starting from a modest value of 16000 , we observe that the optimal performance
有限的体素集是否足以覆盖场景?在本研究中,我们将深入探讨体素稀疏性对最终性能的影响。为了研究这个问题,我们在图 8 (a) 中系统地删减了 的值。从适度的 16000 值开始,我们观察到最佳性能为
Table 3: Ablation of mask transformer (MT) and the cross attention module in MT. Mask-guided sparse sampling is stronger and faster than the dense cross attention.
表 3:掩码转换器(MT)和 MT 中交叉注意模块的消减。掩码引导的稀疏采样比密集交叉注意力更强、更快。
MT Cross Attention 交叉关注 RayIoU RayIoU RayIoU RayIoU FPS
- - 27.0 20.3 27.5 33.1
Dense cross attention 密集的交叉关注 28.7 22.9 29.3 33.8 16.2
Sparse sampling 稀疏取样 25.8 20.5 26.2 30.8 24.0
+ Mask-guided + 面具引导 24.0
(a) Top-  (a) 顶部
(b) Number of Frames
(b) 帧数
Figure 8: Ablations on voxel sparsity and temporal modeling. (a) The optimal performance occurs when is set to 32000 ( sparsity). (b) The performance continues to increase with the number of frames, but it starts to saturate after 12 frames.
图 8:体素稀疏性和时间建模的消融。(a) 当 设置为 32000 时( sparsity)性能最佳。(b) 性能随着帧数的增加而不断提高,但在 12 帧后开始饱和。
occurs when is set to 32000 , which is only of the total number of dense voxels ). Surprisingly, further increasing does not yield any performance improvements; instead, it introduces noise. Thus, our findings suggest that a 5% sparsity level is sufficient, and additional sparsity would be counterproductive.
设置为 32000 时,仅占高密度体素总数的 )。令人惊讶的是,进一步提高 并不能提高性能,反而会带来噪声。因此,我们的研究结果表明,5% 的稀疏程度已经足够,再增加稀疏程度将适得其反。
Temporal modeling. In Fig. 8 (b), we validate the effectiveness of temporal fusion. We can see that the temporal modeling of SparseOcc is very effective, with performance steadily increasing as the number of frames increases. The performance peaks at 12 frames and then saturates.
时态建模。在图 8 (b) 中,我们验证了时态融合的有效性。我们可以看到,SparseOcc 的时态建模非常有效,其性能随着帧数的增加而稳步提高。性能在 12 帧时达