2024_04_30_97d0e98c57c20e89d3f9g

Fully Sparse 3D Occupancy Prediction
完全稀疏 3D 占用预测

Haisong Liu , Yang Chen , Haiguang Wang , Zetong Yang , Tianyu ,
刘海松 , 杨晨 , 王海光 , 杨泽通 , 田宇、Jia Zeng , Li Chen , Hongyang , Limin Wang
Jia Zeng , Li Chen , Hongyang , Limin Wang Nanjing University Shanghai AI Lab
南京大学上海人工智能实验室 https://github.com/MCG-NJU/SparseOcc

Abstract 摘要

Occupancy prediction plays a pivotal role in autonomous driving. Previous methods typically construct dense 3D volumes, neglecting the inherent sparsity of the scene and suffering high computational costs. To bridge the gap, we introduce a novel fully sparse occupancy network, termed SparseOcc. SparseOcc initially reconstructs a sparse 3D representation from visual inputs and subsequently predicts semantic/instance occupancy from the 3D sparse representation by sparse queries. A mask-guided sparse sampling is designed to enable sparse queries to interact with features in a fully sparse manner, thereby circumventing costly dense features or global attention. Additionally, we design a thoughtful ray-based evaluation metric, namely RayIoU, to solve the inconsistency penalty along depths raised in traditional voxel-level mIoU criteria. SparseOcc demonstrates its effectiveness by achieving a RayIoU of 34.0, while maintaining a real-time inference speed of 17.3 FPS, with 7 history frames inputs. By incorporating more preceding frames to 15 , SparseOcc continuously improves its performance to 35.1 RayIoU without whistles and bells.
乘员预测在自动驾驶中起着举足轻重的作用。以往的方法通常是构建密集的三维体，忽略了场景固有的稀疏性，计算成本很高。为了弥补这一缺陷，我们引入了一种新颖的完全稀疏占位网络，称为 SparseOcc。SparseOcc 最初从视觉输入中重建稀疏三维表示，随后通过稀疏查询从三维稀疏表示中预测语义/实例占有率。我们设计了一种遮罩引导的稀疏采样，使稀疏查询能够以完全稀疏的方式与特征进行交互，从而避免了昂贵的密集特征或全局关注。此外，我们还设计了一种周到的基于射线的评估指标，即 RayIoU，以解决传统体素级 mIoU 标准中提出的深度不一致惩罚问题。SparseOcc 证明了其有效性，在输入 7 个历史帧的情况下，RayIoU 达到 34.0，同时保持了 17.3 FPS 的实时推理速度。通过加入更多的历史帧至 15 个，SparseOcc 不断提高其性能，在不增加任何附加功能的情况下，达到 35.1 RayIoU。

1 Introduction 1 引言

Vision-centric 3D occupancy prediction [1] focuses on partitioning 3D scenes into structured grids from visual images. Each grid is assigned a label indicating if it is occupied or not. This task offers more geometric details than 3D detection and produces an alternative representation to LiDAR

.
以视觉为中心的三维空间占用预测[1]侧重于从视觉图像中将三维场景划分为结构化网格。每个网格都会被贴上一个标签，表明其是否被占用。这项任务提供了比三维检测更多的几何细节，并产生了一种可替代激光雷达的表示方法

。

Existing methods

typically construct dense 3D features yet suffer from computational overhead (e.g.

FPS on Tesla A100). However, dense representations are not necessary for occupancy predictions. Statistics in Fig. 1(a) reveal the geometry sparsity, that more than

of the voxels are empty. This manifests a large room in occupancy prediction acceleration by exploiting the sparsity. Some works

explore the sparsity of 3D scenes, but they still rely on sparse-to-dense modules for dense predictions. This inspires us to seek a fully sparse occupancy network without any dense design.
现有方法

通常会构建密集的三维特征，但计算开销较大（例如，在特斯拉 A100 上的

FPS）。然而，密集表示对于占位预测并非必要。图 1(a)中的统计数据显示了几何稀疏性，即超过

的体素是空的。这表明，利用稀疏性加速占用预测的空间很大。

一些作品探索了三维场景的稀疏性，但它们仍然依赖于稀疏到密集模块来进行密集预测。这启发我们寻求一种无需任何密集设计的完全稀疏占用率网络。

In this paper, we propose SparseOcc, the first fully sparse occupancy network. As depicted in Fig. 1(b), SparseOcc includes two steps. First, it leverages a sparse voxel decoder to reconstruct the sparse geometry of a scene in a coarse-to-fine manner. This only models non-free regions, saving computational costs significantly. Second, we design a mask transformer with sparse semantic/instance queries to predict masks and labels of segments from the sparse space. The mask transformer not only improves performance on semantic occupancy but also paves the way for panoptic occupancy. A mask-guided sparse sampling is designed to achieve sparse cross-attention in the mask transformer.
在本文中，我们提出了首个完全稀疏占用网络 SparseOcc。如图 1(b) 所示，SparseOcc 包括两个步骤。首先，它利用稀疏体素解码器，以从粗到细的方式重建场景的稀疏几何图形。这样只对非自由区域建模，大大节省了计算成本。其次，我们设计了一个具有稀疏语义/实例查询功能的掩码转换器，以便从稀疏空间预测片段的掩码和标签。掩码转换器不仅提高了语义占用的性能，还为全视角占用铺平了道路。为了在掩码转换器中实现稀疏交叉注意，设计了掩码引导的稀疏采样。

(a) Geometry sparsity (a) 几何稀疏性

(b) Overview of SparseOcc
(b) SparseOcc 概述

Figure 1: (a) We statistic the geometry sparsity, and find that even the scene with the fewest empty voxels still has over

empty voxels. (b) SparseOcc reconstructs a sparse 3D representation from input images by a sparse voxel decoder, with a set of sparse queries to estimate the mask and label of each segment from the sparse 3D volumes. (c) Performance comparison on the validation split of Occ3D-nus. FPS is measured on a single Tesla A100 with the PyTorch fp32 backend.
图 1：(a) 我们对几何稀疏性进行了统计，发现即使是空体素最少的场景，也仍有超过

个空体素。(b) SparseOcc 通过稀疏体素解码器从输入图像中重建稀疏三维表示，并通过一组稀疏查询从稀疏三维体积中估计每个片段的掩码和标签。(c) Occ3D-nus 验证片段的性能比较。FPS 是在一台使用 PyTorch fp32 后端的 Tesla A100 上测量的。

As such, our SparseOcc fully exploits the sparse properties, with a fully sparse architecture free of any dense design like dense 3D features, sparse-to-dense modules, and global attention.
因此，我们的 SparseOcc 充分利用了稀疏特性，采用完全稀疏的架构，没有任何密集设计，如密集三维特征、稀疏到密集模块和全局注意力。

Besides, we notice flaws in popular voxel-level mean Intersection-over-Union (mIoU) metrics for occupancy evaluation and further design a ray-level evaluation, RayIoU, as the solution. The mIoU criterion is an ill-posed formulation given the ambiguous labeling of unscanned voxels. Previous methods[48] relieve this issue by only evaluating observed areas but raise extra issues in inconsistency penalty along depths. Instead, RayIoU addresses the two aforementioned issues simultaneously. It evaluates predicted 3D occupancy volume by retrieving depth and category predictions of designated rays. To be specific, RayIoU casts query rays into predicted 3D volumes and decides true positive predictions as the ray with the correct distance and class of its first touched occupied voxel grid. This formulates a more fair and reasonable criterion.
此外，我们还注意到流行的体素级平均交叉联合（mIoU）占位评估指标存在缺陷，并进一步设计了射线级评估指标 RayIoU 作为解决方案。鉴于未扫描体素的标记模糊不清，mIoU 标准是一个难以确定的公式。以前的方法[48]只评估观察到的区域，从而缓解了这一问题，但在深度上的不一致性惩罚却带来了额外的问题。而 RayIoU 可同时解决上述两个问题。它通过检索指定射线的深度和类别预测来评估预测的三维占位体积。具体来说，RayIoU 将查询光线投射到预测的三维体积中，并将其首次触及的所占体素网格的距离和类别正确的光线判定为真阳性预测。这样制定的标准更加公平合理。

Thanks to the sparsity design, SparseOcc achieves 34.0 RayIoU on Occ3D-nus [48], while maintaining a real-time inference speed of 17.3 FPS (Tesla A100, PyTorch fp32 backend), with 7 history frames inputs. By incorporating more preceding frames to 15 , SparseOcc continuously improves its performance to 35.1 RayIoU, achieving state-of-the-art performance without whistles and bells. The comparison between SparseOcc with previous methods in terms of performance and efficiency is shown in Fig. 1(c).
得益于稀疏性设计，SparseOcc 在 Occ3D-nus [48] 上实现了 34.0 RayIoU，同时保持了 17.3 FPS 的实时推理速度（Tesla A100，PyTorch fp32 后端），输入 7 个历史帧。通过加入更多的历史帧（达到 15 帧），SparseOcc 的性能不断提高，达到 35.1 RayIoU，实现了最先进的性能，而且没有任何附加功能。SparseOcc 与之前方法在性能和效率方面的比较如图 1(c)所示。

We summarize our contributions as follows:
我们将我们的贡献总结如下：

We propose SparseOcc, the first fully sparse occupancy network without any time-consuming dense designs. It achieves 34.0 RayIoU on Occ3D-nus benchmark with an inference speed of 17.3 FPS.
我们提出了 SparseOcc，这是首个完全稀疏占用网络，无需任何耗时的密集设计。它在 Occ3D-nus 基准测试中达到了 34.0 RayIoU，推理速度为 17.3 FPS。
We present RayIoU, a ray-wise criterion for occupancy evaluation. By querying rays to 3D volume, it solves the ambiguous penalty issue for unscanned free voxels and the inconsistent depth penalty issue in the mIoU metric.
我们提出了一种用于占用率评估的射线标准--RayIoU。通过查询三维体积的射线，它解决了未扫描自由体素的模糊惩罚问题，以及 mIoU 指标中不一致的深度惩罚问题。

Camera-based 3D Occupancy Prediction. Occupancy Networks were originally proposed by Mescheder et al. [36, 41], with a focus on continuous object representations in 3D space. Recent variations in occupancy networks

shift their attention towards reconstructing 3D space

and predicts voxel-level semantic information from image inputs. MonoScene [4] achieves scene occupancy completion through a 2D and a 3D UNet [42] connected by a sight projection module. OccNet [44] applies universal occupancy features to various downstream tasks and introduces the OpenOcc benchmark. SurroundOcc [55] proposes a coarse-to-fine architecture. However, the computational burden of handling a large number of voxel queries remains a challenge. TPVFormer [17] proposes using tri-perspective view representations to supplement vertical structural information, but this inevitably leads to information loss. VoxFormer [25] initializes sparse queries based on monocular depth prediction. However, VoxFormer is not fully sparse as it still requires a sparse-to-dense MAE [11] module to complete
基于摄像头的三维占位预测。占位网络最初由 Mescheder 等人提出[36, 41]，重点关注三维空间中的连续物体表示。最近，占位网络

的变体将注意力转移到重建三维空间

，并从图像输入中预测体素级语义信息。MonoScene [4] 通过一个视线投影模块连接一个 2D 和一个 3D UNet [42] 来完成场景占用。OccNet [44] 将通用占用特征应用于各种下游任务，并引入了 OpenOcc 基准。SurroundOcc [55] 提出了一种从粗到细的架构。然而，处理大量体素查询的计算负担仍然是一个挑战。TPVFormer [17] 建议使用三视角视图表示法来补充垂直结构信息，但这不可避免地会导致信息丢失。VoxFormer [25] 基于单目深度预测对稀疏查询进行初始化。然而，VoxFormer 并不是完全稀疏的，因为它仍然需要一个稀疏到密集的 MAE [11] 模块来完成

Figure 2: SparseOcc is a fully sparse architecture since it neither relies on dense 3D feature, nor has sparse-to-dense and global attention operations. The sparse voxel decoder reconstructs the sparse geometry of the scene, consisting of

voxels

). The mask transformer then uses

sparse queries to predict the mask and label of each segment. SparseOcc can be easily extended to panoptic occupancy by replacing the semantic query with instance query.
图 2：SparseOcc 是一种完全稀疏的架构，因为它既不依赖于密集的三维特征，也没有稀疏到密集和全局注意力操作。稀疏体素解码器可重建场景的稀疏几何图形，该几何图形由

体素

) 组成。然后，掩码转换器使用

稀疏查询来预测每个片段的掩码和标签。通过用实例查询取代语义查询，SparseOcc 可以很容易地扩展到全景占用。

the scene. Some methods emerged in the CVPR 2023 occupancy challenge [27, 39, 9], but none of them exploits a fully sparse design. In this paper, we make the first step to explore the fully sparse architecture for 3D occupancy predictions from camera inputs.
场景。在 CVPR 2023 占位挑战赛中出现了一些方法[27, 39, 9]，但没有一种方法采用了完全稀疏设计。在本文中，我们首先探索了利用摄像头输入进行三维占位预测的全稀疏架构。

End-to-end 3D Reconstruction from Posed Images. As a related task to 3D occupancy prediction,

reconstruction recovers the 3D geometry from multiple posed images. Recent methods focus on more compact and efficient end-to-end 3D reconstruction pipelines [38, 46, 2, 45, 10]. Atlas [38] extracts features from multi-view input images and maps them to

space to construct the truncated signed distance function [8]. NeuralRecon [46] directly reconstructs local surfaces as sparse TSDF volumes and uses a GRU-based TSDF fusion module to fuse features from previous fragments. VoRTX [45] utilizes transformers to address occlusion issues in multi-view images. CVRecon [10] takes a different approach by avoiding occlusion-sensitive perspective mappings of 2D to 3D and directly uses cost volumes to establish 3D features from 2D image features.
从摆拍图像进行端到端三维重建。作为三维占位预测的一项相关任务，

重建可从多张摆拍图像中恢复三维几何图形。最近的方法侧重于更紧凑、更高效的端到端三维重建管道[38, 46, 2, 45, 10]。Atlas [38] 从多视角输入图像中提取特征，并将其映射到

空间，从而构建截断符号距离函数 [8]。NeuralRecon [46] 直接将局部表面重建为稀疏的 TSDF 卷，并使用基于 GRU 的 TSDF 融合模块来融合来自先前片段的特征。VoRTX [45] 利用变换器来解决多视角图像中的遮挡问题。CVRecon[10]则采用了一种不同的方法，它避免了从二维到三维的对遮挡敏感的透视映射，而是直接使用成本体积从二维图像特征中建立三维特征。

Mask Transformer. Recently, unified segmentation models have been widely studied to handle semantic and instance segmentation concurrently. Cheng et al. first proposes MaskFormer [7] for unified segmentation in terms of model architecture, loss functions, and training strategies. Mask2Former [6] then introduces masked attention, with restricted receptive fields on instance masks, for better performance. Later on, Mask3D [43] successfully extends the mask transformer for point cloud segmentation with state-of-the-art performance. OpenMask3D [47] further achieves the open-vocabulary 3D instance segmentation task and proposes a model for zero-shot 3D segmentation.
掩码转换器最近，人们广泛研究了同时处理语义分割和实例分割的统一分割模型。Cheng 等人首先从模型架构、损失函数和训练策略等方面提出了用于统一分割的 MaskFormer [7]。随后，Mask2Former [6] 引入了掩码注意，对实例掩码的感受野进行限制，以获得更好的性能。随后，Mask3D [43] 成功地将掩膜变换器扩展用于点云分割，并取得了最先进的性能。OpenMask3D [47] 进一步实现了开放词汇的三维实例分割任务，并提出了零镜头三维分割模型。

3 SparseOcc

SparseOcc is a vision-centric occupancy model that only requires camera inputs. As shown in Fig. 2, SparseOcc has three modules: an image encoder consisting of an image backbone and FPN [29] to extract 2D features from multi-view images; a sparse voxel decoder (Sec. 3.1) to predict sparse class-agnostic 3D occupancy with correlated embeddings from the image features; a mask transformer decoder (Sec 3.2) to distinguish semantics and instances in the sparse 3D space.
SparseOcc 是一种以视觉为中心的占用模型，只需要摄像头输入。如图 2 所示，SparseOcc 有三个模块：一个由图像主干和 FPN [29] 组成的图像编码器，用于从多视角图像中提取二维特征；一个稀疏体素解码器（第 3.1 节），用于从图像特征中预测具有相关嵌入的稀疏类无关三维占用率；一个掩码变换器解码器（第 3.2 节），用于区分稀疏三维空间中的语义和实例。

3.1 Sparse Voxel Decoder
3.1 稀疏体素解码器

Since 3D occupancy ground truths

are dense 3D volume with shape as

(e.g.

), existing methods typically build a dense 3D feature of shape

, but suffer from computational overhead. In this paper, we argue that such dense representation is not necessary for occupancy prediction. As in our statistics, we find that over

of the voxels in the scene are free. This motivates us to explore a sparse 3D representation that only models the non-free areas of the scene, thereby saving computational resources.
由于三维占位地面实况

是形状为

（如

）的密集三维体，因此现有方法通常会建立一个形状为

的密集三维特征，但会产生计算开销。在本文中，我们认为这种密集表示对于占用预测并不是必需的。在我们的统计中，我们发现场景中超过

的体素是自由的。这促使我们探索一种稀疏的三维表示方法，只对场景中的非自由区域进行建模，从而节省计算资源。

Figure 3: The detailed architecture of sparse voxel decoder, which follows a coarse-to-fine pipeline with three layers. Within each layer, we use a transformer-like architecture for 3D-2D interaction. At the end of each layer, it upsamples each voxels by

and estimates the probability of being occupied for each voxel.
图 3：稀疏体素解码器的详细结构，它采用从粗到细的流水线，共有三层。在每一层中，我们使用类似变压器的架构来实现 3D-2D 交互。在每一层的末尾，它通过

对每个体素进行上采样，并估算每个体素被占用的概率。

Overall architecture. Our designed sparse voxel decoder is shown in Fig. 3. In general, it follows a coarse-to-fine structure but only models the non-free regions. The decoder starts from a set of coarse voxel queries equally distributed in the 3D space (e.g.

). In each layer, we first upsample each voxel by

, e.g. a voxel with size

will be upsampled into 8 voxels with size

. Next, we estimate an occupancy score for each voxel and conduct pruning to remove useless voxel grids. Here we have two approaches for pruning: one is based on a threshold (e.g., only keeps score

); the other is by top-

selection. In our implementation, we simply keep voxels with top-

occupancy scores for training efficiency.

is a dataset-related parameter, obtained by counting the maximum number of non-free voxels in each sample at different resolutions. The voxel tokens after pruning will serve as the input for the next layer.
整体架构我们设计的稀疏体素解码器如图 3 所示。一般来说，它遵循从粗到细的结构，但只对非自由区域进行建模。解码器从在三维空间中平均分布的一组粗体素查询（如

）开始。在每一层中，我们首先对每个体素进行

的升采样，例如，大小为

的体素将被升采样为 8 个大小为

的体素。接下来，我们为每个体素估算占用率分数，并进行修剪以去除无用的体素网格。这里我们有两种剪枝方法：一种是基于阈值（例如，只保留得分

）；另一种是通过顶部

选择。

是一个与数据集相关的参数，通过计算每个样本在不同分辨率下非自由体素的最大数量获得。剪枝后的体素标记将作为下一层的输入。

Detailed design. Within each layer, we use a transformer-like architecture to handle voxel queries. The concrete architecture is inspired by SparseBEV [32], a detection method using a sparse scheme. To be specific, in layer

with

voxel queries described by 3D locations and a

-dim content vector, we first use self-attention to aggregate local and global features for those query voxels. Then, a linear layer is used to generate 3D sampling offsets

for each voxel query from the associated content vector. These sampling offsets are utilized to transform voxel queries to obtain reference points in global coordinates. We finally project those sampled reference points to multi-view image space for integrating image features by adaptive mixing [49, 18].
详细设计在每一层中，我们使用类似变压器的架构来处理体素查询。具体架构的灵感来自 SparseBEV [32]，这是一种使用稀疏方案的检测方法。具体来说，在

层，我们首先使用自注意力来聚合这些查询体素的局部和全局特征，然后使用线性层来处理由三维位置和

-dim 内容向量描述的

体素查询。然后，使用线性层从相关内容向量中为每个体素查询生成三维采样偏移

。利用这些采样偏移来转换体素查询，以获得全局坐标中的参考点。最后，我们将这些采样参考点投影到多视角图像空间，通过自适应混合[49, 18]来整合图像特征。

Temporal modeling. Previous dense occupancy methods

typically warp the history

feature to the current timestamp, and use deformable attention [62] or 3D convolutions to fuse temporal information. However, this approach is not directly applicable in our case due to the sparse nature of our 3D features. To handle this, we leverage the flexibility of the aforementioned global sampled reference points by warping them to previous timestamps to sample history multi-view image features. The sampled multi-frame features are then stacked and aggregated by adaptive mixing so as for temporal modeling.
时间建模。以往的密集占位法

通常会将历史

特征翘曲到当前时间戳，并使用可变形注意力[62]或三维卷积来融合时间信息。然而，由于三维特征的稀疏性，这种方法并不能直接适用于我们的情况。为了解决这个问题，我们利用上述全局采样参考点的灵活性，将其扭曲到之前的时间戳，以采样历史多视角图像特征。然后，通过自适应混合将采样的多帧特征进行堆叠和聚合，以便进行时间建模。

Supervision. We compute loss for sparse voxels from each layer. We use binary cross entropy (BCE) loss as the supervision, given that we are reconstructing a class-agnostic sparse occupancy space. Only the kept sparse voxels are supervised, while the discarded regions during pruning in earlier stages are ignored.
监督。我们计算每一层稀疏体素的损失。我们使用二元交叉熵（BCE）损失作为监督，因为我们正在重建一个与类别无关的稀疏占位空间。只对保留的稀疏体素进行监督，而忽略在早期阶段剪枝过程中丢弃的区域。

Moreover, due to the severe class imbalance, the model can be easily dominated by categories with a large proportion, such as the ground, thereby ignoring other important elements in the scene, such as cars, people, etc. Therefore, voxels belonging to different classes are assigned with different loss weights. For example, voxels belonging to class

are assigned with a loss weight of:
此外，由于类别严重失衡，模型很容易被比例较大的类别（如地面）所支配，从而忽略场景中的其他重要元素，如汽车、人物等。因此，属于不同类别的体素会被赋予不同的损失权重。例如，属于

类别的体素的损失权重为：

where

is the number of voxels belonging to the

-th class in ground truth.
其中

是属于

-th class 的体素数量。

3.2 Mask Transformer 3.2 掩膜变压器

Our mask transformer is inspired by Mask2Former [6], which uses

sparse semantic/instance queries decoupled by binary mask query

and content vector

. The mask transformer consists of three steps: multi-head self attention (MHSA), mask-guided sparse sampling, and adaptive mixing. MHSA is used for the interaction between different queries as the common practice. Mask-guided sparse sampling and adaptive mixing are responsible for the interaction between queries and 2D image features.
我们的掩码转换器受到 Mask2Former [6] 的启发，它通过二进制掩码查询

和内容向量

来解耦

稀疏语义/实例查询。掩码转换器包括三个步骤：多头自我关注（MHSA）、掩码引导的稀疏采样和自适应混合。MHSA 通常用于不同查询之间的交互。掩码引导稀疏采样和自适应混合负责查询与二维图像特征之间的交互。

Mask-guided sparse sampling. A simple baseline of mask transformer is to use the masked crossattention module in Mask2Former. However, it attends to all positions of the key, with unbearable computations. Here, we design a simple alternative. We first randomly select a set of 3D points within the mask predicted by the previous

-th Transformer decoder layer. Then, we project those 3D points to multi-view images and extract their features by bilinear interpolation. Besides, our sparse sampling mechanism makes the temporal modeling easier by simply warping the sampling points (as done in the sparse voxel decoder).
掩码引导稀疏采样。掩码转换器的一个简单基准是使用 Mask2Former 中的掩码交叉注意模块。然而，它需要关注密钥的所有位置，计算量大得难以承受。在这里，我们设计了一个简单的替代方案。首先，我们在前一个

-th Transformer 解码层预测的掩码内随机选择一组三维点。然后，我们将这些三维点投影到多视角图像上，并通过双线性插值提取其特征。此外，我们的稀疏采样机制通过对采样点进行简单的扭曲（就像在稀疏体素解码器中所做的那样），使时间建模变得更加容易。

Prediction. For class prediction, we apply a linear classifier with a sigmoid activation based on the query embeddings

. For mask prediction, the query embeddings are converted to mask embeddings by a MLP. The mask embeddings

have the same shape as query embeddings

and are dot-producted with the sparse voxel embeddings

to produce mask predictions. Thus, the prediction space of our mask transformer is constrained to the sparsified 3D space from the sparse voxel decoder, rather than the full 3D scene. The mask predictions will serve as the mask query

for the next layer.
预测。在类预测方面，我们根据查询嵌入

，应用具有 sigmoid 激活的线性分类器。对于掩码预测，查询嵌入信息通过 MLP 转换为掩码嵌入信息。掩码内嵌

与查询内嵌

具有相同的形状，并与稀疏体素内嵌

点积，以生成掩码预测。因此，我们的掩码转换器的预测空间受限于稀疏体素解码器的稀疏三维空间，而不是完整的三维场景。掩码预测将作为下一层的掩码查询

。

Supervision. The reconstruction result from the sparse voxel decoder may not be reliable, as it may overlook or inaccurately detect certain elements. Thus, supervising the mask transformer presents certain challenges since its predictions are confined within this unreliable space. In cases of missed detection, where some ground truth segments are absent in the predicted sparse occupancy, we opt to discard these segments to prevent confusion. As for inaccurately detected elements, we simply categorize them as an additional "no object" category.
监督。稀疏体素解码器的重建结果可能并不可靠，因为它可能会忽略或不准确地检测到某些元素。因此，由于掩码转换器的预测被限制在这个不可靠的空间内，因此对掩码转换器进行监督就面临着一定的挑战。在漏检的情况下，即在预测的稀疏占位中不存在某些地面真实片段时，我们会选择丢弃这些片段，以防止混淆。至于检测不准确的元素，我们只需将其归类为额外的 "无对象 "类别。

Loss Functions. Follow MaskFormer [7], we match the ground truth with the predictions using Hungarian matching. Focal loss

is used for classification, while a combination of DICE loss [37]

and BCE mask loss

is used for mask prediction. Thus, the total loss of SparseOcc is composed of:
损失函数。按照 MaskFormer [7]，我们使用匈牙利匹配法将地面实况与预测结果进行匹配。Focal loss

用于分类，而 DICE loss [37]

和 BCE mask loss

的组合则用于掩码预测。因此，SparseOcc 的总损失由以下部分组成：

where

is the loss of sparse voxel decoder.
其中

是稀疏体素解码器的损耗。

4 Ray-level mIoU 4 雷级导弹

4.1 Revisiting the Voxel-level mIoU
4.1 重新审视体素级模块

The Occ3D dataset [48], along with its proposed evaluation metrics, are widely recognized as benchmarks in this field. The ground truth occupancy is reconstructed from LiDAR measurements, and the mean Intersection over Union (

) at the voxel level is employed to assess performance.
Occ3D 数据集[48]及其提出的评估指标是该领域公认的基准。根据激光雷达测量结果重建地面实况占用率，并采用象素级的平均交叉联合（

）来评估性能。

However, due to factors such as distance and occlusion, accumulated LiDAR point clouds are imperfect. Some areas unscanned by LiDAR are marked as free, resulting in fragmented instances, as shown in Fig. 4(a). This raises problems of label inconsistency. Previous efforts in solving the evaluation issue, e.g. Occ3D, use a binary visible mask that indicates whether the voxels are observed in the current camera view. However, we found that only calculating mIoU on the observed voxel position could still cause ambiguity and be easily hacked. As illustrated in Fig. 4, RenderOcc [39] generated a cracked but thick surface, approximately unusable for downstream task. Nevertheless, RenderOcc's masked mIoU is much higher than BEVFormer's prediction [26], which is more stable and clean.
然而，由于距离和遮挡等因素，累积的激光雷达点云并不完美。如图 4(a)所示，一些未被激光雷达扫描的区域被标记为空闲区域，导致实例支离破碎。这就产生了标签不一致的问题。以前解决评估问题的方法（如 Occ3D）使用二进制可见掩码，表示在当前相机视图中是否观察到体素。然而，我们发现仅根据观察到的体素位置计算 mIoU 仍然会造成模糊性，而且很容易被黑客攻击。如图 4 所示，RenderOcc[39]生成了一个有裂缝但很厚的表面，大约无法用于下游任务。尽管如此，RenderOcc 的屏蔽 mIoU 远高于 BEVFormer 的预测值[26]，后者更加稳定和干净。

The misalignment between qualitative and quantitative results is caused by the inconsistency along the depth direction of the visible mask. As shown in Fig. 5, this toy example reveals several issues with the current evaluation metrics:
定性和定量结果不一致的原因是沿可见光掩膜深度方向的不一致性。如图 5 所示，这个玩具示例揭示了当前评估指标的几个问题：

(a) Ground Truth (a) 地面实况

(b) BEVFormer

Figure 4: Illustration of the discrepancy between the qualitative results and quantitative metric of Occ3D. (a) The GT has some unobserved area, could be filtered out by visible mask; (b) prediction of BEVFormer, trained without using visible mask, has thin surface; (c) prediction of RenderOcc, has thick surface.
图 4：说明 Occ3D 的定性结果与定量指标之间的差异。(a) GT 有一些未观察到的区域，可通过可见光掩膜过滤掉；(b) BEVFormer 的预测结果，在没有使用可见光掩膜的情况下，表面较薄；(c) RenderOcc 的预测结果，表面较厚。

Figure 5: Consider a scenario where we have a wall in front of us, with a ground-truth distance of

and a thickness of

. When the prediction has a thickness of

, as in RenderOcc, the inconsistency along depth of masked mIoU emergence. If the wall we predict is

farther than the ground truth

in total), then its IoU will be zero,because none of the predicted voxels align with the actual wall. However, if the wall we predict is

closer than the ground truth

in total), we will still achieve an IoU of 0.5 , since all voxels behind the surface are filled. Similarly, if the predicted depth is

, we will still have an IoU of

, and so on.
图 5：假设前方有一堵墙，地面真实距离为

，厚度为

。当预测的厚度为

时，如在 RenderOcc 中一样，会出现沿掩蔽 mIoU 深度的不一致性。如果我们预测的墙面

远于地面实况

），那么它的 IoU 将为零，因为没有一个预测的体素与实际墙面对齐。但是，如果我们预测的墙面

比地面实况

（总计）更近，我们仍然可以获得 0.5 的 IoU 值，因为表面后面的所有体素都被填满了。同样，如果预测的深度是

，我们的 IoU 仍然是

，以此类推。

If the model fills all areas behind the surface, it inconsistently penalizes the prediction of depth. The model can obtain a higher IoU by filling all areas behind the surface and predicting a closer depth. This thick surface issue is very common in models that use the visible mask or supervision.
如果模型填满了表面后的所有区域，就会对深度预测造成不一致的影响。模型可以通过填充表面后的所有区域并预测更接近的深度来获得更高的 IoU。这种厚表面问题在使用可见光掩膜或监督的模型中非常常见。
If the predicted occupancy is a thin surface, the penalty is overly strict, as a deviation of one voxel will lead to an IoU of zero.
如果预测的占位是一个薄表面，则惩罚过于严格，因为一个体素的偏差将导致 IoU 为零。
The visible mask only considers the visible area at the current moment, thereby reducing Occupancy to a depth estimation task with categories and overlooking the crucial ability to complete scenes beyond the visible region.
可见光掩膜只考虑当前可见光区域，从而将 "占用 "简化为一项有类别的深度估计任务，忽略了完成可见光区域以外场景的关键能力。

4.2 Mean IoU by Ray Casting
4.2 通过射线投射获得平均 IoU

To address the above issues, we propose a new evaluation metric: Ray-level mIoU (RayIoU for short). In RayIoU, the elements of the set are now query rays, not voxels. We emulate LiDAR by projecting query rays into the predicted 3D occupancy volume. For each query ray, we compute the distance it travels before intersecting any surface and retrieve the corresponding class label. We then apply the same procedure to the ground-truth occupancy to obtain the ground-truth depth and class label. In case a ray doesn't intersect with any voxel present in the ground truth, it will be excluded from the evaluation process.
为解决上述问题，我们提出了一种新的评估指标：射线级 mIoU（简称 RayIoU）。在 RayIoU 中，集合的元素现在是查询光线，而不是体素。我们通过将查询光线投射到预测的三维占位体积中来模拟激光雷达。对于每条查询光线，我们都会计算它在与任何表面相交之前的距离，并检索相应的类别标签。然后，我们将相同的程序应用于地面实况占位，以获得地面实况深度和类别标签。如果某条射线没有与地面实况中的任何体素相交，它将被排除在评估过程之外。

As shown in Fig. 6(a), the original LiDAR rays in a real dataset tend to be unbalanced from near to far. Thus, we resample the LiDAR rays to balance the distribution on different distance, as shown in Fig. 6(b). For the near field, we modify the LiDAR ray channels to achieve equal-distant spacing when projected onto the ground plane. In the far field, we increase the angular resolution of the
如图 6(a)所示，真实数据集中的原始激光雷达射线往往从近到远分布不均。因此，我们对激光雷达射线进行重新采样，以平衡不同距离上的分布，如图 6（b）所示。对于近场，我们修改了激光雷达射线通道，以实现投射到地平面时的等距分布。在远场，我们提高了激光雷达的角度分辨率。

(a) Simulate LiDAR (a) 模拟激光雷达

(b) Equal-distant resampling
(b) 等距重采样

Figure 6: Covered area of RayIoU. (a) The raw LiDAR ray samples unbalance at different distances. (b) We resample the ray to balance the weight on distance in RayIoU. (c) To investigate the performance of scene completion, we propose evaluating occupancy in the visible area on a wide time span, by casting rays on visited way points.
图 6：RayIoU 的覆盖区域。(a) 原始激光雷达射线样本在不同距离上不平衡。(b) 我们对光线重新采样，以平衡 RayIoU 中的距离权重。(c) 为了研究场景补全的性能，我们建议通过在访问过的航点上投射光线，在较大的时间跨度上评估可见区域的占用率。

ray channels to ensure a more uniform data density across varying ranges. Moreover, our query ray can originate from the LiDAR position at the current, past, or future moments of the ego path. As shown in Fig. 6, temporal casting allows us to better evaluate the scene completion performance while ensuring that the task remains well-posed.
我们的查询射线可以来自自我路径当前、过去或未来时刻的激光雷达位置。此外，我们的查询射线可以来自自我路径当前、过去或未来时刻的激光雷达位置。如图 6 所示，时间投射使我们能够更好地评估场景完成性能，同时确保任务保持良好姿势。

A query ray is classified as a true positive (TP) if the class labels coincide and the L1 error between the ground-truth depth and the predicted depth is less than a certain threshold (e.g.,

). Let

be he number of classes,
如果类标签重合，且地面实况深度与预测深度之间的 L1 误差小于某个阈值（如

），则查询光线被归类为真阳性（TP）。设

为类别数、

where

, and

correspond to the number of true positive, false positive, and false negative predictions for class

.
其中，

, 和

分别对应于类别

的真阳性预测数、假阳性预测数和假阴性预测数。

RayIoU addresses all three of the aforementioned problems:
RayIoU 解决了上述所有三个问题：

Since the query ray only calculates the distance it touches the first voxel, the model cannot obtain a higher IoU by filling the area behind the surface.
由于查询光线只计算它与第一个体素的接触距离，因此模型无法通过填充表面后面的区域来获得更高的 IoU。
RayIoU determines TP through the distance threshold, mitigating the overly harsh nature of voxel-level mIoU.
RayIoU 通过距离阈值确定 TP，减轻了体素级 mIoU 过度苛刻的性质。
The query ray can originate from any position in the scene, thereby considering the model's completion ability and preventing the reduction of occupancy to depth estimation.
查询光线可以来自场景中的任何位置，从而考虑到模型的完成能力，避免将占用率降低到深度估计。

5 Experiments 5 项实验

We evaluate our model on the Occ3D-nus [48] dataset. Occ3D-nus is based on the nuScenes [3] dataset, which consists of large-scale multimodal data collected from 6 surround-view cameras, 1 LiDAR sensor and 5 radar sensors. The dataset has 1000 videos and is split into 700/150/150 videos for training/validation/testing. Each video has roughly 20s duration and the key samples are annotated at intervals of

.
我们在 Occ3D-nus [48] 数据集上评估了我们的模型。Occ3D-nus 基于 nuScenes 数据集[3]，该数据集由从 6 个环视摄像机、1 个激光雷达传感器和 5 个雷达传感器收集的大规模多模态数据组成。该数据集有 1000 个视频，分为 700/150/150 个视频用于训练/验证/测试。每个视频的时长约为 20 秒，关键样本的注释间隔为

。

We use the proposed RayIoU to evaluate the semantic segmentation performance. The query rays originate from 8 LiDAR positions of the ego path. We calculate RayIoU under three distance thresholds: 1,2 and 4 meters. The final ranking metric is averaged over these distance thresholds.
我们使用所提出的 RayIoU 来评估语义分割性能。查询光线来自自我路径的 8 个激光雷达位置。我们在三个距离阈值下计算 RayIoU：1、2 和 4 米。最终的排名指标是这些距离阈值的平均值。

5.1 Implementation Details
5.1 实施细节

We implement our model using PyTorch [40]. Following previous methods, we adopt ResNet-50 [12] as the image backbone. The mask transformer consists of 3 layers with shared weights across different layers. In our main experiments, we employ semantic queries where each query corresponds
我们使用 PyTorch [40] 实现我们的模型。按照之前的方法，我们采用 ResNet-50 [12] 作为图像骨干。掩码转换器由 3 层组成，不同层之间共享权重。在我们的主要实验中，我们采用语义查询，每个查询对应

Table 1: 3D occupancy prediction performance on Occ3D-nuScenes [48]. We use RayIoU to compare our SparseOcc with other methods. "

" and "

" mean fusing temporal information from 8 or 16 frames. SparseOcc outperforms all existing methods under a weaker setting.
表 1：Occ3D-nuScenes[48]上的三维占位预测性能。我们使用 RayIoU 将 SparseOcc 与其他方法进行比较。"

"和 "

"指融合 8 帧或 16 帧的时间信息。在较弱的设置下，SparseOcc 的性能优于所有现有方法。

Method	Backbone	Input Size	Epochs	RayIoU	RayIoU	RayIoU	RayIoU
BEVFormer [26]	R101		24	32.4	26.1	32.9	38.0	3.0
RenderOcc [39]	Swin-B		12	19.5	13.4	19.6	25.5	-
BEVDet-Occ [13]	R50		90	29.6	23.6	30.0	35.1	2.6
BEVDet-Occ-Long (8f)	R50		90	32.6	26.6	33.1	38.2	0.8
FB-Occ (16f) [27] 对外关系与合作部门 (16f) [27]	R50		90	33.5	26.7	34.1	39.7	10.3
SparseOcc (8f)	R50		24	34.0	28.0	34.7	39.4
SparseOcc (16f)	R50		24					12.5

Table 2: Sparse voxel decoder vs. dense voxel decoder. Our sparse voxel decoder achieves nearly

faster inference speed than the dense counterparts.
表 2：稀疏体素解码器与密集体素解码器的对比。稀疏体素解码器的推理速度比密集体素解码器快近

。

Voxel Decoder 体素解码器	RayIoU	RayIoU	RayIoU	RayIoU	FPS
Dense coarse-to-fine 粗到细的密度			30.4		6.3
Dense patch-based 基于密集斑块	25.8	20.4	26.0	30.9	7.8
Sparse coarse-to-fine 稀疏的粗粒到细粒		23.9		35.2

to a semantic class, rather than an instance. The ray casting module in RayIoU is implemented based on the codebase of [19].
到一个语义类，而不是一个实例。RayIoU 中的光线投射模块是基于 [19] 的代码库实现的。

During training, we use the AdamW [35] optimizer with a global batch size of 8. The initial learning rate is set to

and is decayed with cosine annealing policy. For all experiments, we train our models for 24 epochs. FPS is measured on a Tesla A100 GPU with the PyTorch

backend (single batch size).
在训练过程中，我们使用全局批量大小为 8 的 AdamW [35] 优化器。初始学习率设置为

，并采用余弦退火策略进行衰减。在所有实验中，我们对模型进行了 24 次训练。FPS 是在使用 PyTorch

后端的 Tesla A100 GPU 上测量的（单批次大小）。

5.2 Main Results 5.2 主要成果

In Tab. 1 and Fig. 1 (c), we compare SparseOcc with previous state-of-the-art methods on the validation split of Occ3D-nus. Despite under a weaker setting (ResNet-50 [12], 8 history frames, and input image resolution of

), SparseOcc significantly outperforms previous methods including FB-Occ, the winner of CVPR 2023 occupancy challenge, with many complicated designs including forward-backward view transformation, depth net, joint depth and semantic pre-training, and so on. SparseOcc achieves better results (+1.6 RayIoU) while being faster and simpler than FB-Occ, which demonstrates the superiority of our solution.
在表在表 1 和图 1 (c)中，我们比较了 SparseOcc 与之前在 Occ3D-nus 验证分片上最先进的方法。尽管设置较弱（ResNet-50 [12]，8 个历史帧，输入图像分辨率为

），SparseOcc 的表现明显优于之前的方法，包括 CVPR 2023 占用挑战赛的冠军 FB-Occ，该方法有许多复杂的设计，包括前后视图转换、深度网、联合深度和语义预训练等。SparseOcc 取得了更好的结果（+1.6 RayIoU），同时比 FB-Occ 更快、更简单，这证明了我们解决方案的优越性。

We further provide qualitative results in Fig. 7. Both BEVDet-Occ and FB-Occ are dense methods and make many redundant predictions behind the surface. In contrast, SparseOcc discards over

of voxels while still effectively modeling the geometry of the scene and capturing fine-grained details.
我们在图 7 中进一步提供了定性结果。BEVDet-Occ 和 FB-Occ 都是高密度方法，会在表面后面进行许多冗余预测。相比之下，SparseOcc 丢弃了超过

的体素，但仍能有效地模拟场景的几何形状并捕捉细粒度细节。

5.3 Ablations 5.3 消融

In this section, we conduct ablations on the validation split of Occ3D-nuScenes to confirm the effectiveness of each module. By default, we use the single frame version of SparseOcc as the baseline. The choice for our model is made bold.
在本节中，我们将对 Occ3D-nuScenes 的验证分割进行消融，以确认每个模块的有效性。默认情况下，我们使用 SparseOcc 的单帧版本作为基准。我们对模型的选择是大胆的。

Sparse voxel decoder vs. dense voxel decoder. In Tab. 2, we compare our sparse voxel decoder to the dense counterparts. Here, we implement two baselines, and both of them output a dense feature map with shape as

. The first baseline is a coarse-to-fine architecture without pruning empty voxels. In this baseline, we also replace self-attention with 3D convolution and use 3D deconvolution to upsample predictions. The other baseline is a patch-based architecture by dividing the 3D space into a small number of patches as PETRv2 [34] for BEV segmentation. We use

queries and each one of them corresponds to a specific patch of shape

. A stack of deconvolution layers are used to lift the coarse queries to a full-resolution 3D volume.
稀疏体素解码器与密集体素解码器的对比。在表 2 中，我们将稀疏体素解码器与密集体素解码器进行了比较。2 中，我们将稀疏体素解码器与稠密体素解码器进行了比较。在这里，我们实现了两种基线，它们都能输出以形状为

的密集特征图。第一条基线是一个从粗到细的架构，没有剪切空体素。在这一基线中，我们还用三维卷积取代了自注意，并使用三维解卷积对预测进行上采样。另一个基线是基于补丁的架构，它将三维空间划分为少量补丁，就像 PETRv2 [34] 用于 BEV 分割一样。我们使用

查询，每个查询都对应一个特定的形状补丁

。一系列解卷积层用于将粗查询提升到全分辨率三维体。

Figure 7: Visualized comparison of semantic occupancy prediction. Despite discarding over

of voxels, our SparseOcc effectively models the geometry of the scene and captures fine-grained details (e.g. the yellow-marked traffic cone in the bottom row).
图 7：语义占位预测的可视化比较。尽管舍弃了超过

的体素，我们的 SparseOcc 仍能有效地模拟场景的几何形状，并捕捉到细粒度的细节（例如下行中黄色标记的交通锥）。

As we can see from the table, the dense coarse-to-fine baseline achieves a good performance of 29.9 RayIoU but with a slow inference speed of 6.3 FPS. The patch-based one is slightly faster with 7.8 FPS inference speed but with a severe performance drop by 4.1 RayIoU. Instead, our sparse voxel decoder produces sparse 3D features in the shape of

(where

), achieving an inference speed that is nearly

faster than the counterparts without compromising performance. This demonstrates the necessity and effectiveness of our sparse design.
从表中可以看出，从粗到细的密集基线取得了 29.9 RayIoU 的良好性能，但推理速度较慢，仅为 6.3 FPS。基于补丁的推理速度稍快，为 7.8 FPS，但性能严重下降了 4.1 RayIoU。相反，我们的稀疏体素解码器能生成

（其中

）形状的稀疏三维特征，从而在不影响性能的情况下，推理速度比同行快近

。这证明了我们稀疏设计的必要性和有效性。

Mask Transformer. In Tab. 3, we first ablate the effectiveness of mask transformer. The first row is a per-voxel baseline which directly predicts semantics from the sparse voxel decoder using a stack of MLPs. Introducing mask transformer with vanilla dense cross attention (as it is the common practice in MaskFormer and Mask3D) gives a performance boost of 1.7 RayIoU, but inevitably slows down the inference speed. Therefore, to speed up the dense cross-attention pipeline, we adopt a sparse sampling mechanism which brings a 50% reduction in inference time. Then, by further sampling sparse 3D points via predicted mask guidance, we finally achieve 29.2 RayloU with 24 FPS.
掩码转换器。表 33 中，我们首先消除了掩码转换器的有效性。第一行是每体素基线，它使用 MLPs 堆栈直接预测稀疏体素解码器的语义。引入带有 vanilla 密集交叉注意力的掩码转换器（这是 MaskFormer 和 Mask3D 的常见做法）可将性能提升 1.7 RayIoU，但不可避免地会降低推理速度。因此，为了加快密集交叉注意力流水线的速度，我们采用了稀疏采样机制，从而减少了 50% 的推理时间。然后，通过预测遮罩引导进一步对稀疏三维点进行采样，我们最终实现了 29.2 RayloU 和 24 FPS。

Is a limited set of voxels sufficient to cover the scene? In this study, we delve deeper into the impact of voxel sparsity on final performance. To investigate this, we systematically ablate the value of

in Fig. 8 (a). Starting from a modest value of 16000 , we observe that the optimal performance
有限的体素集是否足以覆盖场景？在本研究中，我们将深入探讨体素稀疏性对最终性能的影响。为了研究这个问题，我们在图 8 (a) 中系统地删减了

的值。从适度的 16000 值开始，我们观察到最佳性能为

Table 3: Ablation of mask transformer (MT) and the cross attention module in MT. Mask-guided sparse sampling is stronger and faster than the dense cross attention.
表 3：掩码转换器（MT）和 MT 中交叉注意模块的消减。掩码引导的稀疏采样比密集交叉注意力更强、更快。

MT	Cross Attention 交叉关注	RayIoU	RayIoU	RayIoU	RayIoU	FPS
-	-	27.0	20.3	27.5	33.1
	Dense cross attention 密集的交叉关注	28.7	22.9	29.3	33.8	16.2
	Sparse sampling 稀疏取样	25.8	20.5	26.2	30.8	24.0
	+ Mask-guided + 面具引导					24.0

(a) Top-

(a) 顶部

(b) Number of Frames
(b) 帧数

Figure 8: Ablations on voxel sparsity and temporal modeling. (a) The optimal performance occurs when

is set to 32000 (

sparsity). (b) The performance continues to increase with the number of frames, but it starts to saturate after 12 frames.
图 8：体素稀疏性和时间建模的消融。(a) 当

设置为 32000 时（

sparsity）性能最佳。(b) 性能随着帧数的增加而不断提高，但在 12 帧后开始饱和。

occurs when

is set to 32000 , which is only

of the total number of dense voxels

). Surprisingly, further increasing

does not yield any performance improvements; instead, it introduces noise. Thus, our findings suggest that a 5% sparsity level is sufficient, and additional sparsity would be counterproductive.
当

设置为 32000 时，仅占高密度体素总数的

）。令人惊讶的是，进一步提高

并不能提高性能，反而会带来噪声。因此，我们的研究结果表明，5% 的稀疏程度已经足够，再增加稀疏程度将适得其反。

Temporal modeling. In Fig. 8 (b), we validate the effectiveness of temporal fusion. We can see that the temporal modeling of SparseOcc is very effective, with performance steadily increasing as the number of frames increases. The performance peaks at 12 frames and then saturates.
时态建模。在图 8 (b) 中，我们验证了时态融合的有效性。我们可以看到，SparseOcc 的时态建模非常有效，其性能随着帧数的增加而稳步提高。性能在 12 帧时达到峰值，然后趋于饱和。

5.4 Extensive Experiments
5.4 广泛实验

Panoptic occupancy. We then show that SparseOcc can be easily extended for panoptic occupancy predictions, a task derived from panoptic segmentation that segments images to not only semantically meaningful regions but also to detect and distinguish individual instances. Compared to panoptic segmentation, panoptic occupancy prediction requires the model to be geometry awareness to construct the 3D scene for segmentation. By additionally introducing instance queries to the mask transformer, we seamlessly achieve the first panoptic occupancy prediction framework using camera inputs. In Fig. 9, we visualize the panoptic occupancy results of SparseOcc. As illustrated, semantic regions and individual objects as well as their occupancy grids are well predicted. For more details, please refer to the appendix.
全景占位。然后，我们展示了 SparseOcc 可以轻松扩展用于全景占位预测，这是一项从全景分割衍生出来的任务，它不仅能将图像分割成有语义意义的区域，还能检测和区分单个实例。与全景分割相比，全景占位预测要求模型具有几何感知能力，以构建用于分割的三维场景。通过在掩码转换器中额外引入实例查询，我们利用摄像头输入无缝地实现了首个全景占位预测框架。在图 9 中，我们直观地展示了 SparseOcc 的全景占位结果。如图所示，语义区域、单个物体及其占用网格都得到了很好的预测。更多详情，请参阅附录。

Enhancing sparsity by removing the road surface. The majority of non-free occupancy data pertains to background geometry. In practice, the drivable surface occupancy can be effectively substituted with High-Definition Map (HD Map) or online mapping techniques [5, 28, 50, 24]. This replacement not only streamlines the sparsity but also enriches the semantic and structural understanding of roads. We construct experiments to investigate the effect of removing road surface in SparseOcc. The details can be found in the appendix.
通过移除路面增强稀疏性。大部分非自由占用数据与背景几何有关。在实践中，可驾驶路面占用数据可以通过高清地图（HD Map）或在线制图技术有效替代[5, 28, 50, 24]。这种替代不仅简化了稀疏性，还丰富了对道路语义和结构的理解。我们构建了实验来研究在 SparseOcc 中移除路面的效果。详情请见附录。

Table 4: 3D occupancy prediction performance on Occ3D-nuScenes [48]. We use RayIoU to compare our SparseOcc with other methods. We also report the results on the old voxel-level mIoU metrics. "

" and "

" mean fusing temporal information from 8 or 16 frames. SparseOcc outperforms all existing methods under a weaker setting.
表 4：Occ3D-nuScenes[48]上的三维占位预测性能。我们使用 RayIoU 将 SparseOcc 与其他方法进行比较。我们还报告了旧的体素级 mIoU 指标的结果。"

" 和 "

" 表示融合 8 帧或 16 帧的时间信息。在较弱的设置下，SparseOcc 优于所有现有方法。

Method	Backbone	Input Size	Epochs	Vis. Mask	RayIoU	mIoU	FPS
RenderOcc [39]	Swin-B		12		19.5	24.4	-
BEVDet-Occ [13]	R50		90		29.6	36.1	2.6
BEVDet-Occ-Long (8f)	R50		90		32.6	39.3	0.8
BEVFormer [26]	R101		24		32.4	39.3	3.0
FB-Occ (16f) [27] 对外关系与合作部门 (16f) [27]	R50		90		33.5	39.1	10.3
BEVFormer [26]	R101		24	-	33.7	23.7	3.0
FB-Occ (16f) [27] 对外关系与合作部门 (16f) [27]	R50		90	-	35.6	27.9	10.3
SparseOcc (16f)	R50		24	-	35.1	30.6
SparseOcc (16f)	R50		48	-		30.9

Table 5: To verify the effect of the visible mask, wo provide per-class RayIoU of BEVFormer and FB-Occ on the validation split of Occ3D-nuScenes. † uses the visible mask during training. We find that training with visible mask hurts the performance of ground classes such as drivable surface, terrian and sidewalk.
表 5：为了验证可见光掩膜的效果，我们提供了 BEVFormer 和 FB-Occ 在 Occ3D-nuScenes 验证分段上的每类 RayIoU。在训练过程中，† 使用了可见光掩膜。我们发现，使用可见光掩膜进行训练会降低地面类的性能，例如可驾驶表面、陆地和人行道。

Method

牙

BEVFormer

5.0

42.2

18.2

55.2

57.1

22.7

21.3

31.0

27.1

30.7

49 .

58.4

30.4

29.4

31.7

36.3

26.5

BEVFormer

6.4

44.8

21.0

18.5

39.1

29.8

FB-Occ

35.6

10.5

44.

25.6

55.6

51.7

22.6

27.2

34.3

30.3

65.5

33.3

31.4

32.5

39.6

33.3

FB-Occ

140

291

44.4

22.4

21.5

19.5

39.3

31.1

A More Results 更多成果

In this section, we present a more detailed comparison in Tab. 4, which includes performance metrics based on the traditional voxel-level mIoU. Our model, SparseOcc, does not utilize the visible mask during training. To ensure a fair comparison, we trained a variant of FB-Occ [27] that also does not use the visible mask during training. Despite operating under a weaker setting ( 48 vs. 90 epochs), our SparseOcc still surpasses FB-Occ in both accuracy and speed. Besides, FB-Occ employs many complicated designs, tricks and data augmentations, while our SparseOcc is more concise and elegant.
在本节中，我们将在表 4 中进行更详细的比较，其中包括基于传统体素级 mIoU 的性能指标。其中包括基于传统体素级 mIoU 的性能指标。我们的模型 SparseOcc 在训练过程中不使用可见掩膜。为了确保比较的公平性，我们训练了 FB-Occ [27] 的一个变体，它在训练过程中也不使用可见掩膜。尽管我们的 SparseOcc 在较弱的设置下运行（48 个历时对 90 个历时），但其准确性和速度仍然超过了 FB-Occ。此外，FB-Occ 采用了许多复杂的设计、技巧和数据增强，而我们的 SparseOcc 则更加简洁优雅。

Interestingly, we observed a peculiar phenomenon. Under the traditional voxel-level mIoU metric, methods can significantly benefit from disregarding the non-visible voxels during training. These non-visible voxels are indicated by a binary visible mask provided by the Occ3D-nuScenes dataset. However, this strategy actually impairs performance under our new RayIoU metric. For instance, we train two variants of BEVFormer [26]: one uses the visible mask during training, and the other does not. The former scores 15 points higher than the latter on the voxel-based mIoU, but it scores 1 point lower on RayIoU. This phenomenon is also observed on FB-Occ.
有趣的是，我们观察到一个奇特的现象。根据传统的体素级 mIoU 指标，在训练过程中忽略非可见体素的方法可以显著受益。这些非可见体素由 Occ3D-nuScenes 数据集提供的二进制可见掩码表示。然而，在我们的新 RayIoU 指标下，这种策略实际上会影响性能。例如，我们训练了 BEVFormer [26] 的两个变体：一个在训练过程中使用可见光掩膜，另一个则不使用。前者在基于体素的 mIoU 上比后者高出 15 分，但在 RayIoU 上却比后者低 1 分。在 FB-Occ 中也观察到了这种现象。

To explore this phenomenon, we present the per-class RayIoU in Tab. 5. The table reveals that using the visible mask during training enhances performance for most foreground classes such as bus, bicycle, and truck. However, it negatively impacts background classes like drivable surface, terrain, and sidewalk.
为了探讨这一现象，我们在表 5 中列出了每个类别的 RayIoU。5.从表中可以看出，在训练过程中使用可见光掩膜可以提高大部分前景类别的性能，如公共汽车、自行车和卡车。但是，它对可驾驶表面、地形和人行道等背景类别产生了负面影响。

This observation prompts a further question: why does performance deteriorate for background classes? To address this, we offer a visual comparison of depth errors and height maps of the predicted drivable surface from FB-Occ, both with and without the use of visible mask during training, in Fig. 10. The figure illustrates that training with visible mask results in a thicker and higher ground representation, leading to substantial depth errors in distant areas. Conversely, models trained without the visible mask predict depth with greater accuracy.
这一观察结果引发了另一个问题：为什么背景类的性能会下降？为了解决这个问题，我们在图 10 中对 FB-Occ 预测的可驾驶表面的深度误差和高度图进行了直观比较，包括在训练过程中使用和不使用可见光掩膜的情况。从图中可以看出，使用可见光掩膜进行训练会产生更厚、更高的地面表征，从而导致远处区域出现严重的深度误差。相反，不使用可见光掩膜训练的模型预测深度的准确度更高。

From these observations, we derive some valuable insights: ignoring non-visible voxels during training benefits foreground classes by resolving the issue of ambiguous labeling of unscanned voxels.
从这些观察结果中，我们得到了一些有价值的启示：在训练过程中忽略不可见体素，可以解决未扫描体素的模糊标记问题，从而使前景类别受益。

Figure 10: Why does the performance of background classes, such as drivable surfaces, deteriorate when using the visible mask during training? We provide a visualization of the drivable surface as predicted by FB-Occ [27]. Here, "FB w/ mask" and "FB wo/ mask" denote training with and without the visible mask, respectively. We observe that "FB w/ mask" tends to predict a higher and thicker road surface, resulting in significant depth errors along a ray. In contrast, "FB wo/ mask" predicts a road surface that is both accurate and consistent.
图 10：为什么在训练过程中使用可见光掩膜时，背景类（如可驾驶表面）的性能会下降？我们提供了 FB-Occ [27] 预测的可驾驶表面的可视化效果。这里，"FB w/ mask "和 "FB wo/ mask "分别表示使用和不使用可见光掩膜的训练。我们观察到，"FB w/ mask "往往会预测出更高更厚的路面，从而导致沿射线的深度误差很大。相比之下，"FB wo/ mask "预测的路面既准确又一致。

However, it also compromises the accuracy of depth estimation, as models tend to predict a thicker and closer surface. We hope that our findings will inform and benefit future research.
然而，这也会影响深度估计的准确性，因为模型往往会预测出一个更厚、更近的表面。我们希望我们的研究结果能为未来的研究提供参考并使其受益。

B Panoptic Occupancy Prediction
B 全景占用率预测

Thanks to the mask transformer, our SparseOcc can produce panoptic occupancy prediction by simply replacing the semantic queries with instance queries.
有了掩码转换器，我们的 SparseOcc 只需将语义查询替换为实例查询，就能生成全景占用率预测。

Ground Truth Preparation. To evaluate our method, we utilize the ground-truth object bounding boxes from the 3D detection task to generate the panoptic occupancy ground truth. First, we define eight instance categories (including car, truck, construction vehicle, bus, trailer, motorcycle, bicycle, pedestrian) and ten staff categories (including terrain, manmade, vegetation, etc). Next, we identify each instance segment by grouping the voxels inside the box based on an existing semantic occupancy benchmark Occ3D-nuScenes [48].
地面实况准备。为了评估我们的方法，我们利用三维检测任务中的地面实况对象边界框来生成全景占位地面实况。首先，我们定义了八个实例类别（包括汽车、卡车、工程车、公交车、拖车、摩托车、自行车、行人）和十个人员类别（包括地形、人工建筑、植被等）。接下来，我们根据现有的语义占用基准 Occ3D-nuScenes [48]，通过对方框内的体素进行分组来识别每个实例段。

However, we observe that using the original size of the box for grouping may cause some boundary voxels being missed due to the compactness of the box. Enlarging the box (such as 1.2x) leads to excessive overlap between boxes. To address these issues, we designed a two-stage grouping scheme. In the first stage, we use the original size of the box for grouping. Then, for voxels that have not been assigned to a specific instance, we select the closest box and assign it. This scheme effectively resolves the problems of boundary omission and box overlap.
不过，我们发现，由于方框的紧凑性，使用方框的原始大小进行分组可能会导致遗漏一些边界体素。扩大方框（如 1.2 倍）会导致方框之间过度重叠。为了解决这些问题，我们设计了一个两阶段分组方案。在第一阶段，我们使用方框的原始大小进行分组。然后，对于尚未分配到特定实例的体素，我们选择最接近的方框并进行分配。这一方案有效地解决了边界遗漏和方框重叠的问题。

Evaluation Metrics. We design RayPQ based on the well-known panoptic quality (PQ) [20] metric, which is defined as the multiplication of segmentation quality (SQ) and recognition quality (RQ):
评估指标。我们根据著名的泛光质量（PQ）[20] 指标设计了 RayPQ，该指标被定义为分割质量（SQ）和识别质量（RQ）的乘积：

where the definition of true positive (TP) is the same as that in RayloU. The threshold of IoU between prediction

and ground-truth

is set to 0.5 .
其中真阳性（TP）的定义与 RayloU 中的定义相同。预测

与地面实况

之间的 IoU 临界值设为 0.5。

Results. In Tab. 6, we report the performance of SparseOcc on panoptic occupancy benchmark. Similar to RayIoU, we calculate RayPQ under three distance thresholds: 1, 2 and 4 meters. SparseOcc achieves an averaged RayPQ of 14.1. The visualizations are presented in the main paper (Fig. 9).
结果在表 6 中，我们报告了 SparseOcc 在泛光占用率基准测试中的性能。6 中，我们报告了 SparseOcc 在全景占用基准上的性能。与 RayIoU 类似，我们在三个距离阈值下计算 RayPQ：1、2 和 4 米。SparseOcc 的平均 RayPQ 为 14.1。可视化结果见主论文（图 9）。

Table 6: Panoptic occupancy prediction performance on Occ3D-nuScenes [48].
表 6：Occ3D-nuScenes[48]上的全景占用率预测性能。

Method	Backbone	Input Size	Epochs	RayPQ	RayPQ	RayPQ	RayPQ
SparseOcc	R50		24		10.2	14.5	17.6

Table 7: Experiments on enhancing sparsity by removing certain background categories (denoted by

. The RayIoU* metrics is only calculated on categories that are not ignored. By enhancing sparsity, the inference speed of SparseOcc can be further improved with negligible performance loss.
表 7：通过删除某些背景类别来增强稀疏性的实验（用

表示。RayIoU* 指标仅对未忽略的类别进行计算。通过增强稀疏性，SparseOcc 的推理速度可以进一步提高，而性能损失可以忽略不计。

Method	Backbone	Input Size	Epochs	Top-	RayIoU*	FPS
SparseOcc	R50		24	32000		24.0
SparseOcc	R50		24	24000	29.8	24.8
SparseOcc	R50		24	16000	28.8
SparseOcc	R50		24	32000		24.0
SparseOcc	R50		24	24000	30.0	24.8
SparseOcc	R50		24	16000	29.4

C Enhancing Sparsity C 增强稀疏性

As mentioned in the main paper, the majority of non-free occupancy data pertains to the background geometry, such as the road surface. In practice, the occupancy of road surface can be effectively substituted with High-Definition Map (HD Map) or online mapping techniques [5, 28, 50, 24]. Thus, the sparsity of the scene can be further enhanced by removing certain background categories, leading to faster inference speed with negligible performance loss. This is also an advantage of SparseOcc compared to the dense counterparts, because the dense methods will not speed up as the sparsity of the scene increases.
如主论文所述，大部分非自由占用数据与背景几何图形（如路面）有关。在实践中，路面的占用率可以用高清地图（HD Map）或在线映射技术有效替代[5, 28, 50, 24]。因此，可以通过移除某些背景类别来进一步增强场景的稀疏性，从而在性能损失可以忽略不计的情况下加快推理速度。这也是 SparseOcc 与密集型方法相比的一个优势，因为密集型方法不会随着场景稀疏度的增加而加快速度。

Settings. We train a variant of the model that the voxels belonging to the drivable surface and terrian in the ground truth are treated as free during training (denoted by

in Tab. 7). For fair evaluation, all models are evaluated on the categories that are not ignored.
设置。我们对模型进行了变体训练，即在训练过程中将地面实况中属于可驾驶表面和地形的体素视为自由体素（在表 7 中用

表示）。为了公平评估，所有模型都是根据未被忽略的类别进行评估的。

Results. As shown in Tab. 7, the performance of baseline (modeling all categories) drops notably as the top-

decreases. This is reasonable as the number of voxels is not enough to express the entire scene. In contrast, if we ignore certain background categories, the performance loss is negligible (only 0.7 RayIoU) even when top-

is reduced by half. This means the inference speed of SparseOcc can be further improved by enhancing sparsity, while for the dense counterparts it is not possible.
结果如表 7 所示如表 7 所示，随着顶部

的减少，基线（模拟所有类别）的性能明显下降。这是合理的，因为体素的数量不足以表达整个场景。相反，如果我们忽略某些背景类别，即使顶部

减少一半，性能损失也可以忽略不计（只有 0.7 RayIoU）。这就意味着，通过增强稀疏性，SparseOcc 的推理速度可以得到进一步提高，而对于高密度的对应算法来说，这是不可能的。

D Visualization of 3D Reconstruction
D 三维重建的可视化

In Fig. 11, we visualize the reconstructed 3D geometry from sparse voxel decoder. SparseOcc can reconstruct fine-grained details from camera-only inputs.
在图 11 中，我们展示了通过稀疏体素解码器重建的三维几何图形。SparseOcc 可以从纯相机输入中重建细粒度细节。

Figure 11: Visualization of 3D reconstruction results from sparse voxel decoder.
图 11：稀疏体素解码器的三维重建结果可视化。

*: Equal contribution. This work is done when Haisong Liu is an intern at Shanghai AI Lab.
*:等效贡献。本工作是刘海松在上海人工智能实验室实习时完成的。

: Corresponding author.
：通讯作者。

Fully Sparse 3D Occupancy Prediction 完全稀疏 3D 占用预测

Abstract 摘要

1 Introduction 1 引言

2 Related Work 2 相关工作

3 SparseOcc

3.1 Sparse Voxel Decoder3.1 稀疏体素解码器

3.2 Mask Transformer 3.2 掩膜变压器

4 Ray-level mIoU 4 雷级导弹

4.1 Revisiting the Voxel-level mIoU4.1 重新审视体素级模块

4.2 Mean IoU by Ray Casting4.2 通过射线投射获得平均 IoU

5 Experiments 5 项实验

5.1 Implementation Details5.1 实施细节

5.2 Main Results 5.2 主要成果

5.3 Ablations 5.3 消融

5.4 Extensive Experiments5.4 广泛实验

A More Results 更多成果

B Panoptic Occupancy PredictionB 全景占用率预测

C Enhancing Sparsity C 增强稀疏性

D Visualization of 3D ReconstructionD 三维重建的可视化

Fully Sparse 3D Occupancy Prediction
完全稀疏 3D 占用预测

3.1 Sparse Voxel Decoder
3.1 稀疏体素解码器

4.1 Revisiting the Voxel-level mIoU
4.1 重新审视体素级模块

4.2 Mean IoU by Ray Casting
4.2 通过射线投射获得平均 IoU

5.1 Implementation Details
5.1 实施细节

5.4 Extensive Experiments
5.4 广泛实验

B Panoptic Occupancy Prediction
B 全景占用率预测

D Visualization of 3D Reconstruction
D 三维重建的可视化