Unsupervised 3D Object Detection by Commonsense Clue
无监督的三维物体检测通过常识线索

Journal: 日记:

Transactions on Pattern Analysis and Machine Intelligence
模式分析与机器智能交易

Manuscript ID 手稿编号

TPAMI-2024-05-1261

Manuscript Type: 稿件类型：

Regular 常规

Keywords: 关键词

三维物体检测、无监督学习、激光雷达点云、模式识别

3D object detection, unsupervised learning, LiDAR point clouds, pattern

recognition

3D object detection, unsupervised learning, LiDAR point clouds, pattern recognition| 3D object detection, unsupervised learning, LiDAR point clouds, pattern | | :--- | | recognition |

Journal: Transactions on Pattern Analysis and Machine Intelligence Manuscript ID TPAMI-2024-05-1261 Manuscript Type: Regular Keywords: "3D object detection, unsupervised learning, LiDAR point clouds, pattern recognition" | Journal: | Transactions on Pattern Analysis and Machine Intelligence | | ---: | :--- | | Manuscript ID | TPAMI-2024-05-1261 | | Manuscript Type: | Regular | | Keywords: | 3D object detection, unsupervised learning, LiDAR point clouds, pattern recognition | | | |

Unsupervised 3D Object Detection by Commonsense Clue
无监督的三维物体检测通过常识线索

Hai Wu, Shijia Zhao, Xun Huang, Qiming Xia, Chenglu Wen*, Senior Member, IEEE, Li Jiang, Xin Li, Senior Member, IEEE, and Cheng Wang, Senior Member, IEEE
伍海, 赵诗珈, 黄逊, 夏琪明, 温程路*, IEEE 高级会员, 蒋力, 李昕, IEEE 高级会员, 以及王诚, IEEE 高级会员

Abstract 摘要

Traditional 3D object detectors, operating under fully, semi, or weakly supervised paradigms, require extensive human annotations. In contrast, this paper endeavors to design an unsupervised 3D object detector that discerns the object pattern automatically without relying on such annotations. To this end, we first introduce a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection. CPD initially constructs Commonsense Prototype (CProto) to represent the geometry center and size of objects. Subsequently, CPD produces high-quality pseudo-labels and guides the detector convergence by the size and geometry priors from CProto. Based on the CPD, we further propose CPD++, an enhanced version that boosts performance by motion clue. CPD++ learns localization from stationary objects and learns recognition from moving objects. It then facilitates the mutual transfer of localization and recognition knowledge between these two object types. Both CPD and CPD++ demonstrated superior performance over existing state-of-the-art unsupervised 3D detectors. Furthermore, by training CPD++ on WOD and testing on KITTI, CPD++ attains a performance of $89.25 %$ 3D Average Precision (AP) on moderate car class under 0.5 loU threshold. These results approximate $95.3 %$ of the performance achieved by fully supervised counterparts, underscoring the advancement of our method.
基于概念原型的无监督三维物体检测器，称为 CPD，首次引入了概念原型(CProto)来表示物体的几何中心和大小。CPD 通过 CProto 生成高质量的伪标签，并以此引导检测器收敛。基于 CPD,我们进一步提出了 CPD++,它通过利用运动线索进一步提升性能。CPD++从静止物体学习定位,从移动物体学习识别,并在这两种物体类型之间进行知识转移。CPD 和 CPD++均展现出优于现有最先进的无监督三维检测器的性能。在 WOD 数据集上训练,并在 KITTI 数据集上测试,CPD++在中等车类中达到了 3D 平均精度的水平,接近全监督方法的性能。

Index Terms-3D object detection, unsupervised learning, LiDAR point clouds, pattern recognition.
三维物体检测、无监督学习、激光雷达点云、模式识别。

1 Introduction 1 简介

AUTONOMOUS driving requires the reliable detection of 3D objects in urban scenes to ensure safe path planning and navigation. Numerous studies design 3D detection models through a fully supervised paradigm [4], [21], [43], [45], [46], [48]. These models heavily depend on human annotations from diverse scenes to guarantee their effectiveness across various scenarios. This data labeling process involves creating high-quality 3D bounding boxes for every object instance in each scene (see Fig. 1 (a)). It is laborious and time-consuming, limiting the broad deployment of detectors in practice [57]. Several studies have explored semi/weakly-supervised detection framework to mitigate the labeling need. For instance, semi-supervised methods [3], [39], [63] label a portion of data. Weaklysupervised methods utilize efficient click annotations [24]. These methods decrease the label cost obviously (see Fig. 1 (b)) and alleviate this labeling burden to a certain extent. Recently, inspired by the large-scale self-supervised pre-training of the language model, several initiatives for self-supervised pre-training in 3D object detection have emerged [26], [52]. Self-supervised methods generally pretrain the models on large-scale unlabeled masked data (see Fig. 1 ©) and then fine-tune the detector with humanprovided labels. All the aforementioned methods require
自主驾驶需要可靠地检测城市场景中的 3D 物体,以确保安全的路径规划和导航。众多研究通过完全监督的范式设计了 3D 检测模型[4]、[21]、[43]、[45]、[46]、[48]。这些模型严重依赖于来自各种场景的人工标注,以确保它们在不同场景中的有效性。这个数据标注过程涉及为每个场景中的每个物体创建高质量的 3D 边界框(见图 1(a))。这是一项费时费力的工作,限制了检测器在实践中的广泛部署[57]。一些研究探索了半监督/弱监督检测框架来减轻标注需求。例如,半监督方法[3]、[39]、[63]标注部分数据。弱监督方法利用高效的点击标注[24]。这些方法明显减少了标签成本(见图 1(b)),在一定程度上缓解了这种标注负担。最近,受语言模型大规模自监督预训练的启发,几项关于 3D 物体检测自监督预训练的新举措浮现[26]、[52]。自监督方法通常在大规模未标注的遮蔽数据上预训练模型(见图 1(c)),然后使用人工提供的标签对检测器进行微调。上述所有方法

GHuman box label ×Human click label ■Labor-free mask label
人工智能盒标签 ×人点击标签 ■免劳动力口罩标签
Fig. 1. Different settings of 3D object detection. (a) Fully supervised methods annotate all objects in all scenes. (b) Semi/weakly supervised methods annotate a subset of data or use efficient click annotations. © Self-supervised methods annotate a subset of data for fine-tuning. (d) Unsupervised methods do not use human annotations.
图 1. 3D 目标检测的不同设置。(a) 完全监督的方法对所有场景中的所有物体进行注释。(b) 半监督/弱监督方法对部分数据进行注释或使用高效的点击注释。(c) 自监督方法对部分数据进行微调注释。(d) 无监督方法不使用人工注释。
supervision signals from human annotations for training.
来自人工标注的监督信号用于训练。
However, objects in 3D data (e.g., vehicles in LiDAR point clouds) exhibit distinguishable geometry patterns. For example, these objects are typically situated on the ground with recognizable shapes, and their sizes remain consistent across different frames. Leveraging these patterns, the objects can be readily identified through certain commonsense reasoning. This insight enables the development of an unsupervised 3D object detector that operates without using human annotations (see Fig. 1 (d)). Note that, different from self-supervised pre-training, unsupervised 3D object detection can perform detection without fine-tuning, while self-supervised pre-training cannot.
然而，三维数据中的物体(例如,激光雷达点云中的车辆)展现出可区分的几何图案。例如,这些物体通常位于地面上,具有可识别的形状,并且它们的大小在不同帧中保持一致。利用这些模式,可以通过一些常识推理来轻易识别这些物体。这一洞见使得开发一种无监督的三维物体探测器成为可能,它可以在不使用人工标注的情况下进行操作(见图 1(d))。请注意,与自监督预训练不同,无监督的三维物体检测不需要微调即可进行检测,而自监督预训练则无法做到这一点。

This paper aims to design a modern unsupervised detector able to accurately detect 3D object of interest in
这篇论文旨在设计一个现代的无监督探测器,能够准确地检测感兴趣的 3D 物体

Fig. 2. (a) Recognition challenge bought by similar shapes between foreground and background. (b) Localization challenge bought by a limited view of observation. © Stationary objects are well-captured in consecutive frames, which can generate accurate localization seed labels. (d) Moving objects produce displacement, which can generate precise recognition seed labels. (e) CPD key idea: integrate commonsense clue into a prototype to constrain the detector; CPD++ key idea: leverage the movability and stationarity of object to enhance the localization and recognition mutually.
图 2. (a)前景和背景之间的相似形状引起了识别挑战。(b)有限的观察视野引起了定位挑战。(c)静止物体在连续帧中被很好地捕获,可以生成精确的定位种子标签。(d)移动物体产生位移,可以生成精确的识别种子标签。(e)CPD 的关键思想:将常识线索集成到原型中以约束探测器;CPD++的关键思想:利用物体的可移动性和静止性来相互增强定位和识别。
autonomous driving scenarios, utilizing solely commonsense clues. We follow the recent fully/weakly supervised work [24], [43], where the 3D object of interest refers to mobile entities covering a wide range of traffic participants. Specifically, mobile objects, such as vehicles, pedestrians, and cyclists, are predominantly considered as foreground, while static structures, like buildings, roads, and light poles, are modeled as background. The term “detect” refers to the precise localization of a 3D bounding box and the accurate recognition of object categories.
自动驾驶场景,仅利用常识线索。我们遵循最近的完全/弱监督工作[24]、[43],其中感兴趣的 3D 物体指的是覆盖广泛交通参与者的移动实体。具体而言,如车辆、行人和骑自行车的人等移动物体主要被视为前景,而建筑物、道路和路灯等静态结构被建模为背景。"检测"一词指的是 3D 边界框的精确定位和对象类别的准确识别。

Unfortunately, unsupervised 3D object detection presents exceptionally difficult challenges as follows: (1) Recognition challenge. Within the context of autonomous driving, point clouds often contain a mixture of objects. There are no geometric criteria that have a clear decision boundary to segment the background and foreground objects. Numerous traditional methods leveraged ground removal [12] and clustering technique [60] to differentiate the objects from their surroundings. However, background structures, such as buildings and fences, are inevitably misclassified as foreground objects (see Fig. 2 (a)). Recent sophisticated approaches build accurate seed labels through heuristics and identify common patterns by training a deep network iteratively. For example, the MODEST [57] and DRIFT [23] formulate the seed labels by Persistence Point Score (PPScore). Then, the pseudo labels are utilized to train a regular detector. Nonetheless, these methods require data collection from the same location multiple times, increasing labor costs substantially. As of now, attaining unsupervised object recognition without any increase in labor costs remains a tough challenge. (2) Localization challenge. Almost all objects are self-occluded in the autonomous driving scenario. Besides, with the increase in scanning distance, the number of captured points decreases dramatically. Consequently, the objects in point clouds are mostly geometry-incomplete, posing substantial challenges for the accurate localization of 3D bounding boxes (see Fig. 2 (b)). Early methods leveraged a straightforward bounding box fitting algorithm [36] for object localization. However, these methods often frequently struggle to deliver satisfactory performance due to the sparsity and occlusion of objects. Advanced methods generate initial pseudo-labels from fitted labels and use these to train deep networks
不幸的是,无监督的 3D 物体检测呈现出异常困难的挑战,如下所示:(1)识别挑战。在自动驾驶的背景下,点云经常包含多种物体的混合。没有清晰的几何标准来划分背景和前景物体。许多传统方法利用地面去除[12]和聚类技术[60]来区分物体与其周围环境。然而,诸如建筑物和篱笆等背景结构不可避免地被误分类为前景物体(见图 2(a))。最近的先进方法通过启发式方法建立精确的种子标签,并通过迭代训练深度网络来识别常见模式。例如,MODEST[57]和 DRIFT[23]通过持续点得分(PPScore)公式化种子标签。然后,伪标签被用于训练常规检测器。尽管如此,这些方法需要从同一位置多次收集数据,大大增加了劳动力成本。到目前为止,在不增加劳动力成本的情况下实现无监督物体识别仍然是一个艰巨的挑战。(2)定位挑战。在自动驾驶场景中,几乎所有物体都存在自遮挡。此外,随着扫描距离的增加,捕获的点数量也大幅下降。因此,点云中的物体大多是几何不完整的,这给 3D 边界框的准确定位带来了很大挑战(见图 2(b))。早期的方法利用简单的边界框拟合算法[36]进行物体定位。然而,由于物体的稀疏性和遮挡,这些方法通常难以得到满意的性能。先进的方法从拟合的标签生成初始伪标签,并使用这些标签来训练深度网络。
iteratively. They incorporate human knowledge to refine the position and size of the pseudo-labels iteratively. A subset of objects, denoted as complete objects, benefit from having at least one complete scan across the entire point cloud sequence, allowing their pseudo-labels to be refined through temporal consistency [59]. However, most objects, termed incomplete objects, lack full scan coverage and cannot be recovered by temporal consistency.
迭代地。他们利用人类知识来逐步完善伪标签的位置和大小。一些对象,称为完整对象,受益于至少有一个完整的点云序列扫描,使它们的伪标签可以通过时间一致性[59]得到完善。然而,大多数对象,被称为不完整对象,缺乏完整的扫描覆盖,无法通过时间一致性恢复。

To tackle these issues, this paper first proposes a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection (see Fig. 2 (e)). CPD is built upon two key insights: (1) The objects of the same class have similar sizes and can be roughly classified by size differences. (2) The nearby stationary objects are complete in consecutive frames and can be localized accurately by shape and position measure. Our idea is to construct a Commonsense Prototype (CProto) set representing the geometry and size of objects to refine the pseudo-labels and constrain the detector training. To this end, we first design a Multi-Frame Clustering (MFC) method that yields multi-class pseudolabels by size-based thresholding. Subsequently, we introduce an unsupervised Completeness and Size Similarity (CSS) score that selects high-quality labels to construct the CProto set. Furthermore, we design a CProto-constrained Box Regularization (CBR) that refines the sizes of pseudo labels. In addition, we develop CProto-constrained SelfTraining (CST) that improves the detection accuracy by the geometry knowledge from CProto.
为解决这些问题,本文首次提出了一种基于常识原型的检测器 CPD,用于无监督 3D 目标检测(见图 2(e))。CPD 建立在两个关键洞见之上:(1)同类对象的尺寸相似,可以通过尺寸差异进行粗略分类。(2)邻近的固定物体在连续帧中完整,可以通过形状和位置测量精确定位。我们的想法是构建一个代表对象几何和尺寸的常识原型(CProto)集合,以改善伪标签并约束检测器训练。为此,我们首先设计了一种基于尺寸的阈值的多帧聚类(MFC)方法,产生多类伪标签。随后,我们引入了一种无监督的完整性和尺寸相似性(CSS)评分,选择高质量标签来构建 CProto 集合。此外,我们设计了一种 CProto 约束的框正则化(CBR),用于改善伪标签的尺寸。最后,我们开发了 CProto 约束的自训练(CST),利用 CProto 的几何知识提高检测精度。

Nevertheless, the size-based classification in CPD has limitations in differentiating the foreground and background objects with similar shapes. In particular, the dimensions of incorrectly clustered fragments often closely resemble those of small foreground objects (e.g., pedestrians), leading to many false positive labels. Consequently, the detector erroneously assigns high confidence to small background objects. To tackle this issue, we further propose CPD++ (see Fig. 2 (e)), a stronger unsupervised 3D detector able to detect multi-class objects accurately. CPD++ is built upon three key insights: (1) Stationary objects are well-captured in consecutive frames, which allows for the creation of accurate localization seed labels (see Fig. 2 ©). (2) Moving objects produce salient displacement, which can be leveraged to generate precise recognition seed labels (see
尽管如此,CPD 中基于尺寸的分类在区分具有相似形状的前景和背景物体时存在局限性。特别是,错误聚类的碎片尺寸通常与小型前景物体(如行人)非常相似,从而导致出现许多误报标签。因此,检测器错误地将高置信度分配给小型背景物体。为了解决这一问题,我们进一步提出了 CPD++(见图 2(e)),这是一种更强大的无监督 3D 检测器,能够准确检测多类物体。CPD++建立在三个关键见解之上:(1)静止物体在连续帧中被很好地捕捉,这允许创建准确的定位种子标签(见图 2(c))。(2)移动物体产生显著的位移,可利用这一点生成精确的识别种子标签(见

Fig. 3. Performance comparison. Both CPD and CPD++ achieve state-of-the-art unsupervised 3D object detection performance. CPD++ improves CPD

2 \times

in terms of mAP.
图 3. 性能比较。CPD 和 CPD++都达到了最先进的无监督 3D 目标检测性能。 CPD++在 mAP 方面提高了 CPD。

Fig. 2 (d)). (3) The localization and recognition knowledge are complementary and transferable between each other. Our idea is to learn localization from stationary objects, learn recognition from moving objects, and transfer the knowledge between each other to boost detection performance. The core designs are a Motion-conditioned Label Refinement (MLR) module that produces high-quality pseudo labels and a Motion-Weighted Training (MWT) scheme that mutually learns the localization and recognition.
图 2（d）。（3）定位和识别知识是互补的，可以互相传递。我们的想法是从静止物体中学习定位，从移动物体中学习识别，并在二者之间传递知识以提高检测性能。核心设计是一个运动条件标签细化（MLR）模块，可以产生高质量的伪标签，以及一个相互学习定位和识别的运动加权训练（MWT）方案。

The effectiveness of our design is verified by experiments on widely used Waymo Open Dataset (WOD) [37] and KITTI dataset [8]. The main contributions include:
我们的设计效果通过在广泛使用的 Waymo Open Dataset（WOD）[37]和 KITTI 数据集[8]上的实验得到验证。主要贡献包括：

We propose CPD that integrates commonsense knowledge into prototypes for unsupervised 3D object detection. CPD adopts CProto-constrained Box Regularization (CBR) for accurate initial pseudo-label generation and CProto-constrained Self-Training (CST) for unsupervised learning.
我们提出了将常识知识整合到用于无监督 3D 物体检测的原型中的 CPD。CPD 采用 CBR(CProto 约束的框正则化)来生成准确的初始伪标签,并采用 CST(CProto 约束的自训练)进行无监督学习。
We propose CPD++ that takes a step in more accurately detecting multi-class 3D objects by incorporating motion clues. This is enabled by the designed Motion-conditioned Label Refinement (MLR) and a Motion-Weighted Training (MWT) scheme.
我们提出了 CPD++,它通过融入运动线索来更准确地检测多类 3D 目标。这得益于设计的基于运动的标签精炼(MLR)和基于运动的加权训练(MWT)方案。
Both CPD and CPD++ outperform state-of-the-art unsupervised 3D detector on the WOD and KITTI datasets. Notably, CPD++ enhances the CPD by $2 \times$ mAP (see Fig. 3). When CPD++ is trained on the WOD and tested on KITTI, it attains a performance of 89.25% moderate car 3D AP under 0.5 IoU threshold. The performance matches $95.3 %$ of fully supervised counterpart, highlighting the advance of our work.
两者 CPD 和 CPD++ 在 WOD 和 KITTI 数据集上的无监督 3D 探测器性能优于最先进水平。值得注意的是,CPD++ 通过 $2 \times$ 提高了 CPD 的 mAP(见图 3)。当 CPD++ 在 WOD 上训练并在 KITTI 上测试时,它在 0.5 IoU 阈值下达到了 89.25% 的中等车 3D AP 性能。该性能与 $95.3 %$ 完全监督的对应物相匹配,突出了我们工作的进步。

2.1 Fully/semi/weakly-supervised 3D Object Detection
全监督/半监督/弱监督 3D 目标检测

Recent fully-supervised 3D detectors build single-stage [10], [13], [38], [56], [65], [66], two-stage [4], [31], [32], [33], [45], [46], [48], [53] or multiple stage [2], [43] deep networks for 3D object detection. Fully supervised 3D object detection methods have achieved significant performance, but the annotation cost is often unacceptable. To reduce the annotation cost, semi-supervised and weakly-supervised 3D object detection methods have gained widespread attention. Semi-supervised methods [18], [28], [39], [50], [54], [63] usually only annotate a portion of the scenarios and then apply teacher-student networks to generate pseudo-labels for unannotated scenarios. For example, the 3DIoUMatch [39] laid the groundwork in the domain of outdoor scenes
最近的完全监督的 3D 探测器构建了单级[10]、[13]、[38]、[56]、[65]、[66]、双级[4]、[31]、[32]、[33]、[45]、[46]、[48]、[53]或多级[2]、[43]深度网络用于 3D 目标检测。完全监督的 3D 目标检测方法已经取得了显著的性能,但注释成本通常是不可接受的。为了减少注释成本,半监督和弱监督的 3D 目标检测方法受到了广泛关注。半监督方法[18]、[28]、[39]、[50]、[54]、[63]通常只注释部分场景,然后应用教师-学生网络为未注释的场景生成伪标签。例如,3DIoUMatch[39]为户外场景领域奠定了基础。
by pioneering the estimation of 3D Intersection over Union (IoU) as a metric for localization. Weakly supervised methods [25], [41], [58] reduce the annotation cost from the perspective of the annotation form. For example, WS3D [25] proposed point-level labels, which generate box-level pseudo-labels from instance-level click labels. Some other weakly-supervised methods [19], [49] only annotate a single instance per scene to reduce the annotation cost. Different from all of these methods, we aim to design a modern unsupervised detector that is able to accurately detect 3D objects without using any human-level annotations.
通过率先估计 3D 交集除以联合(IoU)作为定位度量。弱监督方法[25]、[41]、[58]从注释形式的角度降低了注释成本。例如,WS3D[25]提出了点级标签,从实例级点击标签生成箱级伪标签。其他一些弱监督方法[19]、[49]每个场景只注释一个实例,以降低注释成本。与所有这些方法不同,我们的目标是设计一个现代无监督检测器,能够准确检测 3D 物体,而无需使用任何人工级注释。

2.2 Self-supervised/Unsupervised 3D Detectors
2.2 自监督/无监督 3D 检测器

Previous unsupervised pre-training methods discern latent patterns within the unlabeled data by masked labels [52] or contrastive loss [17], [55]. But these methods require human labels for fine-turning [5], [20]. Traditional methods [1], [30], [36] employ ground removal and clustering for 3D object detection without human labels, but suffer from poor detection performance. Some deep learning-based methods generate pseudo-labels by multiple traversals and use the pseudolabels to train a 3D detector [23], [57] iteratively. Nonetheless, these methods require data collection from the same location multiple times, increasing labor costs remarkably. Recent OYSTER [59] generates labels from single traversal, improves the recognition by ignoring short tracklets and improves the localization by temporal consistency. However, most bounding boxes of incomplete objects cannot be recovered by temporal consistency. Besides, lots of small background objects also generate long tracklets, decreasing recognition precision. Our CPD addresses the localization problem by leveraging the geometry prior from CProto to refine the pseudo-label and guide the network convergence. Our CPD++ addresses the recognition problem by learning the discriminative features of moving objects.
之前无监督预训练的方法通过掩蔽标签[52]或对比损失[17]，[55]来识别未标记数据中的潜在模式。但这些方法需要人工标注来微调[5]，[20]。传统方法[1]，[30]，[36]采用地面去除和聚类来检测 3D 物体，无需人工标注，但检测性能较差。一些基于深度学习的方法通过多次遍历生成伪标签，并使用这些伪标签迭代训练 3D 检测器[23]，[57]。然而，这些方法需要多次收集同一位置的数据，大大增加了劳动成本。最近的 OYSTER[59]从单次遍历生成标签，通过忽略短时间跟踪来提高识别，并通过时间一致性来改进定位。然而，大部分不完整物体的边界框无法通过时间一致性恢复。此外,许多小的背景物体也生成长时间跟踪,降低了识别精度。我们的 CPD 利用 CProto 的几何先验来完善伪标签,并引导网络收敛,解决了定位问题。我们的 CPD++通过学习移动物体的区分性特征来解决识别问题。

2.3 Prototype-based Methods
基于原型的方法

The prototype-based methods are widely used in 2D object detection [14], [15], [22], [42], [62] when novel classes are incorporated. Inspired by these methods, Prototypical VoteNet [64] constructs geometric prototypes learned from basic classes for few-shot 3D object detection. GPA-3D [16] and CL3D [29] build geometric prototypes from a sourcedomain model for domain adaptive 3D detection. However, both the learning from basic class and training on the source domain require high-quality annotations. Unlike that, we construct CProto using commonsense knowledge and detect 3D objects in a zero-shot manner without human-level annotations.
基于原型的方法广泛用于在新类别被引入时的二维目标检测[14]、[15]、[22]、[42]、[62]。受这些方法的启发,原型性选票网络[64]构造了从基础类别学习的几何原型，用于少样本三维目标检测。GPA-3D[16]和 CL3D[29]从源域模型构建几何原型，用于领域自适应三维检测。但是，从基础类别学习和在源域上训练都需要高质量注释。与之不同的是,我们使用常识知识构建 CProto,并采用零样本方式检测三维目标,无需人工级别的注释。

3 Problem Formulation 问题描述

Given a set of input points

x, 3 D

object detection aims to design a function

F

that produces optimal detections

y =

F (x)

, where each detection consists of

[x, y, z, l, w, h, α, β]

representing position, width, length, height, azimuth angle, and class identity of a object, respectively.
给定一组输入点

x, 3 D

，目标检测旨在设计一个函数

F

，该函数产生最佳检测结果

y =

F (x)

，其中每个检测结果由

[x, y, z, l, w, h, α, β]

表示对象的位置、宽度、长度、高度、方位角和类别标识。

Fig. 4. CPD framework. (a) Initial pseudo-labels are generated by multi-frame clustering. (b) The commonsense prototype (CProto) is constructed from high-quality pseudo-labels based on CSS score. The low-quality labels are further refined by the shape prior from CProto. © A prototype network fed with dense points from CProto produces high-quality features to guide the detection network convergence.
图 4. CPD 框架。(a)通过多帧聚类生成初始伪标签。(b)根据 CSS 评分构建通用原型(CProto)。基于 CProto 的形状先验进一步提炼低质量标签。©CProto 的密集点馈入原型网络可产生高质量特征,引导检测网络收敛。

3.1 Fully Supervised 3D Object Detection
全面监督 3D 物体检测

Since the

F

can’t be obtained directly by some handcrafted rule, regular methods formulate the 3D object detection problem as a fully supervised learning problem. It requires a large-scale dataset with human labels. The dataset is conventionally divided into training

x^{train}, y^{train}

, validation

x^{v a l}, y^{v a l}

and testing datasets

x^{test}, y^{test}

. During the designing process, a detection model with learnable parameters

θ

and loss function

L^{full}

are optimized from the training dataset.
由于

F

无法直接通过一些手工规则获得,常规方法将 3D 物体检测问题形式化为完全监督学习问题。它需要一个具有人工标签的大规模数据集。数据集通常划分为训练

x^{train}, y^{train}

、验证

x^{v a l}, y^{v a l}

和测试

x^{test}, y^{test}

数据集。在设计过程中,从训练数据集优化可学习参数

θ

和损失函数

L^{full}

的检测模型。

F_{θ^{*}} = \underset{θ}{\arg min} (L^{full} (F_{θ} (x^{train}), y^{train}))

The detection model

F_{θ^{*}}

generally consists of some unlearnable hyperparameters

η

, which are optimized by the detection metrics

A

from the validation dataset.
检测模型通常由某些不可学习的超参数组成,这些超参数由验证数据集中的检测指标进行优化。

F_{θ^{*}}^{η^{*}} = \underset{η}{\arg max} (A (F_{θ^{*}}^{η} (x^{v a l}), y^{v a l})) .

The testing dataset is only used to compare the performance with other methods and can’t be used to tune parameters.
测试数据集仅用于与其他方法的性能进行比较,不能用于调整参数。

3.2 Unsupervised 3D Object Detection
无监督 3D 目标检测

To decrease labeling cost, unsupervised settings do not require human labels of training dataset

y^{train}

. The

y^{val}

and

y^{test}

are required to choose the best hyperparameters and compare them with other methods, respectively. This paper solves the unsupervised problem with a pseudo-labelbased framework. The main challenge lies in how to design a pseudo label generation function

G

and how to design a pseudo-label-based training target

L^{un}

to optimize the detector.
降低标记成本,无监督设置不需要人工标记训练数据集。需要

y^{val}

和

y^{test}

分别选择最佳超参数并与其他方法进行比较。本文通过基于伪标签的框架解决了无监督问题。主要挑战在于如何设计伪标签生成函数

G

以及如何设计基于伪标签的训练目标

L^{un}

来优化检测器。

F_{θ^{*}} = \underset{θ}{\arg min} (L^{u n} (F_{θ} (x^{train}), G (x^{train})))

This paper leverages human commonsense knowledge to formulate the label generation function

G

and training target

L^{u n}

. The commonsense knowledge is formulated as some observation or insights. For example, the stationary objects share the same position, size, and class across frames; all the background objects, such as buildings and trees, do not move. We ensure all the intuitions are shared among different datasets, easy to obtain, and not sensitive to a specific value.
这篇论文利用人类常识知识来制定标签生成函数

G

和训练目标

L^{u n}

。常识知识被形式化为某些观察或洞见。例如,静止物体在各帧中位置、大小和类别都相同;所有背景物体,如建筑物和树木,都没有移动。我们确保所有的直观概念都在不同的数据集中共享,容易获取,并且不对特定值敏感。

4 CPD: Commonsense Prototype for UnsuPERVISEd 3D ObJect Detection
4 CPD：无监督 3D 目标检测的常识原型

This section introduces the Commonsense Prototype-based Detector (CPD) for unsupervised 3D object detection. As shown in Fig. 4, CPD consists of three main parts: (1) initial label generation, (2) label refinement, and (3) self-training. CPD formulates the pseudo label generation function

G

by Multi-Frame Clustering (MFC) and CProto-constrained Box Regularization (CBR), and formulate the training target

L^{u n}

by the CSS-weighted detection loss and geometry contrast loss. We detail the designs as follows.
本节介绍了用于无监督 3D 目标检测的常识原型检测器(CPD)。如图 4 所示,CPD 包括三个主要部分:(1)初始标签生成,(2)标签优化,(3)自我训练。CPD 通过多帧聚类(MFC)和 CProto 约束的框正则化(CBR)来构建伪标签生成函数

G

,并通过 CSS 加权检测损失和几何对比损失来制定训练目标

L^{u n}

。我们将详细介绍这些设计。

4.1 Initial Label Generation
4.1 初始标签生成

Recent unsupervised methods [57], [59] detect 3D objects in a class-agnostic way. How to classify objects (e.g., vehicles and pedestrians) without annotation is still an unsolved challenge. Our observations indicate that the sizes of different classes are significantly different. Therefore, the dense objects in consecutive frames (see Fig. 5 (b)) can be roughly classified by size thresholds. This motivates us to design a Multi-Frame Clustering (MFC) method to generate initial labels. MFC involves motion artifact removal, clustering, and post-processing.
近期无监督方法[57]，[59]以一种忽视类别的方式检测三维物体。如何不借助注释来对物体(如车辆和行人)进行分类,这仍然是一个未解决的挑战。我们的观察表明,不同类别的大小差异显著。因此,可以通过尺寸阈值粗略地对连续帧中的密集物体(见图 5(b))进行分类。这促使我们设计了一种多帧聚类(MFC)方法来生成初始标签。MFC 涉及运动伪影消除、聚类和后处理。

Fig. 5. (a) The moving objects are removed by PPScore. (b) The nearby stationary objects are with high completeness in consecutive frames. © Length absolute error with different frames. (d) Pseudo-labels of complete object

T

are refined by temporal consistency. (e) Pseudo-labels of the incomplete object

J

fail to be refined by temporal consistency. (f) Many objects lack full scan coverage and generate inaccurate pseudo-labels. (g) Multi-level occupancy score. (h) Mean size error of initial labels.
图 5. (a)移动物体已由 PPScore 移除。(b)连续帧中邻近静止物体具有高度完整性。(c)不同帧的长度绝对误差。(d)通过时间一致性优化了完整对象的伪标签。(e)时间一致性无法优化不完整对象的伪标签。(f)许多对象缺乏完整扫描覆盖,生成不准确的伪标签。(g)多层占用评分。(h)初始标签的平均大小误差。

4.1.1 Motion Artifact Removal (MAR)
去运动伪影(MAR)

Directly transforming and concatenating

2 n + 1

consecutive frames

{x_{- n}, \dots, x_{n}}

(i.e., past

n

, future

n

, and the current frame) into a single point cloud

x_{0}^{*}

introduces motion artifacts from moving objects, leading to increased label errors as the

n

grows (see Fig. 5©). To mitigate this issue, we first transform the consecutive frames to a global coordinate system and calculate the Persistence Point Score (PPScore) [57] by consecutive frames to identify the points in motion (see Fig. 5 (a)). We keep all the points from

x_{0}

and remove moving points from the other frames

x_{- n}, \dots, x_{- 1}, x_{1}, \dots, x_{n}

. After this removal, we concatenate the frames to obtain dense points

x_{0}^{*}

.
将多个连续帧(即过去、未来和当前帧)直接转换和连接成单个点云会引入移动物体产生的运动伪影,从而导致标签错误随帧数的增加而增加(见图 5©)。为了缓解这个问题,我们首先将连续帧变换到全局坐标系,并计算持久性点分数(PPScore)[57]来识别运动点(见图 5(a))。我们保留所有点,并从其他帧中删除移动点。经过这种删除后,我们连接帧以获得密集点。

4.1.2 Clustering and Post-processing
4.1.2 聚类和后处理

In line with recent study [59], we apply the ground removal [12], DBSCAN [6] and bounding box fitting [61] on

x_{0}^{*}

to obtain a set of class-agnostic bounding boxes

\hat{b}

. We then perform a tracking process to associate the boxes across frames. Since the objects of the same class typically have similar sizes in 3D space, we calculate class-specific size thresholds to classify

\hat{b}

into different categories. The size threshold is calculated based on two observations. (1) All the moving objects are foreground. (2) The size ratios of different classes are different (the vehicle is around 2:1:1, the pedestrian is around 1:1:2, and the cyclist is around 2:1:2). Therefore, for each point cloud sequence, one trajectory for each class with the highest mean speed and nearest size ratio to a specific class is automatically chosen. The size range of these tracklets defines the size thresholds for classification. This process results in a set of initial pseudo-labels

b = {b_{j}}_{j}

, where

b_{j} = [x, y, z, l, w, h, α, β, τ]

represents position, width, length, height, azimuth angle, class identity, and tracking identity, respectively.
根据最近的研究[59]，我们在

x_{0}^{*}

上应用了地面去除[12]、DBSCAN[6]和边界框拟合[61]来获得一组类无关的边界框

\hat{b}

。然后我们进行跟踪过程来关联不同帧中的框。由于同一类别的对象在 3D 空间中通常具有相似的尺寸，我们计算特定类别的尺寸阈值来将

\hat{b}

划分为不同类别。尺寸阈值的计算基于两个观察结果:(1)所有移动的对象都是前景;(2)不同类别的尺寸比例不同(车辆约为 2:1:1,行人约为 1:1:2,自行车约为 2:1:2)。因此,对于每个点云序列,会自动选择每个类别中平均速度最高且尺寸比例最接近特定类别的一个轨迹。这些轨迹的尺寸范围定义了分类的尺寸阈值。该过程产生了一组初始的伪标签

b = {b_{j}}_{j}

,其中

b_{j} = [x, y, z, l, w, h, α, β, τ]

分别表示位置、宽度、长度、高度、方位角、类别标识和跟踪标识。

As presented in Fig. 5 (d)-(f), initial labels for incomplete objects often suffer from inaccuracies in sizes and positions. To tackle this issue, we introduce the CProto-constrained Box Regularization (CBR) method. The key idea is to construct a high-quality CProto set based on unsupervised scoring from
如图 5(d)-(f)所示,不完整物体的初始标签常常存在尺寸和位置的不准确性。为解决这一问题,我们提出了 CProto 受约束的框正则化(CBR)方法。其核心思想是基于无监督评分构建高质量的 CProto 集合。
complete objects to refine the pseudo-labels of incomplete objects. Unlike OYSTER [59], which can only refine the pseudo-labels of objects with at least one complete scan, our CBR can refine pseudo-labels of all objects, significantly decreasing the overall size and position errors.
完全对象来细化不完整对象的伪标签。与 OYSTER [59]不同,它只能细化至少有一个完整扫描的对象的伪标签,我们的 CBR 可以细化所有对象的伪标签,从而大幅减少整体尺寸和位置误差。

4.2.1 Completeness and Size Similarity (CSS) Scoring
4.2.1 完整性和大小相似性(CSS)评分

Existing label scoring methods such as IoU scoring [31] are designed for fully supervised detectors. In contrast, we introduce an unsupervised Completeness and Size Similarity scoring (CSS) method. It aims to approximate the IoU score using commonsense knowledge alone.
现有的标签评分方法,如 IoU 评分[31],都是为完全监督的检测器而设计的。相反,我们提出了一种无监督的完整性和大小相似性评分(CSS)方法。它旨在仅使用常识知识来近似 IoU 分数。

Distance score. CSS first assesses the object completeness based on distance, assuming labels closer to the ego vehicle are likely to be more accurate. For an initial label

b_{j}

, we normalize the distance to the ego vehicle within the range

[0, 1]

to compute the distance score as
距离得分。CSS 首先根据距离评估对象的完整性,假设靠近自我车辆的标签可能更准确。对于初始标签

b_{j}

，我们将距离自我车辆正常化到范围

[0, 1]

以计算距离得分。

ψ^{1} (b_{j}) = 1 - N (‖ c_{j} ‖)

where

N

is the normalization function and

c_{j}

is the location of

b_{j}

. However, this distance-based approach has its limitations. For example, occluded objects near the ego vehicle, which should receive lower scores, are inadvertently assigned high scores due to their proximity. To mitigate this issue, we introduce a Multi-Level Occupancy (MLO) score, further detailed in Fig. 5 (g).
其中

N

是归一化函数,

c_{j}

是

b_{j}

的位置。然而,这种基于距离的方法也有其局限性。例如,靠近自车的遮挡物体应该获得较低的得分,但由于它们的接近而被错误地分配了较高的得分。为了缓解这一问题,我们引入了多层占用(MLO)得分,如图 5(g)中进一步详述。

MLO score. Considering the diverse sizes of objects, we divide the bounding box of the initial label into multiple grids with different length and width resolutions. The MLO score is then calculated by determining the proportion of grids occupied by cluster points, via
目标检测评分。考虑到目标的不同尺寸，我们将初始标签的边界框划分为不同长度和宽度分辨率的多个网格。然后通过确定包含聚类点的网格的比例来计算目标检测评分。

ψ^{2} (b_{j}) = \frac{1}{N^{o}} \sum_{k} \frac{O^{k}}{{(r^{k})}^{2}}

where

N^{o}

denotes resolution number,

O^{k}

is the number of occupied grids under

k

-th resolution, and

r^{k}

is the grid number of

k

-th resolution.
其中

N^{o}

表示分辨率编号,

O^{k}

是

k

分辨率下的占用网格数,而

r^{k}

是

k

分辨率下的网格数。

Size Similarity (SS) score. While the distance and MLO scores effectively evaluate the localization and size quality, they fall short in assessing classification quality. To bridge this gap, we introduce the SS score. We observe that the size ratios of different classes are different (the vehicle is around 2:1:1, pedestrian is around 1:1:2, and cyclist is around 2:1:2),
尺寸相似性(SS)评分。虽然距离和 MLO 评分有效地评估了定位和尺寸质量,但它们在评估分类质量方面存在缺陷。为了弥补这一差距,我们引入了 SS 评分。我们发现,不同类别的尺寸比例不同(车辆约为 2:1:1,行人约为 1:1:2,自行车约为 2:1:2)。
so we calculate the score by measuring the ratio difference. Specifically, we pre-define a set of size templates (simple size ratio or statistics from Wikipedia) for each class. Then, we calculate the size score by a truncated KL divergence [9].
因此,我们通过测量比率差异来计算分数。具体而言,我们为每个类预定义了一组尺寸模板(简单尺寸比率或维基百科的统计数据)。然后,我们通过截断的 KL 散度来计算尺寸得分[9]。

ψ^{3} (b_{j}) = 1 - min (0.05, \sum_{σ} q_{σ}^{b} \log (\frac{q_{σ}^{b}}{q_{σ}^{a}})) / 0.05

where

q_{σ}^{a} \in {l^{a}, w^{a}, h^{a}}, q_{σ}^{b} \in {l^{b}, w^{b}, h^{b}}

refer to the normalized length, width, and height of the template and label.
其中

q_{σ}^{a} \in {l^{a}, w^{a}, h^{a}}, q_{σ}^{b} \in {l^{b}, w^{b}, h^{b}}

指模板和标签的标准化长度、宽度和高度。

We linearly combine the three metrics

S (b_{j}) =

\sum_{i} ω^{i} ψ^{i} (b_{j})

to produce final scoring, where

ω^{i}

is the weighting factor (in this study we adopt a simple average,

ω^{i} = 1 / 3)

. For each

b_{j} \in b

, we compute its CSS score

s_{j}^{c s s} = S (b_{j})

and obtain a set of scores

s = {s_{j}^{c s s}}_{j}

.
我们线性组合三个指标

S (b_{j}) =

\sum_{i} ω^{i} ψ^{i} (b_{j})

来产生最终评分,其中

ω^{i}

是加权因子(在本研究中我们采用简单平均,

ω^{i} = 1 / 3)

)。对于每个

b_{j} \in b

,我们计算其 CSS 得分

s_{j}^{c s s} = S (b_{j})

并获得一组得分

s = {s_{j}^{c s s}}_{j}

。

4.2.2 CProto Set Construction
4.2.2 CProto 集合构造

Regular learnable prototype-based methods require annotations [16], [64], which are unavailable in the unsupervised problem. We construct a high-quality CProto set

P =

{P_{k}}_{k}

, representing geometry and size centers based on the CSS score. Here,

P_{k} = {x_{k}^{p}, b_{k}^{p}}

, where

x_{k}^{p}

indicates the inside points, and

b_{k}^{p}

refers to the bounding box. Specifically, we first categorize the initial labels

b

into different groups based on their tracking identity

τ

. Within each group, we select the high-quality boxes and inside points that meet a high CSS score threshold

η

(determined on the validation set, using 0.8 in this study). Then, we transform all points and boxes into a local coordinate system, and obtain

b_{k}^{p}

by averaging the boxes and

x_{k}^{p}

by concatenating all the points.
基于原型的可学习方法需要标注[16]、[64]，无监督问题中缺乏这些标注。我们构建了高质量的 CProto 集合

P =

{P_{k}}_{k}

，表示基于 CSS 分数的几何和大小中心。这里，

P_{k} = {x_{k}^{p}, b_{k}^{p}}

，其中

x_{k}^{p}

表示内部点，

b_{k}^{p}

表示边界框。具体来说，我们首先根据跟踪标识

τ

将初始标签

b

分为不同的组。在每个组内，我们选择满足高 CSS 分数阈值

η

（根据验证集确定，在本研究中为 0.8）的高质量框和内部点。然后，我们将所有点和框变换到局部坐标系中，得到

b_{k}^{p}

通过平均框计算得到，

x_{k}^{p}

通过连接所有点得到。

4.2.3 Box Regularization 4.2.3 盒子正则化

We next regularize the initial labels by the size prior from CProto. Intuitively, the height of the initial labels is relatively correct compared to the length and width because the tops of the objects are generally not occluded. The statistics on WOD validation set [37] confirmed (see Fig. 5 (h)) this intuition. Besides, the intra-class 3D objects with the same height have similar lengths and widths. Therefore, for the initial label

b_{j}

and CProto

P_{k}

with the same class identity, we perform an association by the minimum difference in box height. The initial pseudo-labels with the same

P_{k}

, same class identity, similar length, and similar width are naturally classified into the same group. We then perform re-size and re-localization for each group to refine the pseudo-labels. (1) Re-size. We directly replace the size of

b_{j}

using the length, width, and height of

b_{k}^{p} \in P_{k}

. (2) Re-location. Since points are mainly on the object’s surface and boundary, we divide the object into different bins and align the box boundary and orientation to the boundary point of the densest part (see Fig. 6). Finally, we obtain improved pseudo-labels

b^{*} = {b_{j}^{*}}_{j}

.
我们接下来通过 CProto 的尺寸先验来规范初始标签。直观上,初始标签的高度与长度和宽度相比较正确,因为物体的顶部通常没有遮挡。WOD 验证集[37]的统计数据也证实了这一直觉(见图 5(h))。此外,同类别的 3D 物体,如果高度相同,长度和宽度也会相似。因此,对于同类标识的初始标签

b_{j}

和 CProto

P_{k}

,我们通过最小化框高度差异来进行关联。同类标识、长度和宽度相似的初始伪标签自然被分类到同一组。然后我们对每个组进行尺寸调整和位置重新定位来优化伪标签。(1)尺寸调整:我们直接用

b_{k}^{p} \in P_{k}

的长度、宽度和高度来替换

b_{j}

的尺寸。(2)位置重新定位:由于点主要位于物体表面和边界上,我们将物体划分为不同的区域,并将边界框对齐到最密集部分的边界点(见图 6)。最终,我们得到了改进的伪标签

b^{*} = {b_{j}^{*}}_{j}

。

4.3 CProto-constrained Self-Training (CST)
4.3 面向原型约束的自训练(CST)

Recent methods [57], [59] utilize pseudo-labels for training 3D detectors. However, even after refinement, some pseudolabels remain inaccurate, diminishing the effectiveness of correct supervision and potentially misleading the training process. To tackle these issues, we propose two designs: (1) CSS-Weighted Detection Loss assigns different training
基于伪标签的最新方法[57]、[59]用于训练 3D 检测器。然而，即使在精炼之后，一些伪标签仍然不准确，降低了正确监督的有效性，并可能误导训练过程。为了解决这些问题，我们提出了两种设计:(1) CSS 加权检测损失为不同的训练

Fig. 6. The size of the initial label is replaced by the CProto box, and the position is also corrected.
图 6。初始标签的大小被 CProto 框替换，位置也得到修正。
weights based on label quality to suppress false supervision signals. (2) Geometry Contrast Loss, which aligns predictions of sparsely scanned points with the dense CProto, thereby improving feature consistency.
基于标签质量的加权，以抑制虚假监督信号。(2) 几何对比损失,它将稀疏扫描点的预测与密集的 CProto 对齐,从而提高特征一致性。

4.3.1 Network Architecture
4.3.1 网络架构

We adopt a dense-sparse alignment architecture (Fig. 4 ©), consisting of a prototype network

F^{pro}

and a detection network

F^{det}

, constructed from two-stage CenterPoint [56]. During training, for each

b_{j}^{*}

, we add its corresponding points

x_{k}^{p}

from CProto

P_{k}

to the scene to obtain a dense point cloud

x^{pro}

. We feed

x^{pro}

F^{pro}

to produce relatively good features and detections. We then feed randomly downsampled points

x^{det}

to the

F^{det}

. We align the features and detections from two branches by the detection loss and contrast loss. During testing, we feed points without downsampling to the detection network

F^{det}

to perform detection.
我们采用密集-稀疏对齐架构(图 4©),由原型网络

F^{pro}

和检测网络

F^{det}

组成,该网络由两阶段 CenterPoint [56]构建。在训练期间,对于每个

b_{j}^{*}

,我们添加其相应的点

x_{k}^{p}

从 CProto

P_{k}

到场景中以获得密集点云

x^{pro}

。我们将

x^{pro}

馈送到

F^{pro}

以产生相对较好的特征和检测。然后,我们将随机下采样的点

x^{det}

馈送到

F^{det}

。我们通过检测损失和对比损失将两个分支的特征和检测对齐。在测试期间,我们将没有下采样的点馈送到检测网络

F^{det}

以执行检测。

4.3.2 Training Loss 4.3.2 培训损失

CSS Weight. Considering that the false pseudo-labels may mislead the network convergence, we first calculate a loss weight based on different label qualities. Formally, we convert a CSS score

s_{i}^{c s s}

of a pseudo-label to
基于不同标签质量计算损失权重。正式地,我们将伪标签的 CSS 得分转换为

ω_{i} = {\begin{matrix} 0 & s_{i}^{css} < S_{L} \\ \frac{s_{i}^{c s s} - S_{L}}{S_{H} - S_{L}} & S_{L} < s_{i}^{css} < S_{H} \\ 1 & s_{i}^{css} > S_{H} \end{matrix}

where

S_{H}

and

S_{L}

are high/low-quality thresholds (we empirically set 0.7 and 0.4 , respectively).
其中

S_{H}

和

S_{L}

是高质量/低质量阈值(我们经验性地设置为 0.7 和 0.4)。

CSS-weighted Detection Loss. To decrease the influence of false labels, we formulate the CSS-weighted detection loss to refine

N

proposals
基于 CSS 加权的检测损失。为减少错误标签的影响,我们制定了基于 CSS 加权的检测损失来完善

N

建议。

L_{d e t}^{c s s} = \frac{1}{N} \sum_{i} ω_{i} (L_{i}^{p r o} + L_{i}^{d e t})

where

L_{i}^{pro}

and

L_{i}^{det}

are detection losses [4] of

F^{pro}

and

F^{det}

, respectively. The losses are calculated by pseudolabels

b^{*}

and network predictions.
其中

L_{i}^{pro}

和

L_{i}^{det}

是

F^{pro}

和

F^{det}

的检测损失[4]。这些损失是通过伪标签

b^{*}

和网络预测计算的。

Geometry Contrast Loss. We formulate two contrast losses that minimize the feature and predicted box difference between the prototype and detection network. (1) Feature contrast loss. For a foreground RoI

r_{i}

from the detection network, we extract features

f_{i}^{p}

from the prototype network by voxel set abstract [4], and extract features

f_{i}^{d}

from detection network. We then formulate the contrast loss by cosine distance:
几何对比损失。我们制定了两个对比损失,最小化原型和检测网络之间的特征和预测框差异。(1)特征对比损失。对于检测网络中的前景 RoI

r_{i}

,我们从原型网络中提取特征

f_{i}^{p}

,从检测网络中提取特征

f_{i}^{d}

。然后我们用余弦距离来制定对比损失:

L_{f e a t}^{c s s} = - \frac{1}{N^{f}} \sum_{i} ω_{i} \frac{f_{i}^{d} \cdot f_{i}^{p}}{‖ f_{i}^{d} ‖ ‖ f_{i}^{p} ‖}

Fig. 7. (a) A baseline based on Vanilla Iterative Training (VaIT). (b) Performance improvement of ValT by using different score thresholds. © True Positive (TP) and False Positive (FP) statistics of pseudo labels. (d) Examples of TP and FP. (e) Position and size errors of pseudo labels.
图 7。(a) 基于普通迭代培训(VaIT)的基线。(b) 使用不同得分阈值提高 ValT 的性能。(c) 伪标签的真阳性(TP)和假阳性(FP)统计。(d) TP 和 FP 的示例。(e) 伪标签的位置和大小误差。
where

N^{f}

is the foreground proposal number. (2) Box contrast loss. For a box prediction

d_{i}^{p}

from the prototype network and a box prediction

d_{i}^{d}

from the detection network. We then formulate the box contrast loss by IoU, location difference, and angle difference:
其中

N^{f}

是前景提案编号。(2) 框对比损失。对于来自原型网络的框预测

d_{i}^{p}

和来自检测网络的框预测

d_{i}^{d}

，我们通过 IoU、位置差异和角度差异来表示框对比损失:

\begin{matrix} L_{box}^{c s s} = & \frac{1}{N^{f}} \sum_{i} ω_{i} [1 - I (d_{i}^{d}, d_{i}^{p}) \\ + ‖ c_{i}^{d} - c_{i}^{p} ‖ + | \sin (α_{i}^{d} - α_{i}^{p}) |] \end{matrix}

where

I

denote IoU function;

c_{i}^{d}, α_{i}^{d}

refers to position and angle of

d_{i}^{d}; c_{i}^{p}, α_{i}^{p}

refers to position and angle of

d_{i}^{p}

. We finally summat all losses to training the detector.
其中

I

表示 IoU 函数;

c_{i}^{d}, α_{i}^{d}

指的是位置和角度;

d_{i}^{d}; c_{i}^{p}, α_{i}^{p}

指的是位置和角度.

d_{i}^{p}

我们最终将所有损失相加来训练检测器.

5 CPD++: Enhancing CPD by Motion Clue
5 CPD++: 增强 CPD 通过运动线索

Although the proposed CPD framework achieves promising performance on large-scale vehicle objects, it suffers from the low detection accuracy of small objects. We propose CPD++ for more accurate multi-class unsupervised 3D object detection. CPD++ formulates the pseudo label generation function

G

by a Motion-conditioned Label Refinement (MLR) and formulate the training target

L^{u n}

by a motionweighed detection loss.
尽管建议的 CPD 框架在大规模车辆对象上取得了有前景的性能,但对小物体的检测精度较低。我们提出了 CPD++,以实现更精确的无监督 3D 多类目对象检测。CPD++通过运动条件标签精炼(MLR)制定伪标签生成函数

G

,并通过运动加权检测损失制定训练目标

L^{u n}

。

5.1 Pilot Experiments and Discussion
5.1 试验试验和讨论

Prevalent methods apply iterative training to further boost the detection performance [57], [59]. By leveraging the generalization capabilities inherent in deep networks, it can identify more object patterns. However, false labels with inaccurate classes and bounding boxes may mislead the training of the next iteration. Recent works apply score-based filtering [51] or temporal-based filtering [59] to diminish the false supervision signals. By employing these techniques, it is possible to improve the multi-class performance of CPD further. We begin our design by first constructing a strong baseline of the Vanilla Iterative Training (VaIT) method. As presented in Fig. 7 (a), VaIT utilizes the detection results from CPD as initial labels and improves the label quality by tracking, temporal filtering, and score thresholding. By default, the VaIT applies the Kalman Filtering to perform tracking [44] and ignores short tracklets less than six [59] detections. We conduct pilot experiments on the WOD and
普遍的方法应用迭代训练来进一步提高检测性能[57]、[59]。通过利用深度网络固有的泛化能力,可以识别更多的目标模式。然而,具有不准确类别和边界框的错误标签可能会误导下一次迭代的训练。最近的工作采用基于分数的过滤[51]或基于时间的过滤[59]来减少错误的监督信号。通过采用这些技术,可以进一步提高 CPD 的多类性能。我们从构建 Vanilla Iterative Training(VaIT)方法的强大基线开始设计。如图 7(a)所示,VaIT 利用 CPD 的检测结果作为初始标签,并通过跟踪、时间过滤和分数阈值提高标签质量。默认情况下,VaIT 应用卡尔曼滤波进行跟踪[44],并忽略少于六个[59]检测的短轨迹。我们在 WOD 和
train the VaIT for two iterations similar to OYSTER [59]. We present the detection performance improvement in Fig. 7 (b) (1) to (6), where different score thresholds ranging from 0.1 to 0.6 are used.
对齐训练 VaIT 模型进行两次迭代,类似于 OYSTER[59]。我们在图 7(b)(1)到(6)中展示了不同得分阈值从 0.1 到 0.6 的检测性能改进。

However, we observe that VaIT marginally enhances or even decreases the precision of pedestrian and cyclist detection no matter what score threshold is chosen. To delve into this issue, we present the statistics of True Positive (TP) and False Positive (FP) labels on the WOD validation set in Fig. 7 © (1). The statistics show that TP labels of pedestrians and cyclists only account for

25 %

and

24 %

respectively, while the FP labels account for

75 %

and

76 %

respectively. The main reason lies in that threshold-based classification in CPD cannot distinguish the foreground and background objects with similar sizes (see Fig. 7 (d)), leading to many false labels. As illustrated in Fig. 7 (d) (2) (3), the sizes of background objects generally resemble that of small foreground objects (Fig. 7 (d) (1)). Consequently, the network trained with the false labels tends to predict high scores for background objects and cannot attain better performance.
然而,我们观察到,无论选择何种评分阈值,VaIT 都会略微提高或甚至降低行人和骑自行车者的检测精度。为深入研究这一问题,我们在图 7 © (1)中呈现了 WOD 验证集上真阳性(TP)和假阳性(FP)标签的统计数据。统计数据显示,行人和骑自行车者的 TP 标签分别只占

25 %

和

24 %

,而 FP 标签分别占

75 %

和

76 %

。主要原因在于,CPD 中基于阈值的分类无法区分大小相似的前景和背景物体(见图 7(d)),导致了许多错误标签。如图 7(d)(2)(3)所示,背景物体的大小通常与小型前景物体(图 7(d)(1))相似。因此,用错误标签训练的网络倾向于为背景物体预测高分数,无法获得更好的性能。

To tackle this issue, we further investigate stronger and more reliable commonsense clues from humans. (1) Stationarity. Foreground objects may stay somewhere for a while, termed stationary objects. Intuitively, the stationary objects scanned with dense points in consecutive frames share the same position and size. Therefore, their bounding boxes of pseudo labels can be accurately refined. Nevertheless, since the background objects (e.g., lamp poles, and walls) are also stationary and of similar shapes, their class cannot be judged correctly. (2) Movability. Foreground objects may move for a while, named moving objects. The moving objects do not share the same position in consecutive frames. Therefore, their bounding boxes of pseudo labels cannot be estimated accurately, but their class can be accurately judged because no moving background objects exist. (3) Transferability and complementarity. During inference, the object patterns between stationary and moving objects are similar in input data. By utilizing the equivariance of convolutional nets, the learned localization and recognition knowledge are transferable and complementary to each other.
为了应对这个问题,我们进一步探讨从人类那里获得更强大和更可靠的常识线索。(1) 静止性。前景物体可能会在某处停留一段时间,称为静止物体。直观地说,连续帧中扫描到的静止物体会共享相同的位置和大小。因此,它们的伪标签边界框可以精确地进行细化。然而,由于背景物体(如灯柱和墙壁)也是静止的且形状相似,它们的类别无法正确判断。(2) 可移动性。前景物体可能会移动一段时间,称为移动物体。移动物体在连续帧中不会共享相同的位置。因此,它们的伪标签边界框无法准确估计,但其类别可以准确判断,因为没有移动的背景物体存在。(3) 可迁移性和互补性。在推理过程中,静止物体和移动物体之间的物体模式在输入数据中是相似的。通过利用卷积网络的等变性,学习到的定位和识别知识是可迁移和互补的。

Bear this in mind, we design the CPD++, a stronger unsupervised 3D object detector that accurately performs
请记住这一点,我们设计了 CPD++,一个更强大的无监督 3D 物体检测器,可以准确执行

(a) CPD++ CPD++

Fig. 8. (a) The CPD++ framework. Initial pseudo labels

B^{i n i}

are generated by the trained model

F^{det}

from CPD and refined by MLR. Then a new detector

F^{det2}

is trained by MWT, producing labels to perform iterative training. (b) Sequential training scheme. © Parallel training scheme. (d) Motion-Weighted Training (MWT) scheme. MWT performs regression and classification from stationary and moving objects, respectively.
图 8. (a)CPD++框架。初始伪标签

B^{i n i}

由经过训练的模型

F^{det}

(来自 CPD)生成并经 MLR 优化。然后使用 MWT 训练新检测器

F^{det2}

以生成用于迭代训练的标签。(b)顺序训练方案。(c)并行训练方案。(d)基于运动加权的训练(MWT)方案。MWT 分别从静止和移动目标执行回归和分类。
multi-class object detection. The key idea is to learn localization from stationary objects, learn recognition from moving objects, and transfer the knowledge between each other to boost detection performance mutually. As presented in Fig. 8 (a), the CPD++ consists of two key parts: (1) Motion-conditioned Label Refinement (MLR) and (2) Motion-Weighted Training (MWT). We elaborate on the details in the following section.
多类对象检测。关键思想是从静止物体中学习定位,从移动物体中学习识别,并将知识相互传递以提升检测性能。如图 8(a)所示,CPD++由两个关键部分组成:(1)运动条件标签细化(MLR)和(2)运动加权训练(MWT)。我们将在下一节中详细说明。

We first design the Motion-conditioned Label Refinement (MLR) to produce high-quality pseudo labels. The key idea is to classify the labels into two parts by motion states and refine the labels based on the properties of stationarity and movability, respectively. Different from tracklet ignoring in OYSTER [59] and threshold-based classification [47] in CPD that cannot select foreground labels correctly, our MLR identifies accurate recognition and localization seed labels from moving and stationary objects, respectively.
我们首先设计运动条件标签精炼(MLR)来生成高质量的伪标签。关键思想是将标签分为两部分,根据运动状态和静止性和可移动性的特性分别精炼标签。与 OYSTER [59]中忽略小径和 CPD [47]中基于阈值的分类不同,我们的 MLR 从移动和静止物体中分别识别出准确的识别和定位种子标签。

5.2.1 Pseudo Labels Generation
5.2.1 伪标签生成

CPD++ first generates a set of pseudo labels by the pretrained detection model

F^{det}

from CPD. Inspired by recent works [21] that apply augmented testing [27] to improve the perception performance, we employ this method here to improve the diversity of initial pseudo labels. Specifically, we apply

m

rotation transformations to the

N^{x}

points

x_{n} \in

R^{N^{x}, 3}

and obtain

m

( 6 by default) set of detection results. Then we fuse the results to a single result

b_{n}^{i n i}

by Weighted Boxes Fusion (WBF) [35], which can be denoted by
基于预训练的检测模型生成一组伪标签。受到最近使用增强测试提高感知性能的工作的启发,我们采用这种方法来提高初始伪标签的多样性。具体来说,我们对点进行旋转变换,并获得 6 组检测结果。然后我们使用加权框融合(WBF)方法将这些结果融合为单一结果。

b_{n}^{i n i} = W B F (F^{d e t} (x_{n} r_{1}^{T}), \dots, F^{d e t} (x_{n} r_{m}^{T}))

where

r_{m} \in S O (3)

. After that, for each sequence, we obtain detections

B^{i n i} = {b_{n}^{i n i}}_{n}

.
在那里

r_{m} \in S O (3)

。之后,对于每个序列,我们得到检测结果

B^{i n i} = {b_{n}^{i n i}}_{n}

。

5.2.2 Unsupervised Motion Classification
无监督运动分类

We next classify the

B^{ini}

into different parts based on their motion states. Specifically, we first apply a transformation

R T

that transforms all boxes to the global system by IMU/GPS data. Then we perform a class-agnostic tracking by employing a Kalman Fitering-based Tracker (KFTrack) [44]. It constructs the relationships of objects between frames by
我们下一步将

B^{ini}

分类为不同部分,基于它们的运动状态。具体而言,我们首先应用一个变换

R T

,利用 IMU/GPS 数据将所有框框转换到全局系统。然后我们采用卡尔曼滤波跟踪器(KFTrack)[44]进行无类别跟踪。它建立了帧与帧之间的物体关系

B^{tra} = K F Track (R T (B^{ini})),

where

B^{tra} = {b_{τ}^{tra}}_{τ}

and

b_{τ}^{tra}

refers to

τ

-th trajectory.
其中

B^{tra} = {b_{τ}^{tra}}_{τ}

和

b_{τ}^{tra}

指第

τ

个轨迹。
Velocity is a straightforward criterion to classify the motion states. However, pseudo labels of stationary objects produce noise velocity similar to the velocity of slow-moving objects. Consequently, there is no clear decision boundary (as shown in Fig. 9 (a)) that is able to classify the motion states correctly. Instead, we propose the variance of position change to classify the motion states. The intuition behind this is that even if the speed is very small, it will produce a large displacement under long-term movement. Specifically, the coordinates of

b_{τ}^{t r a}

are denoted by

z^{τ} \in R^{N^{τ}, 3}

, where

N^{τ}

denotes the boxes number inside this tracklet. Then the position change

Δ^{τ}

is computed by
速度是一个直接的标准来分类运动状态。但是,静止物体的伪标签产生的噪声速度类似于缓慢移动物体的速度。因此,没有明确的决策边界(如图 9(a)所示)能够正确地分类运动状态。相反,我们提出使用位置变化的方差来分类运动状态。这背后的直觉是,即使速度很小,在长期运动下也会产生大的位移。具体地说,

b_{τ}^{t r a}

的坐标由

z^{τ} \in R^{N^{τ}, 3}

表示,其中

N^{τ}

表示该轨迹内的框数。然后计算位置变化

Δ^{τ}

Δ^{τ} = {(diag ((z^{τ} - z_{*}^{τ}) {(z^{τ} - z_{*}^{τ})}^{T}))}^{\frac{1}{2}}

where

z_{*}^{τ} \in R^{3}

refers to the column mean of

z

. After that, the variance of position change

δ^{τ}

is obtained by
其中

z_{*}^{τ} \in R^{3}

指列均值。之后,获得位置变化

δ^{τ}

的方差。

δ^{τ} = E ((Δ^{τ} - E (Δ^{τ})) {(Δ^{τ} - E (Δ^{τ}))}^{T})

Fig. 9. (a) Speed distribution. (b) variance of position change

δ^{τ}

distribution. © Label groups partition, where

ϵ_{i}

and

δ_{i}

are detection confidence and variance of position change, respectively. (d) Weight mask for regressor. (e) Weight mask for classifier.
图 9. (a)速度分布。(b)位置变化分布的方差。(c)标签组分区,其中

ϵ_{i}

和

δ_{i}

分别是检测置信度和位置变化的方差。(d)回归器的权重掩码。(e)分类器的权重掩码。

There is a clear decision boundary as presented in Fig. 9 (b). We apply a threshold

η^{var}

( 0.8 by default) to classify the labels into a stationary label set

B^{sta} = {b^{τ} ∣ δ^{τ} < η^{v a r}}

and a moving label set

B^{mot} = {b^{τ} ∣ δ^{τ} \geq η^{v a r}}

.
如图 9(b)所示,存在清晰的决策边界。我们应用阈值

η^{var}

(默认为 0.8)将标签分类为固定标签集

B^{sta} = {b^{τ} ∣ δ^{τ} < η^{v a r}}

和移动标签集

B^{mot} = {b^{τ} ∣ δ^{τ} \geq η^{v a r}}

。

5.2.3 Label Propagation 5.2.3 标签传播

As analyzed before, the labels

B^{s t a}

can be further refined by the property of stationarity, i.e. the stationary object shares the same position, size, and class across frames. Therefore, we design the label propagation that replaces the lowquality box with the most confident box. Specifically, we denote a stationary tracklet by

b^{sta} = {(b_{j}^{sta}, ϵ_{j}^{s t a})}_{j = 1}^{N^{s t a}}

, where

b_{j}^{sta} \in R^{8}

denotes 7 -dimension bounding box and a class identity,

ϵ_{j}^{s t a} \in R

refers to the detection confidence,

N^{s t a}

is the label number inside the tracklet. Then, the labels are propagated by
根据前面的分析,标签

B^{s t a}

可以通过平稳性属性进一步细化,即静止物体在各帧中保持相同的位置、大小和类别。因此,我们设计标签传播,用最高置信度的框替换低质量的框。具体来说,我们用

b^{sta} = {(b_{j}^{sta}, ϵ_{j}^{s t a})}_{j = 1}^{N^{s t a}}

表示一个静止轨迹,其中

b_{j}^{sta} \in R^{8}

表示 7 维边界框和类别标识,

ϵ_{j}^{s t a} \in R

表示检测置信度,

N^{s t a}

表示轨迹内的标签数量。然后,标签通过以下方式传播:

j^{*} = \underset{j}{\arg max} (ϵ_{j}^{s t a}), b_{j}^{s t a} = b_{j *}^{s t a}, j = 1, \dots, N^{s t a}

Similarly, the labels

B^{mot}

can be further refined by the property of movability, i.e. the object only shares the same size and class across frames. Therefore, we only propagate the size and class identity for the moving tracklets. Specifically, we denote a moving tracklet by

b^{mot} = {(a_{j}^{mot}, ϵ_{j}^{mot})}_{j = 1}^{N^{mot}}

, where

a_{j}^{mot} \in R^{4}

denotes 3-dimension size and a class identity,

ϵ_{j}^{m o t} \in R

refers to the detection confidence,

N^{m o t}

is the label number inside the tracklet. Then, the labels are propagated by
类似地,标签

B^{mot}

可通过可移动性进行进一步细化,即该物体仅在帧间共享相同的尺寸和类别。因此,我们仅传播移动跟踪格的尺寸和类别标识。具体来说,我们用

b^{mot} = {(a_{j}^{mot}, ϵ_{j}^{mot})}_{j = 1}^{N^{mot}}

表示一个移动跟踪格,其中

a_{j}^{mot} \in R^{4}

表示 3 维尺寸和类别标识,

ϵ_{j}^{m o t} \in R

表示检测置信度,

N^{m o t}

是跟踪格内的标签编号。然后,标签通过

\hat{j} = \underset{j}{\arg max} (ϵ_{j}^{m o t}), b_{j}^{m o t} = b_{j}^{m o t}, j = 1, \dots, N^{m o t}

After all these steps, we obtain improved seed labels

B^{s t a} \cup

B^{mot}

, which are used to perform the detector training.
经过所有这些步骤后,我们得到了改进的种子标签

B^{s t a} \cup

B^{mot}

,这些标签用于执行探测器训练。

5.3 Motion-Weighted Training
5.3 动量加权训练

As analyzed in Sec. 5.1, the refined boxes of stationary labels are more accurate than moving objects. The intuition is confirmed by the WOD validation statistics of size and position errors in Fig 7 (e). The class of moving labels is more accurate than stationary labels, as demonstrated in Fig 7 © (2) (3). Our idea is to leverage stationary labels and moving labels to mutually improve localization and recognition.
根据第 5.1 节的分析,固定标签的精炼框更准确,而不是移动的物体。该直觉得到 WOD 验证统计的尺寸和位置误差 Figure 7(e)的确证。移动标签的类别比固定标签更准确,如 Figure 7(c)(2)(3)所示。我们的想法是利用固定标签和移动标签相互提高定位和识别的能力。

5.3.1 Baselines 基准

We begin our design by discussing two baselines. (1) Sequential Training (ST) scheme (see Fig 8 (b)). ST scheme first trains a detector with good localization from stationary objects and transfers the knowledge to the moving objects. Then new pseudo labels of moving objects can be used to
我们从讨论两个基线开始我们的设计。(1)顺序训练(ST)方案(见图 8(b))。ST 方案首先训练一个具有良好定位能力的探测器,用于静止物体,并将知识转移到移动物体。然后可以使用新的移动物体的伪标签。
train detectors with both better localization and recognition. However, the ST scheme requires two times of training time. Besides, the second round of training ignores stationary labels. Therefore, the localization cannot attain optimal performance. (2) Parallel Training (PT) scheme (see Fig 8 ©). PT scheme trains two detectors A and B, optimized by stationary labels and moving labels, respectively. During inference, the detections are fused by keeping boxes from A and keeping class and confidence from B. Nevertheless, PT scheme requires two times of time for inference.
使用两个更好的定位和识别训练检测器。但是,ST 方案需要两次训练时间。此外,第二轮培训忽略了静止标签。因此,定位无法达到最佳性能。(2)并行培训(PT)方案(见图 8©)。PT 方案训练了两个检测器 A 和 B,分别通过静止标签和移动标签优化。推理时,保留 A 的框并保留 B 的类别和置信度进行融合。但是,PT 方案需要两倍的推理时间。

5.3.2 Motion-weighted Detection Loss
5.3.2 运动加权检测损失

To address these issues, we design the Motion-Weighted Training (MWT) scheme (see Fig 8 (d)). MWT first computes different localization and classification weights based on the motion states of pseudo labels. Then MWT trains the regressor and classifier of a 3D object detector

F^{det 2}

by a motion-weighted detection loss

L_{mot}

. MWT employs a single detector that learns the localization and recognition simultaneously, consequently, attaining global optimization and avoiding additional training and inference time.
为解决这些问题，我们设计了运动加权训练(MWT)方案(见图 8(d))。MWT 首先根据伪标签的运动状态计算不同的定位和分类权重。然后 MWT 通过运动加权检测损失来训练 3D 物体检测器的回归器和分类器。MWT 采用单一检测器同时学习定位和识别,从而实现全局优化并避免额外的训练和推理时间。

We first divided the pseudo-labels into four groups based on the quality and motion state (see Fig. 9 ©). The quality is measured by the detection confidence

ϵ

. The motion state is measured by the variance of position change

δ

in Eq. 14. Intuitively, the box and class of pseudo labels with very high confidence are relatively correct and can be used to train both regressor and classifier. Such labels formulate the first group

D 1 = {(ϵ, δ) ∣ ϵ \in (η^{high}, 1], δ \in [0, + \infty)}

, where

η^{high}

( 0.9 by default) is a threshold to select the highquality labels. On the contrary, the labels with very low confidence are incorrect and can not be used to train both regressor and classifier. Such labels formulate the second group

D 2 = {(ϵ, δ) ∣ ϵ \in [0, η^{l o w}], δ \in [0, + \infty)}

, where

η^{low}

( 0.4 by default) is a threshold to remove low-quality labels. Some stationary labels have the correct box but inaccurate class, formulating the third group

D 3 = {(ϵ, δ) ∣ ϵ \in

(η^{low}, η^{high}], δ \in [0, η^{v a r})}

. Some moving labels have the correct class but inaccurate box, formulating the fourth

group D 4 = {(ϵ, δ) ∣ ϵ \in (η^{low}, η^{high}], δ \in [η^{v a r}, + \infty)}

.
我们首先根据质量和运动状态将伪标签划分为四个组(见图 9©)。通过检测置信度

ϵ

来衡量质量。根据方程 14 中位置变化的方差

δ

来衡量运动状态。直观上来说,伪标签具有非常高的置信度的盒子和类别相对较为正确,可用于训练回归器和分类器。这些标签构成了第一组

D 1 = {(ϵ, δ) ∣ ϵ \in (η^{high}, 1], δ \in [0, + \infty)}

,其中

η^{high}

(默认为 0.9)是选择高质量标签的阈值。相反,置信度非常低的标签是错误的,不能用于训练回归器和分类器。这些标签构成了第二组

D 2 = {(ϵ, δ) ∣ ϵ \in [0, η^{l o w}], δ \in [0, + \infty)}

,其中

η^{low}

(默认为 0.4)是删除低质量标签的阈值。一些静止标签具有正确的边框但类别不准确,构成了第三组

D 3 = {(ϵ, δ) ∣ ϵ \in

(η^{low}, η^{high}], δ \in [0, η^{v a r})}

。一些移动标签具有正确的类别但边框不准确,构成了第四组

group D 4 = {(ϵ, δ) ∣ ϵ \in (η^{low}, η^{high}], δ \in [η^{v a r}, + \infty)}

。

Based on the groups, we compute a weight mask that selects more stationary labels for regression (see Fig. 9 (d))
根据群组,我们计算出一个权重遮罩,用于回归中选择更多静止的标签(见图 9(d))。

ω_{i}^{r e g} = {\begin{matrix} 0 & (δ_{i}, ϵ_{i}) \in D 2 \cup D 4 \\ 1 & (δ_{i}, ϵ_{i}) \in D 1 \cup D 3 \end{matrix} .

Meanwhile, we compute a weight mask that selects more moving labels for classification (see Fig. 9 (e))
同时,我们计算一个权重掩码,该掩码为分类选择更多移动标签(见图 9(e))。

ω_{i}^{c l s} = {\begin{matrix} 0 & (δ_{i}, ϵ_{i}) \in D 2 \cup D 3 \\ 1 & (δ_{i}, ϵ_{i}) \in D 1 \cup D 4 \end{matrix}

Unsupervised 3D object detection results on WOD validation set. We report the 3D AP, 3D APH, and BEV AP using the official metric code of WOD with different loU thresholds. * denotes initial training.
无人监督的 3D 物体检测在 WOD 验证集上的结果。我们使用 WOD 的官方评价指标代码报告了不同 IOU 阈值下的 3D AP、3D APH 和 BEV AP。*表示初始训练。

Method 方法

Metric 公制

Average AP L1 平均 AP L1

Avarage AP L2 平均 AP L2

\begin{matrix} Veh. L1 \\ I o U_{0.5 / 0.7} \end{matrix}

\begin{matrix} Veh. L2 \\ I o U_{0.5 / 0.7} \end{matrix}

\begin{matrix} Ped. L1 \\ I o U_{0.3 / 0.5} \end{matrix}

\begin{matrix} Ped. L2 \\ I o U_{0.3 / 0.5} \end{matrix}

循环 L1

{IoU}_{0.3 / 0.5}

Cyc. L1

{IoU}_{0.3 / 0.5}

Cyc. L1 IoU_(0.3//0.5)| Cyc. L1 | | :--- | | $\operatorname{IoU}_{0.3 / 0.5}$ |

循环。 L2

{IoU}_{0.3 / 0.5}

Cyc. L2

{IoU}_{0.3 / 0.5}

Cyc. L2 IoU_(0.3//0.5)| Cyc. L2 | | :--- | | $\operatorname{IoU}_{0.3 / 0.5}$ |

DBSCAN [6] 密度聚类算法 [6]

3D AP

0.57

0.43

2.32 / 0.29

1.94 / 0.25

0.51 / 0.00

0.19 / 0.00

0.28 / 0.03

0.20 / 0.00

DBSCAN* [57] 密度聚类（DBSCAN）[57]

3.73

3.19

17.36 / 2.65

14.87 / 2.29

1.65 / 0.00

1.35 / 0.00

0.48 / 0.25

0.43 / 0.20

MODEST [57] 适度[57]

6.59

5.42

18.51 / 6.46

15.83 / 5.48

11.83 / 0.17

8.96 / 0.10

1.47 / 1.14

1.17 / 1.01

OYSTER [59] 牡蛎 [59]

8.54

7.58

30.48 / 14.66

26.21 / 14.10

4.33 / 0.18

3.52 / 0.14

1.27 / 0.33

1.24 / 0.32

Proto-vanilla [47] 原始香草[47]

15.16

13.37

35.22 / 20.19

31.58 / 18.36

17.60 / 10.34

14.62 / 8.59

4.21 / 3.45

3.80 / 3.31

CPD [47] 47 号 CPD

24.05

20.67

57.79 / 37.40

50.18 / 32.13

21.91 / 16.31

18.01 / 13.22

5.83 / 5.06

5.61 / 4.87

CPD++ C++

46.21

41.02

77.83 / 55.99

68.92 / 48.59

46.39 / 32.68

39.24 / 27.31

33.59 / 30.83

32.37 / 29.69

Improvement 改善

+22.16

+20.35

+20.04 / 18.59

+18.74/16.46

+24.48/16.37

+21.23/14.09 21.23/14.09

+27.76/25.77

+26.76/24.82

DBSCAN [6] 密度聚类算法 [6]

3D APH

0.41

0.29

1.78 / 0.15

1.45 / 0.13

0.34 / 0.00

0.07 / 0.00

0.20 / 0.00

0.12 / 0.00

DBSCAN* [57] 密度聚类（DBSCAN）[57]

3.14

2.63

15.31 / 2.12

12.84 / 1.64

1.12 / 0.00

1.02 / 0.00

0.23 / 0.11

0.21 / 0.08

MODEST [57] 适度[57]

4.71

3.73

16.43 / 4.25

14.04 / 3.63

5.59 / 0.11

4.18 / 0.05

1.07 / 0.82

0.45 / 0.07

OYSTER [59] 牡蛎 [59]

7.62

6.77

28.56 / 12.87

25.01 / 12.54

3.12 / 0.12

2.03 / 0.06

0.87 / 0.24

0.82 / 0.21

Proto-vanilla [47] 原始香草[47]

12.05

10.74

32.34 / 19.20

29.71 / 16.23

9.12 / 6.30

8.12 / 5.26

2.84 / 2.51

2.73 / 2.42

CPD [47] 47 号 CPD

19.55

16.91

54.19 / 34.97

46.99 / 30.09

12.01 / 9.24

10.06 / 7.68

3.68 / 3.26

3.55 / 3.14

CPD++ C++

34.55

30.59

71.89 / 51.97

63.63 / 45.09

18.82 / 13.82

15.91 / 11.09

26.39 / 24.42

25.43 / 22.42

Improvement 改善

+15.00

+13.68

+17.70/17.00 17.70/17.00

+16.64/15.00 16.64/15.00

+6.81/4.04 6.81/4.04

+5.85/3.41 5.85/3.41

+22.71/21.16 22.71/21.16

+21.88/19.28 21.88/19.28

DBSCAN [6] 密度聚类算法 [6]

BEV AP 饮料

0.76

0.56

2.91 / 0.55

2.34 / 0.47

0.73 / 0.01

0.21 / 0.01

0.32 / 0.10

0.25 / 0.10

DBSCAN* [57] 密度聚类（DBSCAN）[57]

7.41

6.68

22.33 / 13.30

20.60 / 11.95

7.21 / 0.23

6.49 / 0.10

1.03 / 0.39

0.73 / 0.24

MODEST [57] 适度[57]

10.51

8.51

27.16 / 16.58

21.13 / 13.31

15.98 / 0.34

14.06 / 0.13

1.73 / 1.27

1.38 / 1.07

OYSTER [59] 牡蛎 [59]

14.05

11.84

37.73 / 30.57

32.31 / 25.04

13.53 / 0.57

11.76 / 0.30

1.56 / 0.38

1.32 / 0.33

Proto-vanilla [47] 原始香草[47]

20.23

17.52

41.56 / 33.12

34.91 / 28.88

19.64 / 17.56

17.94 / 15.9

4.87 / 4.67

4.05 / 3.48

CPD [47] 47 号 CPD

27.93

24.91

60.81 / 53.01

53.66 / 47.48

22.96 / 19.31

20.21 / 17.26

5.98 / 5.56

5.68 / 5.22

CPD++ C++

49.72

44.22

80.64 / 71.34

71.76 / 62.72

44.04 / 36.47

36.99 / 30.43

34.15 / 31.7

32.91 / 30.55

Improvement 改善

+21.79

+19.31

+19.83/18.33 19.83/18.33

+18.10/15.24

+21.08/17.16

+16.78 / 13.17 16.78 / 13.17

+28.17/26.14

+27.23/25.33

Method Metric Average AP L1 Avarage AP L2 " Veh. L1 IoU_(0.5//0.7)" " Veh. L2 IoU_(0.5//0.7)" " Ped. L1 IoU_(0.3//0.5)" " Ped. L2 IoU_(0.3//0.5)" "Cyc. L1 IoU_(0.3//0.5)" "Cyc. L2 IoU_(0.3//0.5)" DBSCAN [6] 3D AP 0.57 0.43 2.32 / 0.29 1.94 / 0.25 0.51 / 0.00 0.19 / 0.00 0.28 / 0.03 0.20 / 0.00 DBSCAN* [57] 3.73 3.19 17.36 / 2.65 14.87 / 2.29 1.65 / 0.00 1.35 / 0.00 0.48 / 0.25 0.43 / 0.20 MODEST [57] 6.59 5.42 18.51 / 6.46 15.83 / 5.48 11.83 / 0.17 8.96 / 0.10 1.47 / 1.14 1.17 / 1.01 OYSTER [59] 8.54 7.58 30.48 / 14.66 26.21 / 14.10 4.33 / 0.18 3.52 / 0.14 1.27 / 0.33 1.24 / 0.32 Proto-vanilla [47] 15.16 13.37 35.22 / 20.19 31.58 / 18.36 17.60 / 10.34 14.62 / 8.59 4.21 / 3.45 3.80 / 3.31 CPD [47] 24.05 20.67 57.79 / 37.40 50.18 / 32.13 21.91 / 16.31 18.01 / 13.22 5.83 / 5.06 5.61 / 4.87 CPD++ 46.21 41.02 77.83 / 55.99 68.92 / 48.59 46.39 / 32.68 39.24 / 27.31 33.59 / 30.83 32.37 / 29.69 Improvement +22.16 +20.35 +20.04 / 18.59 +18.74/16.46 +24.48/16.37 +21.23/14.09 +27.76/25.77 +26.76/24.82 DBSCAN [6] 3D APH 0.41 0.29 1.78 / 0.15 1.45 / 0.13 0.34 / 0.00 0.07 / 0.00 0.20 / 0.00 0.12 / 0.00 DBSCAN* [57] 3.14 2.63 15.31 / 2.12 12.84 / 1.64 1.12 / 0.00 1.02 / 0.00 0.23 / 0.11 0.21 / 0.08 MODEST [57] 4.71 3.73 16.43 / 4.25 14.04 / 3.63 5.59 / 0.11 4.18 / 0.05 1.07 / 0.82 0.45 / 0.07 OYSTER [59] 7.62 6.77 28.56 / 12.87 25.01 / 12.54 3.12 / 0.12 2.03 / 0.06 0.87 / 0.24 0.82 / 0.21 Proto-vanilla [47] 12.05 10.74 32.34 / 19.20 29.71 / 16.23 9.12 / 6.30 8.12 / 5.26 2.84//2.51 2.73 / 2.42 CPD [47] 19.55 16.91 54.19 / 34.97 46.99 / 30.09 12.01 / 9.24 10.06 / 7.68 3.68 / 3.26 3.55 / 3.14 CPD++ 34.55 30.59 71.89 / 51.97 63.63 / 45.09 18.82 / 13.82 15.91 / 11.09 26.39 / 24.42 25.43 / 22.42 Improvement +15.00 +13.68 +17.70/17.00 +16.64/15.00 +6.81/4.04 +5.85/3.41 +22.71/21.16 +21.88/19.28 DBSCAN [6] BEV AP 0.76 0.56 2.91 / 0.55 2.34 / 0.47 0.73 / 0.01 0.21 / 0.01 0.32 / 0.10 0.25 / 0.10 DBSCAN* [57] 7.41 6.68 22.33 / 13.30 20.60 / 11.95 7.21 / 0.23 6.49 / 0.10 1.03 / 0.39 0.73 / 0.24 MODEST [57] 10.51 8.51 27.16 / 16.58 21.13 / 13.31 15.98 / 0.34 14.06 / 0.13 1.73 / 1.27 1.38 / 1.07 OYSTER [59] 14.05 11.84 37.73 / 30.57 32.31 / 25.04 13.53 / 0.57 11.76 / 0.30 1.56 / 0.38 1.32 / 0.33 Proto-vanilla [47] 20.23 17.52 41.56 / 33.12 34.91 / 28.88 19.64 / 17.56 17.94 / 15.9 4.87 / 4.67 4.05 / 3.48 CPD [47] 27.93 24.91 60.81 / 53.01 53.66 / 47.48 22.96 / 19.31 20.21 / 17.26 5.98 / 5.56 5.68 / 5.22 CPD++ 49.72 44.22 80.64 / 71.34 71.76 / 62.72 44.04 / 36.47 36.99 / 30.43 34.15 / 31.7 32.91 / 30.55 Improvement +21.79 +19.31 +19.83/18.33 +18.10/15.24 +21.08/17.16 +16.78 / 13.17 +28.17/26.14 +27.23/25.33| Method | Metric | Average AP L1 | Avarage AP L2 | $\begin{gathered} \text { Veh. L1 } \\ I o U_{0.5 / 0.7} \end{gathered}$ | $\begin{gathered} \text { Veh. L2 } \\ I o U_{0.5 / 0.7} \end{gathered}$ | $\begin{gathered} \text { Ped. L1 } \\ I o U_{0.3 / 0.5} \end{gathered}$ | $\begin{gathered} \text { Ped. L2 } \\ I o U_{0.3 / 0.5} \end{gathered}$ | Cyc. L1 $\operatorname{IoU}_{0.3 / 0.5}$ | Cyc. L2 $\operatorname{IoU}_{0.3 / 0.5}$ | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | DBSCAN [6] | 3D AP | 0.57 | 0.43 | 2.32 / 0.29 | 1.94 / 0.25 | 0.51 / 0.00 | 0.19 / 0.00 | 0.28 / 0.03 | 0.20 / 0.00 | | DBSCAN* [57] | | 3.73 | 3.19 | 17.36 / 2.65 | 14.87 / 2.29 | 1.65 / 0.00 | 1.35 / 0.00 | 0.48 / 0.25 | 0.43 / 0.20 | | MODEST [57] | | 6.59 | 5.42 | 18.51 / 6.46 | 15.83 / 5.48 | 11.83 / 0.17 | 8.96 / 0.10 | 1.47 / 1.14 | 1.17 / 1.01 | | OYSTER [59] | | 8.54 | 7.58 | 30.48 / 14.66 | 26.21 / 14.10 | 4.33 / 0.18 | 3.52 / 0.14 | 1.27 / 0.33 | 1.24 / 0.32 | | Proto-vanilla [47] | | 15.16 | 13.37 | 35.22 / 20.19 | 31.58 / 18.36 | 17.60 / 10.34 | 14.62 / 8.59 | 4.21 / 3.45 | 3.80 / 3.31 | | CPD [47] | | 24.05 | 20.67 | 57.79 / 37.40 | 50.18 / 32.13 | 21.91 / 16.31 | 18.01 / 13.22 | 5.83 / 5.06 | 5.61 / 4.87 | | CPD++ | | 46.21 | 41.02 | 77.83 / 55.99 | 68.92 / 48.59 | 46.39 / 32.68 | 39.24 / 27.31 | 33.59 / 30.83 | 32.37 / 29.69 | | Improvement | | +22.16 | +20.35 | +20.04 / 18.59 | +18.74/16.46 | +24.48/16.37 | +21.23/14.09 | +27.76/25.77 | +26.76/24.82 | | DBSCAN [6] | 3D APH | 0.41 | 0.29 | 1.78 / 0.15 | 1.45 / 0.13 | 0.34 / 0.00 | 0.07 / 0.00 | 0.20 / 0.00 | 0.12 / 0.00 | | DBSCAN* [57] | | 3.14 | 2.63 | 15.31 / 2.12 | 12.84 / 1.64 | 1.12 / 0.00 | 1.02 / 0.00 | 0.23 / 0.11 | 0.21 / 0.08 | | MODEST [57] | | 4.71 | 3.73 | 16.43 / 4.25 | 14.04 / 3.63 | 5.59 / 0.11 | 4.18 / 0.05 | 1.07 / 0.82 | 0.45 / 0.07 | | OYSTER [59] | | 7.62 | 6.77 | 28.56 / 12.87 | 25.01 / 12.54 | 3.12 / 0.12 | 2.03 / 0.06 | 0.87 / 0.24 | 0.82 / 0.21 | | Proto-vanilla [47] | | 12.05 | 10.74 | 32.34 / 19.20 | 29.71 / 16.23 | 9.12 / 6.30 | 8.12 / 5.26 | $2.84 / 2.51$ | 2.73 / 2.42 | | CPD [47] | | 19.55 | 16.91 | 54.19 / 34.97 | 46.99 / 30.09 | 12.01 / 9.24 | 10.06 / 7.68 | 3.68 / 3.26 | 3.55 / 3.14 | | CPD++ | | 34.55 | 30.59 | 71.89 / 51.97 | 63.63 / 45.09 | 18.82 / 13.82 | 15.91 / 11.09 | 26.39 / 24.42 | 25.43 / 22.42 | | Improvement | | +15.00 | +13.68 | +17.70/17.00 | +16.64/15.00 | +6.81/4.04 | +5.85/3.41 | +22.71/21.16 | +21.88/19.28 | | DBSCAN [6] | BEV AP | 0.76 | 0.56 | 2.91 / 0.55 | 2.34 / 0.47 | 0.73 / 0.01 | 0.21 / 0.01 | 0.32 / 0.10 | 0.25 / 0.10 | | DBSCAN* [57] | | 7.41 | 6.68 | 22.33 / 13.30 | 20.60 / 11.95 | 7.21 / 0.23 | 6.49 / 0.10 | 1.03 / 0.39 | 0.73 / 0.24 | | MODEST [57] | | 10.51 | 8.51 | 27.16 / 16.58 | 21.13 / 13.31 | 15.98 / 0.34 | 14.06 / 0.13 | 1.73 / 1.27 | 1.38 / 1.07 | | OYSTER [59] | | 14.05 | 11.84 | 37.73 / 30.57 | 32.31 / 25.04 | 13.53 / 0.57 | 11.76 / 0.30 | 1.56 / 0.38 | 1.32 / 0.33 | | Proto-vanilla [47] | | 20.23 | 17.52 | 41.56 / 33.12 | 34.91 / 28.88 | 19.64 / 17.56 | 17.94 / 15.9 | 4.87 / 4.67 | 4.05 / 3.48 | | CPD [47] | | 27.93 | 24.91 | 60.81 / 53.01 | 53.66 / 47.48 | 22.96 / 19.31 | 20.21 / 17.26 | 5.98 / 5.56 | 5.68 / 5.22 | | CPD++ | | 49.72 | 44.22 | 80.64 / 71.34 | 71.76 / 62.72 | 44.04 / 36.47 | 36.99 / 30.43 | 34.15 / 31.7 | 32.91 / 30.55 | | Improvement | | +21.79 | +19.31 | +19.83/18.33 | +18.10/15.24 | +21.08/17.16 | +16.78 / 13.17 | +28.17/26.14 | +27.23/25.33 |

Then the motion-weighted detection loss

L^{m o t}

is formulated between regression prediction

\hat{Δ_{i}^{box}}

, regression target

Δ_{i}^{box}

, classification prediction

{\hat{β}}_{i}

and classification target

β_{i}

by
然后运动加权检测损失

L^{m o t}

通过回归预测

\hat{Δ_{i}^{box}}

、回归目标

Δ_{i}^{box}

、分类预测

{\hat{β}}_{i}

和分类目标

β_{i}

进行公式化

\begin{matrix} L^{m o t} = & \frac{1}{N^{f}} \sum_{i} ω_{i}^{r e g} L^{r e g} (\hat{Δ_{i}^{b o x}}, Δ_{i}^{b o x}) \\ + \frac{1}{N} \sum_{i} L^{c l s} ({\hat{β}}_{i}, ω_{i}^{c l s} β_{i}) \end{matrix}

where

N^{f}

and

N

refer to the number of foreground predictions and all predictions respectively.

L^{reg}

and

L^{c l s}

are the Smooth L1 loss and binary cross entropy loss respectively.
其中

N^{f}

和

N

分别指前景预测和所有预测的数量。

L^{reg}

和

L^{c l s}

分别是 Smooth L1 损失和二元交叉熵损失。

5.3.3 Mutually Boosting by Iterative Training
5.3.3 通过迭代训练相互促进

As mentioned in Sec. 5.1, the learned localization and recognition knowledge can be mutually generalized to moving and stationary objects. Therefore, the generated predictions

B^{*}

from the second round detector

F^{det 2}

are much better than the output

B^{i n i}

from CPD. The better results potentially further enhance the pseudo-label refinement process, leading to better performance. Inspired by this insight, we perform an iterative training. Specifically, the second round predictions replace the original pseudo labels

B^{ini} \leftarrow B^{*}

and rerun the overall MLR and MWT until the performance does not improve on the validation set.
正如在第 5.1 节中提到的,学习到的定位和识别知识可以相互推广到移动和静止的物体。因此,第二轮探测器生成的预测结果明显优于 CPD 的输出。更好的结果可能进一步增强伪标签修正过程,从而提高性能。受此启发,我们进行了迭代训练。具体来说,第二轮的预测结果将替换原始的伪标签,并重新运行整体的 MLR 和 MWT,直到在验证集上的性能不再提升。

6 EXPERIMENTS 6 个实验

6.1 Datasets 数据集

(1) Waymo Open Dataset (WOD). Due to its diverse scenes, we conducted extensive experiments on the WOD [37]. The WOD contains 798, 202, and 150 training, validation, and testing sequences, respectively. We adopted similar metrics as fully/weakly supervised methods [45], [49], including 3D AP/APH L1/L2 under 0.5/0.7 IoU thresholds and BEV AP L1/L2 under 0.5/0.7 IoU thresholds, where L1 and L2
瓦莫开放数据集(WOD)。由于其多样化的场景,我们在 WOD 上进行了广泛的实验[37]。WOD 包含 798、202 和 150 个训练、验证和测试序列。我们采用了与完全/弱监督方法[45]、[49]相似的度量指标,包括在 0.5/0.7 IoU 阈值下的 3D AP/APH L1/L2,以及在 0.5/0.7 IoU 阈值下的 BEV AP L1/L2,其中 L1 和 L2
denote the detection level, and the APH is an AP weighed by heading. No annotations were used for training. (2) KITTI Dataset. Since the KITTI detection dataset [8] did not provide consecutive frames, we only tested our method on the 3769 val split [4]. We used similar metrics (3D AP R40 with 0.5/0.7 IoU thresholds) as employed in weakly/fully supervised methods [46], [49].
表示检测水平,而 APH 是根据方向加权的 AP。训练中没有使用注释。 (2) KITTI 数据集。由于 KITTI 检测数据集[8]未提供连续帧,我们仅在 3769 个验证集上[4]测试了我们的方法。我们使用了与弱/完全监督方法[46]、[49]相同的指标(3D AP R40,IoU 阈值为 0.5/0.7)。

6.2 Implementation Details
实施细节

Network Details. The

F^{pro}, F^{det}

and

F^{det 2}

adopt the same 3D backbone as CenterPoint [56] and the second stage refinement like [4]. For the WOD and KITTI datasets, we use the same detection range ([-75.2, 75.2]m for the

X

and Y axes and

[- 2, 4] m

for the Z axis) and voxel size

(0.1 m

0.1 m, 0.15 m

) as CenterPoint [56]. The road surface of KITTI is aligned to WOD during the inference.
网络详情。

F^{pro}, F^{det}

和

F^{det 2}

采用与 CenterPoint [56]相同的 3D 主干网络,以及[4]中的第二阶段细化。对于 WOD 和 KITTI 数据集,我们使用与 CenterPoint [56]相同的检测范围(X 和 Y 轴为[-75.2, 75.2]米,Z 轴为

[- 2, 4] m

),以及体素大小

(0.1 m

0.1 m, 0.15 m

。KITTI 的道路表面在推理过程中与 WOD 对齐。

Training Details. We adopt the widely used global scaling, rotation, and GT sampling [4] data augmentation. We trained our CPD on 8 Tesla V100 GPUs with the ADAM optimizer. We used a learning rate of 0.003 with a one-cycle learning rate strategy. The CPD is trained for 20 epochs. We trained our CPD++ on 4 RTX 3090 GPUs with the same learning rate. The CPD++ is trained by three iterations. In each iteration of training, the CPD++ is trained 20 epochs.
训练细节。我们采用广泛使用的全局缩放、旋转和 GT 采样[4]数据增强。我们在 8 个特斯拉 V100 GPU 上训练了我们的 CPD,使用 ADAM 优化器。我们使用了一个 0.003 的学习率,采用了一个单循环学习率策略。 CPD 训练了 20 个 epoch。我们在 4 个 RTX 3090 GPU 上训练了我们的 CPD++,使用了相同的学习率。 CPD++ 经过了三次迭代训练。在每次训练迭代中, CPD++ 训练了 20 个 epoch。

Parameter Details. (1) Heuristic parameters. Some heuristic parameters, either empirically set or from Wikipedia, do not require tuning from the validation dataset. The template boxes in Eq. 6 are set to [5.1m, 1.9m, 1.5m], [1.0m, 1.0m,

2.0 m]

, and

[1.9 m, 0.9 m

, and 1.8 m

]

for vehicles, pedestrians, and cyclists, respectively. The sensitivity is analyzed in Sec. 6.6. (2) Hyperparameters. We introduced several hyperparameters that were chosen from the validation set. The CPD parameters

η = 0.8, S_{H} = 0.7

, and

S_{L} = 0.4

, are determined by the experiments in Sec. 6.6. The CPD++ parameters

m = 6, η^{v a r} = 0.8, η^{low} = 0.4

and

η^{h i g h} = 0.9

, are determined by the experiments in Sec. 6.7.
参数细节。(1) 启发式参数。一些启发式参数,要么是经验设定的,要么来自维基百科,不需要从验证数据集进行调整。方程式 6 中的模板框被设置为[5.1 米, 1.9 米, 1.5 米]、[1.0 米, 1.0 米,

2.0 m]

和

[1.9 m, 0.9 m

],以及 1.8 米

]

分别用于车辆、行人和自行车。在第 6.6 节中分析了敏感性。(2) 超参数。我们引入了几个从验证集中选择的超参数。在第 6.6 节的实验中确定了 CPD 参数

η = 0.8, S_{H} = 0.7

和

S_{L} = 0.4

。在第 6.7 节的实验中确定了 CPD++参数

m = 6, η^{v a r} = 0.8, η^{low} = 0.4

和

η^{h i g h} = 0.9

。

Unsupervised 3D object detection results on WOD testing set. We report the 3D AP and 3D APH by submitting our results to the official online server of WOD. The results are also available on the WOD online leaderboard.
对无人监督的 3D 物体检测在 WOD 测试集的结果。我们通过向 WOD 官方在线服务器提交结果,报告了 3D AP 和 3D APH。结果也可在 WOD 在线排行榜上查看。

Method 方法

Avenue 大道

\begin{matrix} mAP L1 \\ AP / APH \end{matrix}

平均精确率（mAP） L2 精确率 / 精确率-召回率曲线下面积（APH）

mAP L2

AP / APH

车 L1 AP/APH

Veh. L1

AP/APH

车。L2 AP / APH

Veh. L2

AP / APH

羽毛球

Ped. L1

AP / APH

行人 L2 AP / APH

Ped. L2

AP / APH

循环。一级 AP / APH

Cyc. L1

AP / APH

循环. L2 AP / APH

Cyc. L2

AP / APH

MODEST [57] 适度[57]

CVPR22

2.53 / 2.41

2.20 / 2.09

7.58 / 7.22

6.57 / 6.26

0.02 / 0.01

0.00 / 0.00

OYSTER [59] 牡蛎 [59]

CVPR23

7.44 / 7.04

6.45 / 6.11

21.66 / 20.81

18.79 / 18.05

0.64 / 0.32

0.57 / 0.28

0.01 / 0.00

CPD [47] 47 号 CPD

CVPR24

20.54 / 15.67

18.18 / 13.82

37.26 / 35.00

32.49 / 30.51

18.65 / 8.34

16.57 / 7.40

5.71 / 3.69

5.50 / 3.55

CPD++ C++

41.04 / 29.12

37.13 / 26.21

57.60 / 53.46

50.58 / 46.94

30.99 / 12.96

27.59 / 11.54

34.53 / 20.94

33.21 / 20.13

Improvement 改善

+20.5/13.45

+18.95/12.39

+20.34/18.46 20.34/18.46

+18.09/16.43

+12.34/4.62

+11.02 / 4.14 11.02 / 4.14

+28.82/17.25 28.82/17.25

+27.71/16.58

Method Avenue " mAP L1 AP//APH" "mAP L2 AP / APH" "Veh. L1 AP/APH" "Veh. L2 AP / APH" "Ped. L1 AP / APH" "Ped. L2 AP / APH" "Cyc. L1 AP / APH" "Cyc. L2 AP / APH" MODEST [57] CVPR22 2.53 / 2.41 2.20 / 2.09 7.58 / 7.22 6.57 / 6.26 0.02 / 0.01 0.02 / 0.01 0.00 / 0.00 0.00 / 0.00 OYSTER [59] CVPR23 7.44 / 7.04 6.45 / 6.11 21.66 / 20.81 18.79 / 18.05 0.64 / 0.32 0.57 / 0.28 0.01 / 0.00 0.01 / 0.00 CPD [47] CVPR24 20.54 / 15.67 18.18 / 13.82 37.26 / 35.00 32.49 / 30.51 18.65 / 8.34 16.57 / 7.40 5.71 / 3.69 5.50 / 3.55 CPD++ - 41.04 / 29.12 37.13 / 26.21 57.60 / 53.46 50.58 / 46.94 30.99 / 12.96 27.59 / 11.54 34.53 / 20.94 33.21 / 20.13 Improvement - +20.5/13.45 +18.95/12.39 +20.34/18.46 +18.09/16.43 +12.34/4.62 +11.02 / 4.14 +28.82/17.25 +27.71/16.58| Method | Avenue | $\begin{gathered} \text { mAP L1 } \\ \mathrm{AP} / \mathrm{APH} \end{gathered}$ | mAP L2 AP / APH | Veh. L1 AP/APH | Veh. L2 AP / APH | Ped. L1 AP / APH | Ped. L2 AP / APH | Cyc. L1 AP / APH | Cyc. L2 AP / APH | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | MODEST [57] | CVPR22 | 2.53 / 2.41 | 2.20 / 2.09 | 7.58 / 7.22 | 6.57 / 6.26 | 0.02 / 0.01 | 0.02 / 0.01 | 0.00 / 0.00 | 0.00 / 0.00 | | OYSTER [59] | CVPR23 | 7.44 / 7.04 | 6.45 / 6.11 | 21.66 / 20.81 | 18.79 / 18.05 | 0.64 / 0.32 | 0.57 / 0.28 | 0.01 / 0.00 | 0.01 / 0.00 | | CPD [47] | CVPR24 | 20.54 / 15.67 | 18.18 / 13.82 | 37.26 / 35.00 | 32.49 / 30.51 | 18.65 / 8.34 | 16.57 / 7.40 | 5.71 / 3.69 | 5.50 / 3.55 | | CPD++ | - | 41.04 / 29.12 | 37.13 / 26.21 | 57.60 / 53.46 | 50.58 / 46.94 | 30.99 / 12.96 | 27.59 / 11.54 | 34.53 / 20.94 | 33.21 / 20.13 | | Improvement | - | +20.5/13.45 | +18.95/12.39 | +20.34/18.46 | +18.09/16.43 | +12.34/4.62 | +11.02 / 4.14 | +28.82/17.25 | +27.71/16.58 |

TABLE 3 表 3
Comparison with fully/weakly/un-supervised methods on KITTI val set. We report the 3D AP R40 using the KITTI official metrics.

†

denotes that the model is trained on WOD and tested on KITTI.
在 KITTI 验证集上与完全/弱/无监督方法的比较。我们使用 KITTI 官方指标报告 3D AP R40。

†

表示模型在 WOD 上训练并在 KITTI 上测试。

Method 方法

Setting 设置

Label rate 费率

车辆@IoU 0.5 易/中/难

Car@IoU 0.5

Easy/Mod./Hard

Ped.@IoU

_{0.25}

Easy/Mod./Hard
行人检测@IoU

_{0.25}

简单/普通/困难

Сус.@IoU 0.25 Easy/Mod./Hard
易/中等/难

Car@IoU 0.7 Easy/Mod./Hard
车@IoU 0.7 简单/中等/困难

Ped.@IoU

0_{0.5}

Easy/Mod./Hard
行人检测@IoU

0_{0.5}

简单/普通/困难

Cyc.@IoU 0.5 Easy/Mod./Hard
周期@交并比 0.5 简单/中等/困难

CenterPoint [56] 中心点 [56]

Fully sup. 完全支持

100%

97.34/93.65/93.33 |

78.45/77.69/75.10

82.96/68.86/64.55

87.27/79.22/77.28

56.46/52.04/47.82

84.37/63.56/60.04

直达列车 [56] 科隆 [49]

Direct train [56]

Coln [49]

Weakly sup. 弱弱的

93.54/85.28/72.84

\begin{matrix} 34.88 / 30.50 / 27.41 \\ 10.97 / 9.34 / 7.58 \end{matrix}

18.69/11.09/9.46

72.03/54.82/25.91

7.00/5.71/4.33

\begin{matrix} 17.94 / 10.45 / 10.06 \\ 18.65 / 11.02 / 9.42 \end{matrix}

MODEST [57]

^{†}

OYSTER [59]

^{†}

CPD [47]

^{†}

温和的 [57]

^{†}

牡蛎 [59]

^{†}

CPD [47]

^{†}

CPD + +^{+}

Unsup. 无监督

\begin{matrix} 0 % \\ 0 % \\ 0 % \end{matrix}

\begin{matrix} 65.33 / 54.82 / 43.59 \\ 90.85 / 81.01 / 79.80 \\ 97.01 / 89.25 / 88.96 \end{matrix}

\begin{matrix} 2.13 / 3.15 / 3.32 \\ 5.35 / 4.54 / 4.54 \\ 23.11 / 22.50 / 21.70 \\ 52.53 / 48.85 / 46.68 \end{matrix}

\begin{matrix} 2.43 / 1.34 / 1.24 \\ 4.62 / 3.03 / 3.03 \\ 22.94 / 17.40 / 16.34 \\ 60.93 / 42.25 / 39.78 \end{matrix}

\begin{matrix} 12.65 / 11.14 / 10.60 \\ 23.22 / 20.31 / 19.97 \\ 72.98 / 55.07 / 53.94 \\ 86.09 / 69.91 / 66.91 \end{matrix}

\begin{matrix} 1.29 / 2.23 / 2.27 \\ 3.03 / 3.03 / 3.03 \\ 17.08 / 15.16 / 14.21 \\ 36.88 / 33.07 / 31.79 \end{matrix}

\begin{matrix} 0.13 / 0.04 / 0.04 \\ 1.73 / 1.77 / 1.86 \\ 11.05 / 7.30 / 6.54 \\ 52.79 / 35.09 / 32.80 \end{matrix}

Method Setting Label rate "Car@IoU 0.5 Easy/Mod./Hard" Ped.@IoU _(0.25) Easy/Mod./Hard Сус.@IoU 0.25 Easy/Mod./Hard Car@IoU 0.7 Easy/Mod./Hard Ped.@IoU 0_(0.5) Easy/Mod./Hard Cyc.@IoU 0.5 Easy/Mod./Hard CenterPoint [56] Fully sup. 100% 97.34/93.65/93.33 | 78.45/77.69/75.10 82.96/68.86/64.55 87.27/79.22/77.28 56.46/52.04/47.82 84.37/63.56/60.04 "Direct train [56] Coln [49]" Weakly sup. 2% 93.54/85.28/72.84 "34.88//30.50//27.41 10.97//9.34//7.58" 18.69/11.09/9.46 72.03/54.82/25.91 7.00/5.71/4.33 "17.94//10.45//10.06 18.65//11.02//9.42" MODEST [57] ^(†) OYSTER [59] ^(†) CPD [47] ^(†) CPD++^(+) Unsup. "0% 0% 0%" "65.33//54.82//43.59 90.85//81.01//79.80 97.01//89.25//88.96" "2.13//3.15//3.32 5.35//4.54//4.54 23.11//22.50//21.70 52.53//48.85//46.68" ["2.43//1.34//1.24 4.62//3.03//3.03 22.94//17.40//16.34 60.93//42.25//39.78"] "12.65//11.14//10.60 23.22//20.31//19.97 72.98//55.07//53.94 86.09//69.91//66.91" ["1.29//2.23//2.27 3.03//3.03//3.03 17.08//15.16//14.21 36.88//33.07//31.79"] "0.13//0.04//0.04 1.73//1.77//1.86 11.05//7.30//6.54 52.79//35.09//32.80"| Method | Setting | Label rate | Car@IoU 0.5 Easy/Mod./Hard | Ped.@IoU ${ }_{0.25}$ Easy/Mod./Hard | Сус.@IoU 0.25 Easy/Mod./Hard | Car@IoU 0.7 Easy/Mod./Hard | Ped.@IoU $0_{0.5}$ Easy/Mod./Hard | Cyc.@IoU 0.5 Easy/Mod./Hard | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | CenterPoint [56] | Fully sup. | 100% | 97.34/93.65/93.33 \| | 78.45/77.69/75.10 | 82.96/68.86/64.55 | 87.27/79.22/77.28 | 56.46/52.04/47.82 | 84.37/63.56/60.04 | | Direct train [56] Coln [49] | Weakly sup. | 2% | 93.54/85.28/72.84 | $\begin{gathered} \hline 34.88 / 30.50 / 27.41 \\ 10.97 / 9.34 / 7.58 \end{gathered}$ | 18.69/11.09/9.46 | 72.03/54.82/25.91 | 7.00/5.71/4.33 | $\begin{gathered} \hline 17.94 / 10.45 / 10.06 \\ 18.65 / 11.02 / 9.42 \end{gathered}$ | | MODEST [57] ${ }^{\dagger}$ OYSTER [59] ${ }^{\dagger}$ CPD [47] ${ }^{\dagger}$ $\mathrm{CPD}++^{+}$ | Unsup. | $\begin{aligned} & 0 \% \\ & 0 \% \\ & 0 \% \end{aligned}$ | $\begin{aligned} & 65.33 / 54.82 / 43.59 \\ & 90.85 / 81.01 / 79.80 \\ & 97.01 / 89.25 / 88.96 \end{aligned}$ | $\begin{gathered} 2.13 / 3.15 / 3.32 \\ 5.35 / 4.54 / 4.54 \\ 23.11 / 22.50 / 21.70 \\ 52.53 / 48.85 / 46.68 \end{gathered}$ | $\begin{array}{\|c\|} 2.43 / 1.34 / 1.24 \\ 4.62 / 3.03 / 3.03 \\ 22.94 / 17.40 / 16.34 \\ 60.93 / 42.25 / 39.78 \end{array}$ | $\begin{aligned} & \hline 12.65 / 11.14 / 10.60 \\ & 23.22 / 20.31 / 19.97 \\ & 72.98 / 55.07 / 53.94 \\ & 86.09 / 69.91 / 66.91 \end{aligned}$ | $\begin{array}{\|c\|} \hline 1.29 / 2.23 / 2.27 \\ 3.03 / 3.03 / 3.03 \\ 17.08 / 15.16 / 14.21 \\ 36.88 / 33.07 / 31.79 \end{array}$ | $\begin{gathered} 0.13 / 0.04 / 0.04 \\ 1.73 / 1.77 / 1.86 \\ 11.05 / 7.30 / 6.54 \\ 52.79 / 35.09 / 32.80 \end{gathered}$ |

6.3 Comparison with Unsupervised Detectors
6.3 与无监督检测器的比较

WOD validation set. The results of the WOD validation set are presented in Table 1. All methods use identical size thresholds to define the object classes and use single traversal. Our method significantly outperforms existing unsupervised methods. Notably, under the 3D AP L2 with IoU thresholds of

0.7, 0.5

, and 0.5 , our CPD outperforms OYSTER [59] by

18.03 %, 13.08 %

, and

4.55 %

on Vehicle, Pedestrian, and Cyclist classes, respectively. These advancements come from our MFC, CBR, and CST designs, which yield superior pseudo-labels and enhanced detection accuracy. CPD also surpasses the Proto-vanilla method, which uses class-specific prototype [34], as our CProto better models diverse intra-class object sizes. The CPD++ further significantly improved the detection performance of CPD by

16.46 %, 14.09 %

, and

24.82 %

on Vehicle, Pedestrian, and Cyclist classes, respectively. CPD++ mutually learns localization from stationary objects and recognition from moving objects by the MLR and MWT. Consequently, the precision has been greatly improved. Notably, the detection performance of cyclists has been boosted around

6 \times

, as the MWT significantly depressed the false positive supervision signals by the motion-weighed training loss.
WOD 验证集。 WOD 验证集的结果在表 1 中呈现。所有方法使用相同的尺寸阈值来定义对象类别,并使用单次遍历。我们的方法显著优于现有的无监督方法。值得注意的是,在 3D AP L2 与 IoU 阈值为

0.7, 0.5

和 0.5 的情况下,我们的 CPD 在车辆、行人和骑自行车者类别上分别优于 OYSTER [59]

18.03 %, 13.08 %

和

4.55 %

。这些进步来自我们的 MFC、CBR 和 CST 设计,它们产生更好的伪标签和增强的检测精度。 CPD 还超越了使用特定类别原型[34]的 Proto-vanilla 方法,因为我们的 CProto 更好地模拟了类内对象尺寸的多样性。 CPD++进一步显著提高了 CPD 在车辆、行人和自行车骑手类别上的检测性能,分别提升了

16.46 %, 14.09 %

和

24.82 %

。 CPD++通过 MLR 和 MWT,相互学习从静止物体获得的定位和从移动物体获得的识别。因此,精度大大提高。值得注意的是,骑自行车者的检测性能提高了大约

6 \times

,因为 MWT 通过运动加权训练损失显著抑制了误报的监督信号。

WOD testing set. The results of the WOD testing set are presented in Table 2. Under the 3D APH L2 with IoU thresholds of

0.7, 0.5

, and 0.5 , our CPD outperforms OYSTER [59] by

12.46 %, 7.12 %

, and

3.55 %

on Vehicle, Pedestrian, and Cyclist classes, respectively. Besides, the CPD++ boosted the CPD around

2 \times mAP

. Specifically, the mAP L1 is improved by

20.5 %

and the mAP L2 is improved by

18.95 %

. The improvement in the testing set is consistent with the improvement in the validation set, further demonstrating the universality of our method.
WOD 测试集。 WOD 测试集的结果在表 2 中给出。在 3D APH L2，IoU 阈值为

0.7, 0.5

和 0.5 的情况下，我们的 CPD 分别超越了 OYSTER [59]的

12.46 %, 7.12 %

、

3.55 %

车辆、行人和自行车类。此外，CPD++将 CPD 提高了约

2 \times mAP

。具体而言，mAP L1 提高了

20.5 %

，mAP L2 提高了

18.95 %

。测试集中的改善与验证集中的改善一致,进一步证明了我们方法的普遍性。

6.4 Comparison with Fully/Weakly Supervised Detectors
6.4 与完全/弱监督检测器的比较

Results on KITTI val set. To further validate our method, we pre-trained our CPD and CPD++, along with OYSTER [59]
基于 KITTI 验证集的结果。为了进一步验证我们的方法,我们预训练了我们的 CPD 和 CPD++,以及 OYSTER [59]。
and MODEST [57], on WOD and tested them on the KITTI dataset using Statistical Normalization (SN) [40]. The detection results are in Table 3. We first compared our methods with a sparsely supervised CoIn (annotated with

2 %

labels) [49] that annotates a single instance per frame for training. Our unsupervised CPD and CPD++ outperform this sparsely supervised method by

0.25 %

and

15.09 % 3 D

AP @

I o U_{0.7}

on moderate car class, respectively. Additionally, our CPD++ attains

97.01 %

and

89.25 %

3D AP for the easy and moderate car classes at a 0.5 IoU threshold. Notably, the performance achieves

99.6 %

and

95.3 %

of fully supervised CenterPoint [56], demonstrating the advancement of our method.
和谦逊[57]，在 WOD 上进行测试，使用统计归一化(SN)[40]在 KITTI 数据集上进行测试。检测结果见表 3。我们首先将我们的方法与稀疏监督的 CoIn(用

2 %

标注标记的单个实例)进行了比较[49]。我们的无监督 CPD 和 CPD++在中等汽车类别上分别超过了这种稀疏监督方法

0.25 %

和

15.09 % 3 D

AP。此外,我们的 CPD++在 0.5 IoU 阈值下,在易车和中等车辆类别上达到了

97.01 %

和

89.25 %

的 3D AP。值得注意的是,该性能达到了完全监督的 CenterPoint[56]的

99.6 %

和

95.3 %

,显示了我们方法的进步。

Results on WOD validation set. We compared our method with fully/weakly supervised methods on the WOD validation set [37]. The detection results are in Table 4. For vehicle, our unsupervised CPD outperforms the sparsely supervised method (

2 %

annotation) by

4.42 %

and

3.31 %

in terms of 3D AP L1 and L2, respectively. Besides, our CPD++ outperforms a strong weakly supervised detector CoIn [49] (

2 %

annotation) by

6.77 %

and

3.52 %

in terms of vehicle AP L2 and pedestrian AP L2, respectively. The CPD++ attains 76% (AP L2) of fully supervised CenterPoint [56] for largescale vehicle object detection. The results reflect that our unsupervised method can perform better than some weakly supervised methods and similar performance with fully supervised methods on large-scale object detection.
在 WOD 验证集上的结果。我们将我们的方法与 WOD 验证集[37]上的完全/弱监督方法进行了比较。检测结果见表 4。对于车辆,我们的无监督 CPD 在 3D AP L1 和 L2 方面分别优于稀疏监督方法(

2 %

标注)。此外,我们的 CPD++在车辆 AP L2 和行人 AP L2 方面分别优于强弱监督检测器 CoIn[49](

2 %

标注)。CPD++达到了完全监督的 CenterPoint[56]对大规模车辆目标检测的 76%(AP L2)。这些结果反映了我们的无监督方法可以优于某些弱监督方法,并与完全监督方法在大规模目标检测上达到相似的性能。

6.5 As Pre-trained Model 6.5 预训练模型

Our unsupervised method can serve as a strong pre-training model that greatly improves the detector when only a few labels are available. To demonstrate such property, we conducted a fine-tuning experiment on the WOD validation set. We compare our method with regular MAE-based [11] methods, such as Occupancy-MAE [26] and GD-MAE [52]. We show the results in Table 5, where different labeling rates from

0 %

1 %

are used. Our unsupervised CPD++ outperforms the regular MAE-based methods in almost all
我们无监督的方法可以作为一个强大的预训练模型,当只有少量标签可用时,可以大大提高检测器的性能。为了证明这一特性,我们在 WOD 验证集上进行了微调实验。我们将我们的方法与常规的基于 MAE 的[11]方法,如 Occupancy-MAE[26]和 GD-MAE[52],进行了比较。在表 5 中,我们展示了不同的标注率从

0 %

到

1 %

的结果。我们无监督的 CPD++在绝大多数情况下都优于常规的基于 MAE 的方法。

TABLE 4 表 4
Comparison with fully/weakly/un-supervised methods on WOD validation set. We report the 3D AP and APH using official metrics of WOD.
在 WOD 验证集上与完全/弱/无监督方法的比较。我们使用 WOD 的官方指标报告了 3D AP 和 APH。

Method 方法

Setting 设置

Label rate 费率

车 L1 AP/ APH

Veh. L1

AP/ APH

车。L2 AP / APH

Veh. L2

AP / APH

羽毛球

Ped. L1

AP / APH

行人 L2 AP / APH

Ped. L2

AP / APH

循环。一级 AP / APH

Cyc. L1

AP / APH

100 %

71.33 / 70.76

63.16 / 62.65

72.09 / 65.49

64.27 / 58.23

68.68 / 67.39

AP / APH 美国国务院/美国公安部

Method Setting Label rate "Veh. L1 AP/ APH" "Veh. L2 AP / APH" "Ped. L1 AP / APH" "Ped. L2 AP / APH" "Cyc. L1 AP / APH" 100% 71.33//70.76 63.16//62.65 72.09//65.49 64.27//58.23 68.68//67.39 AP / APH | Method | Setting | Label rate | Veh. L1 AP/ APH | Veh. L2 AP / APH | Ped. L1 AP / APH | Ped. L2 AP / APH | Cyc. L1 AP / APH | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | | $100 \%$ | $71.33 / 70.76$ | $63.16 / 62.65$ | $72.09 / 65.49$ | $64.27 / 58.23$ | $68.68 / 67.39$ | | AP / APH | | | | | | |

TABLE 5 表 5
Fine-tuning results on WOD validation set. We report the 3D AP L2 results evaluated by the WOD official metrics.
在 WOD 验证集上的微调结果。我们报告了使用 WOD 官方指标评估的 3D AP L2 结果。

Method 方法	Base detector 基地探测器	Label rate $= 0 %$ 标签费率			Label rate $= 0.01 %$ 标签率			Label rate $= 0.1 %$ 标签速率			Label rate $= 1 %$ 标签速率
Method 方法	Base detector 基地探测器	Veh. 车辆。	Ped. 儿童	Cyc. 循环	Veh. 车辆。	Ped. 儿童	Сyc. 赛克	Veh. 车辆。	Ped. 儿童	Сyc. 赛克	Veh. 车辆。	Ped. 儿童	Сyc. 赛克
Direct train 直达火车		0.00	0.00	0.00	11.04	8.10	0.01	30.85	31.58	14.57	51.06	45.50	38.05
Occupancy-MAE [26] 占用-MAE [26]		0.00	0.00	0.00	14.18	9.35	0.06	38.14	34.28	15.45	52.62	47.56	37.28
MODEST [57] 适度[57]	CenterPoint [56] 中心点 [56]	5.48	0.10	0.01	22.99	21.03	1.27	41.12	41.94	22.90	53.79	54.85	42.44
OYSTER [59] 牡蛎 [59]	CenterPoint [56] 中心点 [56]	14.10	0.14	0.32	32.77	18.87	0.16	45.31	41.78	19.52	57.23	58.07	41.88
CPD [47] 47 号 CPD		32.13	13.22	4.87	42.11	27.54	2.40	50.35	39.99	27.96	56.53	52.55	43.61
CPD++ C++		48.59	27.31	29.69	47.29	28.64	6.61	52.50	49.35	25.90	58.11	58.69	42.46
Direct train 直达火车		0.00	0.00	0.00	14.28	5.79	0.18	39.29	37.76	6.72	55.33	56.84	35.16
GD-MAE [52] 国华明意 [52]		0.00	0.00	0.00	15.32	8.62	0.14	40.79	38.71	11.62	55.72	58.88	36.52
MODEST [57] 适度[57]		3.02	0.00	0.60	22.87	14.73	0.55	41.14	41.61	16.22	54.96	57.87	36.13
OYSTER [59] 牡蛎 [59]	SST [7] 7 号标准系统测试	10.08	0.10	0.12	24.99	14.21	1.99	41.60	42.39	15.84	55.82	58.47	35.02
CPD [47] 47 号 CPD		22.32	12.44	5.13	28.90	16.36	3.39	41.75	41.53	12.49	54.53	55.90	31.25
CPD++ C++		36.66	20.85	20.47	39.05	23.49	7.32	45.95	46.26	17.61	56.04	59.70	44.43

Method Base detector Label rate =0% Label rate =0.01% Label rate =0.1% Label rate =1% Veh. Ped. Cyc. Veh. Ped. Сyc. Veh. Ped. Сyc. Veh. Ped. Сyc. Direct train 0.00 0.00 0.00 11.04 8.10 0.01 30.85 31.58 14.57 51.06 45.50 38.05 Occupancy-MAE [26] 0.00 0.00 0.00 14.18 9.35 0.06 38.14 34.28 15.45 52.62 47.56 37.28 MODEST [57] CenterPoint [56] 5.48 0.10 0.01 22.99 21.03 1.27 41.12 41.94 22.90 53.79 54.85 42.44 OYSTER [59] CenterPoint [56] 14.10 0.14 0.32 32.77 18.87 0.16 45.31 41.78 19.52 57.23 58.07 41.88 CPD [47] 32.13 13.22 4.87 42.11 27.54 2.40 50.35 39.99 27.96 56.53 52.55 43.61 CPD++ 48.59 27.31 29.69 47.29 28.64 6.61 52.50 49.35 25.90 58.11 58.69 42.46 Direct train 0.00 0.00 0.00 14.28 5.79 0.18 39.29 37.76 6.72 55.33 56.84 35.16 GD-MAE [52] 0.00 0.00 0.00 15.32 8.62 0.14 40.79 38.71 11.62 55.72 58.88 36.52 MODEST [57] 3.02 0.00 0.60 22.87 14.73 0.55 41.14 41.61 16.22 54.96 57.87 36.13 OYSTER [59] SST [7] 10.08 0.10 0.12 24.99 14.21 1.99 41.60 42.39 15.84 55.82 58.47 35.02 CPD [47] 22.32 12.44 5.13 28.90 16.36 3.39 41.75 41.53 12.49 54.53 55.90 31.25 CPD++ 36.66 20.85 20.47 39.05 23.49 7.32 45.95 46.26 17.61 56.04 59.70 44.43| Method | Base detector | Label rate $=0 \%$ | | | Label rate $=0.01 \%$ | | | Label rate $=0.1 \%$ | | | Label rate $=1 \%$ | | | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | | Veh. | Ped. | Cyc. | Veh. | Ped. | Сyc. | Veh. | Ped. | Сyc. | Veh. | Ped. | Сyc. | | Direct train | | 0.00 | 0.00 | 0.00 | 11.04 | 8.10 | 0.01 | 30.85 | 31.58 | 14.57 | 51.06 | 45.50 | 38.05 | | Occupancy-MAE [26] | | 0.00 | 0.00 | 0.00 | 14.18 | 9.35 | 0.06 | 38.14 | 34.28 | 15.45 | 52.62 | 47.56 | 37.28 | | MODEST [57] | CenterPoint [56] | 5.48 | 0.10 | 0.01 | 22.99 | 21.03 | 1.27 | 41.12 | 41.94 | 22.90 | 53.79 | 54.85 | 42.44 | | OYSTER [59] | CenterPoint [56] | 14.10 | 0.14 | 0.32 | 32.77 | 18.87 | 0.16 | 45.31 | 41.78 | 19.52 | 57.23 | 58.07 | 41.88 | | CPD [47] | | 32.13 | 13.22 | 4.87 | 42.11 | 27.54 | 2.40 | 50.35 | 39.99 | 27.96 | 56.53 | 52.55 | 43.61 | | CPD++ | | 48.59 | 27.31 | 29.69 | 47.29 | 28.64 | 6.61 | 52.50 | 49.35 | 25.90 | 58.11 | 58.69 | 42.46 | | Direct train | | 0.00 | 0.00 | 0.00 | 14.28 | 5.79 | 0.18 | 39.29 | 37.76 | 6.72 | 55.33 | 56.84 | 35.16 | | GD-MAE [52] | | 0.00 | 0.00 | 0.00 | 15.32 | 8.62 | 0.14 | 40.79 | 38.71 | 11.62 | 55.72 | 58.88 | 36.52 | | MODEST [57] | | 3.02 | 0.00 | 0.60 | 22.87 | 14.73 | 0.55 | 41.14 | 41.61 | 16.22 | 54.96 | 57.87 | 36.13 | | OYSTER [59] | SST [7] | 10.08 | 0.10 | 0.12 | 24.99 | 14.21 | 1.99 | 41.60 | 42.39 | 15.84 | 55.82 | 58.47 | 35.02 | | CPD [47] | | 22.32 | 12.44 | 5.13 | 28.90 | 16.36 | 3.39 | 41.75 | 41.53 | 12.49 | 54.53 | 55.90 | 31.25 | | CPD++ | | 36.66 | 20.85 | 20.47 | 39.05 | 23.49 | 7.32 | 45.95 | 46.26 | 17.61 | 56.04 | 59.70 | 44.43 |

Fig. 10. (a-c) loU distribution between pseudo-labels and ground truth. (d-f) Mean absolute error associated with the size, position, and angle of pseudo-labels generated by different methods.
图 10。(a-c)伪标签和实际标签之间的 IoU 分布。(d-f)不同方法生成的伪标签的尺寸、位置和角度的平均绝对误差。
metrics. The reason lies in that the occupancy pre-training in regular MAE is task-agnostic, and thus can only initialize the backbone parameters. On the contrary, our CPD++ pretraining is task-specific and thus can initialize the parameters of both the backbone and detection head.
指标。原因在于正则 MAE 中的占用预训练是任务无关的,因此只能初始化主干参数。相反,我们的 CPD++预训练是任务特定的,因此可以初始化主干和检测头的参数。

6.6 Diagnostic Experiment of CPD
6.6 CPD 诊断实验

We conducted several diagnostic experiments to examine the effectiveness of CPD on the WOD validation dataset.
我们进行了几次诊断性实验,以检查 CPD 在 WOD 验证数据集上的有效性。

Quality of Pseudo Labels from CBR. To validate our initial pseudo-labels generated by CBR, we present the mean absolute error of size, position, and angle between different pseudo-labels in Fig. 10 (d)(e)(f), where our method shows
基于 CBR 生成的伪标签的质量。为了验证我们通过 CBR 生成的初始伪标签,我们在图 10(d)(e)(f)中展示了不同伪标签之间的大小、位置和角度的平均绝对误差,其中我们的方法显示
the lowest errors. We further examined the IoU between the pseudo-labels and ground truth, and compared the IoU distributions in Fig. 10 (a)(b)©. The IoU distribution of our method is much closer to 1 than other methods, and it also exhibits lower errors in size, position, and angle. These results verify that our CBR significantly reduces label errors.
最低错误。我们进一步检查了伪标签和真实标签之间的 IoU,并在图 10(a)(b)(c)中比较了 IoU 分布。我们方法的 IoU 分布比其他方法更接近 1,并且在尺寸、位置和角度上也表现出更低的错误。这些结果证实我们的 CBR 大大减少了标签错误。

Components Analysis of CPD. To evaluate the individual contributions of our designs, we incrementally added each component and assessed their impact on vehicle detection. The results are shown in Table 6. Our MFC method surpasses Single Frame Clustering (SFC) by

2.48 %

in 3D AP L2 at 0.7 IoU threshold, attributed to the complete point representation of objects across consecutive frames compared to a single frame. The CBR further enhances performance by

19.27 %

in AP, as it reduces size and location errors in pseudo-labels. The CST contributes an

8.09 %

increase in AP, demonstrating the effectiveness of geometric features from CProto in detecting sparse objects.
我们的 MFC 方法在 3D AP L2 0.7 IoU 阈值下超过单帧聚类(SFC)

2.48 %

，这归因于与单帧相比，在连续帧中完整的点表示对象。CBR 进一步通过

19.27 %

提高了 AP,因为它减少了伪标签中的尺寸和位置错误。CST 贡献了

8.09 %

AP 的增加,证明了 CProto 中几何特征在检测稀疏目标的有效性。

TABLE 6 表 6
Vehicle detection results by using different CPD components.
使用不同 CPD 组件的车辆检测结果。

SFC	Components 组件		CST	3D AP L1		3D AP L2
SFC	MFC	CBR	CST	IoU $U_{0.5}$ IoU	$I_{o U_{0.7}}$	IoU $U_{0.5}$ IoU	$I o^{U_{0.7}}$
$✓$				17.36	2.65	14.87	2.29
	$✓$			19.91	5.01	18.31	4.77
	$✓$	$✓$		48.26	28.01	41.69	24.04
	$✓$	$✓$	$✓$	57.79	37.40	50.18	32.13

SFC Components CST 3D AP L1 3D AP L2 MFC CBR IoU U_(0.5) I_(oU_(0.7)) IoU U_(0.5) Io^(U_(0.7)) ✓ 17.36 2.65 14.87 2.29 ✓ 19.91 5.01 18.31 4.77 ✓ ✓ 48.26 28.01 41.69 24.04 ✓ ✓ ✓ 57.79 37.40 50.18 32.13| SFC | Components | | CST | 3D AP L1 | | 3D AP L2 | | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | MFC | CBR | | IoU $U_{0.5}$ | $I_{o U_{0.7}}$ | IoU $U_{0.5}$ | $I o^{U_{0.7}}$ | | $\checkmark$ | | | | 17.36 | 2.65 | 14.87 | 2.29 | | | $\checkmark$ | | | 19.91 | 5.01 | 18.31 | 4.77 | | | $\checkmark$ | $\checkmark$ | | 48.26 | 28.01 | 41.69 | 24.04 | | | $\checkmark$ | $\checkmark$ | $\checkmark$ | 57.79 | 37.40 | 50.18 | 32.13 |

Frame Number of MFC. To examine the effect of frame count on initial pseudo-label quality, we experimented with different numbers of frames on the WOD validation set. The BEV results, shown in Fig. 11 (a)(b), indicate optimal performance with

[- 5, 5]

frames (five past, five future, and the current frame). Additional frames did not significantly im-
多视角融合相机(MFC)的帧数量。为了研究帧数量对初始伪标签质量的影响,我们在 WOD 验证集上试验了不同数量的帧。鸟瞰图结果(见图 11(a)和(b))表明,使用

[- 5, 5]

帧(5 个过去帧、5 个未来帧和当前帧)可以获得最佳性能。增加帧数量没有显著提升性能。

Fig. 11. (a)(b)The recall and precision of initial pseudo-labels from MFC by using different frames. © The recall-precision curve of pseudo-labels from CBR by using different scores.
图 11. (a)(b)使用不同帧的 MFC 初始伪标签的召回率和精度。 (c)使用不同分数的 CBR 伪标签的召回-精度曲线。
prove recall or precision. Consequently, we used 11 frames for initial pseudo-label generation in this study.
证明召回率或精确率。因此,我们在本研究中使用了 11 个帧来进行初始伪标签生成。

TABLE 7 表 7
Pseudo label BEV AP by using different CSS components.
伪标签 BEV AP 通过使用不同的 CSS 组件。

CSS Components CSS 组件	BEV AP 饮料
CSS Components CSS 组件	$I o U_{0.3}$	$I o U_{0.5}$	$I o U_{0.7}$
Distance Score 距离分数	32.70	24.57	14.80
Distance Score+MLO 距离得分+MLO	34.40	26.25	15.91
Distance Score+MLO+Size Similarity 距离分+MLO+大小相似性	38.95	31.06	19.49

CSS Components BEV AP IoU_(0.3) IoU_(0.5) IoU_(0.7) Distance Score 32.70 24.57 14.80 Distance Score+MLO 34.40 26.25 15.91 Distance Score+MLO+Size Similarity 38.95 31.06 19.49| CSS Components | BEV AP | | | | :--- | :--- | :--- | :--- | | | $I o U_{0.3}$ | $I o U_{0.5}$ | $I o U_{0.7}$ | | Distance Score | 32.70 | 24.57 | 14.80 | | Distance Score+MLO | 34.40 | 26.25 | 15.91 | | Distance Score+MLO+Size Similarity | 38.95 | 31.06 | 19.49 |

Component Analysis of CSS Scoring. To assess the effectiveness of our scoring system, we calculated the BEV AP of initial pseudo-labels with different scores. These evaluations, reported in Table 7, show that incorporating all components (distance, MLO, and SS) yields the highest AP. The recallprecision curve, plotted in Fig. 11 ©, also supports this finding. These indicate the significance of each component in accurately measuring pseudo-label quality.
网页样式表评分分量分析。为了评估我们评分系统的有效性,我们计算了具有不同得分的初始伪标签的 BEVAP。如表 7 所示,这些评估结果表明,结合所有分量(距离、MLO 和 SS)可以获得最高的 AP。如图 11(c)所示的召回-精确曲线也支持这一发现。这表明每个分量在准确测量伪标签质量方面的重要性。

Components Analysis of

C B R

. To evaluate the impact of re-sizing and re-localization in CBR, we conducted experiments and analyzed pseudo-label performance. As shown in Table 8, re-sizing results in a

3.91 %

and

3.4 %

increase in BEV recall at the 0.5 and 0.7 IoU thresholds, respectively; re-localization further enhances recall by

12.68 %

and

6.43 %

at these thresholds, while also increasing precision. These results indicate the importance of both components, which effectively refine pseudo-labels.
对

C B R

的组件分析。为了评估在基于案例的推理中调整大小和重新定位的影响,我们进行了实验并分析了伪标签性能。如表 8 所示,调整大小在 0.5 和 0.7 IoU 阈值下分别导致 BEV 召回率增加了

3.91 %

和

3.4 %

；重新定位进一步提高了这些阈值下的召回率分别为

12.68 %

和

6.43 %

,同时还提高了精度。这些结果表明这两个组件的重要性,它们可以有效地改善伪标签。

TABLE 8 表 8
Pseudo label BEV recall and precision by using different CBR components.
伪标签 BEV 召回率和精度使用不同的 CBR 组件。

CBR Components 中国制动器件	BEV Recall 饮料召回		BEV Precision 饮料精度
CBR Components 中国制动器件	$I o U_{0.5}$	$I o U_{0.7}$	$I o U_{0.5}$	$I o U_{0.7}$
MFC	26.88	18.14	26.98	18.16
MFC+Re-size 界面	30.79	21.54	29.43	21.33
MFC+Re-size+Re-localization 重新调整大小+重新定位	43.47	27.97	30.90	21.62

CBR Components BEV Recall BEV Precision IoU_(0.5) IoU_(0.7) IoU_(0.5) IoU_(0.7) MFC 26.88 18.14 26.98 18.16 MFC+Re-size 30.79 21.54 29.43 21.33 MFC+Re-size+Re-localization 43.47 27.97 30.90 21.62| CBR Components | BEV Recall | | BEV Precision | | | :--- | :---: | :---: | :---: | :---: | | | $I o U_{0.5}$ | $I o U_{0.7}$ | $I o U_{0.5}$ | $I o U_{0.7}$ | | MFC | 26.88 | 18.14 | 26.98 | 18.16 | | MFC+Re-size | 30.79 | 21.54 | 29.43 | 21.33 | | MFC+Re-size+Re-localization | 43.47 | 27.97 | 30.90 | 21.62 |

Components Analysis of CST. To assess the effectiveness of each component in CST, we established a baseline using only CBR-generated pseudo-labels for training a two-stage CenterPoint detector, then incrementally added our loss components and evaluated vehicle detection performance. As shown in Table 9, all loss components contribute to performance improvement. Specifically, our

L_{det}^{c s s}

mitigates the influence of false pseudo-label using CSS weight, and improves the 3D AP L2 at

I o U_{0.7}

4.79 %

. Our

L_{feat}^{css}

and

L_{box}^{c s s}

improve the 3D AP L2 at

I o U_{0.7}

0.75 %

and

2.55 %

对 CST 的组成分析。为了评估 CST 中每个组件的有效性，我们使用仅由 CBR 生成的伪标签训练了一个两阶段的 CenterPoint 检测器作为基线，然后逐步添加我们的损失组件并评估车辆检测性能。如表 9 所示，所有损失组件都有助于性能改善。具体而言，我们的

L_{det}^{c s s}

通过使用 CSS 权重来减轻虚假伪标签的影响，并将

I o U_{0.7}

处的 3D AP L2 提高了

4.79 %

。我们的

L_{feat}^{css}

和

L_{box}^{c s s}

将

I o U_{0.7}

处的 3D AP L2 分别提高了

0.75 %

和

2.55 %

。

Fig. 12. Vehicle detection results by using different hyperparameters.
图 12. 使用不同超参数的车辆检测结果。
respectively, through leveraging geometric knowledge from dense CProto for more effective sparse object detection.
分别通过利用来自密集 CProto 的几何知识来实现更有效的稀疏物体检测。

TABLE 9 表 9
Vehicle detection results by using different CST components.
使用不同 CST 组件进行车辆检测的结果。

CST Components 中国标准时间组件	3D AP L1		3D AP L2
CST Components 中国标准时间组件	$I o U_{0.5}$	$I o^{U_{0.7}}$	IoU $U_{0.5}$ IoU	$I_{o} U_{0.7}$
CBR-only 只有 CBR	48.26	28.01	41.69	24.04
$CBR + L_{det}^{css}$	49.31	29.78	42.50	28.83
$CBR + L_{det}^{des} + L_{feat}^{css}$	52.01	32.17	44.12	29.58
$CBR + L_{det}^{css} + L_{feat}^{css} + L_{box}^{css}$	57.79	37.40	50.18	32.13

CST Components 3D AP L1 3D AP L2 IoU_(0.5) Io^(U_(0.7)) IoU U_(0.5) I_(o)U_(0.7) CBR-only 48.26 28.01 41.69 24.04 CBR+L_("det ")^("css ") 49.31 29.78 42.50 28.83 CBR+L_("det ")^("des ")+L_("feat ")^("css ") 52.01 32.17 44.12 29.58 CBR+L_("det ")^("css ")+L_("feat ")^("css ")+L_("box ")^("css ") 57.79 37.40 50.18 32.13| CST Components | 3D AP L1 | | 3D AP L2 | | | :---: | :---: | :---: | :---: | :---: | | | $I o U_{0.5}$ | $I o^{U_{0.7}}$ | IoU $U_{0.5}$ | $I_{o} U_{0.7}$ | | CBR-only | 48.26 | 28.01 | 41.69 | 24.04 | | $\mathrm{CBR}+\mathcal{L}_{\text {det }}^{\text {css }}$ | 49.31 | 29.78 | 42.50 | 28.83 | | $\mathrm{CBR}+\mathcal{L}_{\text {det }}^{\text {des }}+\mathcal{L}_{\text {feat }}^{\text {css }}$ | 52.01 | 32.17 | 44.12 | 29.58 | | $\mathrm{CBR}+\mathcal{L}_{\text {det }}^{\text {css }}+\mathcal{L}_{\text {feat }}^{\text {css }}+\mathcal{L}_{\text {box }}^{\text {css }}$ | 57.79 | 37.40 | 50.18 | 32.13 |

TABLE 10 表 10
Vehicle detection results by using a simple size ratio template.
使用简单尺寸比例模板的车辆检测结果。

Box size 盒子尺寸	Vehicle 车辆	Pedestrian 行人	Cyclist 骑自行车的人
Template bounding box from WiKipedia 模板边界框从维基百科	32.13	13.22	4.87
Simple $l, w, h$ ratio 简单 $l, w, h$ 比率	31.87	13.07	4.32

Box size Vehicle Pedestrian Cyclist Template bounding box from WiKipedia 32.13 13.22 4.87 Simple l,w,h ratio 31.87 13.07 4.32| Box size | Vehicle | Pedestrian | Cyclist | | :--- | :---: | :---: | :---: | | Template bounding box from WiKipedia | 32.13 | 13.22 | 4.87 | | Simple $l, w, h$ ratio | 31.87 | 13.07 | 4.32 |

Parameter Determination. In Eq. 6, we use box templates to calculate the size score

ψ^{3} (b_{j})

. Here, we test the sensitivity using simple commonsense of

l, w, h

ratios (2:1:1 for Vehicle, 1:1:2 for Pedestrian, 2:1:2 for Cyclist). The 3D AP results ( 32.13 vs. 31.87) in Table 10 demonstrate that

ψ^{3} (b_{j})

is not sensitive to specific values. To select the best hyperparameters of CPD, we conducted a series of experiments on the 1/20 WOD dataset. We apply a simple grid search, i.e., testing different parameters in a specific range with a small stride and choosing a parameter with the best detection performance. We present the results (Vehicle 3D AP L2) in the Fig 12. We perform best when using

η = 0.8, S_{H} = 0.7

S_{L} = 0.4

and

σ = 0.05

. The

σ

is the divergence threshold in Eq. 6.
参数确定。在公式 6 中,我们使用盒子模板来计算尺寸得分

ψ^{3} (b_{j})

。在这里,我们使用简单的常识测试了敏感性,即

l, w, h

比率(车辆 2:1:1,行人 1:1:2,自行车 2:1:2)。表 10 中的 3D AP 结果(32.13 vs. 31.87)表明

ψ^{3} (b_{j})

对具体值不敏感。为了选择 CPD 的最佳超参数,我们对 1/20 WOD 数据集进行了一系列实验。我们应用了简单的网格搜索,即在特定范围内测试不同参数,步长较小,选择检测性能最佳的参数。我们在图 12 中给出了结果(车辆 3D AP L2)。当使用

η = 0.8, S_{H} = 0.7

、

S_{L} = 0.4

和

σ = 0.05

时,我们的性能最佳。

σ

是公式 6 中的散度阈值。

6.7 Diagnostic Experiment of CPD++
6.7 CPD++诊断实验

We then conducted diagnostic experiments to examine the effectiveness of CPD++, all reported results are from the WOD validation dataset.
我们随后进行了诊断实验,以检查 CPD++的有效性,所有报告的结果均来自 WOD 验证数据集。

Training Architecture Comparison. During the design of CPD++ (see Sec. 5), we established a baseline of a Vanilla Iterative Training (VaIT) scheme (Fig. 7 (a)). Besides, during the design of MWT (see Sec. 5.3), we discussed two baseline methods: (1) Sequential Training (ST) scheme (Fig. 8(b)) and (2) Parallel Training (PT) scheme (Fig. 8©). We compared the performance of our design with the baseline methods. The results are presented in Table 11, where our proposed CPD++ with the MWT achieves the best performance. The superiority derives from three aspects. (1) MLR propagates high-quality labels of stationary objects, improving localization. (2) MWT depresses false positive supervision, improving recognition. (3) The iterative training of MWT mutually transfers the localization and recognition knowledge
训练架构比较。在设计 CPD++ (参见第 5 节)时,我们建立了一个基线的 Vanilla 迭代训练(VaIT)方案 (图 7(a))。此外,在设计 MWT (参见第 5.3 节)时,我们讨论了两种基线方法:(1)顺序训练(ST)方案 (图 8(b))和(2)并行训练(PT)方案 (图 8(c))。我们将我们的设计与基线方法进行了性能比较。结果如表 11 所示,我们提出的 CPD++配合 MWT 取得了最佳性能。其优越性源自三个方面:(1)MLR 传播了静止物体的高质量标签,提高了定位;(2)MWT 抑制了错误的正样本监督,提高了识别;(3)MWT 的迭代训练相互传递了定位和识别知识。

TABLE 11 表 11
Training architecture comparison results. We report the 3D AP L1 and L2 using the official metric code of WOD with different loU thresholds.
训练架构比较结果。我们使用 WOD 的官方度量代码报告不同 IOU 阈值下的 3D AP L1 和 L2。

Method 方法

Average AP L1 平均 AP L1

Average AP L2 平均 AP L2

车 L1

I o U_{0.5 / 0.7}

Veh. L1

I o U_{0.5 / 0.7}

Veh. L1 IoU_(0.5//0.7)| Veh. L1 | | :--- | | $I o U_{0.5 / 0.7}$ |

\begin{matrix} Veh. L2 \\ I o U_{0.5 / 0.7} \end{matrix}

学步道 L1

I_{o u} U_{0.3 / 0.5}

Ped. L1

I_{o u} U_{0.3 / 0.5}

Ped. L1 I_(ou)U_(0.3//0.5)| Ped. L1 | | :--- | | $I_{o u} U_{0.3 / 0.5}$ |

行人 L2

I_{o u} U_{0.3 / 0.5}

Ped. L2

I_{o u} U_{0.3 / 0.5}

Ped. L2 I_(ou)U_(0.3//0.5)| Ped. L2 | | :--- | | $I_{o u} U_{0.3 / 0.5}$ |

循环 L1

I o U_{0.3 / 0.5}

Сyc. L1

I o U_{0.3 / 0.5}

Сyc. L1 IoU_(0.3//0.5)| Сyc. L1 | | :--- | | $I o U_{0.3 / 0.5}$ |

循环。 L2

I o U_{0.3 / 0.5}

Cyc. L2

I o U_{0.3 / 0.5}

Cyc. L2 IoU_(0.3//0.5)| Cyc. L2 | | :--- | | $I o U_{0.3 / 0.5}$ |

CPD

24.05

20.67

57.79 / 37.40

50.18 / 32.13

21.91 / 16.31

18.01 / 13.22

5.83 / 5.06

5.61 / 4.87

VaIT 瓦伊特

24.50

21.42

68.23 / 39.53

59.94 / 34.15

18.23 / 17.42

16.52 / 14.50

1.92 / 1.71

1.79 / 1.65

CPD++ w/ Sequential Training
顺序训练的 CPD++

40.02

35.25

75.62 / 48.28

66.94 / 41.84

44.91 / 32.10

38.06 / 26.85

20.42 / 18.85

19.68 / 18.15

CPD++ w/ Parallel Training
并行培训的 CPD++

42.15

37.27

75.76 / 50.15

67.01 / 43.80

44.47 / 30.03

37.35 / 24.91

27.97 / 24.57

26.91 / 23.66

CPD++ w/ MWT 加 CPD++与 MWT

46.21

41.02

77.83 / 55.99

68.92 / 48.59

46.39 / 32.68

39.24 / 27.31

33.59 / 30.83

32.37 / 29.69

Method Average AP L1 Average AP L2 "Veh. L1 IoU_(0.5//0.7)" " Veh. L2 IoU_(0.5//0.7)" "Ped. L1 I_(ou)U_(0.3//0.5)" "Ped. L2 I_(ou)U_(0.3//0.5)" "Сyc. L1 IoU_(0.3//0.5)" "Cyc. L2 IoU_(0.3//0.5)" CPD 24.05 20.67 57.79 / 37.40 50.18 / 32.13 21.91 / 16.31 18.01 / 13.22 5.83 / 5.06 5.61 / 4.87 VaIT 24.50 21.42 68.23 / 39.53 59.94 / 34.15 18.23 / 17.42 16.52 / 14.50 1.92 / 1.71 1.79 / 1.65 CPD++ w/ Sequential Training 40.02 35.25 75.62 / 48.28 66.94 / 41.84 44.91 / 32.10 38.06 / 26.85 20.42 / 18.85 19.68 / 18.15 CPD++ w/ Parallel Training 42.15 37.27 75.76 / 50.15 67.01 / 43.80 44.47 / 30.03 37.35 / 24.91 27.97 / 24.57 26.91 / 23.66 CPD++ w/ MWT 46.21 41.02 77.83 / 55.99 68.92 / 48.59 46.39 / 32.68 39.24 / 27.31 33.59 / 30.83 32.37 / 29.69| Method | Average AP L1 | Average AP L2 | Veh. L1 $I o U_{0.5 / 0.7}$ | $\begin{gathered} \hline \text { Veh. L2 } \\ I o U_{0.5 / 0.7} \end{gathered}$ | Ped. L1 $I_{o u} U_{0.3 / 0.5}$ | Ped. L2 $I_{o u} U_{0.3 / 0.5}$ | Сyc. L1 $I o U_{0.3 / 0.5}$ | Cyc. L2 $I o U_{0.3 / 0.5}$ | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | CPD | 24.05 | 20.67 | 57.79 / 37.40 | 50.18 / 32.13 | 21.91 / 16.31 | 18.01 / 13.22 | 5.83 / 5.06 | 5.61 / 4.87 | | VaIT | 24.50 | 21.42 | 68.23 / 39.53 | 59.94 / 34.15 | 18.23 / 17.42 | 16.52 / 14.50 | 1.92 / 1.71 | 1.79 / 1.65 | | CPD++ w/ Sequential Training | 40.02 | 35.25 | 75.62 / 48.28 | 66.94 / 41.84 | 44.91 / 32.10 | 38.06 / 26.85 | 20.42 / 18.85 | 19.68 / 18.15 | | CPD++ w/ Parallel Training | 42.15 | 37.27 | 75.76 / 50.15 | 67.01 / 43.80 | 44.47 / 30.03 | 37.35 / 24.91 | 27.97 / 24.57 | 26.91 / 23.66 | | CPD++ w/ MWT | 46.21 | 41.02 | 77.83 / 55.99 | 68.92 / 48.59 | 46.39 / 32.68 | 39.24 / 27.31 | 33.59 / 30.83 | 32.37 / 29.69 |

Fig. 13. The training iterations and their performance of different methods. We report the 3D AP L1 and L2 of different classes.
图 13.不同方法的训练迭代及其性能。我们报告了不同类别的 3D AP L1 和 L2。

Fig. 14. Performance (3D mAP L2) comparison by using different CPD++ hyperparameters.
图 14. 使用不同的 CPD++超参数的性能(3D mAP L2)比较。
between stationary and moving objects, attaining optimal performance.
在静止和移动物体之间,达到最佳性能。

Number of Iteration. To examine the effect of training iteration on the detection performance, we experimented with different numbers of training iterations for CPD++. Besides, the other three baseline methods are also tested for comparison. The results, shown in Fig. 13 (a)-(f), indicate optimal performance with three iterations. Additional iterations did not significantly improve AP. Consequently, we used three training iterations to train our CPD++. In addition, our CPD++ shows significant improvement against the baselines in each iteration thanks to our MLR and MWT design.
迭代次数。为了研究训练迭代次数对检测性能的影响,我们对 CPD++ 进行了不同次数的训练迭代实验。此外,还测试了其他三种基准方法进行比较。如图 13 (a)-(f) 所示,三次迭代达到了最佳性能。增加迭代次数并未显著提高 AP。因此,我们采用三次训练迭代来训练我们的 CPD++。此外,得益于我们的 MLR 和 MWT 设计,我们的 CPD++ 在各个迭代中都显示出了显著的改进,超过了基准方法。

TABLE 12 表格 12
Performance (3D AP L2) comparison by using different motion classifiers.
不同运动分类器的性能(3D AP L2)比较。

Box size 盒子尺寸	Vehicle 车辆	Pedestrian 行人	Cyclist 骑自行车的人
Velocity-based 速度为基础的	44.66	27.17	24.15
Variance of position change (ours) 位置变化的方差(我们的)	48.59	27.31	29.69

Box size Vehicle Pedestrian Cyclist Velocity-based 44.66 27.17 24.15 Variance of position change (ours) 48.59 27.31 29.69| Box size | Vehicle | Pedestrian | Cyclist | | :--- | :---: | :---: | :---: | | Velocity-based | 44.66 | 27.17 | 24.15 | | Variance of position change (ours) | 48.59 | 27.31 | 29.69 |

Motion Classifier. In Sec. 5.2.2, we proposed the variance of position change to classify the motion states. Here, we test an alternative to velocity-based motion classification. The detection results are shown in Table 12, where our variance of position change shows superior performance. Therefore, we use the variance of position change in our CPD++.
运动分类器。在第 5.2.2 节中,我们提出了利用位置变化方差来划分运动状态。在这里,我们测试了一种基于速度的运动分类的替代方案。检测结果如表 12 所示,我们的位置变化方差表现优异。因此,我们在 CPD++中采用位置变化方差。

Component Analysis of CPD++. To evaluate the individual contributions of components in CPD++, we incrementally added each component and assessed their impact on detection performance. The results are shown in Table 13. The augmented testing (AT), MLR, weight mask for regres-
组合分析 CPD++。为了评估 CPD++ 组件个体贡献度,我们逐步添加每个组件并评估它们对检测性能的影响。结果如表 13 所示。增强型测试(AT)、MLR、用于回归的权重掩码
sion

ω_{i}^{reg}

and weight mask for classification

ω_{i}^{c l s}

improves the vehicle detection by

2.25 %

and

5.1 %

and

4.4 %

and

4.98 %

respectively. We observe that the MLR and mask for regression

ω_{i}^{reg}

marginally improve the pedestrian and cyclist detection. A possible reason is that the stationary pedestrians and cyclists are much less than the moving ones. On the contrary, the weight mask for classification

ω_{i}^{c l s}

improves the pedestrian and cyclist detection by

16.73 %

and

29.13 %

, respectively. Because the

ω_{i}^{c l s}

significantly decreases the false supervision signals of small objects by the motion pattern. The results demonstrate that every component is vital to the overall performance improvement.
分割和权重掩码对分类的车辆检测分别提升了

2.25 %

、

5.1 %

、

4.4 %

和

4.98 %

。我们发现 MLR 和回归掩码

ω_{i}^{reg}

仅微小提升了行人和自行车的检测。原因可能是静止的行人和自行车远少于移动中的。相反,分类的权重掩码

ω_{i}^{c l s}

分别提升了行人和自行车检测

16.73 %

和

29.13 %

,因为

ω_{i}^{c l s}

大幅降低了小物体由运动模式引起的错误监督信号。结果表明每个组件对整体性能提升至关重要。

TABLE 13 表 13
Performance ( 3D AP L2) comparison by using different CPD++ components. AT refers to augmented testing.
性能（3D AP L2）使用不同的 CPD ++组件进行比较。AT 指增强测试。

Direct train 直达火车	AT	MLR	$ω_{i}^{reg}$	$ω_{i}^{cls}$	Vehicle 车辆	Pedestrian 行人	Cyclist 骑自行车的人
$✓$					31.86	7.32	0.56
	$✓$				34.11	10.14	0.74
	$✓$	$✓$			39.21	10.27	0.84
	$✓$	$✓$	$✓$		43.61	10.58	0.56
	$✓$	$✓$	$✓$	$✓$	48.59	27.31	29.69

Direct train AT MLR omega_(i)^("reg ") omega_(i)^("cls ") Vehicle Pedestrian Cyclist ✓ 31.86 7.32 0.56 ✓ 34.11 10.14 0.74 ✓ ✓ 39.21 10.27 0.84 ✓ ✓ ✓ 43.61 10.58 0.56 ✓ ✓ ✓ ✓ 48.59 27.31 29.69| Direct train | AT | MLR | $\omega_{i}^{\text {reg }}$ | $\omega_{i}^{\text {cls }}$ | Vehicle | Pedestrian | Cyclist | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | $\checkmark$ | | | | | 31.86 | 7.32 | 0.56 | | | $\checkmark$ | | | | 34.11 | 10.14 | 0.74 | | | $\checkmark$ | $\checkmark$ | | | 39.21 | 10.27 | 0.84 | | | $\checkmark$ | $\checkmark$ | $\checkmark$ | | 43.61 | 10.58 | 0.56 | | | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | 48.59 | 27.31 | 29.69 |

Parameter Determination. To select the best hyperparameters for CPD++, we conducted a series of experiments on the

1 / 20

WOD dataset. Similar to CPD, we apply a simple grid search to choose the best parameters. We first select the best transformation number

m

by evaluating the mAP of pseudo labels in Eq. 11 with different

m

. We observe that the mAP attains the best performance when using six transformations. We next select the best motion threshold

η^{v a r}

and score threshold

η^{low}, η^{high}

by evaluating the final detection mAP of CPD++ with different parameters. As presented in Fig 14, we attain the best performance when using

η^{var} = 0.8, η^{low} = 0.4

and

η^{high} = 0.9

.
参数确定。为了为 CPD++ 选择最佳超参数，我们对

1 / 20

WOD 数据集进行了一系列实验。与 CPD 类似，我们应用简单的网格搜索来选择最佳参数。我们首先通过评估式 11 中伪标签的 mAP 来选择最佳变换数

m

。我们观察到当使用六个变换时 mAP 达到最佳性能。我们接下来通过评估 CPD++ 的最终检测 mAP 来选择最佳运动阈值

η^{v a r}

和评分阈值

η^{low}, η^{high}

。如图 14 所示，我们在使用

η^{var} = 0.8, η^{low} = 0.4

和

η^{high} = 0.9

时获得了最佳性能。

6.8 Visualization Comparison
6.8 可视化比较

To provide a more intuitive understanding of how our method improves detection performance, we visually compare our CPD, CPD++ with those of OYSTER [59], as shown in Fig. 15. OYSTER inaccurately reports the sizes and
我们通过视觉比较我们的 CPD、CPD++与 OYSTER [59]的方法,来提供更直观的了解我们的方法如何提高检测性能。OYSTER 不准确地报告了尺寸和

Fig. 15. Visualization comparison of different detection results on WOD validation set. Each column shows the results of different methods, each line shows the detection results for different scenarios. Some of the objects have been zoomed in for better comparison.
图 15.在 WOD 验证集上不同检测结果的可视化比较。每列显示不同方法的结果,每行显示不同场景的检测结果。部分物体已放大以便更好比较。
positions of objects (Fig. 15(1.1)). In contrast, CPD, using our CProto-based design, accurately predicts their sizes and positions (Fig. 15(2.1)). However, since the threshold-based classification cannot distinguish the foreground and background objects with similar sizes, the OYSTER and CPD produce massive false positive predictions for small background objects (Fig. 15(1.3)(2.3)). Besides, some distant vehicles cannot be recognized correctly due to the sparse points and similar shape to the background (Fig. 15(1.2)(2.2)). CPD++ boosted the localization and recognition by the MLR and MWT design. Therefore the false positive of small objects significantly decreased (Fig. 15(3.3)), and the recognition of vehicles is also improved (Fig. 15(3.2)).
物品的位置(图 15(1.1))。相比之下,使用我们的 CProto 设计的 CPD,准确预测了它们的大小和位置(图 15(2.1))。然而,由于基于阈值的分类无法区分具有相似大小的前景和背景物体,因此 OYSTER 和 CPD 对于小型背景物体产生了大量的错误预测(图 15(1.3)(2.3))。此外,由于散点和形状与背景相似,一些远处的车辆无法被正确识别(图 15(1.2)(2.2))。CPD++通过 MLR 和 MWT 设计提高了定位和识别。因此,小物体的错误预测显著减少(图 15(3.3)),车辆识别也得到改善(图 15(3.2)).

6.9 Failure Case Analysis and Limitation
6.9 故障案例分析和局限性

At present, unsupervised 3D object detection is still a challenging problem. Some objects in corner cases can not be detected correctly. One of the most typical situations is when pedestrians and cars are close together. In this case, cars and pedestrians cannot be separated by clustering, which generates false labels and misleads detector training

((1.4) (2.4) (3.4)

in the second row of Fig. 15). Besides, the detection performance of small objects is obviously lower than that of large objects. As presented in Table 1, the pedestrian and cyclist only obtain

44.04 %

BEV AP L1 (

I o U_{0.3}

) and

34.15 %

BEV AP L1

(I o U_{0.3})

, which is much lower than the vehicle detection performance of

80.64 %

BEV AP L1

(I o U_{0.5})

. These results suggest that small 3D object detection without annotation remains exceptionally challenging and requires further investigation in the future.
目前,无监督的 3D 物体检测仍然是一个挑战性的问题。在一些特殊情况下,某些物体无法被正确检测。最典型的情况之一就是,当行人和汽车彼此靠近时。在这种情况下,汽车和行人无法通过聚类来分离,这就产生了错误的标签,并误导了检测器的训练。此外,小物体的检测性能明显低于大物体。如表 1 所示,行人和自行车只获得 BEV AP L1 和 BEV AP L1,远低于车辆检测性能的 BEV AP L1。这些结果表明,没有注释的小型 3D 物体检测仍然是一个极具挑战性的问题,需要未来进一步的研究。

7 Conclusion 结论

This paper explored the challenge of unsupervised 3D object detection, where no human annotations are utilized for either training or fine-tuning. We addressed this problem by incorporating human commonsense clues into a pseudo-label-based framework. Initially, we present a Commonsense Prototype-based Detector (CPD), for unsupervised 3D object detection. First, we develop an MFC method to generate initial pseudo-labels and construct a CProto set using the CSS scoring. Then, we introduce a CBR method to refine these pseudo-labels. Lastly, we design a CST to enhance detection accuracy for sparse objects. Building upon the CPD, we then proposed CPD++, an enhanced framework for more accurate multi-class unsupervised 3D object detection. First, we design an MLR that refines pseudo-labels by considering the motion properties of objects. We then presented the MWT that learns localization from stationary objects, learns recognition from moving objects, and transfers the knowledge between each other to mutually boost detection performance. Extensive experimental evaluations have confirmed the efficacy of our proposed framework. Notably, when CPD++ is trained on the WOD and tested on KITTI, it achieves performance comparable to fully supervised methods in car detection. These findings underscored the significance and advance of our research.
这篇论文探讨了无监督 3D 目标检测的挑战,其中既不使用人工标注进行训练,也不用于细调。我们通过将人类常识线索纳入基于伪标签的框架来解决这一问题。首先,我们提出了一种基于常识原型的检测器(CPD),用于无监督 3D 目标检测。我们开发了一种 MFC 方法来生成初始的伪标签,并使用 CSS 评分构建了一个 CProto 集合。然后,我们引入了一种 CBR 方法来细化这些伪标签。最后,我们设计了一种 CST 来提高稀疏目标的检测精度。在 CPD 的基础上,我们提出了 CPD++,这是一个更精准的多类无监督 3D 目标检测增强框架。我们设计了一种 MLR,通过考虑目标的运动特性来细化伪标签。我们还提出了 MWT,它从静止物体学习定位,从移动物体学习识别,并在两者之间传递知识,相互提高检测性能。广泛的实验评估证实了我们提出框架的有效性。值得注意的是,当 CPD++在 WOD 上训练,在 KITTI 上测试时,其在车辆检测方面的性能可与完全监督的方法媲美。这些发现突出了我们研究的意义和进步。

AcKNOWLEDGMENTS 鸣谢

This work was supported in part by the National Natural Science Foundation of China (No.62171393).
本工作部分由中国国家自然科学基金会（编号 62171393）支持。

References 参考文献

[1] Asma Azim and Olivier Aycard. Detection, classification and tracking of moving objects in a 3d environment. In 2012 IEEE Intelligent Vehicles Symposium, pages 802-807. IEEE, 2012.
阿斯玛·阿齐姆和奥利维尔·艾卡德。在 3D 环境中检测、分类和跟踪移动物体。载于 2012 年 IEEE 智能车辆研讨会论文集,第 802-807 页。IEEE,2012 年。
[2] Qi Cai, Yingwei Pan, Ting Yao, and Tao Mei. 3d cascade renn: High quality object detection in point clouds. IEEE Transactions on Image Processing, 31:5706-5719, 2022.
齐才、应维潘、姚霆和梅涛。3D 级联 RENN：点云中的高质量目标检测。IEEE 图像处理学报,31:5706-5719,2022。
[3] Zehui Chen, Zhenyu Li, Shuo Wang, Dengpan Fu, and Feng Zhao. Learning from noisy data for semi-supervised 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6929-6939, 2023.
陈泽辉、李振宇、王硕、付登攀和赵枫。从噪声数据中学习用于半监督式三维目标检测。在 2023 年 IEEE/CVF 国际计算机视觉会议论文集中,第 6929-6939 页。
[4] Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wen gang Zhou, Yanyong Zhang, and Houqiang Li. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
[4] 邓佳俊、石少帅、李培维、周文钢、张延勇和李厚强。 Voxel r-cnn:走向高性能基于体素的三维物体检测。在人工智能协会会议论文集中,2021 年。
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
阿列克谢·多索维茨基、卢卡斯·贝耶尔、亚历山大·科莱斯尼科夫、迪尔克·魏森博恩、肖华·扎伊、托马斯·温特索纳、穆斯塔法·德加尼、马蒂亚斯·明德雷、格奥尔格·海戈德、西尔万·格利等。一张图像相当于 16x16 个单词：大规模图像识别的变换器。arXiv 预印本 arXiv:2010.11929, 2020。
[6] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Knowledge Discovery and Data Mining, 1996.
马丁·埃斯特、汉斯-彼得·克里格尔、约格·桑德和肖维·徐。一种用于发现大型空间数据库中噪声聚类的基于密度的算法。在知识发现和数据挖掘,1996 年。
[7] Lue Fan, Ziqi Pang, Tianyuan Zhang, Yu-Xiong Wang, Hang Zhao, Feng Wang, Naiyan Wang, and Zhaoxiang Zhang. Embracing single stride 3d object detector with sparse transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8448-8458, 2021.
罗凡、庞子琪、张天源、王宇雄、赵航、王丰、王奈艳和张兆祥。拥抱单步 3D 物体检测器与稀疏变压器。2021 年 IEEE/CVF 计算机视觉与模式识别大会(CVPR)，第 8448-8458 页。
[8] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 3354-3361, 2012.
安德烈亚斯·盖格尔、菲利普·伦兹和拉克尔·乌塔松。我们准备好进行自动驾驶了吗?KITTI 视觉基准测试套件。在 2012 年 IEEE 计算机视觉与模式识别会议(CVPR)论文集中,第 3354-3361 页。
[9] Goldberger, Gordon, and Greenspan. An efficient image similarity measure based on approximations of kl-divergence between two gaussian mixtures. In Proceedings Ninth IEEE International conference on computer vision, pages 487-493. IEEE, 2003.
高伯格,戈登和格林斯潘。基于两个高斯混合体的 KL 散度近似的高效图像相似性度量。在 2003 年第九届 IEEE 国际计算机视觉大会论文集中,第 487-493 页。IEEE。
[10] Chenhang He, Hui Zeng, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 11873-11882, 2020.
何宸航、曾辉、黄健强、华显生、张磊.从点云中感知三维物体的结构感知单阶段检测.在 2020 年 IEEE 计算机视觉和模式识别会议论文集中,第 11873-11882 页.
[11] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll’ar, and Ross B. Girshick. Masked autoencoders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979-15988, 2021.
何恺明、陈昕蕾、谢赛宁、李扬浩、Piotr Doll'ar 和 Ross B. Girshick. 掩蔽自编码器是可扩展的视觉学习者. 在 2022 年 IEEE/CVF 计算机视觉与模式识别会议(CVPR)上, 第 15979-15988 页, 2021.
[12] Michael Himmelsbach, Felix V Hundelshausen, and H-J Wuensche. Fast segmentation of 3d point clouds for ground vehicles. In 2010 IEEE Intelligent Vehicles Symposium, pages 560-565. IEEE, 2010.
迈克尔·希默尔斯巴赫、费利克斯·V·亨德尔斯豪森和 H-J·文斯克。 2010 年 IEEE 智能车辆研讨会,第 560-565 页。 IEEE,2010 年。
[13] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 1269712705, 2019.
[13] 亚历克斯·H·朗,苏拉布·沃拉,霍尔格·凯撒,鲁柄·周,炯阳,奥斯卡·贝伊伯姆。PointPillars：从点云中进行物体检测的快速编码器。在《2019 年计算机视觉和模式识别学术会议(CVPR)》论文集中,第 12697-12705 页。
[14] Huifang Li, Yidong Li, Yuanzhouhan Cao, Yushan Han, Yi Jin, and Yunchao Wei. Weakly supervised object detection with class prototypical network. IEEE Transactions on Multimedia, 2022.
李慧芳、李义东、曹原舟涵、韩雨山、靳艺、魏云超。弱监督物体检测与类原型网络。IEEE 多媒体交易，2022 年。
[15] Tianqin Li, Zijie Li, Harold Rockwell, Amir Farimani, and Tai Sing Lee. Prototype memory and attention mechanisms for few shot image generation. In Proceedings of the Eleventh International Conference on Learning Representations, volume 18, 2022.
李天琴、李子杰、哈罗德·罗克维尔、阿米尔·法拉马尼和李泰盛。少样本图像生成的原型记忆和注意机制。第 11 届国际表征学习会议论文集,第 18 卷,2022 年。
[16] Ziyu Li, Jingming Guo, Tongtong Cao, Liu Bingbing, and Wankou Yang. Gpa-3d: Geometry-aware prototype alignment for unsupervised domain adaptive 3d object detection from point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6394-6403, 2023.
李子育、郭静明、曹童童、刘冰冰和杨万寿。Gpa-3d:基于几何感知的原型对齐用于无监督域自适应点云三维目标检测。在 2023 年 IEEE/CVF 计算机视觉国际会议论文集上,第 6394-6403 页。
[17] Hanxue Liang, Chenhan Jiang, Dapeng Feng, Xin Chen, Hang Xu, Xiaodan Liang, Wei Zhang, Zhenguo Li, Luc Van Gool, and Sun Yat-sen. Exploring geometry-aware contrast and clustering harmonization for self-supervised 3d object detection. In International Conference on Computer Vision (ICCV), pages 3273-3282, 2021.
梁汉学、江晨晗、冯大鹏、陈欣、徐航、梁晓丹、张伟、李正果、Luc Van Gool 和孙逸仙。探索几何感知对比和聚类协调自监督 3D 物体检测。在国际计算机视觉会议(ICCV)上,第 3273-3282 页,2021 年。
[18] Chuandong Liu, Chenqiang Gao, Fangcen Liu, Pengcheng Li, Deyu Meng, and Xinbo Gao. Hierarchical supervision and shuffle data augmentation for 3d semi-supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23819-23828, 2023.
刘传东、高晨浅、刘方岑、李鹏程、孟德渝、高新波。用于 3D 半监督物体检测的层次监督和数据打乱增强。在 2023 年 IEEE/CVF 计算机视觉和模式识别会议论文集中,第 23819-23828 页。
[19] Chuandong Liu, Chenqiang Gao, Fangcen Liu, Jiang Liu, Deyu Meng, and Xinbo Gao. Ss3d: Sparsely-supervised 3d object detection from point cloud. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 8428-8437, 2022.
刘传东，高晨强，刘方岑，刘江，孟德宇，高新波。Ss3d:从点云稀疏监督的 3D 物体检测。在 2022 年计算机视觉模式识别会议(CVPR)上发表。
[20] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1001210022, 2021.
刘泽、林玉彤、曹岳、胡汉、魏一轩、张正、林肯和郭柏凝。Swin transformer:使用移位窗口的分层视觉 transformer。 2021 年 IEEE/CVF 国际计算机视觉会议论文集,第 10012-10022 页。
[21] Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi-task multisensor fusion with unified bird’s-eye view representation. ArXiv, 2022.
刘之健、唐浩天、亚历山大·阿米尼、杨新宇、毛慧子、丹妮拉·鲁斯和韩松。Bevfusion:统一鸟瞰图表示的多任务多传感器融合。ArXiv, 2022.
[22] Xiaonan Lu, Wenhui Diao, Yongqiang Mao, Junxi Li, Peijin Wang, Xian Sun, and Kun Fu. Breaking immutable: information-coupled prototype elaboration for few-shot object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1844-1852, 2023.
[22] 陆笑南, 刁文慧, 毛永强, 李军熙, 王培进, 孙先, 傅琨。打破不变性:面向少样本目标检测的信息耦合原型扩展。在《美国人工智能协会会议论文集》第 37 卷, 第 1844-1852 页, 2023 年。
[23] Katie Luo, Zhenzhen Liu, Xiangyu Chen, Yurong You, Sagie Benaim, Cheng Perng Phoo, Mark Campbell, Wen Sun, Bharath Hariharan, and Kilian Q Weinberger. Reward finetuning for faster and more accurate unsupervised object discovery. Advances in Neural Information Processing Systems, 36, 2024.
罗凯蒂, 刘珍珍, 陈祥宇, 尤尤蓉, 本艾米塞吉, 李成锋, 马克·坎贝尔, 孙文, 哈里哈兰博拉斯, 韦因伯格奇利安。奖励微调用于更快更准确的无监督目标发现。神经信息处理系统进展, 36, 2024。
[24] Qinghao Meng, Wenguan Wang, Tianfei Zhou, Jianbing Shen, Yunde Jia, and Luc Van Gool. Towards a weakly supervised framework for 3d point cloud object detection and annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):44544468, 2021.
清晰的孟,旺旺王,天飞周,剑兵沈,云德贾,和卢克万高尔。走向一个弱监督的框架用于 3d 点云对象检测和标注。IEEE 模式分析和机器智能交易,44(8):4454-4468,2021.
[25] Qinghao Meng, Wenguan Wang, Tianfei Zhou, Jianbing Shen, Yunde Jia, and Luc Van Gool. Towards a weakly supervised framework for 3d point cloud object detection and annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):44544468, 2022.
清昊孟、文观王、天飞周、建兵沈、云德贾和卢克·范古尔。向一个弱监督框架迈进,用于 3D 点云物体检测和标注。IEEE 模式分析与机器智能汇刊,44(8):44544468,2022。
[26] Chen Min, Xinli Xu, Dawei Zhao, Liang Xiao, Yiming Nie, and Bin Dai. Occupancy-mae: Self-supervised pre-training large-scale lidar point clouds with masked occupancy autoencoders. IEEE Transactions on Intelligent Vehicles, 2022.
陈敏、徐新力、赵大伟、肖亮、聂一鸣和戴斌。占用性-MAE:使用掩码占用自编码器对大规模激光雷达点云进行自监督预训练。《IEEE 智能车辆交易》,2022 年。
[27] Nikita Moshkov, Botond Mathe, Attila Kertesz-Farkas, Reka Hollandi, and Peter Horvath. Test-time augmentation for deep learning-based cell segmentation on microscopy images. Scientific reports, 10(1):5068, 2020.
尼基塔·莫什科夫, 博通德·马特, 阿蒂拉·凯特泽斯-法尔卡斯, 雷卡·霍兰迪, 彼得·霍尔瓦特。显微镜图像上基于深度学习的细胞分割的测试时增强。科学报告, 10(1):5068, 2020。
[28] Jinhyung Park, Chenfeng Xu, Yiyang Zhou, Masayoshi Tomizuka, and Wei Zhan. Detmatch: Two teachers are better than one for joint 2d and 3d semi-supervised object detection. In European Conference on Computer Vision, pages 370-389. Springer, 2022.
朴珍荥, 徐晨风, 周逸杨, 富泽正义, 詹伟。Detmatch:两位老师胜过一位,用于联合二维和三维半监督物体检测。在欧洲计算机视觉会议上,第 370-389 页。施普林格,2022 年。
[29] Xidong Peng, Xinge Zhu, and Yuexin Ma. Cl3d: Unsupervised domain adaptation for cross-lidar 3d detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 20472055, 2023.
彭西东、朱新鹏和马悦新。Cl3d:无监督领域自适应用于跨雷达 3D 检测。在人工智能学会会议论文集第 37 卷第 20472055 页,2023 年。
[30] Gheorghii Postica, Andrea Romanoni, and Matteo Matteucci. Robust moving objects detection in lidar data exploiting visual cues. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1093-1098. IEEE, 2016.
[30] 格奥尔基·波斯蒂卡、安德烈亚·罗马尼和马泰奥·马特乌奇。利用视觉线索在激光雷达数据中检测运动物体。在 2016 年 IEEE/RSJ 国际智能机器人与系统大会(IROS)上,第 1093-1098 页。IEEE,2016 年。
[31] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-renn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 10526 - 10535, 2020.
史少帅、郭超旭、江力、王哲、史建平、王晓刚和李洪胜。Pv-renn:三维物体检测的点-体素特征集抽象。在 2020 年 IEEE 计算机视觉与模式识别会议(CVPR)论文集中,第 10526 - 10535 页。
[32] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 770-779, 2019.
石少帅、王晓刚和李洪生。Pointrcnn：从点云中生成 3D 目标提议和检测。在 IEEE 计算机视觉与模式识别大会(CVPR)论文集中,第 770-779 页,2019 年。
[33] Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43:2647-2664, 2021.
施少帅、王哲、石建平、王晓刚和李洪生。从点到部件:基于部件感知和部件聚合网络的点云 3D 目标检测。IEEE 模式分析与机器智能学报(TPAMI),43:2647-2664,2021。
[34] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
[34] 杰克·斯内尔、凯文·斯沃斯基和理查德·泽梅尔。原型网络用于少样本学习。神经信息处理系统进展,第 30 期,2017 年。
[35] Roman Solovyev, Weimin Wang, and Tatiana Gabruseva. Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing, 107, 2021.
罗曼·索洛维耶夫、王维敏和塔蒂亚娜·加布鲁塞娃。加权框融合:从不同物体检测模型融合框。图像与视觉计算,107,2021 年。
[36] Muhammad Sualeh and Gon-Woo Kim. Dynamic multi-lidar based multiple object detection and tracking. Sensors, 19(6):1474, 2019.
穆罕默德·苏利和金建禹。动态多激光雷达 based 多目标检测和跟踪。传感器，19(6):1474,2019 年。
[37] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, and al. et. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages

2443 - 2451, 2020

.
裴孙、亨利·克雷茨马尔、泽克塞斯·多蒂瓦拉等人。
[38] Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He, Bernt Schiele, and Liwei Wang. DSVT: Dynamic Sparse Voxel Transformer With Rotated Sets. In CVPR, 2023.
王海洋, 陈鲜, 史少帅, 雷蒙, 王森, 贺迪, Bernt Schiele, 王立伟。 DSVT: 旋转集合的动态稀疏体素变换器。在 CVPR, 2023。
[39] He Wang, Yezhen Cong, Or Litany, Yue Gao, and Leonidas J. Guibas. 3dioumatch: Leveraging iou prediction for semisupervised 3d object detection. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition CVPR), pages 14610-14619, 2021.
王鹤,丛野震,奥里坦尼,高岳,莱昂尼达斯·J·古巴斯。3dioumatch:利用 IoU 预测进行半监督 3D 物体检测。计算机视觉与模式识别学术会议论文集,2021 年,第 14610-14619 页。
[40] Yan Wang, Xiangyu Chen, Yurong You, Li Erran Li, Bharath Hariharan, Mark Campbell, Kilian Q Weinberger, and Wei-Lun Chao. Train in germany, test in the usa: Making 3d object detectors generalize. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11713-11723, 2020.
严王,陈翔宇,尤荣友,李历安,Bharath Hariharan,Mark Campbell,Kilian Q Weinberger,和魏伦超。在德国训练,在美国测试:让 3D 物体检测器普遍适用。在 2020 年 IEEE/CVF 计算机视觉与模式识别会议上,第 11713-11723 页。
[41] Yi Wei, Shang Su, Jiwen Lu, and Jie Zhou. Fgr: Frustum-aware geometric reasoning for weakly supervised 3d vehicle detection. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4348-4354. IEEE, 2021.
易伟,尚苏,吕济文,周杰
[42] Aming Wu, Yahong Han, Linchao Zhu, and Yi Yang. Universalprototype enhancing for few-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9567-9576, 2021.
吴阿明、韩亚宏、朱林超和杨毅。面向少样本物体检测的统一原型增强。在 2021 年 IEEE/CVF 计算机视觉国际会议论文集中,第 9567-9576 页。
[43] Hai Wu, Jinhao Deng, Chenglu Wen, Xin Li, and Cheng Wang. Casa: A cascade attention network for 3d object detection from lidar point clouds. IEEE Transactions on Geoscience and Remote Sensing, 2022.
吴海,邓锦皓,温承露,李昕,王成。Casa:一种用于来自激光雷达点云的三维目标检测的级联注意力网络。2022 年 IEEE 地球科学与遥感学报。
[44] Hai Wu, Wenkai Han, Chenglu Wen, Xin Li, and Cheng Wang. 3d multi-object tracking in point clouds based on prediction confidence-guided data association. IEEE TITS, 2021.
[44] 海武、韩文凯、温成路、李新和王成。基于预测置信度引导的点云 3D 多目标跟踪。IEEE TITS, 2021 年。
[45] Hai Wu, Chenglu Wen, Wei Li, Ruigang Yang, and Cheng Wang. Transformation-equivariant 3d object detection for autonomous driving. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
吴海、温成录、李威、杨瑞刚和王成。自动驾驶的变换等变 3D 物体检测。在人工智能协会 2023 年会议论文集上发表。
[46] Hai Wu, Chenglu Wen, Shaoshuai Shi, Xin Li, and Cheng Wang. Virtual sparse convolution for multimodal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
伍海、温成路、史少帅、李新、王成。多模态 3D 物体检测的稀疏虚拟卷积。在 2023 年 IEEE/CVF 计算机视觉与模式识别大会论文集上发表。
[47] Hai Wu, Shijia Zhao, Xun Huang, Chenglu Wen, Xin Li, and Cheng Wang. Commonsense prototype for outdoor unsupervised 3d object detection. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
吴海、赵诗佳、黄勋、温成路、李欣、王成。户外无监督 3D 物体检测的常识原型。在 2024 年计算机视觉模式识别大会(CVPR)上。
[48] Xiaopei Wu, Liang Peng, Honghui Yang, Liang Xie, Chenxi Huang, Chengqi Deng, Haifeng Liu, and Deng Cai. Sparse fuse dense: Towards high quality 3d detection with depth completion. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2022.
吴小培、彭亮、杨宏辉、谢亮、黄晨曦、邓程琦、刘海峰和蔡登. 稀疏融合密集:面向高质量 3D 检测的深度完成. 在 2022 年 IEEE 计算机视觉与模式识别会议(CVPR)论文集中.
[49] Qiming Xia, Jinhao Deng, Chenglu Wen, Hai Wu, Shaoshuai Shi, Xin Li, and Cheng Wang. Coin: Contrastive instance feature mining for outdoor 3 d object detection with very limited annotations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
夏启明, 邓锦浩, 温成路, 吴海, 石少帅, 李欣, 王成。Coin:基于对比学习的室外三维物体检测模型,采用极少注释训练。
[50] Hongyi Xu, Feng Liu, Qianyu Zhou, Jinkun Hao, Zhijie Cao, Zhengyang Feng, and Lizhuang Ma. Semi-supervised 3d object detection via adaptive pseudo-labeling. 2021 IEEE International Conference on Image Processing (ICIP), pages 3183-3187, 2021.
许红义,刘峰,周千愉,郝金坤,曹志杰,冯正阳,马立壮。基于自适应伪标签的半监督三维目标检测。2021 IEEE 国际图像处理会议(ICIP),第 3183-3187 页,2021 年。
[51] Jenny Xu and Steven L. Waslander. Hypermodest: Self-supervised 3d object detection with confidence score filtering. 2023 20th Conference on Robots and Vision (CRV), pages 217-224, 2023.
[51] 徐馨宇和 Steven L. Waslander. Hypermodest:自监督 3D 目标检测与置信度评分过滤. 2023 第 20 届机器人与视觉会议(CRV), 第 217-224 页, 2023.
[52] Honghui Yang, Tong He, Jiaheng Liu, Huaguan Chen, Boxi Wu, Binbin Lin, Xiaofei He, and Wanli Ouyang. Gd-mae: Generative decoder for mae pre-training on lidar point clouds. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 94039414, 2023.
杨红辉、何通、刘嘉恒、陈华冠、吴博熙、林斌斌、何晓飞和欧阳万里。Gd-mae:激光雷达点云的 MAE 预训练生成解码器。在计算机视觉与模式识别会议(CVPR)上,第 9403-9414 页,2023 年。
[53] Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1951-1960, 2019.
杨泽童, 孙亚男, 刘姝, 沈晓勇, 贾越亚. Std:点云稀疏到密集 3D 物体检测器. 在《计算机视觉国际会议(ICCV)》上的论文集, 第 1951-1960 页, 2019 年.
[54] Junbo Yin, Jin Fang, Dingfu Zhou, Liangjun Zhang, Cheng-Zhong Xu , Jianbing Shen, and Wenguan Wang. Semi-supervised 3d object detection with proficient teachers. In European Conference on Computer Vision, pages 727-743. Springer, 2022.
殷军波、方金、周定富、张梁军、徐成忠、沈健冰和王文冠。半监督的 3D 物体检测和精通的老师。在欧洲计算机视觉会议上,第 727-743 页。斯普林格,2022 年。
[55] Junbo Yin, Dingfu Zhou, Liangjun Zhang, Jin Fang, Cheng-Zhong Xu , Jianbing Shen, and Wenguan Wang. Proposalcontrast: Un-
提案对比
supervised pre-training for lidar-based 3d object detection. In European conference on computer vision, pages 17-33, 2022.
基于激光雷达的 3D 目标检测的有监督预训练。在 2022 年的欧洲计算机视觉会议上,第 17-33 页。
[56] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Centerbased 3d object detection and tracking. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2021.
尹天威、周兴毅和 Philipp Krähenbühl. 基于中心的三维目标检测和跟踪。在 2021 年计算机视觉和模式识别会议(CVPR)论文集上发表。
[57] Yurong You, Katie Luo, Cheng Perng Phoo, Wei-Lun Chao, Wen Sun, Bharath Hariharan, Mark E. Campbell, and Kilian Q. Weinberger. Learning to detect mobile objects from lidar scans without labels. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
游郁融、罗凯蒂、潘振翔、赵伟仑、孙文、哈瑞哈兰、坎贝尔和维因伯格. 学习从未标注的激光雷达扫描中检测移动物体. 在 2022 年 IEEE/CVF 计算机视觉与模式识别会议(CVPR)上发表, 2022 年.
[58] Dingyuan Zhang, Dingkang Liang, Zhikang Zou, Jingyu Li, Xiaoqing Ye, Zhe Liu, Xiao Tan, and Xiang Bai. A simple vision transformer for weakly semi-supervised 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8373-8383, 2023.
张定源、梁定康、邹智康、李景瑜、叶小青、刘哲、谭潇、白翔。一种用于弱半监督 3D 目标检测的简单视觉 transformer。在 2023 年 IEEE/CVF 国际计算机视觉会议(ICCV)上。
[59] Lunjun Zhang, Anqi Joyce Yang, Yuwen Xiong, Sergio Casas, Bin Yang, Mengye Ren, and Raquel Urtasun. Towards unsupervised object detection from lidar point clouds. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
张倫軍、楊安琪、熊雨雯、Sergio Casas、楊斌、任萌冶和 Raquel Urtasun.面向無監督的雷達點雲物體檢測.2023 年 IEEE/CVF 計算機視覺和模式識別會議(CVPR),2023 年.
[60] Quanshi Zhang, Xuan Song, Xiaowei Shao, Huijing Zhao, and Ryosuke Shibasaki. Unsupervised 3d category discovery and point labeling from a large urban environment. In 2013 IEEE International Conference on Robotics and Automation, pages 26852692. IEEE, 2013.
张权仕、宋璇、邵晓巍、赵晖靖和柴原佑辅
[61] Xiao Zhang, Wenda Xu, Chiyu Dong, and John M Dolan. Efficient 1-shape fitting for vehicle detection using laser scanners. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 54-59. IEEE, 2017.
小张, 文达徐, 程昱东, 和约翰 M 多兰。使用激光扫描仪的车辆检测中的高效 1 形状拟合。在 2017 年 IEEE 智能车辆研讨会(IV)上, 第 54-59 页。IEEE, 2017。
[62] Yixin Zhang, Zilei Wang, and Yushi Mao. Rpn prototype alignment for domain adaptive object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12425-12434, 2021.
张一辛、王子磊和毛禹石。基于 RPN 原型对齐的域自适应目标检测器。在 2021 年 IEEE/CVF 计算机视觉与模式识别会议论文集中,第 12425-12434 页。
[63] Na Zhao, Tat-Seng Chua, and Gim Hee Lee. Sess: Self-ensembling semi-supervised 3d object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2020.
赵娜、蔡大成、李金希。 Sess：自主学习半监督式 3D 物体检测。在《计算机视觉与模式识别会议》(CVPR)论文集中发表,2020 年。
[64] Shizhen Zhao and Xiaojuan Qi. Prototypical votenet for few-shot 3d point cloud object detection. Advances in Neural Information Processing Systems, 35:13838-13851, 2022.
赵实珍和齐晓娟。原型投票网络用于少样本 3D 点云物体检测。神经信息处理系统进展,35:13838-13851,2022。
[65] Wu Zheng, Weiliang Tang, Li Jiang, and Chi-Wing Fu. Se-ssd: Self-ensembling single-stage object detector from point cloud. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 14494-14503, 2021.
吴正, 唐伟良, 李江, 和 Chi-Wing Fu. Se-ssd: 来自点云的自组合单级目标检测器. 在计算机视觉与模式识别会议(CVPR)论文集, 页码 14494-14503, 2021.
[66] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 4490-4499, 2018.
[66] 殷洲和 Oncel Tuzel。VoxelNet:基于点云的 3D 物体检测的端到端学习。在 IEEE 计算机视觉和模式识别会议(CVPR)论文集中,第 4490-4499 页,2018 年。

Hai Wu received his B.Eng. degree in information computing and science from Sichuan University of Science and Technology, Sichuan, China, in 2018. He is currently working toward a Ph.D. degree in the School of Informatics at Xiamen University, Xiamen, China. His research interests include computer vision, machine learning, and 3D point cloud processing.
吴海获得了 2018 年四川科技大学信息计算和科学的工学学士学位。他目前正在厦门大学信息学院攻读博士学位。他的研究兴趣包括计算机视觉、机器学习和三维点云处理。

Shijia Zhao received the M.S. degree from the College of Marine Information Engineering, Jimei University, Xiamen, China, in 2023. He is currently pursuing a Ph.D. degree with the School of Information at Xiamen University, Xiamen, China. His research interests include 3D object detection, deep learning theory and its application, and image processing.
赵世佳于 2023 年在中国厦门集美大学海洋信息工程学院获得硕士学位。他目前正在厦门大学信息学院攻读博士学位。他的研究兴趣包括 3D 物体检测、深度学习理论及其应用以及图像处理。

Xun Huang received his B.S. degree in computer science from the Jimei University, Xiamen, China, in 2003. He is currently pursuing a Ph.D. degree with the School of Informatics at Xiamen University, Xiamen, China. His research interests include 3D object detection and 3D vision.
黄逊在 2003 年获得了厦门集美大学计算机科学学士学位。他目前正在厦门大学信息科学学院攻读博士学位。他的研究领域包括 3D 物体检测和 3D 视觉。

Qiming Xia received the M.S. degree from Jimei University, Xiamen, China, in 2022. He is currently working toward a Ph.D. degree in the School of Informatics at Xiamen University, Xiamen, China. His research interests include computer vision, machine learning, and 3D object detection.
夏启明在 2022 年从中国厦门集美大学获得硕士学位。他目前正在厦门大学信息学院攻读博士学位。他的研究兴趣包括计算机视觉、机器学习和三维物体检测。

Chenglu Wen received her Ph.D. degree in mechanical engineering from China Agricultural University, Beijing, China, in 2009. She is currently a Professor with the School of Informatics, Xiamen University, Xiamen, China. Her research interests include 3D point cloud processing, 3D vision, and intelligent robots. She is currently an Associate Editor of IEEE GRSL and IEEE TITS.
温成璐于 2009 年在中国北京农业大学获得机械工程博士学位。她现任职于中国厦门大学信息学院,担任教授。她的研究兴趣包括 3D 点云处理、3D 视觉和智能机器人。她目前是 IEEE GRSL 和 IEEE TITS 的副主编。

Li Jiang received her PhD degree in the Department of Computer Science and Engineering at The Chinese University of Hong Kong in 2021. She is currently an Assistant Professor at The Chinese University of Hong Kong, Shenzhen. Before that, She was a postdoctoral researcher at the Department of Computer Vision and Machine Learning of Max Planck Institute for Informatics. Her research interest includes computer vision and deep learning, particularly in 3D scene understanding, representation learning, autonomous driving, and robotics.
李江于 2021 年在香港中文大学计算机科学与工程学系获得博士学位。她目前是香港中文大学深圳分校助理教授。在此之前,她曾在马克斯·普朗克信息学研究所计算机视觉与机器学习部门担任博士后研究员。她的研究兴趣包括计算机视觉和深度学习,特别是 3D 场景理解、表示学习、自动驾驶和机器人学等领域。

Xin Li received his Ph.D. degree in computer science from Stony Brook University (SUNY) in 2008. He is currently a professor with the School of Electrical Engineering and Computer Science and the Center for Computation and Technology, Louisiana State University, USA. His research interests include geometric and visual data processing and analysis, computer graphics, and computer vision. For more details, please visit https://www.ece.Isu.edu/xinli/.
李欣获得了 2008 年斯托尼布鲁克大学(纽约州立大学)计算机科学博士学位。他现为美国路易斯安那州立大学电气工程与计算机科学学院和计算与技术中心的教授。他的研究兴趣包括几何和视觉数据处理与分析、计算机图形学和计算机视觉。欲了解更多详情,请访问https://www.ece.Isu.edu/xinli/。

Cheng Wang received his Ph.D. degree in information and communication engineering from the National University of Defense Technology, Changsha, China in 2002. He is currently a Professor at the School of Informatics, Xiamen University, China. His research interests include image processing, mobile LiDAR data analysis, and multi-sensor fusion. He is the Chair of the ISPRS Working Group I/6 on Multi-sensor Integration and Fusion (2016-2020).
程王在 2002 年获得了国防科技大学信息与通信工程博士学位。他现任厦门大学信息学院教授。他的研究兴趣包括图像处理、移动 LiDAR 数据分析和多传感器融合。他是国际摄影测量与遥感学会工作组 I/6 主席(2016-2020)。

Summary of Differences (Extensions) Compared to Conference Version
大会版本的差异(扩展)总结

This submission is an extension of our previous paper (Commonsense Prototype for Outdoor Unsupervised 3D Object Detection) accepted by CVPR 2024. We provide an overview of the differences between this TPAMI submission and our previous paper. Our team has been working diligently to refine our approach and expand upon our initial designs.
这篇论文是我们之前发表在 CVPR 2024 上的论文(室外无监督 3D 物体检测通用原型)的延伸。我们概括了这篇 TPAMI 论文与之前论文的差异。我们的团队一直在努力完善方法,并在初始设计的基础上进一步拓展。
(1) Addressing New Problems: Our CVPR paper, CPD, focused primarily on addressing the challenge of accurate object localization in unsupervised 3D object detection, our TPAMI submission goes a step further. It not only tackles the issue of accurate localization but also addresses the difficulty in recognition of foreground and background objects. This dual focus is driven by the need to enhance the robustness and applicability of our detection system in real-world scenarios.
解决新问题:我们在 CVPR 上发表的论文 CPD 主要关注解决无监督 3D 目标检测中准确目标定位的挑战,我们在 TPAMI 上的论文进一步推进了这一工作。它不仅解决了准确定位的问题,还解决了前景和背景物体识别的难题。这个双重关注是为了提高我们检测系统在实际应用场景中的稳健性和适用性。
(2) New Contributions: Our TPAMI submission introduces a novel method, CPD++, which significantly surpasses the performance of the CPD presented at CVPR. This new approach has set a new SOTA in the field of unsupervised 3D object detection, outperforming all current methods in terms of 3D AP.
(2) 新贡献:我们的 TPAMI 提交引入了一种新颖的方法 CPD++,大幅超越了在 CVPR 上提出的 CPD 的性能。这种新方法在无监督 3D 物体检测领域树立了新的 SOTA,在 3D AP 方面优于所有当前方法。
(3) New Method Design: CPD++ incorporates more innovative techniques. We proposed Motion-conditioned Label Refinement (MLR) for generating high-quality pseudo-labels. Additionally, we have developed a Motion-Weighted Training (MWT) scheme that learns localization from stationary objects and learns recognition from moving ones. CPD++ significantly enhances the performance of unsupervised 3D object detection.
(3) 新方法设计:CPD++采用更多创新技术。我们提出了基于运动的标签精炼(MLR)方法,生成高质量的伪标签。此外,我们开发了基于运动加权的训练(MWT)方案,从静止物体学习定位,从运动物体学习识别。CPD++大幅提升了无监督 3D 目标检测的性能。
(4) More Comprehensive Experimental Results: In comparison to our CVPR paper, our TPAMI submission provides more results under extensive evaluation metrics, including 3D APH and BEV AP. We have also conducted more in-depth ablation studies and provided detailed visual comparisons to illustrate the effectiveness of our approach. Furthermore, we have added experiments that utilize our unsupervised 3D object detection model as a pre-trained model, followed by fine-tuning with a small labeled data, demonstrating the versatility and adaptability of our model.
(4) 更全面的实验结果:与我们的 CVPR 论文相比,我们的 TPAMI 论文提供了更广泛评估指标下的更多结果,包括 3D APH 和 BEV AP。我们还进行了更深入的消融研究,并提供了详细的视觉比较来说明我们方法的有效性。此外,我们还增加了利用我们的无监督 3D 目标检测模型作为预训练模型,然后在少量标记数据上进行微调的实验,展示了我们模型的多样性和适应性。

Commonsense Prototype for Outdoor Unsupervised 3D Object Detection
户外无监督 3D 物体检测的常识原型

Hai Wu $^{1}$ Shijia Zhao $^{1}$ Xun Huang $^{1}$ Chenglu Wen $^{1 *}$ Xin Li $^{2}$ Cheng Wang $^{1}$
吴海赵世佳黄逊温成禄李欣王成 $^{1}$ Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University
福建感知与计算智慧城市重点实验室，厦门大学 $^{2}$ Section of Visual Computing and Interactive Media, Texas A&M University
$^{2}$ 德克萨斯农工大学视觉计算和交互媒体部门

Abstract 摘要

The prevalent approaches of unsupervised 3D object detection follow cluster-based pseudo-label generation and iterative self-training processes. However, the challenge arises due to the sparsity of LiDAR scans, which leads to pseudo-labels with erroneous size and position, resulting in subpar detection performance. To tackle this problem, this paper introduces a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection. CPD first constructs Commonsense Prototype (CProto) characterized by high-quality bounding box and dense points, based on commonsense intuition. Subsequently, CPD refines the low-quality pseudo-labels by leveraging the size prior from CProto. Furthermore, CPD enhances the detection accuracy of sparsely scanned objects by the geometric knowledge from CProto. CPD outperforms state-of-the-art unsupervised 3D detectors on Waymo Open Dataset (WOD), PandaSet, and KITTI datasets by a large margin. Besides, by training CPD on WOD and testing on KITTI, CPD attains $90.85 %$ and $81.01 % 3 D$ Average Precision on easy and moderate car classes, respectively. These achievements position CPD in close proximity to fully supervised detectors, highlighting the significance of our method. The code will be available at https://github.com/hailanyi/CPD.
无监督 3D 物体检测的主流方法遵循基于聚类的伪标签生成和迭代自训练流程。然而,由于 LiDAR 扫描的稀疏性,导致伪标签在尺寸和位置上存在错误,从而导致检测性能较差。为解决此问题,本文提出了一种基于常识原型的检测器 CPD,用于无监督 3D 物体检测。CPD 首先根据常识直觉构建高质量边界框和密集点的常识原型(CProto)。随后,CPD 利用 CProto 的尺寸先验来优化低质量的伪标签。此外,CPD 利用 CProto 的几何知识增强了稀疏扫描物体的检测精度。在 Waymo Open Dataset、PandaSet 和 KITTI 数据集上,CPD 的性能显著优于最先进的无监督 3D 检测器。此外,在 Waymo Open Dataset 上训练 CPD,并在 KITTI 数据集上进行测试,CPD 在简单和中等汽车类别上分别达到了 $90.85 %$ 和 $81.01 % 3 D$ 的平均精度,这一成就将 CPD 的性能与完全监督的检测器接近,突出了我们方法的重要性。该代码将在https://github.com/hailanyi/CPD上公开。

1. Introduction 简介

Autonomous driving requires reliable detection of 3D objects (e.g. vehicle and cyclist) in urban scenes for safe path planning and navigation. Thanks to the power of neural networks, numerous studies have developed high-performance 3 D detectors through fully supervised approaches[4, 15, 30-33]. However, these models heavily depend on human annotations from diverse scenes to guarantee their effectiveness across various scenarios. This data labeling process is typically laborious and time-consuming, limiting the wide deployment of detectors in practice [40].
自主驾驶需要可靠地检测城市场景中的 3D 物体(如车辆和自行车)以进行安全的路径规划和导航。得益于神经网络的强大功能,许多研究采用完全监督的方法开发了高性能的 3D 检测器[4,15,30-33]。然而,这些模型严重依赖于来自各种场景的人工注释,以确保它们在不同情况下的有效性。这个数据标注过程通常是繁琐和耗时的,限制了检测器在实践中的广泛部署[40]。

Figure 1. Illustration of commonsense prototypes for unsupervised 3D object detection in autonomous driving scenes.
图 1. 无监督 3D 目标检测在自动驾驶场景中的常识原型说明。

Several studies have explored approaches to reduce labeling requirements by weakly supervised learning [3, 26, 46], decreasing the label cost by over

80 %

. Notably, the objects within a 3D scene exhibit distinguishable attributes and can be easily identified through certain commonsense reasoning (see Fig. 1). For example, the objects are usually located on the ground surface with a certain shape; the object sizes are fixed across frames. This insight has prompted us to develop an unsupervised

3 D

detector that operates without using human annotations.
几项研究探索了通过弱监督学习[3,26,46]来减少标签要求的方法,减少了超过

80 %

的标签成本。值得注意的是,3D 场景中的物体展现出可区分的属性,并且可以通过某些常识推理来轻易识别(见图 1)。例如,物体通常位于地面表面,具有一定的形状;物体大小在各帧中是固定的。这一洞见促使我们开发了一种无需使用人工标注的无监督

3 D

探测器。

In recent years, traditional methods leveraged ground removal [9] and clustering technique [42] for unsupervised 3D object detection. However, these methods often struggle to achieve satisfactory performance due to the sparsity and occlusion of objects in 3D scenes. Advanced methods create initial pseudo-labels from point cloud sequences by clustering and bootstrap a good detector by iteratively training a deep network [41]. Nevertheless, the sparse and viewlimited nature of LiDAR scanning leads to pseudo-labels with inaccurate sizes and positions, misleading the network convergence and resulting in suboptimal detection performance. A subset of objects, denoted as complete objects

T

,
近年来,传统方法利用地面去除[9]和聚类技术[42]进行无监督的 3D 物体检测。然而,由于 3D 场景中物体的稀疏性和遮挡,这些方法往往难以实现令人满意的性能。先进的方法通过对点云序列进行聚类来创建初始伪标签,并反复训练深度网络以引导良好的检测器[41]。尽管如此,激光雷达扫描的稀疏和视角受限性仍会导致伪标签的大小和位置不准确,从而误导网络收敛,并导致次优的检测性能。一部分被称为完整物体

T

的物体,

Figure 2. Illustration and statistics of complete and incomplete objects on WOD [25] validation set (large enough to demonstrate the general problem). (a) Pseudo-labels of complete object

T

are refined by temporal consistency. (b) Pseudo-labels of incomplete object

J

65 %

objects lack full scan coverage and generate inaccurate pseudo-labels ( Max IoU (Intersection over Union)

< 0.5

with GT (Ground Truth)). (d) The vehicle GT of complete object

G T^{T}

and incomplete object

G T^{J}

have similar size distributions. (e) The pseudolabel of complete object Pse

e^{T}

and incomplete object Pse

^{J}

have different size distributions. (f)(g) The nearby stationary objects are with high completeness in consecutive frames.
图 2. 在 WOD[25]验证集上完整和不完整物体的图示和统计信息(足以说明一般问题)。(a)完整物体

T

的伪标签通过时间一致性得到细化。(b)不完整物体

J

的伪标签无法通过时间一致性得到细化。(c)

65 %

物体缺乏完整的扫描覆盖范围,生成不准确的伪标签(与 GT(Ground Truth)的 Max IoU(交并比)小于

< 0.5

)。(d)完整物体

G T^{T}

和不完整物体

G T^{J}

的车辆 GT 具有相似的尺寸分布。(e)完整物体 Pse

e^{T}

和不完整物体 Pse

^{J}

的伪标签具有不同的尺寸分布。(f)(g)相邻的静止物体在连续帧中具有高完整性。
benefit from having at least one complete scan across the entire point cloud sequence, allowing their pseudo-labels to be refined through temporal consistency [41] (see Fig. 2 (a)). However, the majority of objects (e.g. 65% on WOD [25], as shown in Fig. 2 ©), termed incomplete objects

J

, lack full scan coverage (see Fig. 2 (b)), and cannot be recovered by temporal consistency.
从整个点云序列跨一次完整扫描中获益,这允许他们的伪标签通过时间一致性得到改善[41](见图 2(a))。然而,大多数物体(例如,在 WOD[25]中占 65%,如图 2(c)所示),被称为不完整物体

J

,缺乏完整的扫描覆盖范围(见图 2(b)),且无法通过时间一致性来恢复。

To tackle this issue, this paper proposes a Commonsense Prototype-based Detector, termed CPD, for unsupervised

3 D

object detection. CPD is built upon two key insights: (1) The ground truth of intra-class objects keeps a similar size (length, width, and height) distribution between incomplete objects and complete objects (see Fig. 2 (d)). (2) The nearby stationary objects are very complete in consecutive frames and can be recognized accurately by commonsense intuition (see Fig. 2 (f)(g)). Our idea is to construct a Commonsense Prototype (CProto) set representing accurate geometry and size from complete objects to refine the pseudolabels of incomplete objects and improve the detection accuracy. To this end, we first design an unsupervised MultiFrame Clustering (MFC) method that yields high-recall initial pseudo-labels. Subsequently, we introduce an unsupervised Completeness and Size Similarity (CSS) score that selects high-quality labels to construct the CProto set. Furthermore, we design a CProto-constrained Box Regulariza-
针对此问题,本文提出了一种无监督的通用原型检测器(Commonsense Prototype-based Detector,简称 CPD),用于无监督的目标检测。CPD 建立在两个关键洞察之上:(1)类内物体的真实大小(长度、宽度和高度)分布在完整物体和不完整物体之间保持相似(见图 2(d))。(2)相邻的静止物体在连续帧中往往是完整的,可以通过常识直观地准确识别(见图 2(f)(g))。我们的想法是构建一个代表准确几何和大小的通用原型(CProto)集合,用于细化不完整物体的伪标签,提高检测准确性。为此,我们首先设计了一种无监督的多帧聚类(MFC)方法,产生高召回率的初始伪标签。随后,我们引入了一种无监督的完整性和大小相似性(CSS)评分,选择高质量标签构建 CProto 集合。此外,我们设计了一个 CProto 约束的框正则化(CProto-constrained Box Regulariza-)
tion (CBR) method to refine the pseudo-labels by incorporating the size prior from CProto. In addition, we develop CProto-constrained Self-Training (CST) that improves the detection accuracy of sparsely scanned objects by the geometry knowledge from CProto.
基于尺寸先验的半监督目标检测

The effectiveness of our design is verified by experiments on widely used WOD [25], PandaSet [35], and KITTI dataset [6]. Besides, the individual components of our design are also verified by extensive experiments on WOD [25]. The main contributions of this work include:
我们的设计有效性通过对广泛使用的 WOD [25]、PandaSet [35]和 KITTI 数据集 [6]的实验验证。此外,我们设计的各个组件也通过对 WOD [25]的大量实验进行了验证。本文的主要贡献包括:

We propose a Commonsense Prototype-based Detector (CPD) for unsupervised 3D object detection. CPD outperforms state-of-the-art unsupervised 3D detectors by a large margin.
我们提出了一种基于常识原型的检测器(CPD)用于无监督的 3D 目标检测。CPD 大幅优于最先进的无监督 3D 检测器。
We propose Multi-Frame Clustering (MFC) and CProtoconstrained Box Regularization (CBR) for pseudo-label generation and refinement, greatly improving the recall and precision of pseudo-label.
我们提出多帧聚类(MFC)和 CProtocol 受约束的框正则化(CBR),用于伪标签生成和改进,大幅提高了伪标签的召回率和精确度。
We propose CProto-constrained Self-Training (CST) for unsupervised 3D detection. It improves the recognition and localization accuracy of sparse objects, boosting the detection performance significantly.
我们提出了受到 CProto 约束的自主训练(CST)用于无监督的 3D 检测。它大幅提高了稀疏物体的识别和定位精度,显著提升了检测性能。

Fully/weakly supervised 3D object detection. Recent fully-supervised 3D detectors build single-stage [8, 10, 27, 39, 48, 49], two-stage [4, 20-22, 31-33, 37] or multiple stage

[2, 30]

deep networks for 3D object detection. However, these methods heavily rely on a large amount of precise annotations. Some weakly supervised methods replace the box annotation with low-cost click annotation[17]. Other methods decrease the supervision by only annotating a part of scenes

[3, 26, 45, 46]

or a part of instances [34]. Unlike all of the above works, we aim to design a 3D detector that does not require human-level annotations.
完全/弱监督的 3D 物体检测。最近的完全监督的 3D 检测器构建了单级[8、10、27、39、48、49]、双级[4、20-22、31-33、37]或多级

[2, 30]

深度网络用于 3D 物体检测。然而,这些方法严重依赖于大量精确的注释。一些弱监督方法用低成本的点击注释取代了框注释[17]。其他方法通过只注释部分场景

[3, 26, 45, 46]

或部分实例[34]来减少监督。与上述所有工作不同,我们的目标是设计一种无需人工级别注释的 3D 检测器。

Unsupervised 3D object detection. Previous unsupervised pre-training methods discern latent patterns within the unlabeled data by masked labels [36] or contrastive loss

[14, 38]

. But these methods require human labels for fine-turning. Traditional methods [1, 19, 24] employ ground removal and clustering for 3D object detection without human labels, but suffer from poor detection performance. Some deep learning-based methods generate pseudo-labels by clustering and use the pseudo-labels to train a 3D detector [40] iteratively. Recent OYSTER [41] improves pseudolabel quality with temporal consistency. However, most pseudo-labels of incomplete objects cannot be recovered by temporal consistency. Our CPD addresses this problem by leveraging the geometry prior from CProto to refine the pseudo-label and guide the network convergence.
无监督 3D 目标检测。以前的无监督预训练方法通过掩码标签[36]或对比损失

[14, 38]

来辨别未标记数据中的潜在模式。但这些方法需要人工标签用于微调。传统方法[1,19,24]采用地面去除和聚类来进行无人工标签的 3D 目标检测,但检测性能较差。一些基于深度学习的方法通过聚类生成伪标签,并迭代地使用这些伪标签训练 3D 检测器[40]。最近的 OYSTER[41]通过时间一致性改善伪标签质量。然而,大多数不完整对象的伪标签无法通过时间一致性得到恢复。我们的 CPD 通过利用 CProto 的几何先验来细化伪标签并引导网络收敛,解决了这个问题。

Prototype-based methods. The prototype-based methods are widely used in 2D detection [11, 12, 16, 29, 44] when novel classes are incorporated. Inspired by these
基于原型的方法。这些基于原型的方法在 2D 检测中广泛应用[11,12,16,29,44]，用于新类的引入。受这些方法启发

Figure 3. CPD framework. (a) Initial pseudo-labels are generated by multi-frame clustering. (b) The commonsense prototype (CProto) is constructed from high-quality pseudo-labels based on CSS score. The low-quality labels are further refined by the shape prior from CProto. © A prototype network fed with dense points from CProto produces high-quality features to guide the detection network convergence.
图 3. CPD 框架。(a) 通过多帧聚类生成初始伪标签。(b) 根据 CSS 得分从高质量伪标签中构建通用原型(CProto)。低质量标签进一步通过 CProto 的形状先验得到细化。© 以 CProto 中密集点为输入的原型网络产生高质量特征,引导检测网络收敛。
methods, Prototypical VoteNet [47] constructs geometric prototypes learned from basic classes for few-shot 3D object detection. GPA-3D [13] and CL3D [18] build geometric prototypes from a source-domain model for domain adaptive 3D detection. However, both the learning from basic class and training on the source domain require highquality annotations. Unlike that, we construct CProto using commonsense knowledge and detect 3D objects in a zeroshot manner without human-level annotations.
方法，原型性投票网络[47]构建从基本类学习的几何原型用于少样本 3D 物体检测。GPA-3D[13]和 CL3D[18]从源域模型构建几何原型用于域自适应 3D 检测。但是，从基本类学习和在源域上训练都需要高质量的标注。不同于此,我们使用常识知识构建 CProto 并以零样本方式检测 3D 物体,无需人工标注。

3. CPD Method 3. 派居描述法

This paper introduces the Commonsense Prototype-based Detector (CPD), a novel approach for unsupervised 3D object detection. As shown in Fig. 3, CPD consists of three main parts: (1) initial label generation; (2) label refinement; (3) self-training. We detail the designs as follows.
这篇论文介绍了常识原型检测器(CPD)，这是一种用于无监督 3D 物体检测的新方法。如图 3 所示，CPD 由三个主要部分组成:(1)初始标签生成;(2)标签优化;(3)自训练。我们将详细介绍这些设计内容。

3.1. Initial Label Generation
初始标签生成

Recent unsupervised methods [40, 41] detect 3D objects in a class-agnostic way. How to classify objects (e.g. vehicle and pedestrian) without annotation is still an unsolved challenge. Our observations indicate that some stationary objects in consecutive frames, appear more complete (see Fig. 2 (f)) and can be classified by predefined sizes. This motivates us to design a Multi-Frame Clustering (MFC) method to generate initial labels. MFC involves motion artifact removal, clustering, and post-processing.
近期无监督方法[40,41]以类无关的方式检测 3D 物体。如何在没有标注的情况下对物体进行分类(例如车辆和行人)仍然是一个未解决的挑战。我们的观察表明,连续帧中的一些静止物体显得更加完整(见图 2(f)),可以通过预定义的尺寸进行分类。这激励我们设计了一种多帧聚类(MFC)方法来生成初始标签。MFC 涉及运动伪影去除、聚类和后处理。

Motion Artifact Removal (MAR). Directly transforming and concatenating

2 n + 1

consecutive frames

{x_{- n}, \dots, x_{n}}

(i.e., past

n

, future

n

, and the current frame)
运动伪影去除(MAR)。直接变换和连接连续帧(即过去、未来和当前帧)。
into a single point cloud

x_{0}^{*}

introduces motion artifacts from moving objects, leading to increased label errors as the

n

grows (see Fig. 4(a)). To mitigate this issue, we first transform the consecutive frames to global system and calculate the Persistence Point Score (PPScore)[40] by consecutive frames to identify the points in motion. We keep all the points from

x_{0}

and remove moving points from the other frames

x_{- n}, \dots, x_{- 1}, x_{1}, \dots, x_{n}

. After this removal, we concatenate the frames to obtain dense points

x_{0}^{*}

.
将点云融合到单个点云

x_{0}^{*}

会引入来自移动物体的运动伪影,导致标签错误随

n

增加而增加(见图 4(a))。为了缓解这个问题,我们首先将连续帧变换到全局系统,并通过连续帧计算持久点数得分(PPScore)[40],以确定运动点。我们保留

x_{0}

的所有点,并从其他帧

x_{- n}, \dots, x_{- 1}, x_{1}, \dots, x_{n}

中删除移动点。执行此删除操作后,我们连接帧以获得密集点

x_{0}^{*}

。

Clustering and post-processing. In line with recent study [41], we apply the ground removal[9], DBSCAN [5] and bounding box fitting [43] on

x_{0}^{*}

to obtain a set of class-agnostic bounding boxes

\hat{b}

. We observe that the objects of the same class typically have similar sizes in 3D space. Therefore, we pre-define class-specific size thresholds (e.g. the length of vehicle is generally larger than 0.5 m ) based on human commonsense to classify

\hat{b}

into different categories. We then apply class-agnostic tracking to associate the small background objects with foreground trajectories, and enhance the consistency of objects’ sizes by using temporal coherency [41]. This process results in a set of initial pseudo-labels

b = {b_{j}}_{j}

, where

b_{j} = [x, y, z, l, w, h, α, β, τ]

represents position, width, length, height, azimuth angle, class identity, and tracking identity, respectively.
聚类和后处理。根据最近的研究[41]，我们在

x_{0}^{*}

上应用地面去除[9]、DBSCAN [5]和边界框拟合[43]来获得一组无类别边界框

\hat{b}

。我们观察到同一类对象在 3D 空间中通常具有相似的尺寸。因此，我们根据常识预先定义了特定类别的尺寸阈值(例如车辆的长度通常大于 0.5 米)来对

\hat{b}

进行分类。然后我们应用无类别跟踪来关联小背景对象与前景轨迹,并通过使用时间连贯性[41]来增强对象尺寸的一致性。该过程产生了一组初始伪标签

b = {b_{j}}_{j}

,其中

b_{j} = [x, y, z, l, w, h, α, β, τ]

分别表示位置、宽度、长度、高度、方位角、类别标识和跟踪标识。

As noted in Section 1, initial labels for incomplete objects often suffer from inaccuracies in sizes and positions. To
正如第 1 节所述,不完整物体的初始标签通常存在尺寸和位置的不准确性。

Figure 4. (a) Length absolute error with different frames. (b) Multi-level occupancy score. © Mean size error of initial labels.
图 4.(a)不同帧数下的绝对长度误差.(b)多层占用得分.(c)初始标签的平均尺寸误差.
tackle this issue, we introduce the CProto-constrained Box Regularization (CBR) method. The key idea is to construct a high-quality CProto set based on unsupervised scoring from complete objects to refine the pseudo-labels of incomplete objects. Different from OYSTER [41], which can only refine the pseudo-labels of objects having at least one complete scan, our CBR can refine pseudo-labels of all objects, significantly decreasing the overall size and position errors.
应对这个问题,我们引入了 CProto 约束 Box 正则化(CBR)方法。关键思想是基于对完整物体的无监督评分构建高质量的 CProto 集合,以细化不完整物体的伪标签。与只能细化至少有一个完整扫描的物体伪标签的 OYSTER [41]不同,我们的 CBR 可以细化所有物体的伪标签,显著降低了整体尺寸和位置误差。

Completeness and Size Similarity (CSS) scoring. Existing label scoring methods such as IoU scoring [21] are designed for fully supervised detectors. In contrast, we introduce an unsupervised Completeness and Size Similarity scoring (CSS) method. It aims to approximate the IoU score using commonsense knowledge alone (see Fig. 5).
完整性和大小相似性(CSS)评分。现有的标签评分方法,如 IoU 评分[21]是为完全监督的检测器设计的。相比之下,我们引入了一种无监督的完整性和大小相似性评分(CSS)方法。它旨在仅使用常识知识来近似 IoU 评分(见图 5)。

Distance score. CSS first assesses the object completeness based on distance, assuming labels closer to the ego vehicle are likely to be more accurate. For an initial label

b_{j}

, we normalize the distance to the ego vehicle within the range

[0, 1]

to compute the distance score as
距离分数。 CSS 首先根据距离评估对象的完整性，假设距离车辆更近的标签可能更准确。对于初始标签

b_{j}

，我们将距离车辆的距离归一化在范围

[0, 1]

内，以计算距离分数。

ψ^{1} (b_{j}) = 1 - N (‖ c_{j} ‖)

where

N

is the normalization function and

c_{j}

is the location of

b_{j}

N

是归一化函数,

c_{j}

是

b_{j}

的位置。然而,这种基于距离的方法有其局限性。例如,靠近自车的被遮挡物体应获得较低分数,但由于接近自车而被错误地分配了较高分数。为了缓解这一问题,我们引入了多层占用(MLO)分数,在图 4(b)中有进一步的详细说明。

MLO score. Considering the diverse sizes of objects, we divide the bounding box of the initial label into multiple grids with different length and width resolutions. The MLO score is then calculated by determining the proportion of grids occupied by cluster points, via
多目标定位(MLO)得分。考虑到目标的不同大小,我们将初始标签的边界框划分为具有不同长度和宽度分辨率的多个网格。然后通过确定簇点占用的网格比例来计算 MLO 得分。

ψ^{2} (b_{j}) = \frac{1}{N^{o}} \sum_{k} \frac{O^{k}}{{(r^{k})}^{2}}

where

N^{o}

denotes resolution number,

O^{k}

is the number of occupied grids under

k

-th resolution, and

r^{k}

is the grid number of

k

-th resolution.
其中

N^{o}

表示分辨率编号,

O^{k}

是

k

分辨率下的占用网格数,

r^{k}

是

k

分辨率下的网格编号。

Size Similarity (SS) score. While the distance and MLO scores effectively evaluate the localization and size quality,
尺寸相似性(SS)得分。虽然距离和 MLO 得分有效评估了定位和尺寸质量,

Figure 5. Completeness and size similarity scoring.
完整性和尺寸相似性评分。
they fall short in assessing classification quality. To bridge this gap, we introduce the SS score. This score utilizes a class-specific template box

a

(average size of typical objects in Wikipedia) and calculates a truncated KL divergence [7]. Note that, this score is decided by ratio difference, rather than their specific values. Simple commonsense of

l, w, h

ratios (2:1:1 for Vehicle, 1:1:2 for Pedestrian, 2:1:2 for

Cy -

clist) can also be used here.
他们在评估分类质量方面存在不足。为了弥补这一缺口,我们引入了 SS 评分。该评分利用了特定于类别的模板框

a

（维基百科中典型对象的平均大小）,并计算截断的 KL 散度[7]。请注意,该评分由比率差异决定,而不是具体数值。这里也可以使用

l, w, h

比率（车辆为 2:1:1，行人为 1:1:2，

Cy -

clist 为 2:1:2）的简单常识。

ψ^{3} (b_{j}) = 1 - min (0.05, \sum_{σ} q_{σ}^{b} \log (\frac{q_{σ}^{b}}{q_{σ}^{a}})) / 0.05

where

q_{σ}^{a} \in {l^{a}, w^{a}, h^{a}}, q_{σ}^{b} \in {l^{b}, w^{b}, h^{b}}

refer to the normalized length, width, and height of the template and label.
其中

q_{σ}^{a} \in {l^{a}, w^{a}, h^{a}}, q_{σ}^{b} \in {l^{b}, w^{b}, h^{b}}

指模板和标签的标准化长度、宽度和高度。

We linearly combine the three metrics

S (b_{j}) =

\sum_{i} ω^{i} ψ^{i} (b_{j})

to produce final scoring, where

ω^{i}

is the weighting factor (in this study we adopt a simple average,

ω^{i} = 1 / 3)

. For each

b_{j} \in b

, we compute its CSS score

s_{j}^{c s s} = S (b_{j})

and obtain a set of scores

s = {s_{j}^{c s s}}_{j}

.
我们线性组合三个指标

S (b_{j}) =

\sum_{i} ω^{i} ψ^{i} (b_{j})

来产生最终评分,其中

ω^{i}

是加权因子(在本研究中我们采用简单平均,

ω^{i} = 1 / 3)

)。对于每个

b_{j} \in b

,我们计算其 CSS 得分

s_{j}^{c s s} = S (b_{j})

并获得一组得分

s = {s_{j}^{c s s}}_{j}

。

CProto set construction. Regular learnable prototypebased methods require annotations [13, 47], which are unavailable in the unsupervised problem. We construct a highquality CProto set

P = {P_{k}}_{k}

, representing geometry and size centers based on the unsupervised CSS score. Here,

P_{k} = {x_{k}^{p}, b_{k}^{p}}

, where

x_{k}^{p}

indicates the inside points, and

b_{k}^{p}

refers to the bounding box. Specifically, we first categorize the initial labels

b

into different groups based on their tracking identity

τ

. Within each group, we select the high-quality boxes and inside points that meet a high CSS score threshold

η

(determined on validation set, using 0.8 in this study). Then, we transform all points and boxes into a local coordinate system, and obtain

b_{k}^{p}

by averaging the high-quality boxes and

x_{k}^{p}

by concatenating all the points.
CProto 集合的构建。基于学习的原型方法需要注释[13,47]，这在无监督问题中是不可用的。我们构建了一个高质量的 CProto 集合

P = {P_{k}}_{k}

，基于无监督的 CSS 评分表示几何和尺寸中心。这里，

P_{k} = {x_{k}^{p}, b_{k}^{p}}

，其中

x_{k}^{p}

表示内部点,

b_{k}^{p}

指边界框。具体来说,我们首先根据追踪身份

τ

将初始标签

b

划分为不同组。在每个组内,我们选择满足高 CSS 评分阈值

η

(使用本研究中的 0.8 确定的验证集)的高质量框和内部点。然后,我们将所有点和框转换到局部坐标系中,并通过平均高质量框获得

b_{k}^{p}

,通过连接所有点获得

x_{k}^{p}

。

Box regularization. We next regularize the initial labels by the size prior from CProto. Based on the statistics on WOD validation set [25], we observe that the height of the initial labels is relatively correct than length and width (see Fig. 4 ©). Intuitively, the intra-class 3D objects with the same height have similar length and width. Therefore, we associate the initial label

b_{j}

with CProto

P_{k}

by the minimum difference in box height. The initial pseudo-labels with the same

P_{k}

and similar length and width are naturally classified into the same group. We then perform resize and re-localization for each group to refine the pseudo-
盒子正则化。我们接下来通过来自 CProto 的大小先验对初始标签进行正则化。基于 WOD 验证集 [25] 上的统计数据，我们观察到初始标签的高度相对长度和宽度更为正确(见图 4 ©)。直觉上来说,具有相同高度的同类 3D 物体,长度和宽度也相似。因此,我们通过盒子高度的最小差异将初始标签

b_{j}

与 CProto

P_{k}

3.3. CProto-constrained Self-Training (CST)
3.3. 受约束的自我训练(CST)

Recent methods [40, 41] utilize pseudo-labels for training 3D detectors. However, even after refinement, some pseudo-labels remain inaccurate, diminishing the effectiveness of correct supervision and potentially misleading the training process. To tackle these issues, we propose two designs: (1) CSS-Weighted Detection Loss, which assigns different training weights based on label quality to suppress false supervision signals. (2) Geometry Contrast Loss, which aligns predictions of sparsely scanned points with the dense CProto, thereby improving feature consistency.
近期的方法[40,41]利用伪标签训练 3D 检测器。但是,即使经过细化,一些伪标签仍然不准确,降低了正确监督的有效性,并可能误导训练过程。为了解决这些问题,我们提出了两种设计:(1)CSS 加权检测损失,它根据标签质量分配不同的训练权重,以抑制错误的监督信号。(2)几何对比损失,它将稀疏扫描点的预测与密集 CProto 进行对齐,从而提高特征的一致性。

F^{pro}

and a detection network

F^{det}

, constructed from two-stage CenterPoint [39]. During training, for each

b_{j}^{*}

, we add its corresponding points

x_{k}^{p}

from CProto

P_{k}

to the scene to obtain a dense point cloud

x^{p r o}

. We feed

x^{p r o}

F^{pro}

to produce relatively good features and detections. We then feed randomly downsampled points

x^{det}

as a sparse sample to the

F^{d e t}

. We align the features and detections from two branches by the detection loss and contrast loss. During testing, we feed points without downsampling to the detection network

F^{det}

F^{pro}

和检测网络

F^{det}

组成，构建自两阶段 CenterPoint [39]。在训练过程中，对于每个

b_{j}^{*}

，我们都将其相应的点

x_{k}^{p}

从 CProto

P_{k}

添加到场景中，以获得密集点云

x^{p r o}

。我们将其输入到

F^{pro}

中以产生相对较好的特征和检测。然后，我们将随机下采样的点

x^{det}

作为稀疏样本输入到

F^{d e t}

中。我们通过检测损失和对比损失来对两个分支的特征和检测进行对齐。在测试过程中，我们将不经下采样的点输入到检测网络

F^{det}

中进行检测。

CSS weight. Considering that the false pseudo-labels may mislead the network convergence, we first calculate a loss weight based on different label qualities. Formally, we convert a CSS score

s_{i}^{c s s}

of a pseudo-label to
css 权重。考虑到虚假的伪标签可能会误导网络收敛,我们首先根据不同的标签质量计算损失权重。正式地,我们将伪标签的 CSS 分数

s_{i}^{c s s}

转换为

ω_{i} = {\begin{matrix} 0 & s_{i}^{c s s} < S_{L} \\ \frac{s_{i}^{c s s}}{} - S_{L} & S_{L} < s_{i}^{c s s} < S_{H} \\ 1 & s_{i}^{css} > S_{H} \end{matrix}

where

S_{H}

and

S_{L}

are high/low-quality thresholds (we empirically set 0.7 and 0.4 , respectively).
其中

S_{H}

和

S_{L}

是高质量/低质量阈值(我们经验性地设置为 0.7 和 0.4)。

CSS-weighted detection loss. To decrease the influence of false labels, we formulate the CSS-weighted detection loss to refine

N

proposals
基于 CSS 的加权检测损失。为降低错标签的影响,我们提出了基于 CSS 的加权检测损失来优化

N

proposals。

L_{d e t}^{c s s} = \frac{1}{N} \sum_{i} ω_{i} (L_{i}^{p r o} + L_{i}^{d e t})

where

L_{i}^{pro}

and

L_{i}^{det}

are detection losses [4] of

F^{pro}

and

F^{det}

, respectively. The losses are calculated by pseudolabels

b^{*}

and network predictions.
其中

L_{i}^{pro}

和

L_{i}^{det}

是

F^{pro}

和

F^{det}

的检测损失[4]。这些损失是通过伪标签

b^{*}

和网络预测计算得出的。

Geometry contrast loss. We formulate two contrast losses that minimize the feature and predicted box difference between the prototype and detection network. (1) Feature contrast loss. For a foreground RoI

r_{i}

from the detection network, we extract features

f_{i}^{p}

from the prototype network by voxel set abstract [4], and extract features

f_{i}^{d}

from detection network. We then formulate the contrast loss by cosine distance:
几何对比损失。我们提出了两种对比损失,最小化原型网络和检测网络之间的特征和预测框差异。(1)特征对比损失。对于检测网络中的前景 RoI

r_{i}

,我们从原型网络中提取特征

f_{i}^{p}

,并从检测网络中提取特征

f_{i}^{d}

。然后我们通过余弦距离来计算对比损失。

L_{feat}^{css} = - \frac{1}{N^{f}} \sum_{i} ω_{i} \frac{f_{i}^{d} \cdot f_{i}^{p}}{‖ f_{i}^{d} ‖ ‖ f_{i}^{p} ‖},

where

N^{f}

is the foreground proposal number. (2) Box contrast loss. For a box prediction

d_{i}^{p}

from the prototype network and a box prediction

d_{i}^{d}

from the detection network. We then formulate the box contrast loss by IoU, location difference, and angle difference:
其中

N^{f}

是前景提案编号。(2)框对比损失。对于从原型网络得到的框预测

d_{i}^{p}

和从检测网络得到的框预测

d_{i}^{d}

，我们通过 IoU、位置差异和角度差异来计算框对比损失:

\begin{matrix} L_{box}^{c s s} = & \frac{1}{N^{f}} \sum_{i} ω_{i} [1 - I (d_{i}^{d}, d_{i}^{p}) \\ + ‖ c_{i}^{d} - c_{i}^{p} ‖ + | \sin (α_{i}^{d} - α_{i}^{p}) |] \end{matrix}

where

I

denote IoU function;

c_{i}^{d}, α_{i}^{d}

refers to position and angle of

d_{i}^{d}; c_{i}^{p}, α_{i}^{p}

refers to position and angle of

d_{i}^{p}

. We finally summat all losses to training the detector.
其中

I

表示 IoU 函数;

c_{i}^{d}, α_{i}^{d}

表示位置和角度;

d_{i}^{d}; c_{i}^{p}, α_{i}^{p}

表示位置和角度。我们最终将所有损失相加以训练检测器。

4. Experiments 实验

4.1. Datasets 数据集

Waymo Open Dataset (WOD). We conducted extensive experiments on the WOD [25] due to its diverse scenes. The WOD contains 798, 202 and 150 sequences for training, validation and testing, respectively. We adopted similar metrics (3D AP L1 and L2) as fully/weakly supervised methods [31, 34]. No annotations were used for training.
百度开放数据集 (WOD)。由于其场景丰富,我们在 WOD [25] 上进行了大量实验。WOD 包含 798、202 和 150 个用于训练、验证和测试的序列。我们采用了与完全/弱监督方法 [31, 34] 类似的度量指标 (3D AP L1 和 L2)。在训练过程中未使用任何标注。

PandaSet dataset. To compare with recent unsupervised methods [41], we also conducted experiments on the PandaSet [35]. Like [41], we split the dataset into 73 training and 30 validation snippets and use class-agnostic BEV AP and recall metrics with

0.3, 0.5

, and 0.7 IoU thresholds.
熊猫数据集。为了与最近的无监督方法[41]进行比较,我们还在熊猫数据集[35]上进行了实验。与[41]一样,我们将数据集拆分为 73 个训练和 30 个验证片段,并使用无特定类别的俯视图 AP 和召回率指标,阈值为

0.3, 0.5

和 0.7 IoU。

KITTI dataset. Since the KITTI detection dataset [6] did not provide consecutive frames, we only tested our method on the 3769 val split [4]. We used similar metrics (Car 3D AP R40 with 0.5 and 0.7 IoU thresholds) as employed in fully/weakly supervised methods [32, 34].
基蒂数据集。由于基蒂检测数据集[6]没有提供连续帧，我们只在 3769 个验证分割[4]上测试了我们的方法。我们使用了类似于完全/弱监督方法[32,34]所采用的指标(Car 3D AP R40，IoU 阈值为 0.5 和 0.7)。

Method 方法	Vehicle 3D AP 车辆 3D AP				Pedestrian 3D AP 行人 3D AP				Cyclist 3D AP 骑自行车的人 3D AP
	L1		L2		L1		L2		L1		L2
	$I o U_{0.5}$	IoU $U_{0.7}$ IoU	$I o U_{0.5}$	$I o U_{0.7}$	IoU $U_{0.3}$ IoU	$I o U_{0.5}$	IoU $U_{0.3}$ IoU	$I o U_{0.5}$	$I U_{0.3}$		$I_{o U_{0.3}}$	$I o U_{0.5}$
DBSCAN [5] 密度聚类 [5]	2.32	0.29	1.94	0.25	0.51	0.00	0.19	0.00	0.28	0.03	0.20	0.00
DBSCAN init-train [40] 聚类初始化训练[40]	17.36	2.65	14.87	2.29	1.65	0.00	1.35	0.00	0.48	0.25	0.43	0.20
MODEST [40] 羞涩 [40]	18.51	6.46	15.83	5.48	11.83	0.17	8.96	0.10	1.47	1.14	1.17	1.01
OYSTER [41] 牡蛎 [41]	30.48	14.66	26.21	14.10	4.33	0.18	3.52	0.14	1.27	0.33	1.24	0.32
Proto-vanilla 原味	35.22	20.19	31.58	18.36	17.60	10.34	14.62	8.59	4.21	3.45	3.80	3.31
CPD (Ours) 我们	57.79	37.40	50.18	32.13	21.91	16.31	18.01	13.22	5.83	5.06	5.61	4.87

Method Vehicle 3D AP Pedestrian 3D AP Cyclist 3D AP L1 L2 L1 L2 L1 L2 IoU_(0.5) IoU U_(0.7) IoU_(0.5) IoU_(0.7) IoU U_(0.3) IoU_(0.5) IoU U_(0.3) IoU_(0.5) IU_(0.3) https://cdn.mathpix.com/cropped/2024_11_13_d3519511fdbb2ab98b5cg-26.jpg?height=43&width=106&top_left_y=313&top_left_x=1552 I_(oU_(0.3)) IoU_(0.5) DBSCAN [5] 2.32 0.29 1.94 0.25 0.51 0.00 0.19 0.00 0.28 0.03 0.20 0.00 DBSCAN init-train [40] 17.36 2.65 14.87 2.29 1.65 0.00 1.35 0.00 0.48 0.25 0.43 0.20 MODEST [40] 18.51 6.46 15.83 5.48 11.83 0.17 8.96 0.10 1.47 1.14 1.17 1.01 OYSTER [41] 30.48 14.66 26.21 14.10 4.33 0.18 3.52 0.14 1.27 0.33 1.24 0.32 Proto-vanilla 35.22 20.19 31.58 18.36 17.60 10.34 14.62 8.59 4.21 3.45 3.80 3.31 CPD (Ours) 57.79 37.40 50.18 32.13 21.91 16.31 18.01 13.22 5.83 5.06 5.61 4.87| Method | Vehicle 3D AP | | | | Pedestrian 3D AP | | | | Cyclist 3D AP | | | | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | L1 | | L2 | | L1 | | L2 | | L1 | | L2 | | | | $I o U_{0.5}$ | IoU $U_{0.7}$ | $I o U_{0.5}$ | $I o U_{0.7}$ | IoU $U_{0.3}$ | $I o U_{0.5}$ | IoU $U_{0.3}$ | $I o U_{0.5}$ | $I U_{0.3}$ | ![](https://cdn.mathpix.com/cropped/2024_11_13_d3519511fdbb2ab98b5cg-26.jpg?height=43&width=106&top_left_y=313&top_left_x=1552) | $I_{o U_{0.3}}$ | $I o U_{0.5}$ | | DBSCAN [5] | 2.32 | 0.29 | 1.94 | 0.25 | 0.51 | 0.00 | 0.19 | 0.00 | 0.28 | 0.03 | 0.20 | 0.00 | | DBSCAN init-train [40] | 17.36 | 2.65 | 14.87 | 2.29 | 1.65 | 0.00 | 1.35 | 0.00 | 0.48 | 0.25 | 0.43 | 0.20 | | MODEST [40] | 18.51 | 6.46 | 15.83 | 5.48 | 11.83 | 0.17 | 8.96 | 0.10 | 1.47 | 1.14 | 1.17 | 1.01 | | OYSTER [41] | 30.48 | 14.66 | 26.21 | 14.10 | 4.33 | 0.18 | 3.52 | 0.14 | 1.27 | 0.33 | 1.24 | 0.32 | | Proto-vanilla | 35.22 | 20.19 | 31.58 | 18.36 | 17.60 | 10.34 | 14.62 | 8.59 | 4.21 | 3.45 | 3.80 | 3.31 | | CPD (Ours) | 57.79 | 37.40 | 50.18 | 32.13 | 21.91 | 16.31 | 18.01 | 13.22 | 5.83 | 5.06 | 5.61 | 4.87 |

Table 1. Unsupervised 3D object detection results on WOD validation set. The results of previous methods are reproduced by us.
表 1. 无监督的 3D 物体检测结果在 WOD 验证集上。之前方法的结果由我们复现。

Method 方法	3D AP L1 $(I o U_{0.7, 0.5, 0.5)}$ 3D AP L2 $(I o U_{0.7, 0.5, 0.5)}$
Method 方法	Vehicle 车辆	Ped. 儿童	Cyclist 骑自行车的人	Vehicle 车辆	Ped. 儿童	Cyclist 骑自行车的人
MODEST [40] 羞涩 [40]	7.5	0.0	0.0	6.5	0.0	0.0
OYSTER [41] 牡蛎 [41]	21.6	0.6	0.0	18.7	0.5	0.0
CPD (Ours) 我们	$37.2$	$18.6$	$5.7$	$32.4$	$16.5$	$5.5$

Method 3D AP L1 (IoU_(0.7,0.5,0.5)):} 3D AP L2 (IoU_(0.7,0.5,0.5)):} Vehicle Ped. Cyclist Vehicle Ped. Cyclist MODEST [40] 7.5 0.0 0.0 6.5 0.0 0.0 OYSTER [41] 21.6 0.6 0.0 18.7 0.5 0.0 CPD (Ours) 37.2 18.6 5.7 32.4 16.5 5.5| Method | 3D AP L1 $\left(I o U_{0.7,0.5,0.5)}\right.$ 3D AP L2 $\left(I o U_{0.7,0.5,0.5)}\right.$ | | | | | | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | | Vehicle | Ped. | Cyclist | Vehicle | Ped. | Cyclist | | MODEST [40] | 7.5 | 0.0 | 0.0 | 6.5 | 0.0 | 0.0 | | OYSTER [41] | 21.6 | 0.6 | 0.0 | 18.7 | 0.5 | 0.0 | | CPD (Ours) | $\mathbf{3 7 . 2}$ | $\mathbf{1 8 . 6}$ | $\mathbf{5 . 7}$ | $\mathbf{3 2 . 4}$ | $\mathbf{1 6 . 5}$ | $\mathbf{5 . 5}$ |

Table 2. Unsupervised 3D detection results on WOD test set.
表 2.WOD 测试集的无监督 3D 检测结果。

Method 方法	BEV AP 饮料			BEV Recall 饮料召回
Method 方法	$I o U_{0.3}$	$I o U_{0.5}$	$I o U_{0.7}$	$I o U_{0.3}$	$I o U_{0.5}$	$I o U_{0.7}$
MODEST [40] 羞涩 [40]	22.0	7.5	2.8	49.7	28.9	14.9
OYSTER [41] 牡蛎 [41]	43.5	29.5	18.1	62.8	44.8	28.1
CPD (Ours) 我们	$50.7$	$41.0$	$24.6$	$63.1$	$54.8$	$37.4$

Method BEV AP BEV Recall IoU_(0.3) IoU_(0.5) IoU_(0.7) IoU_(0.3) IoU_(0.5) IoU_(0.7) MODEST [40] 22.0 7.5 2.8 49.7 28.9 14.9 OYSTER [41] 43.5 29.5 18.1 62.8 44.8 28.1 CPD (Ours) 50.7 41.0 24.6 63.1 54.8 37.4| Method | BEV AP | | | BEV Recall | | | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | | $I o U_{0.3}$ | $I o U_{0.5}$ | $I o U_{0.7}$ | $I o U_{0.3}$ | $I o U_{0.5}$ | $I o U_{0.7}$ | | MODEST [40] | 22.0 | 7.5 | 2.8 | 49.7 | 28.9 | 14.9 | | OYSTER [41] | 43.5 | 29.5 | 18.1 | 62.8 | 44.8 | 28.1 | | CPD (Ours) | $\mathbf{5 0 . 7}$ | $\mathbf{4 1 . 0}$ | $\mathbf{2 4 . 6}$ | $\mathbf{6 3 . 1}$ | $\mathbf{5 4 . 8}$ | $\mathbf{3 7 . 4}$ |

Table 3. The class-agnostic comparison results on the PandaSet dataset, evaluated on the

0 - 80 m

detection range.
表 3. PandaSet 数据集上的无类别比较结果,在

0 - 80 m

检测范围进行评估。

4.2. Implementation Details
4.2. 实现细节

Network details. Both prototype and detection networks adopt the same 3D backbone as CenterPoint [39] and the same RoI refinement network as Voxel-RCNN [4]. For the WOD and KITTI datasets, we use the same detection range and voxel size as CenterPoint [39]. For the Pandaset, we use the same detection range as OYSTER [41].
网络细节。原型网络和检测网络都使用与 CenterPoint [39]相同的 3D 主干网络,并使用与 Voxel-RCNN [4]相同的 RoI 细化网络。对于 WOD 和 KITTI 数据集,我们使用与 CenterPoint [39]相同的检测范围和体素大小。对于 Pandaset,我们使用与 OYSTER [41]相同的检测范围。

Training details. We adopt the widely used global scaling and rotation data augmentation. We trained our network on 8 Tesla V100 GPUs with the ADAM optimizer. We used a learning rate of 0.003 with a one-cycle learning rate strategy. We trained the CPD for 20 epochs.
训练细节。我们采用广泛使用的全局缩放和旋转数据增强。我们在 8 个 Tesla V100 GPU 上使用 ADAM 优化器训练了我们的网络。我们使用了 0.003 的学习率和一个一周期的学习率策略。我们训练了 CPD20 个 epochs。

4.3. Comparison with Unsupervised Detectors
4.3. 与无监督检测器的比较

Results on WOD. The results on the WOD validation set and test set are presented in Table 1 and Table 2. All methods use identical size thresholds to define the object classes and use single traversal. Our method significantly outperforms existing unsupervised methods. Notably, under the 3D AP L2 with IoU thresholds of

0.7, 0.5

, and 0.5 , our CPD outperforms OYSTER [41] by

18.03 %, 13.08 %

, and

4.55 %

on Vehicle, Pedestrian, and Cyclist, respectively. These advancements come from our MFC, CBR, and CST designs, which yield superior pseudo-labels and enhanced detection accuracy. CPD also surpasses the Proto-vanilla method, which uses class-specific prototype [23].
根据 WOD 的结果。WOD 验证集和测试集的结果显示在表 1 和表 2 中。所有方法使用相同的尺寸阈值来定义目标类别,并采用单次遍历。我们的方法显著优于现有的无监督方法。值得注意的是,在 3D AP L2 指标下,IoU 阈值为

0.7, 0.5

和 0.5 时,我们的 CPD 在车辆、行人和自行车类别上分别超过 OYSTER [41] 方法

18.03 %, 13.08 %

和

4.55 %

。这些进步来自于我们的 MFC、CBR 和 CST 设计,它们产生了更优秀的伪标签和增强的检测精度。CPD 也超越了使用类别特定原型的 Proto-vanilla 方法 [23]。

Method 方法	Labels 标签	3D AP @ IoU $_{0.5}$			3D AP @ IoU $U_{0.7}$ 3D AP @ IoU
Method 方法	Labels 标签	Easy 简单	Mod. Hard 适度困难	Easy 简单	Mod. Hard 适度困难
CenterPoint [39] 中心点 [39]	$100 %$	97.07	89.23	81.81	88.55	78.38	71.43
Sparsely-sup. [34] 稀疏支持 [34]	$2 %$	-	-	-	49.69	31.55	25.91
MODEST [40] 羞涩 [40]	0	47.56	33.43	30.57	12.65	11.14	10.60
OYSTER [41] 牡蛎 [41]	0	65.33	54.82	43.59	23.22	20.31	19.97
CPD (Ours) 我们	0	$90.85$	$81.01$	$79.80$	$72.98 55.07$	$53.94$

Method Labels 3D AP @ IoU _(0.5) 3D AP @ IoU U_(0.7) Easy Mod. Hard Easy Mod. Hard CenterPoint [39] 100% 97.07 89.23 81.81 88.55 78.38 71.43 Sparsely-sup. [34] 2% - - - 49.69 31.55 25.91 MODEST [40] 0 47.56 33.43 30.57 12.65 11.14 10.60 OYSTER [41] 0 65.33 54.82 43.59 23.22 20.31 19.97 CPD (Ours) 0 90.85 81.01 79.80 72.9855.07 53.94 | Method | Labels | 3D AP @ IoU ${ }_{0.5}$ | | | 3D AP @ IoU $U_{0.7}$ | | | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | | Easy | Mod. Hard | Easy | Mod. Hard | | | | CenterPoint [39] | $100 \%$ | 97.07 | 89.23 | 81.81 | 88.55 | 78.38 | 71.43 | | Sparsely-sup. [34] | $2 \%$ | - | - | - | 49.69 | 31.55 | 25.91 | | MODEST [40] | 0 | 47.56 | 33.43 | 30.57 | 12.65 | 11.14 | 10.60 | | OYSTER [41] | 0 | 65.33 | 54.82 | 43.59 | 23.22 | 20.31 | 19.97 | | CPD (Ours) | 0 | $\mathbf{9 0 . 8 5}$ | $\mathbf{8 1 . 0 1}$ | $\mathbf{7 9 . 8 0}$ | $\mathbf{7 2 . 9 8} \mathbf{5 5 . 0 7}$ | $\mathbf{5 3 . 9 4}$ | |

Table 4. Car detection comparison with fully/weakly supervised detectors on KITTI val set. The models are trained on WOD.
表 4. 基于 KITTI 验证集的汽车检测比较,全监督和弱监督检测器。训练集为 WOD。

Method 方法	Labels 标签	3D AP L1		3D AP L2
Method 方法	Labels 标签	$I o U_{0.5}$	IoU $U_{0.7}$ IoU	$I o U_{0.5}$	$I o U_{0.7}$
CenterPoint [39] 中心点 [39]	$100 %$	89.23	73.72	78.52	65.52
Sparsely-sup. [34] 稀疏支持 [34]	$2 %$	-	32.15	-	27.97
MODEST [40] 羞涩 [40]	0	18.51	6.46	15.83	5.48
OYSTER [41] 牡蛎 [41]	0	30.48	14.66	26.21	14.60
CPD (Ours) 我们	0	$57.79$	$37.40$	$50.18$	$32.13$

Method Labels 3D AP L1 3D AP L2 IoU_(0.5) IoU U_(0.7) IoU_(0.5) IoU_(0.7) CenterPoint [39] 100% 89.23 73.72 78.52 65.52 Sparsely-sup. [34] 2% - 32.15 - 27.97 MODEST [40] 0 18.51 6.46 15.83 5.48 OYSTER [41] 0 30.48 14.66 26.21 14.60 CPD (Ours) 0 57.79 37.40 50.18 32.13| Method | Labels | 3D AP L1 | | 3D AP L2 | | | :--- | :---: | :---: | :---: | :---: | :---: | | | | $I o U_{0.5}$ | IoU $U_{0.7}$ | $I o U_{0.5}$ | $I o U_{0.7}$ | | CenterPoint [39] | $100 \%$ | 89.23 | 73.72 | 78.52 | 65.52 | | Sparsely-sup. [34] | $2 \%$ | - | 32.15 | - | 27.97 | | MODEST [40] | 0 | 18.51 | 6.46 | 15.83 | 5.48 | | OYSTER [41] | 0 | 30.48 | 14.66 | 26.21 | 14.60 | | CPD (Ours) | 0 | $\mathbf{5 7 . 7 9}$ | $\mathbf{3 7 . 4 0}$ | $\mathbf{5 0 . 1 8}$ | $\mathbf{3 2 . 1 3}$ |

Table 5. Vehicle detection comparison with fully/weakly supervised detectors on WOD validation set.
表 5. WOD 验证集上全监督/弱监督车辆检测器的比较。

Results on PandaSet. The class-agnostic results on PandaSet are presented in Table 3. Our method outperforms OYSTER by 6.5% AP and 9.3% Recall under 0.7 IoU threshold. This improvement is largely due to our CPD’s enhanced label quality. Unlike OYSTER, which suffers from the misleading effects of false labels during training, our CPD leverages the size prior from CProto to significantly improve these labels.
熊猫数据集结果。在熊猫数据集上的无类别结果如表 3 所示。我们的方法在 0.7 IoU 阈值下超越 OYSTER 6.5% AP 和 9.3% Recall。这一改善主要归因于我们的 CPD 增强的标签质量。与 OYSTER 受虚假标签在训练期间误导效果的影响不同,我们的 CPD 利用来自 CProto 的大小先验显著改善这些标签。

4.4. Comparison with Fully/Weakly Supervised Detectors
全监督检测器和弱监督检测器的比较

Results on KITTI dataset. To further validate our method, we pre-trained our CPD, along with OYSTER [41] and MODEST [40], on WOD and tested them on the KITTI dataset using Statistical Normalization (SN) [28]. The car detection results are in Table 4. We first compared our method with a sparsely supervised method (weakly supervised with

2 %

labels) [34] that annotates a single instance per frame for training. Our unsupervised CPD outperforms this sparsely supervised method by 23.52% 3D AP @

I o U_{0.7}

on moderate car class. Additionally, our method attains

90.85 %

and

81.01 %

3D AP for the easy and moderate
基于 KITTI 数据集的结果。为进一步验证我们的方法,我们在 WOD 上预训练了我们的 CPD 以及 OYSTER [41]和 MODEST [40],并使用统计归一化(SN) [28]在 KITTI 数据集上进行了测试。汽车检测结果见表 4。我们首先将我们的方法与一种稀疏监督方法(训练时每帧注释一个实例) [34]进行了比较。我们的无监督 CPD 在中等汽车类别上的 3D AP 优于该稀疏监督方法 23.52%。此外,我们的方法在易车和中等车类别上分别达到了

90.85 %

和

81.01 %

3D AP。

Figure 7. (a-c) IoU distribution between pseudo-labels and ground truth. (d-f) Mean absolute error associated with the size, position, and angle of pseudo-labels generated by different methods.
图 7。(a-c) 伪标签和真实标签之间的 IoU 分布。(d-f) 由不同方法生成的伪标签的大小、位置和角度的平均绝对误差。
car classes at a 0.5 IoU threshold. Notably, this performance is comparable to that of the fully supervised method CenterPoint [39], demonstrating the advancement of our method.
车辆类别在 0.5 IoU 阈值下。值得注意的是,这一性能与完全监督的方法 CenterPoint [39]相当,展示了我们方法的进步。

	3D Recall 3D 回忆			3D Precision 三维精密
Method 方法	$I o U_{0.3}$	$I o U_{0.5}$	$I o U_{0.7}$	$I o U_{0.3}$	$I o U_{0.5}$	$I o U_{0.7}$
DBSCAN [5] 密度聚类 [5]	22.85	16.44	6.52	29.41	21.16	8.39
MODEST [40] 羞涩 [40]	17.35	12.04	4.89	32.28	22.81	10.05
OYSTER [41] 牡蛎 [41]	31.10	21.01	11.12	31.22	21.09	9.45
Ours 我们的	$45.66$	$39.33$	$20.54$	$34.17$	$28.22$	$14.74$

3D Recall 3D Precision Method IoU_(0.3) IoU_(0.5) IoU_(0.7) IoU_(0.3) IoU_(0.5) IoU_(0.7) DBSCAN [5] 22.85 16.44 6.52 29.41 21.16 8.39 MODEST [40] 17.35 12.04 4.89 32.28 22.81 10.05 OYSTER [41] 31.10 21.01 11.12 31.22 21.09 9.45 Ours 45.66 39.33 20.54 34.17 28.22 14.74| | 3D Recall | | | 3D Precision | | | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | Method | $I o U_{0.3}$ | $I o U_{0.5}$ | $I o U_{0.7}$ | $I o U_{0.3}$ | $I o U_{0.5}$ | $I o U_{0.7}$ | | DBSCAN [5] | 22.85 | 16.44 | 6.52 | 29.41 | 21.16 | 8.39 | | MODEST [40] | 17.35 | 12.04 | 4.89 | 32.28 | 22.81 | 10.05 | | OYSTER [41] | 31.10 | 21.01 | 11.12 | 31.22 | 21.09 | 9.45 | | Ours | $\mathbf{4 5 . 6 6}$ | $\mathbf{3 9 . 3 3}$ | $\mathbf{2 0 . 5 4}$ | $\mathbf{3 4 . 1 7}$ | $\mathbf{2 8 . 2 2}$ | $\mathbf{1 4 . 7 4}$ |

Table 6. Pseudo-label comparison results on WOD validation set.
表 6. 伪标签比较结果在 WOD 验证集上。

Results on WOD. We also compared our method with fully/weakly supervised methods on the WOD validation set [25]. The vehicle detection results are in Table 5. Our unsupervised CPD outperforms the sparsely supervised method (

2 %

annotation) by

5.25 %

and

4.16 %

in terms of 3D AP L1 and L2 respectively.
在 WOD 上的结果。我们还将我们的方法与 WOD 验证集[25]上的全监督/弱监督方法进行了比较。车辆检测结果见表 5。我们的无监督 CPD 在 3D APL1 和 L2 方面分别超过稀疏监督方法(

2 %

注释)

5.25 %

和

4.16 %

。

4.5. Pseudo-label Comparison
伪标签比较

To validate our pseudo-labels, we analyzed their 3D recall and precision on the WOD validation set. As shown in Table 6 , our method surpasses the previous best-performing OYSTER with a

9.42 %

recall and

5.29 %

precision improvement (under a 0.7 IoU threshold). To understand the sources of this improvement, we examined the IoU between the pseudo-labels and ground truth, and compared the IoU distributions in Fig. 7 (a)(b)©. We also present the mean absolute error of size, position, and angle between different pseudo-labels in Fig. 7 (d)(e)(f). The IoU distribution of our method is much closer to 1 than other methods, and it also exhibits lower errors in size, position, and angle. These results verify that our MFC and CBR significantly reduce label errors.
为验证我们的伪标签,我们分析了它们在 WOD 验证集上的 3D 召回率和精度。如表 6 所示,我们的方法超越了之前表现最好的 OYSTER,在 0.7 IoU 阈值下实现了召回率和精度的提高。为了理解这一改进的来源,我们检查了伪标签和真实标签之间的 IoU,并在图 7 (a)(b)(c)中比较了 IoU 分布。我们还在图 7 (d)(e)(f)中呈现了不同伪标签之间大小、位置和角度的平均绝对误差。我们方法的 IoU 分布更接近 1,在大小、位置和角度方面也表现出更低的误差。这些结果证实了我们的 MFC 和 CBR 显著降低了标签误差。

Figure 8. (a)(b)The recall and precision of initial pseudo-labels by using different frames. © The recall-precision curve of initial pseudo-labels by using different scores.
图 8.(a)(b)使用不同帧获得的初始伪标签的召回率和精度。(c)使用不同分数获得的初始伪标签的召回率-精度曲线。

Components 组件				3D AP L1		3D AP L2
SFC	MFC	CBR	CST	IoU $U_{0.5}$ IoU	$I o U_{0.7}$	IoU $^{0.5}$ IoU
$✓$				17.36	2.65	14.87	2.29
	$✓$			19.91	5.01	18.31	4.77
	$✓$	$✓$		48.26	28.01	41.69	24.04
	$✓$	$✓$	$✓$	57.79	37.40	50.18	32.13

Components 3D AP L1 3D AP L2 SFC MFC CBR CST IoU U_(0.5) IoU_(0.7) IoU ^("0.5 ") https://cdn.mathpix.com/cropped/2024_11_13_d3519511fdbb2ab98b5cg-27.jpg?height=43&width=100&top_left_y=754&top_left_x=1785 ✓ 17.36 2.65 14.87 2.29 ✓ 19.91 5.01 18.31 4.77 ✓ ✓ 48.26 28.01 41.69 24.04 ✓ ✓ ✓ 57.79 37.40 50.18 32.13| Components | | | | 3D AP L1 | | 3D AP L2 | | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | SFC | MFC | CBR | CST | IoU $U_{0.5}$ | $I o U_{0.7}$ | IoU ${ }^{\text {0.5 }}$ | ![](https://cdn.mathpix.com/cropped/2024_11_13_d3519511fdbb2ab98b5cg-27.jpg?height=43&width=100&top_left_y=754&top_left_x=1785) | | $\checkmark$ | | | | 17.36 | 2.65 | 14.87 | 2.29 | | | $\checkmark$ | | | 19.91 | 5.01 | 18.31 | 4.77 | | | $\checkmark$ | $\checkmark$ | | 48.26 | 28.01 | 41.69 | 24.04 | | | $\checkmark$ | $\checkmark$ | $\checkmark$ | 57.79 | 37.40 | 50.18 | 32.13 |

Table 7. CPD component analysis results on WOD validation set.
表 7. CPD 组件分析结果在 WOD 验证集上。

CSS Components CSS 组件	BEV AP 饮料
CSS Components CSS 组件	$I o U_{0.3}$	$I o U_{0.5}$	IoU $U_{0.7}$ IoU
Distance 距离	32.70	24.57	14.80
Distance+MLO 距离+MLO	34.40	26.25	15.91
Distance+MLO+SS 距离+MLO+SS	$38.95$	$31.06$	$19.49$

CSS Components BEV AP IoU_(0.3) IoU_(0.5) IoU U_(0.7) Distance 32.70 24.57 14.80 Distance+MLO 34.40 26.25 15.91 Distance+MLO+SS 38.95 31.06 19.49| CSS Components | BEV AP | | | | :--- | :--- | :--- | :--- | | | $I o U_{0.3}$ | $I o U_{0.5}$ | IoU $U_{0.7}$ | | Distance | 32.70 | 24.57 | 14.80 | | Distance+MLO | 34.40 | 26.25 | 15.91 | | Distance+MLO+SS | $\mathbf{3 8 . 9 5}$ | $\mathbf{3 1 . 0 6}$ | $\mathbf{1 9 . 4 9}$ |

Table 8. CSS component analysis results on WOD validation set.
表 8. CSS 组件分析在 WOD 验证集上的结果。

4.6. Ablation Study 消融研究

Components analysis of CPD. To evaluate the individual contributions of our designs, we incrementally added each component and assessed their impact on vehicle detection using the WOD validation set. The results are shown in Table 7. Our MFC method surpasses Single Frame Clustering (SFC) by

2.52 %

in AP, attributed to the more complete point representation of objects across consecutive frames compared to a single frame. The CBR further enhances performance by

19.27 %

in AP, as it reduces size and location errors in pseudo-labels. The CST contributes an

8.09 %

increase in AP, demonstrating the effectiveness of geometric features from CProto in detecting sparse objects.
对 CPD 的组件分析。为了评估我们设计的个人贡献,我们逐步添加了每个组件,并使用 WOD 验证集评估它们对车辆检测的影响。结果如表 7 所示。我们的 MFC 方法超越了 Single Frame Clustering(SFC),在 AP 上提高了

2.52 %

,这归因于与单一帧相比,它对连续帧中物体的点表示更加完整。CBR 进一步通过

19.27 %

的 AP 提升来增强性能,因为它减少了伪标签中的尺寸和位置错误。CST 贡献了

8.09 %

的 AP 增加,证明了 CProto 几何特征在检测稀疏物体方面的有效性。

Frame number of MFC. To examine the effect of frame count on initial pseudo-label quality, we experimented with different numbers of past and future point cloud frames on the WOD validation set. The BEV results, shown in Fig. 8 (a)(b), indicate optimal performance with [-5, 5] frames (five past, five future, and the current frame). Additional frames did not significantly improve recall or precision. Consequently, we used 11 frames for initial pseudo-label generation in this study.
多通道前馈网络的帧序号。为了检查帧计数对初始伪标签质量的影响,我们在 WOD 验证集上使用不同数量的过去和未来点云帧进行了实验。从 BEV 结果(图 8(a)(b))可以看出,使用[-5,5]帧(5 个过去帧、5 个未来帧和当前帧)效果最佳。增加帧数不会显著改善召回率或精度。因此,我们在本研究中使用了 11 帧进行初始伪标签生成。

Component analysis of CSS Scoring. To assess the effectiveness of our scoring system, we calculated the BEV AP of initial pseudo-labels with different scores. These
评估我们评分系统的有效性,我们计算了具有不同分数的初始伪标签的 BEV AP。

Figure 9. Visualization comparison of different detection results on WOD validation set.
图 9. 在 WOD 验证集上不同检测结果的可视化比较。

CBR Components 中国制动器件	BEV Recall 饮料召回		BEV Precision 饮料精度
CBR Components 中国制动器件	$I o U_{0.5}$	IoU $_{0.7}$ IoU	$I_{o U_{0.5}}$	IoU $U_{0.7}$ IoU
MFC	26.88	18.14	26.98	18.16
MFC+Re-size 界面	30.79	21.54	29.43	21.33
MFC+Re-size+Re-localization 重新调整大小+重新定位	$43.47$	$27.97$	$30.90$	$21.62$

CBR Components BEV Recall BEV Precision IoU_(0.5) IoU _(0.7) I_(oU_(0.5)) IoU U_(0.7) MFC 26.88 18.14 26.98 18.16 MFC+Re-size 30.79 21.54 29.43 21.33 MFC+Re-size+Re-localization 43.47 27.97 30.90 21.62| CBR Components | BEV Recall | | BEV Precision | | | :--- | :--- | :--- | :--- | :--- | | | $I o U_{0.5}$ | IoU $_{0.7}$ | $I_{o U_{0.5}}$ | IoU $U_{0.7}$ | | MFC | 26.88 | 18.14 | 26.98 | 18.16 | | MFC+Re-size | 30.79 | 21.54 | 29.43 | 21.33 | | MFC+Re-size+Re-localization | $\mathbf{4 3 . 4 7}$ | $\mathbf{2 7 . 9 7}$ | $\mathbf{3 0 . 9 0}$ | $\mathbf{2 1 . 6 2}$ |

Table 9. CBR component analysis results on WOD validation set.
表 9. CBR 组件分析结果在 WOD 验证集上。
evaluations, reported in Table 8, show that incorporating all components (distance, MLO, and SS) yields the highest AP. The recall-precision curve, plotted in Fig. 8 ©, also supports this finding. These indicate the significance of each component in accurately measuring pseudo-label quality.
表 8 中报告的评估结果显示,结合所有组件(distance、MLO 和 SS)可产生最高的 AP。图 8(c)中绘制的查全率-查准率曲线也支持这一发现。这表明每个组件在准确测量伪标签质量方面的重要性。

Components analysis of CBR. To evaluate the impact of re-sizing and re-localization in CBR, we conducted experiments and analyzed pseudo-label performance. As shown in Table 9, re-sizing results in a

3.91 %

and

3.4 %

increase in BEV recall at the 0.5 and 0.7 IoU thresholds, respectively; re-localization further enhances recall by

12.68 %

and

6.43 %

at these thresholds, while also increasing precision. These results indicate the importance of both components, which effectively refine pseudo-labels.
基于 CBR 的组件分析。为评估 CBR 中调整大小和重新定位的影响,我们进行了实验并分析了伪标签的性能。如表 9 所示,调整大小在 0.5 和 0.7 IoU 阈值下分别导致了 BEV 召回率的

3.91 %

和

3.4 %

增加;重新定位进一步提高了这些阈值下的召回率分别为

12.68 %

和

6.43 %

,同时也提高了精度。这些结果表明这两个组件的重要性,它们有效地改善了伪标签。

Components analysis of CST. To assess the effectiveness of each component in CST, we established a baseline using only CBR-generated pseudo-labels for training a two-stage CenterPoint detector, then incrementally added our loss components and evaluated vehicle detection performance on the WOD validation set. As shown in Table 10, all loss components contribute to performance improvement. Specifically, our

L_{det}^{c s s}

mitigates the influence of false pseudo-label using CSS weight, and improves the 3D AP L2 at

I o U_{0.7}

4.79 %

. Our

L_{feat}^{c s s}

and

L_{box}^{c s s} im -

prove the 3D AP L2 at

I o U_{0.7}

0.75 %

and

2.55 %

respectively, through leveraging geometric knowledge from dense CProto for more effective sparse object detection.
分析 CST 的各个组件。为了评估 CST 中每个组件的有效性,我们使用仅有 CBR 生成的伪标签训练了一个两阶段的 CenterPoint 检测器作为基线,然后逐步添加我们的损失组件并在 WOD 验证集上评估车辆检测性能。如表 10 所示,所有损失组件均有助于性能提升。具体而言,我们的

L_{det}^{c s s}

通过利用 CSS 权重减轻了虚假伪标签的影响,并将

I o U_{0.7}

的 3D AP L2 提高了

4.79 %

。我们的

L_{feat}^{c s s}

和

L_{box}^{c s s} im -

通过利用来自密集 CProto 的几何知识,进一步将

I o U_{0.7}

的 3D AP L2 提高了

0.75 %

和

2.55 %

。

4.7. Visualization Comparison
4.7. 可视化比较

To provide a more intuitive understanding of how our method improves detection performance, we visually compare our results with those of MODEST [40] and OYSTER [41], as shown in Fig. 9. MODEST often misses
为了更直观地了解我们的方法如何提高检测性能,我们将我们的结果与 MODEST [40] 和 OYSTER [41] 的结果进行了可视化对比,如图 9 所示。MODEST 通常会错过

CST Components 中国标准时间组件	3D AP L1		3D AP L2
CST Components 中国标准时间组件	IoU O $_{.}$ 交并比 O $_{.}$	$I o U_{0.7}$	IoU $_{0.5}$ IoU	IoU $U_{0.7}$ IoU
CBR-only 只有 CBR	48.26	28.01	41.69	24.04
$CBR + L_{det}^{css}$	49.31	29.78	42.50	28.83
$CBR + L_{det}^{css} + L_{feat}^{c s s}$	52.01	32.17	44.12	29.58
CBR $+ L_{det}^{c s s} + L_{feat}^{c s s} + L_{box}^{c s s}$	57.79	37.40	50.18	32.13

CST Components 3D AP L1 3D AP L2 IoU O _(". ") IoU_(0.7) IoU _(0.5) IoU U_(0.7) CBR-only 48.26 28.01 41.69 24.04 CBR+L_("det ")^("css ") 49.31 29.78 42.50 28.83 CBR+L_("det ")^("css ")+L_("feat ")^(css) 52.01 32.17 44.12 29.58 CBR +L_("det ")^(css)+L_("feat ")^(css)+L_("box ")^(css) 57.79 37.40 50.18 32.13| CST Components | 3D AP L1 | | 3D AP L2 | | | :---: | :---: | :---: | :---: | :---: | | | IoU O $_{\text {. }}$ | $I o U_{0.7}$ | IoU ${ }_{0.5}$ | IoU $\mathrm{U}_{0.7}$ | | CBR-only | 48.26 | 28.01 | 41.69 | 24.04 | | $\mathrm{CBR}+\mathcal{L}_{\text {det }}^{\text {css }}$ | 49.31 | 29.78 | 42.50 | 28.83 | | $\mathrm{CBR}+\mathcal{L}_{\text {det }}^{\text {css }}+\mathcal{L}_{\text {feat }}^{c s s}$ | 52.01 | 32.17 | 44.12 | 29.58 | | CBR $+\mathcal{L}_{\text {det }}^{c s s}+\mathcal{L}_{\text {feat }}^{c s s}+\mathcal{L}_{\text {box }}^{c s s}$ | 57.79 | 37.40 | 50.18 | 32.13 |

Table 10. CST component analysis results on WOD validation set.
表 10. CST 组件分析结果在 WOD 验证集上。
distant, sparse objects (Fig. 9(1.1)), while OYSTER detects them but inaccurately reports their sizes and positions (Fig. 9(2.1)). In contrast, CPD, using our CProto-based design, not only recognizes these objects but also accurately predicts their sizes and positions (Fig. 9(3.1)). Furthermore, since our CST reduces the influence of false pseudo-labels, the false positives (Fig. 9(3.2)) are also much fewer than the previous methods (Fig. 9(1.2)(2.2)).
遥远稀疏的物体(图 9(1.1)),而 OYSTER 检测到它们但不准确地报告它们的大小和位置(图 9(2.1)).相比之下,CPD 使用我们的 CProto 设计不仅识别出这些物体,而且还准确预测它们的大小和位置(图 9(3.1)).此外,由于我们的 CST 减少了错误伪标签的影响,误报(图 9(3.2))也比前一种方法(图 9(1.2)(2.2))少得多.

5. Conclusion 结论

This paper presents the CPD framework, a novel approach for accurate unsupervised 3D object detection. First, we develop an MFC method to generate initial pseudo-labels. Then, a CProto set is constructed using CSS scoring. Next, we introduce a CBR method to refine these pseudo-labels. Lastly, a CST is designed to enhance detection accuracy for sparse objects. Extensive experiments have verified the effectiveness of our design. Notably, for the first time, our unsupervised CPD method surpasses some weakly supervised methods, demonstrating the advancement of our approach.
这篇论文提出了 CPD 框架,一种用于准确无监督 3D 目标检测的新方法。首先,我们开发了一种 MFC 方法来生成初始伪标签。然后,使用 CSS 评分构建了一个 CProto 集。接下来,我们引入了一种 CBR 方法来细化这些伪标签。最后,设计了一种 CST 来提高稀疏目标的检测准确性。我们的大量实验已经验证了我们设计的有效性。值得注意的是,我们的无监督 CPD 方法首次超越了一些弱监督方法,展示了我们方法的进步。

Limitations. One notable limitation of our work is the significantly lower Average Precision (AP) for minority classes, such as cyclists (Table 1), compared to more prevalent classes like vehicles. This disparity is largely due to the scarce instances of these minority classes within the dataset. Future efforts to collect such objects could be a promising avenue to tackle this issue.
少数群体类别的平均精度较低,如骑自行车的人(表 1),与更常见的车辆类别相比。这种差异主要是由于数据集中这些少数群体类别的实例数较少造成的。未来收集此类对象的努力可能是解决这一问题的有前景的途径。

Acknowledgements. This work was supported in part by the National Natural Science Foundation of China (No.62171393), the Fundamental Research Funds for the Central Universities (No.20720220064).
中国国家自然科学基金（No.62171393），中央高校基本科研业务费（No.20720220064）。

References 参考文献

Computer Vision and Pattern Recognition (CVPR), pages 2443 - 2451, 2020. 2, 4, 5, 7
计算机视觉与模式识别(CVPR)，第 2443-2451 页，2020 年。2, 4, 5, 7
[26] He Wang, Yezhen Cong, Or Litany, Yue Gao, and Leonidas J. Guibas. 3dioumatch: Leveraging iou prediction for semisupervised 3d object detection. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition CVPR), pages 14610-14619, 2021. 1, 2
王赫、从烨真、利塔尼、高悦和利昂尼达斯·J·古博斯。3dioumatch:利用 iou 预测进行半监督 3D 物体检测。2021 年计算机视觉和模式识别学会会议论文集(CVPR),第 14610-14619 页。
[27] Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He, Bernt Schiele, and Liwei Wang. DSVT: Dynamic Sparse Voxel Transformer With Rotated Sets. In CVPR, 2023. 2
王海洋、陈实、邵帅、雷蒙、王森、何迪、贝尔特·席尔、王力伟。DSVT:带旋转集合的动态稀疏体素变换器。在 CVPR 2023 中。
[28] Yan Wang, Xiangyu Chen, Yurong You, Li Erran Li, Bharath Hariharan, Mark Campbell, Kilian Q Weinberger, and WeiLun Chao. Train in germany, test in the usa: Making 3d object detectors generalize. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11713-11723, 2020. 6
王岩、陈祥宇、由于容、李洪文、Bharath Hariharan、Mark Campbell、Kilian Q Weinberger 和赵维伦。在德国培训,在美国测试:使 3D 物体检测器普遍适用。在 2020 年 IEEE/CVF 计算机视觉与模式识别会议论文集中,第 11713-11723 页。
[29] Aming Wu, Yahong Han, Linchao Zhu, and Yi Yang. Universal-prototype enhancing for few-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9567-9576, 2021. 2
吴阿明、韩亚红、朱林超和杨毅。用于少样本目标检测的通用原型增强。 2021 年 IEEE/CVF 国际计算机视觉会议论文集, 第 9567-9576 页。
[30] Hai Wu, Jinhao Deng, Chenglu Wen, Xin Li, and Cheng Wang. Casa: A cascade attention network for 3d object detection from lidar point clouds. IEEE Transactions on Geoscience and Remote Sensing, 2022. 1, 2
吴海、邓锦昊、温承录、李鑫和王成。Casa:一种用于从激光雷达点云检测 3D 物体的层叠注意力网络。IEEE 地球科学与遥感交易, 2022 年。
[31] Hai Wu, Chenglu Wen, Wei Li, Ruigang Yang, and Cheng Wang. Learning transformation-equivariant features for 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023. 2, 5
[31] 武海, 文成禄, 李伟, 杨睿刚和王成。学习变换不变的特征进行 3D 物体检测。在人工智能学会 AAAI 会议论文集, 2023 年。2, 5
[32] Hai Wu, Chenglu Wen, Shaoshuai Shi, Xin Li, and Cheng Wang. Virtual sparse convolution for multimodal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 5
[32]吴海,文程路,史少帅,李欣,王成。多模态 3D 物体检测的虚拟稀疏卷积。在 2023 年 IEEE/CVF 计算机视觉和模式识别会议论文集上发表。5
[33] Xiaopei Wu, Liang Peng, Honghui Yang, Liang Xie, Chenxi Huang, Chengqi Deng, Haifeng Liu, and Deng Cai. Sparse fuse dense: Towards high quality 3d detection with depth completion. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2022. 1, 2
吴小培、彭亮、杨宏辉、谢亮、黄晨曦、邓成齐、刘海峰和蔡登. 稀疏融合密集:指向高质量的 3D 检测深度补全. 在 2022 年 IEEE 计算机视觉与模式识别会议(CVPR)论文集中. 1, 2
[34] Qiming Xia, Jinhao Deng, Chenglu Wen, Hai Wu, Shaoshuai Shi, Xin Li, and Cheng Wang. Coin: Contrastive instance feature mining for outdoor 3 d object detection with very limited annotations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023. 2, 5, 6
夏启明、邓金浩、温成路、吴海、石少帅、李欣、王晟。Coin:利用有限注释进行户外 3D 物体检测的对比实例特征挖掘。在 2023 年 IEEE/CVF 国际计算机视觉会议上发表。2, 5, 6
[35] Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, Yunlong Wang, and Diange Yang. Pandaset: Advanced sensor suite dataset for autonomous driving. 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 2021. 2, 5
[35] 彭川晓,邵振磊,Steven Hao,张子硕,柴晓林,焦懈,李泽松,吴健,孙凯,蒋昆,王云龙,杨迪昂。Pandaset:自动驾驶先进传感器数据集。2021 年 IEEE 国际智能交通系统会议(ITSC),2021 年。2,5
[36] Honghui Yang, Tong He, Jiaheng Liu, Huaguan Chen, Boxi Wu, Binbin Lin, Xiaofei He, and Wanli Ouyang. Gdmae: Generative decoder for mae pre-training on lidar point clouds. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 9403-9414, 2023. 2
杨红晖、何桐、刘嘉恒、陈化馆、吴博喜、林斌斌、何小飞、欧阳万里。Gdmae:针对激光雷达点云的 MAE 预训练的生成式解码器。在计算机视觉与模式识别会议(CVPR)上发表,页数 9403-9414,2023 年。
[37] Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1951-1960, 2019. 2
杨泽彤、孙亚楠、刘舒、沈晓勇和贾晶亮。Std:用于点云的稀疏至密集 3D 物体检测器。 2019 年 IEEE 国际计算机视觉会议论文集,第 1951-1960 页。
[38] Junbo Yin, Dingfu Zhou, Liangjun Zhang, Jin Fang, ChengZhong Xu, Jianbing Shen, and Wenguan Wang. Proposalcontrast: Unsupervised pre-training for lidar-based 3d object detection. In European conference on computer vision, pages 17-33, 2022. 2
尹君博、周定福、张亮军、方进、徐成忠、沈健兵、王雯官。提议对比:无监督前期训练用于基于激光雷达的三维目标检测。在 2022 年欧洲计算机视觉会议上,第 17-33 页。
[39] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Centerbased 3d object detection and tracking. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2, 5, 6, 7
尹天伟、周星怡和 Philipp Krähenbühl。基于中心的 3D 物体检测与跟踪。在 2021 年计算机视觉与模式识别会议(CVPR)上发表。2, 5, 6, 7
[40] Yurong You, Katie Luo, Cheng Perng Phoo, Wei-Lun Chao, Wen Sun, Bharath Hariharan, Mark E. Campbell, and Kilian Q . Weinberger. Learning to detect mobile objects from lidar scans without labels. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 1, 2, 3, 5, 6, 7, 8
尤容,凯蒂·罗,程鹏傅,赵韦伦,孙文,BharathHariharan,Mark E. Campbell 和 Kilian Q. Weinberger。在没有标签的情况下,从激光雷达扫描中学习检测移动物体。在 2022 年 IEEE/CVF 计算机视觉和模式识别会议(CVPR)上, 2022 年。1,2,3,5,6,7,8
[41] Lunjun Zhang, Anqi Joyce Yang, Yuwen Xiong, Sergio Casas, Bin Yang, Mengye Ren, and Raquel Urtasun. Towards unsupervised object detection from lidar point clouds. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 1, 2, 3, 4, 5, 6, 7, 8
张论军、杨安琪、熊玉文、卡萨斯、杨斌、任梦晔和乌塔松。针对来自激光雷达点云的无监督目标检测。在 2023 年 IEEE/CVF 计算机视觉和模式识别会议(CVPR)上, 2023 年。
[42] Quanshi Zhang, Xuan Song, Xiaowei Shao, Huijing Zhao, and Ryosuke Shibasaki. Unsupervised 3d category discovery and point labeling from a large urban environment. In 2013 IEEE International Conference on Robotics and Automation, pages 2685-2692. IEEE, 2013. 1
张全师、宋轩、邵晓为、赵慧静和柴胜之. 从一个大型城市环境中无监督的 3D 类别发现和点标注. 2013 年 IEEE 国际机器人自动化大会论文集, 第 2685-2692 页. IEEE, 2013.
[43] Xiao Zhang, Wenda Xu, Chiyu Dong, and John M Dolan. Efficient 1 -shape fitting for vehicle detection using laser scanners. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 54-59. IEEE, 2017. 3
肖张, 文达徐, 赤禹东, 约翰·M·多兰。使用激光扫描仪进行车辆检测的高效 1 -形状拟合。在 2017 年 IEEE 智能汽车研讨会(IV)上,第 54-59 页。IEEE, 2017。3
[44] Yixin Zhang, Zilei Wang, and Yushi Mao. Rpn prototype alignment for domain adaptive object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12425-12434, 2021. 2
[44] 张一新, 王子磊和毛雨十. Rpn 原型对齐用于领域自适应目标检测器. 在 2021 年 IEEE/CVF 计算机视觉和模式识别会议论文集中, 第 12425-12434 页.
[45] Zehan Zhang, Yang Ji, Wei Cui, Yulong Wang, Hao Li, Xian Zhao, Duo Li, Sanli Tang, Ming Yang, Wenming Tan, et al. Atf-3d: Semi-supervised 3d object detection with adaptive thresholds filtering based on confidence and distance. IEEE Robotics and Automation Letters, 7(4):10573-10580, 2022. 2
张泽瀚、季阳、崔巍、王宇龙、李浩、赵贤、李多、唐三立、杨明、谭文明等。
[46] Na Zhao, Tat-Seng Chua, and Gim Hee Lee. Sess: Selfensembling semi-supervised 3d object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2020. 1, 2
赵那，蔡达成，和李锦希。Sess:自我集成半监督 3D 物体检测。在 2020 年 IEEE 计算机视觉与模式识别会议(CVPR)论文集中。1, 2
[47] Shizhen Zhao and Xiaojuan Qi. Prototypical votenet for fewshot 3d point cloud object detection. Advances in Neural Information Processing Systems, 35:13838-13851, 2022. 3, 4
赵士贞和祁小娟.原型投票网络用于少量样本 3D 点云物体检测.神经信息处理系统进展,35:13838-13851,2022.
[48] Wu Zheng, Weiliang Tang, Li Jiang, and Chi-Wing Fu. Sessd: Self-ensembling single-stage object detector from point cloud. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 1449414503, 2021. 2
吴正,唐维良,李江和符志荣。 Sessd:自组装单级点云目标检测器。在 2021 年计算机视觉和模式识别会议(CVPR)论文集上,第 14494-14503 页。
[49] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 4490-4499, 2018. 2
[49] 尹舟和 Oncel Tuzel。 Voxelnet:基于点云的 3D 目标检测的端到端学习。在第 IEEE 计算机视觉和模式识别会议(CVPR)论文集中,页数为 4490-4499,2018 年。

- Hai Wu, Shijia Zhao, Qiming Xia, Chenglu Wen, and Cheng Wang were with the Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, 361005, P.R. China
  吴海，赵世佳，夏启明，温成路和王成与中国福建省厦门大学智慧城市感知与计算福建省重点实验室（361005)合作。
  E-mail: clwen@xmu.edu.cn
  电子邮件: clwen@xmu.edu.cn
- Xin Li was with the Visual Computing and Interactive Media, Texas AEM University
  李新与德克萨斯农工大学视觉计算和交互媒体相关
- Li Jiang was with the School of Data Science, The Chinese University of Hong Kong, Shenzhen, Guangdong, 518172, P.R. China
  李江是香港中文大学（深圳）数据科学学院的一员，广东省深圳市，518172。
  $^{*}$ Corresponding author
  $^{*}$ 相应作者
  Manuscript received April 19, 2005; revised August 26, 2015.
  接收到的手稿于 2005 年 4 月 19 日；修订于 2015 年 8 月 26 日。
- Corresponding author 对应作者

Unsupervised 3D Object Detection by Commonsense Clue 无监督的三维物体检测通过常识线索

Unsupervised 3D Object Detection by Commonsense Clue 无监督的三维物体检测通过常识线索

Abstract 摘要

1 Introduction 1 简介

2 Related Work 相关工作

2.1 Fully/semi/weakly-supervised 3D Object Detection全监督/半监督/弱监督 3D 目标检测

2.2 Self-supervised/Unsupervised 3D Detectors2.2 自监督/无监督 3D 检测器

2.3 Prototype-based Methods基于原型的方法

3 Problem Formulation 问题描述

3.1 Fully Supervised 3D Object Detection全面监督 3D 物体检测

3.2 Unsupervised 3D Object Detection无监督 3D 目标检测

4 CPD: Commonsense Prototype for UnsuPERVISEd 3D ObJect Detection4 CPD：无监督 3D 目标检测的常识原型

4.1 Initial Label Generation4.1 初始标签生成

4.1.1 Motion Artifact Removal (MAR)去运动伪影(MAR)

4.1.2 Clustering and Post-processing4.1.2 聚类和后处理

4.2 CProto-constrained Box Regularization for Label Refinement4.2 CProto 约束框正则化用于标签精化

4.2.1 Completeness and Size Similarity (CSS) Scoring4.2.1 完整性和大小相似性(CSS)评分

4.2.2 CProto Set Construction4.2.2 CProto 集合构造

4.2.3 Box Regularization 4.2.3 盒子正则化

4.3 CProto-constrained Self-Training (CST)4.3 面向原型约束的自训练(CST)

4.3.1 Network Architecture4.3.1 网络架构

4.3.2 Training Loss 4.3.2 培训损失

5 CPD++: Enhancing CPD by Motion Clue5 CPD++: 增强 CPD 通过运动线索

5.1 Pilot Experiments and Discussion5.1 试验试验和讨论

(a) CPD++ CPD++

5.2 Motion-conditioned Label Refinement5.2 运动条件标签细化

5.2.1 Pseudo Labels Generation5.2.1 伪标签生成

5.2.2 Unsupervised Motion Classification无监督运动分类

5.2.3 Label Propagation 5.2.3 标签传播

5.3 Motion-Weighted Training5.3 动量加权训练

5.3.1 Baselines 基准

5.3.2 Motion-weighted Detection Loss5.3.2 运动加权检测损失

5.3.3 Mutually Boosting by Iterative Training5.3.3 通过迭代训练相互促进

6 EXPERIMENTS 6 个实验

6.1 Datasets 数据集

6.2 Implementation Details实施细节

6.3 Comparison with Unsupervised Detectors6.3 与无监督检测器的比较

6.4 Comparison with Fully/Weakly Supervised Detectors6.4 与完全/弱监督检测器的比较

6.5 As Pre-trained Model 6.5 预训练模型

6.6 Diagnostic Experiment of CPD6.6 CPD 诊断实验

6.7 Diagnostic Experiment of CPD++6.7 CPD++诊断实验

6.8 Visualization Comparison6.8 可视化比较

6.9 Failure Case Analysis and Limitation6.9 故障案例分析和局限性

7 Conclusion 结论

AcKNOWLEDGMENTS 鸣谢

References 参考文献

Summary of Differences (Extensions) Compared to Conference Version大会版本的差异(扩展)总结

Commonsense Prototype for Outdoor Unsupervised 3D Object Detection 户外无监督 3D 物体检测的常识原型

Abstract 摘要

1. Introduction 简介

2. Related Work 相关工作

3. CPD Method 3. 派居描述法

3.1. Initial Label Generation初始标签生成

3.2. CProto-constrained Box Regularization for Label Refinement3.2. CProto 约束 Box 正则化用于标签修正

3.3. CProto-constrained Self-Training (CST)3.3. 受约束的自我训练(CST)

4. Experiments 实验

4.1. Datasets 数据集

4.2. Implementation Details4.2. 实现细节

4.3. Comparison with Unsupervised Detectors4.3. 与无监督检测器的比较

4.4. Comparison with Fully/Weakly Supervised Detectors全监督检测器和弱监督检测器的比较

4.5. Pseudo-label Comparison伪标签比较

4.6. Ablation Study 消融研究

4.7. Visualization Comparison4.7. 可视化比较

5. Conclusion 结论

References 参考文献

Unsupervised 3D Object Detection by Commonsense Clue
无监督的三维物体检测通过常识线索

Unsupervised 3D Object Detection by Commonsense Clue
无监督的三维物体检测通过常识线索

2.1 Fully/semi/weakly-supervised 3D Object Detection
全监督/半监督/弱监督 3D 目标检测

2.2 Self-supervised/Unsupervised 3D Detectors
2.2 自监督/无监督 3D 检测器

2.3 Prototype-based Methods
基于原型的方法

3.1 Fully Supervised 3D Object Detection
全面监督 3D 物体检测

3.2 Unsupervised 3D Object Detection
无监督 3D 目标检测

4 CPD: Commonsense Prototype for UnsuPERVISEd 3D ObJect Detection
4 CPD：无监督 3D 目标检测的常识原型

4.1 Initial Label Generation
4.1 初始标签生成

4.1.1 Motion Artifact Removal (MAR)
去运动伪影(MAR)

4.1.2 Clustering and Post-processing
4.1.2 聚类和后处理

4.2 CProto-constrained Box Regularization for Label Refinement
4.2 CProto 约束框正则化用于标签精化

4.2.1 Completeness and Size Similarity (CSS) Scoring
4.2.1 完整性和大小相似性(CSS)评分

4.2.2 CProto Set Construction
4.2.2 CProto 集合构造

4.3 CProto-constrained Self-Training (CST)
4.3 面向原型约束的自训练(CST)

4.3.1 Network Architecture
4.3.1 网络架构

5 CPD++: Enhancing CPD by Motion Clue
5 CPD++: 增强 CPD 通过运动线索

5.1 Pilot Experiments and Discussion
5.1 试验试验和讨论

5.2 Motion-conditioned Label Refinement
5.2 运动条件标签细化

5.2.1 Pseudo Labels Generation
5.2.1 伪标签生成

5.2.2 Unsupervised Motion Classification
无监督运动分类

5.3 Motion-Weighted Training
5.3 动量加权训练

5.3.2 Motion-weighted Detection Loss
5.3.2 运动加权检测损失

5.3.3 Mutually Boosting by Iterative Training
5.3.3 通过迭代训练相互促进

6.2 Implementation Details
实施细节

6.3 Comparison with Unsupervised Detectors
6.3 与无监督检测器的比较

6.4 Comparison with Fully/Weakly Supervised Detectors
6.4 与完全/弱监督检测器的比较

6.6 Diagnostic Experiment of CPD
6.6 CPD 诊断实验

6.7 Diagnostic Experiment of CPD++
6.7 CPD++诊断实验

6.8 Visualization Comparison
6.8 可视化比较

6.9 Failure Case Analysis and Limitation
6.9 故障案例分析和局限性

Summary of Differences (Extensions) Compared to Conference Version
大会版本的差异(扩展)总结

Commonsense Prototype for Outdoor Unsupervised 3D Object Detection
户外无监督 3D 物体检测的常识原型

3.1. Initial Label Generation
初始标签生成

3.2. CProto-constrained Box Regularization for Label Refinement
3.2. CProto 约束 Box 正则化用于标签修正

3.3. CProto-constrained Self-Training (CST)
3.3. 受约束的自我训练(CST)

4.2. Implementation Details
4.2. 实现细节

4.3. Comparison with Unsupervised Detectors
4.3. 与无监督检测器的比较

4.4. Comparison with Fully/Weakly Supervised Detectors
全监督检测器和弱监督检测器的比较

4.5. Pseudo-label Comparison
伪标签比较

4.7. Visualization Comparison
4.7. 可视化比较