这是用户在 2024-11-14 7:35 为 https://app.immersivetranslate.com/pdf-pro/234e7760-9921-4a7d-8ed5-adde9c1fb33e 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Unsupervised 3D Object Detection by Commonsense Clue
无监督的三维物体检测通过常识线索

Journal: 日记: Transactions on Pattern Analysis and Machine Intelligence
模式分析与机器智能交易
Manuscript ID 手稿编号 TPAMI-2024-05-1261
Manuscript Type: 稿件类型: Regular 常规
Keywords: 关键词

三维物体检测、无监督学习、激光雷达点云、模式识别
3D object detection, unsupervised learning, LiDAR point clouds, pattern
recognition
3D object detection, unsupervised learning, LiDAR point clouds, pattern recognition| 3D object detection, unsupervised learning, LiDAR point clouds, pattern | | :--- | | recognition |
Journal: Transactions on Pattern Analysis and Machine Intelligence Manuscript ID TPAMI-2024-05-1261 Manuscript Type: Regular Keywords: "3D object detection, unsupervised learning, LiDAR point clouds, pattern recognition" | Journal: | Transactions on Pattern Analysis and Machine Intelligence | | ---: | :--- | | Manuscript ID | TPAMI-2024-05-1261 | | Manuscript Type: | Regular | | Keywords: | 3D object detection, unsupervised learning, LiDAR point clouds, pattern <br> recognition | | | |

Unsupervised 3D Object Detection by Commonsense Clue
无监督的三维物体检测通过常识线索

Hai Wu, Shijia Zhao, Xun Huang, Qiming Xia, Chenglu Wen*, Senior Member, IEEE, Li Jiang, Xin Li, Senior Member, IEEE, and Cheng Wang, Senior Member, IEEE
伍海, 赵诗珈, 黄逊, 夏琪明, 温程路*, IEEE 高级会员, 蒋力, 李昕, IEEE 高级会员, 以及王诚, IEEE 高级会员

Abstract 摘要

Traditional 3D object detectors, operating under fully, semi, or weakly supervised paradigms, require extensive human annotations. In contrast, this paper endeavors to design an unsupervised 3D object detector that discerns the object pattern automatically without relying on such annotations. To this end, we first introduce a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection. CPD initially constructs Commonsense Prototype (CProto) to represent the geometry center and size of objects. Subsequently, CPD produces high-quality pseudo-labels and guides the detector convergence by the size and geometry priors from CProto. Based on the CPD, we further propose CPD++, an enhanced version that boosts performance by motion clue. CPD++ learns localization from stationary objects and learns recognition from moving objects. It then facilitates the mutual transfer of localization and recognition knowledge between these two object types. Both CPD and CPD++ demonstrated superior performance over existing state-of-the-art unsupervised 3D detectors. Furthermore, by training CPD++ on WOD and testing on KITTI, CPD++ attains a performance of 89.25 % 89.25 % 89.25%89.25 \% 3D Average Precision (AP) on moderate car class under 0.5 loU threshold. These results approximate 95.3 % 95.3 % 95.3%95.3 \% of the performance achieved by fully supervised counterparts, underscoring the advancement of our method.
基于概念原型的无监督三维物体检测器,称为 CPD,首次引入了概念原型(CProto)来表示物体的几何中心和大小。CPD 通过 CProto 生成高质量的伪标签,并以此引导检测器收敛。基于 CPD,我们进一步提出了 CPD++,它通过利用运动线索进一步提升性能。CPD++从静止物体学习定位,从移动物体学习识别,并在这两种物体类型之间进行知识转移。CPD 和 CPD++均展现出优于现有最先进的无监督三维检测器的性能。在 WOD 数据集上训练,并在 KITTI 数据集上测试,CPD++在中等车类中达到了 3D 平均精度的水平,接近全监督方法的性能。

Index Terms-3D object detection, unsupervised learning, LiDAR point clouds, pattern recognition.
三维物体检测、无监督学习、激光雷达点云、模式识别。

1 Introduction 1 简介

AUTONOMOUS driving requires the reliable detection of 3D objects in urban scenes to ensure safe path planning and navigation. Numerous studies design 3D detection models through a fully supervised paradigm [4], [21], [43], [45], [46], [48]. These models heavily depend on human annotations from diverse scenes to guarantee their effectiveness across various scenarios. This data labeling process involves creating high-quality 3D bounding boxes for every object instance in each scene (see Fig. 1 (a)). It is laborious and time-consuming, limiting the broad deployment of detectors in practice [57]. Several studies have explored semi/weakly-supervised detection framework to mitigate the labeling need. For instance, semi-supervised methods [3], [39], [63] label a portion of data. Weaklysupervised methods utilize efficient click annotations [24]. These methods decrease the label cost obviously (see Fig. 1 (b)) and alleviate this labeling burden to a certain extent. Recently, inspired by the large-scale self-supervised pre-training of the language model, several initiatives for self-supervised pre-training in 3D object detection have emerged [26], [52]. Self-supervised methods generally pretrain the models on large-scale unlabeled masked data (see Fig. 1 ©) and then fine-tune the detector with humanprovided labels. All the aforementioned methods require
自主驾驶需要可靠地检测城市场景中的 3D 物体,以确保安全的路径规划和导航。众多研究通过完全监督的范式设计了 3D 检测模型[4]、[21]、[43]、[45]、[46]、[48]。这些模型严重依赖于来自各种场景的人工标注,以确保它们在不同场景中的有效性。这个数据标注过程涉及为每个场景中的每个物体创建高质量的 3D 边界框(见图 1(a))。这是一项费时费力的工作,限制了检测器在实践中的广泛部署[57]。一些研究探索了半监督/弱监督检测框架来减轻标注需求。例如,半监督方法[3]、[39]、[63]标注部分数据。弱监督方法利用高效的点击标注[24]。这些方法明显减少了标签成本(见图 1(b)),在一定程度上缓解了这种标注负担。最近,受语言模型大规模自监督预训练的启发,几项关于 3D 物体检测自监督预训练的新举措浮现[26]、[52]。自监督方法通常在大规模未标注的遮蔽数据上预训练模型(见图 1(c)),然后使用人工提供的标签对检测器进行微调。上述所有方法
GHuman box label ×Human click label ■Labor-free mask label
人工智能盒标签 ×人点击标签 ■免劳动力口罩标签

Fig. 1. Different settings of 3D object detection. (a) Fully supervised methods annotate all objects in all scenes. (b) Semi/weakly supervised methods annotate a subset of data or use efficient click annotations. © Self-supervised methods annotate a subset of data for fine-tuning. (d) Unsupervised methods do not use human annotations.
图 1. 3D 目标检测的不同设置。(a) 完全监督的方法对所有场景中的所有物体进行注释。(b) 半监督/弱监督方法对部分数据进行注释或使用高效的点击注释。(c) 自监督方法对部分数据进行微调注释。(d) 无监督方法不使用人工注释。

supervision signals from human annotations for training.
来自人工标注的监督信号用于训练。

However, objects in 3D data (e.g., vehicles in LiDAR point clouds) exhibit distinguishable geometry patterns. For example, these objects are typically situated on the ground with recognizable shapes, and their sizes remain consistent across different frames. Leveraging these patterns, the objects can be readily identified through certain commonsense reasoning. This insight enables the development of an unsupervised 3D object detector that operates without using human annotations (see Fig. 1 (d)). Note that, different from self-supervised pre-training, unsupervised 3D object detection can perform detection without fine-tuning, while self-supervised pre-training cannot.
然而,三维数据中的物体(例如,激光雷达点云中的车辆)展现出可区分的几何图案。例如,这些物体通常位于地面上,具有可识别的形状,并且它们的大小在不同帧中保持一致。利用这些模式,可以通过一些常识推理来轻易识别这些物体。这一洞见使得开发一种无监督的三维物体探测器成为可能,它可以在不使用人工标注的情况下进行操作(见图 1(d))。请注意,与自监督预训练不同,无监督的三维物体检测不需要微调即可进行检测,而自监督预训练则无法做到这一点。
This paper aims to design a modern unsupervised detector able to accurately detect 3D object of interest in
这篇论文旨在设计一个现代的无监督探测器,能够准确地检测感兴趣的 3D 物体

Fig. 2. (a) Recognition challenge bought by similar shapes between foreground and background. (b) Localization challenge bought by a limited view of observation. © Stationary objects are well-captured in consecutive frames, which can generate accurate localization seed labels. (d) Moving objects produce displacement, which can generate precise recognition seed labels. (e) CPD key idea: integrate commonsense clue into a prototype to constrain the detector; CPD++ key idea: leverage the movability and stationarity of object to enhance the localization and recognition mutually.
图 2. (a)前景和背景之间的相似形状引起了识别挑战。(b)有限的观察视野引起了定位挑战。(c)静止物体在连续帧中被很好地捕获,可以生成精确的定位种子标签。(d)移动物体产生位移,可以生成精确的识别种子标签。(e)CPD 的关键思想:将常识线索集成到原型中以约束探测器;CPD++的关键思想:利用物体的可移动性和静止性来相互增强定位和识别。

autonomous driving scenarios, utilizing solely commonsense clues. We follow the recent fully/weakly supervised work [24], [43], where the 3D object of interest refers to mobile entities covering a wide range of traffic participants. Specifically, mobile objects, such as vehicles, pedestrians, and cyclists, are predominantly considered as foreground, while static structures, like buildings, roads, and light poles, are modeled as background. The term “detect” refers to the precise localization of a 3D bounding box and the accurate recognition of object categories.
自动驾驶场景,仅利用常识线索。我们遵循最近的完全/弱监督工作[24]、[43],其中感兴趣的 3D 物体指的是覆盖广泛交通参与者的移动实体。具体而言,如车辆、行人和骑自行车的人等移动物体主要被视为前景,而建筑物、道路和路灯等静态结构被建模为背景。"检测"一词指的是 3D 边界框的精确定位和对象类别的准确识别。
Unfortunately, unsupervised 3D object detection presents exceptionally difficult challenges as follows: (1) Recognition challenge. Within the context of autonomous driving, point clouds often contain a mixture of objects. There are no geometric criteria that have a clear decision boundary to segment the background and foreground objects. Numerous traditional methods leveraged ground removal [12] and clustering technique [60] to differentiate the objects from their surroundings. However, background structures, such as buildings and fences, are inevitably misclassified as foreground objects (see Fig. 2 (a)). Recent sophisticated approaches build accurate seed labels through heuristics and identify common patterns by training a deep network iteratively. For example, the MODEST [57] and DRIFT [23] formulate the seed labels by Persistence Point Score (PPScore). Then, the pseudo labels are utilized to train a regular detector. Nonetheless, these methods require data collection from the same location multiple times, increasing labor costs substantially. As of now, attaining unsupervised object recognition without any increase in labor costs remains a tough challenge. (2) Localization challenge. Almost all objects are self-occluded in the autonomous driving scenario. Besides, with the increase in scanning distance, the number of captured points decreases dramatically. Consequently, the objects in point clouds are mostly geometry-incomplete, posing substantial challenges for the accurate localization of 3D bounding boxes (see Fig. 2 (b)). Early methods leveraged a straightforward bounding box fitting algorithm [36] for object localization. However, these methods often frequently struggle to deliver satisfactory performance due to the sparsity and occlusion of objects. Advanced methods generate initial pseudo-labels from fitted labels and use these to train deep networks
不幸的是,无监督的 3D 物体检测呈现出异常困难的挑战,如下所示:(1)识别挑战。在自动驾驶的背景下,点云经常包含多种物体的混合。没有清晰的几何标准来划分背景和前景物体。许多传统方法利用地面去除[12]和聚类技术[60]来区分物体与其周围环境。然而,诸如建筑物和篱笆等背景结构不可避免地被误分类为前景物体(见图 2(a))。最近的先进方法通过启发式方法建立精确的种子标签,并通过迭代训练深度网络来识别常见模式。例如,MODEST[57]和 DRIFT[23]通过持续点得分(PPScore)公式化种子标签。然后,伪标签被用于训练常规检测器。尽管如此,这些方法需要从同一位置多次收集数据,大大增加了劳动力成本。到目前为止,在不增加劳动力成本的情况下实现无监督物体识别仍然是一个艰巨的挑战。(2)定位挑战。在自动驾驶场景中,几乎所有物体都存在自遮挡。此外,随着扫描距离的增加,捕获的点数量也大幅下降。因此,点云中的物体大多是几何不完整的,这给 3D 边界框的准确定位带来了很大挑战(见图 2(b))。早期的方法利用简单的边界框拟合算法[36]进行物体定位。然而,由于物体的稀疏性和遮挡,这些方法通常难以得到满意的性能。先进的方法从拟合的标签生成初始伪标签,并使用这些标签来训练深度网络。

iteratively. They incorporate human knowledge to refine the position and size of the pseudo-labels iteratively. A subset of objects, denoted as complete objects, benefit from having at least one complete scan across the entire point cloud sequence, allowing their pseudo-labels to be refined through temporal consistency [59]. However, most objects, termed incomplete objects, lack full scan coverage and cannot be recovered by temporal consistency.
迭代地。他们利用人类知识来逐步完善伪标签的位置和大小。一些对象,称为完整对象,受益于至少有一个完整的点云序列扫描,使它们的伪标签可以通过时间一致性[59]得到完善。然而,大多数对象,被称为不完整对象,缺乏完整的扫描覆盖,无法通过时间一致性恢复。
To tackle these issues, this paper first proposes a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection (see Fig. 2 (e)). CPD is built upon two key insights: (1) The objects of the same class have similar sizes and can be roughly classified by size differences. (2) The nearby stationary objects are complete in consecutive frames and can be localized accurately by shape and position measure. Our idea is to construct a Commonsense Prototype (CProto) set representing the geometry and size of objects to refine the pseudo-labels and constrain the detector training. To this end, we first design a Multi-Frame Clustering (MFC) method that yields multi-class pseudolabels by size-based thresholding. Subsequently, we introduce an unsupervised Completeness and Size Similarity (CSS) score that selects high-quality labels to construct the CProto set. Furthermore, we design a CProto-constrained Box Regularization (CBR) that refines the sizes of pseudo labels. In addition, we develop CProto-constrained SelfTraining (CST) that improves the detection accuracy by the geometry knowledge from CProto.
为解决这些问题,本文首次提出了一种基于常识原型的检测器 CPD,用于无监督 3D 目标检测(见图 2(e))。CPD 建立在两个关键洞见之上:(1)同类对象的尺寸相似,可以通过尺寸差异进行粗略分类。(2)邻近的固定物体在连续帧中完整,可以通过形状和位置测量精确定位。我们的想法是构建一个代表对象几何和尺寸的常识原型(CProto)集合,以改善伪标签并约束检测器训练。为此,我们首先设计了一种基于尺寸的阈值的多帧聚类(MFC)方法,产生多类伪标签。随后,我们引入了一种无监督的完整性和尺寸相似性(CSS)评分,选择高质量标签来构建 CProto 集合。此外,我们设计了一种 CProto 约束的框正则化(CBR),用于改善伪标签的尺寸。最后,我们开发了 CProto 约束的自训练(CST),利用 CProto 的几何知识提高检测精度。
Nevertheless, the size-based classification in CPD has limitations in differentiating the foreground and background objects with similar shapes. In particular, the dimensions of incorrectly clustered fragments often closely resemble those of small foreground objects (e.g., pedestrians), leading to many false positive labels. Consequently, the detector erroneously assigns high confidence to small background objects. To tackle this issue, we further propose CPD++ (see Fig. 2 (e)), a stronger unsupervised 3D detector able to detect multi-class objects accurately. CPD++ is built upon three key insights: (1) Stationary objects are well-captured in consecutive frames, which allows for the creation of accurate localization seed labels (see Fig. 2 ©). (2) Moving objects produce salient displacement, which can be leveraged to generate precise recognition seed labels (see
尽管如此,CPD 中基于尺寸的分类在区分具有相似形状的前景和背景物体时存在局限性。特别是,错误聚类的碎片尺寸通常与小型前景物体(如行人)非常相似,从而导致出现许多误报标签。因此,检测器错误地将高置信度分配给小型背景物体。为了解决这一问题,我们进一步提出了 CPD++(见图 2(e)),这是一种更强大的无监督 3D 检测器,能够准确检测多类物体。CPD++建立在三个关键见解之上:(1)静止物体在连续帧中被很好地捕捉,这允许创建准确的定位种子标签(见图 2(c))。(2)移动物体产生显著的位移,可利用这一点生成精确的识别种子标签(见

Fig. 3. Performance comparison. Both CPD and CPD++ achieve state-of-the-art unsupervised 3D object detection performance. CPD++ improves CPD 2 × 2 × 2xx2 \times in terms of mAP.
图 3. 性能比较。CPD 和 CPD++都达到了最先进的无监督 3D 目标检测性能。 CPD++在 mAP 方面提高了 CPD。
Fig. 2 (d)). (3) The localization and recognition knowledge are complementary and transferable between each other. Our idea is to learn localization from stationary objects, learn recognition from moving objects, and transfer the knowledge between each other to boost detection performance. The core designs are a Motion-conditioned Label Refinement (MLR) module that produces high-quality pseudo labels and a Motion-Weighted Training (MWT) scheme that mutually learns the localization and recognition.
图 2(d)。(3)定位和识别知识是互补的,可以互相传递。我们的想法是从静止物体中学习定位,从移动物体中学习识别,并在二者之间传递知识以提高检测性能。核心设计是一个运动条件标签细化(MLR)模块,可以产生高质量的伪标签,以及一个相互学习定位和识别的运动加权训练(MWT)方案。
The effectiveness of our design is verified by experiments on widely used Waymo Open Dataset (WOD) [37] and KITTI dataset [8]. The main contributions include:
我们的设计效果通过在广泛使用的 Waymo Open Dataset(WOD)[37]和 KITTI 数据集[8]上的实验得到验证。主要贡献包括:
  • We propose CPD that integrates commonsense knowledge into prototypes for unsupervised 3D object detection. CPD adopts CProto-constrained Box Regularization (CBR) for accurate initial pseudo-label generation and CProto-constrained Self-Training (CST) for unsupervised learning.
    我们提出了将常识知识整合到用于无监督 3D 物体检测的原型中的 CPD。CPD 采用 CBR(CProto 约束的框正则化)来生成准确的初始伪标签,并采用 CST(CProto 约束的自训练)进行无监督学习。
  • We propose CPD++ that takes a step in more accurately detecting multi-class 3D objects by incorporating motion clues. This is enabled by the designed Motion-conditioned Label Refinement (MLR) and a Motion-Weighted Training (MWT) scheme.
    我们提出了 CPD++,它通过融入运动线索来更准确地检测多类 3D 目标。这得益于设计的基于运动的标签精炼(MLR)和基于运动的加权训练(MWT)方案。
  • Both CPD and CPD++ outperform state-of-the-art unsupervised 3D detector on the WOD and KITTI datasets. Notably, CPD++ enhances the CPD by 2 × 2 × 2xx2 \times mAP (see Fig. 3). When CPD++ is trained on the WOD and tested on KITTI, it attains a performance of 89.25% moderate car 3D AP under 0.5 IoU threshold. The performance matches 95.3 % 95.3 % 95.3%95.3 \% of fully supervised counterpart, highlighting the advance of our work.
    两者 CPD 和 CPD++ 在 WOD 和 KITTI 数据集上的无监督 3D 探测器性能优于最先进水平。值得注意的是,CPD++ 通过 2 × 2 × 2xx2 \times 提高了 CPD 的 mAP(见图 3)。当 CPD++ 在 WOD 上训练并在 KITTI 上测试时,它在 0.5 IoU 阈值下达到了 89.25% 的中等车 3D AP 性能。该性能与 95.3 % 95.3 % 95.3%95.3 \% 完全监督的对应物相匹配,突出了我们工作的进步。

2.1 Fully/semi/weakly-supervised 3D Object Detection
全监督/半监督/弱监督 3D 目标检测

Recent fully-supervised 3D detectors build single-stage [10], [13], [38], [56], [65], [66], two-stage [4], [31], [32], [33], [45], [46], [48], [53] or multiple stage [2], [43] deep networks for 3D object detection. Fully supervised 3D object detection methods have achieved significant performance, but the annotation cost is often unacceptable. To reduce the annotation cost, semi-supervised and weakly-supervised 3D object detection methods have gained widespread attention. Semi-supervised methods [18], [28], [39], [50], [54], [63] usually only annotate a portion of the scenarios and then apply teacher-student networks to generate pseudo-labels for unannotated scenarios. For example, the 3DIoUMatch [39] laid the groundwork in the domain of outdoor scenes
最近的完全监督的 3D 探测器构建了单级[10]、[13]、[38]、[56]、[65]、[66]、双级[4]、[31]、[32]、[33]、[45]、[46]、[48]、[53]或多级[2]、[43]深度网络用于 3D 目标检测。完全监督的 3D 目标检测方法已经取得了显著的性能,但注释成本通常是不可接受的。为了减少注释成本,半监督和弱监督的 3D 目标检测方法受到了广泛关注。半监督方法[18]、[28]、[39]、[50]、[54]、[63]通常只注释部分场景,然后应用教师-学生网络为未注释的场景生成伪标签。例如,3DIoUMatch[39]为户外场景领域奠定了基础。

by pioneering the estimation of 3D Intersection over Union (IoU) as a metric for localization. Weakly supervised methods [25], [41], [58] reduce the annotation cost from the perspective of the annotation form. For example, WS3D [25] proposed point-level labels, which generate box-level pseudo-labels from instance-level click labels. Some other weakly-supervised methods [19], [49] only annotate a single instance per scene to reduce the annotation cost. Different from all of these methods, we aim to design a modern unsupervised detector that is able to accurately detect 3D objects without using any human-level annotations.
通过率先估计 3D 交集除以联合(IoU)作为定位度量。弱监督方法[25]、[41]、[58]从注释形式的角度降低了注释成本。例如,WS3D[25]提出了点级标签,从实例级点击标签生成箱级伪标签。其他一些弱监督方法[19]、[49]每个场景只注释一个实例,以降低注释成本。与所有这些方法不同,我们的目标是设计一个现代无监督检测器,能够准确检测 3D 物体,而无需使用任何人工级注释。

2.2 Self-supervised/Unsupervised 3D Detectors
2.2 自监督/无监督 3D 检测器

Previous unsupervised pre-training methods discern latent patterns within the unlabeled data by masked labels [52] or contrastive loss [17], [55]. But these methods require human labels for fine-turning [5], [20]. Traditional methods [1], [30], [36] employ ground removal and clustering for 3D object detection without human labels, but suffer from poor detection performance. Some deep learning-based methods generate pseudo-labels by multiple traversals and use the pseudolabels to train a 3D detector [23], [57] iteratively. Nonetheless, these methods require data collection from the same location multiple times, increasing labor costs remarkably. Recent OYSTER [59] generates labels from single traversal, improves the recognition by ignoring short tracklets and improves the localization by temporal consistency. However, most bounding boxes of incomplete objects cannot be recovered by temporal consistency. Besides, lots of small background objects also generate long tracklets, decreasing recognition precision. Our CPD addresses the localization problem by leveraging the geometry prior from CProto to refine the pseudo-label and guide the network convergence. Our CPD++ addresses the recognition problem by learning the discriminative features of moving objects.
之前无监督预训练的方法通过掩蔽标签[52]或对比损失[17],[55]来识别未标记数据中的潜在模式。但这些方法需要人工标注来微调[5],[20]。传统方法[1],[30],[36]采用地面去除和聚类来检测 3D 物体,无需人工标注,但检测性能较差。一些基于深度学习的方法通过多次遍历生成伪标签,并使用这些伪标签迭代训练 3D 检测器[23],[57]。然而,这些方法需要多次收集同一位置的数据,大大增加了劳动成本。最近的 OYSTER[59]从单次遍历生成标签,通过忽略短时间跟踪来提高识别,并通过时间一致性来改进定位。然而,大部分不完整物体的边界框无法通过时间一致性恢复。此外,许多小的背景物体也生成长时间跟踪,降低了识别精度。我们的 CPD 利用 CProto 的几何先验来完善伪标签,并引导网络收敛,解决了定位问题。我们的 CPD++通过学习移动物体的区分性特征来解决识别问题。

2.3 Prototype-based Methods
基于原型的方法

The prototype-based methods are widely used in 2D object detection [14], [15], [22], [42], [62] when novel classes are incorporated. Inspired by these methods, Prototypical VoteNet [64] constructs geometric prototypes learned from basic classes for few-shot 3D object detection. GPA-3D [16] and CL3D [29] build geometric prototypes from a sourcedomain model for domain adaptive 3D detection. However, both the learning from basic class and training on the source domain require high-quality annotations. Unlike that, we construct CProto using commonsense knowledge and detect 3D objects in a zero-shot manner without human-level annotations.
基于原型的方法广泛用于在新类别被引入时的二维目标检测[14]、[15]、[22]、[42]、[62]。受这些方法的启发,原型性选票网络[64]构造了从基础类别学习的几何原型,用于少样本三维目标检测。GPA-3D[16]和 CL3D[29]从源域模型构建几何原型,用于领域自适应三维检测。但是,从基础类别学习和在源域上训练都需要高质量注释。与之不同的是,我们使用常识知识构建 CProto,并采用零样本方式检测三维目标,无需人工级别的注释。

3 Problem Formulation 问题描述

Given a set of input points x , 3 D x , 3 D x,3D\boldsymbol{x}, 3 \mathrm{D} object detection aims to design a function F F F\mathcal{F} that produces optimal detections y = y = y=\boldsymbol{y}= F ( x ) F ( x ) F(x)\mathcal{F}(\boldsymbol{x}), where each detection consists of [ x , y , z , l , w , h , α , β ] [ x , y , z , l , w , h , α , β ] [x,y,z,l,w,h,alpha,beta][x, y, z, l, w, h, \alpha, \beta] representing position, width, length, height, azimuth angle, and class identity of a object, respectively.
给定一组输入点 x , 3 D x , 3 D x,3D\boldsymbol{x}, 3 \mathrm{D} ,目标检测旨在设计一个函数 F F F\mathcal{F} ,该函数产生最佳检测结果 y = y = y=\boldsymbol{y}= F ( x ) F ( x ) F(x)\mathcal{F}(\boldsymbol{x}) ,其中每个检测结果由 [ x , y , z , l , w , h , α , β ] [ x , y , z , l , w , h , α , β ] [x,y,z,l,w,h,alpha,beta][x, y, z, l, w, h, \alpha, \beta] 表示对象的位置、宽度、长度、高度、方位角和类别标识。


© CProto-constrained self-training
原文版权所有 (C) CProto 约束式自学习
Fig. 4. CPD framework. (a) Initial pseudo-labels are generated by multi-frame clustering. (b) The commonsense prototype (CProto) is constructed from high-quality pseudo-labels based on CSS score. The low-quality labels are further refined by the shape prior from CProto. © A prototype network fed with dense points from CProto produces high-quality features to guide the detection network convergence.
图 4. CPD 框架。(a)通过多帧聚类生成初始伪标签。(b)根据 CSS 评分构建通用原型(CProto)。基于 CProto 的形状先验进一步提炼低质量标签。©CProto 的密集点馈入原型网络可产生高质量特征,引导检测网络收敛。

3.1 Fully Supervised 3D Object Detection
全面监督 3D 物体检测

Since the F F F\mathcal{F} can’t be obtained directly by some handcrafted rule, regular methods formulate the 3D object detection problem as a fully supervised learning problem. It requires a large-scale dataset with human labels. The dataset is conventionally divided into training x train , y train x train  , y train  x^("train "),y^("train ")\boldsymbol{x}^{\text {train }}, \boldsymbol{y}^{\text {train }}, validation x v a l , y v a l x v a l , y v a l x^(val),y^(val)\boldsymbol{x}^{v a l}, \boldsymbol{y}^{v a l} and testing datasets x test , y test x test  , y test  x^("test "),y^("test ")\boldsymbol{x}^{\text {test }}, \boldsymbol{y}^{\text {test }}. During the designing process, a detection model with learnable parameters θ θ theta\theta and loss function L full L full  L^("full ")\mathcal{L}^{\text {full }} are optimized from the training dataset.
由于 F F F\mathcal{F} 无法直接通过一些手工规则获得,常规方法将 3D 物体检测问题形式化为完全监督学习问题。它需要一个具有人工标签的大规模数据集。数据集通常划分为训练 x train , y train x train  , y train  x^("train "),y^("train ")\boldsymbol{x}^{\text {train }}, \boldsymbol{y}^{\text {train }} 、验证 x v a l , y v a l x v a l , y v a l x^(val),y^(val)\boldsymbol{x}^{v a l}, \boldsymbol{y}^{v a l} 和测试 x test , y test x test  , y test  x^("test "),y^("test ")\boldsymbol{x}^{\text {test }}, \boldsymbol{y}^{\text {test }} 数据集。在设计过程中,从训练数据集优化可学习参数 θ θ theta\theta 和损失函数 L full L full  L^("full ")\mathcal{L}^{\text {full }} 的检测模型。
F θ = arg min θ ( L full ( F θ ( x train ) , y train ) ) F θ = arg min θ L full  F θ x train  , y train  F_(theta^(**))=arg min_(theta)(L^("full ")(F_(theta)(x^("train ")),y^("train ")))\mathcal{F}_{\theta^{*}}=\underset{\theta}{\arg \min }\left(\mathcal{L}^{\text {full }}\left(\mathcal{F}_{\theta}\left(\boldsymbol{x}^{\text {train }}\right), \boldsymbol{y}^{\text {train }}\right)\right)
The detection model F θ F θ F_(theta^(**))\mathcal{F}_{\theta^{*}} generally consists of some unlearnable hyperparameters η η eta\eta, which are optimized by the detection metrics A A A\mathcal{A} from the validation dataset.
检测模型通常由某些不可学习的超参数组成,这些超参数由验证数据集中的检测指标进行优化。
F θ η = arg max η ( A ( F θ η ( x v a l ) , y v a l ) ) . F θ η = arg max η A F θ η x v a l , y v a l . F_(theta^(**))^(eta^(**))=arg max_(eta)(A(F_(theta^(**))^(eta)(x^(val)),y^(val))).\mathcal{F}_{\theta^{*}}^{\eta^{*}}=\underset{\eta}{\arg \max }\left(\mathcal{A}\left(\mathcal{F}_{\theta^{*}}^{\eta}\left(\boldsymbol{x}^{v a l}\right), \boldsymbol{y}^{v a l}\right)\right) .
The testing dataset is only used to compare the performance with other methods and can’t be used to tune parameters.
测试数据集仅用于与其他方法的性能进行比较,不能用于调整参数。

3.2 Unsupervised 3D Object Detection
无监督 3D 目标检测

To decrease labeling cost, unsupervised settings do not require human labels of training dataset y train y train  y^("train ")\boldsymbol{y}^{\text {train }}. The y val y val  y^("val ")\boldsymbol{y}^{\text {val }} and y test y test  y^("test ")\boldsymbol{y}^{\text {test }} are required to choose the best hyperparameters and compare them with other methods, respectively. This paper solves the unsupervised problem with a pseudo-labelbased framework. The main challenge lies in how to design a pseudo label generation function G G G\mathcal{G} and how to design a pseudo-label-based training target L un L un  L^("un ")\mathcal{L}^{\text {un }} to optimize the detector.
降低标记成本,无监督设置不需要人工标记训练数据集。需要 y val y val  y^("val ")\boldsymbol{y}^{\text {val }} y test y test  y^("test ")\boldsymbol{y}^{\text {test }} 分别选择最佳超参数并与其他方法进行比较。本文通过基于伪标签的框架解决了无监督问题。主要挑战在于如何设计伪标签生成函数 G G G\mathcal{G} 以及如何设计基于伪标签的训练目标 L un L un  L^("un ")\mathcal{L}^{\text {un }} 来优化检测器。
F θ = arg min θ ( L u n ( F θ ( x train ) , G ( x train ) ) ) F θ = arg min θ L u n F θ x train  , G x train  F_(theta^(**))=arg min_(theta)(L^(un)(F_(theta)(x^("train ")),G(x^("train "))))\mathcal{F}_{\theta^{*}}=\underset{\theta}{\arg \min }\left(\mathcal{L}^{u n}\left(\mathcal{F}_{\theta}\left(\boldsymbol{x}^{\text {train }}\right), \mathcal{G}\left(\boldsymbol{x}^{\text {train }}\right)\right)\right)
This paper leverages human commonsense knowledge to formulate the label generation function G G G\mathcal{G} and training target L u n L u n L^(un)\mathcal{L}^{u n}. The commonsense knowledge is formulated as some observation or insights. For example, the stationary objects share the same position, size, and class across frames; all the background objects, such as buildings and trees, do not move. We ensure all the intuitions are shared among different datasets, easy to obtain, and not sensitive to a specific value.
这篇论文利用人类常识知识来制定标签生成函数 G G G\mathcal{G} 和训练目标 L u n L u n L^(un)\mathcal{L}^{u n} 。常识知识被形式化为某些观察或洞见。例如,静止物体在各帧中位置、大小和类别都相同;所有背景物体,如建筑物和树木,都没有移动。我们确保所有的直观概念都在不同的数据集中共享,容易获取,并且不对特定值敏感。

4 CPD: Commonsense Prototype for UnsuPERVISEd 3D ObJect Detection
4 CPD:无监督 3D 目标检测的常识原型

This section introduces the Commonsense Prototype-based Detector (CPD) for unsupervised 3D object detection. As shown in Fig. 4, CPD consists of three main parts: (1) initial label generation, (2) label refinement, and (3) self-training. CPD formulates the pseudo label generation function G G G\mathcal{G} by Multi-Frame Clustering (MFC) and CProto-constrained Box Regularization (CBR), and formulate the training target L u n L u n L^(un)\mathcal{L}^{u n} by the CSS-weighted detection loss and geometry contrast loss. We detail the designs as follows.
本节介绍了用于无监督 3D 目标检测的常识原型检测器(CPD)。如图 4 所示,CPD 包括三个主要部分:(1)初始标签生成,(2)标签优化,(3)自我训练。CPD 通过多帧聚类(MFC)和 CProto 约束的框正则化(CBR)来构建伪标签生成函数 G G G\mathcal{G} ,并通过 CSS 加权检测损失和几何对比损失来制定训练目标 L u n L u n L^(un)\mathcal{L}^{u n} 。我们将详细介绍这些设计。

4.1 Initial Label Generation
4.1 初始标签生成

Recent unsupervised methods [57], [59] detect 3D objects in a class-agnostic way. How to classify objects (e.g., vehicles and pedestrians) without annotation is still an unsolved challenge. Our observations indicate that the sizes of different classes are significantly different. Therefore, the dense objects in consecutive frames (see Fig. 5 (b)) can be roughly classified by size thresholds. This motivates us to design a Multi-Frame Clustering (MFC) method to generate initial labels. MFC involves motion artifact removal, clustering, and post-processing.
近期无监督方法[57],[59]以一种忽视类别的方式检测三维物体。如何不借助注释来对物体(如车辆和行人)进行分类,这仍然是一个未解决的挑战。我们的观察表明,不同类别的大小差异显著。因此,可以通过尺寸阈值粗略地对连续帧中的密集物体(见图 5(b))进行分类。这促使我们设计了一种多帧聚类(MFC)方法来生成初始标签。MFC 涉及运动伪影消除、聚类和后处理。

Fig. 5. (a) The moving objects are removed by PPScore. (b) The nearby stationary objects are with high completeness in consecutive frames. © Length absolute error with different frames. (d) Pseudo-labels of complete object T T T\mathcal{T} are refined by temporal consistency. (e) Pseudo-labels of the incomplete object J J J\mathcal{J} fail to be refined by temporal consistency. (f) Many objects lack full scan coverage and generate inaccurate pseudo-labels. (g) Multi-level occupancy score. (h) Mean size error of initial labels.
图 5. (a)移动物体已由 PPScore 移除。(b)连续帧中邻近静止物体具有高度完整性。(c)不同帧的长度绝对误差。(d)通过时间一致性优化了完整对象的伪标签。(e)时间一致性无法优化不完整对象的伪标签。(f)许多对象缺乏完整扫描覆盖,生成不准确的伪标签。(g)多层占用评分。(h)初始标签的平均大小误差。

4.1.1 Motion Artifact Removal (MAR)
去运动伪影(MAR)

Directly transforming and concatenating 2 n + 1 2 n + 1 2n+12 n+1 consecutive frames { x n , , x n } x n , , x n {x_(-n),dots,x_(n)}\left\{\boldsymbol{x}_{-n}, \ldots, \boldsymbol{x}_{n}\right\} (i.e., past n n nn, future n n nn, and the current frame) into a single point cloud x 0 x 0 x_(0)^(**)x_{0}^{*} introduces motion artifacts from moving objects, leading to increased label errors as the n n nn grows (see Fig. 5©). To mitigate this issue, we first transform the consecutive frames to a global coordinate system and calculate the Persistence Point Score (PPScore) [57] by consecutive frames to identify the points in motion (see Fig. 5 (a)). We keep all the points from x 0 x 0 x_(0)x_{0} and remove moving points from the other frames x n , , x 1 , x 1 , , x n x n , , x 1 , x 1 , , x n x_(-n),dots,x_(-1),x_(1),dots,x_(n)\boldsymbol{x}_{-n}, \ldots, \boldsymbol{x}_{-1}, \boldsymbol{x}_{1}, \ldots, \boldsymbol{x}_{n}. After this removal, we concatenate the frames to obtain dense points x 0 x 0 x_(0)^(**)\boldsymbol{x}_{0}^{*}.
将多个连续帧(即过去、未来和当前帧)直接转换和连接成单个点云会引入移动物体产生的运动伪影,从而导致标签错误随帧数的增加而增加(见图 5©)。为了缓解这个问题,我们首先将连续帧变换到全局坐标系,并计算持久性点分数(PPScore)[57]来识别运动点(见图 5(a))。我们保留所有点,并从其他帧中删除移动点。经过这种删除后,我们连接帧以获得密集点。

4.1.2 Clustering and Post-processing
4.1.2 聚类和后处理

In line with recent study [59], we apply the ground removal [12], DBSCAN [6] and bounding box fitting [61] on x 0 x 0 x_(0)^(**)\boldsymbol{x}_{0}^{*} to obtain a set of class-agnostic bounding boxes b ^ b ^ hat(b)\hat{b}. We then perform a tracking process to associate the boxes across frames. Since the objects of the same class typically have similar sizes in 3D space, we calculate class-specific size thresholds to classify b ^ b ^ hat(b)\hat{b} into different categories. The size threshold is calculated based on two observations. (1) All the moving objects are foreground. (2) The size ratios of different classes are different (the vehicle is around 2:1:1, the pedestrian is around 1:1:2, and the cyclist is around 2:1:2). Therefore, for each point cloud sequence, one trajectory for each class with the highest mean speed and nearest size ratio to a specific class is automatically chosen. The size range of these tracklets defines the size thresholds for classification. This process results in a set of initial pseudo-labels b = { b j } j b = b j j b={b_(j)}_(j)\boldsymbol{b}=\left\{b_{j}\right\}_{j}, where b j = [ x , y , z , l , w , h , α , β , τ ] b j = [ x , y , z , l , w , h , α , β , τ ] b_(j)=[x,y,z,l,w,h,alpha,beta,tau]b_{j}=[x, y, z, l, w, h, \alpha, \beta, \tau] represents position, width, length, height, azimuth angle, class identity, and tracking identity, respectively.
根据最近的研究[59],我们在 x 0 x 0 x_(0)^(**)\boldsymbol{x}_{0}^{*} 上应用了地面去除[12]、DBSCAN[6]和边界框拟合[61]来获得一组类无关的边界框 b ^ b ^ hat(b)\hat{b} 。然后我们进行跟踪过程来关联不同帧中的框。由于同一类别的对象在 3D 空间中通常具有相似的尺寸,我们计算特定类别的尺寸阈值来将 b ^ b ^ hat(b)\hat{b} 划分为不同类别。尺寸阈值的计算基于两个观察结果:(1)所有移动的对象都是前景;(2)不同类别的尺寸比例不同(车辆约为 2:1:1,行人约为 1:1:2,自行车约为 2:1:2)。因此,对于每个点云序列,会自动选择每个类别中平均速度最高且尺寸比例最接近特定类别的一个轨迹。这些轨迹的尺寸范围定义了分类的尺寸阈值。该过程产生了一组初始的伪标签 b = { b j } j b = b j j b={b_(j)}_(j)\boldsymbol{b}=\left\{b_{j}\right\}_{j} ,其中 b j = [ x , y , z , l , w , h , α , β , τ ] b j = [ x , y , z , l , w , h , α , β , τ ] b_(j)=[x,y,z,l,w,h,alpha,beta,tau]b_{j}=[x, y, z, l, w, h, \alpha, \beta, \tau] 分别表示位置、宽度、长度、高度、方位角、类别标识和跟踪标识。

4.2 CProto-constrained Box Regularization for Label Refinement
4.2 CProto 约束框正则化用于标签精化

As presented in Fig. 5 (d)-(f), initial labels for incomplete objects often suffer from inaccuracies in sizes and positions. To tackle this issue, we introduce the CProto-constrained Box Regularization (CBR) method. The key idea is to construct a high-quality CProto set based on unsupervised scoring from
如图 5(d)-(f)所示,不完整物体的初始标签常常存在尺寸和位置的不准确性。为解决这一问题,我们提出了 CProto 受约束的框正则化(CBR)方法。其核心思想是基于无监督评分构建高质量的 CProto 集合。

complete objects to refine the pseudo-labels of incomplete objects. Unlike OYSTER [59], which can only refine the pseudo-labels of objects with at least one complete scan, our CBR can refine pseudo-labels of all objects, significantly decreasing the overall size and position errors.
完全对象来细化不完整对象的伪标签。与 OYSTER [59]不同,它只能细化至少有一个完整扫描的对象的伪标签,我们的 CBR 可以细化所有对象的伪标签,从而大幅减少整体尺寸和位置误差。

4.2.1 Completeness and Size Similarity (CSS) Scoring
4.2.1 完整性和大小相似性(CSS)评分

Existing label scoring methods such as IoU scoring [31] are designed for fully supervised detectors. In contrast, we introduce an unsupervised Completeness and Size Similarity scoring (CSS) method. It aims to approximate the IoU score using commonsense knowledge alone.
现有的标签评分方法,如 IoU 评分[31],都是为完全监督的检测器而设计的。相反,我们提出了一种无监督的完整性和大小相似性评分(CSS)方法。它旨在仅使用常识知识来近似 IoU 分数。
Distance score. CSS first assesses the object completeness based on distance, assuming labels closer to the ego vehicle are likely to be more accurate. For an initial label b j b j b_(j)b_{j}, we normalize the distance to the ego vehicle within the range [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] to compute the distance score as
距离得分。CSS 首先根据距离评估对象的完整性,假设靠近自我车辆的标签可能更准确。对于初始标签 b j b j b_(j)b_{j} ,我们将距离自我车辆正常化到范围 [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] 以计算距离得分。
ψ 1 ( b j ) = 1 N ( c j ) ψ 1 b j = 1 N c j psi^(1)(b_(j))=1-N(||c_(j)||)\psi^{1}\left(b_{j}\right)=1-\mathcal{N}\left(\left\|c_{j}\right\|\right)
where N N N\mathcal{N} is the normalization function and c j c j c_(j)c_{j} is the location of b j b j b_(j)b_{j}. However, this distance-based approach has its limitations. For example, occluded objects near the ego vehicle, which should receive lower scores, are inadvertently assigned high scores due to their proximity. To mitigate this issue, we introduce a Multi-Level Occupancy (MLO) score, further detailed in Fig. 5 (g).
其中 N N N\mathcal{N} 是归一化函数, c j c j c_(j)c_{j} b j b j b_(j)b_{j} 的位置。然而,这种基于距离的方法也有其局限性。例如,靠近自车的遮挡物体应该获得较低的得分,但由于它们的接近而被错误地分配了较高的得分。为了缓解这一问题,我们引入了多层占用(MLO)得分,如图 5(g)中进一步详述。
MLO score. Considering the diverse sizes of objects, we divide the bounding box of the initial label into multiple grids with different length and width resolutions. The MLO score is then calculated by determining the proportion of grids occupied by cluster points, via
目标检测评分。考虑到目标的不同尺寸,我们将初始标签的边界框划分为不同长度和宽度分辨率的多个网格。然后通过确定包含聚类点的网格的比例来计算目标检测评分。
ψ 2 ( b j ) = 1 N o k O k ( r k ) 2 ψ 2 b j = 1 N o k O k r k 2 psi^(2)(b_(j))=(1)/(N^(o))sum_(k)(O^(k))/((r^(k))^(2))\psi^{2}\left(b_{j}\right)=\frac{1}{N^{o}} \sum_{k} \frac{O^{k}}{\left(r^{k}\right)^{2}}
where N o N o N^(o)N^{o} denotes resolution number, O k O k O^(k)O^{k} is the number of occupied grids under k k kk-th resolution, and r k r k r^(k)r^{k} is the grid number of k k kk-th resolution.
其中 N o N o N^(o)N^{o} 表示分辨率编号, O k O k O^(k)O^{k} k k kk 分辨率下的占用网格数,而 r k r k r^(k)r^{k} k k kk 分辨率下的网格数。
Size Similarity (SS) score. While the distance and MLO scores effectively evaluate the localization and size quality, they fall short in assessing classification quality. To bridge this gap, we introduce the SS score. We observe that the size ratios of different classes are different (the vehicle is around 2:1:1, pedestrian is around 1:1:2, and cyclist is around 2:1:2),
尺寸相似性(SS)评分。虽然距离和 MLO 评分有效地评估了定位和尺寸质量,但它们在评估分类质量方面存在缺陷。为了弥补这一差距,我们引入了 SS 评分。我们发现,不同类别的尺寸比例不同(车辆约为 2:1:1,行人约为 1:1:2,自行车约为 2:1:2)。

so we calculate the score by measuring the ratio difference. Specifically, we pre-define a set of size templates (simple size ratio or statistics from Wikipedia) for each class. Then, we calculate the size score by a truncated KL divergence [9].
因此,我们通过测量比率差异来计算分数。具体而言,我们为每个类预定义了一组尺寸模板(简单尺寸比率或维基百科的统计数据)。然后,我们通过截断的 KL 散度来计算尺寸得分[9]。
ψ 3 ( b j ) = 1 min ( 0.05 , σ q σ b log ( q σ b q σ a ) ) / 0.05 ψ 3 b j = 1 min 0.05 , σ q σ b log q σ b q σ a / 0.05 psi^(3)(b_(j))=1-min(0.05,sum_(sigma)q_(sigma)^(b)log((q_(sigma)^(b))/(q_(sigma)^(a))))//0.05\psi^{3}\left(b_{j}\right)=1-\min \left(0.05, \sum_{\sigma} q_{\sigma}^{b} \log \left(\frac{q_{\sigma}^{b}}{q_{\sigma}^{a}}\right)\right) / 0.05
where q σ a { l a , w a , h a } , q σ b { l b , w b , h b } q σ a l a , w a , h a , q σ b l b , w b , h b q_(sigma)^(a)in{l^(a),w^(a),h^(a)},q_(sigma)^(b)in{l^(b),w^(b),h^(b)}q_{\sigma}^{a} \in\left\{l^{a}, w^{a}, h^{a}\right\}, q_{\sigma}^{b} \in\left\{l^{b}, w^{b}, h^{b}\right\} refer to the normalized length, width, and height of the template and label.
其中 q σ a { l a , w a , h a } , q σ b { l b , w b , h b } q σ a l a , w a , h a , q σ b l b , w b , h b q_(sigma)^(a)in{l^(a),w^(a),h^(a)},q_(sigma)^(b)in{l^(b),w^(b),h^(b)}q_{\sigma}^{a} \in\left\{l^{a}, w^{a}, h^{a}\right\}, q_{\sigma}^{b} \in\left\{l^{b}, w^{b}, h^{b}\right\} 指模板和标签的标准化长度、宽度和高度。
We linearly combine the three metrics S ( b j ) = S b j = S(b_(j))=\mathcal{S}\left(b_{j}\right)= i ω i ψ i ( b j ) i ω i ψ i b j sum_(i)omega^(i)psi^(i)(b_(j))\sum_{i} \omega^{i} \psi^{i}\left(b_{j}\right) to produce final scoring, where ω i ω i omega^(i)\omega^{i} is the weighting factor (in this study we adopt a simple average, ω i = 1 / 3 ) ω i = 1 / 3 {:omega^(i)=1//3)\left.\omega^{i}=1 / 3\right). For each b j b b j b b_(j)in bb_{j} \in \boldsymbol{b}, we compute its CSS score s j c s s = S ( b j ) s j c s s = S b j s_(j)^(css)=S(b_(j))s_{j}^{c s s}=\mathcal{S}\left(b_{j}\right) and obtain a set of scores s = { s j c s s } j s = s j c s s j s={s_(j)^(css)}_(j)s=\left\{s_{j}^{c s s}\right\}_{j}.
我们线性组合三个指标 S ( b j ) = S b j = S(b_(j))=\mathcal{S}\left(b_{j}\right)= i ω i ψ i ( b j ) i ω i ψ i b j sum_(i)omega^(i)psi^(i)(b_(j))\sum_{i} \omega^{i} \psi^{i}\left(b_{j}\right) 来产生最终评分,其中 ω i ω i omega^(i)\omega^{i} 是加权因子(在本研究中我们采用简单平均, ω i = 1 / 3 ) ω i = 1 / 3 {:omega^(i)=1//3)\left.\omega^{i}=1 / 3\right) )。对于每个 b j b b j b b_(j)in bb_{j} \in \boldsymbol{b} ,我们计算其 CSS 得分 s j c s s = S ( b j ) s j c s s = S b j s_(j)^(css)=S(b_(j))s_{j}^{c s s}=\mathcal{S}\left(b_{j}\right) 并获得一组得分 s = { s j c s s } j s = s j c s s j s={s_(j)^(css)}_(j)s=\left\{s_{j}^{c s s}\right\}_{j}

4.2.2 CProto Set Construction
4.2.2 CProto 集合构造

Regular learnable prototype-based methods require annotations [16], [64], which are unavailable in the unsupervised problem. We construct a high-quality CProto set P = P = P=\boldsymbol{P}= { P k } k P k k {P_(k)}_(k)\left\{P_{k}\right\}_{k}, representing geometry and size centers based on the CSS score. Here, P k = { x k p , b k p } P k = x k p , b k p P_(k)={x_(k)^(p),b_(k)^(p)}P_{k}=\left\{x_{k}^{p}, b_{k}^{p}\right\}, where x k p x k p x_(k)^(p)x_{k}^{p} indicates the inside points, and b k p b k p b_(k)^(p)b_{k}^{p} refers to the bounding box. Specifically, we first categorize the initial labels b b b\boldsymbol{b} into different groups based on their tracking identity τ τ tau\tau. Within each group, we select the high-quality boxes and inside points that meet a high CSS score threshold η η eta\eta (determined on the validation set, using 0.8 in this study). Then, we transform all points and boxes into a local coordinate system, and obtain b k p b k p b_(k)^(p)b_{k}^{p} by averaging the boxes and x k p x k p x_(k)^(p)x_{k}^{p} by concatenating all the points.
基于原型的可学习方法需要标注[16]、[64],无监督问题中缺乏这些标注。我们构建了高质量的 CProto 集合 P = P = P=\boldsymbol{P}= { P k } k P k k {P_(k)}_(k)\left\{P_{k}\right\}_{k} ,表示基于 CSS 分数的几何和大小中心。这里, P k = { x k p , b k p } P k = x k p , b k p P_(k)={x_(k)^(p),b_(k)^(p)}P_{k}=\left\{x_{k}^{p}, b_{k}^{p}\right\} ,其中 x k p x k p x_(k)^(p)x_{k}^{p} 表示内部点, b k p b k p b_(k)^(p)b_{k}^{p} 表示边界框。具体来说,我们首先根据跟踪标识 τ τ tau\tau 将初始标签 b b b\boldsymbol{b} 分为不同的组。在每个组内,我们选择满足高 CSS 分数阈值 η η eta\eta (根据验证集确定,在本研究中为 0.8)的高质量框和内部点。然后,我们将所有点和框变换到局部坐标系中,得到 b k p b k p b_(k)^(p)b_{k}^{p} 通过平均框计算得到, x k p x k p x_(k)^(p)x_{k}^{p} 通过连接所有点得到。

4.2.3 Box Regularization 4.2.3 盒子正则化

We next regularize the initial labels by the size prior from CProto. Intuitively, the height of the initial labels is relatively correct compared to the length and width because the tops of the objects are generally not occluded. The statistics on WOD validation set [37] confirmed (see Fig. 5 (h)) this intuition. Besides, the intra-class 3D objects with the same height have similar lengths and widths. Therefore, for the initial label b j b j b_(j)b_{j} and CProto P k P k P_(k)P_{k} with the same class identity, we perform an association by the minimum difference in box height. The initial pseudo-labels with the same P k P k P_(k)P_{k}, same class identity, similar length, and similar width are naturally classified into the same group. We then perform re-size and re-localization for each group to refine the pseudo-labels. (1) Re-size. We directly replace the size of b j b j b_(j)b_{j} using the length, width, and height of b k p P k b k p P k b_(k)^(p)inP_(k)b_{k}^{p} \in P_{k}. (2) Re-location. Since points are mainly on the object’s surface and boundary, we divide the object into different bins and align the box boundary and orientation to the boundary point of the densest part (see Fig. 6). Finally, we obtain improved pseudo-labels b = { b j } j b = b j j b^(**)={b_(j)^(**)}_(j)\boldsymbol{b}^{*}=\left\{b_{j}^{*}\right\}_{j}.
我们接下来通过 CProto 的尺寸先验来规范初始标签。直观上,初始标签的高度与长度和宽度相比较正确,因为物体的顶部通常没有遮挡。WOD 验证集[37]的统计数据也证实了这一直觉(见图 5(h))。此外,同类别的 3D 物体,如果高度相同,长度和宽度也会相似。因此,对于同类标识的初始标签 b j b j b_(j)b_{j} 和 CProto P k P k P_(k)P_{k} ,我们通过最小化框高度差异来进行关联。同类标识、长度和宽度相似的初始伪标签自然被分类到同一组。然后我们对每个组进行尺寸调整和位置重新定位来优化伪标签。(1)尺寸调整:我们直接用 b k p P k b k p P k b_(k)^(p)inP_(k)b_{k}^{p} \in P_{k} 的长度、宽度和高度来替换 b j b j b_(j)b_{j} 的尺寸。(2)位置重新定位:由于点主要位于物体表面和边界上,我们将物体划分为不同的区域,并将边界框对齐到最密集部分的边界点(见图 6)。最终,我们得到了改进的伪标签 b = { b j } j b = b j j b^(**)={b_(j)^(**)}_(j)\boldsymbol{b}^{*}=\left\{b_{j}^{*}\right\}_{j}

4.3 CProto-constrained Self-Training (CST)
4.3 面向原型约束的自训练(CST)

Recent methods [57], [59] utilize pseudo-labels for training 3D detectors. However, even after refinement, some pseudolabels remain inaccurate, diminishing the effectiveness of correct supervision and potentially misleading the training process. To tackle these issues, we propose two designs: (1) CSS-Weighted Detection Loss assigns different training
基于伪标签的最新方法[57]、[59]用于训练 3D 检测器。然而,即使在精炼之后,一些伪标签仍然不准确,降低了正确监督的有效性,并可能误导训练过程。为了解决这些问题,我们提出了两种设计:(1) CSS 加权检测损失为不同的训练

Fig. 6. The size of the initial label is replaced by the CProto box, and the position is also corrected.
图 6。初始标签的大小被 CProto 框替换,位置也得到修正。

weights based on label quality to suppress false supervision signals. (2) Geometry Contrast Loss, which aligns predictions of sparsely scanned points with the dense CProto, thereby improving feature consistency.
基于标签质量的加权,以抑制虚假监督信号。(2) 几何对比损失,它将稀疏扫描点的预测与密集的 CProto 对齐,从而提高特征一致性。

4.3.1 Network Architecture
4.3.1 网络架构

We adopt a dense-sparse alignment architecture (Fig. 4 ©), consisting of a prototype network F pro F pro  F^("pro ")\mathcal{F}^{\text {pro }} and a detection network F det F det  F^("det ")\mathcal{F}^{\text {det }}, constructed from two-stage CenterPoint [56]. During training, for each b j b j b_(j)^(**)b_{j}^{*}, we add its corresponding points x k p x k p x_(k)^(p)x_{k}^{p} from CProto P k P k P_(k)P_{k} to the scene to obtain a dense point cloud x pro x pro  x^("pro ")\boldsymbol{x}^{\text {pro }}. We feed x pro x pro  x^("pro ")\boldsymbol{x}^{\text {pro }} to F pro F pro  F^("pro ")\mathcal{F}^{\text {pro }} to produce relatively good features and detections. We then feed randomly downsampled points x det x det  x^("det ")\boldsymbol{x}^{\text {det }} to the F det F det  F^("det ")\mathcal{F}^{\text {det }}. We align the features and detections from two branches by the detection loss and contrast loss. During testing, we feed points without downsampling to the detection network F det F det  F^("det ")\mathcal{F}^{\text {det }} to perform detection.
我们采用密集-稀疏对齐架构(图 4©),由原型网络 F pro F pro  F^("pro ")\mathcal{F}^{\text {pro }} 和检测网络 F det F det  F^("det ")\mathcal{F}^{\text {det }} 组成,该网络由两阶段 CenterPoint [56]构建。在训练期间,对于每个 b j b j b_(j)^(**)b_{j}^{*} ,我们添加其相应的点 x k p x k p x_(k)^(p)x_{k}^{p} 从 CProto P k P k P_(k)P_{k} 到场景中以获得密集点云 x pro x pro  x^("pro ")\boldsymbol{x}^{\text {pro }} 。我们将 x pro x pro  x^("pro ")\boldsymbol{x}^{\text {pro }} 馈送到 F pro F pro  F^("pro ")\mathcal{F}^{\text {pro }} 以产生相对较好的特征和检测。然后,我们将随机下采样的点 x det x det  x^("det ")\boldsymbol{x}^{\text {det }} 馈送到 F det F det  F^("det ")\mathcal{F}^{\text {det }} 。我们通过检测损失和对比损失将两个分支的特征和检测对齐。在测试期间,我们将没有下采样的点馈送到检测网络 F det F det  F^("det ")\mathcal{F}^{\text {det }} 以执行检测。

4.3.2 Training Loss 4.3.2 培训损失

CSS Weight. Considering that the false pseudo-labels may mislead the network convergence, we first calculate a loss weight based on different label qualities. Formally, we convert a CSS score s i c s s s i c s s s_(i)^(css)s_{i}^{c s s} of a pseudo-label to
基于不同标签质量计算损失权重。正式地,我们将伪标签的 CSS 得分转换为
ω i = { 0 s i css < S L s i c s s S L S H S L S L < s i css < S H 1 s i css > S H ω i = 0 s i css  < S L s i c s s S L S H S L S L < s i css  < S H 1 s i css  > S H omega_(i)={[0,s_(i)^("css ") < S_(L)],[(s_(i)^(css)-S_(L))/(S_(H)-S_(L)),S_(L) < s_(i)^("css ") < S_(H)],[1,s_(i)^("css ") > S_(H)]:}\omega_{i}= \begin{cases}0 & s_{i}^{\text {css }}<S_{L} \\ \frac{s_{i}^{c s s}-S_{L}}{S_{H}-S_{L}} & S_{L}<s_{i}^{\text {css }}<S_{H} \\ 1 & s_{i}^{\text {css }}>S_{H}\end{cases}
where S H S H S_(H)S_{H} and S L S L S_(L)S_{L} are high/low-quality thresholds (we empirically set 0.7 and 0.4 , respectively).
其中 S H S H S_(H)S_{H} S L S L S_(L)S_{L} 是高质量/低质量阈值(我们经验性地设置为 0.7 和 0.4)。
CSS-weighted Detection Loss. To decrease the influence of false labels, we formulate the CSS-weighted detection loss to refine N N NN proposals
基于 CSS 加权的检测损失。为减少错误标签的影响,我们制定了基于 CSS 加权的检测损失来完善 N N NN 建议。
L d e t c s s = 1 N i ω i ( L i p r o + L i d e t ) L d e t c s s = 1 N i ω i L i p r o + L i d e t L_(det)^(css)=(1)/(N)sum_(i)omega_(i)(L_(i)^(pro)+L_(i)^(det))\mathcal{L}_{d e t}^{c s s}=\frac{1}{N} \sum_{i} \omega_{i}\left(\mathcal{L}_{i}^{p r o}+\mathcal{L}_{i}^{d e t}\right)
where L i pro L i pro  L_(i)^("pro ")\mathcal{L}_{i}^{\text {pro }} and L i det L i det  L_(i)^("det ")\mathcal{L}_{i}^{\text {det }} are detection losses [4] of F pro F pro  F^("pro ")\mathcal{F}^{\text {pro }} and F det F det  F^("det ")\mathcal{F}^{\text {det }}, respectively. The losses are calculated by pseudolabels b b b^(**)\boldsymbol{b}^{*} and network predictions.
其中 L i pro L i pro  L_(i)^("pro ")\mathcal{L}_{i}^{\text {pro }} L i det L i det  L_(i)^("det ")\mathcal{L}_{i}^{\text {det }} F pro F pro  F^("pro ")\mathcal{F}^{\text {pro }} F det F det  F^("det ")\mathcal{F}^{\text {det }} 的检测损失[4]。这些损失是通过伪标签 b b b^(**)\boldsymbol{b}^{*} 和网络预测计算的。
Geometry Contrast Loss. We formulate two contrast losses that minimize the feature and predicted box difference between the prototype and detection network. (1) Feature contrast loss. For a foreground RoI r i r i r_(i)r_{i} from the detection network, we extract features f i p f i p f_(i)^(p)\boldsymbol{f}_{i}^{p} from the prototype network by voxel set abstract [4], and extract features f i d f i d f_(i)^(d)f_{i}^{d} from detection network. We then formulate the contrast loss by cosine distance:
几何对比损失。我们制定了两个对比损失,最小化原型和检测网络之间的特征和预测框差异。(1)特征对比损失。对于检测网络中的前景 RoI r i r i r_(i)r_{i} ,我们从原型网络中提取特征 f i p f i p f_(i)^(p)\boldsymbol{f}_{i}^{p} ,从检测网络中提取特征 f i d f i d f_(i)^(d)f_{i}^{d} 。然后我们用余弦距离来制定对比损失:
L f e a t c s s = 1 N f i ω i f i d f i p f i d f i p L f e a t c s s = 1 N f i ω i f i d f i p f i d f i p L_(feat)^(css)=-(1)/(N^(f))sum_(i)omega_(i)(f_(i)^(d)*f_(i)^(p))/(||f_(i)^(d)||||f_(i)^(p)||)\mathcal{L}_{f e a t}^{c s s}=-\frac{1}{N^{f}} \sum_{i} \omega_{i} \frac{\boldsymbol{f}_{i}^{d} \cdot \boldsymbol{f}_{i}^{p}}{\left\|\boldsymbol{f}_{i}^{d}\right\|\left\|\boldsymbol{f}_{i}^{p}\right\|}
Fig. 7. (a) A baseline based on Vanilla Iterative Training (VaIT). (b) Performance improvement of ValT by using different score thresholds. © True Positive (TP) and False Positive (FP) statistics of pseudo labels. (d) Examples of TP and FP. (e) Position and size errors of pseudo labels.
图 7。(a) 基于普通迭代培训(VaIT)的基线。(b) 使用不同得分阈值提高 ValT 的性能。(c) 伪标签的真阳性(TP)和假阳性(FP)统计。(d) TP 和 FP 的示例。(e) 伪标签的位置和大小误差。

where N f N f N^(f)N^{f} is the foreground proposal number. (2) Box contrast loss. For a box prediction d i p d i p d_(i)^(p)d_{i}^{p} from the prototype network and a box prediction d i d d i d d_(i)^(d)d_{i}^{d} from the detection network. We then formulate the box contrast loss by IoU, location difference, and angle difference:
其中 N f N f N^(f)N^{f} 是前景提案编号。(2) 框对比损失。对于来自原型网络的框预测 d i p d i p d_(i)^(p)d_{i}^{p} 和来自检测网络的框预测 d i d d i d d_(i)^(d)d_{i}^{d} ,我们通过 IoU、位置差异和角度差异来表示框对比损失:
L box c s s = 1 N f i ω i [ 1 I ( d i d , d i p ) + c i d c i p + | sin ( α i d α i p ) | ] L box  c s s = 1 N f i ω i 1 I d i d , d i p + c i d c i p + sin α i d α i p {:[L_("box ")^(css)=(1)/(N^(f))sum_(i)omega_(i)[1-I(d_(i)^(d),d_(i)^(p)):}],[{:+||c_(i)^(d)-c_(i)^(p)||+|sin(alpha_(i)^(d)-alpha_(i)^(p))|]]:}\begin{aligned} \mathcal{L}_{\text {box }}^{c s s}= & \frac{1}{N^{f}} \sum_{i} \omega_{i}\left[1-I\left(d_{i}^{d}, d_{i}^{p}\right)\right. \\ & \left.+\left\|c_{i}^{d}-c_{i}^{p}\right\|+\left|\sin \left(\alpha_{i}^{d}-\alpha_{i}^{p}\right)\right|\right] \end{aligned}
where I I II denote IoU function; c i d , α i d c i d , α i d c_(i)^(d),alpha_(i)^(d)c_{i}^{d}, \alpha_{i}^{d} refers to position and angle of d i d ; c i p , α i p d i d ; c i p , α i p d_(i)^(d);c_(i)^(p),alpha_(i)^(p)d_{i}^{d} ; c_{i}^{p}, \alpha_{i}^{p} refers to position and angle of d i p d i p d_(i)^(p)d_{i}^{p}. We finally summat all losses to training the detector.
其中 I I II 表示 IoU 函数; c i d , α i d c i d , α i d c_(i)^(d),alpha_(i)^(d)c_{i}^{d}, \alpha_{i}^{d} 指的是位置和角度; d i d ; c i p , α i p d i d ; c i p , α i p d_(i)^(d);c_(i)^(p),alpha_(i)^(p)d_{i}^{d} ; c_{i}^{p}, \alpha_{i}^{p} 指的是位置和角度. d i p d i p d_(i)^(p)d_{i}^{p} 我们最终将所有损失相加来训练检测器.

5 CPD++: Enhancing CPD by Motion Clue
5 CPD++: 增强 CPD 通过运动线索

Although the proposed CPD framework achieves promising performance on large-scale vehicle objects, it suffers from the low detection accuracy of small objects. We propose CPD++ for more accurate multi-class unsupervised 3D object detection. CPD++ formulates the pseudo label generation function G G G\mathcal{G} by a Motion-conditioned Label Refinement (MLR) and formulate the training target L u n L u n L^(un)\mathcal{L}^{u n} by a motionweighed detection loss.
尽管建议的 CPD 框架在大规模车辆对象上取得了有前景的性能,但对小物体的检测精度较低。我们提出了 CPD++,以实现更精确的无监督 3D 多类目对象检测。CPD++通过运动条件标签精炼(MLR)制定伪标签生成函数 G G G\mathcal{G} ,并通过运动加权检测损失制定训练目标 L u n L u n L^(un)\mathcal{L}^{u n}

5.1 Pilot Experiments and Discussion
5.1 试验试验和讨论

Prevalent methods apply iterative training to further boost the detection performance [57], [59]. By leveraging the generalization capabilities inherent in deep networks, it can identify more object patterns. However, false labels with inaccurate classes and bounding boxes may mislead the training of the next iteration. Recent works apply score-based filtering [51] or temporal-based filtering [59] to diminish the false supervision signals. By employing these techniques, it is possible to improve the multi-class performance of CPD further. We begin our design by first constructing a strong baseline of the Vanilla Iterative Training (VaIT) method. As presented in Fig. 7 (a), VaIT utilizes the detection results from CPD as initial labels and improves the label quality by tracking, temporal filtering, and score thresholding. By default, the VaIT applies the Kalman Filtering to perform tracking [44] and ignores short tracklets less than six [59] detections. We conduct pilot experiments on the WOD and
普遍的方法应用迭代训练来进一步提高检测性能[57]、[59]。通过利用深度网络固有的泛化能力,可以识别更多的目标模式。然而,具有不准确类别和边界框的错误标签可能会误导下一次迭代的训练。最近的工作采用基于分数的过滤[51]或基于时间的过滤[59]来减少错误的监督信号。通过采用这些技术,可以进一步提高 CPD 的多类性能。我们从构建 Vanilla Iterative Training(VaIT)方法的强大基线开始设计。如图 7(a)所示,VaIT 利用 CPD 的检测结果作为初始标签,并通过跟踪、时间过滤和分数阈值提高标签质量。默认情况下,VaIT 应用卡尔曼滤波进行跟踪[44],并忽略少于六个[59]检测的短轨迹。我们在 WOD 和

train the VaIT for two iterations similar to OYSTER [59]. We present the detection performance improvement in Fig. 7 (b) (1) to (6), where different score thresholds ranging from 0.1 to 0.6 are used.
对齐训练 VaIT 模型进行两次迭代,类似于 OYSTER[59]。我们在图 7(b)(1)到(6)中展示了不同得分阈值从 0.1 到 0.6 的检测性能改进。
However, we observe that VaIT marginally enhances or even decreases the precision of pedestrian and cyclist detection no matter what score threshold is chosen. To delve into this issue, we present the statistics of True Positive (TP) and False Positive (FP) labels on the WOD validation set in Fig. 7 © (1). The statistics show that TP labels of pedestrians and cyclists only account for 25 % 25 % 25%25 \% and 24 % 24 % 24%24 \% respectively, while the FP labels account for 75 % 75 % 75%75 \% and 76 % 76 % 76%76 \% respectively. The main reason lies in that threshold-based classification in CPD cannot distinguish the foreground and background objects with similar sizes (see Fig. 7 (d)), leading to many false labels. As illustrated in Fig. 7 (d) (2) (3), the sizes of background objects generally resemble that of small foreground objects (Fig. 7 (d) (1)). Consequently, the network trained with the false labels tends to predict high scores for background objects and cannot attain better performance.
然而,我们观察到,无论选择何种评分阈值,VaIT 都会略微提高或甚至降低行人和骑自行车者的检测精度。为深入研究这一问题,我们在图 7 © (1)中呈现了 WOD 验证集上真阳性(TP)和假阳性(FP)标签的统计数据。统计数据显示,行人和骑自行车者的 TP 标签分别只占 25 % 25 % 25%25 \%