Traditional 3D object detectors, operating under fully, semi, or weakly supervised paradigms, require extensive human annotations. In contrast, this paper endeavors to design an unsupervised 3D object detector that discerns the object pattern automatically without relying on such annotations. To this end, we first introduce a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection. CPD initially constructs Commonsense Prototype (CProto) to represent the geometry center and size of objects. Subsequently, CPD produces high-quality pseudo-labels and guides the detector convergence by the size and geometry priors from CProto. Based on the CPD, we further propose CPD++, an enhanced version that boosts performance by motion clue. CPD++ learns localization from stationary objects and learns recognition from moving objects. It then facilitates the mutual transfer of localization and recognition knowledge between these two object types. Both CPD and CPD++ demonstrated superior performance over existing state-of-the-art unsupervised 3D detectors. Furthermore, by training CPD++ on WOD and testing on KITTI, CPD++ attains a performance of 89.25%89.25 \% 3D Average Precision (AP) on moderate car class under 0.5 loU threshold. These results approximate 95.3%95.3 \% of the performance achieved by fully supervised counterparts, underscoring the advancement of our method. 基于概念原型的无监督三维物体检测器,称为 CPD,首次引入了概念原型(CProto)来表示物体的几何中心和大小。CPD 通过 CProto 生成高质量的伪标签,并以此引导检测器收敛。基于 CPD,我们进一步提出了 CPD++,它通过利用运动线索进一步提升性能。CPD++从静止物体学习定位,从移动物体学习识别,并在这两种物体类型之间进行知识转移。CPD 和 CPD++均展现出优于现有最先进的无监督三维检测器的性能。在 WOD 数据集上训练,并在 KITTI 数据集上测试,CPD++在中等车类中达到了 3D 平均精度的水平,接近全监督方法的性能。
Index Terms-3D object detection, unsupervised learning, LiDAR point clouds, pattern recognition. 三维物体检测、无监督学习、激光雷达点云、模式识别。
Unfortunately, unsupervised 3D object detection presents exceptionally difficult challenges as follows: (1) Recognition challenge. Within the context of autonomous driving, point clouds often contain a mixture of objects. There are no geometric criteria that have a clear decision boundary to segment the background and foreground objects. Numerous traditional methods leveraged ground removal [12] and clustering technique [60] to differentiate the objects from their surroundings. However, background structures, such as buildings and fences, are inevitably misclassified as foreground objects (see Fig. 2 (a)). Recent sophisticated approaches build accurate seed labels through heuristics and identify common patterns by training a deep network iteratively. For example, the MODEST [57] and DRIFT [23] formulate the seed labels by Persistence Point Score (PPScore). Then, the pseudo labels are utilized to train a regular detector. Nonetheless, these methods require data collection from the same location multiple times, increasing labor costs substantially. As of now, attaining unsupervised object recognition without any increase in labor costs remains a tough challenge. (2) Localization challenge. Almost all objects are self-occluded in the autonomous driving scenario. Besides, with the increase in scanning distance, the number of captured points decreases dramatically. Consequently, the objects in point clouds are mostly geometry-incomplete, posing substantial challenges for the accurate localization of 3D bounding boxes (see Fig. 2 (b)). Early methods leveraged a straightforward bounding box fitting algorithm [36] for object localization. However, these methods often frequently struggle to deliver satisfactory performance due to the sparsity and occlusion of objects. Advanced methods generate initial pseudo-labels from fitted labels and use these to train deep networks 不幸的是,无监督的 3D 物体检测呈现出异常困难的挑战,如下所示:(1)识别挑战。在自动驾驶的背景下,点云经常包含多种物体的混合。没有清晰的几何标准来划分背景和前景物体。许多传统方法利用地面去除[12]和聚类技术[60]来区分物体与其周围环境。然而,诸如建筑物和篱笆等背景结构不可避免地被误分类为前景物体(见图 2(a))。最近的先进方法通过启发式方法建立精确的种子标签,并通过迭代训练深度网络来识别常见模式。例如,MODEST[57]和 DRIFT[23]通过持续点得分(PPScore)公式化种子标签。然后,伪标签被用于训练常规检测器。尽管如此,这些方法需要从同一位置多次收集数据,大大增加了劳动力成本。到目前为止,在不增加劳动力成本的情况下实现无监督物体识别仍然是一个艰巨的挑战。(2)定位挑战。在自动驾驶场景中,几乎所有物体都存在自遮挡。此外,随着扫描距离的增加,捕获的点数量也大幅下降。因此,点云中的物体大多是几何不完整的,这给 3D 边界框的准确定位带来了很大挑战(见图 2(b))。早期的方法利用简单的边界框拟合算法[36]进行物体定位。然而,由于物体的稀疏性和遮挡,这些方法通常难以得到满意的性能。先进的方法从拟合的标签生成初始伪标签,并使用这些标签来训练深度网络。
iteratively. They incorporate human knowledge to refine the position and size of the pseudo-labels iteratively. A subset of objects, denoted as complete objects, benefit from having at least one complete scan across the entire point cloud sequence, allowing their pseudo-labels to be refined through temporal consistency [59]. However, most objects, termed incomplete objects, lack full scan coverage and cannot be recovered by temporal consistency. 迭代地。他们利用人类知识来逐步完善伪标签的位置和大小。一些对象,称为完整对象,受益于至少有一个完整的点云序列扫描,使它们的伪标签可以通过时间一致性[59]得到完善。然而,大多数对象,被称为不完整对象,缺乏完整的扫描覆盖,无法通过时间一致性恢复。
To tackle these issues, this paper first proposes a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection (see Fig. 2 (e)). CPD is built upon two key insights: (1) The objects of the same class have similar sizes and can be roughly classified by size differences. (2) The nearby stationary objects are complete in consecutive frames and can be localized accurately by shape and position measure. Our idea is to construct a Commonsense Prototype (CProto) set representing the geometry and size of objects to refine the pseudo-labels and constrain the detector training. To this end, we first design a Multi-Frame Clustering (MFC) method that yields multi-class pseudolabels by size-based thresholding. Subsequently, we introduce an unsupervised Completeness and Size Similarity (CSS) score that selects high-quality labels to construct the CProto set. Furthermore, we design a CProto-constrained Box Regularization (CBR) that refines the sizes of pseudo labels. In addition, we develop CProto-constrained SelfTraining (CST) that improves the detection accuracy by the geometry knowledge from CProto. 为解决这些问题,本文首次提出了一种基于常识原型的检测器 CPD,用于无监督 3D 目标检测(见图 2(e))。CPD 建立在两个关键洞见之上:(1)同类对象的尺寸相似,可以通过尺寸差异进行粗略分类。(2)邻近的固定物体在连续帧中完整,可以通过形状和位置测量精确定位。我们的想法是构建一个代表对象几何和尺寸的常识原型(CProto)集合,以改善伪标签并约束检测器训练。为此,我们首先设计了一种基于尺寸的阈值的多帧聚类(MFC)方法,产生多类伪标签。随后,我们引入了一种无监督的完整性和尺寸相似性(CSS)评分,选择高质量标签来构建 CProto 集合。此外,我们设计了一种 CProto 约束的框正则化(CBR),用于改善伪标签的尺寸。最后,我们开发了 CProto 约束的自训练(CST),利用 CProto 的几何知识提高检测精度。
Fig. 3. Performance comparison. Both CPD and CPD++ achieve state-of-the-art unsupervised 3D object detection performance. CPD++ improves CPD 2xx2 \times in terms of mAP. 图 3. 性能比较。CPD 和 CPD++都达到了最先进的无监督 3D 目标检测性能。 CPD++在 mAP 方面提高了 CPD。
Fig. 2 (d)). (3) The localization and recognition knowledge are complementary and transferable between each other. Our idea is to learn localization from stationary objects, learn recognition from moving objects, and transfer the knowledge between each other to boost detection performance. The core designs are a Motion-conditioned Label Refinement (MLR) module that produces high-quality pseudo labels and a Motion-Weighted Training (MWT) scheme that mutually learns the localization and recognition. 图 2(d)。(3)定位和识别知识是互补的,可以互相传递。我们的想法是从静止物体中学习定位,从移动物体中学习识别,并在二者之间传递知识以提高检测性能。核心设计是一个运动条件标签细化(MLR)模块,可以产生高质量的伪标签,以及一个相互学习定位和识别的运动加权训练(MWT)方案。
The effectiveness of our design is verified by experiments on widely used Waymo Open Dataset (WOD) [37] and KITTI dataset [8]. The main contributions include: 我们的设计效果通过在广泛使用的 Waymo Open Dataset(WOD)[37]和 KITTI 数据集[8]上的实验得到验证。主要贡献包括:
We propose CPD that integrates commonsense knowledge into prototypes for unsupervised 3D object detection. CPD adopts CProto-constrained Box Regularization (CBR) for accurate initial pseudo-label generation and CProto-constrained Self-Training (CST) for unsupervised learning. 我们提出了将常识知识整合到用于无监督 3D 物体检测的原型中的 CPD。CPD 采用 CBR(CProto 约束的框正则化)来生成准确的初始伪标签,并采用 CST(CProto 约束的自训练)进行无监督学习。
We propose CPD++ that takes a step in more accurately detecting multi-class 3D objects by incorporating motion clues. This is enabled by the designed Motion-conditioned Label Refinement (MLR) and a Motion-Weighted Training (MWT) scheme. 我们提出了 CPD++,它通过融入运动线索来更准确地检测多类 3D 目标。这得益于设计的基于运动的标签精炼(MLR)和基于运动的加权训练(MWT)方案。
Both CPD and CPD++ outperform state-of-the-art unsupervised 3D detector on the WOD and KITTI datasets. Notably, CPD++ enhances the CPD by 2xx2 \times mAP (see Fig. 3). When CPD++ is trained on the WOD and tested on KITTI, it attains a performance of 89.25% moderate car 3D AP under 0.5 IoU threshold. The performance matches 95.3%95.3 \% of fully supervised counterpart, highlighting the advance of our work. 两者 CPD 和 CPD++ 在 WOD 和 KITTI 数据集上的无监督 3D 探测器性能优于最先进水平。值得注意的是,CPD++ 通过 2xx2 \times 提高了 CPD 的 mAP(见图 3)。当 CPD++ 在 WOD 上训练并在 KITTI 上测试时,它在 0.5 IoU 阈值下达到了 89.25% 的中等车 3D AP 性能。该性能与 95.3%95.3 \% 完全监督的对应物相匹配,突出了我们工作的进步。
2 Related Work 相关工作
2.1 Fully/semi/weakly-supervised 3D Object Detection 全监督/半监督/弱监督 3D 目标检测
Recent fully-supervised 3D detectors build single-stage [10], [13], [38], [56], [65], [66], two-stage [4], [31], [32], [33], [45], [46], [48], [53] or multiple stage [2], [43] deep networks for 3D object detection. Fully supervised 3D object detection methods have achieved significant performance, but the annotation cost is often unacceptable. To reduce the annotation cost, semi-supervised and weakly-supervised 3D object detection methods have gained widespread attention. Semi-supervised methods [18], [28], [39], [50], [54], [63] usually only annotate a portion of the scenarios and then apply teacher-student networks to generate pseudo-labels for unannotated scenarios. For example, the 3DIoUMatch [39] laid the groundwork in the domain of outdoor scenes 最近的完全监督的 3D 探测器构建了单级[10]、[13]、[38]、[56]、[65]、[66]、双级[4]、[31]、[32]、[33]、[45]、[46]、[48]、[53]或多级[2]、[43]深度网络用于 3D 目标检测。完全监督的 3D 目标检测方法已经取得了显著的性能,但注释成本通常是不可接受的。为了减少注释成本,半监督和弱监督的 3D 目标检测方法受到了广泛关注。半监督方法[18]、[28]、[39]、[50]、[54]、[63]通常只注释部分场景,然后应用教师-学生网络为未注释的场景生成伪标签。例如,3DIoUMatch[39]为户外场景领域奠定了基础。
by pioneering the estimation of 3D Intersection over Union (IoU) as a metric for localization. Weakly supervised methods [25], [41], [58] reduce the annotation cost from the perspective of the annotation form. For example, WS3D [25] proposed point-level labels, which generate box-level pseudo-labels from instance-level click labels. Some other weakly-supervised methods [19], [49] only annotate a single instance per scene to reduce the annotation cost. Different from all of these methods, we aim to design a modern unsupervised detector that is able to accurately detect 3D objects without using any human-level annotations. 通过率先估计 3D 交集除以联合(IoU)作为定位度量。弱监督方法[25]、[41]、[58]从注释形式的角度降低了注释成本。例如,WS3D[25]提出了点级标签,从实例级点击标签生成箱级伪标签。其他一些弱监督方法[19]、[49]每个场景只注释一个实例,以降低注释成本。与所有这些方法不同,我们的目标是设计一个现代无监督检测器,能够准确检测 3D 物体,而无需使用任何人工级注释。
2.2 Self-supervised/Unsupervised 3D Detectors 2.2 自监督/无监督 3D 检测器
Previous unsupervised pre-training methods discern latent patterns within the unlabeled data by masked labels [52] or contrastive loss [17], [55]. But these methods require human labels for fine-turning [5], [20]. Traditional methods [1], [30], [36] employ ground removal and clustering for 3D object detection without human labels, but suffer from poor detection performance. Some deep learning-based methods generate pseudo-labels by multiple traversals and use the pseudolabels to train a 3D detector [23], [57] iteratively. Nonetheless, these methods require data collection from the same location multiple times, increasing labor costs remarkably. Recent OYSTER [59] generates labels from single traversal, improves the recognition by ignoring short tracklets and improves the localization by temporal consistency. However, most bounding boxes of incomplete objects cannot be recovered by temporal consistency. Besides, lots of small background objects also generate long tracklets, decreasing recognition precision. Our CPD addresses the localization problem by leveraging the geometry prior from CProto to refine the pseudo-label and guide the network convergence. Our CPD++ addresses the recognition problem by learning the discriminative features of moving objects. 之前无监督预训练的方法通过掩蔽标签[52]或对比损失[17],[55]来识别未标记数据中的潜在模式。但这些方法需要人工标注来微调[5],[20]。传统方法[1],[30],[36]采用地面去除和聚类来检测 3D 物体,无需人工标注,但检测性能较差。一些基于深度学习的方法通过多次遍历生成伪标签,并使用这些伪标签迭代训练 3D 检测器[23],[57]。然而,这些方法需要多次收集同一位置的数据,大大增加了劳动成本。最近的 OYSTER[59]从单次遍历生成标签,通过忽略短时间跟踪来提高识别,并通过时间一致性来改进定位。然而,大部分不完整物体的边界框无法通过时间一致性恢复。此外,许多小的背景物体也生成长时间跟踪,降低了识别精度。我们的 CPD 利用 CProto 的几何先验来完善伪标签,并引导网络收敛,解决了定位问题。我们的 CPD++通过学习移动物体的区分性特征来解决识别问题。
2.3 Prototype-based Methods 基于原型的方法
The prototype-based methods are widely used in 2D object detection [14], [15], [22], [42], [62] when novel classes are incorporated. Inspired by these methods, Prototypical VoteNet [64] constructs geometric prototypes learned from basic classes for few-shot 3D object detection. GPA-3D [16] and CL3D [29] build geometric prototypes from a sourcedomain model for domain adaptive 3D detection. However, both the learning from basic class and training on the source domain require high-quality annotations. Unlike that, we construct CProto using commonsense knowledge and detect 3D objects in a zero-shot manner without human-level annotations. 基于原型的方法广泛用于在新类别被引入时的二维目标检测[14]、[15]、[22]、[42]、[62]。受这些方法的启发,原型性选票网络[64]构造了从基础类别学习的几何原型,用于少样本三维目标检测。GPA-3D[16]和 CL3D[29]从源域模型构建几何原型,用于领域自适应三维检测。但是,从基础类别学习和在源域上训练都需要高质量注释。与之不同的是,我们使用常识知识构建 CProto,并采用零样本方式检测三维目标,无需人工级别的注释。
3 Problem Formulation 问题描述
Given a set of input points x,3D\boldsymbol{x}, 3 \mathrm{D} object detection aims to design a function F\mathcal{F} that produces optimal detections y=\boldsymbol{y}=F(x)\mathcal{F}(\boldsymbol{x}), where each detection consists of [x,y,z,l,w,h,alpha,beta][x, y, z, l, w, h, \alpha, \beta] representing position, width, length, height, azimuth angle, and class identity of a object, respectively. 给定一组输入点 x,3D\boldsymbol{x}, 3 \mathrm{D} ,目标检测旨在设计一个函数 F\mathcal{F} ,该函数产生最佳检测结果 y=\boldsymbol{y}=F(x)\mathcal{F}(\boldsymbol{x}) ,其中每个检测结果由 [x,y,z,l,w,h,alpha,beta][x, y, z, l, w, h, \alpha, \beta] 表示对象的位置、宽度、长度、高度、方位角和类别标识。
3.1 Fully Supervised 3D Object Detection 全面监督 3D 物体检测
Since the F\mathcal{F} can’t be obtained directly by some handcrafted rule, regular methods formulate the 3D object detection problem as a fully supervised learning problem. It requires a large-scale dataset with human labels. The dataset is conventionally divided into training x^("train "),y^("train ")\boldsymbol{x}^{\text {train }}, \boldsymbol{y}^{\text {train }}, validation x^(val),y^(val)\boldsymbol{x}^{v a l}, \boldsymbol{y}^{v a l} and testing datasets x^("test "),y^("test ")\boldsymbol{x}^{\text {test }}, \boldsymbol{y}^{\text {test }}. During the designing process, a detection model with learnable parameters theta\theta and loss function L^("full ")\mathcal{L}^{\text {full }} are optimized from the training dataset. 由于 F\mathcal{F} 无法直接通过一些手工规则获得,常规方法将 3D 物体检测问题形式化为完全监督学习问题。它需要一个具有人工标签的大规模数据集。数据集通常划分为训练 x^("train "),y^("train ")\boldsymbol{x}^{\text {train }}, \boldsymbol{y}^{\text {train }} 、验证 x^(val),y^(val)\boldsymbol{x}^{v a l}, \boldsymbol{y}^{v a l} 和测试 x^("test "),y^("test ")\boldsymbol{x}^{\text {test }}, \boldsymbol{y}^{\text {test }} 数据集。在设计过程中,从训练数据集优化可学习参数 theta\theta 和损失函数 L^("full ")\mathcal{L}^{\text {full }} 的检测模型。
The detection model F_(theta^(**))\mathcal{F}_{\theta^{*}} generally consists of some unlearnable hyperparameters eta\eta, which are optimized by the detection metrics A\mathcal{A} from the validation dataset. 检测模型通常由某些不可学习的超参数组成,这些超参数由验证数据集中的检测指标进行优化。
F_(theta^(**))^(eta^(**))=arg max_(eta)(A(F_(theta^(**))^(eta)(x^(val)),y^(val))).\mathcal{F}_{\theta^{*}}^{\eta^{*}}=\underset{\eta}{\arg \max }\left(\mathcal{A}\left(\mathcal{F}_{\theta^{*}}^{\eta}\left(\boldsymbol{x}^{v a l}\right), \boldsymbol{y}^{v a l}\right)\right) .
The testing dataset is only used to compare the performance with other methods and can’t be used to tune parameters. 测试数据集仅用于与其他方法的性能进行比较,不能用于调整参数。
3.2 Unsupervised 3D Object Detection 无监督 3D 目标检测
To decrease labeling cost, unsupervised settings do not require human labels of training dataset y^("train ")\boldsymbol{y}^{\text {train }}. The y^("val ")\boldsymbol{y}^{\text {val }} and y^("test ")\boldsymbol{y}^{\text {test }} are required to choose the best hyperparameters and compare them with other methods, respectively. This paper solves the unsupervised problem with a pseudo-labelbased framework. The main challenge lies in how to design a pseudo label generation function G\mathcal{G} and how to design a pseudo-label-based training target L^("un ")\mathcal{L}^{\text {un }} to optimize the detector. 降低标记成本,无监督设置不需要人工标记训练数据集。需要 y^("val ")\boldsymbol{y}^{\text {val }} 和 y^("test ")\boldsymbol{y}^{\text {test }} 分别选择最佳超参数并与其他方法进行比较。本文通过基于伪标签的框架解决了无监督问题。主要挑战在于如何设计伪标签生成函数 G\mathcal{G} 以及如何设计基于伪标签的训练目标 L^("un ")\mathcal{L}^{\text {un }} 来优化检测器。
This paper leverages human commonsense knowledge to formulate the label generation function G\mathcal{G} and training target L^(un)\mathcal{L}^{u n}. The commonsense knowledge is formulated as some observation or insights. For example, the stationary objects share the same position, size, and class across frames; all the background objects, such as buildings and trees, do not move. We ensure all the intuitions are shared among different datasets, easy to obtain, and not sensitive to a specific value. 这篇论文利用人类常识知识来制定标签生成函数 G\mathcal{G} 和训练目标 L^(un)\mathcal{L}^{u n} 。常识知识被形式化为某些观察或洞见。例如,静止物体在各帧中位置、大小和类别都相同;所有背景物体,如建筑物和树木,都没有移动。我们确保所有的直观概念都在不同的数据集中共享,容易获取,并且不对特定值敏感。
4 CPD: Commonsense Prototype for UnsuPERVISEd 3D ObJect Detection 4 CPD:无监督 3D 目标检测的常识原型
This section introduces the Commonsense Prototype-based Detector (CPD) for unsupervised 3D object detection. As shown in Fig. 4, CPD consists of three main parts: (1) initial label generation, (2) label refinement, and (3) self-training. CPD formulates the pseudo label generation function G\mathcal{G} by Multi-Frame Clustering (MFC) and CProto-constrained Box Regularization (CBR), and formulate the training target L^(un)\mathcal{L}^{u n} by the CSS-weighted detection loss and geometry contrast loss. We detail the designs as follows. 本节介绍了用于无监督 3D 目标检测的常识原型检测器(CPD)。如图 4 所示,CPD 包括三个主要部分:(1)初始标签生成,(2)标签优化,(3)自我训练。CPD 通过多帧聚类(MFC)和 CProto 约束的框正则化(CBR)来构建伪标签生成函数 G\mathcal{G} ,并通过 CSS 加权检测损失和几何对比损失来制定训练目标 L^(un)\mathcal{L}^{u n} 。我们将详细介绍这些设计。
4.1 Initial Label Generation 4.1 初始标签生成
Recent unsupervised methods [57], [59] detect 3D objects in a class-agnostic way. How to classify objects (e.g., vehicles and pedestrians) without annotation is still an unsolved challenge. Our observations indicate that the sizes of different classes are significantly different. Therefore, the dense objects in consecutive frames (see Fig. 5 (b)) can be roughly classified by size thresholds. This motivates us to design a Multi-Frame Clustering (MFC) method to generate initial labels. MFC involves motion artifact removal, clustering, and post-processing. 近期无监督方法[57],[59]以一种忽视类别的方式检测三维物体。如何不借助注释来对物体(如车辆和行人)进行分类,这仍然是一个未解决的挑战。我们的观察表明,不同类别的大小差异显著。因此,可以通过尺寸阈值粗略地对连续帧中的密集物体(见图 5(b))进行分类。这促使我们设计了一种多帧聚类(MFC)方法来生成初始标签。MFC 涉及运动伪影消除、聚类和后处理。
In line with recent study [59], we apply the ground removal [12], DBSCAN [6] and bounding box fitting [61] on x_(0)^(**)\boldsymbol{x}_{0}^{*} to obtain a set of class-agnostic bounding boxes hat(b)\hat{b}. We then perform a tracking process to associate the boxes across frames. Since the objects of the same class typically have similar sizes in 3D space, we calculate class-specific size thresholds to classify hat(b)\hat{b} into different categories. The size threshold is calculated based on two observations. (1) All the moving objects are foreground. (2) The size ratios of different classes are different (the vehicle is around 2:1:1, the pedestrian is around 1:1:2, and the cyclist is around 2:1:2). Therefore, for each point cloud sequence, one trajectory for each class with the highest mean speed and nearest size ratio to a specific class is automatically chosen. The size range of these tracklets defines the size thresholds for classification. This process results in a set of initial pseudo-labels b={b_(j)}_(j)\boldsymbol{b}=\left\{b_{j}\right\}_{j}, where b_(j)=[x,y,z,l,w,h,alpha,beta,tau]b_{j}=[x, y, z, l, w, h, \alpha, \beta, \tau] represents position, width, length, height, azimuth angle, class identity, and tracking identity, respectively. 根据最近的研究[59],我们在 x_(0)^(**)\boldsymbol{x}_{0}^{*} 上应用了地面去除[12]、DBSCAN[6]和边界框拟合[61]来获得一组类无关的边界框 hat(b)\hat{b} 。然后我们进行跟踪过程来关联不同帧中的框。由于同一类别的对象在 3D 空间中通常具有相似的尺寸,我们计算特定类别的尺寸阈值来将 hat(b)\hat{b} 划分为不同类别。尺寸阈值的计算基于两个观察结果:(1)所有移动的对象都是前景;(2)不同类别的尺寸比例不同(车辆约为 2:1:1,行人约为 1:1:2,自行车约为 2:1:2)。因此,对于每个点云序列,会自动选择每个类别中平均速度最高且尺寸比例最接近特定类别的一个轨迹。这些轨迹的尺寸范围定义了分类的尺寸阈值。该过程产生了一组初始的伪标签 b={b_(j)}_(j)\boldsymbol{b}=\left\{b_{j}\right\}_{j} ,其中 b_(j)=[x,y,z,l,w,h,alpha,beta,tau]b_{j}=[x, y, z, l, w, h, \alpha, \beta, \tau] 分别表示位置、宽度、长度、高度、方位角、类别标识和跟踪标识。
4.2 CProto-constrained Box Regularization for Label Refinement 4.2 CProto 约束框正则化用于标签精化
As presented in Fig. 5 (d)-(f), initial labels for incomplete objects often suffer from inaccuracies in sizes and positions. To tackle this issue, we introduce the CProto-constrained Box Regularization (CBR) method. The key idea is to construct a high-quality CProto set based on unsupervised scoring from 如图 5(d)-(f)所示,不完整物体的初始标签常常存在尺寸和位置的不准确性。为解决这一问题,我们提出了 CProto 受约束的框正则化(CBR)方法。其核心思想是基于无监督评分构建高质量的 CProto 集合。
complete objects to refine the pseudo-labels of incomplete objects. Unlike OYSTER [59], which can only refine the pseudo-labels of objects with at least one complete scan, our CBR can refine pseudo-labels of all objects, significantly decreasing the overall size and position errors. 完全对象来细化不完整对象的伪标签。与 OYSTER [59]不同,它只能细化至少有一个完整扫描的对象的伪标签,我们的 CBR 可以细化所有对象的伪标签,从而大幅减少整体尺寸和位置误差。
4.2.1 Completeness and Size Similarity (CSS) Scoring 4.2.1 完整性和大小相似性(CSS)评分
Existing label scoring methods such as IoU scoring [31] are designed for fully supervised detectors. In contrast, we introduce an unsupervised Completeness and Size Similarity scoring (CSS) method. It aims to approximate the IoU score using commonsense knowledge alone. 现有的标签评分方法,如 IoU 评分[31],都是为完全监督的检测器而设计的。相反,我们提出了一种无监督的完整性和大小相似性评分(CSS)方法。它旨在仅使用常识知识来近似 IoU 分数。
Distance score. CSS first assesses the object completeness based on distance, assuming labels closer to the ego vehicle are likely to be more accurate. For an initial label b_(j)b_{j}, we normalize the distance to the ego vehicle within the range [0,1][0,1] to compute the distance score as 距离得分。CSS 首先根据距离评估对象的完整性,假设靠近自我车辆的标签可能更准确。对于初始标签 b_(j)b_{j} ,我们将距离自我车辆正常化到范围 [0,1][0,1] 以计算距离得分。
where N\mathcal{N} is the normalization function and c_(j)c_{j} is the location of b_(j)b_{j}. However, this distance-based approach has its limitations. For example, occluded objects near the ego vehicle, which should receive lower scores, are inadvertently assigned high scores due to their proximity. To mitigate this issue, we introduce a Multi-Level Occupancy (MLO) score, further detailed in Fig. 5 (g). 其中 N\mathcal{N} 是归一化函数, c_(j)c_{j} 是 b_(j)b_{j} 的位置。然而,这种基于距离的方法也有其局限性。例如,靠近自车的遮挡物体应该获得较低的得分,但由于它们的接近而被错误地分配了较高的得分。为了缓解这一问题,我们引入了多层占用(MLO)得分,如图 5(g)中进一步详述。
MLO score. Considering the diverse sizes of objects, we divide the bounding box of the initial label into multiple grids with different length and width resolutions. The MLO score is then calculated by determining the proportion of grids occupied by cluster points, via 目标检测评分。考虑到目标的不同尺寸,我们将初始标签的边界框划分为不同长度和宽度分辨率的多个网格。然后通过确定包含聚类点的网格的比例来计算目标检测评分。
where N^(o)N^{o} denotes resolution number, O^(k)O^{k} is the number of occupied grids under kk-th resolution, and r^(k)r^{k} is the grid number of kk-th resolution. 其中 N^(o)N^{o} 表示分辨率编号, O^(k)O^{k} 是 kk 分辨率下的占用网格数,而 r^(k)r^{k} 是 kk 分辨率下的网格数。
Size Similarity (SS) score. While the distance and MLO scores effectively evaluate the localization and size quality, they fall short in assessing classification quality. To bridge this gap, we introduce the SS score. We observe that the size ratios of different classes are different (the vehicle is around 2:1:1, pedestrian is around 1:1:2, and cyclist is around 2:1:2), 尺寸相似性(SS)评分。虽然距离和 MLO 评分有效地评估了定位和尺寸质量,但它们在评估分类质量方面存在缺陷。为了弥补这一差距,我们引入了 SS 评分。我们发现,不同类别的尺寸比例不同(车辆约为 2:1:1,行人约为 1:1:2,自行车约为 2:1:2)。
so we calculate the score by measuring the ratio difference. Specifically, we pre-define a set of size templates (simple size ratio or statistics from Wikipedia) for each class. Then, we calculate the size score by a truncated KL divergence [9]. 因此,我们通过测量比率差异来计算分数。具体而言,我们为每个类预定义了一组尺寸模板(简单尺寸比率或维基百科的统计数据)。然后,我们通过截断的 KL 散度来计算尺寸得分[9]。
where q_(sigma)^(a)in{l^(a),w^(a),h^(a)},q_(sigma)^(b)in{l^(b),w^(b),h^(b)}q_{\sigma}^{a} \in\left\{l^{a}, w^{a}, h^{a}\right\}, q_{\sigma}^{b} \in\left\{l^{b}, w^{b}, h^{b}\right\} refer to the normalized length, width, and height of the template and label. 其中 q_(sigma)^(a)in{l^(a),w^(a),h^(a)},q_(sigma)^(b)in{l^(b),w^(b),h^(b)}q_{\sigma}^{a} \in\left\{l^{a}, w^{a}, h^{a}\right\}, q_{\sigma}^{b} \in\left\{l^{b}, w^{b}, h^{b}\right\} 指模板和标签的标准化长度、宽度和高度。
We linearly combine the three metrics S(b_(j))=\mathcal{S}\left(b_{j}\right)=sum_(i)omega^(i)psi^(i)(b_(j))\sum_{i} \omega^{i} \psi^{i}\left(b_{j}\right) to produce final scoring, where omega^(i)\omega^{i} is the weighting factor (in this study we adopt a simple average, {:omega^(i)=1//3)\left.\omega^{i}=1 / 3\right). For each b_(j)in bb_{j} \in \boldsymbol{b}, we compute its CSS score s_(j)^(css)=S(b_(j))s_{j}^{c s s}=\mathcal{S}\left(b_{j}\right) and obtain a set of scores s={s_(j)^(css)}_(j)s=\left\{s_{j}^{c s s}\right\}_{j}. 我们线性组合三个指标 S(b_(j))=\mathcal{S}\left(b_{j}\right)=sum_(i)omega^(i)psi^(i)(b_(j))\sum_{i} \omega^{i} \psi^{i}\left(b_{j}\right) 来产生最终评分,其中 omega^(i)\omega^{i} 是加权因子(在本研究中我们采用简单平均, {:omega^(i)=1//3)\left.\omega^{i}=1 / 3\right) )。对于每个 b_(j)in bb_{j} \in \boldsymbol{b} ,我们计算其 CSS 得分 s_(j)^(css)=S(b_(j))s_{j}^{c s s}=\mathcal{S}\left(b_{j}\right) 并获得一组得分 s={s_(j)^(css)}_(j)s=\left\{s_{j}^{c s s}\right\}_{j} 。
4.2.2 CProto Set Construction 4.2.2 CProto 集合构造
Regular learnable prototype-based methods require annotations [16], [64], which are unavailable in the unsupervised problem. We construct a high-quality CProto set P=\boldsymbol{P}={P_(k)}_(k)\left\{P_{k}\right\}_{k}, representing geometry and size centers based on the CSS score. Here, P_(k)={x_(k)^(p),b_(k)^(p)}P_{k}=\left\{x_{k}^{p}, b_{k}^{p}\right\}, where x_(k)^(p)x_{k}^{p} indicates the inside points, and b_(k)^(p)b_{k}^{p} refers to the bounding box. Specifically, we first categorize the initial labels b\boldsymbol{b} into different groups based on their tracking identity tau\tau. Within each group, we select the high-quality boxes and inside points that meet a high CSS score threshold eta\eta (determined on the validation set, using 0.8 in this study). Then, we transform all points and boxes into a local coordinate system, and obtain b_(k)^(p)b_{k}^{p} by averaging the boxes and x_(k)^(p)x_{k}^{p} by concatenating all the points. 基于原型的可学习方法需要标注[16]、[64],无监督问题中缺乏这些标注。我们构建了高质量的 CProto 集合 P=\boldsymbol{P}={P_(k)}_(k)\left\{P_{k}\right\}_{k} ,表示基于 CSS 分数的几何和大小中心。这里, P_(k)={x_(k)^(p),b_(k)^(p)}P_{k}=\left\{x_{k}^{p}, b_{k}^{p}\right\} ,其中 x_(k)^(p)x_{k}^{p} 表示内部点, b_(k)^(p)b_{k}^{p} 表示边界框。具体来说,我们首先根据跟踪标识 tau\tau 将初始标签 b\boldsymbol{b} 分为不同的组。在每个组内,我们选择满足高 CSS 分数阈值 eta\eta (根据验证集确定,在本研究中为 0.8)的高质量框和内部点。然后,我们将所有点和框变换到局部坐标系中,得到 b_(k)^(p)b_{k}^{p} 通过平均框计算得到, x_(k)^(p)x_{k}^{p} 通过连接所有点得到。
4.2.3 Box Regularization 4.2.3 盒子正则化
We next regularize the initial labels by the size prior from CProto. Intuitively, the height of the initial labels is relatively correct compared to the length and width because the tops of the objects are generally not occluded. The statistics on WOD validation set [37] confirmed (see Fig. 5 (h)) this intuition. Besides, the intra-class 3D objects with the same height have similar lengths and widths. Therefore, for the initial label b_(j)b_{j} and CProto P_(k)P_{k} with the same class identity, we perform an association by the minimum difference in box height. The initial pseudo-labels with the same P_(k)P_{k}, same class identity, similar length, and similar width are naturally classified into the same group. We then perform re-size and re-localization for each group to refine the pseudo-labels. (1) Re-size. We directly replace the size of b_(j)b_{j} using the length, width, and height of b_(k)^(p)inP_(k)b_{k}^{p} \in P_{k}. (2) Re-location. Since points are mainly on the object’s surface and boundary, we divide the object into different bins and align the box boundary and orientation to the boundary point of the densest part (see Fig. 6). Finally, we obtain improved pseudo-labels b^(**)={b_(j)^(**)}_(j)\boldsymbol{b}^{*}=\left\{b_{j}^{*}\right\}_{j}. 我们接下来通过 CProto 的尺寸先验来规范初始标签。直观上,初始标签的高度与长度和宽度相比较正确,因为物体的顶部通常没有遮挡。WOD 验证集[37]的统计数据也证实了这一直觉(见图 5(h))。此外,同类别的 3D 物体,如果高度相同,长度和宽度也会相似。因此,对于同类标识的初始标签 b_(j)b_{j} 和 CProto P_(k)P_{k} ,我们通过最小化框高度差异来进行关联。同类标识、长度和宽度相似的初始伪标签自然被分类到同一组。然后我们对每个组进行尺寸调整和位置重新定位来优化伪标签。(1)尺寸调整:我们直接用 b_(k)^(p)inP_(k)b_{k}^{p} \in P_{k} 的长度、宽度和高度来替换 b_(j)b_{j} 的尺寸。(2)位置重新定位:由于点主要位于物体表面和边界上,我们将物体划分为不同的区域,并将边界框对齐到最密集部分的边界点(见图 6)。最终,我们得到了改进的伪标签 b^(**)={b_(j)^(**)}_(j)\boldsymbol{b}^{*}=\left\{b_{j}^{*}\right\}_{j} 。
Recent methods [57], [59] utilize pseudo-labels for training 3D detectors. However, even after refinement, some pseudolabels remain inaccurate, diminishing the effectiveness of correct supervision and potentially misleading the training process. To tackle these issues, we propose two designs: (1) CSS-Weighted Detection Loss assigns different training 基于伪标签的最新方法[57]、[59]用于训练 3D 检测器。然而,即使在精炼之后,一些伪标签仍然不准确,降低了正确监督的有效性,并可能误导训练过程。为了解决这些问题,我们提出了两种设计:(1) CSS 加权检测损失为不同的训练
Fig. 6. The size of the initial label is replaced by the CProto box, and the position is also corrected. 图 6。初始标签的大小被 CProto 框替换,位置也得到修正。
weights based on label quality to suppress false supervision signals. (2) Geometry Contrast Loss, which aligns predictions of sparsely scanned points with the dense CProto, thereby improving feature consistency. 基于标签质量的加权,以抑制虚假监督信号。(2) 几何对比损失,它将稀疏扫描点的预测与密集的 CProto 对齐,从而提高特征一致性。
CSS Weight. Considering that the false pseudo-labels may mislead the network convergence, we first calculate a loss weight based on different label qualities. Formally, we convert a CSS score s_(i)^(css)s_{i}^{c s s} of a pseudo-label to 基于不同标签质量计算损失权重。正式地,我们将伪标签的 CSS 得分转换为
where S_(H)S_{H} and S_(L)S_{L} are high/low-quality thresholds (we empirically set 0.7 and 0.4 , respectively). 其中 S_(H)S_{H} 和 S_(L)S_{L} 是高质量/低质量阈值(我们经验性地设置为 0.7 和 0.4)。
CSS-weighted Detection Loss. To decrease the influence of false labels, we formulate the CSS-weighted detection loss to refine NN proposals 基于 CSS 加权的检测损失。为减少错误标签的影响,我们制定了基于 CSS 加权的检测损失来完善 NN 建议。
L_(det)^(css)=(1)/(N)sum_(i)omega_(i)(L_(i)^(pro)+L_(i)^(det))\mathcal{L}_{d e t}^{c s s}=\frac{1}{N} \sum_{i} \omega_{i}\left(\mathcal{L}_{i}^{p r o}+\mathcal{L}_{i}^{d e t}\right)
where L_(i)^("pro ")\mathcal{L}_{i}^{\text {pro }} and L_(i)^("det ")\mathcal{L}_{i}^{\text {det }} are detection losses [4] of F^("pro ")\mathcal{F}^{\text {pro }} and F^("det ")\mathcal{F}^{\text {det }}, respectively. The losses are calculated by pseudolabels b^(**)\boldsymbol{b}^{*} and network predictions. 其中 L_(i)^("pro ")\mathcal{L}_{i}^{\text {pro }} 和 L_(i)^("det ")\mathcal{L}_{i}^{\text {det }} 是 F^("pro ")\mathcal{F}^{\text {pro }} 和 F^("det ")\mathcal{F}^{\text {det }} 的检测损失[4]。这些损失是通过伪标签 b^(**)\boldsymbol{b}^{*} 和网络预测计算的。
Geometry Contrast Loss. We formulate two contrast losses that minimize the feature and predicted box difference between the prototype and detection network. (1) Feature contrast loss. For a foreground RoI r_(i)r_{i} from the detection network, we extract features f_(i)^(p)\boldsymbol{f}_{i}^{p} from the prototype network by voxel set abstract [4], and extract features f_(i)^(d)f_{i}^{d} from detection network. We then formulate the contrast loss by cosine distance: 几何对比损失。我们制定了两个对比损失,最小化原型和检测网络之间的特征和预测框差异。(1)特征对比损失。对于检测网络中的前景 RoI r_(i)r_{i} ,我们从原型网络中提取特征 f_(i)^(p)\boldsymbol{f}_{i}^{p} ,从检测网络中提取特征 f_(i)^(d)f_{i}^{d} 。然后我们用余弦距离来制定对比损失:
L_(feat)^(css)=-(1)/(N^(f))sum_(i)omega_(i)(f_(i)^(d)*f_(i)^(p))/(||f_(i)^(d)||||f_(i)^(p)||)\mathcal{L}_{f e a t}^{c s s}=-\frac{1}{N^{f}} \sum_{i} \omega_{i} \frac{\boldsymbol{f}_{i}^{d} \cdot \boldsymbol{f}_{i}^{p}}{\left\|\boldsymbol{f}_{i}^{d}\right\|\left\|\boldsymbol{f}_{i}^{p}\right\|}
where II denote IoU function; c_(i)^(d),alpha_(i)^(d)c_{i}^{d}, \alpha_{i}^{d} refers to position and angle of d_(i)^(d);c_(i)^(p),alpha_(i)^(p)d_{i}^{d} ; c_{i}^{p}, \alpha_{i}^{p} refers to position and angle of d_(i)^(p)d_{i}^{p}. We finally summat all losses to training the detector. 其中 II 表示 IoU 函数; c_(i)^(d),alpha_(i)^(d)c_{i}^{d}, \alpha_{i}^{d} 指的是位置和角度; d_(i)^(d);c_(i)^(p),alpha_(i)^(p)d_{i}^{d} ; c_{i}^{p}, \alpha_{i}^{p} 指的是位置和角度. d_(i)^(p)d_{i}^{p} 我们最终将所有损失相加来训练检测器.
Although the proposed CPD framework achieves promising performance on large-scale vehicle objects, it suffers from the low detection accuracy of small objects. We propose CPD++ for more accurate multi-class unsupervised 3D object detection. CPD++ formulates the pseudo label generation function G\mathcal{G} by a Motion-conditioned Label Refinement (MLR) and formulate the training target L^(un)\mathcal{L}^{u n} by a motionweighed detection loss. 尽管建议的 CPD 框架在大规模车辆对象上取得了有前景的性能,但对小物体的检测精度较低。我们提出了 CPD++,以实现更精确的无监督 3D 多类目对象检测。CPD++通过运动条件标签精炼(MLR)制定伪标签生成函数 G\mathcal{G} ,并通过运动加权检测损失制定训练目标 L^(un)\mathcal{L}^{u n} 。
5.1 Pilot Experiments and Discussion 5.1 试验试验和讨论
Prevalent methods apply iterative training to further boost the detection performance [57], [59]. By leveraging the generalization capabilities inherent in deep networks, it can identify more object patterns. However, false labels with inaccurate classes and bounding boxes may mislead the training of the next iteration. Recent works apply score-based filtering [51] or temporal-based filtering [59] to diminish the false supervision signals. By employing these techniques, it is possible to improve the multi-class performance of CPD further. We begin our design by first constructing a strong baseline of the Vanilla Iterative Training (VaIT) method. As presented in Fig. 7 (a), VaIT utilizes the detection results from CPD as initial labels and improves the label quality by tracking, temporal filtering, and score thresholding. By default, the VaIT applies the Kalman Filtering to perform tracking [44] and ignores short tracklets less than six [59] detections. We conduct pilot experiments on the WOD and 普遍的方法应用迭代训练来进一步提高检测性能[57]、[59]。通过利用深度网络固有的泛化能力,可以识别更多的目标模式。然而,具有不准确类别和边界框的错误标签可能会误导下一次迭代的训练。最近的工作采用基于分数的过滤[51]或基于时间的过滤[59]来减少错误的监督信号。通过采用这些技术,可以进一步提高 CPD 的多类性能。我们从构建 Vanilla Iterative Training(VaIT)方法的强大基线开始设计。如图 7(a)所示,VaIT 利用 CPD 的检测结果作为初始标签,并通过跟踪、时间过滤和分数阈值提高标签质量。默认情况下,VaIT 应用卡尔曼滤波进行跟踪[44],并忽略少于六个[59]检测的短轨迹。我们在 WOD 和
train the VaIT for two iterations similar to OYSTER [59]. We present the detection performance improvement in Fig. 7 (b) (1) to (6), where different score thresholds ranging from 0.1 to 0.6 are used. 对齐训练 VaIT 模型进行两次迭代,类似于 OYSTER[59]。我们在图 7(b)(1)到(6)中展示了不同得分阈值从 0.1 到 0.6 的检测性能改进。