Traditional 3D object detectors, operating under fully, semi, or weakly supervised paradigms, require extensive human annotations. In contrast, this paper endeavors to design an unsupervised 3D object detector that discerns the object pattern automatically without relying on such annotations. To this end, we first introduce a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection. CPD initially constructs Commonsense Prototype (CProto) to represent the geometry center and size of objects. Subsequently, CPD produces high-quality pseudo-labels and guides the detector convergence by the size and geometry priors from CProto. Based on the CPD, we further propose CPD++, an enhanced version that boosts performance by motion clue. CPD++ learns localization from stationary objects and learns recognition from moving objects. It then facilitates the mutual transfer of localization and recognition knowledge between these two object types. Both CPD and CPD++ demonstrated superior performance over existing state-of-the-art unsupervised 3D detectors. Furthermore, by training CPD++ on WOD and testing on KITTI, CPD++ attains a performance of 89.25%89.25 \% 3D Average Precision (AP) on moderate car class under 0.5 loU threshold. These results approximate 95.3%95.3 \% of the performance achieved by fully supervised counterparts, underscoring the advancement of our method. 基于概念原型的无监督三维物体检测器,称为 CPD,首次引入了概念原型(CProto)来表示物体的几何中心和大小。CPD 通过 CProto 生成高质量的伪标签,并以此引导检测器收敛。基于 CPD,我们进一步提出了 CPD++,它通过利用运动线索进一步提升性能。CPD++从静止物体学习定位,从移动物体学习识别,并在这两种物体类型之间进行知识转移。CPD 和 CPD++均展现出优于现有最先进的无监督三维检测器的性能。在 WOD 数据集上训练,并在 KITTI 数据集上测试,CPD++在中等车类中达到了 3D 平均精度的水平,接近全监督方法的性能。
Index Terms-3D object detection, unsupervised learning, LiDAR point clouds, pattern recognition. 三维物体检测、无监督学习、激光雷达点云、模式识别。
Unfortunately, unsupervised 3D object detection presents exceptionally difficult challenges as follows: (1) Recognition challenge. Within the context of autonomous driving, point clouds often contain a mixture of objects. There are no geometric criteria that have a clear decision boundary to segment the background and foreground objects. Numerous traditional methods leveraged ground removal [12] and clustering technique [60] to differentiate the objects from their surroundings. However, background structures, such as buildings and fences, are inevitably misclassified as foreground objects (see Fig. 2 (a)). Recent sophisticated approaches build accurate seed labels through heuristics and identify common patterns by training a deep network iteratively. For example, the MODEST [57] and DRIFT [23] formulate the seed labels by Persistence Point Score (PPScore). Then, the pseudo labels are utilized to train a regular detector. Nonetheless, these methods require data collection from the same location multiple times, increasing labor costs substantially. As of now, attaining unsupervised object recognition without any increase in labor costs remains a tough challenge. (2) Localization challenge. Almost all objects are self-occluded in the autonomous driving scenario. Besides, with the increase in scanning distance, the number of captured points decreases dramatically. Consequently, the objects in point clouds are mostly geometry-incomplete, posing substantial challenges for the accurate localization of 3D bounding boxes (see Fig. 2 (b)). Early methods leveraged a straightforward bounding box fitting algorithm [36] for object localization. However, these methods often frequently struggle to deliver satisfactory performance due to the sparsity and occlusion of objects. Advanced methods generate initial pseudo-labels from fitted labels and use these to train deep networks 不幸的是,无监督的 3D 物体检测呈现出异常困难的挑战,如下所示:(1)识别挑战。在自动驾驶的背景下,点云经常包含多种物体的混合。没有清晰的几何标准来划分背景和前景物体。许多传统方法利用地面去除[12]和聚类技术[60]来区分物体与其周围环境。然而,诸如建筑物和篱笆等背景结构不可避免地被误分类为前景物体(见图 2(a))。最近的先进方法通过启发式方法建立精确的种子标签,并通过迭代训练深度网络来识别常见模式。例如,MODEST[57]和 DRIFT[23]通过持续点得分(PPScore)公式化种子标签。然后,伪标签被用于训练常规检测器。尽管如此,这些方法需要从同一位置多次收集数据,大大增加了劳动力成本。到目前为止,在不增加劳动力成本的情况下实现无监督物体识别仍然是一个艰巨的挑战。(2)定位挑战。在自动驾驶场景中,几乎所有物体都存在自遮挡。此外,随着扫描距离的增加,捕获的点数量也大幅下降。因此,点云中的物体大多是几何不完整的,这给 3D 边界框的准确定位带来了很大挑战(见图 2(b))。早期的方法利用简单的边界框拟合算法[36]进行物体定位。然而,由于物体的稀疏性和遮挡,这些方法通常难以得到满意的性能。先进的方法从拟合的标签生成初始伪标签,并使用这些标签来训练深度网络。
iteratively. They incorporate human knowledge to refine the position and size of the pseudo-labels iteratively. A subset of objects, denoted as complete objects, benefit from having at least one complete scan across the entire point cloud sequence, allowing their pseudo-labels to be refined through temporal consistency [59]. However, most objects, termed incomplete objects, lack full scan coverage and cannot be recovered by temporal consistency. 迭代地。他们利用人类知识来逐步完善伪标签的位置和大小。一些对象,称为完整对象,受益于至少有一个完整的点云序列扫描,使它们的伪标签可以通过时间一致性[59]得到完善。然而,大多数对象,被称为不完整对象,缺乏完整的扫描覆盖,无法通过时间一致性恢复。
To tackle these issues, this paper first proposes a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection (see Fig. 2 (e)). CPD is built upon two key insights: (1) The objects of the same class have similar sizes and can be roughly classified by size differences. (2) The nearby stationary objects are complete in consecutive frames and can be localized accurately by shape and position measure. Our idea is to construct a Commonsense Prototype (CProto) set representing the geometry and size of objects to refine the pseudo-labels and constrain the detector training. To this end, we first design a Multi-Frame Clustering (MFC) method that yields multi-class pseudolabels by size-based thresholding. Subsequently, we introduce an unsupervised Completeness and Size Similarity (CSS) score that selects high-quality labels to construct the CProto set. Furthermore, we design a CProto-constrained Box Regularization (CBR) that refines the sizes of pseudo labels. In addition, we develop CProto-constrained SelfTraining (CST) that improves the detection accuracy by the geometry knowledge from CProto. 为解决这些问题,本文首次提出了一种基于常识原型的检测器 CPD,用于无监督 3D 目标检测(见图 2(e))。CPD 建立在两个关键洞见之上:(1)同类对象的尺寸相似,可以通过尺寸差异进行粗略分类。(2)邻近的固定物体在连续帧中完整,可以通过形状和位置测量精确定位。我们的想法是构建一个代表对象几何和尺寸的常识原型(CProto)集合,以改善伪标签并约束检测器训练。为此,我们首先设计了一种基于尺寸的阈值的多帧聚类(MFC)方法,产生多类伪标签。随后,我们引入了一种无监督的完整性和尺寸相似性(CSS)评分,选择高质量标签来构建 CProto 集合。此外,我们设计了一种 CProto 约束的框正则化(CBR),用于改善伪标签的尺寸。最后,我们开发了 CProto 约束的自训练(CST),利用 CProto 的几何知识提高检测精度。
Fig. 3. Performance comparison. Both CPD and CPD++ achieve state-of-the-art unsupervised 3D object detection performance. CPD++ improves CPD 2xx2 \times in terms of mAP. 图 3. 性能比较。CPD 和 CPD++都达到了最先进的无监督 3D 目标检测性能。 CPD++在 mAP 方面提高了 CPD。
Fig. 2 (d)). (3) The localization and recognition knowledge are complementary and transferable between each other. Our idea is to learn localization from stationary objects, learn recognition from moving objects, and transfer the knowledge between each other to boost detection performance. The core designs are a Motion-conditioned Label Refinement (MLR) module that produces high-quality pseudo labels and a Motion-Weighted Training (MWT) scheme that mutually learns the localization and recognition. 图 2(d)。(3)定位和识别知识是互补的,可以互相传递。我们的想法是从静止物体中学习定位,从移动物体中学习识别,并在二者之间传递知识以提高检测性能。核心设计是一个运动条件标签细化(MLR)模块,可以产生高质量的伪标签,以及一个相互学习定位和识别的运动加权训练(MWT)方案。
The effectiveness of our design is verified by experiments on widely used Waymo Open Dataset (WOD) [37] and KITTI dataset [8]. The main contributions include: 我们的设计效果通过在广泛使用的 Waymo Open Dataset(WOD)[37]和 KITTI 数据集[8]上的实验得到验证。主要贡献包括:
We propose CPD that integrates commonsense knowledge into prototypes for unsupervised 3D object detection. CPD adopts CProto-constrained Box Regularization (CBR) for accurate initial pseudo-label generation and CProto-constrained Self-Training (CST) for unsupervised learning. 我们提出了将常识知识整合到用于无监督 3D 物体检测的原型中的 CPD。CPD 采用 CBR(CProto 约束的框正则化)来生成准确的初始伪标签,并采用 CST(CProto 约束的自训练)进行无监督学习。
We propose CPD++ that takes a step in more accurately detecting multi-class 3D objects by incorporating motion clues. This is enabled by the designed Motion-conditioned Label Refinement (MLR) and a Motion-Weighted Training (MWT) scheme. 我们提出了 CPD++,它通过融入运动线索来更准确地检测多类 3D 目标。这得益于设计的基于运动的标签精炼(MLR)和基于运动的加权训练(MWT)方案。
Both CPD and CPD++ outperform state-of-the-art unsupervised 3D detector on the WOD and KITTI datasets. Notably, CPD++ enhances the CPD by 2xx2 \times mAP (see Fig. 3). When CPD++ is trained on the WOD and tested on KITTI, it attains a performance of 89.25% moderate car 3D AP under 0.5 IoU threshold. The performance matches 95.3%95.3 \% of fully supervised counterpart, highlighting the advance of our work. 两者 CPD 和 CPD++ 在 WOD 和 KITTI 数据集上的无监督 3D 探测器性能优于最先进水平。值得注意的是,CPD++ 通过 2xx2 \times 提高了 CPD 的 mAP(见图 3)。当 CPD++ 在 WOD 上训练并在 KITTI 上测试时,它在 0.5 IoU 阈值下达到了 89.25% 的中等车 3D AP 性能。该性能与 95.3%95.3 \% 完全监督的对应物相匹配,突出了我们工作的进步。
2 Related Work 相关工作
2.1 Fully/semi/weakly-supervised 3D Object Detection 全监督/半监督/弱监督 3D 目标检测
Recent fully-supervised 3D detectors build single-stage [10], [13], [38], [56], [65], [66], two-stage [4], [31], [32], [33], [45], [46], [48], [53] or multiple stage [2], [43] deep networks for 3D object detection. Fully supervised 3D object detection methods have achieved significant performance, but the annotation cost is often unacceptable. To reduce the annotation cost, semi-supervised and weakly-supervised 3D object detection methods have gained widespread attention. Semi-supervised methods [18], [28], [39], [50], [54], [63] usually only annotate a portion of the scenarios and then apply teacher-student networks to generate pseudo-labels for unannotated scenarios. For example, the 3DIoUMatch [39] laid the groundwork in the domain of outdoor scenes 最近的完全监督的 3D 探测器构建了单级[10]、[13]、[38]、[56]、[65]、[66]、双级[4]、[31]、[32]、[33]、[45]、[46]、[48]、[53]或多级[2]、[43]深度网络用于 3D 目标检测。完全监督的 3D 目标检测方法已经取得了显著的性能,但注释成本通常是不可接受的。为了减少注释成本,半监督和弱监督的 3D 目标检测方法受到了广泛关注。半监督方法[18]、[28]、[39]、[50]、[54]、[63]通常只注释部分场景,然后应用教师-学生网络为未注释的场景生成伪标签。例如,3DIoUMatch[39]为户外场景领域奠定了基础。
by pioneering the estimation of 3D Intersection over Union (IoU) as a metric for localization. Weakly supervised methods [25], [41], [58] reduce the annotation cost from the perspective of the annotation form. For example, WS3D [25] proposed point-level labels, which generate box-level pseudo-labels from instance-level click labels. Some other weakly-supervised methods [19], [49] only annotate a single instance per scene to reduce the annotation cost. Different from all of these methods, we aim to design a modern unsupervised detector that is able to accurately detect 3D objects without using any human-level annotations. 通过率先估计 3D 交集除以联合(IoU)作为定位度量。弱监督方法[25]、[41]、[58]从注释形式的角度降低了注释成本。例如,WS3D[25]提出了点级标签,从实例级点击标签生成箱级伪标签。其他一些弱监督方法[19]、[49]每个场景只注释一个实例,以降低注释成本。与所有这些方法不同,我们的目标是设计一个现代无监督检测器,能够准确检测 3D 物体,而无需使用任何人工级注释。
2.2 Self-supervised/Unsupervised 3D Detectors 2.2 自监督/无监督 3D 检测器
Previous unsupervised pre-training methods discern latent patterns within the unlabeled data by masked labels [52] or contrastive loss [17], [55]. But these methods require human labels for fine-turning [5], [20]. Traditional methods [1], [30], [36] employ ground removal and clustering for 3D object detection without human labels, but suffer from poor detection performance. Some deep learning-based methods generate pseudo-labels by multiple traversals and use the pseudolabels to train a 3D detector [23], [57] iteratively. Nonetheless, these methods require data collection from the same location multiple times, increasing labor costs remarkably. Recent OYSTER [59] generates labels from single traversal, improves the recognition by ignoring short tracklets and improves the localization by temporal consistency. However, most bounding boxes of incomplete objects cannot be recovered by temporal consistency. Besides, lots of small background objects also generate long tracklets, decreasing recognition precision. Our CPD addresses the localization problem by leveraging the geometry prior from CProto to refine the pseudo-label and guide the network convergence. Our CPD++ addresses the recognition problem by learning the discriminative features of moving objects. 之前无监督预训练的方法通过掩蔽标签[52]或对比损失[17],[55]来识别未标记数据中的潜在模式。但这些方法需要人工标注来微调[5],[20]。传统方法[1],[30],[36]采用地面去除和聚类来检测 3D 物体,无需人工标注,但检测性能较差。一些基于深度学习的方法通过多次遍历生成伪标签,并使用这些伪标签迭代训练 3D 检测器[23],[57]。然而,这些方法需要多次收集同一位置的数据,大大增加了劳动成本。最近的 OYSTER[59]从单次遍历生成标签,通过忽略短时间跟踪来提高识别,并通过时间一致性来改进定位。然而,大部分不完整物体的边界框无法通过时间一致性恢复。此外,许多小的背景物体也生成长时间跟踪,降低了识别精度。我们的 CPD 利用 CProto 的几何先验来完善伪标签,并引导网络收敛,解决了定位问题。我们的 CPD++通过学习移动物体的区分性特征来解决识别问题。
2.3 Prototype-based Methods 基于原型的方法
The prototype-based methods are widely used in 2D object detection [14], [15], [22], [42], [62] when novel classes are incorporated. Inspired by these methods, Prototypical VoteNet [64] constructs geometric prototypes learned from basic classes for few-shot 3D object detection. GPA-3D [16] and CL3D [29] build geometric prototypes from a sourcedomain model for domain adaptive 3D detection. However, both the learning from basic class and training on the source domain require high-quality annotations. Unlike that, we construct CProto using commonsense knowledge and detect 3D objects in a zero-shot manner without human-level annotations. 基于原型的方法广泛用于在新类别被引入时的二维目标检测[14]、[15]、[22]、[42]、[62]。受这些方法的启发,原型性选票网络[64]构造了从基础类别学习的几何原型,用于少样本三维目标检测。GPA-3D[16]和 CL3D[29]从源域模型构建几何原型,用于领域自适应三维检测。但是,从基础类别学习和在源域上训练都需要高质量注释。与之不同的是,我们使用常识知识构建 CProto,并采用零样本方式检测三维目标,无需人工级别的注释。
3 Problem Formulation 问题描述
Given a set of input points x,3D\boldsymbol{x}, 3 \mathrm{D} object detection aims to design a function F\mathcal{F} that produces optimal detections y=\boldsymbol{y}=F(x)\mathcal{F}(\boldsymbol{x}), where each detection consists of [x,y,z,l,w,h,alpha,beta][x, y, z, l, w, h, \alpha, \beta] representing position, width, length, height, azimuth angle, and class identity of a object, respectively. 给定一组输入点 x,3D\boldsymbol{x}, 3 \mathrm{D} ,目标检测旨在设计一个函数 F\mathcal{F} ,该函数产生最佳检测结果 y=\boldsymbol{y}=F(x)\mathcal{F}(\boldsymbol{x}) ,其中每个检测结果由 [x,y,z,l,w,h,alpha,beta][x, y, z, l, w, h, \alpha, \beta] 表示对象的位置、宽度、长度、高度、方位角和类别标识。
3.1 Fully Supervised 3D Object Detection 全面监督 3D 物体检测
Since the F\mathcal{F} can’t be obtained directly by some handcrafted rule, regular methods formulate the 3D object detection problem as a fully supervised learning problem. It requires a large-scale dataset with human labels. The dataset is conventionally divided into training x^("train "),y^("train ")\boldsymbol{x}^{\text {train }}, \boldsymbol{y}^{\text {train }}, validation x^(val),y^(val)\boldsymbol{x}^{v a l}, \boldsymbol{y}^{v a l} and testing datasets x^("test "),y^("test ")\boldsymbol{x}^{\text {test }}, \boldsymbol{y}^{\text {test }}. During the designing process, a detection model with learnable parameters theta\theta and loss function L^("full ")\mathcal{L}^{\text {full }} are optimized from the training dataset. 由于 F\mathcal{F} 无法直接通过一些手工规则获得,常规方法将 3D 物体检测问题形式化为完全监督学习问题。它需要一个具有人工标签的大规模数据集。数据集通常划分为训练 x^("train "),y^("train ")\boldsymbol{x}^{\text {train }}, \boldsymbol{y}^{\text {train }} 、验证 x^(val),y^(val)\boldsymbol{x}^{v a l}, \boldsymbol{y}^{v a l} 和测试 x^("test "),y^("test ")\boldsymbol{x}^{\text {test }}, \boldsymbol{y}^{\text {test }} 数据集。在设计过程中,从训练数据集优化可学习参数 theta\theta 和损失函数 L^("full ")\mathcal{L}^{\text {full }} 的检测模型。
The detection model F_(theta^(**))\mathcal{F}_{\theta^{*}} generally consists of some unlearnable hyperparameters eta\eta, which are optimized by the detection metrics A\mathcal{A} from the validation dataset. 检测模型通常由某些不可学习的超参数组成,这些超参数由验证数据集中的检测指标进行优化。
F_(theta^(**))^(eta^(**))=arg max_(eta)(A(F_(theta^(**))^(eta)(x^(val)),y^(val))).\mathcal{F}_{\theta^{*}}^{\eta^{*}}=\underset{\eta}{\arg \max }\left(\mathcal{A}\left(\mathcal{F}_{\theta^{*}}^{\eta}\left(\boldsymbol{x}^{v a l}\right), \boldsymbol{y}^{v a l}\right)\right) .
The testing dataset is only used to compare the performance with other methods and can’t be used to tune parameters. 测试数据集仅用于与其他方法的性能进行比较,不能用于调整参数。
3.2 Unsupervised 3D Object Detection 无监督 3D 目标检测
To decrease labeling cost, unsupervised settings do not require human labels of training dataset y^("train ")\boldsymbol{y}^{\text {train }}. The y^("val ")\boldsymbol{y}^{\text {val }} and y^("test ")\boldsymbol{y}^{\text {test }} are required to choose the best hyperparameters and compare them with other methods, respectively. This paper solves the unsupervised problem with a pseudo-labelbased framework. The main challenge lies in how to design a pseudo label generation function G\mathcal{G} and how to design a pseudo-label-based training target L^("un ")\mathcal{L}^{\text {un }} to optimize the detector. 降低标记成本,无监督设置不需要人工标记训练数据集。需要 y^("val ")\boldsymbol{y}^{\text {val }} 和 y^("test ")\boldsymbol{y}^{\text {test }} 分别选择最佳超参数并与其他方法进行比较。本文通过基于伪标签的框架解决了无监督问题。主要挑战在于如何设计伪标签生成函数 G\mathcal{G} 以及如何设计基于伪标签的训练目标 L^("un ")\mathcal{L}^{\text {un }} 来优化检测器。
This paper leverages human commonsense knowledge to formulate the label generation function G\mathcal{G} and training target L^(un)\mathcal{L}^{u n}. The commonsense knowledge is formulated as some observation or insights. For example, the stationary objects share the same position, size, and class across frames; all the background objects, such as buildings and trees, do not move. We ensure all the intuitions are shared among different datasets, easy to obtain, and not sensitive to a specific value. 这篇论文利用人类常识知识来制定标签生成函数 G\mathcal{G} 和训练目标 L^(un)\mathcal{L}^{u n} 。常识知识被形式化为某些观察或洞见。例如,静止物体在各帧中位置、大小和类别都相同;所有背景物体,如建筑物和树木,都没有移动。我们确保所有的直观概念都在不同的数据集中共享,容易获取,并且不对特定值敏感。
4 CPD: Commonsense Prototype for UnsuPERVISEd 3D ObJect Detection 4 CPD:无监督 3D 目标检测的常识原型
This section introduces the Commonsense Prototype-based Detector (CPD) for unsupervised 3D object detection. As shown in Fig. 4, CPD consists of three main parts: (1) initial label generation, (2) label refinement, and (3) self-training. CPD formulates the pseudo label generation function G\mathcal{G} by Multi-Frame Clustering (MFC) and CProto-constrained Box Regularization (CBR), and formulate the training target L^(un)\mathcal{L}^{u n} by the CSS-weighted detection loss and geometry contrast loss. We detail the designs as follows. 本节介绍了用于无监督 3D 目标检测的常识原型检测器(CPD)。如图 4 所示,CPD 包括三个主要部分:(1)初始标签生成,(2)标签优化,(3)自我训练。CPD 通过多帧聚类(MFC)和 CProto 约束的框正则化(CBR)来构建伪标签生成函数 G\mathcal{G} ,并通过 CSS 加权检测损失和几何对比损失来制定训练目标 L^(un)\mathcal{L}^{u n} 。我们将详细介绍这些设计。
4.1 Initial Label Generation 4.1 初始标签生成
Recent unsupervised methods [57], [59] detect 3D objects in a class-agnostic way. How to classify objects (e.g., vehicles and pedestrians) without annotation is still an unsolved challenge. Our observations indicate that the sizes of different classes are significantly different. Therefore, the dense objects in consecutive frames (see Fig. 5 (b)) can be roughly classified by size thresholds. This motivates us to design a Multi-Frame Clustering (MFC) method to generate initial labels. MFC involves motion artifact removal, clustering, and post-processing. 近期无监督方法[57],[59]以一种忽视类别的方式检测三维物体。如何不借助注释来对物体(如车辆和行人)进行分类,这仍然是一个未解决的挑战。我们的观察表明,不同类别的大小差异显著。因此,可以通过尺寸阈值粗略地对连续帧中的密集物体(见图 5(b))进行分类。这促使我们设计了一种多帧聚类(MFC)方法来生成初始标签。MFC 涉及运动伪影消除、聚类和后处理。
In line with recent study [59], we apply the ground removal [12], DBSCAN [6] and bounding box fitting [61] on x_(0)^(**)\boldsymbol{x}_{0}^{*} to obtain a set of class-agnostic bounding boxes hat(b)\hat{b}. We then perform a tracking process to associate the boxes across frames. Since the objects of the same class typically have similar sizes in 3D space, we calculate class-specific size thresholds to classify hat(b)\hat{b} into different categories. The size threshold is calculated based on two observations. (1) All the moving objects are foreground. (2) The size ratios of different classes are different (the vehicle is around 2:1:1, the pedestrian is around 1:1:2, and the cyclist is around 2:1:2). Therefore, for each point cloud sequence, one trajectory for each class with the highest mean speed and nearest size ratio to a specific class is automatically chosen. The size range of these tracklets defines the size thresholds for classification. This process results in a set of initial pseudo-labels b={b_(j)}_(j)\boldsymbol{b}=\left\{b_{j}\right\}_{j}, where b_(j)=[x,y,z,l,w,h,alpha,beta,tau]b_{j}=[x, y, z, l, w, h, \alpha, \beta, \tau] represents position, width, length, height, azimuth angle, class identity, and tracking identity, respectively. 根据最近的研究[59],我们在 x_(0)^(**)\boldsymbol{x}_{0}^{*} 上应用了地面去除[12]、DBSCAN[6]和边界框拟合[61]来获得一组类无关的边界框 hat(b)\hat{b} 。然后我们进行跟踪过程来关联不同帧中的框。由于同一类别的对象在 3D 空间中通常具有相似的尺寸,我们计算特定类别的尺寸阈值来将 hat(b)\hat{b} 划分为不同类别。尺寸阈值的计算基于两个观察结果:(1)所有移动的对象都是前景;(2)不同类别的尺寸比例不同(车辆约为 2:1:1,行人约为 1:1:2,自行车约为 2:1:2)。因此,对于每个点云序列,会自动选择每个类别中平均速度最高且尺寸比例最接近特定类别的一个轨迹。这些轨迹的尺寸范围定义了分类的尺寸阈值。该过程产生了一组初始的伪标签 b={b_(j)}_(j)\boldsymbol{b}=\left\{b_{j}\right\}_{j} ,其中 b_(j)=[x,y,z,l,w,h,alpha,beta,tau]b_{j}=[x, y, z, l, w, h, \alpha, \beta, \tau] 分别表示位置、宽度、长度、高度、方位角、类别标识和跟踪标识。
4.2 CProto-constrained Box Regularization for Label Refinement 4.2 CProto 约束框正则化用于标签精化
As presented in Fig. 5 (d)-(f), initial labels for incomplete objects often suffer from inaccuracies in sizes and positions. To tackle this issue, we introduce the CProto-constrained Box Regularization (CBR) method. The key idea is to construct a high-quality CProto set based on unsupervised scoring from 如图 5(d)-(f)所示,不完整物体的初始标签常常存在尺寸和位置的不准确性。为解决这一问题,我们提出了 CProto 受约束的框正则化(CBR)方法。其核心思想是基于无监督评分构建高质量的 CProto 集合。
complete objects to refine the pseudo-labels of incomplete objects. Unlike OYSTER [59], which can only refine the pseudo-labels of objects with at least one complete scan, our CBR can refine pseudo-labels of all objects, significantly decreasing the overall size and position errors. 完全对象来细化不完整对象的伪标签。与 OYSTER [59]不同,它只能细化至少有一个完整扫描的对象的伪标签,我们的 CBR 可以细化所有对象的伪标签,从而大幅减少整体尺寸和位置误差。
4.2.1 Completeness and Size Similarity (CSS) Scoring 4.2.1 完整性和大小相似性(CSS)评分
Existing label scoring methods such as IoU scoring [31] are designed for fully supervised detectors. In contrast, we introduce an unsupervised Completeness and Size Similarity scoring (CSS) method. It aims to approximate the IoU score using commonsense knowledge alone. 现有的标签评分方法,如 IoU 评分[31],都是为完全监督的检测器而设计的。相反,我们提出了一种无监督的完整性和大小相似性评分(CSS)方法。它旨在仅使用常识知识来近似 IoU 分数。
Distance score. CSS first assesses the object completeness based on distance, assuming labels closer to the ego vehicle are likely to be more accurate. For an initial label b_(j)b_{j}, we normalize the distance to the ego vehicle within the range [0,1][0,1] to compute the distance score as 距离得分。CSS 首先根据距离评估对象的完整性,假设靠近自我车辆的标签可能更准确。对于初始标签 b_(j)b_{j} ,我们将距离自我车辆正常化到范围 [0,1][0,1] 以计算距离得分。
where N\mathcal{N} is the normalization function and c_(j)c_{j} is the location of b_(j)b_{j}. However, this distance-based approach has its limitations. For example, occluded objects near the ego vehicle, which should receive lower scores, are inadvertently assigned high scores due to their proximity. To mitigate this issue, we introduce a Multi-Level Occupancy (MLO) score, further detailed in Fig. 5 (g). 其中 N\mathcal{N} 是归一化函数, c_(j)c_{j} 是 b_(j)b_{j} 的位置。然而,这种基于距离的方法也有其局限性。例如,靠近自车的遮挡物体应该获得较低的得分,但由于它们的接近而被错误地分配了较高的得分。为了缓解这一问题,我们引入了多层占用(MLO)得分,如图 5(g)中进一步详述。
MLO score. Considering the diverse sizes of objects, we divide the bounding box of the initial label into multiple grids with different length and width resolutions. The MLO score is then calculated by determining the proportion of grids occupied by cluster points, via 目标检测评分。考虑到目标的不同尺寸,我们将初始标签的边界框划分为不同长度和宽度分辨率的多个网格。然后通过确定包含聚类点的网格的比例来计算目标检测评分。
where N^(o)N^{o} denotes resolution number, O^(k)O^{k} is the number of occupied grids under kk-th resolution, and r^(k)r^{k} is the grid number of kk-th resolution. 其中 N^(o)N^{o} 表示分辨率编号, O^(k)O^{k} 是 kk 分辨率下的占用网格数,而 r^(k)r^{k} 是 kk 分辨率下的网格数。
Size Similarity (SS) score. While the distance and MLO scores effectively evaluate the localization and size quality, they fall short in assessing classification quality. To bridge this gap, we introduce the SS score. We observe that the size ratios of different classes are different (the vehicle is around 2:1:1, pedestrian is around 1:1:2, and cyclist is around 2:1:2), 尺寸相似性(SS)评分。虽然距离和 MLO 评分有效地评估了定位和尺寸质量,但它们在评估分类质量方面存在缺陷。为了弥补这一差距,我们引入了 SS 评分。我们发现,不同类别的尺寸比例不同(车辆约为 2:1:1,行人约为 1:1:2,自行车约为 2:1:2)。
so we calculate the score by measuring the ratio difference. Specifically, we pre-define a set of size templates (simple size ratio or statistics from Wikipedia) for each class. Then, we calculate the size score by a truncated KL divergence [9]. 因此,我们通过测量比率差异来计算分数。具体而言,我们为每个类预定义了一组尺寸模板(简单尺寸比率或维基百科的统计数据)。然后,我们通过截断的 KL 散度来计算尺寸得分[9]。
where q_(sigma)^(a)in{l^(a),w^(a),h^(a)},q_(sigma)^(b)in{l^(b),w^(b),h^(b)}q_{\sigma}^{a} \in\left\{l^{a}, w^{a}, h^{a}\right\}, q_{\sigma}^{b} \in\left\{l^{b}, w^{b}, h^{b}\right\} refer to the normalized length, width, and height of the template and label. 其中 q_(sigma)^(a)in{l^(a),w^(a),h^(a)},q_(sigma)^(b)in{l^(b),w^(b),h^(b)}q_{\sigma}^{a} \in\left\{l^{a}, w^{a}, h^{a}\right\}, q_{\sigma}^{b} \in\left\{l^{b}, w^{b}, h^{b}\right\} 指模板和标签的标准化长度、宽度和高度。
We linearly combine the three metrics S(b_(j))=\mathcal{S}\left(b_{j}\right)=sum_(i)omega^(i)psi^(i)(b_(j))\sum_{i} \omega^{i} \psi^{i}\left(b_{j}\right) to produce final scoring, where omega^(i)\omega^{i} is the weighting factor (in this study we adopt a simple average, {:omega^(i)=1//3)\left.\omega^{i}=1 / 3\right). For each b_(j)in bb_{j} \in \boldsymbol{b}, we compute its CSS score s_(j)^(css)=S(b_(j))s_{j}^{c s s}=\mathcal{S}\left(b_{j}\right) and obtain a set of scores s={s_(j)^(css)}_(j)s=\left\{s_{j}^{c s s}\right\}_{j}. 我们线性组合三个指标 S(b_(j))=\mathcal{S}\left(b_{j}\right)=sum_(i)omega^(i)psi^(i)(b_(j))\sum_{i} \omega^{i} \psi^{i}\left(b_{j}\right) 来产生最终评分,其中 omega^(i)\omega^{i} 是加权因子(在本研究中我们采用简单平均, {:omega^(i)=1//3)\left.\omega^{i}=1 / 3\right) )。对于每个 b_(j)in bb_{j} \in \boldsymbol{b} ,我们计算其 CSS 得分 s_(j)^(css)=S(b_(j))s_{j}^{c s s}=\mathcal{S}\left(b_{j}\right) 并获得一组得分 s={s_(j)^(css)}_(j)s=\left\{s_{j}^{c s s}\right\}_{j} 。
4.2.2 CProto Set Construction 4.2.2 CProto 集合构造
Regular learnable prototype-based methods require annotations [16], [64], which are unavailable in the unsupervised problem. We construct a high-quality CProto set P=\boldsymbol{P}={P_(k)}_(k)\left\{P_{k}\right\}_{k}, representing geometry and size centers based on the CSS score. Here, P_(k)={x_(k)^(p),b_(k)^(p)}P_{k}=\left\{x_{k}^{p}, b_{k}^{p}\right\}, where x_(k)^(p)x_{k}^{p} indicates the inside points, and b_(k)^(p)b_{k}^{p} refers to the bounding box. Specifically, we first categorize the initial labels b\boldsymbol{b} into different groups based on their tracking identity tau\tau. Within each group, we select the high-quality boxes and inside points that meet a high CSS score threshold eta\eta (determined on the validation set, using 0.8 in this study). Then, we transform all points and boxes into a local coordinate system, and obtain b_(k)^(p)b_{k}^{p} by averaging the boxes and x_(k)^(p)x_{k}^{p} by concatenating all the points. 基于原型的可学习方法需要标注[16]、[64],无监督问题中缺乏这些标注。我们构建了高质量的 CProto 集合 P=\boldsymbol{P}={P_(k)}_(k)\left\{P_{k}\right\}_{k} ,表示基于 CSS 分数的几何和大小中心。这里, P_(k)={x_(k)^(p),b_(k)^(p)}P_{k}=\left\{x_{k}^{p}, b_{k}^{p}\right\} ,其中 x_(k)^(p)x_{k}^{p} 表示内部点, b_(k)^(p)b_{k}^{p} 表示边界框。具体来说,我们首先根据跟踪标识 tau\tau 将初始标签 b\boldsymbol{b} 分为不同的组。在每个组内,我们选择满足高 CSS 分数阈值 eta\eta (根据验证集确定,在本研究中为 0.8)的高质量框和内部点。然后,我们将所有点和框变换到局部坐标系中,得到 b_(k)^(p)b_{k}^{p} 通过平均框计算得到, x_(k)^(p)x_{k}^{p} 通过连接所有点得到。
4.2.3 Box Regularization 4.2.3 盒子正则化
We next regularize the initial labels by the size prior from CProto. Intuitively, the height of the initial labels is relatively correct compared to the length and width because the tops of the objects are generally not occluded. The statistics on WOD validation set [37] confirmed (see Fig. 5 (h)) this intuition. Besides, the intra-class 3D objects with the same height have similar lengths and widths. Therefore, for the initial label b_(j)b_{j} and CProto P_(k)P_{k} with the same class identity, we perform an association by the minimum difference in box height. The initial pseudo-labels with the same P_(k)P_{k}, same class identity, similar length, and similar width are naturally classified into the same group. We then perform re-size and re-localization for each group to refine the pseudo-labels. (1) Re-size. We directly replace the size of b_(j)b_{j} using the length, width, and height of b_(k)^(p)inP_(k)b_{k}^{p} \in P_{k}. (2) Re-location. Since points are mainly on the object’s surface and boundary, we divide the object into different bins and align the box boundary and orientation to the boundary point of the densest part (see Fig. 6). Finally, we obtain improved pseudo-labels b^(**)={b_(j)^(**)}_(j)\boldsymbol{b}^{*}=\left\{b_{j}^{*}\right\}_{j}. 我们接下来通过 CProto 的尺寸先验来规范初始标签。直观上,初始标签的高度与长度和宽度相比较正确,因为物体的顶部通常没有遮挡。WOD 验证集[37]的统计数据也证实了这一直觉(见图 5(h))。此外,同类别的 3D 物体,如果高度相同,长度和宽度也会相似。因此,对于同类标识的初始标签 b_(j)b_{j} 和 CProto P_(k)P_{k} ,我们通过最小化框高度差异来进行关联。同类标识、长度和宽度相似的初始伪标签自然被分类到同一组。然后我们对每个组进行尺寸调整和位置重新定位来优化伪标签。(1)尺寸调整:我们直接用 b_(k)^(p)inP_(k)b_{k}^{p} \in P_{k} 的长度、宽度和高度来替换 b_(j)b_{j} 的尺寸。(2)位置重新定位:由于点主要位于物体表面和边界上,我们将物体划分为不同的区域,并将边界框对齐到最密集部分的边界点(见图 6)。最终,我们得到了改进的伪标签 b^(**)={b_(j)^(**)}_(j)\boldsymbol{b}^{*}=\left\{b_{j}^{*}\right\}_{j} 。
Recent methods [57], [59] utilize pseudo-labels for training 3D detectors. However, even after refinement, some pseudolabels remain inaccurate, diminishing the effectiveness of correct supervision and potentially misleading the training process. To tackle these issues, we propose two designs: (1) CSS-Weighted Detection Loss assigns different training 基于伪标签的最新方法[57]、[59]用于训练 3D 检测器。然而,即使在精炼之后,一些伪标签仍然不准确,降低了正确监督的有效性,并可能误导训练过程。为了解决这些问题,我们提出了两种设计:(1) CSS 加权检测损失为不同的训练
Fig. 6. The size of the initial label is replaced by the CProto box, and the position is also corrected. 图 6。初始标签的大小被 CProto 框替换,位置也得到修正。
weights based on label quality to suppress false supervision signals. (2) Geometry Contrast Loss, which aligns predictions of sparsely scanned points with the dense CProto, thereby improving feature consistency. 基于标签质量的加权,以抑制虚假监督信号。(2) 几何对比损失,它将稀疏扫描点的预测与密集的 CProto 对齐,从而提高特征一致性。
CSS Weight. Considering that the false pseudo-labels may mislead the network convergence, we first calculate a loss weight based on different label qualities. Formally, we convert a CSS score s_(i)^(css)s_{i}^{c s s} of a pseudo-label to 基于不同标签质量计算损失权重。正式地,我们将伪标签的 CSS 得分转换为
where S_(H)S_{H} and S_(L)S_{L} are high/low-quality thresholds (we empirically set 0.7 and 0.4 , respectively). 其中 S_(H)S_{H} 和 S_(L)S_{L} 是高质量/低质量阈值(我们经验性地设置为 0.7 和 0.4)。
CSS-weighted Detection Loss. To decrease the influence of false labels, we formulate the CSS-weighted detection loss to refine NN proposals 基于 CSS 加权的检测损失。为减少错误标签的影响,我们制定了基于 CSS 加权的检测损失来完善 NN 建议。
L_(det)^(css)=(1)/(N)sum_(i)omega_(i)(L_(i)^(pro)+L_(i)^(det))\mathcal{L}_{d e t}^{c s s}=\frac{1}{N} \sum_{i} \omega_{i}\left(\mathcal{L}_{i}^{p r o}+\mathcal{L}_{i}^{d e t}\right)
where L_(i)^("pro ")\mathcal{L}_{i}^{\text {pro }} and L_(i)^("det ")\mathcal{L}_{i}^{\text {det }} are detection losses [4] of F^("pro ")\mathcal{F}^{\text {pro }} and F^("det ")\mathcal{F}^{\text {det }}, respectively. The losses are calculated by pseudolabels b^(**)\boldsymbol{b}^{*} and network predictions. 其中 L_(i)^("pro ")\mathcal{L}_{i}^{\text {pro }} 和 L_(i)^("det ")\mathcal{L}_{i}^{\text {det }} 是 F^("pro ")\mathcal{F}^{\text {pro }} 和 F^("det ")\mathcal{F}^{\text {det }} 的检测损失[4]。这些损失是通过伪标签 b^(**)\boldsymbol{b}^{*} 和网络预测计算的。
Geometry Contrast Loss. We formulate two contrast losses that minimize the feature and predicted box difference between the prototype and detection network. (1) Feature contrast loss. For a foreground RoI r_(i)r_{i} from the detection network, we extract features f_(i)^(p)\boldsymbol{f}_{i}^{p} from the prototype network by voxel set abstract [4], and extract features f_(i)^(d)f_{i}^{d} from detection network. We then formulate the contrast loss by cosine distance: 几何对比损失。我们制定了两个对比损失,最小化原型和检测网络之间的特征和预测框差异。(1)特征对比损失。对于检测网络中的前景 RoI r_(i)r_{i} ,我们从原型网络中提取特征 f_(i)^(p)\boldsymbol{f}_{i}^{p} ,从检测网络中提取特征 f_(i)^(d)f_{i}^{d} 。然后我们用余弦距离来制定对比损失:
L_(feat)^(css)=-(1)/(N^(f))sum_(i)omega_(i)(f_(i)^(d)*f_(i)^(p))/(||f_(i)^(d)||||f_(i)^(p)||)\mathcal{L}_{f e a t}^{c s s}=-\frac{1}{N^{f}} \sum_{i} \omega_{i} \frac{\boldsymbol{f}_{i}^{d} \cdot \boldsymbol{f}_{i}^{p}}{\left\|\boldsymbol{f}_{i}^{d}\right\|\left\|\boldsymbol{f}_{i}^{p}\right\|}
where II denote IoU function; c_(i)^(d),alpha_(i)^(d)c_{i}^{d}, \alpha_{i}^{d} refers to position and angle of d_(i)^(d);c_(i)^(p),alpha_(i)^(p)d_{i}^{d} ; c_{i}^{p}, \alpha_{i}^{p} refers to position and angle of d_(i)^(p)d_{i}^{p}. We finally summat all losses to training the detector. 其中 II 表示 IoU 函数; c_(i)^(d),alpha_(i)^(d)c_{i}^{d}, \alpha_{i}^{d} 指的是位置和角度; d_(i)^(d);c_(i)^(p),alpha_(i)^(p)d_{i}^{d} ; c_{i}^{p}, \alpha_{i}^{p} 指的是位置和角度. d_(i)^(p)d_{i}^{p} 我们最终将所有损失相加来训练检测器.
Although the proposed CPD framework achieves promising performance on large-scale vehicle objects, it suffers from the low detection accuracy of small objects. We propose CPD++ for more accurate multi-class unsupervised 3D object detection. CPD++ formulates the pseudo label generation function G\mathcal{G} by a Motion-conditioned Label Refinement (MLR) and formulate the training target L^(un)\mathcal{L}^{u n} by a motionweighed detection loss. 尽管建议的 CPD 框架在大规模车辆对象上取得了有前景的性能,但对小物体的检测精度较低。我们提出了 CPD++,以实现更精确的无监督 3D 多类目对象检测。CPD++通过运动条件标签精炼(MLR)制定伪标签生成函数 G\mathcal{G} ,并通过运动加权检测损失制定训练目标 L^(un)\mathcal{L}^{u n} 。
5.1 Pilot Experiments and Discussion 5.1 试验试验和讨论
Prevalent methods apply iterative training to further boost the detection performance [57], [59]. By leveraging the generalization capabilities inherent in deep networks, it can identify more object patterns. However, false labels with inaccurate classes and bounding boxes may mislead the training of the next iteration. Recent works apply score-based filtering [51] or temporal-based filtering [59] to diminish the false supervision signals. By employing these techniques, it is possible to improve the multi-class performance of CPD further. We begin our design by first constructing a strong baseline of the Vanilla Iterative Training (VaIT) method. As presented in Fig. 7 (a), VaIT utilizes the detection results from CPD as initial labels and improves the label quality by tracking, temporal filtering, and score thresholding. By default, the VaIT applies the Kalman Filtering to perform tracking [44] and ignores short tracklets less than six [59] detections. We conduct pilot experiments on the WOD and 普遍的方法应用迭代训练来进一步提高检测性能[57]、[59]。通过利用深度网络固有的泛化能力,可以识别更多的目标模式。然而,具有不准确类别和边界框的错误标签可能会误导下一次迭代的训练。最近的工作采用基于分数的过滤[51]或基于时间的过滤[59]来减少错误的监督信号。通过采用这些技术,可以进一步提高 CPD 的多类性能。我们从构建 Vanilla Iterative Training(VaIT)方法的强大基线开始设计。如图 7(a)所示,VaIT 利用 CPD 的检测结果作为初始标签,并通过跟踪、时间过滤和分数阈值提高标签质量。默认情况下,VaIT 应用卡尔曼滤波进行跟踪[44],并忽略少于六个[59]检测的短轨迹。我们在 WOD 和
train the VaIT for two iterations similar to OYSTER [59]. We present the detection performance improvement in Fig. 7 (b) (1) to (6), where different score thresholds ranging from 0.1 to 0.6 are used. 对齐训练 VaIT 模型进行两次迭代,类似于 OYSTER[59]。我们在图 7(b)(1)到(6)中展示了不同得分阈值从 0.1 到 0.6 的检测性能改进。
To tackle this issue, we further investigate stronger and more reliable commonsense clues from humans. (1) Stationarity. Foreground objects may stay somewhere for a while, termed stationary objects. Intuitively, the stationary objects scanned with dense points in consecutive frames share the same position and size. Therefore, their bounding boxes of pseudo labels can be accurately refined. Nevertheless, since the background objects (e.g., lamp poles, and walls) are also stationary and of similar shapes, their class cannot be judged correctly. (2) Movability. Foreground objects may move for a while, named moving objects. The moving objects do not share the same position in consecutive frames. Therefore, their bounding boxes of pseudo labels cannot be estimated accurately, but their class can be accurately judged because no moving background objects exist. (3) Transferability and complementarity. During inference, the object patterns between stationary and moving objects are similar in input data. By utilizing the equivariance of convolutional nets, the learned localization and recognition knowledge are transferable and complementary to each other. 为了应对这个问题,我们进一步探讨从人类那里获得更强大和更可靠的常识线索。(1) 静止性。前景物体可能会在某处停留一段时间,称为静止物体。直观地说,连续帧中扫描到的静止物体会共享相同的位置和大小。因此,它们的伪标签边界框可以精确地进行细化。然而,由于背景物体(如灯柱和墙壁)也是静止的且形状相似,它们的类别无法正确判断。(2) 可移动性。前景物体可能会移动一段时间,称为移动物体。移动物体在连续帧中不会共享相同的位置。因此,它们的伪标签边界框无法准确估计,但其类别可以准确判断,因为没有移动的背景物体存在。(3) 可迁移性和互补性。在推理过程中,静止物体和移动物体之间的物体模式在输入数据中是相似的。通过利用卷积网络的等变性,学习到的定位和识别知识是可迁移和互补的。
Bear this in mind, we design the CPD++, a stronger unsupervised 3D object detector that accurately performs 请记住这一点,我们设计了 CPD++,一个更强大的无监督 3D 物体检测器,可以准确执行
We first design the Motion-conditioned Label Refinement (MLR) to produce high-quality pseudo labels. The key idea is to classify the labels into two parts by motion states and refine the labels based on the properties of stationarity and movability, respectively. Different from tracklet ignoring in OYSTER [59] and threshold-based classification [47] in CPD that cannot select foreground labels correctly, our MLR identifies accurate recognition and localization seed labels from moving and stationary objects, respectively. 我们首先设计运动条件标签精炼(MLR)来生成高质量的伪标签。关键思想是将标签分为两部分,根据运动状态和静止性和可移动性的特性分别精炼标签。与 OYSTER [59]中忽略小径和 CPD [47]中基于阈值的分类不同,我们的 MLR 从移动和静止物体中分别识别出准确的识别和定位种子标签。
5.2.1 Pseudo Labels Generation 5.2.1 伪标签生成
CPD++ first generates a set of pseudo labels by the pretrained detection model F^("det ")\mathcal{F}^{\text {det }} from CPD. Inspired by recent works [21] that apply augmented testing [27] to improve the perception performance, we employ this method here to improve the diversity of initial pseudo labels. Specifically, we apply mm rotation transformations to the N^(x)N^{x} points x_(n)in\boldsymbol{x}_{n} \inR^(N^(x),3)\mathbb{R}^{N^{x}, 3} and obtain mm ( 6 by default) set of detection results. Then we fuse the results to a single result b_(n)^(ini)\boldsymbol{b}_{n}^{i n i} by Weighted Boxes Fusion (WBF) [35], which can be denoted by 基于预训练的检测模型生成一组伪标签。受到最近使用增强测试提高感知性能的工作的启发,我们采用这种方法来提高初始伪标签的多样性。具体来说,我们对点进行旋转变换,并获得 6 组检测结果。然后我们使用加权框融合(WBF)方法将这些结果融合为单一结果。
b_(n)^(ini)=WBF(F^(det)(x_(n)r_(1)^(T)),dots,F^(det)(x_(n)r_(m)^(T)))\boldsymbol{b}_{n}^{i n i}=W B F\left(\mathcal{F}^{d e t}\left(\boldsymbol{x}_{n} \boldsymbol{r}_{1}^{T}\right), \ldots, \mathcal{F}^{d e t}\left(\boldsymbol{x}_{n} \boldsymbol{r}_{m}^{T}\right)\right)
where r_(m)in SO(3)\boldsymbol{r}_{m} \in S O(3). After that, for each sequence, we obtain detections B^(ini)={b_(n)^(ini)}_(n)\boldsymbol{B}^{i n i}=\left\{\boldsymbol{b}_{n}^{i n i}\right\}_{n}. 在那里 r_(m)in SO(3)\boldsymbol{r}_{m} \in S O(3) 。之后,对于每个序列,我们得到检测结果 B^(ini)={b_(n)^(ini)}_(n)\boldsymbol{B}^{i n i}=\left\{\boldsymbol{b}_{n}^{i n i}\right\}_{n} 。
5.2.2 Unsupervised Motion Classification 无监督运动分类
We next classify the B^("ini ")\boldsymbol{B}^{\text {ini }} into different parts based on their motion states. Specifically, we first apply a transformation RTR T that transforms all boxes to the global system by IMU/GPS data. Then we perform a class-agnostic tracking by employing a Kalman Fitering-based Tracker (KFTrack) [44]. It constructs the relationships of objects between frames by 我们下一步将 B^("ini ")\boldsymbol{B}^{\text {ini }} 分类为不同部分,基于它们的运动状态。具体而言,我们首先应用一个变换 RTR T ,利用 IMU/GPS 数据将所有框框转换到全局系统。然后我们采用卡尔曼滤波跟踪器(KFTrack)[44]进行无类别跟踪。它建立了帧与帧之间的物体关系
B^(tra)=KF Track(RT(B^("ini "))),\boldsymbol{B}^{\operatorname{tra}}=K F \operatorname{Track}\left(R T\left(\boldsymbol{B}^{\text {ini }}\right)\right),
where B^("tra ")={b_(tau)^("tra ")}_(tau)\boldsymbol{B}^{\text {tra }}=\left\{\boldsymbol{b}_{\tau}^{\text {tra }}\right\}_{\tau} and b_(tau)^("tra ")\boldsymbol{b}_{\tau}^{\text {tra }} refers to tau\tau-th trajectory. 其中 B^("tra ")={b_(tau)^("tra ")}_(tau)\boldsymbol{B}^{\text {tra }}=\left\{\boldsymbol{b}_{\tau}^{\text {tra }}\right\}_{\tau} 和 b_(tau)^("tra ")\boldsymbol{b}_{\tau}^{\text {tra }} 指第 tau\tau 个轨迹。
Velocity is a straightforward criterion to classify the motion states. However, pseudo labels of stationary objects produce noise velocity similar to the velocity of slow-moving objects. Consequently, there is no clear decision boundary (as shown in Fig. 9 (a)) that is able to classify the motion states correctly. Instead, we propose the variance of position change to classify the motion states. The intuition behind this is that even if the speed is very small, it will produce a large displacement under long-term movement. Specifically, the coordinates of b_(tau)^(tra)\boldsymbol{b}_{\tau}^{t r a} are denoted by z^(tau)inR^(N^(tau),3)\boldsymbol{z}^{\tau} \in \mathbb{R}^{N^{\tau}, 3}, where N^(tau)N^{\tau} denotes the boxes number inside this tracklet. Then the position change Delta^(tau)\Delta^{\tau} is computed by 速度是一个直接的标准来分类运动状态。但是,静止物体的伪标签产生的噪声速度类似于缓慢移动物体的速度。因此,没有明确的决策边界(如图 9(a)所示)能够正确地分类运动状态。相反,我们提出使用位置变化的方差来分类运动状态。这背后的直觉是,即使速度很小,在长期运动下也会产生大的位移。具体地说, b_(tau)^(tra)\boldsymbol{b}_{\tau}^{t r a} 的坐标由 z^(tau)inR^(N^(tau),3)\boldsymbol{z}^{\tau} \in \mathbb{R}^{N^{\tau}, 3} 表示,其中 N^(tau)N^{\tau} 表示该轨迹内的框数。然后计算位置变化 Delta^(tau)\Delta^{\tau}
where z_(**)^(tau)inR^(3)\boldsymbol{z}_{*}^{\tau} \in \mathbb{R}^{3} refers to the column mean of z\boldsymbol{z}. After that, the variance of position change delta^(tau)\delta^{\tau} is obtained by 其中 z_(**)^(tau)inR^(3)\boldsymbol{z}_{*}^{\tau} \in \mathbb{R}^{3} 指列均值。 之后,获得位置变化 delta^(tau)\delta^{\tau} 的方差。
There is a clear decision boundary as presented in Fig. 9 (b). We apply a threshold eta^("var ")\eta^{\text {var }} ( 0.8 by default) to classify the labels into a stationary label set B^("sta ")={b^(tau)∣delta^(tau) < eta^(var)}\boldsymbol{B}^{\text {sta }}=\left\{\boldsymbol{b}^{\tau} \mid \delta^{\tau}<\eta^{v a r}\right\} and a moving label set B^("mot ")={b^(tau)∣delta^(tau) >= eta^(var)}\boldsymbol{B}^{\text {mot }}=\left\{\boldsymbol{b}^{\tau} \mid \delta^{\tau} \geq \eta^{v a r}\right\}. 如图 9(b)所示,存在清晰的决策边界。我们应用阈值 eta^("var ")\eta^{\text {var }} (默认为 0.8)将标签分类为固定标签集 B^("sta ")={b^(tau)∣delta^(tau) < eta^(var)}\boldsymbol{B}^{\text {sta }}=\left\{\boldsymbol{b}^{\tau} \mid \delta^{\tau}<\eta^{v a r}\right\} 和移动标签集 B^("mot ")={b^(tau)∣delta^(tau) >= eta^(var)}\boldsymbol{B}^{\text {mot }}=\left\{\boldsymbol{b}^{\tau} \mid \delta^{\tau} \geq \eta^{v a r}\right\} 。
5.2.3 Label Propagation 5.2.3 标签传播
As analyzed before, the labels B^(sta)\boldsymbol{B}^{s t a} can be further refined by the property of stationarity, i.e. the stationary object shares the same position, size, and class across frames. Therefore, we design the label propagation that replaces the lowquality box with the most confident box. Specifically, we denote a stationary tracklet by b^("sta ")={(b_(j)^("sta "),epsilon_(j)^(sta))}_(j=1)^(N^(sta))\boldsymbol{b}^{\text {sta }}=\left\{\left(b_{j}^{\text {sta }}, \epsilon_{j}^{s t a}\right)\right\}_{j=1}^{N^{s t a}}, where b_(j)^("sta ")inR^(8)b_{j}^{\text {sta }} \in \mathbb{R}^{8} denotes 7 -dimension bounding box and a class identity, epsilon_(j)^(sta)inR\epsilon_{j}^{s t a} \in \mathbf{R} refers to the detection confidence, N^(sta)N^{s t a} is the label number inside the tracklet. Then, the labels are propagated by 根据前面的分析,标签 B^(sta)\boldsymbol{B}^{s t a} 可以通过平稳性属性进一步细化,即静止物体在各帧中保持相同的位置、大小和类别。因此,我们设计标签传播,用最高置信度的框替换低质量的框。具体来说,我们用 b^("sta ")={(b_(j)^("sta "),epsilon_(j)^(sta))}_(j=1)^(N^(sta))\boldsymbol{b}^{\text {sta }}=\left\{\left(b_{j}^{\text {sta }}, \epsilon_{j}^{s t a}\right)\right\}_{j=1}^{N^{s t a}} 表示一个静止轨迹,其中 b_(j)^("sta ")inR^(8)b_{j}^{\text {sta }} \in \mathbb{R}^{8} 表示 7 维边界框和类别标识, epsilon_(j)^(sta)inR\epsilon_{j}^{s t a} \in \mathbf{R} 表示检测置信度, N^(sta)N^{s t a} 表示轨迹内的标签数量。然后,标签通过以下方式传播:
j^(**)=arg max_(j)(epsilon_(j)^(sta)),b_(j)^(sta)=b_(j**)^(sta),j=1,dots,N^(sta)j^{*}=\underset{j}{\arg \max }\left(\epsilon_{j}^{s t a}\right), b_{j}^{s t a}=b_{j *}^{s t a}, j=1, \ldots, N^{s t a}
Similarly, the labels B^("mot ")\boldsymbol{B}^{\text {mot }} can be further refined by the property of movability, i.e. the object only shares the same size and class across frames. Therefore, we only propagate the size and class identity for the moving tracklets. Specifically, we denote a moving tracklet by b^("mot ")={(a_(j)^("mot "),epsilon_(j)^("mot "))}_(j=1)^(N^("mot "))\boldsymbol{b}^{\text {mot }}=\left\{\left(a_{j}^{\text {mot }}, \epsilon_{j}^{\text {mot }}\right)\right\}_{j=1}^{N^{\text {mot }}}, where a_(j)^("mot ")inR^(4)a_{j}^{\text {mot }} \in \mathbb{R}^{4} denotes 3-dimension size and a class identity, epsilon_(j)^(mot)inR\epsilon_{j}^{m o t} \in \mathbf{R} refers to the detection confidence, N^(mot)N^{m o t} is the label number inside the tracklet. Then, the labels are propagated by 类似地,标签 B^("mot ")\boldsymbol{B}^{\text {mot }} 可通过可移动性进行进一步细化,即该物体仅在帧间共享相同的尺寸和类别。因此,我们仅传播移动跟踪格的尺寸和类别标识。具体来说,我们用 b^("mot ")={(a_(j)^("mot "),epsilon_(j)^("mot "))}_(j=1)^(N^("mot "))\boldsymbol{b}^{\text {mot }}=\left\{\left(a_{j}^{\text {mot }}, \epsilon_{j}^{\text {mot }}\right)\right\}_{j=1}^{N^{\text {mot }}} 表示一个移动跟踪格,其中 a_(j)^("mot ")inR^(4)a_{j}^{\text {mot }} \in \mathbb{R}^{4} 表示 3 维尺寸和类别标识, epsilon_(j)^(mot)inR\epsilon_{j}^{m o t} \in \mathbf{R} 表示检测置信度, N^(mot)N^{m o t} 是跟踪格内的标签编号。然后,标签通过
hat(j)=arg max_(j)(epsilon_(j)^(mot)),b_(j)^(mot)=b_(j)^(mot),j=1,dots,N^(mot)\hat{j}=\underset{j}{\arg \max }\left(\epsilon_{j}^{m o t}\right), b_{j}^{m o t}=b_{j}^{m o t}, j=1, \ldots, N^{m o t}
After all these steps, we obtain improved seed labels B^(sta)uu\boldsymbol{B}^{s t a} \cupB^("mot ")\boldsymbol{B}^{\text {mot }}, which are used to perform the detector training. 经过所有这些步骤后,我们得到了改进的种子标签 B^(sta)uu\boldsymbol{B}^{s t a} \cupB^("mot ")\boldsymbol{B}^{\text {mot }} ,这些标签用于执行探测器训练。
5.3.2 Motion-weighted Detection Loss 5.3.2 运动加权检测损失
To address these issues, we design the Motion-Weighted Training (MWT) scheme (see Fig 8 (d)). MWT first computes different localization and classification weights based on the motion states of pseudo labels. Then MWT trains the regressor and classifier of a 3D object detector F^("det "2)\mathcal{F}^{\text {det } 2} by a motion-weighted detection loss L_("mot ")\mathcal{L}_{\text {mot }}. MWT employs a single detector that learns the localization and recognition simultaneously, consequently, attaining global optimization and avoiding additional training and inference time. 为解决这些问题,我们设计了运动加权训练(MWT)方案(见图 8(d))。MWT 首先根据伪标签的运动状态计算不同的定位和分类权重。然后 MWT 通过运动加权检测损失来训练 3D 物体检测器的回归器和分类器。MWT 采用单一检测器同时学习定位和识别,从而实现全局优化并避免额外的训练和推理时间。
Based on the groups, we compute a weight mask that selects more stationary labels for regression (see Fig. 9 (d)) 根据群组,我们计算出一个权重遮罩,用于回归中选择更多静止的标签(见图 9(d))。
omega_(i)^(reg)={[0,(delta_(i),epsilon_(i))in D2uu D4],[1,(delta_(i),epsilon_(i))in D1uu D3].:}\omega_{i}^{r e g}=\left\{\begin{array}{ll}
0 & \left(\delta_{i}, \epsilon_{i}\right) \in D 2 \cup D 4 \\
1 & \left(\delta_{i}, \epsilon_{i}\right) \in D 1 \cup D 3
\end{array} .\right.
Meanwhile, we compute a weight mask that selects more moving labels for classification (see Fig. 9 (e)) 同时,我们计算一个权重掩码,该掩码为分类选择更多移动标签(见图 9(e))。
omega_(i)^(cls)={[0,(delta_(i),epsilon_(i))in D2uu D3],[1,(delta_(i),epsilon_(i))in D1uu D4]:}\omega_{i}^{c l s}= \begin{cases}0 & \left(\delta_{i}, \epsilon_{i}\right) \in D 2 \cup D 3 \\ 1 & \left(\delta_{i}, \epsilon_{i}\right) \in D 1 \cup D 4\end{cases}
Unsupervised 3D object detection results on WOD validation set. We report the 3D AP, 3D APH, and BEV AP using the official metric code of WOD with different loU thresholds. * denotes initial training. 无人监督的 3D 物体检测在 WOD 验证集上的结果。我们使用 WOD 的官方评价指标代码报告了不同 IOU 阈值下的 3D AP、3D APH 和 BEV AP。*表示初始训练。
Method 方法
Metric 公制
Average AP L1 平均 AP L1
Avarage AP L2 平均 AP L2
{:[" Veh. L1 "],[IoU_(0.5//0.7)]:}\begin{gathered} \text { Veh. L1 } \\ I o U_{0.5 / 0.7} \end{gathered}
{:[" Veh. L2 "],[IoU_(0.5//0.7)]:}\begin{gathered} \text { Veh. L2 } \\ I o U_{0.5 / 0.7} \end{gathered}
{:[" Ped. L1 "],[IoU_(0.3//0.5)]:}\begin{gathered} \text { Ped. L1 } \\ I o U_{0.3 / 0.5} \end{gathered}
{:[" Ped. L2 "],[IoU_(0.3//0.5)]:}\begin{gathered} \text { Ped. L2 } \\ I o U_{0.3 / 0.5} \end{gathered}
Then the motion-weighted detection loss L^(mot)\mathcal{L}^{m o t} is formulated between regression prediction widehat(Delta_(i)^("box "))\widehat{\Delta_{i}^{\text {box }}}, regression target Delta_(i)^("box ")\Delta_{i}^{\text {box }}, classification prediction widehat(beta)_(i)\widehat{\beta}_{i} and classification target beta_(i)\beta_{i} by 然后运动加权检测损失 L^(mot)\mathcal{L}^{m o t} 通过回归预测 widehat(Delta_(i)^("box "))\widehat{\Delta_{i}^{\text {box }}} 、回归目标 Delta_(i)^("box ")\Delta_{i}^{\text {box }} 、分类预测 widehat(beta)_(i)\widehat{\beta}_{i} 和分类目标 beta_(i)\beta_{i} 进行公式化
{:[L^(mot)=(1)/(N^(f))sum_(i)omega_(i)^(reg)L^(reg)(( widehat(Delta_(i)^(box))),Delta_(i)^(box))],[+(1)/(N)sum_(i)L^(cls)( widehat(beta)_(i),omega_(i)^(cls)beta_(i))]:}\begin{aligned}
\mathcal{L}^{m o t}= & \frac{1}{N^{f}} \sum_{i} \omega_{i}^{r e g} \mathcal{L}^{r e g}\left(\widehat{\Delta_{i}^{b o x}}, \Delta_{i}^{b o x}\right) \\
& +\frac{1}{N} \sum_{i} \mathcal{L}^{c l s}\left(\widehat{\beta}_{i}, \omega_{i}^{c l s} \beta_{i}\right)
\end{aligned}
where N^(f)N^{f} and NN refer to the number of foreground predictions and all predictions respectively. L^("reg ")\mathcal{L}^{\text {reg }} and L^(cls)\mathcal{L}^{c l s} are the Smooth L1 loss and binary cross entropy loss respectively. 其中 N^(f)N^{f} 和 NN 分别指前景预测和所有预测的数量。 L^("reg ")\mathcal{L}^{\text {reg }} 和 L^(cls)\mathcal{L}^{c l s} 分别是 Smooth L1 损失和二元交叉熵损失。
5.3.3 Mutually Boosting by Iterative Training 5.3.3 通过迭代训练相互促进
As mentioned in Sec. 5.1, the learned localization and recognition knowledge can be mutually generalized to moving and stationary objects. Therefore, the generated predictions B^(**)\boldsymbol{B}^{*} from the second round detector F^("det "2)\mathcal{F}^{\text {det } 2} are much better than the output B^(ini)\boldsymbol{B}^{i n i} from CPD. The better results potentially further enhance the pseudo-label refinement process, leading to better performance. Inspired by this insight, we perform an iterative training. Specifically, the second round predictions replace the original pseudo labels B^("ini ")larrB^(**)\boldsymbol{B}^{\text {ini }} \leftarrow \boldsymbol{B}^{*} and rerun the overall MLR and MWT until the performance does not improve on the validation set. 正如在第 5.1 节中提到的,学习到的定位和识别知识可以相互推广到移动和静止的物体。因此,第二轮探测器生成的预测结果明显优于 CPD 的输出。更好的结果可能进一步增强伪标签修正过程,从而提高性能。受此启发,我们进行了迭代训练。具体来说,第二轮的预测结果将替换原始的伪标签,并重新运行整体的 MLR 和 MWT,直到在验证集上的性能不再提升。
6 EXPERIMENTS 6 个实验
6.1 Datasets 数据集
(1) Waymo Open Dataset (WOD). Due to its diverse scenes, we conducted extensive experiments on the WOD [37]. The WOD contains 798, 202, and 150 training, validation, and testing sequences, respectively. We adopted similar metrics as fully/weakly supervised methods [45], [49], including 3D AP/APH L1/L2 under 0.5/0.7 IoU thresholds and BEV AP L1/L2 under 0.5/0.7 IoU thresholds, where L1 and L2 瓦莫开放数据集(WOD)。由于其多样化的场景,我们在 WOD 上进行了广泛的实验[37]。WOD 包含 798、202 和 150 个训练、验证和测试序列。我们采用了与完全/弱监督方法[45]、[49]相似的度量指标,包括在 0.5/0.7 IoU 阈值下的 3D AP/APH L1/L2,以及在 0.5/0.7 IoU 阈值下的 BEV AP L1/L2,其中 L1 和 L2
denote the detection level, and the APH is an AP weighed by heading. No annotations were used for training. (2) KITTI Dataset. Since the KITTI detection dataset [8] did not provide consecutive frames, we only tested our method on the 3769 val split [4]. We used similar metrics (3D AP R40 with 0.5/0.7 IoU thresholds) as employed in weakly/fully supervised methods [46], [49]. 表示检测水平,而 APH 是根据方向加权的 AP。 训练中没有使用注释。 (2) KITTI 数据集。 由于 KITTI 检测数据集[8]未提供连续帧,我们仅在 3769 个验证集上[4]测试了我们的方法。 我们使用了与弱/完全监督方法[46]、[49]相同的指标(3D AP R40,IoU 阈值为 0.5/0.7)。
6.2 Implementation Details 实施细节
Network Details. The F^("pro "),F^("det ")\mathcal{F}^{\text {pro }}, \mathcal{F}^{\text {det }} and F^("det "2)\mathcal{F}^{\text {det } 2} adopt the same 3D backbone as CenterPoint [56] and the second stage refinement like [4]. For the WOD and KITTI datasets, we use the same detection range ([-75.2, 75.2]m for the XX and Y axes and [-2,4]m[-2,4] \mathrm{m} for the Z axis) and voxel size (0.1m(0.1 \mathrm{~m}, 0.1m,0.15m0.1 \mathrm{~m}, 0.15 \mathrm{~m} ) as CenterPoint [56]. The road surface of KITTI is aligned to WOD during the inference. 网络详情。 F^("pro "),F^("det ")\mathcal{F}^{\text {pro }}, \mathcal{F}^{\text {det }} 和 F^("det "2)\mathcal{F}^{\text {det } 2} 采用与 CenterPoint [56]相同的 3D 主干网络,以及[4]中的第二阶段细化。对于 WOD 和 KITTI 数据集,我们使用与 CenterPoint [56]相同的检测范围(X 和 Y 轴为[-75.2, 75.2]米,Z 轴为 [-2,4]m[-2,4] \mathrm{m} ),以及体素大小 (0.1m(0.1 \mathrm{~m} , 0.1m,0.15m0.1 \mathrm{~m}, 0.15 \mathrm{~m} 。KITTI 的道路表面在推理过程中与 WOD 对齐。
Training Details. We adopt the widely used global scaling, rotation, and GT sampling [4] data augmentation. We trained our CPD on 8 Tesla V100 GPUs with the ADAM optimizer. We used a learning rate of 0.003 with a one-cycle learning rate strategy. The CPD is trained for 20 epochs. We trained our CPD++ on 4 RTX 3090 GPUs with the same learning rate. The CPD++ is trained by three iterations. In each iteration of training, the CPD++ is trained 20 epochs. 训练细节。我们采用广泛使用的全局缩放、旋转和 GT 采样[4]数据增强。我们在 8 个特斯拉 V100 GPU 上训练了我们的 CPD,使用 ADAM 优化器。我们使用了一个 0.003 的学习率,采用了一个单循环学习率策略。 CPD 训练了 20 个 epoch。我们在 4 个 RTX 3090 GPU 上训练了我们的 CPD++,使用了相同的学习率。 CPD++ 经过了三次迭代训练。在每次训练迭代中, CPD++ 训练了 20 个 epoch。
Parameter Details. (1) Heuristic parameters. Some heuristic parameters, either empirically set or from Wikipedia, do not require tuning from the validation dataset. The template boxes in Eq. 6 are set to [5.1m, 1.9m, 1.5m], [1.0m, 1.0m, 2.0m]2.0 \mathrm{~m}], and [1.9m,0.9m[1.9 \mathrm{~m}, 0.9 \mathrm{~m}, and 1.8 m]] for vehicles, pedestrians, and cyclists, respectively. The sensitivity is analyzed in Sec. 6.6. (2) Hyperparameters. We introduced several hyperparameters that were chosen from the validation set. The CPD parameters eta=0.8,S_(H)=0.7\eta=0.8, S_{H}=0.7, and S_(L)=0.4S_{L}=0.4, are determined by the experiments in Sec. 6.6. The CPD++ parameters m=6,eta^(var)=0.8,eta^("low ")=0.4m=6, \eta^{v a r}=0.8, \eta^{\text {low }}=0.4 and eta^(high)=0.9\eta^{h i g h}=0.9, are determined by the experiments in Sec. 6.7. 参数细节。(1) 启发式参数。一些启发式参数,要么是经验设定的,要么来自维基百科,不需要从验证数据集进行调整。方程式 6 中的模板框被设置为[5.1 米, 1.9 米, 1.5 米]、[1.0 米, 1.0 米, 2.0m]2.0 \mathrm{~m}] 和 [1.9m,0.9m[1.9 \mathrm{~m}, 0.9 \mathrm{~m} ],以及 1.8 米 ]] 分别用于车辆、行人和自行车。在第 6.6 节中分析了敏感性。(2) 超参数。我们引入了几个从验证集中选择的超参数。在第 6.6 节的实验中确定了 CPD 参数 eta=0.8,S_(H)=0.7\eta=0.8, S_{H}=0.7 和 S_(L)=0.4S_{L}=0.4 。在第 6.7 节的实验中确定了 CPD++参数 m=6,eta^(var)=0.8,eta^("low ")=0.4m=6, \eta^{v a r}=0.8, \eta^{\text {low }}=0.4 和 eta^(high)=0.9\eta^{h i g h}=0.9 。
Unsupervised 3D object detection results on WOD testing set. We report the 3D AP and 3D APH by submitting our results to the official online server of WOD. The results are also available on the WOD online leaderboard. 对无人监督的 3D 物体检测在 WOD 测试集的结果。我们通过向 WOD 官方在线服务器提交结果,报告了 3D AP 和 3D APH。结果也可在 WOD 在线排行榜上查看。
TABLE 3 表 3
Comparison with fully/weakly/un-supervised methods on KITTI val set. We report the 3D AP R40 using the KITTI official metrics. †\dagger denotes that the model is trained on WOD and tested on KITTI. 在 KITTI 验证集上与完全/弱/无监督方法的比较。我们使用 KITTI 官方指标报告 3D AP R40。 †\dagger 表示模型在 WOD 上训练并在 KITTI 上测试。
6.3 Comparison with Unsupervised Detectors 6.3 与无监督检测器的比较
WOD validation set. The results of the WOD validation set are presented in Table 1. All methods use identical size thresholds to define the object classes and use single traversal. Our method significantly outperforms existing unsupervised methods. Notably, under the 3D AP L2 with IoU thresholds of 0.7,0.50.7,0.5, and 0.5 , our CPD outperforms OYSTER [59] by 18.03%,13.08%18.03 \%, 13.08 \%, and 4.55%4.55 \% on Vehicle, Pedestrian, and Cyclist classes, respectively. These advancements come from our MFC, CBR, and CST designs, which yield superior pseudo-labels and enhanced detection accuracy. CPD also surpasses the Proto-vanilla method, which uses class-specific prototype [34], as our CProto better models diverse intra-class object sizes. The CPD++ further significantly improved the detection performance of CPD by 16.46%,14.09%16.46 \%, 14.09 \%, and 24.82%24.82 \% on Vehicle, Pedestrian, and Cyclist classes, respectively. CPD++ mutually learns localization from stationary objects and recognition from moving objects by the MLR and MWT. Consequently, the precision has been greatly improved. Notably, the detection performance of cyclists has been boosted around 6xx6 \times, as the MWT significantly depressed the false positive supervision signals by the motion-weighed training loss. WOD 验证集。 WOD 验证集的结果在表 1 中呈现。 所有方法使用相同的尺寸阈值来定义对象类别,并使用单次遍历。 我们的方法显著优于现有的无监督方法。 值得注意的是,在 3D AP L2 与 IoU 阈值为 0.7,0.50.7,0.5 和 0.5 的情况下,我们的 CPD 在车辆、行人和骑自行车者类别上分别优于 OYSTER [59] 18.03%,13.08%18.03 \%, 13.08 \% 和 4.55%4.55 \% 。 这些进步来自我们的 MFC、CBR 和 CST 设计,它们产生更好的伪标签和增强的检测精度。 CPD 还超越了使用特定类别原型[34]的 Proto-vanilla 方法,因为我们的 CProto 更好地模拟了类内对象尺寸的多样性。 CPD++进一步显著提高了 CPD 在车辆、行人和自行车骑手类别上的检测性能,分别提升了 16.46%,14.09%16.46 \%, 14.09 \% 和 24.82%24.82 \% 。 CPD++通过 MLR 和 MWT,相互学习从静止物体获得的定位和从移动物体获得的识别。 因此,精度大大提高。 值得注意的是,骑自行车者的检测性能提高了大约 6xx6 \times ,因为 MWT 通过运动加权训练损失显著抑制了误报的监督信号。
WOD testing set. The results of the WOD testing set are presented in Table 2. Under the 3D APH L2 with IoU thresholds of 0.7,0.50.7,0.5, and 0.5 , our CPD outperforms OYSTER [59] by 12.46%,7.12%12.46 \%, 7.12 \%, and 3.55%3.55 \% on Vehicle, Pedestrian, and Cyclist classes, respectively. Besides, the CPD++ boosted the CPD around 2xxmAP2 \times \mathrm{mAP}. Specifically, the mAP L1 is improved by 20.5%20.5 \% and the mAP L2 is improved by 18.95%18.95 \%. The improvement in the testing set is consistent with the improvement in the validation set, further demonstrating the universality of our method. WOD 测试集。 WOD 测试集的结果在表 2 中给出。在 3D APH L2,IoU 阈值为 0.7,0.50.7,0.5 和 0.5 的情况下,我们的 CPD 分别超越了 OYSTER [59]的 12.46%,7.12%12.46 \%, 7.12 \% 、 3.55%3.55 \% 车辆、行人和自行车类。此外,CPD++将 CPD 提高了约 2xxmAP2 \times \mathrm{mAP} 。具体而言,mAP L1 提高了 20.5%20.5 \% ,mAP L2 提高了 18.95%18.95 \% 。测试集中的改善与验证集中的改善一致,进一步证明了我们方法的普遍性。
6.4 Comparison with Fully/Weakly Supervised Detectors 6.4 与完全/弱监督检测器的比较
Results on KITTI val set. To further validate our method, we pre-trained our CPD and CPD++, along with OYSTER [59] 基于 KITTI 验证集的结果。为了进一步验证我们的方法,我们预训练了我们的 CPD 和 CPD++,以及 OYSTER [59]。
and MODEST [57], on WOD and tested them on the KITTI dataset using Statistical Normalization (SN) [40]. The detection results are in Table 3. We first compared our methods with a sparsely supervised CoIn (annotated with 2%2 \% labels) [49] that annotates a single instance per frame for training. Our unsupervised CPD and CPD++ outperform this sparsely supervised method by 0.25%0.25 \% and 15.09%3D15.09 \% 3 \mathrm{D} AP @ IoU_(0.7)I o U_{0.7} on moderate car class, respectively. Additionally, our CPD++ attains 97.01%97.01 \% and 89.25%89.25 \% 3D AP for the easy and moderate car classes at a 0.5 IoU threshold. Notably, the performance achieves 99.6%99.6 \% and 95.3%95.3 \% of fully supervised CenterPoint [56], demonstrating the advancement of our method. 和谦逊[57],在 WOD 上进行测试,使用统计归一化(SN)[40]在 KITTI 数据集上进行测试。检测结果见表 3。我们首先将我们的方法与稀疏监督的 CoIn(用 2%2 \% 标注标记的单个实例)进行了比较[49]。我们的无监督 CPD 和 CPD++在中等汽车类别上分别超过了这种稀疏监督方法 0.25%0.25 \% 和 15.09%3D15.09 \% 3 \mathrm{D} AP。此外,我们的 CPD++在 0.5 IoU 阈值下,在易车和中等车辆类别上达到了 97.01%97.01 \% 和 89.25%89.25 \% 的 3D AP。值得注意的是,该性能达到了完全监督的 CenterPoint[56]的 99.6%99.6 \% 和 95.3%95.3 \% ,显示了我们方法的进步。
Results on WOD validation set. We compared our method with fully/weakly supervised methods on the WOD validation set [37]. The detection results are in Table 4. For vehicle, our unsupervised CPD outperforms the sparsely supervised method ( 2%2 \% annotation) by 4.42%4.42 \% and 3.31%3.31 \% in terms of 3D AP L1 and L2, respectively. Besides, our CPD++ outperforms a strong weakly supervised detector CoIn [49] ( 2%2 \% annotation) by 6.77%6.77 \% and 3.52%3.52 \% in terms of vehicle AP L2 and pedestrian AP L2, respectively. The CPD++ attains 76% (AP L2) of fully supervised CenterPoint [56] for largescale vehicle object detection. The results reflect that our unsupervised method can perform better than some weakly supervised methods and similar performance with fully supervised methods on large-scale object detection. 在 WOD 验证集上的结果。我们将我们的方法与 WOD 验证集[37]上的完全/弱监督方法进行了比较。检测结果见表 4。对于车辆,我们的无监督 CPD 在 3D AP L1 和 L2 方面分别优于稀疏监督方法( 2%2 \% 标注)。此外,我们的 CPD++在车辆 AP L2 和行人 AP L2 方面分别优于强弱监督检测器 CoIn[49]( 2%2 \% 标注)。CPD++达到了完全监督的 CenterPoint[56]对大规模车辆目标检测的 76%(AP L2)。这些结果反映了我们的无监督方法可以优于某些弱监督方法,并与完全监督方法在大规模目标检测上达到相似的性能。
6.5 As Pre-trained Model 6.5 预训练模型
Our unsupervised method can serve as a strong pre-training model that greatly improves the detector when only a few labels are available. To demonstrate such property, we conducted a fine-tuning experiment on the WOD validation set. We compare our method with regular MAE-based [11] methods, such as Occupancy-MAE [26] and GD-MAE [52]. We show the results in Table 5, where different labeling rates from 0%0 \% to 1%1 \% are used. Our unsupervised CPD++ outperforms the regular MAE-based methods in almost all 我们无监督的方法可以作为一个强大的预训练模型,当只有少量标签可用时,可以大大提高检测器的性能。为了证明这一特性,我们在 WOD 验证集上进行了微调实验。我们将我们的方法与常规的基于 MAE 的[11]方法,如 Occupancy-MAE[26]和 GD-MAE[52],进行了比较。在表 5 中,我们展示了不同的标注率从 0%0 \% 到 1%1 \% 的结果。我们无监督的 CPD++在绝大多数情况下都优于常规的基于 MAE 的方法。
TABLE 4 表 4
Comparison with fully/weakly/un-supervised methods on WOD validation set. We report the 3D AP and APH using official metrics of WOD. 在 WOD 验证集上与完全/弱/无监督方法的比较。我们使用 WOD 的官方指标报告了 3D AP 和 APH。
Method 方法
Setting 设置
Label rate 费率
车 L1 AP/ APH
Veh. L1
AP/ APH
Veh. L1
AP/ APH| Veh. L1 |
| :---: |
| AP/ APH |
车。L2 AP / APH
Veh. L2
AP / APH
Veh. L2
AP / APH| Veh. L2 |
| :---: |
| AP / APH |
羽毛球
Ped. L1
AP / APH
Ped. L1
AP / APH| Ped. L1 |
| :---: |
| AP / APH |
行人 L2 AP / APH
Ped. L2
AP / APH
Ped. L2
AP / APH| Ped. L2 |
| :---: |
| AP / APH |
循环。一级 AP / APH
Cyc. L1
AP / APH
Cyc. L1
AP / APH| Cyc. L1 |
| :---: |
| AP / APH |
TABLE 5 表 5
Fine-tuning results on WOD validation set. We report the 3D AP L2 results evaluated by the WOD official metrics. 在 WOD 验证集上的微调结果。我们报告了使用 WOD 官方指标评估的 3D AP L2 结果。
Fig. 10. (a-c) loU distribution between pseudo-labels and ground truth. (d-f) Mean absolute error associated with the size, position, and angle of pseudo-labels generated by different methods. 图 10。(a-c)伪标签和实际标签之间的 IoU 分布。(d-f)不同方法生成的伪标签的尺寸、位置和角度的平均绝对误差。
metrics. The reason lies in that the occupancy pre-training in regular MAE is task-agnostic, and thus can only initialize the backbone parameters. On the contrary, our CPD++ pretraining is task-specific and thus can initialize the parameters of both the backbone and detection head. 指标。原因在于正则 MAE 中的占用预训练是任务无关的,因此只能初始化主干参数。相反,我们的 CPD++预训练是任务特定的,因此可以初始化主干和检测头的参数。
6.6 Diagnostic Experiment of CPD 6.6 CPD 诊断实验
We conducted several diagnostic experiments to examine the effectiveness of CPD on the WOD validation dataset. 我们进行了几次诊断性实验,以检查 CPD 在 WOD 验证数据集上的有效性。
Components Analysis of CPD. To evaluate the individual contributions of our designs, we incrementally added each component and assessed their impact on vehicle detection. The results are shown in Table 6. Our MFC method surpasses Single Frame Clustering (SFC) by 2.48%2.48 \% in 3D AP L2 at 0.7 IoU threshold, attributed to the complete point representation of objects across consecutive frames compared to a single frame. The CBR further enhances performance by 19.27%19.27 \% in AP, as it reduces size and location errors in pseudo-labels. The CST contributes an 8.09%8.09 \% increase in AP, demonstrating the effectiveness of geometric features from CProto in detecting sparse objects. 我们的 MFC 方法在 3D AP L2 0.7 IoU 阈值下超过单帧聚类(SFC) 2.48%2.48 \% ,这归因于与单帧相比,在连续帧中完整的点表示对象。CBR 进一步通过 19.27%19.27 \% 提高了 AP,因为它减少了伪标签中的尺寸和位置错误。CST 贡献了 8.09%8.09 \% AP 的增加,证明了 CProto 中几何特征在检测稀疏目标的有效性。
TABLE 6 表 6
Vehicle detection results by using different CPD components. 使用不同 CPD 组件的车辆检测结果。
Frame Number of MFC. To examine the effect of frame count on initial pseudo-label quality, we experimented with different numbers of frames on the WOD validation set. The BEV results, shown in Fig. 11 (a)(b), indicate optimal performance with [-5,5][-5,5] frames (five past, five future, and the current frame). Additional frames did not significantly im- 多视角融合相机(MFC)的帧数量。为了研究帧数量对初始伪标签质量的影响,我们在 WOD 验证集上试验了不同数量的帧。鸟瞰图结果(见图 11(a)和(b))表明,使用 [-5,5][-5,5] 帧(5 个过去帧、5 个未来帧和当前帧)可以获得最佳性能。增加帧数量没有显著提升性能。
Components Analysis of CBRC B R. To evaluate the impact of re-sizing and re-localization in CBR, we conducted experiments and analyzed pseudo-label performance. As shown in Table 8, re-sizing results in a 3.91%3.91 \% and 3.4%3.4 \% increase in BEV recall at the 0.5 and 0.7 IoU thresholds, respectively; re-localization further enhances recall by 12.68%12.68 \% and 6.43%6.43 \% at these thresholds, while also increasing precision. These results indicate the importance of both components, which effectively refine pseudo-labels. 对 CBRC B R 的组件分析。为了评估在基于案例的推理中调整大小和重新定位的影响,我们进行了实验并分析了伪标签性能。如表 8 所示,调整大小在 0.5 和 0.7 IoU 阈值下分别导致 BEV 召回率增加了 3.91%3.91 \% 和 3.4%3.4 \% ;重新定位进一步提高了这些阈值下的召回率分别为 12.68%12.68 \% 和 6.43%6.43 \% ,同时还提高了精度。这些结果表明这两个组件的重要性,它们可以有效地改善伪标签。
TABLE 8 表 8
Pseudo label BEV recall and precision by using different CBR components. 伪标签 BEV 召回率和精度使用不同的 CBR 组件。
Components Analysis of CST. To assess the effectiveness of each component in CST, we established a baseline using only CBR-generated pseudo-labels for training a two-stage CenterPoint detector, then incrementally added our loss components and evaluated vehicle detection performance. As shown in Table 9, all loss components contribute to performance improvement. Specifically, our L_("det ")^(css)\mathcal{L}_{\text {det }}^{c s s} mitigates the influence of false pseudo-label using CSS weight, and improves the 3D AP L2 at IoU_(0.7)I o U_{0.7} by 4.79%4.79 \%. Our L_("feat ")^("css ")\mathcal{L}_{\text {feat }}^{\text {css }} and L_("box ")^(css)\mathcal{L}_{\text {box }}^{c s s} improve the 3D AP L2 at IoU_(0.7)I o U_{0.7} by 0.75%0.75 \% and 2.55%2.55 \% 对 CST 的组成分析。为了评估 CST 中每个组件的有效性,我们使用仅由 CBR 生成的伪标签训练了一个两阶段的 CenterPoint 检测器作为基线,然后逐步添加我们的损失组件并评估车辆检测性能。如表 9 所示,所有损失组件都有助于性能改善。具体而言,我们的 L_("det ")^(css)\mathcal{L}_{\text {det }}^{c s s} 通过使用 CSS 权重来减轻虚假伪标签的影响,并将 IoU_(0.7)I o U_{0.7} 处的 3D AP L2 提高了 4.79%4.79 \% 。我们的 L_("feat ")^("css ")\mathcal{L}_{\text {feat }}^{\text {css }} 和 L_("box ")^(css)\mathcal{L}_{\text {box }}^{c s s} 将 IoU_(0.7)I o U_{0.7} 处的 3D AP L2 分别提高了 0.75%0.75 \% 和 2.55%2.55 \% 。
Fig. 12. Vehicle detection results by using different hyperparameters. 图 12. 使用不同超参数的车辆检测结果。
respectively, through leveraging geometric knowledge from dense CProto for more effective sparse object detection. 分别通过利用来自密集 CProto 的几何知识来实现更有效的稀疏物体检测。
TABLE 9 表 9
Vehicle detection results by using different CST components. 使用不同 CST 组件进行车辆检测的结果。
Parameter Determination. In Eq. 6, we use box templates to calculate the size score psi^(3)(b_(j))\psi^{3}\left(b_{j}\right). Here, we test the sensitivity using simple commonsense of l,w,hl, w, h ratios (2:1:1 for Vehicle, 1:1:2 for Pedestrian, 2:1:2 for Cyclist). The 3D AP results ( 32.13 vs. 31.87) in Table 10 demonstrate that psi^(3)(b_(j))\psi^{3}\left(b_{j}\right) is not sensitive to specific values. To select the best hyperparameters of CPD, we conducted a series of experiments on the 1/20 WOD dataset. We apply a simple grid search, i.e., testing different parameters in a specific range with a small stride and choosing a parameter with the best detection performance. We present the results (Vehicle 3D AP L2) in the Fig 12. We perform best when using eta=0.8,S_(H)=0.7\eta=0.8, S_{H}=0.7, S_(L)=0.4S_{L}=0.4 and sigma=0.05\sigma=0.05. The sigma\sigma is the divergence threshold in Eq. 6. 参数确定。在公式 6 中,我们使用盒子模板来计算尺寸得分 psi^(3)(b_(j))\psi^{3}\left(b_{j}\right) 。在这里,我们使用简单的常识测试了敏感性,即 l,w,hl, w, h 比率(车辆 2:1:1,行人 1:1:2,自行车 2:1:2)。表 10 中的 3D AP 结果(32.13 vs. 31.87)表明 psi^(3)(b_(j))\psi^{3}\left(b_{j}\right) 对具体值不敏感。为了选择 CPD 的最佳超参数,我们对 1/20 WOD 数据集进行了一系列实验。我们应用了简单的网格搜索,即在特定范围内测试不同参数,步长较小,选择检测性能最佳的参数。我们在图 12 中给出了结果(车辆 3D AP L2)。当使用 eta=0.8,S_(H)=0.7\eta=0.8, S_{H}=0.7 、 S_(L)=0.4S_{L}=0.4 和 sigma=0.05\sigma=0.05 时,我们的性能最佳。 sigma\sigma 是公式 6 中的散度阈值。
6.7 Diagnostic Experiment of CPD++ 6.7 CPD++诊断实验
We then conducted diagnostic experiments to examine the effectiveness of CPD++, all reported results are from the WOD validation dataset. 我们随后进行了诊断实验,以检查 CPD++的有效性,所有报告的结果均来自 WOD 验证数据集。
TABLE 11 表 11
Training architecture comparison results. We report the 3D AP L1 and L2 using the official metric code of WOD with different loU thresholds. 训练架构比较结果。我们使用 WOD 的官方度量代码报告不同 IOU 阈值下的 3D AP L1 和 L2。
Fig. 13. The training iterations and their performance of different methods. We report the 3D AP L1 and L2 of different classes. 图 13.不同方法的训练迭代及其性能。我们报告了不同类别的 3D AP L1 和 L2。
Fig. 14. Performance (3D mAP L2) comparison by using different CPD++ hyperparameters. 图 14. 使用不同的 CPD++超参数的性能(3D mAP L2)比较。
between stationary and moving objects, attaining optimal performance. 在静止和移动物体之间,达到最佳性能。
Number of Iteration. To examine the effect of training iteration on the detection performance, we experimented with different numbers of training iterations for CPD++. Besides, the other three baseline methods are also tested for comparison. The results, shown in Fig. 13 (a)-(f), indicate optimal performance with three iterations. Additional iterations did not significantly improve AP. Consequently, we used three training iterations to train our CPD++. In addition, our CPD++ shows significant improvement against the baselines in each iteration thanks to our MLR and MWT design. 迭代次数。为了研究训练迭代次数对检测性能的影响,我们对 CPD++ 进行了不同次数的训练迭代实验。此外,还测试了其他三种基准方法进行比较。如图 13 (a)-(f) 所示,三次迭代达到了最佳性能。增加迭代次数并未显著提高 AP。因此,我们采用三次训练迭代来训练我们的 CPD++。此外,得益于我们的 MLR 和 MWT 设计,我们的 CPD++ 在各个迭代中都显示出了显著的改进,超过了基准方法。
TABLE 12 表格 12
Performance (3D AP L2) comparison by using different motion classifiers. 不同运动分类器的性能(3D AP L2)比较。
Motion Classifier. In Sec. 5.2.2, we proposed the variance of position change to classify the motion states. Here, we test an alternative to velocity-based motion classification. The detection results are shown in Table 12, where our variance of position change shows superior performance. Therefore, we use the variance of position change in our CPD++. 运动分类器。在第 5.2.2 节中,我们提出了利用位置变化方差来划分运动状态。在这里,我们测试了一种基于速度的运动分类的替代方案。检测结果如表 12 所示,我们的位置变化方差表现优异。因此,我们在 CPD++中采用位置变化方差。
Component Analysis of CPD++. To evaluate the individual contributions of components in CPD++, we incrementally added each component and assessed their impact on detection performance. The results are shown in Table 13. The augmented testing (AT), MLR, weight mask for regres- 组合分析 CPD++。为了评估 CPD++ 组件个体贡献度,我们逐步添加每个组件并评估它们对检测性能的影响。结果如表 13 所示。增强型测试(AT)、MLR、用于回归的权重掩码
sion omega_(i)^("reg ")\omega_{i}^{\text {reg }} and weight mask for classification omega_(i)^(cls)\omega_{i}^{c l s} improves the vehicle detection by 2.25%2.25 \% and 5.1%5.1 \% and 4.4%4.4 \% and 4.98%4.98 \% respectively. We observe that the MLR and mask for regression omega_(i)^("reg ")\omega_{i}^{\text {reg }} marginally improve the pedestrian and cyclist detection. A possible reason is that the stationary pedestrians and cyclists are much less than the moving ones. On the contrary, the weight mask for classification omega_(i)^(cls)\omega_{i}^{c l s} improves the pedestrian and cyclist detection by 16.73%16.73 \% and 29.13%29.13 \%, respectively. Because the omega_(i)^(cls)\omega_{i}^{c l s} significantly decreases the false supervision signals of small objects by the motion pattern. The results demonstrate that every component is vital to the overall performance improvement. 分割和权重掩码对分类的车辆检测分别提升了 2.25%2.25 \% 、 5.1%5.1 \% 、 4.4%4.4 \% 和 4.98%4.98 \% 。我们发现 MLR 和回归掩码 omega_(i)^("reg ")\omega_{i}^{\text {reg }} 仅微小提升了行人和自行车的检测。原因可能是静止的行人和自行车远少于移动中的。相反,分类的权重掩码 omega_(i)^(cls)\omega_{i}^{c l s} 分别提升了行人和自行车检测 16.73%16.73 \% 和 29.13%29.13 \% ,因为 omega_(i)^(cls)\omega_{i}^{c l s} 大幅降低了小物体由运动模式引起的错误监督信号。结果表明每个组件对整体性能提升至关重要。
TABLE 13 表 13
Performance ( 3D AP L2) comparison by using different CPD++ components. AT refers to augmented testing. 性能(3D AP L2)使用不同的 CPD ++组件进行比较。AT 指增强测试。
Parameter Determination. To select the best hyperparameters for CPD++, we conducted a series of experiments on the 1//201 / 20 WOD dataset. Similar to CPD, we apply a simple grid search to choose the best parameters. We first select the best transformation number mm by evaluating the mAP of pseudo labels in Eq. 11 with different mm. We observe that the mAP attains the best performance when using six transformations. We next select the best motion threshold eta^(var)\eta^{v a r} and score threshold eta^("low "),eta^("high ")\eta^{\text {low }}, \eta^{\text {high }} by evaluating the final detection mAP of CPD++ with different parameters. As presented in Fig 14, we attain the best performance when using eta^("var ")=0.8,eta^("low ")=0.4\eta^{\text {var }}=0.8, \eta^{\text {low }}=0.4 and eta^("high ")=0.9\eta^{\text {high }}=0.9. 参数确定。为了为 CPD++ 选择最佳超参数,我们对 1//201 / 20 WOD 数据集进行了一系列实验。与 CPD 类似,我们应用简单的网格搜索来选择最佳参数。我们首先通过评估式 11 中伪标签的 mAP 来选择最佳变换数 mm 。我们观察到当使用六个变换时 mAP 达到最佳性能。我们接下来通过评估 CPD++ 的最终检测 mAP 来选择最佳运动阈值 eta^(var)\eta^{v a r} 和评分阈值 eta^("low "),eta^("high ")\eta^{\text {low }}, \eta^{\text {high }} 。如图 14 所示,我们在使用 eta^("var ")=0.8,eta^("low ")=0.4\eta^{\text {var }}=0.8, \eta^{\text {low }}=0.4 和 eta^("high ")=0.9\eta^{\text {high }}=0.9 时获得了最佳性能。
6.8 Visualization Comparison 6.8 可视化比较
To provide a more intuitive understanding of how our method improves detection performance, we visually compare our CPD, CPD++ with those of OYSTER [59], as shown in Fig. 15. OYSTER inaccurately reports the sizes and 我们通过视觉比较我们的 CPD、CPD++与 OYSTER [59]的方法,来提供更直观的了解我们的方法如何提高检测性能。OYSTER 不准确地报告了尺寸和
Fig. 15. Visualization comparison of different detection results on WOD validation set. Each column shows the results of different methods, each line shows the detection results for different scenarios. Some of the objects have been zoomed in for better comparison. 图 15.在 WOD 验证集上不同检测结果的可视化比较。每列显示不同方法的结果,每行显示不同场景的检测结果。部分物体已放大以便更好比较。
positions of objects (Fig. 15(1.1)). In contrast, CPD, using our CProto-based design, accurately predicts their sizes and positions (Fig. 15(2.1)). However, since the threshold-based classification cannot distinguish the foreground and background objects with similar sizes, the OYSTER and CPD produce massive false positive predictions for small background objects (Fig. 15(1.3)(2.3)). Besides, some distant vehicles cannot be recognized correctly due to the sparse points and similar shape to the background (Fig. 15(1.2)(2.2)). CPD++ boosted the localization and recognition by the MLR and MWT design. Therefore the false positive of small objects significantly decreased (Fig. 15(3.3)), and the recognition of vehicles is also improved (Fig. 15(3.2)). 物品的位置(图 15(1.1))。相比之下,使用我们的 CProto 设计的 CPD,准确预测了它们的大小和位置(图 15(2.1))。然而,由于基于阈值的分类无法区分具有相似大小的前景和背景物体,因此 OYSTER 和 CPD 对于小型背景物体产生了大量的错误预测(图 15(1.3)(2.3))。此外,由于散点和形状与背景相似,一些远处的车辆无法被正确识别(图 15(1.2)(2.2))。CPD++通过 MLR 和 MWT 设计提高了定位和识别。因此,小物体的错误预测显著减少(图 15(3.3)),车辆识别也得到改善(图 15(3.2)).
6.9 Failure Case Analysis and Limitation 6.9 故障案例分析和局限性
At present, unsupervised 3D object detection is still a challenging problem. Some objects in corner cases can not be detected correctly. One of the most typical situations is when pedestrians and cars are close together. In this case, cars and pedestrians cannot be separated by clustering, which generates false labels and misleads detector training ((1.4)(2.4)(3.4)((1.4)(2.4)(3.4) in the second row of Fig. 15). Besides, the detection performance of small objects is obviously lower than that of large objects. As presented in Table 1, the pedestrian and cyclist only obtain 44.04%44.04 \% BEV AP L1 ( IoU_(0.3)I o U_{0.3} ) and 34.15%34.15 \% BEV AP L1 (IoU_(0.3))\left(I o U_{0.3}\right), which is much lower than the vehicle detection performance of 80.64%80.64 \% BEV AP L1 (IoU_(0.5))\left(I o U_{0.5}\right). These results suggest that small 3D object detection without annotation remains exceptionally challenging and requires further investigation in the future. 目前,无监督的 3D 物体检测仍然是一个挑战性的问题。在一些特殊情况下,某些物体无法被正确检测。最典型的情况之一就是,当行人和汽车彼此靠近时。在这种情况下,汽车和行人无法通过聚类来分离,这就产生了错误的标签,并误导了检测器的训练。此外,小物体的检测性能明显低于大物体。如表 1 所示,行人和自行车只获得 BEV AP L1 和 BEV AP L1,远低于车辆检测性能的 BEV AP L1。这些结果表明,没有注释的小型 3D 物体检测仍然是一个极具挑战性的问题,需要未来进一步的研究。
7 Conclusion 结论
This paper explored the challenge of unsupervised 3D object detection, where no human annotations are utilized for either training or fine-tuning. We addressed this problem by incorporating human commonsense clues into a pseudo-label-based framework. Initially, we present a Commonsense Prototype-based Detector (CPD), for unsupervised 3D object detection. First, we develop an MFC method to generate initial pseudo-labels and construct a CProto set using the CSS scoring. Then, we introduce a CBR method to refine these pseudo-labels. Lastly, we design a CST to enhance detection accuracy for sparse objects. Building upon the CPD, we then proposed CPD++, an enhanced framework for more accurate multi-class unsupervised 3D object detection. First, we design an MLR that refines pseudo-labels by considering the motion properties of objects. We then presented the MWT that learns localization from stationary objects, learns recognition from moving objects, and transfers the knowledge between each other to mutually boost detection performance. Extensive experimental evaluations have confirmed the efficacy of our proposed framework. Notably, when CPD++ is trained on the WOD and tested on KITTI, it achieves performance comparable to fully supervised methods in car detection. These findings underscored the significance and advance of our research. 这篇论文探讨了无监督 3D 目标检测的挑战,其中既不使用人工标注进行训练,也不用于细调。我们通过将人类常识线索纳入基于伪标签的框架来解决这一问题。首先,我们提出了一种基于常识原型的检测器(CPD),用于无监督 3D 目标检测。我们开发了一种 MFC 方法来生成初始的伪标签,并使用 CSS 评分构建了一个 CProto 集合。然后,我们引入了一种 CBR 方法来细化这些伪标签。最后,我们设计了一种 CST 来提高稀疏目标的检测精度。在 CPD 的基础上,我们提出了 CPD++,这是一个更精准的多类无监督 3D 目标检测增强框架。我们设计了一种 MLR,通过考虑目标的运动特性来细化伪标签。我们还提出了 MWT,它从静止物体学习定位,从移动物体学习识别,并在两者之间传递知识,相互提高检测性能。广泛的实验评估证实了我们提出框架的有效性。值得注意的是,当 CPD++在 WOD 上训练,在 KITTI 上测试时,其在车辆检测方面的性能可与完全监督的方法媲美。这些发现突出了我们研究的意义和进步。
AcKNOWLEDGMENTS 鸣谢
This work was supported in part by the National Natural Science Foundation of China (No.62171393). 本工作部分由中国国家自然科学基金会(编号 62171393)支持。
References 参考文献
[1] Asma Azim and Olivier Aycard. Detection, classification and tracking of moving objects in a 3d environment. In 2012 IEEE Intelligent Vehicles Symposium, pages 802-807. IEEE, 2012. 阿斯玛·阿齐姆和奥利维尔·艾卡德。在 3D 环境中检测、分类和跟踪移动物体。载于 2012 年 IEEE 智能车辆研讨会论文集,第 802-807 页。IEEE,2012 年。
[2] Qi Cai, Yingwei Pan, Ting Yao, and Tao Mei. 3d cascade renn: High quality object detection in point clouds. IEEE Transactions on Image Processing, 31:5706-5719, 2022. 齐才、应维潘、姚霆和梅涛。3D 级联 RENN:点云中的高质量目标检测。IEEE 图像处理学报,31:5706-5719,2022。
[3] Zehui Chen, Zhenyu Li, Shuo Wang, Dengpan Fu, and Feng Zhao. Learning from noisy data for semi-supervised 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6929-6939, 2023. 陈泽辉、李振宇、王硕、付登攀和赵枫。从噪声数据中学习用于半监督式三维目标检测。在 2023 年 IEEE/CVF 国际计算机视觉会议论文集中,第 6929-6939 页。
[4] Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wen gang Zhou, Yanyong Zhang, and Houqiang Li. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021. [4] 邓佳俊、石少帅、李培维、周文钢、张延勇和李厚强。 Voxel r-cnn:走向高性能基于体素的三维物体检测。在人工智能协会会议论文集中,2021 年。
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 阿列克谢·多索维茨基、卢卡斯·贝耶尔、亚历山大·科莱斯尼科夫、迪尔克·魏森博恩、肖华·扎伊、托马斯·温特索纳、穆斯塔法·德加尼、马蒂亚斯·明德雷、格奥尔格·海戈德、西尔万·格利等。一张图像相当于 16x16 个单词:大规模图像识别的变换器。arXiv 预印本 arXiv:2010.11929, 2020。
[6] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Knowledge Discovery and Data Mining, 1996. 马丁·埃斯特、汉斯-彼得·克里格尔、约格·桑德和肖维·徐。一种用于发现大型空间数据库中噪声聚类的基于密度的算法。在知识发现和数据挖掘,1996 年。
[7] Lue Fan, Ziqi Pang, Tianyuan Zhang, Yu-Xiong Wang, Hang Zhao, Feng Wang, Naiyan Wang, and Zhaoxiang Zhang. Embracing single stride 3d object detector with sparse transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8448-8458, 2021. 罗凡、庞子琪、张天源、王宇雄、赵航、王丰、王奈艳和张兆祥。拥抱单步 3D 物体检测器与稀疏变压器。2021 年 IEEE/CVF 计算机视觉与模式识别大会(CVPR),第 8448-8458 页。
[8] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 3354-3361, 2012. 安德烈亚斯·盖格尔、菲利普·伦兹和拉克尔·乌塔松。我们准备好进行自动驾驶了吗?KITTI 视觉基准测试套件。在 2012 年 IEEE 计算机视觉与模式识别会议(CVPR)论文集中,第 3354-3361 页。
[9] Goldberger, Gordon, and Greenspan. An efficient image similarity measure based on approximations of kl-divergence between two gaussian mixtures. In Proceedings Ninth IEEE International conference on computer vision, pages 487-493. IEEE, 2003. 高伯格,戈登和格林斯潘。基于两个高斯混合体的 KL 散度近似的高效图像相似性度量。在 2003 年第九届 IEEE 国际计算机视觉大会论文集中,第 487-493 页。IEEE。
[10] Chenhang He, Hui Zeng, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 11873-11882, 2020. 何宸航、曾辉、黄健强、华显生、张磊.从点云中感知三维物体的结构感知单阶段检测.在 2020 年 IEEE 计算机视觉和模式识别会议论文集中,第 11873-11882 页.
[11] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll’ar, and Ross B. Girshick. Masked autoencoders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979-15988, 2021. 何恺明、陈昕蕾、谢赛宁、李扬浩、Piotr Doll'ar 和 Ross B. Girshick. 掩蔽自编码器是可扩展的视觉学习者. 在 2022 年 IEEE/CVF 计算机视觉与模式识别会议(CVPR)上, 第 15979-15988 页, 2021.
[12] Michael Himmelsbach, Felix V Hundelshausen, and H-J Wuensche. Fast segmentation of 3d point clouds for ground vehicles. In 2010 IEEE Intelligent Vehicles Symposium, pages 560-565. IEEE, 2010. 迈克尔·希默尔斯巴赫、费利克斯·V·亨德尔斯豪森和 H-J·文斯克。 2010 年 IEEE 智能车辆研讨会,第 560-565 页。 IEEE,2010 年。
[13] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 1269712705, 2019. [13] 亚历克斯·H·朗,苏拉布·沃拉,霍尔格·凯撒,鲁柄·周,炯阳,奥斯卡·贝伊伯姆。PointPillars:从点云中进行物体检测的快速编码器。在《2019 年计算机视觉和模式识别学术会议(CVPR)》论文集中,第 12697-12705 页。
[14] Huifang Li, Yidong Li, Yuanzhouhan Cao, Yushan Han, Yi Jin, and Yunchao Wei. Weakly supervised object detection with class prototypical network. IEEE Transactions on Multimedia, 2022. 李慧芳、李义东、曹原舟涵、韩雨山、靳艺、魏云超。弱监督物体检测与类原型网络。IEEE 多媒体交易,2022 年。
[15] Tianqin Li, Zijie Li, Harold Rockwell, Amir Farimani, and Tai Sing Lee. Prototype memory and attention mechanisms for few shot image generation. In Proceedings of the Eleventh International Conference on Learning Representations, volume 18, 2022. 李天琴、李子杰、哈罗德·罗克维尔、阿米尔·法拉马尼和李泰盛。少样本图像生成的原型记忆和注意机制。第 11 届国际表征学习会议论文集,第 18 卷,2022 年。
[16] Ziyu Li, Jingming Guo, Tongtong Cao, Liu Bingbing, and Wankou Yang. Gpa-3d: Geometry-aware prototype alignment for unsupervised domain adaptive 3d object detection from point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6394-6403, 2023. 李子育、郭静明、曹童童、刘冰冰和杨万寿。Gpa-3d:基于几何感知的原型对齐用于无监督域自适应点云三维目标检测。在 2023 年 IEEE/CVF 计算机视觉国际会议论文集上,第 6394-6403 页。
[17] Hanxue Liang, Chenhan Jiang, Dapeng Feng, Xin Chen, Hang Xu, Xiaodan Liang, Wei Zhang, Zhenguo Li, Luc Van Gool, and Sun Yat-sen. Exploring geometry-aware contrast and clustering harmonization for self-supervised 3d object detection. In International Conference on Computer Vision (ICCV), pages 3273-3282, 2021. 梁汉学、江晨晗、冯大鹏、陈欣、徐航、梁晓丹、张伟、李正果、Luc Van Gool 和孙逸仙。探索几何感知对比和聚类协调自监督 3D 物体检测。在国际计算机视觉会议(ICCV)上,第 3273-3282 页,2021 年。
[18] Chuandong Liu, Chenqiang Gao, Fangcen Liu, Pengcheng Li, Deyu Meng, and Xinbo Gao. Hierarchical supervision and shuffle data augmentation for 3d semi-supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23819-23828, 2023. 刘传东、高晨浅、刘方岑、李鹏程、孟德渝、高新波。用于 3D 半监督物体检测的层次监督和数据打乱增强。在 2023 年 IEEE/CVF 计算机视觉和模式识别会议论文集中,第 23819-23828 页。
[19] Chuandong Liu, Chenqiang Gao, Fangcen Liu, Jiang Liu, Deyu Meng, and Xinbo Gao. Ss3d: Sparsely-supervised 3d object detection from point cloud. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 8428-8437, 2022. 刘传东,高晨强,刘方岑,刘江,孟德宇,高新波。Ss3d:从点云稀疏监督的 3D 物体检测。在 2022 年计算机视觉模式识别会议(CVPR)上发表。
[20] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1001210022, 2021. 刘泽、林玉彤、曹岳、胡汉、魏一轩、张正、林肯和郭柏凝。Swin transformer:使用移位窗口的分层视觉 transformer。 2021 年 IEEE/CVF 国际计算机视觉会议论文集,第 10012-10022 页。
[21] Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi-task multisensor fusion with unified bird’s-eye view representation. ArXiv, 2022. 刘之健、唐浩天、亚历山大·阿米尼、杨新宇、毛慧子、丹妮拉·鲁斯和韩松。Bevfusion:统一鸟瞰图表示的多任务多传感器融合。ArXiv, 2022.
[22] Xiaonan Lu, Wenhui Diao, Yongqiang Mao, Junxi Li, Peijin Wang, Xian Sun, and Kun Fu. Breaking immutable: information-coupled prototype elaboration for few-shot object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1844-1852, 2023. [22] 陆笑南, 刁文慧, 毛永强, 李军熙, 王培进, 孙先, 傅琨。打破不变性:面向少样本目标检测的信息耦合原型扩展。在《美国人工智能协会会议论文集》第 37 卷, 第 1844-1852 页, 2023 年。
[23] Katie Luo, Zhenzhen Liu, Xiangyu Chen, Yurong You, Sagie Benaim, Cheng Perng Phoo, Mark Campbell, Wen Sun, Bharath Hariharan, and Kilian Q Weinberger. Reward finetuning for faster and more accurate unsupervised object discovery. Advances in Neural Information Processing Systems, 36, 2024. 罗凯蒂, 刘珍珍, 陈祥宇, 尤尤蓉, 本艾米塞吉, 李成锋, 马克·坎贝尔, 孙文, 哈里哈兰博拉斯, 韦因伯格奇利安。奖励微调用于更快更准确的无监督目标发现。神经信息处理系统进展, 36, 2024。
[24] Qinghao Meng, Wenguan Wang, Tianfei Zhou, Jianbing Shen, Yunde Jia, and Luc Van Gool. Towards a weakly supervised framework for 3d point cloud object detection and annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):44544468, 2021. 清晰的孟,旺旺王,天飞周,剑兵沈,云德贾,和卢克万高尔。走向一个弱监督的框架用于 3d 点云对象检测和标注。IEEE 模式分析和机器智能交易,44(8):4454-4468,2021.
[25] Qinghao Meng, Wenguan Wang, Tianfei Zhou, Jianbing Shen, Yunde Jia, and Luc Van Gool. Towards a weakly supervised framework for 3d point cloud object detection and annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):44544468, 2022. 清昊孟、文观王、天飞周、建兵沈、云德贾和卢克·范古尔。向一个弱监督框架迈进,用于 3D 点云物体检测和标注。IEEE 模式分析与机器智能汇刊,44(8):44544468,2022。
[26] Chen Min, Xinli Xu, Dawei Zhao, Liang Xiao, Yiming Nie, and Bin Dai. Occupancy-mae: Self-supervised pre-training large-scale lidar point clouds with masked occupancy autoencoders. IEEE Transactions on Intelligent Vehicles, 2022. 陈敏、徐新力、赵大伟、肖亮、聂一鸣和戴斌。占用性-MAE:使用掩码占用自编码器对大规模激光雷达点云进行自监督预训练。《IEEE 智能车辆交易》,2022 年。
[27] Nikita Moshkov, Botond Mathe, Attila Kertesz-Farkas, Reka Hollandi, and Peter Horvath. Test-time augmentation for deep learning-based cell segmentation on microscopy images. Scientific reports, 10(1):5068, 2020. 尼基塔·莫什科夫, 博通德·马特, 阿蒂拉·凯特泽斯-法尔卡斯, 雷卡·霍兰迪, 彼得·霍尔瓦特。显微镜图像上基于深度学习的细胞分割的测试时增强。科学报告, 10(1):5068, 2020。
[28] Jinhyung Park, Chenfeng Xu, Yiyang Zhou, Masayoshi Tomizuka, and Wei Zhan. Detmatch: Two teachers are better than one for joint 2d and 3d semi-supervised object detection. In European Conference on Computer Vision, pages 370-389. Springer, 2022. 朴珍荥, 徐晨风, 周逸杨, 富泽正义, 詹伟。Detmatch:两位老师胜过一位,用于联合二维和三维半监督物体检测。在欧洲计算机视觉会议上,第 370-389 页。施普林格,2022 年。
[29] Xidong Peng, Xinge Zhu, and Yuexin Ma. Cl3d: Unsupervised domain adaptation for cross-lidar 3d detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 20472055, 2023. 彭西东、朱新鹏和马悦新。Cl3d:无监督领域自适应用于跨雷达 3D 检测。在人工智能学会会议论文集第 37 卷第 20472055 页,2023 年。
[30] Gheorghii Postica, Andrea Romanoni, and Matteo Matteucci. Robust moving objects detection in lidar data exploiting visual cues. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1093-1098. IEEE, 2016. [30] 格奥尔基·波斯蒂卡、安德烈亚·罗马尼和马泰奥·马特乌奇。利用视觉线索在激光雷达数据中检测运动物体。在 2016 年 IEEE/RSJ 国际智能机器人与系统大会(IROS)上,第 1093-1098 页。IEEE,2016 年。
[31] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-renn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 10526 - 10535, 2020. 史少帅、郭超旭、江力、王哲、史建平、王晓刚和李洪胜。Pv-renn:三维物体检测的点-体素特征集抽象。在 2020 年 IEEE 计算机视觉与模式识别会议(CVPR)论文集中,第 10526 - 10535 页。
[32] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 770-779, 2019. 石少帅、王晓刚和李洪生。Pointrcnn:从点云中生成 3D 目标提议和检测。在 IEEE 计算机视觉与模式识别大会(CVPR)论文集中,第 770-779 页,2019 年。
[33] Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43:2647-2664, 2021. 施少帅、王哲、石建平、王晓刚和李洪生。从点到部件:基于部件感知和部件聚合网络的点云 3D 目标检测。IEEE 模式分析与机器智能学报(TPAMI),43:2647-2664,2021。
[34] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017. [34] 杰克·斯内尔、凯文·斯沃斯基和理查德·泽梅尔。原型网络用于少样本学习。神经信息处理系统进展,第 30 期,2017 年。
[35] Roman Solovyev, Weimin Wang, and Tatiana Gabruseva. Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing, 107, 2021. 罗曼·索洛维耶夫、王维敏和塔蒂亚娜·加布鲁塞娃。加权框融合:从不同物体检测模型融合框。图像与视觉计算,107,2021 年。
[36] Muhammad Sualeh and Gon-Woo Kim. Dynamic multi-lidar based multiple object detection and tracking. Sensors, 19(6):1474, 2019. 穆罕默德·苏利和金建禹。动态多激光雷达 based 多目标检测和跟踪。传感器,19(6):1474,2019 年。
[37] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, and al. et. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 2443-2451,20202443-2451,2020. 裴孙、亨利·克雷茨马尔、泽克塞斯·多蒂瓦拉等人。
[38] Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He, Bernt Schiele, and Liwei Wang. DSVT: Dynamic Sparse Voxel Transformer With Rotated Sets. In CVPR, 2023. 王海洋, 陈鲜, 史少帅, 雷蒙, 王森, 贺迪, Bernt Schiele, 王立伟。 DSVT: 旋转集合的动态稀疏体素变换器。 在 CVPR, 2023。
[39] He Wang, Yezhen Cong, Or Litany, Yue Gao, and Leonidas J. Guibas. 3dioumatch: Leveraging iou prediction for semisupervised 3d object detection. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition CVPR), pages 14610-14619, 2021. 王鹤,丛野震,奥里坦尼,高岳,莱昂尼达斯·J·古巴斯。3dioumatch:利用 IoU 预测进行半监督 3D 物体检测。计算机视觉与模式识别学术会议论文集,2021 年,第 14610-14619 页。
[40] Yan Wang, Xiangyu Chen, Yurong You, Li Erran Li, Bharath Hariharan, Mark Campbell, Kilian Q Weinberger, and Wei-Lun Chao. Train in germany, test in the usa: Making 3d object detectors generalize. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11713-11723, 2020. 严王,陈翔宇,尤荣友,李历安,Bharath Hariharan,Mark Campbell,Kilian Q Weinberger,和魏伦超。在德国训练,在美国测试:让 3D 物体检测器普遍适用。在 2020 年 IEEE/CVF 计算机视觉与模式识别会议上,第 11713-11723 页。
[41] Yi Wei, Shang Su, Jiwen Lu, and Jie Zhou. Fgr: Frustum-aware geometric reasoning for weakly supervised 3d vehicle detection. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4348-4354. IEEE, 2021. 易伟,尚苏,吕济文,周杰
[42] Aming Wu, Yahong Han, Linchao Zhu, and Yi Yang. Universalprototype enhancing for few-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9567-9576, 2021. 吴阿明、韩亚宏、朱林超和杨毅。面向少样本物体检测的统一原型增强。在 2021 年 IEEE/CVF 计算机视觉国际会议论文集中,第 9567-9576 页。
[43] Hai Wu, Jinhao Deng, Chenglu Wen, Xin Li, and Cheng Wang. Casa: A cascade attention network for 3d object detection from lidar point clouds. IEEE Transactions on Geoscience and Remote Sensing, 2022. 吴海,邓锦皓,温承露,李昕,王成。Casa:一种用于来自激光雷达点云的三维目标检测的级联注意力网络。2022 年 IEEE 地球科学与遥感学报。
[44] Hai Wu, Wenkai Han, Chenglu Wen, Xin Li, and Cheng Wang. 3d multi-object tracking in point clouds based on prediction confidence-guided data association. IEEE TITS, 2021. [44] 海武、韩文凯、温成路、李新和王成。基于预测置信度引导的点云 3D 多目标跟踪。IEEE TITS, 2021 年。
[45] Hai Wu, Chenglu Wen, Wei Li, Ruigang Yang, and Cheng Wang. Transformation-equivariant 3d object detection for autonomous driving. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023. 吴海、温成录、李威、杨瑞刚和王成。自动驾驶的变换等变 3D 物体检测。在人工智能协会 2023 年会议论文集上发表。
[46] Hai Wu, Chenglu Wen, Shaoshuai Shi, Xin Li, and Cheng Wang. Virtual sparse convolution for multimodal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 伍海、温成路、史少帅、李新、王成。多模态 3D 物体检测的稀疏虚拟卷积。在 2023 年 IEEE/CVF 计算机视觉与模式识别大会论文集上发表。
[47] Hai Wu, Shijia Zhao, Xun Huang, Chenglu Wen, Xin Li, and Cheng Wang. Commonsense prototype for outdoor unsupervised 3d object detection. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 吴海、赵诗佳、黄勋、温成路、李欣、王成。户外无监督 3D 物体检测的常识原型。在 2024 年计算机视觉模式识别大会(CVPR)上。
[48] Xiaopei Wu, Liang Peng, Honghui Yang, Liang Xie, Chenxi Huang, Chengqi Deng, Haifeng Liu, and Deng Cai. Sparse fuse dense: Towards high quality 3d detection with depth completion. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2022. 吴小培、彭亮、杨宏辉、谢亮、黄晨曦、邓程琦、刘海峰和蔡登. 稀疏融合密集:面向高质量 3D 检测的深度完成. 在 2022 年 IEEE 计算机视觉与模式识别会议(CVPR)论文集中.
[49] Qiming Xia, Jinhao Deng, Chenglu Wen, Hai Wu, Shaoshuai Shi, Xin Li, and Cheng Wang. Coin: Contrastive instance feature mining for outdoor 3 d object detection with very limited annotations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023. 夏启明, 邓锦浩, 温成路, 吴海, 石少帅, 李欣, 王成。Coin:基于对比学习的室外三维物体检测模型,采用极少注释训练。
[50] Hongyi Xu, Feng Liu, Qianyu Zhou, Jinkun Hao, Zhijie Cao, Zhengyang Feng, and Lizhuang Ma. Semi-supervised 3d object detection via adaptive pseudo-labeling. 2021 IEEE International Conference on Image Processing (ICIP), pages 3183-3187, 2021. 许红义,刘峰,周千愉,郝金坤,曹志杰,冯正阳,马立壮。 基于自适应伪标签的半监督三维目标检测。2021 IEEE 国际图像处理会议(ICIP),第 3183-3187 页,2021 年。
[51] Jenny Xu and Steven L. Waslander. Hypermodest: Self-supervised 3d object detection with confidence score filtering. 2023 20th Conference on Robots and Vision (CRV), pages 217-224, 2023. [51] 徐馨宇和 Steven L. Waslander. Hypermodest:自监督 3D 目标检测与置信度评分过滤. 2023 第 20 届机器人与视觉会议(CRV), 第 217-224 页, 2023.
[52] Honghui Yang, Tong He, Jiaheng Liu, Huaguan Chen, Boxi Wu, Binbin Lin, Xiaofei He, and Wanli Ouyang. Gd-mae: Generative decoder for mae pre-training on lidar point clouds. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 94039414, 2023. 杨红辉、何通、刘嘉恒、陈华冠、吴博熙、林斌斌、何晓飞和欧阳万里。Gd-mae:激光雷达点云的 MAE 预训练生成解码器。在计算机视觉与模式识别会议(CVPR)上,第 9403-9414 页,2023 年。
[53] Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1951-1960, 2019. 杨泽童, 孙亚男, 刘姝, 沈晓勇, 贾越亚. Std:点云稀疏到密集 3D 物体检测器. 在《计算机视觉国际会议(ICCV)》上的论文集, 第 1951-1960 页, 2019 年.
[54] Junbo Yin, Jin Fang, Dingfu Zhou, Liangjun Zhang, Cheng-Zhong Xu , Jianbing Shen, and Wenguan Wang. Semi-supervised 3d object detection with proficient teachers. In European Conference on Computer Vision, pages 727-743. Springer, 2022. 殷军波、方金、周定富、张梁军、徐成忠、沈健冰和王文冠。半监督的 3D 物体检测和精通的老师。在欧洲计算机视觉会议上,第 727-743 页。斯普林格,2022 年。
[55] Junbo Yin, Dingfu Zhou, Liangjun Zhang, Jin Fang, Cheng-Zhong Xu , Jianbing Shen, and Wenguan Wang. Proposalcontrast: Un- 提案对比
supervised pre-training for lidar-based 3d object detection. In European conference on computer vision, pages 17-33, 2022. 基于激光雷达的 3D 目标检测的有监督预训练。在 2022 年的欧洲计算机视觉会议上,第 17-33 页。
[56] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Centerbased 3d object detection and tracking. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2021. 尹天威、周兴毅和 Philipp Krähenbühl. 基于中心的三维目标检测和跟踪。在 2021 年计算机视觉和模式识别会议(CVPR)论文集上发表。
[57] Yurong You, Katie Luo, Cheng Perng Phoo, Wei-Lun Chao, Wen Sun, Bharath Hariharan, Mark E. Campbell, and Kilian Q. Weinberger. Learning to detect mobile objects from lidar scans without labels. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 游郁融、罗凯蒂、潘振翔、赵伟仑、孙文、哈瑞哈兰、坎贝尔和维因伯格. 学习从未标注的激光雷达扫描中检测移动物体. 在 2022 年 IEEE/CVF 计算机视觉与模式识别会议(CVPR)上发表, 2022 年.
[58] Dingyuan Zhang, Dingkang Liang, Zhikang Zou, Jingyu Li, Xiaoqing Ye, Zhe Liu, Xiao Tan, and Xiang Bai. A simple vision transformer for weakly semi-supervised 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8373-8383, 2023. 张定源、梁定康、邹智康、李景瑜、叶小青、刘哲、谭潇、白翔。一种用于弱半监督 3D 目标检测的简单视觉 transformer。在 2023 年 IEEE/CVF 国际计算机视觉会议(ICCV)上。
[59] Lunjun Zhang, Anqi Joyce Yang, Yuwen Xiong, Sergio Casas, Bin Yang, Mengye Ren, and Raquel Urtasun. Towards unsupervised object detection from lidar point clouds. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 张倫軍、楊安琪、熊雨雯、Sergio Casas、楊斌、任萌冶和 Raquel Urtasun.面向無監督的雷達點雲物體檢測.2023 年 IEEE/CVF 計算機視覺和模式識別會議(CVPR),2023 年.
[60] Quanshi Zhang, Xuan Song, Xiaowei Shao, Huijing Zhao, and Ryosuke Shibasaki. Unsupervised 3d category discovery and point labeling from a large urban environment. In 2013 IEEE International Conference on Robotics and Automation, pages 26852692. IEEE, 2013. 张权仕、宋璇、邵晓巍、赵晖靖和柴原佑辅
[61] Xiao Zhang, Wenda Xu, Chiyu Dong, and John M Dolan. Efficient 1-shape fitting for vehicle detection using laser scanners. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 54-59. IEEE, 2017. 小张, 文达徐, 程昱东, 和约翰 M 多兰。使用激光扫描仪的车辆检测中的高效 1 形状拟合。在 2017 年 IEEE 智能车辆研讨会(IV)上, 第 54-59 页。IEEE, 2017。
[62] Yixin Zhang, Zilei Wang, and Yushi Mao. Rpn prototype alignment for domain adaptive object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12425-12434, 2021. 张一辛、王子磊和毛禹石。基于 RPN 原型对齐的域自适应目标检测器。在 2021 年 IEEE/CVF 计算机视觉与模式识别会议论文集中,第 12425-12434 页。
[63] Na Zhao, Tat-Seng Chua, and Gim Hee Lee. Sess: Self-ensembling semi-supervised 3d object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2020. 赵娜、蔡大成、李金希。 Sess:自主学习半监督式 3D 物体检测。 在《计算机视觉与模式识别会议》(CVPR)论文集中发表,2020 年。
[64] Shizhen Zhao and Xiaojuan Qi. Prototypical votenet for few-shot 3d point cloud object detection. Advances in Neural Information Processing Systems, 35:13838-13851, 2022. 赵实珍和齐晓娟。原型投票网络用于少样本 3D 点云物体检测。神经信息处理系统进展,35:13838-13851,2022。
[65] Wu Zheng, Weiliang Tang, Li Jiang, and Chi-Wing Fu. Se-ssd: Self-ensembling single-stage object detector from point cloud. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 14494-14503, 2021. 吴正, 唐伟良, 李江, 和 Chi-Wing Fu. Se-ssd: 来自点云的自组合单级目标检测器. 在计算机视觉与模式识别会议(CVPR)论文集, 页码 14494-14503, 2021.
[66] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 4490-4499, 2018. [66] 殷洲和 Oncel Tuzel。VoxelNet:基于点云的 3D 物体检测的端到端学习。在 IEEE 计算机视觉和模式识别会议(CVPR)论文集中,第 4490-4499 页,2018 年。
Hai Wu received his B.Eng. degree in information computing and science from Sichuan University of Science and Technology, Sichuan, China, in 2018. He is currently working toward a Ph.D. degree in the School of Informatics at Xiamen University, Xiamen, China. His research interests include computer vision, machine learning, and 3D point cloud processing. 吴海获得了 2018 年四川科技大学信息计算和科学的工学学士学位。他目前正在厦门大学信息学院攻读博士学位。他的研究兴趣包括计算机视觉、机器学习和三维点云处理。
Shijia Zhao received the M.S. degree from the College of Marine Information Engineering, Jimei University, Xiamen, China, in 2023. He is currently pursuing a Ph.D. degree with the School of Information at Xiamen University, Xiamen, China. His research interests include 3D object detection, deep learning theory and its application, and image processing. 赵世佳于 2023 年在中国厦门集美大学海洋信息工程学院获得硕士学位。他目前正在厦门大学信息学院攻读博士学位。他的研究兴趣包括 3D 物体检测、深度学习理论及其应用以及图像处理。
Xun Huang received his B.S. degree in computer science from the Jimei University, Xiamen, China, in 2003. He is currently pursuing a Ph.D. degree with the School of Informatics at Xiamen University, Xiamen, China. His research interests include 3D object detection and 3D vision. 黄逊在 2003 年获得了厦门集美大学计算机科学学士学位。他目前正在厦门大学信息科学学院攻读博士学位。他的研究领域包括 3D 物体检测和 3D 视觉。
Qiming Xia received the M.S. degree from Jimei University, Xiamen, China, in 2022. He is currently working toward a Ph.D. degree in the School of Informatics at Xiamen University, Xiamen, China. His research interests include computer vision, machine learning, and 3D object detection. 夏启明在 2022 年从中国厦门集美大学获得硕士学位。他目前正在厦门大学信息学院攻读博士学位。他的研究兴趣包括计算机视觉、机器学习和三维物体检测。
Chenglu Wen received her Ph.D. degree in mechanical engineering from China Agricultural University, Beijing, China, in 2009. She is currently a Professor with the School of Informatics, Xiamen University, Xiamen, China. Her research interests include 3D point cloud processing, 3D vision, and intelligent robots. She is currently an Associate Editor of IEEE GRSL and IEEE TITS. 温成璐于 2009 年在中国北京农业大学获得机械工程博士学位。她现任职于中国厦门大学信息学院,担任教授。她的研究兴趣包括 3D 点云处理、3D 视觉和智能机器人。她目前是 IEEE GRSL 和 IEEE TITS 的副主编。
Li Jiang received her PhD degree in the Department of Computer Science and Engineering at The Chinese University of Hong Kong in 2021. She is currently an Assistant Professor at The Chinese University of Hong Kong, Shenzhen. Before that, She was a postdoctoral researcher at the Department of Computer Vision and Machine Learning of Max Planck Institute for Informatics. Her research interest includes computer vision and deep learning, particularly in 3D scene understanding, representation learning, autonomous driving, and robotics. 李江于 2021 年在香港中文大学计算机科学与工程学系获得博士学位。她目前是香港中文大学深圳分校助理教授。在此之前,她曾在马克斯·普朗克信息学研究所计算机视觉与机器学习部门担任博士后研究员。她的研究兴趣包括计算机视觉和深度学习,特别是 3D 场景理解、表示学习、自动驾驶和机器人学等领域。
Xin Li received his Ph.D. degree in computer science from Stony Brook University (SUNY) in 2008. He is currently a professor with the School of Electrical Engineering and Computer Science and the Center for Computation and Technology, Louisiana State University, USA. His research interests include geometric and visual data processing and analysis, computer graphics, and computer vision. For more details, please visit https://www.ece.Isu.edu/xinli/. 李欣获得了 2008 年斯托尼布鲁克大学(纽约州立大学)计算机科学博士学位。他现为美国路易斯安那州立大学电气工程与计算机科学学院和计算与技术中心的教授。他的研究兴趣包括几何和视觉数据处理与分析、计算机图形学和计算机视觉。欲了解更多详情,请访问https://www.ece.Isu.edu/xinli/。
Cheng Wang received his Ph.D. degree in information and communication engineering from the National University of Defense Technology, Changsha, China in 2002. He is currently a Professor at the School of Informatics, Xiamen University, China. His research interests include image processing, mobile LiDAR data analysis, and multi-sensor fusion. He is the Chair of the ISPRS Working Group I/6 on Multi-sensor Integration and Fusion (2016-2020). 程王在 2002 年获得了国防科技大学信息与通信工程博士学位。他现任厦门大学信息学院教授。他的研究兴趣包括图像处理、移动 LiDAR 数据分析和多传感器融合。他是国际摄影测量与遥感学会工作组 I/6 主席(2016-2020)。
Summary of Differences (Extensions) Compared to Conference Version 大会版本的差异(扩展)总结
This submission is an extension of our previous paper (Commonsense Prototype for Outdoor Unsupervised 3D Object Detection) accepted by CVPR 2024. We provide an overview of the differences between this TPAMI submission and our previous paper. Our team has been working diligently to refine our approach and expand upon our initial designs. 这篇论文是我们之前发表在 CVPR 2024 上的论文(室外无监督 3D 物体检测通用原型)的延伸。我们概括了这篇 TPAMI 论文与之前论文的差异。我们的团队一直在努力完善方法,并在初始设计的基础上进一步拓展。
(1) Addressing New Problems: Our CVPR paper, CPD, focused primarily on addressing the challenge of accurate object localization in unsupervised 3D object detection, our TPAMI submission goes a step further. It not only tackles the issue of accurate localization but also addresses the difficulty in recognition of foreground and background objects. This dual focus is driven by the need to enhance the robustness and applicability of our detection system in real-world scenarios. 解决新问题:我们在 CVPR 上发表的论文 CPD 主要关注解决无监督 3D 目标检测中准确目标定位的挑战,我们在 TPAMI 上的论文进一步推进了这一工作。它不仅解决了准确定位的问题,还解决了前景和背景物体识别的难题。这个双重关注是为了提高我们检测系统在实际应用场景中的稳健性和适用性。
(2) New Contributions: Our TPAMI submission introduces a novel method, CPD++, which significantly surpasses the performance of the CPD presented at CVPR. This new approach has set a new SOTA in the field of unsupervised 3D object detection, outperforming all current methods in terms of 3D AP. (2) 新贡献:我们的 TPAMI 提交引入了一种新颖的方法 CPD++,大幅超越了在 CVPR 上提出的 CPD 的性能。这种新方法在无监督 3D 物体检测领域树立了新的 SOTA,在 3D AP 方面优于所有当前方法。
(3) New Method Design: CPD++ incorporates more innovative techniques. We proposed Motion-conditioned Label Refinement (MLR) for generating high-quality pseudo-labels. Additionally, we have developed a Motion-Weighted Training (MWT) scheme that learns localization from stationary objects and learns recognition from moving ones. CPD++ significantly enhances the performance of unsupervised 3D object detection. (3) 新方法设计:CPD++采用更多创新技术。我们提出了基于运动的标签精炼(MLR)方法,生成高质量的伪标签。此外,我们开发了基于运动加权的训练(MWT)方案,从静止物体学习定位,从运动物体学习识别。CPD++大幅提升了无监督 3D 目标检测的性能。
(4) More Comprehensive Experimental Results: In comparison to our CVPR paper, our TPAMI submission provides more results under extensive evaluation metrics, including 3D APH and BEV AP. We have also conducted more in-depth ablation studies and provided detailed visual comparisons to illustrate the effectiveness of our approach. Furthermore, we have added experiments that utilize our unsupervised 3D object detection model as a pre-trained model, followed by fine-tuning with a small labeled data, demonstrating the versatility and adaptability of our model. (4) 更全面的实验结果:与我们的 CVPR 论文相比,我们的 TPAMI 论文提供了更广泛评估指标下的更多结果,包括 3D APH 和 BEV AP。我们还进行了更深入的消融研究,并提供了详细的视觉比较来说明我们方法的有效性。此外,我们还增加了利用我们的无监督 3D 目标检测模型作为预训练模型,然后在少量标记数据上进行微调的实验,展示了我们模型的多样性和适应性。
Commonsense Prototype for Outdoor Unsupervised 3D Object Detection 户外无监督 3D 物体检测的常识原型
Hai Wu ^(1)quad{ }^{1} \quad Shijia Zhao ^(1)quad{ }^{1} \quad Xun Huang ^(1)quad{ }^{1} \quad Chenglu Wen ^(1**)quad{ }^{1 *} \quad Xin Li ^(2)quad^{2} \quad Cheng Wang ^(1)^{1} 吴海
赵世佳
黄逊
温成禄
李欣
王成^(1){ }^{1} Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University 福建感知与计算智慧城市重点实验室,厦门大学^(2){ }^{2} Section of Visual Computing and Interactive Media, Texas A&M University ^(2){ }^{2} 德克萨斯农工大学视觉计算和交互媒体部门
Abstract 摘要
The prevalent approaches of unsupervised 3D object detection follow cluster-based pseudo-label generation and iterative self-training processes. However, the challenge arises due to the sparsity of LiDAR scans, which leads to pseudo-labels with erroneous size and position, resulting in subpar detection performance. To tackle this problem, this paper introduces a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection. CPD first constructs Commonsense Prototype (CProto) characterized by high-quality bounding box and dense points, based on commonsense intuition. Subsequently, CPD refines the low-quality pseudo-labels by leveraging the size prior from CProto. Furthermore, CPD enhances the detection accuracy of sparsely scanned objects by the geometric knowledge from CProto. CPD outperforms state-of-the-art unsupervised 3D detectors on Waymo Open Dataset (WOD), PandaSet, and KITTI datasets by a large margin. Besides, by training CPD on WOD and testing on KITTI, CPD attains 90.85%\mathbf{9 0 . 8 5 \%} and 81.01%3D\mathbf{8 1 . 0 1 \%} 3 D Average Precision on easy and moderate car classes, respectively. These achievements position CPD in close proximity to fully supervised detectors, highlighting the significance of our method. The code will be available at https://github.com/hailanyi/CPD. 无监督 3D 物体检测的主流方法遵循基于聚类的伪标签生成和迭代自训练流程。然而,由于 LiDAR 扫描的稀疏性,导致伪标签在尺寸和位置上存在错误,从而导致检测性能较差。为解决此问题,本文提出了一种基于常识原型的检测器 CPD,用于无监督 3D 物体检测。CPD 首先根据常识直觉构建高质量边界框和密集点的常识原型(CProto)。随后,CPD 利用 CProto 的尺寸先验来优化低质量的伪标签。此外,CPD 利用 CProto 的几何知识增强了稀疏扫描物体的检测精度。在 Waymo Open Dataset、PandaSet 和 KITTI 数据集上,CPD 的性能显著优于最先进的无监督 3D 检测器。此外,在 Waymo Open Dataset 上训练 CPD,并在 KITTI 数据集上进行测试,CPD 在简单和中等汽车类别上分别达到了 90.85%\mathbf{9 0 . 8 5 \%} 和 81.01%3D\mathbf{8 1 . 0 1 \%} 3 D 的平均精度,这一成就将 CPD 的性能与完全监督的检测器接近,突出了我们方法的重要性。该代码将在https://github.com/hailanyi/CPD上公开。
1. Introduction 简介
Autonomous driving requires reliable detection of 3D objects (e.g. vehicle and cyclist) in urban scenes for safe path planning and navigation. Thanks to the power of neural networks, numerous studies have developed high-performance 3 D detectors through fully supervised approaches[4, 15, 30-33]. However, these models heavily depend on human annotations from diverse scenes to guarantee their effectiveness across various scenarios. This data labeling process is typically laborious and time-consuming, limiting the wide deployment of detectors in practice [40]. 自主驾驶需要可靠地检测城市场景中的 3D 物体(如车辆和自行车)以进行安全的路径规划和导航。得益于神经网络的强大功能,许多研究采用完全监督的方法开发了高性能的 3D 检测器[4,15,30-33]。然而,这些模型严重依赖于来自各种场景的人工注释,以确保它们在不同情况下的有效性。这个数据标注过程通常是繁琐和耗时的,限制了检测器在实践中的广泛部署[40]。
Figure 1. Illustration of commonsense prototypes for unsupervised 3D object detection in autonomous driving scenes. 图 1. 无监督 3D 目标检测在自动驾驶场景中的常识原型说明。
Several studies have explored approaches to reduce labeling requirements by weakly supervised learning [3, 26, 46], decreasing the label cost by over 80%80 \%. Notably, the objects within a 3D scene exhibit distinguishable attributes and can be easily identified through certain commonsense reasoning (see Fig. 1). For example, the objects are usually located on the ground surface with a certain shape; the object sizes are fixed across frames. This insight has prompted us to develop an unsupervised 3D3 D detector that operates without using human annotations. 几项研究探索了通过弱监督学习[3,26,46]来减少标签要求的方法,减少了超过 80%80 \% 的标签成本。值得注意的是,3D 场景中的物体展现出可区分的属性,并且可以通过某些常识推理来轻易识别(见图 1)。例如,物体通常位于地面表面,具有一定的形状;物体大小在各帧中是固定的。这一洞见促使我们开发了一种无需使用人工标注的无监督 3D3 D 探测器。
In recent years, traditional methods leveraged ground removal [9] and clustering technique [42] for unsupervised 3D object detection. However, these methods often struggle to achieve satisfactory performance due to the sparsity and occlusion of objects in 3D scenes. Advanced methods create initial pseudo-labels from point cloud sequences by clustering and bootstrap a good detector by iteratively training a deep network [41]. Nevertheless, the sparse and viewlimited nature of LiDAR scanning leads to pseudo-labels with inaccurate sizes and positions, misleading the network convergence and resulting in suboptimal detection performance. A subset of objects, denoted as complete objects T\mathcal{T}, 近年来,传统方法利用地面去除[9]和聚类技术[42]进行无监督的 3D 物体检测。然而,由于 3D 场景中物体的稀疏性和遮挡,这些方法往往难以实现令人满意的性能。先进的方法通过对点云序列进行聚类来创建初始伪标签,并反复训练深度网络以引导良好的检测器[41]。尽管如此,激光雷达扫描的稀疏和视角受限性仍会导致伪标签的大小和位置不准确,从而误导网络收敛,并导致次优的检测性能。一部分被称为完整物体 T\mathcal{T} 的物体,
To tackle this issue, this paper proposes a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D3 D object detection. CPD is built upon two key insights: (1) The ground truth of intra-class objects keeps a similar size (length, width, and height) distribution between incomplete objects and complete objects (see Fig. 2 (d)). (2) The nearby stationary objects are very complete in consecutive frames and can be recognized accurately by commonsense intuition (see Fig. 2 (f)(g)). Our idea is to construct a Commonsense Prototype (CProto) set representing accurate geometry and size from complete objects to refine the pseudolabels of incomplete objects and improve the detection accuracy. To this end, we first design an unsupervised MultiFrame Clustering (MFC) method that yields high-recall initial pseudo-labels. Subsequently, we introduce an unsupervised Completeness and Size Similarity (CSS) score that selects high-quality labels to construct the CProto set. Furthermore, we design a CProto-constrained Box Regulariza- 针对此问题,本文提出了一种无监督的通用原型检测器(Commonsense Prototype-based Detector,简称 CPD),用于无监督的目标检测。CPD 建立在两个关键洞察之上:(1)类内物体的真实大小(长度、宽度和高度)分布在完整物体和不完整物体之间保持相似(见图 2(d))。(2)相邻的静止物体在连续帧中往往是完整的,可以通过常识直观地准确识别(见图 2(f)(g))。我们的想法是构建一个代表准确几何和大小的通用原型(CProto)集合,用于细化不完整物体的伪标签,提高检测准确性。为此,我们首先设计了一种无监督的多帧聚类(MFC)方法,产生高召回率的初始伪标签。随后,我们引入了一种无监督的完整性和大小相似性(CSS)评分,选择高质量标签构建 CProto 集合。此外,我们设计了一个 CProto 约束的框正则化(CProto-constrained Box Regulariza-)
tion (CBR) method to refine the pseudo-labels by incorporating the size prior from CProto. In addition, we develop CProto-constrained Self-Training (CST) that improves the detection accuracy of sparsely scanned objects by the geometry knowledge from CProto. 基于尺寸先验的半监督目标检测
The effectiveness of our design is verified by experiments on widely used WOD [25], PandaSet [35], and KITTI dataset [6]. Besides, the individual components of our design are also verified by extensive experiments on WOD [25]. The main contributions of this work include: 我们的设计有效性通过对广泛使用的 WOD [25]、PandaSet [35]和 KITTI 数据集 [6]的实验验证。此外,我们设计的各个组件也通过对 WOD [25]的大量实验进行了验证。本文的主要贡献包括:
We propose a Commonsense Prototype-based Detector (CPD) for unsupervised 3D object detection. CPD outperforms state-of-the-art unsupervised 3D detectors by a large margin. 我们提出了一种基于常识原型的检测器(CPD)用于无监督的 3D 目标检测。CPD 大幅优于最先进的无监督 3D 检测器。
We propose Multi-Frame Clustering (MFC) and CProtoconstrained Box Regularization (CBR) for pseudo-label generation and refinement, greatly improving the recall and precision of pseudo-label. 我们提出多帧聚类(MFC)和 CProtocol 受约束的框正则化(CBR),用于伪标签生成和改进,大幅提高了伪标签的召回率和精确度。
We propose CProto-constrained Self-Training (CST) for unsupervised 3D detection. It improves the recognition and localization accuracy of sparse objects, boosting the detection performance significantly. 我们提出了受到 CProto 约束的自主训练(CST)用于无监督的 3D 检测。它大幅提高了稀疏物体的识别和定位精度,显著提升了检测性能。
2. Related Work 相关工作
Fully/weakly supervised 3D object detection. Recent fully-supervised 3D detectors build single-stage [8, 10, 27, 39, 48, 49], two-stage [4, 20-22, 31-33, 37] or multiple stage [2,30][2,30] deep networks for 3D object detection. However, these methods heavily rely on a large amount of precise annotations. Some weakly supervised methods replace the box annotation with low-cost click annotation[17]. Other methods decrease the supervision by only annotating a part of scenes [3,26,45,46][3,26,45,46] or a part of instances [34]. Unlike all of the above works, we aim to design a 3D detector that does not require human-level annotations. 完全/弱监督的 3D 物体检测。最近的完全监督的 3D 检测器构建了单级[8、10、27、39、48、49]、双级[4、20-22、31-33、37]或多级 [2,30][2,30] 深度网络用于 3D 物体检测。然而,这些方法严重依赖于大量精确的注释。一些弱监督方法用低成本的点击注释取代了框注释[17]。其他方法通过只注释部分场景 [3,26,45,46][3,26,45,46] 或部分实例[34]来减少监督。与上述所有工作不同,我们的目标是设计一种无需人工级别注释的 3D 检测器。
Unsupervised 3D object detection. Previous unsupervised pre-training methods discern latent patterns within the unlabeled data by masked labels [36] or contrastive loss [14,38][14,38]. But these methods require human labels for fine-turning. Traditional methods [1, 19, 24] employ ground removal and clustering for 3D object detection without human labels, but suffer from poor detection performance. Some deep learning-based methods generate pseudo-labels by clustering and use the pseudo-labels to train a 3D detector [40] iteratively. Recent OYSTER [41] improves pseudolabel quality with temporal consistency. However, most pseudo-labels of incomplete objects cannot be recovered by temporal consistency. Our CPD addresses this problem by leveraging the geometry prior from CProto to refine the pseudo-label and guide the network convergence. 无监督 3D 目标检测。以前的无监督预训练方法通过掩码标签[36]或对比损失 [14,38][14,38] 来辨别未标记数据中的潜在模式。但这些方法需要人工标签用于微调。传统方法[1,19,24]采用地面去除和聚类来进行无人工标签的 3D 目标检测,但检测性能较差。一些基于深度学习的方法通过聚类生成伪标签,并迭代地使用这些伪标签训练 3D 检测器[40]。最近的 OYSTER[41]通过时间一致性改善伪标签质量。然而,大多数不完整对象的伪标签无法通过时间一致性得到恢复。我们的 CPD 通过利用 CProto 的几何先验来细化伪标签并引导网络收敛,解决了这个问题。
Prototype-based methods. The prototype-based methods are widely used in 2D detection [11, 12, 16, 29, 44] when novel classes are incorporated. Inspired by these 基于原型的方法。这些基于原型的方法在 2D 检测中广泛应用[11,12,16,29,44],用于新类的引入。受这些方法启发
This paper introduces the Commonsense Prototype-based Detector (CPD), a novel approach for unsupervised 3D object detection. As shown in Fig. 3, CPD consists of three main parts: (1) initial label generation; (2) label refinement; (3) self-training. We detail the designs as follows. 这篇论文介绍了常识原型检测器(CPD),这是一种用于无监督 3D 物体检测的新方法。如图 3 所示,CPD 由三个主要部分组成:(1)初始标签生成;(2)标签优化;(3)自训练。我们将详细介绍这些设计内容。
3.1. Initial Label Generation 初始标签生成
Recent unsupervised methods [40, 41] detect 3D objects in a class-agnostic way. How to classify objects (e.g. vehicle and pedestrian) without annotation is still an unsolved challenge. Our observations indicate that some stationary objects in consecutive frames, appear more complete (see Fig. 2 (f)) and can be classified by predefined sizes. This motivates us to design a Multi-Frame Clustering (MFC) method to generate initial labels. MFC involves motion artifact removal, clustering, and post-processing. 近期无监督方法[40,41]以类无关的方式检测 3D 物体。如何在没有标注的情况下对物体进行分类(例如车辆和行人)仍然是一个未解决的挑战。我们的观察表明,连续帧中的一些静止物体显得更加完整(见图 2(f)),可以通过预定义的尺寸进行分类。这激励我们设计了一种多帧聚类(MFC)方法来生成初始标签。MFC 涉及运动伪影去除、聚类和后处理。
Motion Artifact Removal (MAR). Directly transforming and concatenating 2n+12 n+1 consecutive frames {x_(-n),dots,x_(n)}\left\{\boldsymbol{x}_{-n}, \ldots, \boldsymbol{x}_{n}\right\} (i.e., past nn, future nn, and the current frame) 运动伪影去除(MAR)。直接变换和连接连续帧(即过去、未来和当前帧)。
into a single point cloud x_(0)^(**)\boldsymbol{x}_{0}^{*} introduces motion artifacts from moving objects, leading to increased label errors as the nn grows (see Fig. 4(a)). To mitigate this issue, we first transform the consecutive frames to global system and calculate the Persistence Point Score (PPScore)[40] by consecutive frames to identify the points in motion. We keep all the points from x_(0)\boldsymbol{x}_{0} and remove moving points from the other frames x_(-n),dots,x_(-1),x_(1),dots,x_(n)\boldsymbol{x}_{-n}, \ldots, \boldsymbol{x}_{-1}, \boldsymbol{x}_{1}, \ldots, \boldsymbol{x}_{n}. After this removal, we concatenate the frames to obtain dense points x_(0)^(**)x_{0}^{*}. 将点云融合到单个点云 x_(0)^(**)\boldsymbol{x}_{0}^{*} 会引入来自移动物体的运动伪影,导致标签错误随 nn 增加而增加(见图 4(a))。为了缓解这个问题,我们首先将连续帧变换到全局系统,并通过连续帧计算持久点数得分(PPScore)[40],以确定运动点。我们保留 x_(0)\boldsymbol{x}_{0} 的所有点,并从其他帧 x_(-n),dots,x_(-1),x_(1),dots,x_(n)\boldsymbol{x}_{-n}, \ldots, \boldsymbol{x}_{-1}, \boldsymbol{x}_{1}, \ldots, \boldsymbol{x}_{n} 中删除移动点。执行此删除操作后,我们连接帧以获得密集点 x_(0)^(**)x_{0}^{*} 。
Clustering and post-processing. In line with recent study [41], we apply the ground removal[9], DBSCAN [5] and bounding box fitting [43] on x_(0)^(**)\boldsymbol{x}_{0}^{*} to obtain a set of class-agnostic bounding boxes hat(b)\hat{\boldsymbol{b}}. We observe that the objects of the same class typically have similar sizes in 3D space. Therefore, we pre-define class-specific size thresholds (e.g. the length of vehicle is generally larger than 0.5 m ) based on human commonsense to classify hat(b)\hat{b} into different categories. We then apply class-agnostic tracking to associate the small background objects with foreground trajectories, and enhance the consistency of objects’ sizes by using temporal coherency [41]. This process results in a set of initial pseudo-labels b={b_(j)}_(j)\boldsymbol{b}=\left\{b_{j}\right\}_{j}, where b_(j)=[x,y,z,l,w,h,alpha,beta,tau]b_{j}=[x, y, z, l, w, h, \alpha, \beta, \tau] represents position, width, length, height, azimuth angle, class identity, and tracking identity, respectively. 聚类和后处理。根据最近的研究[41],我们在 x_(0)^(**)\boldsymbol{x}_{0}^{*} 上应用地面去除[9]、DBSCAN [5]和边界框拟合[43]来获得一组无类别边界框 hat(b)\hat{\boldsymbol{b}} 。我们观察到同一类对象在 3D 空间中通常具有相似的尺寸。因此,我们根据常识预先定义了特定类别的尺寸阈值(例如车辆的长度通常大于 0.5 米)来对 hat(b)\hat{b} 进行分类。然后我们应用无类别跟踪来关联小背景对象与前景轨迹,并通过使用时间连贯性[41]来增强对象尺寸的一致性。该过程产生了一组初始伪标签 b={b_(j)}_(j)\boldsymbol{b}=\left\{b_{j}\right\}_{j} ,其中 b_(j)=[x,y,z,l,w,h,alpha,beta,tau]b_{j}=[x, y, z, l, w, h, \alpha, \beta, \tau] 分别表示位置、宽度、长度、高度、方位角、类别标识和跟踪标识。
As noted in Section 1, initial labels for incomplete objects often suffer from inaccuracies in sizes and positions. To 正如第 1 节所述,不完整物体的初始标签通常存在尺寸和位置的不准确性。
Completeness and Size Similarity (CSS) scoring. Existing label scoring methods such as IoU scoring [21] are designed for fully supervised detectors. In contrast, we introduce an unsupervised Completeness and Size Similarity scoring (CSS) method. It aims to approximate the IoU score using commonsense knowledge alone (see Fig. 5). 完整性和大小相似性(CSS)评分。现有的标签评分方法,如 IoU 评分[21]是为完全监督的检测器设计的。相比之下,我们引入了一种无监督的完整性和大小相似性评分(CSS)方法。它旨在仅使用常识知识来近似 IoU 评分(见图 5)。
Distance score. CSS first assesses the object completeness based on distance, assuming labels closer to the ego vehicle are likely to be more accurate. For an initial label b_(j)b_{j}, we normalize the distance to the ego vehicle within the range [0,1][0,1] to compute the distance score as 距离分数。 CSS 首先根据距离评估对象的完整性,假设距离车辆更近的标签可能更准确。 对于初始标签 b_(j)b_{j} ,我们将距离车辆的距离归一化在范围 [0,1][0,1] 内,以计算距离分数。
where N\mathcal{N} is the normalization function and c_(j)c_{j} is the location of b_(j)b_{j}. However, this distance-based approach has its limitations. For example, occluded objects near the ego vehicle, which should receive lower scores, are inadvertently assigned high scores due to their proximity. To mitigate this issue, we introduce a Multi-Level Occupancy (MLO) score, further detailed in Fig. 4 (b). 其中 N\mathcal{N} 是归一化函数, c_(j)c_{j} 是 b_(j)b_{j} 的位置。然而,这种基于距离的方法有其局限性。例如,靠近自车的被遮挡物体应获得较低分数,但由于接近自车而被错误地分配了较高分数。为了缓解这一问题,我们引入了多层占用(MLO)分数,在图 4(b)中有进一步的详细说明。
MLO score. Considering the diverse sizes of objects, we divide the bounding box of the initial label into multiple grids with different length and width resolutions. The MLO score is then calculated by determining the proportion of grids occupied by cluster points, via 多目标定位(MLO)得分。考虑到目标的不同大小,我们将初始标签的边界框划分为具有不同长度和宽度分辨率的多个网格。然后通过确定簇点占用的网格比例来计算 MLO 得分。
where N^(o)N^{o} denotes resolution number, O^(k)O^{k} is the number of occupied grids under kk-th resolution, and r^(k)r^{k} is the grid number of kk-th resolution. 其中 N^(o)N^{o} 表示分辨率编号, O^(k)O^{k} 是 kk 分辨率下的占用网格数, r^(k)r^{k} 是 kk 分辨率下的网格编号。
Size Similarity (SS) score. While the distance and MLO scores effectively evaluate the localization and size quality, 尺寸相似性(SS)得分。虽然距离和 MLO 得分有效评估了定位和尺寸质量,
Figure 5. Completeness and size similarity scoring. 完整性和尺寸相似性评分。
they fall short in assessing classification quality. To bridge this gap, we introduce the SS score. This score utilizes a class-specific template box aa (average size of typical objects in Wikipedia) and calculates a truncated KL divergence [7]. Note that, this score is decided by ratio difference, rather than their specific values. Simple commonsense of l,w,hl, w, h ratios (2:1:1 for Vehicle, 1:1:2 for Pedestrian, 2:1:2 for Cy-\mathrm{Cy}- clist) can also be used here. 他们在评估分类质量方面存在不足。为了弥补这一缺口,我们引入了 SS 评分。该评分利用了特定于类别的模板框 aa (维基百科中典型对象的平均大小),并计算截断的 KL 散度[7]。请注意,该评分由比率差异决定,而不是具体数值。这里也可以使用 l,w,hl, w, h 比率(车辆为 2:1:1,行人为 1:1:2, Cy-\mathrm{Cy}- clist 为 2:1:2)的简单常识。
where q_(sigma)^(a)in{l^(a),w^(a),h^(a)},q_(sigma)^(b)in{l^(b),w^(b),h^(b)}q_{\sigma}^{a} \in\left\{l^{a}, w^{a}, h^{a}\right\}, q_{\sigma}^{b} \in\left\{l^{b}, w^{b}, h^{b}\right\} refer to the normalized length, width, and height of the template and label. 其中 q_(sigma)^(a)in{l^(a),w^(a),h^(a)},q_(sigma)^(b)in{l^(b),w^(b),h^(b)}q_{\sigma}^{a} \in\left\{l^{a}, w^{a}, h^{a}\right\}, q_{\sigma}^{b} \in\left\{l^{b}, w^{b}, h^{b}\right\} 指模板和标签的标准化长度、宽度和高度。
We linearly combine the three metrics S(b_(j))=\mathcal{S}\left(b_{j}\right)=sum_(i)omega^(i)psi^(i)(b_(j))\sum_{i} \omega^{i} \psi^{i}\left(b_{j}\right) to produce final scoring, where omega^(i)\omega^{i} is the weighting factor (in this study we adopt a simple average, {:omega^(i)=1//3)\left.\omega^{i}=1 / 3\right). For each b_(j)in bb_{j} \in \boldsymbol{b}, we compute its CSS score s_(j)^(css)=S(b_(j))s_{j}^{c s s}=\mathcal{S}\left(b_{j}\right) and obtain a set of scores s={s_(j)^(css)}_(j)s=\left\{s_{j}^{c s s}\right\}_{j}. 我们线性组合三个指标 S(b_(j))=\mathcal{S}\left(b_{j}\right)=sum_(i)omega^(i)psi^(i)(b_(j))\sum_{i} \omega^{i} \psi^{i}\left(b_{j}\right) 来产生最终评分,其中 omega^(i)\omega^{i} 是加权因子(在本研究中我们采用简单平均, {:omega^(i)=1//3)\left.\omega^{i}=1 / 3\right) )。对于每个 b_(j)in bb_{j} \in \boldsymbol{b} ,我们计算其 CSS 得分 s_(j)^(css)=S(b_(j))s_{j}^{c s s}=\mathcal{S}\left(b_{j}\right) 并获得一组得分 s={s_(j)^(css)}_(j)s=\left\{s_{j}^{c s s}\right\}_{j} 。
CProto set construction. Regular learnable prototypebased methods require annotations [13, 47], which are unavailable in the unsupervised problem. We construct a highquality CProto set P={P_(k)}_(k)\boldsymbol{P}=\left\{P_{k}\right\}_{k}, representing geometry and size centers based on the unsupervised CSS score. Here, P_(k)={x_(k)^(p),b_(k)^(p)}P_{k}=\left\{x_{k}^{p}, b_{k}^{p}\right\}, where x_(k)^(p)x_{k}^{p} indicates the inside points, and b_(k)^(p)b_{k}^{p} refers to the bounding box. Specifically, we first categorize the initial labels b\boldsymbol{b} into different groups based on their tracking identity tau\tau. Within each group, we select the high-quality boxes and inside points that meet a high CSS score threshold eta\eta (determined on validation set, using 0.8 in this study). Then, we transform all points and boxes into a local coordinate system, and obtain b_(k)^(p)b_{k}^{p} by averaging the high-quality boxes and x_(k)^(p)x_{k}^{p} by concatenating all the points. CProto 集合的构建。基于学习的原型方法需要注释[13,47],这在无监督问题中是不可用的。我们构建了一个高质量的 CProto 集合 P={P_(k)}_(k)\boldsymbol{P}=\left\{P_{k}\right\}_{k} ,基于无监督的 CSS 评分表示几何和尺寸中心。这里, P_(k)={x_(k)^(p),b_(k)^(p)}P_{k}=\left\{x_{k}^{p}, b_{k}^{p}\right\} ,其中 x_(k)^(p)x_{k}^{p} 表示内部点, b_(k)^(p)b_{k}^{p} 指边界框。具体来说,我们首先根据追踪身份 tau\tau 将初始标签 b\boldsymbol{b} 划分为不同组。在每个组内,我们选择满足高 CSS 评分阈值 eta\eta (使用本研究中的 0.8 确定的验证集)的高质量框和内部点。然后,我们将所有点和框转换到局部坐标系中,并通过平均高质量框获得 b_(k)^(p)b_{k}^{p} ,通过连接所有点获得 x_(k)^(p)x_{k}^{p} 。
Figure 6. The size of the initial label is replaced by the CProto box, and the position is also corrected. 图 6.初始标签的大小被 CProto 框取代,位置也得到纠正。
labels. (1) Re-size. We directly replace the size of b_(j)b_{j} using the length, width, and height of b_(k)^(p)inP_(k)b_{k}^{p} \in P_{k}. (2) Re-location. Since points are mostly on the object’s surface and boundary, we divide the object into different bins and align the box boundary and orientation to the boundary point of the densest part (see Fig. 6). Finally, we obtain improved pseudolabels b^(**)={b_(j)^(**)}_(j)\boldsymbol{b}^{*}=\left\{b_{j}^{*}\right\}_{j}. 标签。(1)调整大小。我们直接使用 b_(k)^(p)inP_(k)b_{k}^{p} \in P_{k} 的长度、宽度和高度来替换 b_(j)b_{j} 的大小。(2)重新定位。由于点主要在物体表面和边界上,我们将物体划分为不同的箱子,并将边界框边界和方向对准最密集部分的边界点(见图 6)。最终,我们获得了改进的伪标签 b^(**)={b_(j)^(**)}_(j)\boldsymbol{b}^{*}=\left\{b_{j}^{*}\right\}_{j} 。
Recent methods [40, 41] utilize pseudo-labels for training 3D detectors. However, even after refinement, some pseudo-labels remain inaccurate, diminishing the effectiveness of correct supervision and potentially misleading the training process. To tackle these issues, we propose two designs: (1) CSS-Weighted Detection Loss, which assigns different training weights based on label quality to suppress false supervision signals. (2) Geometry Contrast Loss, which aligns predictions of sparsely scanned points with the dense CProto, thereby improving feature consistency. 近期的方法[40,41]利用伪标签训练 3D 检测器。但是,即使经过细化,一些伪标签仍然不准确,降低了正确监督的有效性,并可能误导训练过程。为了解决这些问题,我们提出了两种设计:(1)CSS 加权检测损失,它根据标签质量分配不同的训练权重,以抑制错误的监督信号。(2)几何对比损失,它将稀疏扫描点的预测与密集 CProto 进行对齐,从而提高特征的一致性。
CSS weight. Considering that the false pseudo-labels may mislead the network convergence, we first calculate a loss weight based on different label qualities. Formally, we convert a CSS score s_(i)^(css)s_{i}^{c s s} of a pseudo-label to css 权重。考虑到虚假的伪标签可能会误导网络收敛,我们首先根据不同的标签质量计算损失权重。正式地,我们将伪标签的 CSS 分数 s_(i)^(css)s_{i}^{c s s} 转换为
omega_(i)={[0,s_(i)^(css) < S_(L)],[(s_(i)^(css))/()-S_(L),S_(L) < s_(i)^(css) < S_(H)],[1,s_(i)^("css ") > S_(H)]:}\omega_{i}= \begin{cases}0 & s_{i}^{c s s}<S_{L} \\ \frac{s_{i}^{c s s}}{}-S_{L} & S_{L}<s_{i}^{c s s}<S_{H} \\ 1 & s_{i}^{\text {css }}>S_{H}\end{cases}
where S_(H)S_{H} and S_(L)S_{L} are high/low-quality thresholds (we empirically set 0.7 and 0.4 , respectively). 其中 S_(H)S_{H} 和 S_(L)S_{L} 是高质量/低质量阈值(我们经验性地设置为 0.7 和 0.4)。
CSS-weighted detection loss. To decrease the influence of false labels, we formulate the CSS-weighted detection loss to refine NN proposals 基于 CSS 的加权检测损失。为降低错标签的影响,我们提出了基于 CSS 的加权检测损失来优化 NN proposals。
L_(det)^(css)=(1)/(N)sum_(i)omega_(i)(L_(i)^(pro)+L_(i)^(det))\mathcal{L}_{d e t}^{c s s}=\frac{1}{N} \sum_{i} \omega_{i}\left(\mathcal{L}_{i}^{p r o}+\mathcal{L}_{i}^{d e t}\right)
where L_(i)^("pro ")\mathcal{L}_{i}^{\text {pro }} and L_(i)^("det ")\mathcal{L}_{i}^{\text {det }} are detection losses [4] of F^("pro ")\mathcal{F}^{\text {pro }} and F^("det ")\mathcal{F}^{\text {det }}, respectively. The losses are calculated by pseudolabels b^(**)\boldsymbol{b}^{*} and network predictions. 其中 L_(i)^("pro ")\mathcal{L}_{i}^{\text {pro }} 和 L_(i)^("det ")\mathcal{L}_{i}^{\text {det }} 是 F^("pro ")\mathcal{F}^{\text {pro }} 和 F^("det ")\mathcal{F}^{\text {det }} 的检测损失[4]。这些损失是通过伪标签 b^(**)\boldsymbol{b}^{*} 和网络预测计算得出的。
Geometry contrast loss. We formulate two contrast losses that minimize the feature and predicted box difference between the prototype and detection network. (1) Feature contrast loss. For a foreground RoI r_(i)r_{i} from the detection network, we extract features f_(i)^(p)\boldsymbol{f}_{i}^{p} from the prototype network by voxel set abstract [4], and extract features f_(i)^(d)\boldsymbol{f}_{i}^{d} from detection network. We then formulate the contrast loss by cosine distance: 几何对比损失。我们提出了两种对比损失,最小化原型网络和检测网络之间的特征和预测框差异。(1)特征对比损失。对于检测网络中的前景 RoI r_(i)r_{i} ,我们从原型网络中提取特征 f_(i)^(p)\boldsymbol{f}_{i}^{p} ,并从检测网络中提取特征 f_(i)^(d)\boldsymbol{f}_{i}^{d} 。然后我们通过余弦距离来计算对比损失。
where N^(f)N^{f} is the foreground proposal number. (2) Box contrast loss. For a box prediction d_(i)^(p)d_{i}^{p} from the prototype network and a box prediction d_(i)^(d)d_{i}^{d} from the detection network. We then formulate the box contrast loss by IoU, location difference, and angle difference: 其中 N^(f)N^{f} 是前景提案编号。(2)框对比损失。对于从原型网络得到的框预测 d_(i)^(p)d_{i}^{p} 和从检测网络得到的框预测 d_(i)^(d)d_{i}^{d} ,我们通过 IoU、位置差异和角度差异来计算框对比损失:
where II denote IoU function; c_(i)^(d),alpha_(i)^(d)c_{i}^{d}, \alpha_{i}^{d} refers to position and angle of d_(i)^(d);c_(i)^(p),alpha_(i)^(p)d_{i}^{d} ; c_{i}^{p}, \alpha_{i}^{p} refers to position and angle of d_(i)^(p)d_{i}^{p}. We finally summat all losses to training the detector. 其中 II 表示 IoU 函数; c_(i)^(d),alpha_(i)^(d)c_{i}^{d}, \alpha_{i}^{d} 表示位置和角度; d_(i)^(d);c_(i)^(p),alpha_(i)^(p)d_{i}^{d} ; c_{i}^{p}, \alpha_{i}^{p} 表示位置和角度。我们最终将所有损失相加以训练检测器。
4. Experiments 实验
4.1. Datasets 数据集
Waymo Open Dataset (WOD). We conducted extensive experiments on the WOD [25] due to its diverse scenes. The WOD contains 798, 202 and 150 sequences for training, validation and testing, respectively. We adopted similar metrics (3D AP L1 and L2) as fully/weakly supervised methods [31, 34]. No annotations were used for training. 百度 开放数据集 (WOD)。由于其场景丰富,我们在 WOD [25] 上进行了大量实验。WOD 包含 798、202 和 150 个用于训练、验证和测试的序列。我们采用了与完全/弱监督方法 [31, 34] 类似的度量指标 (3D AP L1 和 L2)。在训练过程中未使用任何标注。
PandaSet dataset. To compare with recent unsupervised methods [41], we also conducted experiments on the PandaSet [35]. Like [41], we split the dataset into 73 training and 30 validation snippets and use class-agnostic BEV AP and recall metrics with 0.3,0.50.3,0.5, and 0.7 IoU thresholds. 熊猫数据集。为了与最近的无监督方法[41]进行比较,我们还在熊猫数据集[35]上进行了实验。与[41]一样,我们将数据集拆分为 73 个训练和 30 个验证片段,并使用无特定类别的俯视图 AP 和召回率指标,阈值为 0.3,0.50.3,0.5 和 0.7 IoU。
KITTI dataset. Since the KITTI detection dataset [6] did not provide consecutive frames, we only tested our method on the 3769 val split [4]. We used similar metrics (Car 3D AP R40 with 0.5 and 0.7 IoU thresholds) as employed in fully/weakly supervised methods [32, 34]. 基蒂数据集。由于基蒂检测数据集[6]没有提供连续帧,我们只在 3769 个验证分割[4]上测试了我们的方法。我们使用了类似于完全/弱监督方法[32,34]所采用的指标(Car 3D AP R40,IoU 阈值为 0.5 和 0.7)。
Table 1. Unsupervised 3D object detection results on WOD validation set. The results of previous methods are reproduced by us. 表 1. 无监督的 3D 物体检测结果在 WOD 验证集上。之前方法的结果由我们复现。
Method 方法
3D AP L1 (IoU_(0.7,0.5,0.5)):}\left(I o U_{0.7,0.5,0.5)}\right. 3D AP L2 (IoU_(0.7,0.5,0.5)):}\left(I o U_{0.7,0.5,0.5)}\right.
Table 3. The class-agnostic comparison results on the PandaSet dataset, evaluated on the 0-80m0-80 \mathrm{~m} detection range. 表 3. PandaSet 数据集上的无类别比较结果,在 0-80m0-80 \mathrm{~m} 检测范围进行评估。
4.2. Implementation Details 4.2. 实现细节
Network details. Both prototype and detection networks adopt the same 3D backbone as CenterPoint [39] and the same RoI refinement network as Voxel-RCNN [4]. For the WOD and KITTI datasets, we use the same detection range and voxel size as CenterPoint [39]. For the Pandaset, we use the same detection range as OYSTER [41]. 网络细节。原型网络和检测网络都使用与 CenterPoint [39]相同的 3D 主干网络,并使用与 Voxel-RCNN [4]相同的 RoI 细化网络。对于 WOD 和 KITTI 数据集,我们使用与 CenterPoint [39]相同的检测范围和体素大小。对于 Pandaset,我们使用与 OYSTER [41]相同的检测范围。
Training details. We adopt the widely used global scaling and rotation data augmentation. We trained our network on 8 Tesla V100 GPUs with the ADAM optimizer. We used a learning rate of 0.003 with a one-cycle learning rate strategy. We trained the CPD for 20 epochs. 训练细节。我们采用广泛使用的全局缩放和旋转数据增强。我们在 8 个 Tesla V100 GPU 上使用 ADAM 优化器训练了我们的网络。我们使用了 0.003 的学习率和一个一周期的学习率策略。我们训练了 CPD20 个 epochs。
4.3. Comparison with Unsupervised Detectors 4.3. 与无监督检测器的比较
Results on WOD. The results on the WOD validation set and test set are presented in Table 1 and Table 2. All methods use identical size thresholds to define the object classes and use single traversal. Our method significantly outperforms existing unsupervised methods. Notably, under the 3D AP L2 with IoU thresholds of 0.7,0.50.7,0.5, and 0.5 , our CPD outperforms OYSTER [41] by 18.03%,13.08%18.03 \%, 13.08 \%, and 4.55%4.55 \% on Vehicle, Pedestrian, and Cyclist, respectively. These advancements come from our MFC, CBR, and CST designs, which yield superior pseudo-labels and enhanced detection accuracy. CPD also surpasses the Proto-vanilla method, which uses class-specific prototype [23]. 根据 WOD 的结果。WOD 验证集和测试集的结果显示在表 1 和表 2 中。所有方法使用相同的尺寸阈值来定义目标类别,并采用单次遍历。我们的方法显著优于现有的无监督方法。值得注意的是,在 3D AP L2 指标下,IoU 阈值为 0.7,0.50.7,0.5 和 0.5 时,我们的 CPD 在车辆、行人和自行车类别上分别超过 OYSTER [41] 方法 18.03%,13.08%18.03 \%, 13.08 \% 和 4.55%4.55 \% 。这些进步来自于我们的 MFC、CBR 和 CST 设计,它们产生了更优秀的伪标签和增强的检测精度。CPD 也超越了使用类别特定原型的 Proto-vanilla 方法 [23]。
Table 4. Car detection comparison with fully/weakly supervised detectors on KITTI val set. The models are trained on WOD. 表 4. 基于 KITTI 验证集的汽车检测比较,全监督和弱监督检测器。训练集为 WOD。
Table 5. Vehicle detection comparison with fully/weakly supervised detectors on WOD validation set. 表 5. WOD 验证集上全监督/弱监督车辆检测器的比较。
Results on PandaSet. The class-agnostic results on PandaSet are presented in Table 3. Our method outperforms OYSTER by 6.5% AP and 9.3% Recall under 0.7 IoU threshold. This improvement is largely due to our CPD’s enhanced label quality. Unlike OYSTER, which suffers from the misleading effects of false labels during training, our CPD leverages the size prior from CProto to significantly improve these labels. 熊猫数据集结果。在熊猫数据集上的无类别结果如表 3 所示。我们的方法在 0.7 IoU 阈值下超越 OYSTER 6.5% AP 和 9.3% Recall。这一改善主要归因于我们的 CPD 增强的标签质量。与 OYSTER 受虚假标签在训练期间误导效果的影响不同,我们的 CPD 利用来自 CProto 的大小先验显著改善这些标签。
4.4. Comparison with Fully/Weakly Supervised Detectors 全监督检测器和弱监督检测器的比较
Results on KITTI dataset. To further validate our method, we pre-trained our CPD, along with OYSTER [41] and MODEST [40], on WOD and tested them on the KITTI dataset using Statistical Normalization (SN) [28]. The car detection results are in Table 4. We first compared our method with a sparsely supervised method (weakly supervised with 2%2 \% labels) [34] that annotates a single instance per frame for training. Our unsupervised CPD outperforms this sparsely supervised method by 23.52% 3D AP @ IoU_(0.7)I o U_{0.7} on moderate car class. Additionally, our method attains 90.85%90.85 \% and 81.01%81.01 \% 3D AP for the easy and moderate 基于 KITTI 数据集的结果。为进一步验证我们的方法,我们在 WOD 上预训练了我们的 CPD 以及 OYSTER [41]和 MODEST [40],并使用统计归一化(SN) [28]在 KITTI 数据集上进行了测试。汽车检测结果见表 4。我们首先将我们的方法与一种稀疏监督方法(训练时每帧注释一个实例) [34]进行了比较。我们的无监督 CPD 在中等汽车类别上的 3D AP 优于该稀疏监督方法 23.52%。此外,我们的方法在易车和中等车类别上分别达到了 90.85%90.85 \% 和 81.01%81.01 \% 3D AP。
Figure 7. (a-c) IoU distribution between pseudo-labels and ground truth. (d-f) Mean absolute error associated with the size, position, and angle of pseudo-labels generated by different methods. 图 7。(a-c) 伪标签和真实标签之间的 IoU 分布。(d-f) 由不同方法生成的伪标签的大小、位置和角度的平均绝对误差。
car classes at a 0.5 IoU threshold. Notably, this performance is comparable to that of the fully supervised method CenterPoint [39], demonstrating the advancement of our method. 车辆类别在 0.5 IoU 阈值下。值得注意的是,这一性能与完全监督的方法 CenterPoint [39]相当,展示了我们方法的进步。
Results on WOD. We also compared our method with fully/weakly supervised methods on the WOD validation set [25]. The vehicle detection results are in Table 5. Our unsupervised CPD outperforms the sparsely supervised method ( 2%2 \% annotation) by 5.25%5.25 \% and 4.16%4.16 \% in terms of 3D AP L1 and L2 respectively. 在 WOD 上的结果。我们还将我们的方法与 WOD 验证集[25]上的全监督/弱监督方法进行了比较。车辆检测结果见表 5。我们的无监督 CPD 在 3D APL1 和 L2 方面分别超过稀疏监督方法( 2%2 \% 注释) 5.25%5.25 \% 和 4.16%4.16 \% 。
Components analysis of CPD. To evaluate the individual contributions of our designs, we incrementally added each component and assessed their impact on vehicle detection using the WOD validation set. The results are shown in Table 7. Our MFC method surpasses Single Frame Clustering (SFC) by 2.52%2.52 \% in AP, attributed to the more complete point representation of objects across consecutive frames compared to a single frame. The CBR further enhances performance by 19.27%19.27 \% in AP, as it reduces size and location errors in pseudo-labels. The CST contributes an 8.09%8.09 \% increase in AP, demonstrating the effectiveness of geometric features from CProto in detecting sparse objects. 对 CPD 的组件分析。为了评估我们设计的个人贡献,我们逐步添加了每个组件,并使用 WOD 验证集评估它们对车辆检测的影响。结果如表 7 所示。我们的 MFC 方法超越了 Single Frame Clustering(SFC),在 AP 上提高了 2.52%2.52 \% ,这归因于与单一帧相比,它对连续帧中物体的点表示更加完整。CBR 进一步通过 19.27%19.27 \% 的 AP 提升来增强性能,因为它减少了伪标签中的尺寸和位置错误。CST 贡献了 8.09%8.09 \% 的 AP 增加,证明了 CProto 几何特征在检测稀疏物体方面的有效性。
Frame number of MFC. To examine the effect of frame count on initial pseudo-label quality, we experimented with different numbers of past and future point cloud frames on the WOD validation set. The BEV results, shown in Fig. 8 (a)(b), indicate optimal performance with [-5, 5] frames (five past, five future, and the current frame). Additional frames did not significantly improve recall or precision. Consequently, we used 11 frames for initial pseudo-label generation in this study. 多通道前馈网络的帧序号。为了检查帧计数对初始伪标签质量的影响,我们在 WOD 验证集上使用不同数量的过去和未来点云帧进行了实验。从 BEV 结果(图 8(a)(b))可以看出,使用[-5,5]帧(5 个过去帧、5 个未来帧和当前帧)效果最佳。增加帧数不会显著改善召回率或精度。因此,我们在本研究中使用了 11 帧进行初始伪标签生成。
Component analysis of CSS Scoring. To assess the effectiveness of our scoring system, we calculated the BEV AP of initial pseudo-labels with different scores. These 评估我们评分系统的有效性,我们计算了具有不同分数的初始伪标签的 BEV AP。
Figure 9. Visualization comparison of different detection results on WOD validation set. 图 9. 在 WOD 验证集上不同检测结果的可视化比较。
Components analysis of CBR. To evaluate the impact of re-sizing and re-localization in CBR, we conducted experiments and analyzed pseudo-label performance. As shown in Table 9, re-sizing results in a 3.91%3.91 \% and 3.4%3.4 \% increase in BEV recall at the 0.5 and 0.7 IoU thresholds, respectively; re-localization further enhances recall by 12.68%12.68 \% and 6.43%6.43 \% at these thresholds, while also increasing precision. These results indicate the importance of both components, which effectively refine pseudo-labels. 基于 CBR 的组件分析。为评估 CBR 中调整大小和重新定位的影响,我们进行了实验并分析了伪标签的性能。如表 9 所示,调整大小在 0.5 和 0.7 IoU 阈值下分别导致了 BEV 召回率的 3.91%3.91 \% 和 3.4%3.4 \% 增加;重新定位进一步提高了这些阈值下的召回率分别为 12.68%12.68 \% 和 6.43%6.43 \% ,同时也提高了精度。这些结果表明这两个组件的重要性,它们有效地改善了伪标签。
Components analysis of CST. To assess the effectiveness of each component in CST, we established a baseline using only CBR-generated pseudo-labels for training a two-stage CenterPoint detector, then incrementally added our loss components and evaluated vehicle detection performance on the WOD validation set. As shown in Table 10, all loss components contribute to performance improvement. Specifically, our L_("det ")^(css)\mathcal{L}_{\text {det }}^{c s s} mitigates the influence of false pseudo-label using CSS weight, and improves the 3D AP L2 at IoU_(0.7)I o U_{0.7} by 4.79%4.79 \%. Our L_("feat ")^(css)\mathcal{L}_{\text {feat }}^{c s s} and L_("box ")^(css)im-\mathcal{L}_{\text {box }}^{c s s} \mathrm{im}- prove the 3D AP L2 at IoU_(0.7)I o U_{0.7} by 0.75%0.75 \% and 2.55%2.55 \% respectively, through leveraging geometric knowledge from dense CProto for more effective sparse object detection. 分析 CST 的各个组件。为了评估 CST 中每个组件的有效性,我们使用仅有 CBR 生成的伪标签训练了一个两阶段的 CenterPoint 检测器作为基线,然后逐步添加我们的损失组件并在 WOD 验证集上评估车辆检测性能。如表 10 所示,所有损失组件均有助于性能提升。具体而言,我们的 L_("det ")^(css)\mathcal{L}_{\text {det }}^{c s s} 通过利用 CSS 权重减轻了虚假伪标签的影响,并将 IoU_(0.7)I o U_{0.7} 的 3D AP L2 提高了 4.79%4.79 \% 。我们的 L_("feat ")^(css)\mathcal{L}_{\text {feat }}^{c s s} 和 L_("box ")^(css)im-\mathcal{L}_{\text {box }}^{c s s} \mathrm{im}- 通过利用来自密集 CProto 的几何知识,进一步将 IoU_(0.7)I o U_{0.7} 的 3D AP L2 提高了 0.75%0.75 \% 和 2.55%2.55 \% 。
4.7. Visualization Comparison 4.7. 可视化比较
To provide a more intuitive understanding of how our method improves detection performance, we visually compare our results with those of MODEST [40] and OYSTER [41], as shown in Fig. 9. MODEST often misses 为了更直观地了解我们的方法如何提高检测性能,我们将我们的结果与 MODEST [40] 和 OYSTER [41] 的结果进行了可视化对比,如图 9 所示。MODEST 通常会错过
CST Components 中国标准时间组件
3D AP L1
3D AP L2
IoU O _(". ")_{\text {. }} 交并比 O _(". ")_{\text {. }}
Table 10. CST component analysis results on WOD validation set. 表 10. CST 组件分析结果在 WOD 验证集上。
distant, sparse objects (Fig. 9(1.1)), while OYSTER detects them but inaccurately reports their sizes and positions (Fig. 9(2.1)). In contrast, CPD, using our CProto-based design, not only recognizes these objects but also accurately predicts their sizes and positions (Fig. 9(3.1)). Furthermore, since our CST reduces the influence of false pseudo-labels, the false positives (Fig. 9(3.2)) are also much fewer than the previous methods (Fig. 9(1.2)(2.2)). 遥远稀疏的物体(图 9(1.1)),而 OYSTER 检测到它们但不准确地报告它们的大小和位置(图 9(2.1)).相比之下,CPD 使用我们的 CProto 设计不仅识别出这些物体,而且还准确预测它们的大小和位置(图 9(3.1)).此外,由于我们的 CST 减少了错误伪标签的影响,误报(图 9(3.2))也比前一种方法(图 9(1.2)(2.2))少得多.
5. Conclusion 结论
This paper presents the CPD framework, a novel approach for accurate unsupervised 3D object detection. First, we develop an MFC method to generate initial pseudo-labels. Then, a CProto set is constructed using CSS scoring. Next, we introduce a CBR method to refine these pseudo-labels. Lastly, a CST is designed to enhance detection accuracy for sparse objects. Extensive experiments have verified the effectiveness of our design. Notably, for the first time, our unsupervised CPD method surpasses some weakly supervised methods, demonstrating the advancement of our approach. 这篇论文提出了 CPD 框架,一种用于准确无监督 3D 目标检测的新方法。首先,我们开发了一种 MFC 方法来生成初始伪标签。然后,使用 CSS 评分构建了一个 CProto 集。接下来,我们引入了一种 CBR 方法来细化这些伪标签。最后,设计了一种 CST 来提高稀疏目标的检测准确性。我们的大量实验已经验证了我们设计的有效性。值得注意的是,我们的无监督 CPD 方法首次超越了一些弱监督方法,展示了我们方法的进步。
Limitations. One notable limitation of our work is the significantly lower Average Precision (AP) for minority classes, such as cyclists (Table 1), compared to more prevalent classes like vehicles. This disparity is largely due to the scarce instances of these minority classes within the dataset. Future efforts to collect such objects could be a promising avenue to tackle this issue. 少数群体类别的平均精度较低,如骑自行车的人(表 1),与更常见的车辆类别相比。这种差异主要是由于数据集中这些少数群体类别的实例数较少造成的。未来收集此类对象的努力可能是解决这一问题的有前景的途径。
Acknowledgements. This work was supported in part by the National Natural Science Foundation of China (No.62171393), the Fundamental Research Funds for the Central Universities (No.20720220064). 中国国家自然科学基金(No.62171393),中央高校基本科研业务费(No.20720220064)。
References 参考文献
[1] Asma Azim and Olivier Aycard. Detection, classification and tracking of moving objects in a 3d environment. In 2012 IEEE Intelligent Vehicles Symposium, pages 802-807. IEEE, 2012. 2 阿斯玛·阿齐姆和奥利维尔·艾卡尔。在三维环境中检测、分类和跟踪移动物体。在 2012 年 IEEE 智能车辆研讨会上,页码 802-807。IEEE,2012.2
[2] Qi Cai, Yingwei Pan, Ting Yao, and Tao Mei. 3d cascade rcnn: High quality object detection in point clouds. IEEE Transactions on Image Processing, 31:5706-5719, 2022. 2 祁彩、应伟潘、姚婷和梅涛。 3d 级联 rcnn:点云中的高质量目标检测。 IEEE 图像处理论文,31:5706-5719,2022 年。
[3] Zehui Chen, Zhenyu Li, Shuo Wang, Dengpan Fu, and Feng Zhao. Learning from noisy data for semi-supervised 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6929-6939, 2023. 1, 2 [3] 陈泽晖、李振宇、王硕、付登攀、赵峰。从噪音数据中学习用于半监督 3D 目标检测。在 2023 年 IEEE/CVF 国际计算机视觉会议论文集中,第 6929-6939 页。
[4] Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wen gang Zhou, Yanyong Zhang, and Houqiang Li. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021. 1, 2, 5, 6 邓佳君, 石少帅, 李培伟, 周文刚, 张燕勇, 李厚强. Voxel r-cnn: 实现高性能的体素(voxel)基 3D 目标检测. 在 2021 年 AAAI 人工智能会议论文集中发表. 1, 2, 5, 6
[5] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu . A density-based algorithm for discovering clusters in large spatial databases with noise. In Knowledge Discovery and Data Mining, 1996. 3, 6, 7 马丁·埃斯特、汉斯-彼得·克里格尔、尤尔格·桑德尔和萧威·徐。一种基于密度的算法,用于发现噪声大型空间数据库中的聚类。 1996 年知识发现与数据挖掘。3, 6, 7
[6] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 3354-3361, 2012. 2, 5 安德烈亚斯·盖格,菲利普·伦兹和拉克尔·乌尔塔松。我们为自动驾驶做好准备了吗?基蒂视觉基准测试套件。在 IEEE 计算机视觉和模式识别会议(CVPR)论文集中,第 3354-3361 页,2012 年。2,5
[7] Goldberger, Gordon, and Greenspan. An efficient image similarity measure based on approximations of kl-divergence between two gaussian mixtures. In Proceedings Ninth IEEE International conference on computer vision, pages 487-493. IEEE, 2003. 4 [7] 高贝格,戈登和格林斯潘。基于两个高斯混合体之间近似 kl 散度的有效图像相似度度量。在第九届 IEEE 国际计算机视觉会议论文集中,第 487-493 页。IEEE,2003.4
[8] Chenhang He, Hui Zeng, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 11873-11882, 2020. 2 何晨航、曾慧、黄建强、华显生、张磊. 基于结构感知的单阶段 3D 目标检测. 在 2020 年 IEEE 计算机视觉和模式识别会议(CVPR)论文集中, 第 11873-11882 页.
[9] Michael Himmelsbach, Felix V Hundelshausen, and H-J Wuensche. Fast segmentation of 3d point clouds for ground vehicles. In 2010 IEEE Intelligent Vehicles Symposium, pages 560-565. IEEE, 2010. 1, 3 迈克尔·希姆尔斯巴赫、费利克斯·V·亨德尔斯豪森和 H-J·温世克。用于地面车辆的三维点云快速分割。在 2010 年 IEEE 智能车辆研讨会上,第 560-565 页。IEEE, 2010。1, 3
[10] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 12697-12705, 2019. 2 [10] 亚历克斯·H·朗,索拉布·沃拉,霍尔格·凯撒,吕冰周,杨冏,奥斯卡·贝纳博。PointPillars: 从点云中进行物体检测的快速编码器。在《2019 年计算机视觉和模式识别会议论文集》(CVPR)中,第 12697-12705 页。2
[11] Huifang Li, Yidong Li, Yuanzhouhan Cao, Yushan Han, Yi Jin, and Yunchao Wei. Weakly supervised object detection with class prototypical network. IEEE Transactions on Multimedia, 2022. 2 李惠芳, 李一冬, 曹远洲, 韩裕山, 金溢, 魏云超. 基于类原型网络的弱监督目标检测. 2022 年 IEEE 多媒体交易.
[12] Tianqin Li, Zijie Li, Harold Rockwell, Amir Farimani, and Tai Sing Lee. Prototype memory and attention mechanisms for few shot image generation. In Proceedings of the Eleventh International Conference on Learning Representations, 2022. 2 李天琴、李子杰、哈罗德·罗克韦尔、阿米尔·法里马尼和李泰昇。少样本图像生成的原型记忆和注意力机制。在 2022 年第十一届国际学习表征大会论文集中。
[13] Ziyu Li, Jingming Guo, Tongtong Cao, Liu Bingbing, and Wankou Yang. Gpa-3d: Geometry-aware prototype align- 几何感知原型对齐
ment for unsupervised domain adaptive 3d object detection from point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6394-6403, 2023. 3, 4 从点云中无监督领域自适应三维目标检测的方法。在 2023 年 IEEE/CVF 国际计算机视觉会议上,第 6394-6403 页。
[14] Hanxue Liang, Chenhan Jiang, Dapeng Feng, Xin Chen, Hang Xu, Xiaodan Liang, Wei Zhang, Zhenguo Li, Luc Van Gool, and Sun Yat-sen. Exploring geometry-aware contrast and clustering harmonization for self-supervised 3d object detection. In International Conference on Computer Vision (ICCV), pages 3273-3282, 2021. 2 梁涵雪,江宸涵,冯大鹏,陈昕,徐航,梁晓丹,张巍,李正果,Luc Van Gool,孙中山。
[15] Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multitask multi-sensor fusion with unified bird’s-eye view representation. ArXiv, 2022. 1 刘志坚、唐浩天、Alexander Amini、杨新宇、毛会资、Daniela Rus 和韩松。 Bevfusion:统一的俯视图表示的多任务多传感器融合。 ArXiv, 2022。
[16] Xiaonan Lu, Wenhui Diao, Yongqiang Mao, Junxi Li, Peijin Wang, Xian Sun, and Kun Fu. Breaking immutable: information-coupled prototype elaboration for few-shot object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1844-1852, 2023. 2 陆晓楠、刁文慧、毛永强、李军喜、王培进、孙贤、付昆。打破不变性:用于少样本目标检测的信息耦合原型详化。在 2023 年人工智能协会会议论文集中,第 1844 至 1852 页。
[17] Qinghao Meng, Wenguan Wang, Tianfei Zhou, Jianbing Shen, Yunde Jia, and Luc Van Gool. Towards a weakly supervised framework for 3d point cloud object detection and annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4454-4468, 2022. 2 清晰萌, 温馨王, 天翚周, 建兵申, 云德贾, 和吕高尔。 面向弱监督的三维点云物体检测及标注框架。 IEEE 模式分析与机器智能论文集, 44(8):4454-4468, 2022. 2
[18] Xidong Peng, Xinge Zhu, and Yuexin Ma. Cl3d: Unsupervised domain adaptation for cross-lidar 3d detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2047-2055, 2023. 3 彭西东,朱新戈,马月新。Cl3d:无监督域适应的跨雷达 3D 检测。在 2023 年人工智能学会年会论文集中,第 2047-2055 页。3
[19] Gheorghii Postica, Andrea Romanoni, and Matteo Matteucci. Robust moving objects detection in lidar data exploiting visual cues. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1093-1098. IEEE, 2016. 2 格奥尔吉·波斯蒂卡、安德烈亚·罗曼诺尼和马特奥·马特乌奇。利用视觉线索在激光雷达数据中检测稳健的移动物体。在 2016 年 IEEE/RSJ 国际智能机器人与系统大会(IROS)上,第 1093-1098 页。IEEE,2016。2
[20] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointrcnn : 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 770779, 2019. 2 [20] 邵帅石,王晓刚,李宏胜.POINTRCNN:从点云生成和检测三维物体建议. 在 2019 年 IEEE 计算机视觉与模式识别会议(CVPR)论文集中,第 770-779 页.
[21] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-renn: Pointvoxel feature set abstraction for 3d object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 10526 - 10535, 2020. 4 [21] 邵帅石、郭朝旭、江力、王哲、石建平、王晓岗和李宏升。Pv-renn:用于 3D 目标检测的 pointvoxel 特征集抽象。在 2020 年计算机视觉和模式识别会议(CVPR)论文集中,第 10526-10535 页。
[22] Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43:2647-2664, 2021. 2 史少帅、王哲、史建平、王晓刚和李宏升。从点到部件:采用部件感知和部件聚合网络的点云 3D 目标检测。IEEE 模式分析与机器智能学报(TPAMI), 43:2647-2664, 2021.
[23] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017. 6 [23] 杰克·斯内尔、凯文·斯威尔斯基和理查德·泽梅尔。原型网络用于少样本学习。神经信息处理系统进展, 30, 2017. 6
[24] Muhammad Sualeh and Gon-Woo Kim. Dynamic multi-lidar based multiple object detection and tracking. Sensors, 19(6): 1474, 2019. 2 穆罕默德·苏阿勒和金建雨。动态多激光雷达基础的多目标检测与跟踪。传感器, 19(6): 1474, 2019. 2
[25] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, and al. et. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE conference on 佩·孙,亨里克·克雷茨马尔,泽克赛斯·多蒂瓦拉及其他人。感知在自动驾驶中的可扩展性:Waymo 开放数据集。在 IEEE 会议论文中
Computer Vision and Pattern Recognition (CVPR), pages 2443 - 2451, 2020. 2, 4, 5, 7 计算机视觉与模式识别(CVPR),第 2443-2451 页,2020 年。2, 4, 5, 7
[26] He Wang, Yezhen Cong, Or Litany, Yue Gao, and Leonidas J. Guibas. 3dioumatch: Leveraging iou prediction for semisupervised 3d object detection. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition CVPR), pages 14610-14619, 2021. 1, 2 王赫、从烨真、利塔尼、高悦和利昂尼达斯·J·古博斯。3dioumatch:利用 iou 预测进行半监督 3D 物体检测。2021 年计算机视觉和模式识别学会会议论文集(CVPR),第 14610-14619 页。
[27] Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He, Bernt Schiele, and Liwei Wang. DSVT: Dynamic Sparse Voxel Transformer With Rotated Sets. In CVPR, 2023. 2 王海洋、陈实、邵帅、雷蒙、王森、何迪、贝尔特·席尔、王力伟。DSVT:带旋转集合的动态稀疏体素变换器。在 CVPR 2023 中。
[28] Yan Wang, Xiangyu Chen, Yurong You, Li Erran Li, Bharath Hariharan, Mark Campbell, Kilian Q Weinberger, and WeiLun Chao. Train in germany, test in the usa: Making 3d object detectors generalize. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11713-11723, 2020. 6 王岩、陈祥宇、由于容、李洪文、Bharath Hariharan、Mark Campbell、Kilian Q Weinberger 和赵维伦。在德国培训,在美国测试:使 3D 物体检测器普遍适用。在 2020 年 IEEE/CVF 计算机视觉与模式识别会议论文集中,第 11713-11723 页。
[29] Aming Wu, Yahong Han, Linchao Zhu, and Yi Yang. Universal-prototype enhancing for few-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9567-9576, 2021. 2 吴阿明、韩亚红、朱林超和杨毅。 用于少样本目标检测的通用原型增强。 2021 年 IEEE/CVF 国际计算机视觉会议论文集, 第 9567-9576 页。
[30] Hai Wu, Jinhao Deng, Chenglu Wen, Xin Li, and Cheng Wang. Casa: A cascade attention network for 3d object detection from lidar point clouds. IEEE Transactions on Geoscience and Remote Sensing, 2022. 1, 2 吴海、邓锦昊、温承录、李鑫和王成。Casa:一种用于从激光雷达点云检测 3D 物体的层叠注意力网络。IEEE 地球科学与遥感交易, 2022 年。
[31] Hai Wu, Chenglu Wen, Wei Li, Ruigang Yang, and Cheng Wang. Learning transformation-equivariant features for 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023. 2, 5 [31] 武海, 文成禄, 李伟, 杨睿刚和王成。学习变换不变的特征进行 3D 物体检测。在人工智能学会 AAAI 会议论文集, 2023 年。2, 5
[32] Hai Wu, Chenglu Wen, Shaoshuai Shi, Xin Li, and Cheng Wang. Virtual sparse convolution for multimodal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 5 [32]吴海,文程路,史少帅,李欣,王成。多模态 3D 物体检测的虚拟稀疏卷积。在 2023 年 IEEE/CVF 计算机视觉和模式识别会议论文集上发表。5
[33] Xiaopei Wu, Liang Peng, Honghui Yang, Liang Xie, Chenxi Huang, Chengqi Deng, Haifeng Liu, and Deng Cai. Sparse fuse dense: Towards high quality 3d detection with depth completion. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2022. 1, 2 吴小培、彭亮、杨宏辉、谢亮、黄晨曦、邓成齐、刘海峰和蔡登. 稀疏融合密集:指向高质量的 3D 检测深度补全. 在 2022 年 IEEE 计算机视觉与模式识别会议(CVPR)论文集中. 1, 2
[34] Qiming Xia, Jinhao Deng, Chenglu Wen, Hai Wu, Shaoshuai Shi, Xin Li, and Cheng Wang. Coin: Contrastive instance feature mining for outdoor 3 d object detection with very limited annotations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023. 2, 5, 6 夏启明、邓金浩、温成路、吴海、石少帅、李欣、王晟。Coin:利用有限注释进行户外 3D 物体检测的对比实例特征挖掘。在 2023 年 IEEE/CVF 国际计算机视觉会议上发表。2, 5, 6
[35] Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, Yunlong Wang, and Diange Yang. Pandaset: Advanced sensor suite dataset for autonomous driving. 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 2021. 2, 5 [35] 彭川晓,邵振磊,Steven Hao,张子硕,柴晓林,焦懈,李泽松,吴健,孙凯,蒋昆,王云龙,杨迪昂。Pandaset:自动驾驶先进传感器数据集。2021 年 IEEE 国际智能交通系统会议(ITSC),2021 年。2,5
[36] Honghui Yang, Tong He, Jiaheng Liu, Huaguan Chen, Boxi Wu, Binbin Lin, Xiaofei He, and Wanli Ouyang. Gdmae: Generative decoder for mae pre-training on lidar point clouds. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 9403-9414, 2023. 2 杨红晖、何桐、刘嘉恒、陈化馆、吴博喜、林斌斌、何小飞、欧阳万里。Gdmae:针对激光雷达点云的 MAE 预训练的生成式解码器。在计算机视觉与模式识别会议(CVPR)上发表,页数 9403-9414,2023 年。
[37] Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1951-1960, 2019. 2 杨泽彤、孙亚楠、刘舒、沈晓勇和贾晶亮。Std:用于点云的稀疏至密集 3D 物体检测器。 2019 年 IEEE 国际计算机视觉会议论文集,第 1951-1960 页。
[38] Junbo Yin, Dingfu Zhou, Liangjun Zhang, Jin Fang, ChengZhong Xu, Jianbing Shen, and Wenguan Wang. Proposalcontrast: Unsupervised pre-training for lidar-based 3d object detection. In European conference on computer vision, pages 17-33, 2022. 2 尹君博、周定福、张亮军、方进、徐成忠、沈健兵、王雯官。提议对比:无监督前期训练用于基于激光雷达的三维目标检测。在 2022 年欧洲计算机视觉会议上,第 17-33 页。
[39] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Centerbased 3d object detection and tracking. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2, 5, 6, 7 尹天伟、周星怡和 Philipp Krähenbühl。基于中心的 3D 物体检测与跟踪。在 2021 年计算机视觉与模式识别会议(CVPR)上发表。2, 5, 6, 7
[40] Yurong You, Katie Luo, Cheng Perng Phoo, Wei-Lun Chao, Wen Sun, Bharath Hariharan, Mark E. Campbell, and Kilian Q . Weinberger. Learning to detect mobile objects from lidar scans without labels. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 1, 2, 3, 5, 6, 7, 8 尤容,凯蒂·罗,程鹏傅,赵韦伦,孙文,BharathHariharan,Mark E. Campbell 和 Kilian Q. Weinberger。在没有标签的情况下,从激光雷达扫描中学习检测移动物体。在 2022 年 IEEE/CVF 计算机视觉和模式识别会议(CVPR)上, 2022 年。1,2,3,5,6,7,8
[41] Lunjun Zhang, Anqi Joyce Yang, Yuwen Xiong, Sergio Casas, Bin Yang, Mengye Ren, and Raquel Urtasun. Towards unsupervised object detection from lidar point clouds. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 1, 2, 3, 4, 5, 6, 7, 8 张论军、杨安琪、熊玉文、卡萨斯、杨斌、任梦晔和乌塔松。 针对来自激光雷达点云的无监督目标检测。 在 2023 年 IEEE/CVF 计算机视觉和模式识别会议(CVPR)上, 2023 年。
[42] Quanshi Zhang, Xuan Song, Xiaowei Shao, Huijing Zhao, and Ryosuke Shibasaki. Unsupervised 3d category discovery and point labeling from a large urban environment. In 2013 IEEE International Conference on Robotics and Automation, pages 2685-2692. IEEE, 2013. 1 张全师、宋轩、邵晓为、赵慧静和柴胜之. 从一个大型城市环境中无监督的 3D 类别发现和点标注. 2013 年 IEEE 国际机器人自动化大会论文集, 第 2685-2692 页. IEEE, 2013.
[43] Xiao Zhang, Wenda Xu, Chiyu Dong, and John M Dolan. Efficient 1 -shape fitting for vehicle detection using laser scanners. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 54-59. IEEE, 2017. 3 肖张, 文达徐, 赤禹东, 约翰·M·多兰。使用激光扫描仪进行车辆检测的高效 1 -形状拟合。在 2017 年 IEEE 智能汽车研讨会(IV)上,第 54-59 页。IEEE, 2017。3
[44] Yixin Zhang, Zilei Wang, and Yushi Mao. Rpn prototype alignment for domain adaptive object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12425-12434, 2021. 2 [44] 张一新, 王子磊和毛雨十. Rpn 原型对齐用于领域自适应目标检测器. 在 2021 年 IEEE/CVF 计算机视觉和模式识别会议论文集中, 第 12425-12434 页.
[45] Zehan Zhang, Yang Ji, Wei Cui, Yulong Wang, Hao Li, Xian Zhao, Duo Li, Sanli Tang, Ming Yang, Wenming Tan, et al. Atf-3d: Semi-supervised 3d object detection with adaptive thresholds filtering based on confidence and distance. IEEE Robotics and Automation Letters, 7(4):10573-10580, 2022. 2 张泽瀚、季阳、崔巍、王宇龙、李浩、赵贤、李多、唐三立、杨明、谭文明等。
[46] Na Zhao, Tat-Seng Chua, and Gim Hee Lee. Sess: Selfensembling semi-supervised 3d object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2020. 1, 2 赵那,蔡达成,和李锦希。Sess:自我集成半监督 3D 物体检测。在 2020 年 IEEE 计算机视觉与模式识别会议(CVPR)论文集中。1, 2
[47] Shizhen Zhao and Xiaojuan Qi. Prototypical votenet for fewshot 3d point cloud object detection. Advances in Neural Information Processing Systems, 35:13838-13851, 2022. 3, 4 赵士贞和祁小娟.原型投票网络用于少量样本 3D 点云物体检测.神经信息处理系统进展,35:13838-13851,2022.
[48] Wu Zheng, Weiliang Tang, Li Jiang, and Chi-Wing Fu. Sessd: Self-ensembling single-stage object detector from point cloud. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 1449414503, 2021. 2 吴正,唐维良,李江和符志荣。 Sessd:自组装单级点云目标检测器。 在 2021 年计算机视觉和模式识别会议(CVPR)论文集上,第 14494-14503 页。
[49] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 4490-4499, 2018. 2 [49] 尹舟和 Oncel Tuzel。 Voxelnet:基于点云的 3D 目标检测的端到端学习。 在第 IEEE 计算机视觉和模式识别会议(CVPR)论文集中,页数为 4490-4499,2018 年。
Hai Wu, Shijia Zhao, Qiming Xia, Chenglu Wen, and Cheng Wang were with the Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, 361005, P.R. China 吴海,赵世佳,夏启明,温成路和王成与中国福建省厦门大学智慧城市感知与计算福建省重点实验室(361005)合作。
E-mail: clwen@xmu.edu.cn 电子邮件: clwen@xmu.edu.cn
Xin Li was with the Visual Computing and Interactive Media, Texas AEM University 李新与德克萨斯农工大学视觉计算和交互媒体相关
Li Jiang was with the School of Data Science, The Chinese University of Hong Kong, Shenzhen, Guangdong, 518172, P.R. China 李江是香港中文大学(深圳)数据科学学院的一员,广东省深圳市,518172。 ^(**){ }^{*} Corresponding author ^(**){ }^{*} 相应作者
Manuscript received April 19, 2005; revised August 26, 2015. 接收到的手稿于 2005 年 4 月 19 日;修订于 2015 年 8 月 26 日。