这是用户在 2024-7-4 9:25 为 https://arxiv.org/html/2407.02394v2?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
No License 无许可证
arXiv:2407.02394v2 [cs.CV] null
arXiv:2407.02394v2 [cs.CV] 空

Similarity Distance-Based Label Assignment for Tiny Object Detection
用于微小物体检测的基于相似度距离的标签分配

Shuohao Shi1, Qiang Fang1,∗, Tong Zhao1, Xin Xu
史硕豪, 方强 1,∗ , 赵彤, 徐欣
1
1National University of Defense Technology, Changsha, Hunan, China e867cda2b@126.com, qiangfang@nudt.edu.cn, zhaotong@nudt.edu.cn, xinxu@nudt.edu.cncorresponding author
国防科技大学,湖南长沙 e867cda2b@126.com, qiangfang@nudt.edu.cn, zhaotong@nudt.edu.cn, xinxu@nudt.edu.cn 通讯作者
Abstract 抽象

Tiny object detection is becoming one of the most challenging tasks in computer vision because of the limited object size and lack of information. The label assignment strategy is a key factor affecting the accuracy of object detection. Although there are some effective label assignment strategies for tiny objects, most of them focus on reducing the sensitivity to the bounding boxes to increase the number of positive samples and have some fixed hyperparameters need to set. However, more positive samples may not necessarily lead to better detection results, in fact, excessive positive samples may lead to more false positives. In this paper, we introduce a simple but effective strategy named the Similarity Distance (SimD) to evaluate the similarity between bounding boxes. This proposed strategy not only considers both location and shape similarity but also learns hyperparameters adaptively, ensuring that it can adapt to different datasets and various object sizes in a dataset. Our approach can be simply applied in common anchor-based detectors in place of the IoU for label assignment and Non Maximum Suppression (NMS). Extensive experiments on four mainstream tiny object detection datasets demonstrate superior performance of our method, especially, 1.8 AP points and 4.1 AP points of very tiny higher than the state-of-the-art competitors on AI-TOD. Code is available at: https://github.com/cszzshi/SimD.
由于物体尺寸有限且缺乏信息,微小物体检测正在成为计算机视觉中最具挑战性的任务之一。标签分配策略是影响目标检测准确性的关键因素。尽管对于微小物体有一些有效的标签分配策略,但大多数策略都侧重于降低对边界框的敏感度以增加阳性样本的数量,并且需要设置一些固定的超参数。然而,阳性样本越多,检测结果越好,事实上,阳性样本过多也可能导致更多的假阳性。在本文中,我们介绍了一种简单而有效的策略,称为相似度距离(SimD)来评估边界框之间的相似性。该策略不仅考虑了位置和形状的相似性,还自适应地学习超参数,确保其能够适应数据集中的不同数据集和各种对象大小。我们的方法可以简单地应用于常见的基于锚点的检测器,以代替用于标签分配和非最大抑制 (NMS) 的 IoU。在四个主流微小目标检测数据集上的大量实验表明,我们的方法具有优异的性能,特别是 1.8 个 AP 点和 4.1 个 AP 点,比最先进的 AI-TOD 竞争对手高出非常小。代码可在以下位置获得:https://github.com/cszzshi/SimD。

I INTRODUCTION 一、引言

With the popularization of drone technology and autonomous driving, applications of object detection are becoming increasingly widespread in daily life. General object detectors have achieved significant progress in both accuracy and detection speed. For example, the latest version of the YOLO series, YOLOv8, achieves a mean average precision (mAP) of 53.9 percent on the COCO detection dataset and takes only 3.53 ms to detect objects in an image when implemented on the NVIDIA A100 GPU using TensorRT. Nevertheless, despite this significant progress in general object detectors, when they are directly applied for tiny object detection tasks, their accuracy sharply decreases.
随着无人机技术和自动驾驶的普及,物体检测在日常生活中的应用越来越广泛。通用物体检测器在精度和检测速度方面都取得了显著进步。例如,YOLO 系列的最新版本 YOLOv8 在 COCO 检测数据集上实现了 53.9% 的平均精度 (mAP),当使用 TensorRT 在 NVIDIA A100 GPU 上实现时,检测图像中的对象只需 3.53 毫秒。然而,尽管在一般物体检测器方面取得了重大进展,但当它们直接应用于微小物体检测任务时,它们的精度会急剧下降。

In a recent survey of small object detection, Cheng et al. [1] proposed dividing small objects into three categories (extremely, relatively and generally small) in accordance with their mean area. Two major challenges faced in tiny object detection are information loss and the lack of positive samples. There are many possible approaches to improve the accuracy of tiny object detection, such as feature fusion, data augmentation, and superresolution.
Cheng 等 [ 1] 在最近一项关于小目标检测的调查中,提出根据小目标的平均面积将小目标分为三类(极小、相对小和一般小)。微小物体检测面临的两个主要挑战是信息丢失和缺乏阳性样本。有许多可能的方法可以提高微小物体检测的准确性,例如特征融合、数据增强和超分辨率。

Refer to caption
Figure 1: Comparison between traditional label assignment metrics and our SimD metric. The first row shows typical detection results achieved with these methods, and the second row presents diagrammatic sketches of these metrics. The ΔwΔ𝑤\Delta wroman_Δ italic_w and ΔhΔ\Delta hroman_Δ italic_h in SimD respectively represent the difference of width and height between anchor and ground truth. The green, blue and red boxes respectively denote true positive (TP), false positive (FP) and false negative (FN) predictions.
图 1:传统标签分配指标与 SimD 指标之间的比较。第一行显示了使用这些方法实现的典型检测结果,第二行显示了这些指标的图表草图。SimD 中的 ΔwΔ𝑤\Delta wroman_Δ italic_wΔhΔ\Delta hroman_Δ italic_h 分别表示锚点和地面实况之间的宽度和高度差。绿色、蓝色和红色框分别表示真阳性 (TP)、假阳性 (FP) 和假阴性 (FN) 预测。

Because sufficiently numerous and high-quality positive samples are crucial for object detection, the label assignment strategy is a core factor affecting the final results. The smaller the bounding boxes are, the higher the sensitivity of the IoU metric [2], this is the main reason why it is not possible to label as many positive samples as tiny objects as can be labeled as general objects. A simple comparison between traditional anchor-based and anchor-free metrics and our SimD metric is shown in Fig. 1.
由于足够多和高质量的阳性样品对于物体检测至关重要,因此标签分配策略是影响最终结果的核心因素。边界框越小,IoU 度量的灵敏度就越高 [ 2],这是无法将尽可能多的阳性样本标记为可以标记为一般物体的微小物体的主要原因。图 1 显示了传统的基于锚点和无锚点的指标与我们的 SimD 指标之间的简单比较。

The current research on tiny object label assignment strategies mainly focuses on reducing the sensitivity to the bounding box size. From this perspective, Xu et al. [2] proposed using the Dot Distance (DotD) as the assignment metric in place of the IoU. Later, NWD [3] and RFLA [4] were proposed as attempts to model the ground truth and anchor as Gaussian distributions and then use the distance between these two Gaussian distributions to evaluate the two bounding boxes. In fact, these methods have enabled considerable progress in label assignment, but there are also some problems that they may not consider.
目前对微小对象标签分配策略的研究主要集中在降低对边界框大小的敏感性上。从这个角度来看,Xu等[2]提出使用点距离(DotD)作为分配度量来代替IoU。后来,NWD [ 3 ] 和 RFLA [ 4] 被提出尝试将地面实况和锚点建模为高斯分布,然后利用这两个高斯分布之间的距离来评估两个边界框。事实上,这些方法在标签分配方面取得了相当大的进展,但也有一些问题他们可能没有考虑到。

First, most of these methods focus on reducing the sensitivity to the bounding box size, thereby increasing the number of positive samples. However, as we know, excessive positive samples may have a detrimental impact on an object detector, leading to many false positives.
首先,这些方法大多侧重于降低对边界框尺寸的灵敏度,从而增加阳性样品的数量。然而,正如我们所知,过多的阳性样本可能会对物体检测器产生不利影响,导致许多误报。

Refer to caption
Figure 2: The processing flow of the SimD-based label assignment strategy. We first obtain the coordinates of the ground truth and anchors and then calculate the Similarity Distance (SimD) between the ground truth and each anchor. Subsequently, we follow the traditional label assignment strategy to obtain positive and negative samples in accordance with corresponding thresholds. For a ground truth that does not have any associated positive sample based on this strategy, we assign the anchor with the maximum SimD value as a positive sample, as long as this SimD value is larger than a minimum positive threshold.
图 2:基于 SimD 的标签分配策略的处理流程。我们首先获得地面真值和锚点的坐标,然后计算地面真值和每个锚点之间的相似度距离(SimD)。随后,我们遵循传统的标签分配策略,根据相应的阈值获得阳性和阴性样本。对于没有任何基于此策略的关联正样本的基本事实,只要此 SimD 值大于最小正阈值,我们就会将具有最大 SimD 值的锚点指定为正样本。

Second, the essence of these evaluation metrics is to measure the similarity between bounding boxes. For anchor-based methods, the similarity between the ground truth and the anchor is considered. This similarity includes two aspects: shape and location. However, some methods only consider the locations of the bounding boxes, the others consider both shape and location, but they also have a hyperparameter that needs to be chosen.
其次,这些评估指标的本质是衡量边界框之间的相似性。对于基于锚点的方法,考虑了地面实况和锚点之间的相似性。这种相似性包括两个方面:形状和位置。但是,有些方法只考虑边界框的位置,其他方法同时考虑形状和位置,但它们也有一个需要选择的超参数。

Finally, although the object sizes in a tiny object detection dataset tend to be fairly similar, there are still differences in the scales of different objects. For example, the sizes of the objects in the AI-TOD dataset range from 2 to 64 pixels. The discrepancy is more pronounced in the VisDrone2019 dataset, which contains both small and general-sized objects. In fact, the smaller the size of object, the more difficult it is to obtain positive samples. Unfortunately, most of the existing methods may pay less attention to this problem.
最后,尽管微小物体检测数据集中的物体大小往往非常相似,但不同物体的尺度仍然存在差异。例如,AI-TOD 数据集中对象的大小范围为 2 到 64 像素。这种差异在 VisDrone2019 数据集中更为明显,该数据集包含小型和一般尺寸的物体。事实上,物体的尺寸越小,获得阳性样本的难度就越大。不幸的是,大多数现有方法可能不太关注这个问题。

In this paper, to solve these problems, we introduce a new evaluation metric to take the place of the traditional IoU, the processing flow of our method is shown in Fig. 2. The main contributions of this paper include the following:
为了解决这些问题,我们引入了一种新的评估指标来取代传统的IoU,我们的方法的处理流程如图2所示。本文的主要贡献包括:

•  We propose a simple but effective strategy named Similarity Distance (SimD) to evaluate the relationship between two bounding boxes. It not only considers both location and shape similarity but also can effectively adapt to different datasets and different object sizes in a dataset without the need to set any hyperparameter.
• 我们提出了一种简单而有效的策略,称为相似距离 (SimD) 来评估两个边界框之间的关系。它不仅考虑了位置和形状的相似性,而且可以有效地适应数据集中的不同数据集和不同对象大小,而无需设置任何超参数。

•  Extensive experiments prove the validity of our method. We use several general object detectors and simply replace the IoU-based assignment module with the proposed method based on our SimD metric, in this way, we achieve the state-of-the-art performance on four mainstream tiny object detection datasets.
• 大量的实验证明了我们方法的有效性。我们使用了几种通用的目标检测器,并简单地将基于IoU的分配模块替换为基于SimD度量的方法,通过这种方式,我们在四个主流的微小目标检测数据集上实现了最先进的性能。

II RELATED WORK 二、相关工作

In recent years, applications of object detection have become increasingly widespread in various industries. This technology offers considerable convenience. For example, rescue operations can be quickly carried out by identifying ground objects in remote sensing images. With the development of deep learning technology, especially the introduction of ResNet [5], the accuracy and speed of detection have significantly increased.
近年来,目标检测在各行各业的应用越来越广泛。这项技术提供了相当大的便利。例如,可以通过识别遥感图像中的地面物体来快速进行救援行动。随着深度学习技术的发展,特别是ResNet的引入[5],检测的准确性和速度显著提高。

General object detectors can be divided into two categories: one-stage and two-stage detectors. A two-stage detector first generates a list of proposal regions and then determines the position and category of the object. Such algorithms include R-CNN [6], Fast R-CNN [7] and Faster R-CNN [8]. The structure of one-stage detectors is simpler. They can directly output the coordinates and category of an object from the input image. Some classic one-stage detectors include YOLO [9] and SSD [10].
一般物体探测器可分为两类:一级和两级探测器。两级检测器首先生成建议区域列表,然后确定物体的位置和类别。这些算法包括 R-CNN [ 6]、Fast R-CNN [ 7] 和 Faster R-CNN [ 8]。单级探测器的结构更简单。它们可以直接从输入图像中输出对象的坐标和类别。一些经典的单级检测器包括 YOLO [ 9] 和 SSD [ 10]。

II-A Tiny Object Detection
II-A 微小物体检测

Despite the significant progress in object detection achieved with deep learning technology, the detection accuracy will sharply decrease when the objects to be detected are tiny. Small objects are usually defined as objects with sizes less than a certain threshold value. For example, in Microsoft COCO [11], if the area of an object is less than or equal to 1024, it is considered a small object. In many cases, however, the objects of interest are in fact much smaller than the above definition. For example, in the AI-TOD dataset, the average edge length of an object is only 12.8 pixels, far smaller than in other datasets.
尽管深度学习技术在目标检测方面取得了重大进展,但当要检测的物体很小时,检测精度会急剧下降。小对象通常定义为大小小于特定阈值的对象。例如,在 Microsoft COCO [ 11] 中,如果对象的面积小于或等于 1024,则将其视为小对象。然而,在许多情况下,感兴趣的对象实际上比上述定义小得多。例如,在 AI-TOD 数据集中,对象的平均边长仅为 12.8 像素,远小于其他数据集。

As stated in a previous paper [1], due to the extremely small size of the objects of interest, there are three main challenges in tiny object detection. First, most object detectors use downsampling for feature extraction, which will result in a large amount of information loss for tiny objects. Second, due to the limited amount of valid information they contain, small objects are easily disturbed by noise. Finally, the smaller an object is, the more sensitive it is to changes in the bounding box [2]. Consequently, if we use traditional label assignment metrics, such as the IoU, GIoU [12], DIoU [13], and CIoU [13], for object detection, the number of positive samples obtained for tiny objects will be very small.
正如之前的论文[1]所述,由于感兴趣的物体的尺寸极小,在微小物体检测方面存在三个主要挑战。首先,大多数目标检测器使用下采样进行特征提取,这将导致微小物体的大量信息丢失。其次,由于它们所包含的有效信息量有限,小物体很容易受到噪音的干扰。最后,对象越小,它对边界框的变化就越敏感 [ 2]。因此,如果我们使用传统的标签分配指标,如 IoU、GIoU [ 12]、DIoU [ 13] 和 CIoU [ 13],用于物体检测,则为微小物体获得的阳性样本数量将非常少。

Many methods and have been proposed to improve the accuracy and efficiency of tiny object detection. For example, from the perspective of data augmentation, Kisantal et al. [14] proposed increasing the number of training samples by copying tiny objects, randomly transforming the copies and then pasting the results into new positions in an image.
已经提出了许多方法来提高微小物体检测的准确性和效率。例如,从数据增强的角度来看,Kisantal等[14]提出通过复制微小的物体,随机转换副本,然后将结果粘贴到图像中的新位置来增加训练样本的数量。

II-B Label Assignment Strategies
II-B 标签分配策略

The label assignment strategy plays a significant role in object detection. Depending on whether each label is either strictly negative or strictly positive, such strategies can be divided into hard label assignment strategies and soft label assignment strategies. In a soft label assignment strategy, different weights are set for different samples based on the calculation results, examples include GFL [15], VFL [16], TOOD [17] and DW [18]. Hard label assignment strategies can be further divided into static and dynamic strategies depending on whether the thresholds for designating positive and negative samples are fixed. Static label assignment strategies include those based on the IoU and DotD [2] metrics as well as RFLA [4]. Examples of dynamic label assignment strategies include ATSS [19], PAA [20], OTA [21], and DSLA [22]. From another perspective, label assignment strategies can be divided into prediction-based and prediction-free strategies. A prediction-based method assigns a sample a positive/negative label based on the relationship between the ground-truth and predicted bounding boxes, while a prediction-free method assigns labels based only on the anchors or other existing information.
标签分配策略在对象检测中起着重要作用。根据每个标签是严格负数还是严格正数,此类策略可分为硬标签分配策略和软标签分配策略。在软标签分配策略中,根据计算结果为不同的样本设置不同的权重,例如GFL [ 15]、VFL [ 16]、TOOD [ 17] 和 DW [ 18]。硬标签分配策略可进一步分为静态策略和动态策略,具体取决于指定阳性和阴性样本的阈值是否固定。静态标签分配策略包括基于 IoU 和 DotD [ 2] 指标以及 RFLA [ 4] 的策略。动态标签分配策略的例子包括 ATSS [ 19]、PAA [ 20]、OTA [ 21] 和 DSLA [ 22]。从另一个角度来看,标签分配策略可以分为基于预测的策略和无预测的策略。基于预测的方法根据真值和预测边界框之间的关系为样本分配正/负标签,而无预测方法仅根据锚点或其他现有信息分配标签。

II-C Label Assignment Strategies for Tiny Objects
II-C 微小物体的标签分配策略

TABLE I: Comparison between existing label assignment methods for tiny objects and our method
表一:现有微小物体标签分配方法与本方法的比较
Method 方法 Formula 公式 Insensitive 感觉迟钝的 Comprehensive 全面 Adaptive
DotD 点D的
DotD=eDS𝐷𝑜𝑡𝐷superscript𝑒𝐷𝑆DotD=e^{-\frac{D}{S}}italic_D italic_o italic_t italic_D = italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_D end_ARG start_ARG italic_S end_ARG end_POSTSUPERSCRIPT, D=(xgxa)2+(ygya)2𝐷superscriptsubscript𝑥𝑔subscript𝑥𝑎2superscriptsubscript𝑦𝑔subscript𝑦𝑎2D=\sqrt{\left(x_{g}-x_{a}\right)^{2}+\left(y_{g}-y_{a}\right)^{2}}italic_D = square-root start_ARG ( italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, S=i=1Mj=1Niwij×hiji=1MNi𝑆superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1subscript𝑁𝑖subscript𝑤𝑖𝑗subscript𝑖𝑗superscriptsubscript𝑖1𝑀subscript𝑁𝑖S=\sqrt{\frac{{\textstyle\sum_{i=1}^{M}{\textstyle\sum_{j=1}^{N_{i}}w_{ij}% \times h_{ij}}}}{{\textstyle\sum_{i=1}^{M}}N_{i}}}italic_S = square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT × italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG
\usym2613 \usym2613
NWD
NWD=eWC𝑁𝑊𝐷superscript𝑒𝑊𝐶NWD=e^{-\frac{W}{C}}italic_N italic_W italic_D = italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_W end_ARG start_ARG italic_C end_ARG end_POSTSUPERSCRIPT, W=(xgxa)2+(ygya)2+((wgwa)2+(hgha)2)×0.25𝑊superscriptsubscript𝑥𝑔subscript𝑥𝑎2superscriptsubscript𝑦𝑔subscript𝑦𝑎2superscriptsubscript𝑤𝑔subscript𝑤𝑎2superscriptsubscript𝑔subscript𝑎20.25W=\sqrt{\left(x_{g}-x_{a}\right)^{2}+\left(y_{g}-y_{a}\right)^{2}+\left(\left(% w_{g}-w_{a}\right)^{2}+\left(h_{g}-h_{a}\right)^{2}\right)\times 0.25}italic_W = square-root start_ARG ( italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( ( italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) × 0.25 end_ARG
\usym2613
RFLA
RFD=11+RFDC𝑅𝐹𝐷11𝑅𝐹𝐷𝐶RFD=\frac{1}{1+RFDC}italic_R italic_F italic_D = divide start_ARG 1 end_ARG start_ARG 1 + italic_R italic_F italic_D italic_C end_ARG, RFDC=0.5β×wa2wg2+0.5β×ha2hg2+2(xaxg)2wg2+2(yayg)2hg2+lnwgβ×wa+lnhgβ×ha1𝑅𝐹𝐷𝐶0.5𝛽superscriptsubscript𝑤𝑎2superscriptsubscript𝑤𝑔20.5𝛽superscriptsubscript𝑎2superscriptsubscript𝑔22superscriptsubscript𝑥𝑎subscript𝑥𝑔2superscriptsubscript𝑤𝑔22superscriptsubscript𝑦𝑎subscript𝑦𝑔2superscriptsubscript𝑔2subscript𝑤𝑔𝛽subscript𝑤𝑎subscript𝑔𝛽subscript𝑎1RFDC=\frac{0.5\beta\times w_{a}^{2}}{w_{g}^{2}}+\frac{0.5\beta\times h_{a}^{2}% }{h_{g}^{2}}+\frac{2\left(x_{a}-x_{g}\right)^{2}}{w_{g}^{2}}+\frac{2\left(y_{a% }-y_{g}\right)^{2}}{h_{g}^{2}}+\ln{\frac{w_{g}}{\beta\times w_{a}}}+\ln{\frac{% h_{g}}{\beta\times h_{a}}}-1italic_R italic_F italic_D italic_C = divide start_ARG 0.5 italic_β × italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 0.5 italic_β × italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 ( italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + roman_ln divide start_ARG italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG start_ARG italic_β × italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG + roman_ln divide start_ARG italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG start_ARG italic_β × italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG - 1
\usym2613
SimD 模拟模拟
SimD=e(simlocation+simshape)𝑆𝑖𝑚𝐷superscript𝑒𝑠𝑖subscript𝑚𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑖subscript𝑚𝑠𝑎𝑝𝑒SimD=e^{-\left(sim_{location}+sim_{shape}\right)}italic_S italic_i italic_m italic_D = italic_e start_POSTSUPERSCRIPT - ( italic_s italic_i italic_m start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT + italic_s italic_i italic_m start_POSTSUBSCRIPT italic_s italic_h italic_a italic_p italic_e end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT

Although there have been many studies on label assignment strategies for object detection, most such strategies are designed for conventional datasets, with few specifically designed for tiny objects. When these traditional label assignment strategies are directly used for tiny object detection, they suffer a significant decrease in accuracy. To date, the label assignment strategies and metrics designed specifically for tiny objects mainly include S3superscript𝑆3S^{3}italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTFD [23], DotD [2], NWD-RKA [24] and RFLA [4].
尽管已经有许多关于目标检测的标签分配策略的研究,但大多数此类策略都是为传统数据集设计的,很少有专门为微小对象设计的。当这些传统的标签分配策略直接用于微小物体检测时,它们的准确性会显著降低。迄今为止,专为微小物体设计的标签分配策略和指标主要包括 S3superscript𝑆3S^{3}italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT FD [ 23]、DotD [ 2]、NWD-RKA [ 24] 和 RFLA [ 4]。

In S3superscript𝑆3S^{3}italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTFD, the threshold value is first reduced (from 0.5 to 0.35) to obtain more positive samples for ground truth and then is further lowered to 0.1 to obtain positive samples for those ground truths that were not addressed with the first threshold reduction. However, S3superscript𝑆3S^{3}italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTFD also uses the traditional IoU metric to compute the similarity between the ground truth and anchor. To overcome the weakness of the IoU metric, the novel DotD formula was introduced to reduce the sensitivity to the bounding box size. Based on this metric, more positive samples can be obtained for the ground truth. In NWD-RKA, a normalized Wasserstein distance is introduced as a replacement for the IoU, and a ranking-based strategy is used to assign the top-k samples as positive. RFLA explores the relationship between the ground truth and anchor from the perspective of the receptive field, on this basis, the ground truth and anchor are modeled as Gaussian distributions. Then, the distance between these two Gaussian distributions is calculated based on the Kullback–Leibler divergence (KLD), which is used in place of the IoU metric.
在 FD 中 S3superscript𝑆3S^{3}italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ,首先降低阈值(从 0.5 降低到 0.35)以获得更多的地面实况正样本,然后进一步降低到 0.1 以获得那些未通过第一个阈值降低解决的地面实况的正样本。但是, S3superscript𝑆3S^{3}italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT FD 还使用传统的 IoU 指标来计算地面实况和锚点之间的相似性。为了克服 IoU 度量的弱点,引入了新颖的 DotD 公式来降低对边界框大小的敏感性。基于该指标,可以获得更多地面实况的正样本。在 NWD-RKA 中,引入了归一化的 Wasserstein 距离作为 IoU 的替代品,并使用基于排序的策略将 top-k 样本分配为正样本。RFLA从感受场的角度探讨了基真值和锚点之间的关系,在此基础上,将基真值和锚点建模为高斯分布。然后,根据 Kullback-Leibler 散度 (KLD) 计算这两个高斯分布之间的距离,该散度用于代替 IoU 度量。

III Method III方法

III-A Similarity Distance Between Bounding Boxes
III-A 边界框之间的相似度距离

One of the most important steps in label assignment is to calculate a value that reflects the similarity between different bounding boxes. Specifically, for an anchor-based label assignment strategy, the similarity between the anchors and ground truths must be quantified before assigning labels.
标签分配中最重要的步骤之一是计算反映不同边界框之间相似性的值。具体而言,对于基于锚点的标签分配策略,在分配标签之前,必须量化锚点和基本事实之间的相似性。

Common label assignment metrics, such as the IoU, GIoU [12], DIoU [13] and CIoU [13], are usually based on the overlap between the anchor and the ground truth. These metrics have the serious problem that if the overlap is zero, which is often the case for tiny objects, these metrics may become invalid. Some more suitable methods use distance-based evaluation metrics or even use Gaussian distributions to model the ground truth and anchor, such as the DotD [2], NWD [3] and RFLA [4]. We present a simple comparison between the existing metrics and our SimD metric from three perspectives in Table I. For example, DotD only considers location similarity and may not adapt different object sizes in a dataset, so it is not comprehensive or adaptive. NWD and RFLA are not adaptive because they respectively have a hyperparameter C𝐶Citalic_C and β𝛽\betaitalic_β need to set. Following the existing approaches, we consider proposing an adaptive method without any hyperparameters.
常见的标签分配指标,如 IoU、GIoU [ 12]、DIoU [ 13] 和 CIoU [ 13],通常基于锚点和地面实况之间的重叠。这些指标存在一个严重的问题,即如果重叠为零(通常适用于微小对象),则这些指标可能会变得无效。一些更合适的方法使用基于距离的评估度量,甚至使用高斯分布来模拟地面实况和锚点,例如 DotD [ 2]、NWD [ 3] 和 RFLA [ 4]。我们在表中从三个角度对现有指标和我们的 SimD 指标进行了简单的比较。例如,DotD 仅考虑位置相似性,可能不会适应数据集中的不同对象大小,因此它不全面或自适应。NWD 和 RFLA 不是自适应的,因为它们分别具有超参数 C𝐶Citalic_C 并且 β𝛽\betaitalic_β 需要设置。根据现有方法,我们考虑提出一种没有任何超参数的自适应方法。

In this paper, we introduce a novel metric named Similarity Distance (SimD) to better reflect the similarity between different bounding boxes. The Similarity Distance is defined as follows:
在本文中,我们引入了一种名为相似度距离(SimD)的新度量,以更好地反映不同边界框之间的相似性。相似距离定义如下:

SimD=e(simlocation+simshape)𝑆𝑖𝑚𝐷superscript𝑒𝑠𝑖subscript𝑚𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑖subscript𝑚𝑠𝑎𝑝𝑒SimD=e^{-\left(sim_{location}+sim_{shape}\right)}italic_S italic_i italic_m italic_D = italic_e start_POSTSUPERSCRIPT - ( italic_s italic_i italic_m start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT + italic_s italic_i italic_m start_POSTSUBSCRIPT italic_s italic_h italic_a italic_p italic_e end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT (1)
simlocation=(xgxa1m×(wg+wa))2+(ygya1n×(hg+ha))2𝑠𝑖subscript𝑚𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛superscriptcontinued-fractionsubscript𝑥𝑔subscript𝑥𝑎1𝑚subscript𝑤𝑔subscript𝑤𝑎2superscriptcontinued-fractionsubscript𝑦𝑔subscript𝑦𝑎1𝑛subscript𝑔subscript𝑎2sim_{location}=\sqrt{\left(\cfrac{x_{g}-x_{a}}{\frac{1}{m}\times\left(w_{g}+w_% {a}\right)}\right)^{2}+\left(\cfrac{y_{g}-y_{a}}{\frac{1}{n}\times\left(h_{g}+% h_{a}\right)}\right)^{2}}italic_s italic_i italic_m start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT = square-root start_ARG ( continued-fraction start_ARG italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_m end_ARG × ( italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( continued-fraction start_ARG italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG × ( italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (2)
simshape=(wgwa1m×(wg+wa))2+(hgha1n×(hg+ha))2𝑠𝑖subscript𝑚𝑠𝑎𝑝𝑒superscriptcontinued-fractionsubscript𝑤𝑔subscript𝑤𝑎1𝑚subscript𝑤𝑔subscript𝑤𝑎2superscriptcontinued-fractionsubscript𝑔subscript𝑎1𝑛subscript𝑔subscript𝑎2sim_{shape}=\sqrt{\left(\cfrac{w_{g}-w_{a}}{\frac{1}{m}\times\left(w_{g}+w_{a}% \right)}\right)^{2}+\left(\cfrac{h_{g}-h_{a}}{\frac{1}{n}\times\left(h_{g}+h_{% a}\right)}\right)^{2}}italic_s italic_i italic_m start_POSTSUBSCRIPT italic_s italic_h italic_a italic_p italic_e end_POSTSUBSCRIPT = square-root start_ARG ( continued-fraction start_ARG italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_m end_ARG × ( italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( continued-fraction start_ARG italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG × ( italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (3)

where m𝑚mitalic_m and n𝑛nitalic_n are shown below:
其中 m𝑚mitalic_mn𝑛nitalic_n 如下所示:

m=i=1Mj=1Nik=1Qi|xijxik|wij+wiki=1MNi×Qi𝑚continued-fractionsuperscriptsubscript𝑖1𝑀superscriptsubscript𝑗1subscript𝑁𝑖superscriptsubscript𝑘1subscript𝑄𝑖continued-fractionsubscript𝑥𝑖𝑗subscript𝑥𝑖𝑘subscript𝑤𝑖𝑗subscript𝑤𝑖𝑘superscriptsubscript𝑖1𝑀subscript𝑁𝑖subscript𝑄𝑖m=\cfrac{{\sum_{i=1}^{M}}{\sum_{j=1}^{N_{i}}{\sum_{k=1}^{Q_{i}}\cfrac{\left|x_% {ij}-x_{ik}\right|}{w_{ij}+w_{ik}}}}}{{\sum_{i=1}^{M}}N_{i}\times Q_{i}}italic_m = continued-fraction start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT continued-fraction start_ARG | italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT | end_ARG start_ARG italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG (4)
n=i=1Mj=1Nik=1Qi|yijyik|hij+hiki=1MNi×Qi𝑛continued-fractionsuperscriptsubscript𝑖1𝑀superscriptsubscript𝑗1subscript𝑁𝑖superscriptsubscript𝑘1subscript𝑄𝑖continued-fractionsubscript𝑦𝑖𝑗subscript𝑦𝑖𝑘subscript𝑖𝑗subscript𝑖𝑘superscriptsubscript𝑖1𝑀subscript𝑁𝑖subscript𝑄𝑖n=\cfrac{{\sum_{i=1}^{M}}{\sum_{j=1}^{N_{i}}{\sum_{k=1}^{Q_{i}}\cfrac{\left|y_% {ij}-y_{ik}\right|}{h_{ij}+h_{ik}}}}}{{\sum_{i=1}^{M}}N_{i}\times Q_{i}}italic_n = continued-fraction start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT continued-fraction start_ARG | italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT | end_ARG start_ARG italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG (5)

The SimD contains two parts, location similarity, simlocation𝑠𝑖subscript𝑚𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛sim_{location}italic_s italic_i italic_m start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT and shape similarity, simshape𝑠𝑖subscript𝑚𝑠𝑎𝑝𝑒sim_{shape}italic_s italic_i italic_m start_POSTSUBSCRIPT italic_s italic_h italic_a italic_p italic_e end_POSTSUBSCRIPT. As shown in (2), (xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, ygsubscript𝑦𝑔y_{g}italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT) and (xasubscript𝑥𝑎x_{a}italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, yasubscript𝑦𝑎y_{a}italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT) respectively represent the center coordinate of ground truth and anchor, wgsubscript𝑤𝑔w_{g}italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, wasubscript𝑤𝑎w_{a}italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, hgsubscript𝑔h_{g}italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, hasubscript𝑎h_{a}italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT represent the width and height of ground truth and anchor, the main part of the formula is the distance between the center point of ground truth and anchor, similar to the DotD formula [2]. The difference is that we use the widths and heights of the two bounding boxes multiplied by corresponding parameters to eliminate the differences between bounding boxes of different sizes. This process is similar to the idea of normalization. This is also why our metric can easily adapt to different object sizes in a dataset. The shape similarity in (3) is similar to (2).
SimD 包含两个部分,位置相似度和 simlocation𝑠𝑖subscript𝑚𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛sim_{location}italic_s italic_i italic_m start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT 形状相似度。 simshape𝑠𝑖subscript𝑚𝑠𝑎𝑝𝑒sim_{shape}italic_s italic_i italic_m start_POSTSUBSCRIPT italic_s italic_h italic_a italic_p italic_e end_POSTSUBSCRIPT 如(2)所示,( xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPTygsubscript𝑦𝑔y_{g}italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) 和 ( xasubscript𝑥𝑎x_{a}italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPTyasubscript𝑦𝑎y_{a}italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) 分别表示地实值和锚点的中心坐标, wgsubscript𝑤𝑔w_{g}italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPTwasubscript𝑤𝑎w_{a}italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPThgsubscript𝑔h_{g}italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT hasubscript𝑎h_{a}italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT 表示地实值和锚点的宽度和高度,公式的主要部分是地实值和锚点中心点之间的距离,类似于 DotD 公式 [ 2]。不同之处在于,我们使用两个边界框的宽度和高度乘以相应的参数来消除不同大小的边界框之间的差异。这个过程类似于规范化的想法。这也是为什么我们的指标可以轻松适应数据集中不同的对象大小。(3)中的形状相似性与(2)相似。

The definitions of the two normalization parameters are shown in (4) and (5). Because they are quite similar, we take the parameter m𝑚mitalic_m in (4) as an example for further discussion. m𝑚mitalic_m is the average ratio of the distance in the x𝑥xitalic_x-direction to the sum of the two widths for all ground truths and anchors in each image in the whole train set. M𝑀Mitalic_M represents the number of images in the train set, Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the number of ground truths and anchors, respectively, in the i𝑖iitalic_i-th image. xijsubscript𝑥𝑖𝑗x_{ij}italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and xiksubscript𝑥𝑖𝑘x_{ik}italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT respectively represent the x𝑥xitalic_x-coordinate of the center point of j𝑗jitalic_j-th ground truth and k𝑘kitalic_k-th anchor in i𝑖iitalic_i-th image. wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and wiksubscript𝑤𝑖𝑘w_{ik}italic_w start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT respectively represent the width of j𝑗jitalic_j-th ground truth and k𝑘kitalic_k-th anchor in i𝑖iitalic_i-th image. Because the two normalization parameters are computed based on the train set, our metric can also automatically adapt to different datasets.
两个归一化参数的定义如(4)和(5)所示。由于它们非常相似,我们以(4) m𝑚mitalic_m 中的参数为例进行进一步讨论。 m𝑚mitalic_m 是整个训练集中每个图像中所有地面真值和锚点的 x𝑥xitalic_x -direction 距离与两个宽度之和的平均比率。 M𝑀Mitalic_M 表示训练集中的图像数, Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTQisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 分别表示 i𝑖iitalic_i 第 -th 个图像中的地面真值数和锚点数。 xijsubscript𝑥𝑖𝑗x_{ij}italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPTxiksubscript𝑥𝑖𝑘x_{ik}italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT 分别表示 -th 地图和 k𝑘kitalic_k -th 锚点的中心 i𝑖iitalic_ix𝑥xitalic_x j𝑗jitalic_j 的 -坐标。 wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPTwiksubscript𝑤𝑖𝑘w_{ik}italic_w start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT 分别表示 -th 图像中 j𝑗jitalic_j i𝑖iitalic_i -th ground truth 和 k𝑘kitalic_k -th 锚点的宽度。由于这两个归一化参数是基于训练集计算的,因此我们的指标也可以自动适应不同的数据集。

To facilitate label assignment, we use an exponential function to scale the value of SimD to the range between zero and one. If the two bounding boxes are the same, the value below the root sign will be zero, so SimD will be equal to one. If the two bounding boxes greatly differ, this value will be very large, so SimD will approach zero.
为了便于标签分配,我们使用指数函数将 SimD 的值缩放到 0 到 1 之间的范围。如果两个边界框相同,则根号下方的值将为零,因此 SimD 将等于 1。如果两个边界框相差很大,则此值将非常大,因此 SimD 将接近于零。

III-B Similarity Distance-based Detectors
基于III-B相似度距离的探测器

The novel SimD metric defined in (1) can well reflect the relationship between two bounding boxes and is easy to compute. Therefore, it can be used in place of the IoU in scenarios where the similarity between two bounding boxes needs to be calculated.
(1)中定义的新颖的SimD度量可以很好地反映两个边界框之间的关系,并且易于计算。因此,在需要计算两个边界框之间的相似度的场景中,它可以代替 IoU 使用。

SimD-based label assignment. In traditional object detectors, such as Faster R-CNN [8], Cascade R-CNN [25] and DetectoRS [26], the label assignment strategy for the RPN and R-CNN models is MaxIoUAssigner. MaxIoUAssigner considers three threshold values: a positive threshold, a negative threshold and a minimum positive threshold. Anchors for which the IoU with respect to the ground truth is higher than the positive threshold are positive samples, those for which the IoU is lower than the negative threshold are negative samples, and those for which the IoU is between the positive threshold and the negative threshold are ignored. For tiny object detection, Xu et al. introduced the RKA [24] and HLA [4] label assignment strategies, which do not use fixed thresholds to divide positive and negative samples. In RKA, the top-k anchors associated with a ground truth are simply selected as positive samples, this strategy can increase the number of positive samples because the assignment of positive labels is not limited by a positive threshold. However, introducing too many low-quality positive samples may cause the detection accuracy to degrade.
基于 SimD 的标签分配。在传统的目标检测器中,如Faster R-CNN [ 8]、Cascade R-CNN [ 25]和DetectoRS [ 26],RPN和R-CNN模型的标签分配策略是MaxIoUAssigner。MaxIoUAssigner 考虑三个阈值:正阈值、负阈值和最小正阈值。IoU 相对于真值高于正阈值的锚点为正样本,IoU 低于负阈值的锚点为负样本,IoU 介于正阈值和负阈值之间的锚点被忽略。对于微小目标检测,Xu等人引入了RKA[24]和HLA[4]标签分配策略,它们不使用固定的阈值来划分阳性和阴性样本。在 RKA 中,与地面实况相关的 top-k 锚点被简单地选择为正样本,这种策略可以增加正样本的数量,因为正标签的分配不受正阈值的限制。然而,引入过多的低质量阳性样品可能会导致检测精度下降。

In this paper, we follow the traditional MaxIoUAssigner strategy and simply use SimD in place of the IoU. The positive threshold, negative threshold, and minimum positive threshold are set to 0.7, 0.3, and 0.3, respectively. Our label assignment strategy is named MaxSimDAssigner.
在本文中,我们遵循传统的 MaxIoUAssigner 策略,简单地使用 SimD 代替 IoU。阳性阈值、负阈值和最小阳性阈值分别设置为 0.7、0.3 和 0.3。我们的标签分配策略被命名为 MaxSimDAssigner。

SimD-based NMS. Non Maximum Suppression (NMS) is one of the most important components of postprocessing. Its purpose is to eliminate predicted bounding boxes that are repeatedly detected by preserving only the best detection result. In the traditional NMS procedure, first, the IoUs between the bounding box with the highest score and all other bounding boxes are computed. Then, bounding boxes with an IoU higher than a certain threshold will be eliminated. Considering the advantages of SimD, we can simply use it as the metric for NMS in place of the traditional IoU metric.
基于 SimD 的 NMS。非最大抑制 (NMS) 是后处理中最重要的组件之一。其目的是通过仅保留最佳检测结果来消除重复检测的预测边界框。在传统的 NMS 过程中,首先计算得分最高的边界框与所有其他边界框之间的 IoU。然后,将消除 IoU 高于特定阈值的边界框。考虑到 SimD 的优势,我们可以简单地将其用作 NMS 的指标,而不是传统的 IoU 指标。

IV Experiments IV 实验

To verify the reliability of our proposed method, we designed a series of experiments involving the application of traditional object detectors to several open-source tiny object detection datasets.
为了验证我们所提出的方法的可靠性,我们设计了一系列实验,涉及将传统目标检测器应用于几个开源微小目标检测数据集。

IV-A Datasets IV-A 数据集

There are two main types of tiny object detection datasets: one contains only small objects, such as AI-TOD [27], AI-TODv2 [24], and SODA-D [1], the other contains both small and medium-sized objects, such as VisDrone2019 [28] and TinyPerson [29].
微小目标检测数据集主要有两种类型:一种只包含小对象,如AI-TOD [ 27]、AI-TODv2 [ 24]和SODA-D [ 1],另一种同时包含中小型对象,如VisDrone2019 [ 28]和TinyPerson [ 29]。

AI-TOD. AI-TOD (Tiny Object Detection in Aerial Images) is an aerial remote sensing small object detection dataset collected to address the shortage of available datasets for aerial image object detection tasks. It contains 28,036 images and 700,621 object instances divided into eight categories with accurate annotations. Due to the extremely small size of its object instances (the mean size is only 12.8 pixels), it can be effectively used to test the capabilities of tiny object detectors.
人工智能TOD。AI-TOD(Tiny Object Detection in Aerial Images)是航空遥感小目标检测数据集,旨在解决航空影像目标检测任务可用数据集的短缺问题。它包含 28,036 张图像和 700,621 个对象实例,分为 8 个类别,并具有准确的注释。由于其物体实例的尺寸极小(平均尺寸仅为 12.8 像素),因此可以有效地用于测试微小物体检测器的能力。

SODA-D. The SODA (Small Object Detection dAtasets) series includes two datasets: SODA-A and SODA-D. SODA-D was collected from MVD [30], which consists of images captured from streets, highways and other similar scenes. There are 25,834 extremely small objects (with areas ranging from 0 to 144) in SODA-D [1], making it an excellent benchmark for tiny object detection tasks.
苏打水-D。SODA(Small Object Detection dAtasets)系列包括两个数据集:SODA-A和SODA-D。SODA-D是从MVD [ 30]收集的,MVD由从街道、高速公路和其他类似场景中捕获的图像组成。SODA-D [ 1] 中有 25,834 个极小物体(面积范围从 0 到 144),使其成为微小物体检测任务的绝佳基准。

VisDrone2019. VisDrone2019 is a dataset from the VisDrone Object Detection in Images Challenge. For this competition, 10,209 static images were captured by an unmanned drone in different locations at different heights and angles. VisDrone2019 is also an excellent dataset for evaluating tiny object detectors because it contains not only extremely small objects but also normally sized objects.
VisDrone2019年。VisDrone2019 是 VisDrone 图像对象检测挑战赛的数据集。在本次比赛中,无人驾驶无人机在不同位置、不同高度和角度拍摄了10,209张静态图像。VisDrone2019 也是评估微小物体探测器的优秀数据集,因为它不仅包含极小的物体,还包含正常大小的物体。

TABLE II: Main results on AI-TOD. All models are trained on the trainval set and tested on the test set. APvt, APt, APs and APm respectively represent the average precision for very tiny (2 to 8 pixels), tiny (8 to 16 pixels), small (16 to 32 pixels) and medium (32 to 64 pixels) objects
表二:AI-TOD的主要结果。所有模型都在 trainval 集上进行训练,并在测试集上进行测试。AP vt 、AP、AP s 和 AP m 分别表示非常微小(2 至 8 像素)、微小(8 至 16 像素)、小型(16 至 32 像素)和中型(32 至 64 像素)对象的平均精度
Method 方法 Backbone 骨干 AP AP0.5 AP0.75 APvt 美联社 vt AP 美联社t APs 美联社 s APm 美联社 m
TridentNet [33] 三叉戟网 [ 33] ResNet-50 ResNet-50型 7.5 20.9 3.6 1.0 5.8 12.6 14.0
Faster R-CNN [8] 更快的 R-CNN [ 8] ResNet-50 ResNet-50型 11.1 26.3 7.6 0.0 7.2 23.3 33.6
Cascade R-CNN [25] 级联 R-CNN [ 25] ResNet-50 ResNet-50型 13.8 30.8 10.5 0.0 10.6 25.5 36.6
DetectoRS [26] 检测RS [ 26] ResNet-50 ResNet-50型 14.8 32.8 11.4 0.0 10.8 18.3 38.0
DotD [2] 点数 [ 2] ResNet-50 ResNet-50型 16.1 39.2 10.6 8.3 17.6 18.1 22.1
DetectoRS w/ NWD [3]
带NWD的检测RS [ 3]
ResNet-50 ResNet-50型 20.8 49.3 14.3 6.4 19.7 29.6 38.3
DetectoRS w/ RFLA [4]
带RFLA的检测RS [ 4]
ResNet-50 ResNet-50型 24.8 55.2 18.5 9.3 24.8 30.3 38.2
SSD-512 [10] 固态硬盘-512 [ 10] ResNet-50 ResNet-50型 7.0 21.7 2.8 1.0 4.7 11.5 13.5
RetinaNet [34] 视网膜 [ 34] ResNet-50 ResNet-50型 8.7 22.3 4.8 2.4 8.9 12.2 16.0
ATSS [19] 苯丙胺类兴奋剂 [ 19] ResNet-50 ResNet-50型 12.8 30.6 8.5 1.9 11.6 19.5 29.2
RepPoints [35] 回复点 [ 35] ResNet-50 ResNet-50型 9.2 23.6 5.3 2.5 9.2 12.9 14.4
AutoAssign [36] 自动分配 [ 36] ResNet-50 ResNet-50型 12.2 32.0 6.8 3.4 13.7 16.0 19.1
FCOS [37] FCOS公司 [ 37] ResNet-50 ResNet-50型 12.6 30.4 8.1 2.3 12.2 17.2 25.0
M-CenterNet [27] M中心网 [ 27] DLA-34 14.5 40.7 6.4 6.1 15.0 19.4 20.4
Faster R-CNN w/ SimD
更快的 R-CNN 带 SimD
ResNet-50 ResNet-50型 23.9 54.5 17.6 11.9 24.8 28.0 33.5
Cascade R-CNN w/ SimD
带 SimD 的级联 R-CNN
ResNet-50 ResNet-50型 25.0 53.2 19.1 13.2 25.8 29.0 35.7
DetectoRS w/ SimD 带 SimD 的 DetectoRS ResNet-50 ResNet-50型 26.6 55.9 21.2 13.4 27.5 30.9 37.8

IV-B Experimental settings
IV-B 实验设置

In the following series of experiments, we use a computer with one NVIDIA RTX A6000 GPU and implement various models based on the object detection framework MMDetection [31] and PyTorch [32]. We use general object detectors such as Faster R-CNN, Cascade R-CNN and DetectoRS as the base models and simply replace the MaxIoUAssigner module with our SimD assignment module. Our method can effectively adaptive to any backbone networks and anchor-based detectors. Following the main stream settings, for all of the models, ResNet-50-FPN pretrained on ImageNet is used as the backbone, and stochastic gradient descent (SGD) is used as the optimizer, with a momentum of 0.9 and a weight decay of 0.0001. The batch size is set to 2, and the initial learning rate is 0.005. The number of RPN proposals is 3000 in both the training and testing stages. For the VisDrone2019 dataset, the number of epochs for training is set to 12, and the learning rate decays at the 8th and 11th epochs. For AI-TOD, AI-TODv2 and SODA-D, the number of training epochs is 24, and the learning rate decays in the 20th and 23rd epochs. For NMS, we use the IoU metric, and the IoU threshold is set to 0.7 for RPN and 0.5 for R-CNN. Other aspects of the configuration, such as the data preprocessing and pipeline, follow the defaults in MMDetection.
在下面的一系列实验中,我们使用一台带有一个 NVIDIA RTX A6000 GPU 的计算机,并基于对象检测框架 MMDetection [ 31] 和 PyTorch [ 32] 实现各种模型。我们使用 Faster R-CNN、Cascade R-CNN 和 DetectoRS 等通用目标检测器作为基本模型,只需将 MaxIoUAssigner 模块替换为我们的 SimD 分配模块即可。我们的方法可以有效地适应任何骨干网络和基于锚点的检测器。按照主流设置,对于所有模型,使用在 ImageNet 上预训练的 ResNet-50-FPN 作为主干,使用随机梯度下降 (SGD) 作为优化器,动量为 0.9,权重衰减为 0.0001。批处理大小设置为 2,初始学习率为 0.005。在训练和测试阶段,RPN 提案的数量为 3000 个。对于 VisDrone2019 数据集,训练的 epoch 数设置为 12,学习率在第 8 和第 11 个 epoch 衰减。对于 AI-TOD、AI-TODv2 和 SODA-D,训练周期数为 24,学习率在第 20 和第 23 周期衰减。对于 NMS,我们使用 IoU 指标,RPN 的 IoU 阈值设置为 0.7,R-CNN 的 IoU 阈值设置为 0.5。配置的其他方面(如数据预处理和管道)遵循 MMDetection 中的默认值。

TABLE III: Main results on AI-TODv2. All of the settings are as same as those for AI-TOD
表三:AI-TODv2的主要结果。所有设置都与AI-TOD的设置相同
Method 方法 Backbone 骨干 AP AP0.5 AP0.75 APvt 美联社 vt APt 美联社 t APs 美联社 s APm 美联社 m
TridentNet [33] 三叉戟网 [ 33] ResNet-50 ResNet-50型 10.1 24.5 6.7 0.1 6.3 19.8 31.9
Faster R-CNN [8] 更快的 R-CNN [ 8] ResNet-50-FPN ResNet-50-FPN系列 12.8 29.9 9.4 0.0 9.2 24.6 37.0
Cascade R-CNN [25] 级联 R-CNN [ 25] ResNet-50-FPN ResNet-50-FPN系列 15.1 34.2 11.2 0.1 11.5 26.7 38.5
DetectoRS [26] 检测RS [ 26] ResNet-50-FPN ResNet-50-FPN系列 16.1 35.5 12.5 0.1 12.6 28.3 40.0
DetectoRS w/ NWD-RKA [24]
带NWD-RKA的DetectoRS [ 24]
ResNet-50-FPN ResNet-50-FPN系列 24.7 57.4 17.1 9.7 24.2 29.8 39.3
YOLOv3 [38] YOLOv3版本 [ 38] DarkNet-53 暗网-53 4.1 14.6 0.9 1.1 4.8 7.7 8.0
RetinaNet [34] 视网膜 [ 34] ResNet-50-FPN ResNet-50-FPN系列 8.9 24.2 4.6 2.7 8.4 13.1 20.4
SSD-512 [10] 固态硬盘-512 [ 10] VGG-16 10.7 32.5 4.0 2.0 8.7 16.8 28.0
RepPoints [35] 回复点 [ 35] ResNet-50-FPN ResNet-50-FPN系列 9.3 23.6 5.4 2.8 10.0 12.3 18.9
FoveaBox [39] 中央凹盒 [ 39] ResNet-50-FPN ResNet-50-FPN系列 11.3 28.1 7.4 1.4 8.6 17.8 32.2
FCOS [37] FCOS公司 [ 37] ResNet-50-FPN ResNet-50-FPN系列 12.0 30.2 7.3 2.2 11.1 16.6 26.9
Grid R-CNN [40] 网格 R-CNN [ 40] ResNet-50-FPN ResNet-50-FPN系列 14.3 31.1 11.0 0.1 11.0 25.7 36.7
Faster R-CNN w/ SimD
更快的 R-CNN 带 SimD
ResNet-50-FPN ResNet-50-FPN系列 24.5 56.3 17.2 10.3 24.9 28.9 35.4
Cascade R-CNN w/ SimD
带 SimD 的级联 R-CNN
ResNet-50-FPN ResNet-50-FPN系列 25.2 55.2 19.1 9.7 25.3 30.3 37.4
DetectoRS w/ SimD 带 SimD 的 DetectoRS ResNet-50-FPN ResNet-50-FPN系列 26.5 57.7 20.5 11.6 26.7 30.9 38.4

To facilitate comparison with previous research results, during the testing stage, we use the AI-TOD benchmark
为了便于与以前的研究结果进行比较,在测试阶段,我们使用了AI-TOD基准测试

TABLE IV: Main results on VisDrone2019. All models are trained on the train set and tested on the val set
表四:VisDrone2019的主要结果。所有模型都在训练集上训练,并在值集上进行测试
Method 方法 AP AP0.5 APvt 美联社 vt APt 美联社 t
Faster R-CNN 更快的 R-CNN 22.3 38.0 0.1 6.2
Cascade R-CNN 级联 R-CNN 22.5 38.5 0.5 6.8
DetectoRS 探测器 25.7 41.7 0.5 7.6
DetectoRS w/ NWD-RKA 带NWD-RKA的DetectoRS 27.4 46.2 4.4 12.6
DetectoRS w/ RFLA 带RFLA的DetectoRS 27.4 45.3 4.5 12.9
Faster R-CNN w/ SimD
更快的 R-CNN 带 SimD
26.5 47.9 7.5 16.3
Cascade R-CNN w/ SimD
带 SimD 的级联 R-CNN
27.7 48.6 6.8 16.1
DetectoRS w/ SimD 带 SimD 的 DetectoRS 28.7 50.3 7.6 17.3

evaluation metrics, which comprise the Average Precision (AP), AP0.5, AP0.75, APvt, APt, APs and APm, for AI-TOD, AI-TODv2 and VisDrone2019. For the SODA-D dataset, we use the COCO evaluation metrics.
评估指标,包括 AI-TOD、AI-TODv2 和 VisDrone2019 的平均精度 (AP)、AP 0.5 、AP 0.75 vt 、AP、AP、AP s 和 AP m 。对于 SODA-D 数据集,我们使用 COCO 评估指标。

IV-C Results IV-C 结果

We design four groups of experiments on the AI-TOD, AI-TODv2, VisDrone2019 and SODA-D datasets. In each group, we replace the IoU metric with our SimD metric in the RPN module and then apply this module in combination with traditional object detection models, including Faster R-CNN, Cascade R-CNN and DetectoRS.
我们在 AI-TOD、AI-TODv2、VisDrone2019 和 SODA-D 数据集上设计了四组实验。在每个组中,我们在 RPN 模块中将 IoU 指标替换为 SimD 指标,然后将该模块与传统的对象检测模型(包括 Faster R-CNN、Cascade R-CNN 和 DetectoRS)结合应用。

The results on AI-TOD are shown in Table II, where we compare our method to several typical object detection methods. The detectors in the first seven rows are two-stage anchor-based detectors, those in the next three and four rows are one-stage anchor-based and anchor-free detectors, respectively, and the last three rows show the results of our method. Compared with Faster R-CNN, Cascade R-CNN and DetectoRS, we achieve AP improvements of 12.8, 11.2 and 11.8 points, respectively, by using SimD in place of the IoU in RPN. We also compare our method with some specialized detectors for tiny objects, namely, DotD, NWD and RFLA, relative to which our method improves the AP by 10.5, 5.8 and 1.8 points, respectively.
AI-TOD的结果如表II所示,我们将我们的方法与几种典型的目标检测方法进行了比较。前七行的探测器是两级基于锚的探测器,接下来的三行和第四行的探测器分别是一级锚和无锚探测器,最后三行显示了我们的方法结果。与Faster R-CNN、Cascade R-CNN和DetectoRS相比,通过使用SimD代替RPN中的IoU,我们分别实现了12.8、11.2和11.8点的AP改进。我们还将我们的方法与一些专门用于微小物体的探测器(即 DotD、NWD 和 RFLA)进行了比较,相对于这些探测器,我们的方法分别将 AP 提高了 10.5、5.8 和 1.8 点。

The performance of our method on tiny objects is worthy of special attention. Because of the extremely small size of these objects (very tiny refers to a size range from 2 to 8 pixels), the APvt of general object detectors is 0, whereas with the use of SimD, the APvt values of Faster R-CNN, Cascade R-CNN and DetectoRS are increased from 0 to 11.9, 13.2 and 13.4 points, respectively.
我们的方法在微小物体上的表现值得特别关注。由于这些物体的尺寸非常小(非常小是指 2 到 8 像素的大小范围),一般物体检测器的 AP vt 为 0,而使用 SimD 时,Faster R-CNN、Cascade R-CNN 和 DetectoRS 的 AP vt 值分别从 0 增加到 11.9、13.2 和 13.4 点。

In addition to AI-TOD, our method also achieves the best performance on AI-TODv2, VisDrone2019 and SODA-D, as shown in Table III, Table IV and Table V, respectively. On AI-TODv2 and SODA-D, the AP of our method is 1.8 and 1.6 points higher, respectively, than that of its best competitor. On VisDrone2019, which contains both tiny and general-sized objects, our method also performs well, in particular, it achieves a 1.3 points improvement over RFLA. In Table V, the AP0.5 is almost at the same level with RFLA but AP0.75 is much higher, this may indicate that our method is more capable of tiny object detection. Some typical visual comparisons between the IoU metric and SimD are shown in Fig. 3. We can find there is an obvious improvement of detection performance after using our method.
除了AI-TOD之外,我们的方法在AI-TODv2、VisDrone2019和SODA-D上也取得了最佳性能,分别如表III、表IV和表V所示。在AI-TODv2和SODA-D上,我们方法的AP分别比其最佳竞争对手高出1.8和1.6分。在包含微小和一般尺寸物体的 VisDrone2019 上,我们的方法也表现良好,特别是它比 RFLA 提高了 1.3 分。在表V中,AP 0.5 与RFLA几乎处于同一水平,但AP 0.75 要高得多,这可能表明我们的方法更有能力检测微小物体。图 3 显示了 IoU 指标和 SimD 之间的一些典型视觉比较。我们可以发现,使用我们的方法后,检测性能有了明显的提高。

TABLE V: Main results on SODA-D. All models are trained on the train set and tested on the test set
表五:SODA-D的主要结果。所有模型都在训练集上训练,并在测试集上进行测试
Method 方法 AP AP0.5 AP0.75
Faster R-CNN [8] 更快的 R-CNN [ 8] 28.9 59.7 24.2
Cascade R-CNN [25] 级联 R-CNN [ 25] 31.2 59.9 27.8
RetinaNet [34] 视网膜 [ 34] 28.2 57.6 23.7
FCOS [37] FCOS公司 [ 37] 23.9 49.5 19.9
RepPoints [35] 回复点 [ 35] 28.0 55.6 24.7
ATSS [19] 苯丙胺类兴奋剂 [ 19] 26.8 55.6 22.1
YOLOX [41] 尤洛克斯 [ 41] 26.7 53.4 23.0
RFLA [4] RFLA公司 [ 4] 29.7 60.2 25.2
Faster R-CNN w/ SimD
更快的 R-CNN 带 SimD
31.1 59.8 28.1
Cascade R-CNN w/ SimD
带 SimD 的级联 R-CNN
32.4 59.2 30.5
DetectoRS w/ SimD 带 SimD 的 DetectoRS 32.8 59.4 31.3
TABLE VI: Ablation study on AI-TOD based on Faster R-CNN
表六:基于Faster R-CNN的AI-TOD消融研究
norm_width norm_height AP AP0.5 AP0.75
20.4 49.6 13.2
22.1 52.0 14.8
22.0 52.3 14.5
23.9 54.5 17.6
Refer to caption
Figure 3: Comparison of detection results on AI-TOD dataset between label assignment with the traditional IoU metric and SimD. The first row shows the results of Faster R-CNN with the IoU metric, and the second row is also based on Faster R-CNN but with the SimD metric. The green, blue and red boxes respectively denote true positive (TP), false positive (FP) and false negative (FN) predictions. The improvement achieved with our method is obvious.
图 3:在 AI-TOD 数据集上,使用传统 IoU 指标和 SimD 进行标签分配的检测结果比较。第一行显示具有 IoU 指标的 Faster R-CNN 的结果,第二行也基于 Faster R-CNN,但具有 SimD 指标。绿色、蓝色和红色框分别表示真阳性 (TP)、假阳性 (FP) 和假阴性 (FN) 预测。我们的方法所取得的改进是显而易见的。

IV-D Ablation Study IV-D消融研究

In our proposed method, an important operation is normalization based on the width and height of the ground truth and anchor. To verify the effectiveness of the normalization operation, we conduct a set of ablation study. As shown in Table VI, we respectively compare not normalizing, normalizing only width, only height, and both width and height. The experimental results show that normalization operation achieves a 3.5 points improvement, mainly benefit by the ability to adapt to objects of different sizes in a dataset, and the normalization parameters m𝑚mitalic_m, n𝑛nitalic_n can be adaptively adjusted according to different datasets.
在我们提出的方法中,一个重要的操作是基于地面实况和锚的宽度和高度进行归一化。为了验证归一化操作的有效性,我们进行了一系列消融研究。如表VI所示,我们分别比较了不归一化、仅归一化宽度、仅归一化高度以及宽度和高度。实验结果表明,归一化操作实现了3.5分的提升,主要得益于能够适应数据集中不同大小的对象,并且归一化参数 m𝑚mitalic_m 可以 n𝑛nitalic_n 根据不同的数据集进行自适应调整。

IV-E Analysis IV-E分析

From the experimental results shown in Table II to Table V, we find that our method achieves the best AP on all four datasets. In addition, on the AI-TOD, AI-TODv2, and VisDrone2019 datasets, our method achieves the best results for very tiny, tiny and small objects. Three main achievements of our method can be summarized.
从表二到表五所示的实验结果中,我们发现我们的方法在所有四个数据集上都获得了最佳的AP。此外,在 AI-TOD、AI-TODv2 和 VisDrone2019 数据集上,我们的方法对非常小、微小和很小的物体取得了最佳结果。可以总结我们方法的三个主要成就。

First, our method effectively solves the problem of low accuracy for tiny objects. The most fundamental reason is that our method fully accounts for the similarity between two bounding boxes, including both location and shape similarity, therefore, only the highest-quality anchors will be chosen as positive samples when using the SimD metric. Compared with VisDrone2019, the performance improvements on AI-TOD and AI-TODv2 are more obvious because the objects are much smaller in these two datasets, this phenomenon may also reflects the effectiveness of our method on tiny object detection.
首先,该方法有效地解决了微小物体精度低的问题。最根本的原因是,我们的方法充分考虑了两个边界框之间的相似性,包括位置和形状相似性,因此,在使用 SimD 度量时,只会选择质量最高的锚点作为正样本。与VisDrone2019相比,AI-TOD和AI-TODv2的性能提升更为明显,因为这两个数据集中的物体要小得多,这种现象也可能反映了我们的方法在微小物体检测上的有效性。

Second, our method can adapt well to different object sizes in a dataset. In Table IV, both the AP and APvt values of our method are the best and much higher than those of other methods. The main reason is that normalization is applied in the SimD metric when calculating the similarity between bounding boxes, so it can eliminate the differences arising from bounding boxes of different sizes. Some typical detection results are shown in Fig. 4.
其次,我们的方法能够很好地适应数据集中不同的对象大小。在表IV中,我们方法的AP值和AP vt 值都是最好的,远高于其他方法。主要原因是在计算边界框之间的相似度时,SimD 指标中应用了归一化,因此可以消除因不同大小的边界框而产生的差异。一些典型的检测结果如图4所示。

Finally, our method achieves the state-of-the-art results on four different datasets. Although the characteristics of the objects in different datasets vary, we use the relationships between the ground truths and anchors in the train set when computing the normalization parameters, enabling our metric to automatically adapt to different datasets. In addition, there are no hyperparameters that need to be set in our formula.
最后,我们的方法在四个不同的数据集上实现了最先进的结果。尽管不同数据集中对象的特征各不相同,但在计算归一化参数时,我们利用了训练集中的地面实况和锚点之间的关系,使我们的度量能够自动适应不同的数据集。此外,我们的公式中不需要设置超参数。

Refer to caption
Figure 4: Some typical detection results on VisDrone2019 val set, which contains both tiny and general-sized objects.
图 4:VisDrone2019 val set 上的一些典型检测结果,其中包含微小和一般大小的物体。

V Conclusion 五、结论

In this paper, we point out that most of the existing methods may not be able to automatically adapt to objects of different sizes and include some hyperparameters need to be chosen. To this end, we propose a novel evaluation metric named Similarity Distance (SimD), which not only considers both location and shape similarity but also can automatically adapt to different datasets and different object sizes in a dataset. In addition, there are no hyperparameters in our formula. Finally, we conduct extensive experiments on four classic tiny object detection datasets, in which our method achieves the state-of-the-art results. Although our proposed SimD metric is adaptive, it also based on the existing label assignment strategy with fixed thresholds. In the future, we aim to further improve the effectiveness of label assignment for tiny object detection.
在本文中,我们指出,现有的大多数方法可能无法自动适应不同大小的对象,并且包括一些需要选择的超参数。为此,我们提出了一种名为相似度距离(SimD)的新颖评估指标,该指标不仅考虑了位置和形状的相似性,还可以自动适应数据集中的不同数据集和不同对象大小。此外,我们的公式中没有超参数。最后,我们对四个经典的微小目标检测数据集进行了广泛的实验,其中我们的方法实现了最先进的结果。尽管我们提出的 SimD 指标是自适应的,但它也基于具有固定阈值的现有标签分配策略。未来,我们的目标是进一步提高标签分配在微小物体检测中的有效性。

References 引用

  • [1] G. Cheng, X. Yuan, X. Yao, K. Yan, Q. Zeng, X. Xie, and J. Han, “Towards Large-Scale Small Object Detection: Survey and Benchmarks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 11, pp. 13467-13488, 2022.
    G. Cheng、X. Yuan、X. Yao、K. Yan、Q. Zeng、X. Xie 和 J. Han,“迈向大规模小目标检测:调查和基准”,IEEE Transactions on Pattern Analysis and Machine Intelligence,第 45 卷,第 11 期,第 13467-13488 页,2022 年。
  • [2] C. Xu, J. Wang, W. Yang, and L. Yu, “Dot Distance for Tiny Object Detection in Aerial Images,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021, pp. 1192-1201.
    C. Xu、J. Wang、W. Yang 和 L. Yu,“航空图像中微小物体检测的点距离”,2021 年 IEEE/CVF 计算机视觉和模式识别研讨会会议 (CVPRW),2021 年,第 1192-1201 页。
  • [3] J. Wang, C. Xu, W. Yang, and L. Yu, “A Normalized Gaussian Wasserstein Distance for Tiny Object Detection,” in Computer Vision and Pattern Recognition, 2021.
    J. Wang、C. Xu、W. Yang 和 L. Yu,“用于微小物体检测的归一化高斯 Wasserstein 距离”,载于《计算机视觉与模式识别》,2021 年。
  • [4] C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, “RFLA: Gaussian Receptive Field based Label Assignment for Tiny Object Detection,” in European Conference on Computer Vision, 2022, pp. 526-543.
    C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, 和 G.-S.Xia,“RFLA:用于微小物体检测的高斯感受野标签分配”,欧洲计算机视觉会议,2022 年,第 526-543 页。
  • [5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.
    K. He,X. Zhang,S. 任和J. Sun,“图像识别的深度残差学习”,2016年IEEE计算机视觉和模式识别会议(CVPR),2016年,第770-778页。
  • [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580-587.
    R. Girshick、J. Donahue、T. Darrell 和 J. Malik,“Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”,2014 年 IEEE 计算机视觉和模式识别会议,2014 年,第 580-587 页。
  • [7] R. Girshick, “Fast R-CNN,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440-1448.
    R. Girshick,“Fast R-CNN”,2015 年 IEEE 计算机视觉国际会议 (ICCV),2015 年,第 1440-1448 页。
  • [8] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 2017.
    S. 任, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 2017.
  • [9] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779-788.
    J. Redmon、S. Divvala、R. Girshick 和 A. Farhadi,“You Only Look Once: Unified, Real-Time Object Detection”,2016 年 IEEE 计算机视觉和模式识别会议 (CVPR),2016 年,第 779-788 页。
  • [10] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. Berg, “SSD: Single Shot MultiBox Detector,” in Computer Vision–ECCV 2016: 14th European Conference, 2016, pp. 21–37.
    W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.Fu 和 A. Berg,“SSD:单次多盒检测器”,载于 Computer Vision–ECCV 2016:第 14 届欧洲会议,2016 年,第 21-37 页。
  • [11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Zitnick, “Microsoft COCO: Common Objects in Context,” in Computer Vision–ECCV 2014: 13th European Conference, 2014, pp. 740–755.
    T.-Y. Lin,M. Maire,S. Belongie,J. Hays,P. Perona,D. Ramanan,P. Dollár和C. Zitnick,“Microsoft COCO:上下文中的通用对象”,计算机视觉-ECCV 2014:第13届欧洲会议,2014年,第740-755页。
  • [12] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 658-666.
    H. Rezatofighi、N. Tsoi、J. Gwak、A. Sadeghian、I. Reid 和 S. Savarese,“并集的广义交集:边界框回归的度量和损失”,2019 年 IEEE/CVF 计算机视觉和模式识别会议 (CVPR),2019 年,第 658-666 页。
  • [13] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp. 12993–13000, 2020.
    Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. 任, “Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression”, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp. 12993–13000, 2020.
  • [14] M. Kisantal, Z. Wojna, J. Murawski, J. Naruniec, and K. Cho, “Augmentation for small object detection,” in 9th International Conference on Advances in Computing and Information Technology (ACITY 2019), 2019.
    M. Kisantal、Z. Wojna、J. Murawski、J. Naruniec 和 K. Cho,“小物体检测的增强”,第 9 届计算与信息技术进展国际会议 (ACITY 2019),2019 年。
  • [15] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection,” in Advances in Neural Information Processing Systems, 2020, pp. 21002-21012.
    X. Li, W. Wang, L. Wu, S. Chen, X. 胡, J. Li, J. Tang, and J. Yang, “广义焦点丢失:学习密集目标检测的合格和分布式边界框”,载于《神经信息处理系统进展》,2020年,第21002-21012页。
  • [16] H. Zhang, Y. Wang, F. Dayoub, and N. Sünderhauf, “VarifocalNet: An IoU-aware Dense Object Detector,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8510-8519.
    H. Zhang、Y. Wang、F. Dayoub 和 N. Sünderhauf,“VarifocalNet:一种 IoU 感知密集目标检测器”,2021 年 IEEE/CVF 计算机视觉和模式识别会议 (CVPR),2021 年,第 8510-8519 页。
  • [17] C. Feng, Y. Zhong, Y. Gao, M. R. Scott, and W. Huang, “TOOD: Task-aligned One-stage Object Detection,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 3490-3499.
    C. Feng、Y. Zhong、Y. Gao、M. R. Scott 和 W. Huang,“TOOD:任务对齐的单阶段对象检测”,2021 年 IEEE/CVF 计算机视觉国际会议 (ICCV),2021 年,第 3490-3499 页。
  • [18] S. Li, C. He, R. Li, and L. Zhang, “A Dual Weighting Label Assignment Scheme for Object Detection,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 9377-9386.
    S. Li、C. He、R. Li 和 L. Zhang,“用于对象检测的双重加权标签分配方案”,2022 年 IEEE/CVF 计算机视觉和模式识别会议 (CVPR),2022 年,第 9377-9386 页。
  • [19] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9756-9765.
    S. Zhang、C. Chi、Y. Yao、Z. Lei 和 S. Z. Li,“通过自适应训练样本选择弥合基于锚点和无锚点检测之间的差距”,2020 年 IEEE/CVF 计算机视觉和模式识别会议 (CVPR),2020 年,第 9756-9765 页。
  • [20] K. Kim and H. S. Lee, “Probabilistic Anchor Assignment with IoU Prediction for Object Detection,” in Computer Vision–ECCV 2020: 16th European Conference, 2020, pp. 355–371.
    K. Kim 和 H. S. Lee,“Probabilistic Anchor Assignment with IoU Prediction for Object Detection”,载于 Computer Vision–ECCV 2020:第 16 届欧洲会议,2020 年,第 355-371 页。
  • [21] Z. Ge, S. Liu, Z. Li, O. Yoshie, and J. Sun, “OTA: Optimal Transport Assignment for Object Detection,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 303-312.
    Z. Ge、S. Liu、Z. Li、O. Yoshie 和 J. Sun,“OTA:对象检测的最佳传输分配”,2021 年 IEEE/CVF 计算机视觉和模式识别会议 (CVPR),2021 年,第 303-312 页。
  • [22] H. Su, Y. He, R. Jiang, J. Zhang, W. Zou, and B. Fan, “DSLA: Dynamic smooth label assignment for efficient anchor-free object detection,” Pattern Recognition, vol. 131, pp. 108868, 2022.
    H. Su、Y. He、R. 江、J. Zhang、W. Zou 和 B. Fan,“DSLA:用于高效无锚点对象检测的动态平滑标签分配”,《模式识别》,第 131 卷,第 108868 页,2022 年。
  • [23] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and SZ. Li, “S3FD: Single Shot Scale-invariant Face Detector,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 192-201.
    S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, 和 SZ.Li,“S 3 FD:单次尺度不变人脸检测器”,IEEE计算机视觉国际会议(ICCV)论文集,2017年,第192-201页。
  • [24] C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, “Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 190, pp. 79–93, 2022.
    C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, 和 G.-S.Xia,“检测航空图像中的微小物体:归一化的 Wasserstein 距离和新基准”,ISPRS 摄影测量与遥感杂志,第 190 卷,第 79–93 页,2022 年。
  • [25] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving Into High Quality Object Detection,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6154-6162.
    Z. Cai 和 N. Vasconcelos,“Cascade R-CNN:深入研究高质量对象检测”,2018 年 IEEE/CVF 计算机视觉和模式识别会议,2018 年,第 6154-6162 页。
  • [26] S. Qiao, L.-C. Chen, and A. Yuille, “DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10208-10219.
    S. Qiao, L.-C.Chen 和 A. Yuille,“DetectoRS:使用递归特征金字塔和可切换的 Atrous 卷积检测对象”,2021 年 IEEE/CVF 计算机视觉和模式识别会议 (CVPR),2021 年,第 10208-10219 页。
  • [27] J. Wang, W. Yang, H. Guo, R. Zhang, and G.-S. Xia, “Tiny Object Detection in Aerial Images,” in 2020 25th International Conference on Pattern Recognition (ICPR), 2021, pp. 3791-3798.
    J. Wang, W. Yang, H. Guo, R. Zhang, 和 G.-S.Xia,“航空图像中的微小物体检测”,2020 年第 25 届模式识别国际会议 (ICPR),2021 年,第 3791-3798 页。
  • [28] P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection and Tracking Meet Drones Challenge,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380-7399, 2022.
    P. Zhu、L. 温、D. Du、X. Bian、H. Fan、Q. 胡和 H. Ling,“检测和跟踪迎接无人机挑战”,IEEE Transactions on Pattern Analysis and Machine Intelligence,第 44 卷,第 11 期,第 7380-7399 页,2022 年。
  • [29] X. Yu, Y. Gong, N. Jiang, Q. Ye, and Z. Han, “Scale Match for Tiny Person Detection,” in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 1246-1254.
    X. Yu,Y. Gong,N. 江,Q. Ye和Z. Han,“微小人检测的规模匹配”,2020年IEEE计算机视觉应用冬季会议(WACV),2020年,第1246-1254页。
  • [30] G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder, “The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5000-5009.
    G. Neuhold、T. Ollmann、S. R. Bulò 和 P. Kontschieder,“The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes”,2017 年 IEEE 计算机视觉国际会议 (ICCV),2017 年,第 5000-5009 页。
  • [31] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, and C. C. Loy, “MMDetection: Open MMLab Detection Toolbox and Benchmark,” in Computer Vision and Pattern Recognition, 2019.
    K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, and C. C. Loy, “MMDetection: Open MMLab Detection Toolbox and Benchmark,“,计算机视觉与模式识别,2019 年。
  • [32] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems 32 (NeurIPS 2019), vol. 32, 2019.
    A. Paszke、S. Gross、F. Massa、A. Lerer、J. Bradbury、G. Chanan、T. Killeen、Z. Lin、N. Gimelshein、L. Antiga、A. Desmaison、A. Kopf、E. Yang、Z. DeVito、M. Raison、A. Tejani、S. Chilamkurthy、B. Steiner、L. Fang、J. Bai 和 S. Chintala,“PyTorch:一种命令式、高性能的深度学习库”,载于 Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 第32卷,2019年。
  • [33] Y. Li, Y. Chen, N. Wang, and Z.-X. Zhang, “Scale-Aware Trident Networks for Object Detection,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6053-6062.
    Y. Li, Y. Chen, N. Wang, 和 Z.-X.Zhang,“用于对象检测的尺度感知三叉戟网络”,2019 年 IEEE/CVF 计算机视觉国际会议 (ICCV),2019 年,第 6053-6062 页。
  • [34] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal Loss for Dense Object Detection,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980-2988.
    林同英, P. 戈亚尔, R. 吉尔希克, K.He 和 P. Dollar,“密集目标检测的焦点损失”,IEEE 计算机视觉国际会议 (ICCV) 论文集,2017 年,第 2980-2988 页。
  • [35] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “RepPoints: Point Set Representation for Object Detection,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 9656-9665.
    Z. Yang,S. Liu,H. 胡,L. Wang和S. Lin,“RepPoints:对象检测的点集表示”,2019年IEEE/CVF计算机视觉国际会议(ICCV),2019年,第9656-9665页。
  • [36] B. Zhu, J. Wang, Z. Jiang, F. Zong, S. Liu, Z. Li, and J. Sun, “AutoAssign: Differentiable Label Assignment for Dense Object Detection,” in Computer Vision and Pattern Recognition, 2020.
    B. Zhu, J. Wang, Z. 江, F. Zong, S. Liu, Z. Li, and J. Sun, “AutoAssign: Differentiable Label Assignment for Dense Object Detection”, in Computer Vision and Pattern Recognition, 2020.
  • [37] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully Convolutional One-Stage Object Detection,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 9626-9635.
    Z. Tian, C. Shen, H. Chen, 和 T.他,“FCOS:全卷积单阶段对象检测”,2019 年 IEEE/CVF 计算机视觉国际会议 (ICCV),2019 年,第 9626-9635 页。
  • [38] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” in Computer Vision and Pattern Recognition, 2018.
    J. Redmon 和 A. Farhadi,“YOLOv3:渐进式改进”,载于《计算机视觉与模式识别》,2018 年。
  • [39] T. Kong, F. Sun, H. Liu, Y. Jiang, L. Li, and J. Shi, “FoveaBox: Beyound Anchor-Based Object Detection,” IEEE Transactions on Image Processing, vol. 29, pp. 7389-7398, 2020.
    T. Kong, F. Sun, H. Liu, Y. 江, L. Li, and J. Shi, “FoveaBox: Beyound Anchor-Based Object Detection”, IEEE Transactions on Image Processing, vol. 29, pp. 7389-7398, 2020.
  • [40] X. Lu, B. Li, Y. Yue, Q. Li, and J. Yan, “Grid R-CNN,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7363-7372.
    X. Lu、B. Li、Y. Yue、Q. Li 和 J. Yan,“Grid R-CNN”,载于 IEEE/CVF 计算机视觉和模式识别会议 (CVPR) 论文集,2019 年,第 7363-7372 页。
  • [41] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO Series in 2021,” in Computer Vision and Pattern Recognition, 2021.
    Z. Ge、S. Liu、F. Wang、Z. Li 和 J. Sun,“YOLOX:2021 年超越 YOLO 系列”,载于《计算机视觉与模式识别》,2021 年。