Dual-YOLO Architecture from Infrared and Visible Images for Object Detection

Bao, Chun; Cao, Jie; Hao, Qun; Cheng, Yang; Ning, Yaqian; Zhao, Tianhua

doi:10.3390/s23062934

Open AccessArticle 开放存取文章

Dual-YOLO Architecture from Infrared and Visible Images for Object Detection
来自红外和可见光图像的 Dual-YOLO 架构用于目标检测

by

Chun Bao

¹

, 陈宝

Jie Cao

^1,2,*,
¹

, 结曹

Qun Hao

^1,2,3,
^1,2,* , 群浩

Yang Cheng

^1,2

, ^1,2,3 , 杨成

Yaqian Ning

¹ and
^1,2

, 亚茜宁

Tianhua Zhao

¹ ¹ 和田华赵

¹

Bionic Robot Key Laboratory of Ministry of Education, School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China
教育部光电子与光子学重点实验室，北京理工大学光学与光电学院，100081，中国

²

Yangtze Delta Region Academy, Beijing Institute of Technology, Jiaxing 314003, China
北京理工大学长三角研究院，嘉兴 314003，中国

³

School of Opto-Electronic Engineering, Changchun University of Science and Technology, Changchun 130022, China
长春理工大学光电工程学院, 长春 130022, 中国

^*

Author to whom correspondence should be addressed.
应将 Corresponding author 部分发送给 whom correspondence should be addressed.

Sensors 2023, 23(6), 2934; https://doi.org/10.3390/s23062934
传感器 2023, 23(6), 2934; https://doi.org/10.3390/s23062934

Submission received: 8 February 2023 / Revised: 2 March 2023 / Accepted: 7 March 2023 / Published: 8 March 2023
提交日期: 2023 年 2 月 8 日 / 修订日期: 2023 年 3 月 2 日 / 接受日期: 2023 年 3 月 7 日 / 发表日期: 2023 年 3 月 8 日

(This article belongs to the Section Intelligent Sensors)
（本文属于智能传感器部分）

Download

Browse Figures

Versions Notes
下载浏览图片版本注释

Abstract 摘要

With the development of infrared detection technology and the improvement of military remote sensing needs, infrared object detection networks with low false alarms and high detection accuracy have been a research focus. However, due to the lack of texture information, the false detection rate of infrared object detection is high, resulting in reduced object detection accuracy. To solve these problems, we propose an infrared object detection network named Dual-YOLO, which integrates visible image features. To ensure the speed of model detection, we choose the You Only Look Once v7 (YOLOv7) as the basic framework and design the infrared and visible images dual feature extraction channels. In addition, we develop attention fusion and fusion shuffle modules to reduce the detection error caused by redundant fusion feature information. Moreover, we introduce the Inception and SE modules to enhance the complementary characteristics of infrared and visible images. Furthermore, we design the fusion loss function to make the network converge fast during training. The experimental results show that the proposed Dual-YOLO network reaches 71.8% mean Average Precision (mAP) in the DroneVehicle remote sensing dataset and 73.2% mAP in the KAIST pedestrian dataset. The detection accuracy reaches 84.5% in the FLIR dataset. The proposed architecture is expected to be applied in the fields of military reconnaissance, unmanned driving, and public safety.
随着红外检测技术的发展和军事遥感需求的提高，具有低误报率和高检测精度的红外目标检测网络成为研究重点。然而，由于缺乏纹理信息，红外目标检测的误检率较高，导致目标检测精度降低。为了解决这些问题，我们提出了一种名为 Dual-YOLO 的红外目标检测网络，该网络结合了可见光图像特征。为了保证模型检测的速度，我们选择了 You Only Look Once v7 (YOLOv7) 作为基本框架，并设计了红外和可见光图像的双特征提取通道。此外，我们开发了注意力融合和融合混洗模块，以减少冗余融合特征信息引起的检测误差。此外，我们引入了 Inception 和 SE 模块，以增强红外和可见光图像的互补特性。此外，我们设计了融合损失函数，以使网络在训练过程中快速收敛。实验结果表明，提出的双 YOLO 网络在无人机车辆遥感数据集中达到了 71.8%的平均精度（mAP），在 KAIST 行人数据集中达到了 73.2%的 mAP，在 FLIR 数据集中检测精度达到了 84.5%。提出的架构有望应用于军事侦察、无人驾驶和公共安全等领域。

Keywords:

infrared object detection; dual-YOLO; attention fusion; fusion shuffle; fusion loss
关键词：红外目标检测；双 YOLO；注意力融合；融合洗牌；融合损失

1. Introduction 1. 介绍

In recent years, infrared detection technology has been widely applied in military, remote sensing, civil, and other fields, such as infrared reconnaissance and early warning, infrared space detection, automotive navigation, medical infrared detection, and many other application scenarios. As a critical technology in the field of infrared early warning detection, infrared object detection algorithms adapted to different complex scenes have been widely studied by researchers. Under the situation that the spatial resolution of the optical system is complex to further improve, it is of great significance to study the infrared object detection algorithm with a low false alarm rate and strong adaptability, which is suitable for different scenes.
在近年来，红外检测技术已在军事、遥感、民用以及其他领域得到了广泛应用，例如红外侦察与预警、红外空间探测、汽车导航、医疗红外检测等众多应用场景。作为红外早期预警检测领域的一项关键技术，研究人员已经广泛研究了适用于不同复杂场景的红外目标检测算法。在光学系统的空间分辨率难以进一步提高的情况下，研究低误报率且具有较强适应性的红外目标检测算法对于不同场景的应用具有重要意义。

However, the detection of infrared images also has many challenges. First, the object has fewer features available. Secondly, the signal-to-noise ratio of the image is low. Finally, the real-time performance of infrared image object detection is limited. These factors indicate that designing an object detection network with high accuracy and good real-time performance in infrared images is challenging. We can see from the current research interests that the most popular object detection methods mainly focus on visible scenes, such as Single Shot Detection (SSD) [1], You Only Look Once (YOLO) series [2,3], Fully Convolutional One-Stage (FCOS) Object Detection [4], and other single-stage object detection networks. Furthermore, two-stage object detection algorithms such as Faster R-CNN [5] and Task-aligned One-stage Object Detection (TOOD) [6] exist. In addition, there are also some object detection methods established on anchor-free [7] or transformer [8]. These methods perform well on visible images, but there are always limitations for infrared image detection.
然而，红外图像的检测也面临着许多挑战。首先，目标的特征较少。其次，图像的信噪比较低。最后，红外图像目标检测的实时性能有限。这些因素表明，在红外图像中设计具有高准确性和良好实时性能的目标检测网络是具有挑战性的。从当前的研究兴趣可以看出，最受欢迎的目标检测方法主要集中在可见场景，如单阶段检测（SSD）[1]、仅看一次（YOLO）系列[2, 3]、全卷积单阶段（FCOS）目标检测[4]以及其他单阶段目标检测网络。此外，还存在一些两阶段目标检测算法，如快速 R-CNN[5]和任务对齐的单阶段目标检测（TOOD）[6]。另外，还有一些基于无锚点[7]或变压器[8]的目标检测方法。这些方法在可见图像上表现良好，但在红外图像检测中始终存在局限性。

Although there are challenges for infrared target detection, many methods have been tried, and these methods have achieved relatively good results. For example, the YOLO-FIRI [9] algorithm, by improving the YOLOv5 [10] practice, proposed a region-free infrared image object detection method and reached the advanced level on the KAIST [11] and FLIR [12] datasets. The work of I-YOLO [13] is aimed explicitly at infrared object detection on the road. I-YOLO combines DRUNet [14] with YOLOv3 [2] to enhance the infrared image through DRUNet, and finally uses YOLOv3 for accurate object recognition. This method not only has excellent advantages in precision and speed. In the scene of infrared object detection, air-to-ground detection is also a hot issue of single infrared image detection. In [15], Jiang et al. proposed a UAV object detection framework for infrared images and video. The feature is extracted from the ground object, and the improved YOLOv5s is used for object recognition. This infrared recognition method can achieve 88.69% recognition accuracy and 50 FPS speed. The IARet [16] performs well in single infrared image object detection, and the Focus module is designed to improve the detection speed. The IARet is also lightweight, with the entire model measuring just 4.8 MB. For example, many object detection methods are only for a single infrared image. Although they have achieved good results, their common problem is that the single infrared image object detection ability is limited, the feature loss of the object is severe, and the false alarm rate is high.
虽然红外目标检测存在挑战，但已经尝试了许多方法，并且这些方法在 KAIST [11] 和 FLIR [12] 数据集上达到了较高的水平。YOLO-FIRI [9] 算法通过对 YOLOv5 [10] 的实践进行改进，提出了一种无区域的红外图像物体检测方法。I-YOLO [13] 的工作明确针对道路红外物体检测。I-YOLO 将 DRUNet [14] 与 YOLOv3 [2] 结合，通过 DRUNet 提升红外图像，最后使用 YOLOv3 进行精确的物体识别。这种方法不仅在精度和速度方面具有出色的优势。在红外物体检测的场景中，空地检测也是单红外图像检测的热点问题。在 [15] 中，江等人提出了一个用于红外图像和视频的无人机物体检测框架。从地面物体中提取特征，并使用改进的 YOLOv5s 进行物体识别。这种红外识别方法可以实现 88.69% 的识别准确率和 50 FPS 的速度。 The IARet [16] 在单红外图像目标检测方面表现良好，Focus 模块的设计旨在提高检测速度。IARet 还非常轻量，整个模型仅占用 4.8 MB。例如，许多目标检测方法仅适用于单张红外图像。尽管它们取得了很好的结果，但它们的共同问题是单红外图像的目标检测能力有限，目标特征损失严重，误报率高。

As we all know, producing visible images requires compensation for external illumination when the illumination conditions are poor. Infrared cameras can produce infrared spectral images throughout the day, but infrared spectral images lack details such as texture and color. Moreover, in infrared images, the critical factor determining the object’s visibility is the temperature difference between the object and the environment. For example, the car object is brighter than the background [17,18]. However, when there are some non-object heat points, it will also lead to the false detection of the object. Therefore, infrared and visible images have advantages and are complementary in information distribution. Combining the unique benefits of visible images with infrared images can compensate for the lack of precision reduction caused by infrared image object detection.
如我们所知，在照明条件不佳时，生成可见图像需要补偿外部照明。红外相机可以在全天候生成红外光谱图像，但红外光谱图像缺乏纹理和颜色等细节。此外，在红外图像中，决定物体可见性的关键因素是物体与环境之间的温度差。例如，汽车对象比背景更亮 [17, 18]。然而，当存在一些非对象热点时，也会导致物体的误检。因此，红外图像和可见图像在信息分布方面各有优势且互补。结合可见图像和红外图像的独特优势，可以弥补由红外图像物体检测引起的精度降低。

According to the above analysis, some researchers began to try to make complementary detection between infrared and visible images. For example, MFFN [19] proposes a new multi-modal feature fusion network, which uses morphological features, infrared radiation, and motion features to compensate for the deficiency of single-modal detection of small infrared objects. At the same time, MFFN also proposed a characteristic pyramid structure with layer hopping structure (SCFPN) in morphology. In addition, the network’s backbone integrates SCFPN and the voided convolutional attention module into Resblock. This design also gives the network a detection accuracy of 92.01% on the OEDD dataset. However, not all fusion features are helpful. There are also a lot of research works in progress for how to solve the problems caused by fusion features, such as TIRNet [20]. To solve the problem of information redundancy in the fusion of infrared and visible images, RISNet [17] designed a new mutual information minimization module to reduce redundancy. In addition, the RISNet proposed a classification method of light conditions based on histogram statistics. This method automatically classifies more detailed lighting conditions to facilitate the complementary fusion of infrared and RGB images. This design also makes RISNet better than the state-of-the-art methods for infrared image detection, especially under conditions of insufficient illumination, complex background, and low contrast. In addition, the PearlGAN [21] also plays a role in promoting infrared and visible image fusion detection. PearlGAN designed a top-down guided attention module to make the corresponding attention loss reach the hierarchical attention distribution, reduce local semantic ambiguity, and use context information for image coding. Moreover, PearlGAN introduces a structured gradient alignment loss. This design has a good performance effect in the image translation task and provides a new idea for infrared object detection. Like PearlGAN’s constraint design on the loss function of infrared and visible image fusion detection, there are many excellent works, such as CMPD [22].
根据上述分析，一些研究人员开始尝试在红外和可见光图像之间进行互补检测。例如，MFFN [19] 提出了一种新的多模态特征融合网络，该网络使用形态学特征、红外辐射和运动特征来弥补单模态检测小红外目标的不足。同时，MFFN 还提出了具有层跳结构的特征金字塔结构（SCFPN）在形态学中。此外，网络的骨干将 SCFPN 和空洞卷积注意力模块整合到 Resblock 中。这种设计也使网络在 OEDD 数据集上的检测准确率达到 92.01%。然而，并非所有融合特征都是有益的。还有许多研究工作正在探讨如何解决由融合特征引起的问题，例如 TIRNet [20]。为了解决红外和可见光图像融合中的信息冗余问题，RISNet [17] 设计了一种新的互信息最小化模块来减少冗余。此外，RISNet 提出了一种基于直方图统计的光照条件分类方法。 This method 自动分类更详细的光照条件，以促进红外和 RGB 图像的互补融合。此设计还使 RISNet 在红外图像检测方面优于最先进的方法，尤其是在光照不足、背景复杂和对比度低的情况下。此外，PearlGAN [21] 也在促进红外和可见光图像融合检测方面发挥了作用。PearlGAN 设计了一个自上而下的引导注意力模块，使相应的注意力损失达到分层注意力分布，减少局部语义模糊，并利用上下文信息进行图像编码。此外，PearlGAN 引入了一种结构化梯度对齐损失。此设计在图像翻译任务中表现出良好的性能效果，并为红外目标检测提供了新的思路。像 PearlGAN 对红外和可见光图像融合检测损失函数的约束设计一样，有许多优秀的工作，例如 CMPD [22]。

We propose the Dual-YOLO method based on the above observations on visible image object detection and infrared and visible image fusion detection. This method effectively solves the problems of low accuracy, feature loss, too many fused redundant features, and slow detection speed in infrared image object detection. Compared with the general target detection, our proposed Dual-YOLO is more suitable to solve the problem of target detection based on RGB UAV imagery. We can also see from [23] that target detection based on RGB UAV imagery is more challenging than general target detection. For targets with complex backgrounds, dense distribution, and small size, such as crop quality detection, the detection method based on RGB UAV imagery can improve the detection accuracy. In summary, the main contributions of this paper are listed as follows:
我们基于对可见光图像目标检测以及红外和可见光图像融合检测的上述观察，提出了 Dual-YOLO 方法。该方法有效地解决了红外图像目标检测中准确率低、特征丢失、融合冗余特征过多以及检测速度慢等问题。与一般的目标检测相比，我们提出的 Dual-YOLO 更适合解决基于 RGB 无人机图像的目标检测问题。从[23]中我们也可以看到，基于 RGB 无人机图像的目标检测比一般的目标检测更具挑战性。对于背景复杂、分布密集、尺寸较小的目标，如作物质量检测，基于 RGB 无人机图像的检测方法可以提高检测精度。综上所述，本文的主要贡献如下：

(1) Based on the current YOLOv7 [3] network with the highest accuracy in real-time object detection, we propose the dual-branch that includes an infrared and visible object detection network named Dual-YOLO. This method alleviates the problem of missing texture features in object detection of a single infrared image. The detection accuracy is improved by complementing the infrared and visible image feature information.
(1) 基于当前在实时目标检测中具有最高准确率的 YOLOv7 [3] 网络，我们提出了包含红外和可见光目标检测网络的双分支，名为 Dual-YOLO。该方法通过红外和可见光图像特征信息的互补，缓解了单张红外图像目标检测中缺失纹理特征的问题，提高了检测精度。

(2) We propose the attention fusion module, which added the Inception module and SE mutual attention module in the infrared and visible feature fusion process. So that infrared and visible images can achieve the best feature complementarity and fusion effect without increasing the number of parameters.
(2) 我们提出了注意力融合模块，在红外和可见光特征融合过程中加入了 Inception 模块和 SE 互注意力模块，使得红外和可见光图像能够在不增加参数数量的情况下，实现最佳的特征互补和融合效果。

(3) We propose the fusion shuffle module, which adds dilated convolution in the infrared and visible feature fusion process and increases the receptive field for feature extraction of the fusion module. In addition, we add the channel shuffle module to make the infrared and visible features more uniform and reduce redundant features. In addition, we design a feature fusion loss function to accelerate the convergence of Dual-YOLO.
(3) 我们提出了融合抖动模块，在红外和可见光特征融合过程中加入了空洞卷积，增加了融合模块的感受野，从而提高了特征提取能力。此外，我们加入了通道抖动模块，使红外和可见光特征更加均匀，减少了冗余特征。此外，我们设计了一种特征融合损失函数，以加速 Dual-YOLO 的收敛。

(4) Our method achieves state-of-the-art results on the challenging KAIST multispectral pedestrian dataset and the DroneVehicle [24] remote sensing dataset. Moreover, experiments on a multispectral object detection dataset FLIR also demonstrate the effectiveness and versatility of our algorithm.
(4) 我们的方法在具有挑战性的 KAIST 多光谱行人数据集和无人机 [24] 远程 sensing 数据集上达到了最先进的结果。此外，对多光谱物体检测数据集 FLIR 的实验也证明了我们算法的有效性和灵活性。

The rest of this paper is structured as follows: In Section 2, we describe the network structure and methods in detail. Section 3 gives the details of our work and experimental results and related comparison to verify the effectiveness of our method. Finally, we summarize the research content in Section 4.
本文的其余部分结构如下：在第 2 节中，我们详细描述了网络结构和方法。第 3 节给出了我们工作的详细内容和实验结果，并与相关比较进行验证，以证明我们方法的有效性。最后，在第 4 节中，我们总结了研究内容。

2. Methods 2. 方法

2.1. Overall Network Architecture
2.1. 整体网络架构

The overall network structure we have designed is shown in Figure 1. For the base structure, we take reference from the design of YOLOv7. In the backbone of the object detection network Dual-YOLO, we use P1 to P6 for hierarchical identification. Where the P1 layer uses the TriConv structure. TriConv consists of a three-layer convolution structure with the following format as shown in Equation (1). Where

F_{C_{i}} \in R^{C_{i n} \times H_{i n} \times W_{i n}}

.

F_{C_{i}}

are the input feature maps,

C o n v_{3 \times 3}

representing a convolution operation with kernel size of 3 × 3 and stride 1, and

C o n v_{3 \times 2}

representing a convolution operation with kernel size of 3 × 3 and stride 2. The P2 layer uses the ELAN1 structure of YOLOv7, as shown in Figure 2a. The P3 layer uses a combination of MPConv and ELAN1, which we have identified as MEConv. MEConv is calculated as shown in Equation (2). Where the composition of MPConv is shown in Equation (3) and

C o n v_{1 \times 1}

representing a convolution operation with kernel size of 1 × 1 and stride 1. The design of the P6 layer is derived from the SPPCSPC structure of YOLOv7 is shown in Figure 2c.
我们设计的整体网络结构如图 1 所示。对于基础结构，我们参考了 YOLOv7 的设计。在目标检测网络 Dual-YOLO 的骨干中，我们使用 P1 到 P6 进行分层识别。其中，P1 层使用了 TriConv 结构。TriConv 由一个三层卷积结构组成，格式如式（1）所示。其中

F_{C_{i}} \in R^{C_{i n} \times H_{i n} \times W_{i n}}

.

F_{C_{i}}

是输入特征图，

C o n v_{3 \times 3}

表示卷积操作，卷积核大小为 3 × 3，步长为 1，

C o n v_{3 \times 2}

表示卷积操作，卷积核大小为 3 × 3，步长为 2。P2 层使用了 YOLOv7 的 ELAN1 结构，如图 2a 所示。P3 层使用了 MPConv 和 ELAN1 的组合，我们将其识别为 MEConv。MEConv 的计算如式（2）所示。其中 MPConv 的组成如式（3）所示，

C o n v_{1 \times 1}

表示卷积操作，卷积核大小为 1 × 1，步长为 1。P6 层的设计源自 YOLOv7 的 SPPCSPC 结构，如图 2c 所示。

T r i C o n v (F_{C_{i}}) = C o n v_{3 \times 3} (C o n v_{3 \times 3} (C o n v_{3 \times 3} (F_{C_{i}})))

(1)

M E C o n v (F_{C_{i}}) = M P C o n v (E L A N 1 (F_{C_{i}}))

(2)

M P C o n v (F_{C_{i}}) = C o n c a t (C o n v_{1 \times 1} (M a x p o o l (F_{C_{i}})), C o n v_{3 \times 2} (C o n v_{1 \times 1} (F_{C_{i}})))

(3)

Figure 1. The overall architecture of the proposed Dual-YOLO. The proposed network is mainly designed to detect weak infrared objects captured by UAVs. However, to compensate for the loss of features due to variations in light intensity, we add a visible image feature extraction branch to the network to reduce the probability of missing objects.
图 1. 提出的双 YOLO 的整体架构。所提出的网络主要设计用于检测由无人机捕获的弱红外目标。然而，为了补偿由于光照强度变化导致的特征损失，我们在网络中添加了一个可见光图像特征提取分支，以降低漏检目标的概率。

Figure 2. Structures of the feature extraction modules in the backbone and neck. Where (a) is the structure of ELAN1, (b) is the structure of ELAN2, and (c) is the structure of SPPCSP. These structures remain consistent with the design in YOLOv7, where ELAN2 has essentially the same essential components as ELAN1, but ELAN2 has more channels than ELAN1 in the feature aggregation part to ensure that multi-scale feature information is aggregated in the neck. For the maxpool structure in SPPCSP, the value of k is the ratio of downsampling.
Figure 2. 骨干和颈部特征提取模块的结构。其中 (a) 是 ELAN1 的结构，(b) 是 ELAN2 的结构，(c) 是 SPPCSP 的结构。这些结构与 YOLOv7 的设计保持一致，其中 ELAN2 基本上具有与 ELAN1 相同的核心组件，但 ELAN2 在特征聚合部分的通道数更多，以确保多尺度特征信息在颈部中聚合。对于 SPPCSP 中的 maxpool 结构，k 的值是下采样的比例。

Thermal infrared images have strong feature contrast properties in environments with low light levels. However, visible images have a unique texture feature under normal light conditions. This textural feature can compensate for the lack of recognition of objects in thermal infrared images. Therefore, we add a visible feature extraction branch to the backbone design. The structure of the visible feature extraction branch is the same as that of the infrared feature extraction branch. In the neck’s design, we elicit feature map vectors from the backbone’s P3, P4, P5, and P6. The structure of this type of FPN already covers small, medium, and large objects in the infrared image, thus reducing the probability of missing detection. We design the novel Dual-Fusion (D-Fusion) module in the fusion features section, where the structure and characteristics of D-Fusion are described amply in Section 2.2. The D-Fusion module consists of two parts, Attention-Fusion and Fusion-Shuffle. Furthermore, the Attention-Fusion module is designed to weigh the visible features as the attention feature vector under the attention mechanism with the infrared features. The inspiration for creating the attention fusion module came from our preliminary experiments, where we found a significant miss-detection rate when training and detecting visible or infrared images alone.
热红外图像在低光照环境下具有强烈的特征对比特性。然而，在正常光照条件下，可见光图像具有独特的纹理特征，这种纹理特征可以弥补热红外图像中物体识别的不足。因此，我们在骨干网络设计中添加了一个可见光特征提取分支。可见光特征提取分支的结构与红外特征提取分支相同。在颈部设计中，我们从骨干网络的 P3、P4、P5 和 P6 提取特征图向量。这种类型的 FPN 结构已经涵盖了红外图像中的小、中、大物体，从而降低了漏检的概率。我们在融合特征部分设计了新颖的双融合（D-Fusion）模块，其中 D-Fusion 模块的结构和特性在第 2.2 节中描述得非常充分。D-Fusion 模块由两部分组成，即注意力融合和融合混洗。此外，注意力融合模块设计为在注意力机制下将可见光特征作为注意力特征向量与红外特征进行加权。创建注意力融合模块的灵感来源于我们初步的实验，在这些实验中，我们发现单独训练和检测可见光或红外图像时存在显著的误检率。

In the design of the neck section, we refer to the structure of YOLOv7. Three up-sampling operations are performed in the deck to eliminate the problem of gradual loss of features due to convolution. At the same time, four detection heads are designed to preserve the small object features in the convolution, where the detection head can cover all object sizes.
在颈部设计中，我们参考了 YOLOv7 的结构。在甲板上进行了三次上采样操作，以消除由于卷积而导致的特征逐渐丢失的问题。同时，设计了四个检测头，以在卷积中保留小物体特征，其中检测头可以覆盖所有物体大小。

2.2. Information Fusion Module
2.2. 信息融合模块

The design of this module is derived from several sets of experiments we have conducted on the effectiveness of network detection for a single data source. Before designing Dual-YOLO, we complete the following groups of experiments, as shown in Figure 3. For the single visible image data training model, as in Figure 3a(1), the bus class (blue box) is detected in daylight conditions, and the car class is near the bush. In Figure 3a(2), however, classes such as cars are missed. Furthermore, compared to Figure 3c(1) and c(2), after training the model with single visible images at night when there is not enough light, most objects can be detected, although there are missed detections. However, for infrared images, there are many missed and faulty objects. For the training of infrared images, as in Figure 3b(1), the objects in the car category are submerged in the background due to the faint brightness of the overall image. This phenomenon also leads to a large number of objects being missed. In contrast, in Figure 3b(2), the object of the car class differs significantly from the background features in the thermal infrared image. Therefore, the network has a strong recognition ability when trained with infrared images. Similarly, objects are detected in Figure 3d(2) that are not detected in the visible image case in Figure 3d(1). As a result, the ideal model we want to design is characterized by solid robustness and a meager leakage rate at different light intensities.
该模块的设计源自我们针对单一数据源网络检测效果所进行的几组实验。在设计 Dual-YOLO 之前，我们完成了如下几组实验，如图 3 所示。对于单一可见光图像训练模型，如图 3a(1) 所示，在白天条件下，公共汽车类（蓝色框）被检测到，而汽车类则靠近灌木丛。然而，在图 3a(2) 中，却错过了诸如汽车之类的类别。此外，与图 3c(1) 和 c(2) 相比，当在夜间训练模型且光线不足时，大多数物体可以被检测到，尽管存在一些漏检。然而，对于红外图像，存在许多漏检和故障物体。对于红外图像的训练，如图 3b(1) 所示，由于整体图像亮度较暗，汽车类别的物体被背景淹没。这一现象也导致了大量的物体被漏检。相比之下，如图 3b(2) 所示，在热红外图像中，汽车类别的物体与背景特征有显著差异。因此，当使用红外图像训练时，网络具有很强的识别能力。同样，在图 3d(2)中检测到了可见光图像图 3d(1)中未检测到的对象。因此，我们想要设计的理想模型具有在不同光照强度下坚固的鲁棒性和微小的泄漏率。

Figure 3. The effect of separate detection of infrared images and visible images. a(1), a(2), c(1), and c(2) are training and detection results for single visible data. b(1), b(2), d(1), and d(2) are training and detection results for single infrared data. This is a collection of images taken from a drone. The images are taken during the day and night. The drone flies at altitudes of 100 m and 200 m.
图 3. 单独检测红外图像和可见光图像的效果。a(1)，a(2)，c(1)和 c(2)是单个可见光数据的训练和检测结果。b(1)，b(2)，d(1)和 d(2)是单个红外数据的训练和检测结果。这些图像来自无人机。白天和夜晚拍摄。无人机在 100 米和 200 米的高度飞行。

2.2.1. Attention Fusion Module
2.2.1. Attention Fusion 模块

In the feature fusion module, we feed the visible and infrared images into a two-branch backbone and perform shared learning of features at the FPN layer. This architecture is used to achieve the fusion of mixed modal features of infrared and visible images. In the fusion module, we add the batch normalization (BN) operation to the double branch’s features to improve the network’s generalization ability. In addition, we add the SE attention module in the independent branches, which multiplies the attention feature vectors obtained from the two feature calculations with the corresponding branches. Moreover, we use the deep separable convolution instead of the conventional 3 × 3 convolution to reduce the number of parameters in the network with less network performance. The structure of the feature fusion module we designed to incorporate the attention mechanism is shown in Figure 4.
在特征融合模块中，我们将可见光和红外图像输入到一个双分支主干中，并在 FPN 层进行特征的共享学习。这种架构用于实现红外和可见光图像混合模态特征的融合。在融合模块中，我们为双分支的特征添加了批标准化（BN）操作，以提高网络的泛化能力。此外，我们在独立分支中添加了 SE 注意力模块，该模块将从两个特征计算中获得的注意力特征向量与相应的分支相乘。此外，我们使用了深度可分离卷积而不是传统的 3×3 卷积，以在减少网络参数数量的同时保持网络性能。我们设计的融合模块中包含注意力机制的结构如图 4 所示。

Figure 4. The structure of the Attention fusion module. (a) shows the data flow structure of the attention fusion. (b) shows the Inception structure in (a), which mainly connects the 4 branches. (c) shows the detailed description of the convolution structure in (b).
Figure 4. 注意融合模块的结构。(a) 显示了注意融合的数据流结构。(b) 显示了(a) 中的 Inception 结构，主要连接了 4 个分支。(c) 显示了(b) 中卷积结构的详细描述。

We can understand the attention fusion structure intuitively in Figure 4, where Figure 4a shows the main structure of the attention fusion module. The Attention fusion module is designed to enhance the information exchange between the infrared and visible channels as well as the mutual feature enhancement. The Inception module is designed to obtain multi-scale features in both infrared and visible images. It can also reduce the computational overhead while ensuring the accuracy of the network, thus improving the efficiency of the feature extraction network. In the structure shown in Figure 4a, we add the SE attention module to enhance the infrared and visible features. In this case, we set the squeeze factor of the SE module to s = 4. In particular, we designed the SE module by weighting the feature vectors of the infrared images with the features extracted from the visible image, resulting in attention feature maps for the visible image channels. Similarly, the attention feature maps for the infrared image channel are obtained by weighting the features with the feature vectors derived from the visible image channels by SE calculations. The structure of the Inceptive module in Figure 4a is shown in Figure 4b, and we use the Inception structure from [25]. The composition of the convolution part in Figure 4b is shown in Figure 4c. For each convolution operation, we use the Leaky ReLU activation function. Moreover, in the end, we add the BN operation. Enhanced feature maps, calculated by the attention fusion module, will be more favorable for later fusion.
我们可以通过图 4 直观地理解注意力融合结构，其中图 4a 展示了注意力融合模块的主要结构。注意力融合模块旨在增强红外和可见光通道之间的信息交流以及相互特征增强。Inception 模块旨在获取红外和可见光图像的多尺度特征，同时可以减少计算开销并确保网络的准确性，从而提高特征提取网络的效率。在图 4a 所示的结构中，我们添加了 SE 注意力模块以增强红外和可见光特征。在这种情况下，我们将 SE 模块的压缩因子设置为 s=4。特别地，我们通过使用可见光图像提取的特征加权红外图像的特征向量，设计了 SE 模块，从而得到可见光图像通道的注意力特征图。同样，通过 SE 计算，使用可见光图像通道提取的特征向量加权红外图像的特征向量，可以得到红外图像通道的注意力特征图。图 4a 中 Inceptive 模块的结构如图 4b 所示，我们使用了 [25] 中的 Inception 结构。图 4b 中卷积部分的组成如图 4c 所示。对于每个卷积操作，我们使用 Leaky ReLU 激活函数。此外，在最后，我们添加了 BN 操作。通过注意力融合模块计算出的增强特征图更有利于后续的融合。

2.2.2. Fusion Shuffle Module
2.2.2. 融合混洗模块

After the infrared features are fused with the visible features, we add the process of fusion shuffle. The purpose is to allow the network to learn more mixed features of the infrared and visible images, thus allowing the network to adapt to both modes of features. So we take the module’s design for feature enhancement from [26] and improve it. The Fusion shuffle module we designed is shown in Figure 5. As can be seen from the figure, after obtaining the infrared and visible features in the lower dimension, we concatenate the two features to create a double effect on the feature channel. We then add multiple branches of convolution layers with different kernel sizes and followed each convolution with a dilated convolution with the corresponding dilation rate. Finally, we concatenate the output of the four branches and then shuffle these enhanced features to form the mixed enhancement.
在将红外特征与可见光特征融合后，我们增加了融合洗牌的过程。目的是让网络学习更多红外和可见光图像的混合特征，从而使网络能够适应这两种特征模式。因此，我们从[26]中借鉴了特征增强模块的设计并进行了改进。我们设计的融合洗牌模块如图 5 所示。从图中可以看出，在获得低维度的红外和可见光特征后，我们将两个特征进行拼接，以在特征通道上产生双重效果。然后，我们添加了多个具有不同内核大小的卷积层分支，并在每个卷积后跟随相应的膨胀卷积。最后，我们将四个分支的输出进行拼接，然后洗牌这些增强特征以形成混合增强。

Figure 5. The fusion shuffle module structure where the shuffle is performed after fusion.
图 5. 在融合后执行混洗的操作的融合混洗模块结构。

In Figure 5, we first design a four-branch convolution layer (including 1 × 1 convolution, 3 × 3 convolution, 5 × 5 convolution, and 7 × 7 convolution). Where 1 × 1 convolution and 3 × 3 convolution extract small object features in infrared images and visible images, 5 × 5 convolution extracts medium-scale object features, and 7 × 7 convolution aims to extract large-scale object features. The four-branch convolution structure enhances the depth features of the infrared and visible images. To further extend the field of perception for image feature extraction in both modes, we introduce additional dilated convolution in each branch. The aim of introducing dilated convolution is to generate feature maps with high resolution and make them more contextual. The intent is also to reduce computational costs. As the dilation rate setting of the dilated convolution, for the 1 × 1 convolution, we set the dilation rate of the 3 × 3 dilated convolution to 1. For the 3 × 3 convolution, we set up a 3 × 3 dilated convolution with a dilation rate of 3. For the 5 × 5 convolution, we set up a 3 × 3 dilated convolution with a dilation rate of 5. For the 7 × 7 convolution, we set up a 3 × 3 dilated convolution with a dilation rate of 7. The larger the dilation rate of the dilation convolution, the larger the perceptual field. Dilated convolutions with different dilation rates make the branches more focused on enhancing features of a particular size. After enhancing the features, we cascade four branches of features and performed a shuffle operation. Finally, we use a 1 × 1 convolution operation to reshape the output of the fused features.
在图 5 中，我们首先设计了一个四分支卷积层（包括 1×1 卷积、3×3 卷积、5×5 卷积和 7×7 卷积）。其中，1×1 卷积和 3×3 卷积用于提取红外图像和可见光图像中的小对象特征，5×5 卷积用于提取中等尺度的对象特征，而 7×7 卷积则用于提取大尺度的对象特征。四分支卷积结构增强了红外和可见光图像的深度特征。为了进一步扩展图像特征提取的感知范围，在每一分支中我们引入了额外的空洞卷积。引入空洞卷积的目的是生成高分辨率的特征图，并使它们更具上下文性。同时，这也旨在降低计算成本。对于空洞卷积的膨胀率设置，对于 1×1 卷积，我们将 3×3 空洞卷积的膨胀率设置为 1。对于 3×3 卷积，我们设置了膨胀率为 3 的 3×3 空洞卷积。对于 5×5 卷积，我们设置了膨胀率为 5 的 3×3 空洞卷积。对于 7 × 7 卷积，我们设置了一个膨胀率为 7 的 3 × 3 膨胀卷积。膨胀卷积的膨胀率越大，感知范围越大。不同膨胀率的膨胀卷积使得分支更加专注于增强特定大小的特征。增强特征后，我们级联了四个特征分支并执行了混洗操作。最后，我们使用一个 1 × 1 卷积操作重塑融合特征的输出。

2.3. Loss Function 2.3. 损失函数

In the design of the loss function, we divide the loss of Dual-YOLO into four parts. The first is for the D-fusion module. In the overall structure of the network, we design four fusion modules for visible and infrared images based on the feature pyramid structure. Furthermore, the corresponding fusion is carried out according to deep and shallow features. Assuming that the feature matrix of the visible image is

Z_{v i s}

and the feature matrix of the infrared image is

Z_{i n f}

, the feature entropy of the two images

H_{i} (Z_{v i s})

and

H_{i} (Z_{i n f})

are calculated as shown in Equations (4) and (5).
在损失函数的设计中，我们将 Dual-YOLO 的损失分为四部分。第一部分是为 D-融合模块设计的。在网络的整体结构中，我们基于特征金字塔结构设计了四个融合模块，用于可见光和红外图像。此外，根据深层和浅层特征进行相应的融合。假设可见光图像的特征矩阵为

Z_{v i s}

，红外图像的特征矩阵为

Z_{i n f}

，两幅图像的特征熵

H_{i} (Z_{v i s})

和

H_{i} (Z_{i n f})

分别如公式（4）和（5）所示。

H_{i} (Z_{i n f}) = C_{i} (Z_{v i s}, Z_{i n f}) - D_{i} (Z_{v i s} ‖ Z_{i n f})

(4)

H_{i} (Z_{v i s}) = C_{i} (Z_{i n f}, Z_{v i s}) - D_{i} (Z_{i n f} ‖ Z_{v i s})

(5)

where

C_{i} (Z_{v i s}, Z_{i n f})

is the cross-entropy of the low-dimensional feature vectors

Z_{v i s}

and

Z_{i n f}

of the i-th D-fusion module.

D_{i} (Z_{i n f} ‖ Z_{v i s})

is the relative entropy of

Z_{v i s}

and

Z_{i n f}

. In the loss of the D-fusion module, we add up the losses of the four different scales of the module and end up with a loss of

L_{f u s i o n}

the fusion module, as shown in Equation (6).
其中

C_{i} (Z_{v i s}, Z_{i n f})

是第 i 个 D-融合模块的低维特征向量

Z_{v i s}

和

Z_{i n f}

的交叉熵，

D_{i} (Z_{i n f} ‖ Z_{v i s})

是

Z_{v i s}

和

Z_{i n f}

的相对熵。在 D-融合模块的损失中，我们将模块四个不同尺度的损失相加，最终得到 D-融合模块的损失

L_{f u s i o n}

，如公式（6）所示。

\begin{matrix} L_{f u s i o n} = & \sum_{i = 1}^{4} (H_{i} (Z_{i n f}) + H_{i} (Z_{v i s})) \\ = & \sum_{i = 1}^{4} (C_{i} (Z_{v i s}, Z_{i n f}) + C_{i} (Z_{i n f}, Z_{v i s}) - D_{i} (Z_{v i s} ‖ Z_{i n f}) - D_{i} (Z_{i n f} ‖ Z_{v i s})) \end{matrix}

(6)

The value of

L_{f u s i o n}

represents the number of pseudo-features in the visible image. By optimizing

L_{f u s i o n}

, the parameters of the network for extracting features can be optimized. It is also possible to eliminate redundant image features, thus improving the network’s generalization ability and facilitating rapid convergence. For the coordinate position error, we choose Complete IoU (CIoU) Loss as the loss function, making the box-objective regression more stable, as shown in Equation (7).

L_{f u s i o n}

的值表示可见图像中的伪特征数量。通过优化

L_{f u s i o n}

，可以优化用于提取特征的网络参数，还可以消除冗余图像特征，从而提高网络的泛化能力和促进快速收敛。对于坐标位置误差，我们选择 Complete IoU (CIoU) 损失作为损失函数，使框-对象回归更加稳定，如公式（7）所示。

L_{b o x} = L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α ν

(7)

where

I o U

is the intersection ratio of the prediction bounding box to the Ground True (GT) bounding box.
其中

I o U

是预测边界框与 Ground True (GT) 边界框的交并比。

I o U = | \frac{b \cap b^{g t}}{b \cup b^{g t}} |

(8)

ν = \frac{4}{π^{2}} {(arctan \frac{w^{g t}}{h^{g t}} - arctan \frac{w}{h})}^{2}

(9)

where b is the predicted box,

b^{g t}

is the GT box,

ρ

is the distance between the centroid of the predicted box and the GT box, c is the diagonal length of the smallest enclosing rectangle of the predicted box and the GT box,

ρ

is the similarity of the aspect ratio of the predicted box and the GT box, and

α

is the influence factor of

ν

.
其中 b 是预测框，

b^{g t}

是 GT 框，

ρ

是预测框和 GT 框的质心距离，c 是预测框和 GT 框的最小外接矩形的对角线长度，

ρ

是预测框和 GT 框的长宽比相似度，

α

是

ν

的影响因子。

For the object coordinate position error, we choose the Smooth Binary Cross Entropy (Smooth BCE) loss with logits function to increase numerical stability, which is calculated as shown in Equation (10).
对于对象坐标位置误差，我们选择使用 logits 函数的平滑二元交叉熵（Smooth BCE）损失以增加数值稳定性，其计算如式（10）所示。

L_{o b j} = - \frac{1}{n} \sum_{i}^{n} [y_{i} \cdot log (σ (x_{i})) + (1 + y_{i}) \cdot log (1 - σ (x_{i}))]

(10)

σ (x_{i}) = \frac{1}{1 + exp (- x_{i})}

(11)

For the loss function of object classification, we choose Focal loss as the loss function, as shown in Equation (12).
对于物体分类的损失函数，我们选择使用 Focal loss 作为损失函数，如公式（12）所示。

\begin{matrix} L_{c l s} = & \sum_{i = 1}^{S^{2}} \sum_{j = 1}^{B} 1_{i, j}^{o b j} \sum_{c \in c l a s s} [- p_{i} (c) log ({\hat{p}}_{i} (c)) - (1 - p_{i} (c)) log (1 - {\hat{p}}_{i} (c))] \end{matrix}

(12)

where

{\hat{p}}_{i} (c)

and

p_{i} (c)

represent with predicted and true value probabilities, respectively. The number of input image cells is

S^{2}

. B is the number of bounding boxes predicted for each cell. The value of

1_{i, j}^{o b j}

is 1 or 0, that is whether there is a detection object in the j-th bounding box of the i-th cell. We use 1 if it exists, 0 otherwise. For the total loss function design, we add up the loss function of the head part with the loss of the D-fusion. The total loss value of the network is finally obtained, which is calculated as shown in Equation (13). Where

λ

is the correction factor of the fusion loss.
其中

{\hat{p}}_{i} (c)

和

p_{i} (c)

分别代表预测值和真实值的概率，

S^{2}

是输入图像单元的数量。B 是每个单元预测的边界框数量。

1_{i, j}^{o b j}

的值为 1 或 0，即第 i 个单元的第-th 个边界框中是否存在检测对象。如果存在则使用 1，否则使用 0。对于总损失函数的设计，我们将头部部分的损失函数与 D-融合的损失相加。最终网络的总损失值如式（13）所示计算。其中

λ

是融合损失的校正因子。

L_{t o t a l} = λ L_{f u s i o n} + L_{b o x} + L_{o b j} + L_{c l s}

(13)

3. Experiment and Analysis
3. 实验与分析

To test the performance of the infrared image object detection models Dual-YOLO proposed in this paper, we use the public DroneVehicle, KAIST, and FLIR infrared pedestrian datasets.
为了测试本文提出的双 YOLO 红外图像物体检测模型的性能，我们使用了公开的 DroneVehicle、KAIST 和 FLIR 红外行人数据集。

3.1. Dataset Introduction
3.1. 数据集介绍

3.1.1. DroneVehicle Dataset
3.1.1. 无人机车辆数据集

The DroneVehicle dataset [24] is a large UAV aerial vehicle dataset for annotation, which is used for tasks such as vehicle detection and vehicle counting. The dataset images are taken in environments ranging from day to night and contain both infrared and visible images. The entire annotated dataset has 15,532 pairs (31,064 images) and 441,642 annotated instances. Moreover, it contains realistic environment occlusion and scale variation.
无人机车辆数据集 [24] 是一个用于标注的大规模无人机航空车辆数据集，用于车辆检测和车辆计数等任务。数据集中的图像拍摄环境从白天到夜晚不等，并包含红外和可见光图像。整个标注数据集包含 15,532 对（31,064 张图像）和 441,642 个标注实例。此外，还包含真实的环境遮挡和尺度变化。

3.1.2. KAIST Dataset 3.1.2. KAIST 数据集

The KAIST dataset [11] is a multispectral detection dataset constructed by Hwang et al. in 2015 with the primary aim of addressing the lack of pedestrian detection data in nighttime environments. The dataset is divided into 12 subsets. Where set00∼set05 are training data (set00∼set02 are daytime scenes; set03∼set05 are nighttime scenes), and set06∼set11 are test data (set06∼set08 are daytime scenes; set09∼set11 are nighttime scenes). The image resolution sizes are 640 × 512, containing a total of 95,328 images, each containing both visible and infrared images. The KAIST dataset captures several regular traffic scenes, including campus, street, and countryside, during daytime and nighttime, respectively, and contains 103108 dense annotations.
The KAIST 数据集 [11] 是由 Hwang 等人在 2015 年构建的多光谱检测数据集，主要目的是解决行人检测数据在夜间环境中的不足。数据集分为 12 个子集。其中 set00∼set05 是训练数据（set00∼set02 是白天场景；set03∼set05 是夜间场景），而 set06∼set11 是测试数据（set06∼set08 是白天场景；set09∼set11 是夜间场景）。图像分辨率大小为 640 × 512，总共包含 95,328 张图像，每张图像都包含可见光和红外图像。KAIST 数据集捕捉了白天和夜间不同场景下的多个常规交通场景，包括校园、街道和乡村，并包含 103108 个密集标注。

3.1.3. FLIR Dataset 3.1.3. FLIR 数据集

The FLIR dataset [12] contains more than 10K pairs of 8-bit infrared images and 24-bit visible images, including people, vehicles, bicycles, and other objects in the daytime and nighttime scenes. The infrared images’ resolution is 640 × 512, while the corresponding resolution of visible images varies from 720 × 480 to 2048 × 1536. We resize each visible image to 640 × 512 in our experiments. The default FLIR training dataset is used as our training dataset, and 20 color-thermal pairs from the FLIR validation set are randomly selected as the testing dataset. The dataset information we used for training and testing is summarized in Table 1.
FLIR 数据集[12]包含超过 10K 对 8 位红外图像和 24 位可见光图像，包括白天和夜晚场景中的人员、车辆、自行车和其他物体。红外图像的分辨率为 640 × 512，而对应的可见光图像分辨率从 720 × 480 到 2048 × 1536 不等。我们在实验中将每张可见光图像调整为 640 × 512。默认的 FLIR 训练数据集作为我们的训练数据集，从 FLIR 验证集中随机选择了 20 个彩色-热成像对作为测试数据集。我们用于训练和测试的数据集信息总结在表 1 中。

Table 1. Dataset information we used in this paper.
表 1. 本文中使用的数据集信息。

3.2. Implementation Details
3.2. 实现细节

We utilize the YOLOv7 network as the main framework. Each image is randomly horizontally flipped with a probability of 0.5 to increase the diversity. The whole network is optimized by stochastic gradient descent (SGD) optimizer for 300 epochs with a learning rate of 0.005 and a batch size of 16. Weight decay and momentum are set to 0.0001 and 0.9, respectively. We implement our codes with the PyTorch framework and conduct experiments on a workstation with two NVIDIA GTX3090 GPUs. We summarize the setting of experimental environment and parameter as shown in Table 2. The hyper-parameters of the dataset we used in this article is shown in Table 3. There are equal numbers of infrared and visible images, while using these datasets for network training and testing, we perform data cleaning operations.
我们使用 YOLOv7 网络作为主要框架。每张图片以 0.5 的概率随机水平翻转以增加多样性。整个网络通过随机梯度下降（SGD）优化器在 300 个周期内进行优化，学习率为 0.005，批量大小为 16。权重衰减和动量分别设置为 0.0001 和 0.9。我们使用 PyTorch 框架实现代码，并在配备两块 NVIDIA GTX3090 GPU 的工作站上进行实验。我们总结了实验环境和参数设置，如表 2 所示。本文使用的数据集的超参数如表 3 所示。红外和可见光图像的数量相等，在使用这些数据集进行网络训练和测试时，我们进行了数据清理操作。

Table 2. Environment and parameter setting for the experiment setup.
表 2. 实验设置的环境和参数设置。

Table 3. The hyper-parameters of the dataset we used in this manuscript. test-val means that the test set used in this article is the same as the validation set.
表 3. 本文中使用的数据集的超参数。test-val 表示本文使用的测试集与验证集相同。

3.3. Evaluation Metrics 3.3. 评估指标

Precision, Recall, and mean Average Precision (mAP) are used to evaluate the detection performance of different methods. In the experiments of this paper, we mainly use the values of precision and recall to measure the network’s performance, which are calculated as shown in Equations (14) and (15).
精度、召回率和平均精度（mAP）用于评估不同方法的检测性能。在本文的实验中，我们主要使用精度和召回率的值来衡量网络的性能，如公式（14）和（15）所示。

p r e c i s i o n = \frac{TP}{TP + FP}

(14)

r e c a l l = \frac{TP}{TP + FN}

(15)

For example, in the FLIR dataset for detecting persons and cars, TP (True Positive) represents the number of cars (or persons) correctly recognized as cars (or persons). FP (False Positives) means the number of samples that identified non-car instances (or non-person instances) as cars (or persons), and FN (False Negatives) indicates the number of samples that identified cars (or persons) as non-car instances (or non-person instances).
例如，在 FLIR 数据集中用于检测人员和车辆时，TP（真阳性）表示正确识别为车辆（或人员）的车辆（或人员）的数量。FP（假阳性）表示将非车辆实例（或非人员实例）误识别为车辆（或人员）的样本数量，而 FN（假阴性）表示将车辆（或人员）误识别为非车辆实例（或非人员实例）的样本数量。

Average Precision (AP) refers to the area value of the P-R curve surrounded by coordinates. The closer the AP value is to 1, the better the detection effect of the algorithm. The calculation process of AP can be summarized as follows:
平均精度（AP）是指 P-R 曲线包围的坐标区域值。AP 值越接近 1，算法的检测效果越好。AP 的计算过程可以总结如下：

AP = \int P (R) dR

(16)

The mAP indicates each class’s average value of AP, which is used to measure the performance of multi-class object detection tasks fairly. Therefore, the mAP is also adopted to evaluate the detection accuracy in our experiments. The mAP measures the quality of bounding box predictions in the test set. Following [27], a prediction is considered a true positive if the IoU between the prediction and its nearest ground-truth annotation is more extensive than 0.5. The IoU is calculated as shown in Equation (8).
The mAP 表示每个类别的 AP 的平均值，用于公平地衡量多类物体检测任务的性能。因此，我们实验中也采用了 mAP 来评估检测精度。mAP 用于衡量测试集中的边界框预测质量。按照 [27]，如果预测与其实最近的地面真实注释之间的 IoU 大于 0.5，则认为该预测为真阳性。IoU 的计算如方程（8）所示。

3.4. Analysis of Results
3.4. 结果分析

3.4.1. Experiments on the DroneVehicle Remote Sensing Dataset
3.4.1. 对无人机车辆遥感数据集的实验

To verify the detection effectiveness of our proposed Dual-YOLO method on small infrared objects, we conduct a series of experiments on the DroneVehicle dataset. The experimental results are shown in Table 4. Based on our observations, the freight car and van classes are very similar in shape in the DroneVehicle dataset. Therefore, many popular detection methods incorporate these two classes into the other three classes when conducting experiments on the DroneVehicle dataset to eliminate the error caused by fine classification. However, we chose the complete the DroneVehicle dataset when experimenting. In addition, we compare the performance with the current popular object detection methods, and the performance comparison is shown in Table 4.
为了验证我们提出的双 YOLO 方法在小红外目标检测的有效性，我们在 DroneVehicle 数据集上进行了一系列实验。实验结果如表 4 所示。根据我们的观察，在 DroneVehicle 数据集中，货运列车和厢式货车的形状非常相似。因此，许多流行的检测方法在对 DroneVehicle 数据集进行实验时，将这两个类别合并到其他三个类别中，以消除细分类别带来的误差。然而，在实验中，我们选择了完整的 DroneVehicle 数据集。此外，我们还与当前流行的物体检测方法进行了性能比较，性能比较结果如表 4 所示。

Table 4. Evaluation results on the DroneVehicle dataset. All values are in %.The top results are marked in green.
表 4. 在 DroneVehicle 数据集上的评估结果。所有数值均为百分比。最佳结果用绿色标记。

In Table 4, we divide the modality of the data into Visible and Infrared. Table 4 shows that when only visible data is used for training, popular networks such as RetinaNet and Mask R-CNN can achieve the highest accuracy of 47.9%. The algorithm that achieves the highest accuracy when training infrared data is YOLOv7. Therefore, we choose YOLOv7 as the basic framework for Dual-YOLO. The highest accuracy YOLOv7 can achieve is 66.7%. The Dual-YOLO algorithm proposed in this paper can reach 71.8% accuracy on the DroneVehicle dataset. It is worth noting that when we test the Dual-YOLO algorithm, the test set is the infrared image test set. Our proposed model also has the highest detection accuracy of 52.9% and 46.6% for the two categories of freight car and van that are difficult to detect. This result also shows that the Dual-YOLO design is very robust. Moreover, the detection of small objects also has strong performance.
在表 4 中，我们将数据的模态分为可见光和红外。表 4 显示，仅使用可见光数据进行训练时，RetinaNet 和 Mask R-CNN 等流行网络可以达到最高的准确率 47.9%。当训练红外数据时，达到最高准确率的算法是 YOLOv7。因此，我们选择 YOLOv7 作为双 YOLO 的基本框架。YOLOv7 能达到的最高准确率为 66.7%。本文提出的双 YOLO 算法在无人机车辆数据集上的准确率为 71.8%。值得注意的是，在测试双 YOLO 算法时，测试集是红外图像测试集。我们提出的方法在货运车厢和货车这两种难以检测的类别上的检测准确率分别为 52.9%和 46.6%，这一结果也表明双 YOLO 设计非常稳健。此外，对小物体的检测性能也很强。

3.4.2. Experiments on the KAIST Pedestrian Dataset
3.4.2. KAIST 行人数据集上的实验

To further verify the effectiveness and robustness of our proposed Dual-YOLO, we conduct experiments on the challenging KAIST dataset. After comparing with some popular methods, our experimental results are shown in Table 5. Here, we mainly compare with PearlGAN. PearlGAN’s design idea is similar to ours, which uses infrared and visible image fusion information. However, unlike the Dual-YOLO we proposed, PearlGAN does not integrate infrared and visible features in this design. Instead, the two information sources are constrained by the loss function. Therefore, we can also be seen from Table 5 that the method that we choose to use features for fusion before detection and add loss constraint has a better performance on the KAIST dataset.
为了进一步验证我们提出的双 YOLO 的有效性和鲁棒性，我们在具有挑战性的 KAIST 数据集上进行了实验。在与一些流行方法进行比较后，我们的实验结果如表 5 所示。在这里，我们主要与 PearlGAN 进行了比较。PearlGAN 的设计理念与我们相似，它使用了红外和可见光图像融合的信息。然而，与我们提出的双 YOLO 不同，PearlGAN 的设计中并没有将红外和可见光特征进行整合，而是通过损失函数对两种信息源进行了约束。因此，从表 5 中也可以看出，我们在检测前使用特征进行融合并添加损失约束的方法在 KAIST 数据集上的表现更好。

Table 5. Pedestrian detection results of the synthesized images obtained by different translation methods on the KAIST dataset computed at a single IoU of 0.5. All values are in %. The top results are marked in green.
表 5. 在 KAIST 数据集上使用不同翻译方法生成的合成图像在单个 IoU 为 0.5 时的行人检测结果。所有数值均为百分比。最佳结果用绿色标记。

From Table 5, we can see that the accuracy of the other methods is low compared to the accuracy of our proposed network on the KAIST dataset. According to our analysis, this is due to the presence of many cluttered labels in the KAIST dataset, which leads to lower accuracy of other methods. However, we perform data cleaning on the dataset to remove pseudo-labels as well as incorrect labels before training the network. It can also be seen from the final results that our method also performs effectively in small infrared objects. Figure 6 shows the results of our tests and visualization of some data from the KAIST dataset. As can be seen from the figure, our network is highly robust to both changes in scale and changes in image brightness.
从表 5 可以看出，与其他方法相比，我们提出的网络在 KAIST 数据集上的准确率较高。根据我们的分析，这是由于 KAIST 数据集中存在许多杂乱的标签，导致其他方法的准确率较低。然而，在训练网络之前，我们对数据集进行了数据清洗，以去除伪标签和错误标签。从最终结果也可以看出，我们的方法在小红外对象上也表现有效。图 6 展示了我们测试的结果和 KAIST 数据集中某些数据的可视化。从图中可以看出，我们的网络对尺度变化和图像亮度变化都具有很高的鲁棒性。

Figure 6. Visualization of Dual-YOLO detection results on the KAIST pedestrian dataset.
图 6. 双 YOLO 检测结果在 KAIST 行人数据集上的可视化。

3.4.3. Experiments on the FLIR Dataset
3.4.3. FLIR 数据集上的实验

We also conduct a series of experiments on the FLIR dataset to prove the effectiveness of the proposed method. Furthermore, we compare the performance with some popular methods, such as SSD, RetinaNet, YOLOv5s, and YOLOF. The final experimental results are shown in Table 6.
我们还在 FLIR 数据集上进行了一系列实验，以证明所提出方法的有效性。此外，我们还将性能与一些流行的方法进行了比较，如 SSD、RetinaNet、YOLOv5s 和 YOLOF。最终的实验结果见表 6。

Table 6. Object detection results of the synthesized images obtained by different translation methods on FLIR dataset, were computed at a single IoU of 0.5. All values are in %. The top results are marked in green.
表 6. 不同翻译方法在 FLIR 数据集上生成的合成图像的目标检测结果，在单个 IoU 为 0.5 的情况下计算得出。所有数值以%表示。最佳结果用绿色标记。

Table 6 shows that our proposed method has the highest mAP value compared with other methods. The structure of Dual-YOLO we used is shuffled before fusion, which is also explained in detail in Section 3.5. From Table 6, we can also see that for small objects such as bicycles most methods have limited detection accuracy on such objects, such as SSD and RetinaNet. The Dual-YOLO we proposed has a strong detection effect for small and medium-sized objects such as persons. According to the data, the detection accuracy of our Dual-YOLO is 20.3% higher than that of YOLOv5s in the person category. We believe this improvement is not only due to the advancement of the YOLOv7 architecture. It shows that our idea of infrared and visible image fusion detection is reasonable. It is worth noting that the detection accuracy of the proposed network is up to 93.0% on the car class. Such detection accuracy is 13.0% higher than YOLOv5s and 7.5% higher than TermalNet in Table 6. According to our analysis, the visible image channel is added in our proposed Dual-YOLO so that the network can better recognize texture features. The enhancement of texture features makes the overall detection effect more optimized. It is worth mentioning that our proposed Dual-YOLO method increases by 4.5% compared with the YOLO-FIR on the overall mAP. YOLO-FIR is also designed based on the fusion of infrared and visible images. However, we design the attention fusion module and fusion shuffle module in the fusion process, which also increases our detection accuracy.
表 6 显示，我们提出的方法在与其他方法相比时具有最高的 mAP 值。我们使用的 Dual-YOLO 结构在融合前被打乱，这在第 3.5 节中也进行了详细解释。从表 6 中我们还可以看到，对于自行车等小型物体，大多数方法在检测这些物体时的准确性都有限，例如 SSD 和 RetinaNet。我们提出的 Dual-YOLO 对于人员等小型和中型物体具有很强的检测效果。根据数据，我们提出的 Dual-YOLO 在人员类别中的检测准确性比 YOLOv5s 高 20.3%。我们认为这种改进不仅归功于 YOLOv7 架构的进步。这表明我们提出的红外和可见光图像融合检测的想法是合理的。值得注意的是，我们提出的网络在汽车类别的检测准确性高达 93.0%，这种检测准确性比 YOLOv5s 高 13.0%，比 TermalNet 高 7.5%。根据我们的分析，我们在提出的 Dual-YOLO 中增加了可见光图像通道，从而使网络能够更好地识别纹理特征。纹理特征的增强使整体检测效果更加优化。值得一提的是，我们提出的 Dual-YOLO 方法在整体 mAP 上比 YOLO-FIR 提高了 4.5%。YOLO-FIR 也基于红外和可见光图像的融合设计。然而，在融合过程中，我们设计了注意力融合模块和融合抖动模块，这也提高了我们的检测准确性。

Figure 7 shows some visualization results of the object detection effect on the FLIR dataset. From the fourth scene in the first row and the first scene in the third row in Figure 7, we can also see that for the objects with overlapping and occluded areas, our Dual-YOLO can fully detect cars. In the second scene in the second row, our detector can accurately detect overlapping objects and recognize objects with different scales. In this scenario, cars can be large or small, and our detector can detect them accurately. In the second scenario in the third row, our network also performs well in detecting small-sized objects such as bicycles. The surrounding scene in infrared images easily drowns the bicycle features. Therefore, it is challenging to detect this kind of object. Model complexity and runtime comparison of Dual-YOLO and the plain counterparts are shown in Table 7.
图 7 展示了 FLIR 数据集上目标检测效果的一些可视化结果。从图 7 第一行第四场景和第三行第一场景中，我们也可以看到，对于有重叠和遮挡区域的物体，我们的 Dual-YOLO 可以完全检测到汽车。在第二行第二场景中，我们的检测器可以准确检测到重叠的物体，并识别不同尺度的物体。在这种场景中，汽车可以是大也可以是小，我们的检测器可以准确检测到它们。在第三行第二场景中，我们的网络在检测小尺寸物体如自行车方面也表现良好。红外图像中的周围场景容易淹没自行车的特征，因此检测这种物体具有挑战性。Dual-YOLO 与普通版本的模型复杂度和运行时间比较见表 7。

Figure 7. Visualization of Dual-YOLO detection results on the FLIR dataset.
图 7. FLIR 数据集上 Dual-YOLO 检测结果的可视化。

Table 7. Model complexity and runtime comparison of Dual-YOLO and the plain counterparts.
表 7. 双 YOLO 模型与纯模型的复杂度和运行时比较。

3.5. Ablation Study 3.5. 剖析研究

3.5.1. Position of the Shuffle
3.5.1. 混合模块中 Shuffle 的位置

In the structure shown in Figure 5, we use the strategy of channel shuffle in the design of the fusion module. This strategy increases the exchange of feature information between different channels. Nevertheless, we have also considered whether shuffles should be used before or after fusion. As shown in Figure 8, we have placed the shuffle operation before the convolution fusion module to obtain a more blended feature. This processing is performed in such a way as to obtain information on the effective blending of the infrared and visible image before the convolutional fusion is performed. Therefore, we also conducted a set of experiments for validation. On the FLIR dataset, we carry out three different types of experiments.
在图 5 所示的结构中，我们在融合模块的设计中使用了通道混洗策略。该策略增加了不同通道之间特征信息的交换。然而，我们还考虑了在融合之前或之后是否应该使用混洗操作。如图 8 所示，我们将混洗操作放置在卷积融合模块之前，以获得更融合的特征。这种处理方式是为了在进行卷积融合之前获得红外和可见光图像的有效融合信息。因此，我们还进行了一系列验证实验。在 FLIR 数据集上，我们进行了三种不同类型的实验。

Figure 8. The fusion shuffle module structure where the shuffle is performed before fusion.
图 8. 在融合之前执行混洗的融合混洗模块结构。

The experimental results obtained according to the position of the shuffle are shown in Table 8. In the first row of Table 8 is the experiment without adding the shuffle fusion module, and the final obtained detection accuracy is 81.1%. The second row shows the experiments with the addition of the shuffle fusion and the placement of the shuffle operation after the convolutional fusion, resulting in an accuracy of 83.2%. Furthermore, the last line is where we added the shuffle fusion module and placed the shuffle operation before the convolutional fusion, resulting in an accuracy of 84.5%. Compared to the module without the addition of shuffle fusion, the accuracy of the network improved by 3.4% with the addition of this module. For the shuffle position, we can also conclude from Table 8 that there is a 1.3% improvement in the accuracy of the network when the shuffle operation is performed before the convolutional fusion.
"根据混洗的位置获得的实验结果如表 8 所示。表 8 的第一行是未添加混洗融合模块的实验，最终获得的检测精度为 81.1%。第二行显示了添加了混洗融合模块并在卷积融合之后放置混洗操作的实验，结果精度为 83.2%。此外，最后一行是添加了混洗融合模块并在卷积融合之前放置混洗操作的结果，精度为 84.5%。与未添加混洗融合模块的模块相比，添加此模块后网络精度提高了 3.4%。对于混洗位置，我们也可以从表 8 中得出结论，当在卷积融合之前执行混洗操作时，网络精度提高了 1.3%。”

Table 8. On the FLIR dataset, object detection results at a single IoU of 0.50 when the shuffle is placed in different positions of Dual-YOLO. All values are in %.The top results are marked in green.
表 8. 在 FLIR 数据集上，当 shuffle 在 Dual-YOLO 的不同位置放置时，单个 IoU 为 0.50 的对象检测结果。所有值均为百分比。最佳结果用绿色标记。

3.5.2. Functions of the Components in the Attention Fusion Module
3.5.2. 注意融合模块中各组件的功能

We conduct the following ablation study to test the function of the attention fusion module proposed in Section 2.2 and its components. It is worth noting that there are four D-Fusion modules in our proposed Dual-YOLO network. In this ablation experiment, we perform the same configuration on the attention fusion module in each D-Fusion module. That is, the configuration of the four D-Fusions is precisely the same. Through experiments, the results obtained are shown in Table 9. The training curves of the proposed algorithms are shown in Figure 9 and Figure 10. In Table 9, we test the accuracy of Dual-YOLO on the FLIR dataset by adding or not adding Inception and SE modules. In order to eliminate the influence of different IoU Settings on the experiment, we not only used mAP@0.5 to evaluate the accuracy in this experiment but also used mAP@0.5:0.95 as the evaluation standard.
我们在第 2.2 节中提出的一种注意融合模块及其组件的功能进行了以下消融研究。值得注意的是，我们提出的 Dual-YOLO 网络中有四个 D-Fusion 模块。在本消融实验中，我们在每个 D-Fusion 模块的注意融合模块上进行了相同的配置。也就是说，四个 D-Fusion 的配置完全相同。通过实验，实验结果如表 9 所示。所提出的算法的训练曲线如图 9 和图 10 所示。在表 9 中，我们通过添加或不添加 Inception 和 SE 模块来测试 Dual-YOLO 在 FLIR 数据集上的准确性。为了消除不同 IoU 设置对实验的影响，我们在本次实验中不仅使用 mAP@0.5 来评估准确性，还使用 mAP@0.5:0.95 作为评估标准。

Figure 9. The mAP@.5:0.95 performance curve of Dual-YOLO during training. From the curves, we can see that Dual-YOLO has the highest accuracy when it adds Inception and the SE module together.
Figure 9. 双 YOLO 在训练期间的 mAP@.5:0.95 性能曲线。从这些曲线中，我们可以看出当将 Inception 和 SE 模块一起添加时，双 YOLO 的准确性最高。

Figure 10. The mAP@0.5 performance curve of Dual-YOLO during training. From the curves, we can see that Dual-YOLO has the highest accuracy when it adds Inception and the SE module together.
Figure 10. 双 YOLO 在训练期间的 mAP@0.5 性能曲线。从这些曲线中，我们可以看出当将 Inception 和 SE 模块一起添加时，双 YOLO 的准确性最高。

Table 9. Object detection results of the synthesized images obtained by different modules in the attention fusion module on the FLIR dataset. These results are computed at a single IoU of 0.50 and IoU between 0.50 and 0.95. All values are in %.The top results are marked in green.
Table 9. 在 FLIR 数据集上，不同注意力融合模块中的模块生成的合成图像的目标检测结果。这些结果是在单个 IoU 为 0.50 和 IoU 在 0.50 到 0.95 之间计算的。所有值均为百分比。最高结果用绿色标记。

From Table 9, we can see that when we add Inception and SE modules in Dual-YOLO, the highest mAP is achieved on the FLIR dataset. After adding Inception and SE, mAP@0.5 is a 4.8% improvement over not adding these two modules. We can also see that the mAP@0.5 of the model increases by 1.4% when only SE is added. With Inception only, mAP@0.5 increased by 2.8%. We achieve the highest accuracy when we use Inception and SE modules together.
从表 9 中可以看出，当在双 YOLO 中添加 Inception 和 SE 模块时，我们在 FLIR 数据集上获得了最高的 mAP。添加 Inception 和 SE 后，mAP@0.5 比不添加这两个模块提高了 4.8%。我们还可以看到，仅添加 SE 时，模型的 mAP@0.5 提高了 1.4%。仅使用 Inception 时，mAP@0.5 提高了 2.8%。当我们使用 Inception 和 SE 模块一起时，我们达到了最高的准确性。

4. Conclusions 4. 结论

To overcome the problem of accuracy loss caused by the loss of texture features of infrared objects, we propose the Dual-YOLO object detection network with infrared and visible image fusion based on YOLOv7. In the infrared image feature extraction, we design the infrared and visible image feature fusion module named D-fusion. Furthermore, we obtain simplified and useful fusion information in feature extraction through attention fusion and fusion shuffle design. This method reduces the impact of redundant information on network accuracy reduction. Finally, we design the fusion module loss function in the network training process to accelerate the network’s convergence. Through experimental verification on the DroneVehicle, KAIST, and FLIR datasets, we prove the effectiveness of Dual-YOLO in improving the accuracy of infrared object detection. The proposed method is expected to be applied in the fields of military reconnaissance, unmanned driving, agricultural fruit detection, and public safety. Meanwhile, further research will include infrared and visible image fusion for semantic segmentation and infrared object tracking. In addition, we will do more optimization work in terms of parameter compression and acceleration of the model. Through these optimization strategies, the proposed infrared small target detection model Dual-YOLO is more suitable for embedded platform.
为了解决因红外目标纹理特征丢失而导致的精度损失问题，我们基于 YOLOv7 提出了红外和可见光图像融合的双 YOLO 目标检测网络。在红外图像特征提取中，我们设计了名为 D-融合的红外和可见光图像特征融合模块。此外，我们通过注意力融合和融合混洗设计，在特征提取过程中获得了简化且有用的信息融合。这种方法减少了冗余信息对网络精度降低的影响。最后，在网络训练过程中，我们设计了融合模块损失函数，以加速网络的收敛。通过在 DroneVehicle、KAIST 和 FLIR 数据集上的实验验证，我们证明了双 YOLO 在提高红外目标检测精度方面的有效性。所提出的方法有望应用于军事侦察、无人驾驶、农业水果检测和公共安全等领域。同时，进一步的研究将包括红外和可见光图像融合的语义分割和红外目标跟踪。此外，我们还将从参数压缩和模型加速方面进行更多优化工作。通过这些优化策略，提出的红外小目标检测模型 Dual-YOLO 更适用于嵌入式平台。

Author Contributions 作者贡献

Conceptualization, C.B.; methodology, C.B., J.C. and Y.N.; software, C.B. and T.Z.; validation, C.B. and Y.N.; investigation, Y.C.; resources, T.Z.; data curation, T.Z.; writing—original draft preparation, C.B. and J.C.; writing—review and editing, C.B., Y.C. and Q.H.; visualization, C.B., Y.N., and T.Z.; supervision, J.C. All authors have read and agreed to the published version of the manuscript.
概念化，C.B.；方法，C.B.，J.C.和 Y.N.；软件，C.B.和 T.Z.；验证，C.B.和 Y.N.；调查，Y.C.；资源，T.Z.；数据整理，T.Z.；原始手稿准备，C.B.和 J.C.；审阅和编辑，C.B.，Y.C.和 Q.H.；可视化，C.B.，Y.N.和 T.Z.；监督，J.C. 所有作者均已阅读并同意发表的手稿版本。

Funding 资助

This research was supported by the National Natural Science Foundation of China (Grant No. 62275022 and No. 62275017), the Beijing Municipal Natural Science Foundation (Grant No. 4222017), and the funding of foundation enhancement program (Grant No. 2020-JCJQ-JJ-030).
本研究得到了国家自然科学基金（项目编号 62275022 和 62275017）、北京市自然科学基金（项目编号 4222017）以及基础增强计划基金（项目编号 2020-JCJQ-JJ-030）的支持。

Institutional Review Board Statement
机构审查委员会声明

Not applicable. 不适用。

Informed Consent Statement
知情同意声明

Not applicable. 不适用。

Data Availability Statement
数据可用性声明

The DroneVehicle remote sensing dataset is obtained from https://github.com/VisDrone/DroneVehicle, accessed on 29 December 2021. The KAIST pedestrian dataset is obtained from https://github.com/SoonminHwang/rgbt-ped-detection/tree/master/data, accessed on 12 November 2021. The FLIR dataset is obtained from https://www.flir.com/oem/adas/adas-dataset-form/, accessed on 19 January 2022.
DroneVehicle 遥感数据集从 https://github.com/VisDrone/DroneVehicle 获取，访问日期为 2021 年 12 月 29 日。KAIST 行人数据集从 https://github.com/SoonminHwang/rgbt-ped-detection/tree/master/data 获取，访问日期为 2021 年 11 月 12 日。FLIR 数据集从 https://www.flir.com/oem/adas/adas-dataset-form/ 获取，访问日期为 2022 年 1 月 19 日。

Conflicts of Interest 利益冲突

The authors declare no conflict of interest.
作者声明无利益冲突。

References 参考文献

Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar]
刘, W.; 安格洛夫, D.; 厄汗, D.; 西戈迪, C.; 布德, S.; 付, C.Y.; 贝尔格, A.C. SSD: 单阶段多框检测器. 在计算机视觉—ECCV 2016: 第 14 届欧洲会议, 荷兰阿姆斯特丹, 2016 年 10 月 11-14 日, 论文集, 第一部分; 施普林格国际出版社: 日内瓦, 瑞士, 2016; 第 9905 卷, 第 21-37 页. [ Google 学术引文]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
红德蒙, J.; 法拉哈迪, A. Yolov3: 逐步改进. arXiv 2018, arXiv:1804.02767. [ Google 学术引文]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
王, C.Y.; 波克霍夫斯基, A.; 廖, H.Y.M. YOLOv7: 训练可调集以实现实时目标检测的新状态。arXiv 2022, arXiv:2207.02696. [ Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar] [CrossRef] [Green Version]
田, Z.; 沈, C.; 陈, H.; 何, T. FCOS: 完全卷积的一阶段目标检测。在 2019 IEEE/CVF 国际计算机视觉会议 (ICCV) 上发表，韩国首尔，2019 年 10 月 27 日至 11 月 2 日；第 9626-9635 页。[ Google Scholar] [ CrossRef] [ Green Version]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
任, S.; 何, K.; 吉尔希克, R.; 孙, J. Faster R-CNN: 通过区域提议网络实现实时目标检测。IEEE 计算机视觉与模式识别杂志 2017, 39, 1137-1149。[ Google Scholar] [ CrossRef] [ PubMed] [ Green Version]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned One-stage Object Detection. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE Computer Society: Washington, DC, USA, 2021; pp. 3490–3499. [Google Scholar]
冯, C.; 中, Y.; 高, Y.; 肯特, M.R.; 黄, W. TOOD：任务对齐的一阶段目标检测. 在 2021 IEEE/CVF 国际计算机视觉会议 (ICCV); IEEE 计算机学会: 美国华盛顿特区, 2021; 第 3490–3499 页. [ Google 学术 ]
Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; Shi, J. Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process. 2020, 29, 7389–7398. [Google Scholar] [CrossRef]
孔, T.; 孙, F.; 刘, H.; 江, Y.; 李, L.; 石, J. Foveabox：超越基于锚框的目标检测. IEEE 交易图像处理. 2020, 29, 7389–7398. [ Google 学术 ] [ 跨参考 ]
Zhao, C.; Wang, J.; Su, N.; Yan, Y.; Xing, X. Low contrast infrared target detection method based on residual thermal backbone network and weighting loss function. Remote Sens. 2022, 14, 177. [Google Scholar] [CrossRef]
赵, C.; 王, J.; 苏, N.; 严, Y.; 邢, X. 基于残差热骨干网络和加权损失函数的低对比度红外目标检测方法。Remote Sens. 2022, 14, 177. [ Google Scholar] [ CrossRef]
Li, S.; Li, Y.; Li, Y.; Li, M.; Xu, X. YOLO-FIRI: Improved YOLOv5 for Infrared Image Object Detection. IEEE Access 2021, 9, 141861–141875. [Google Scholar] [CrossRef]
李, S.; 李, Y.; 李, Y.; 李, M.; 徐, X. YOLO-FIRI: 改进的 YOLOv5 用于红外图像目标检测。IEEE Access 2021, 9, 141861–141875. [ Google Scholar] [ CrossRef]
Available online: https://github.com/ultralytics/yolov5 (accessed on 20 May 2022).
在线访问: https://github.com/ultralytics/yolov5 (访问日期: 2022 年 5 月 20 日).
Hwang, S.; Park, J.; Kim, N.; Choi, Y.; Kweon, I.S. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar] [CrossRef]
黄, S.; 帕克, J.; 金, N.; 金, Y.; 崔, Y.; 克温, I.S. 多光谱行人检测：基准数据集和基线。在 2015 年 IEEE 计算机视觉与模式识别会议（CVPR）上的论文；美国马萨诸塞州波士顿，2015 年 6 月 7-12 日；第 1037-1045 页。[ Google 学术][ CrossRef]
Available online: https://www.flir.com/oem/adas/adas-dataset-form (accessed on 19 January 2022).
可在线访问：https://www.flir.com/oem/adas/adas-dataset-form（最后访问日期：2022 年 1 月 19 日）.
Sun, M.; Zhang, H.; Huang, Z.; Luo, Y.; Li, Y. Road infrared target detection with I-YOLO. IET Image Process. 2022, 16, 92–101. [Google Scholar] [CrossRef]
孙, M.; 张, H.; 黄, Z.; 罗, Y.; 李, Y. 基于 I-YOLO 的道路红外目标检测。IET 图像处理。2022, 16, 92-101。[ Google 学术][ CrossRef]
Devalla, S.K.; Renukanand, P.K.; Sreedhar, B.K.; Subramanian, G.; Zhang, L.; Perera, S.; Mari, J.M.; Chin, K.S.; Tun, T.A.; Strouthidis, N.G. DRUNET: A dilated-residual U-Net deep learning network to segment optic nerve head tissues in optical coherence tomography images. Biomed. Opt. Express 2018, 9, 3244–3265. [Google Scholar] [CrossRef] [Green Version]
德瓦拉, S.K.; 伦努坎南, P.K.; 斯里德哈, B.K.; 斯布拉马尼亚姆, G.; 张, L.; 梅拉, S.; 汤姆, J.M.; 金, K.S.; 图恩, T.A.; 斯特罗蒂迪斯, N.G. DRUNET：一种用于光学相干断层扫描图像中视神经头组织分割的空洞-残差 U-Net 深度学习网络。生物医学光学快报。2018, 9, 3244-3265。[ Google 学术][ CrossRef][ 绿色版本]
Jiang, C.; Ren, H.; Ye, X.; Zhu, J.; Zeng, H.; Nan, Y.; Sun, M.; Ren, X.; Huo, H. Object detection from UAV thermal infrared images and videos using YOLO models. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102912. [Google Scholar] [CrossRef]
江, C.; 任, H.; 叶, X.; 朱, J.; 曾, H.; 南, Y.; 孙, M.; 任, X.; 贺, H. 使用 YOLO 模型从无人机热红外图像和视频中进行目标检测。国际应用地球观测与地理信息系统杂志 2022, 112, 102912. [ Google Scholar] [ CrossRef]
Jiang, X.; Cai, W.; Yang, Z.; Xu, P.; Jiang, B. IARet: A Lightweight Multiscale Infrared Aerocraft Recognition Algorithm. Arab. J. Sci. Eng. 2022, 47, 2289–2303. [Google Scholar] [CrossRef]
江, X.; 蔡, W.; 杨, Z.; 徐, P.; 江, B. IARet: 一种轻量级多尺度红外航空器识别算法。阿拉伯科学学报 2022, 47, 2289–2303. [ Google Scholar] [ CrossRef]
Wang, Q.; Chi, Y.; Shen, T.; Song, J.; Zhang, Z.; Zhu, Y. Improving RGB-Infrared Object Detection by Reducing Cross-Modality Redundancy. Remote Sens. 2022, 14, 2020. [Google Scholar] [CrossRef]
王, Q.; 常, Y.; 申, T.; 宋, J.; 张, Z.; 朱, Y. 通过减少跨模态冗余性提高 RGB-红外目标检测。遥感 2022, 14, 2020. [ Google Scholar] [ CrossRef]
Yuan, M.; Wang, Y.; Wei, X. Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection. arXiv 2022, arXiv:2209.13801. [Google Scholar]
袁, M.; 王, Y.; 魏, X. 位移、尺度和旋转：跨模态对齐与 RGB-红外车辆检测。arXiv 2022, arXiv:2209.13801. [ Google Scholar]
Wu, D.; Cao, L.; Zhou, P.; Li, N.; Li, Y.; Wang, D. Infrared Small-Target Detection Based on Radiation Characteristics with a Multimodal Feature Fusion Network. Remote Sens. 2022, 14, 3570. [Google Scholar] [CrossRef]
吴, D.; 曹, L.; 周, P.; 李, N.; 李, Y.; 王, D. 基于辐射特性多模态特征融合网络的红外小目标检测. 遥感 2022, 14, 3570. [ Google Scholar] [ CrossRef]
Dai, X.; Yuan, X.; Wei, X. TIRNet: Object detection in thermal infrared images for autonomous driving. Appl. Intell. 2021, 51, 1244–1261. [Google Scholar] [CrossRef]
戴, X.; 元, X.; 魏, X. TIRNet: 自动驾驶中热红外图像的目标检测. 智能系统学报 2021, 51, 1244–1261. [ Google Scholar] [ CrossRef]
Luo, F.; Li, Y.; Zeng, G.; Peng, P.; Wang, G.; Li, Y. Thermal infrared image colorization for nighttime driving scenes with top-down guided attention. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15808–15823. [Google Scholar] [CrossRef]
罗, F.; 李, Y.; 曾, G.; 彭, P.; 王, G.; 李, Y. 基于自上而下注意引导的夜间驾驶场景热红外图像着色. IEEE 智能运输系统汇刊 2022, 23, 15808–15823. [ Google Scholar] [ CrossRef]
Li, Q.; Zhang, C.; Hu, Q.; Fu, H.; Zhu, P. Confidence-aware Fusion using Dempster-Shafer Theory for Multispectral Pedestrian Detection. IEEE Trans. Multimed. 2022. [Google Scholar] [CrossRef]
李, Q.; 张, C.; 胡, Q.; 付, H.; 朱, P. 基于 Dempster-Shafer 理论的多光谱行人检测置信度融合. IEEE 多媒体 2022. [ Google Scholar] [ CrossRef]
Liu, W.; Quijano, K.; Crawford, M.M. YOLOv5-Tassel: Detecting Tassels in RGB UAV Imagery with Improved YOLOv5 Based on Transfer Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8085–8094. [Google Scholar] [CrossRef]
刘, W.; 科西奥诺, K.; 克罗福德, M.M. YOLOv5-麦穗: 基于迁移学习改进的 YOLOv5 在 RGB 无人机影像中检测麦穗. 《IEEE 精选专题应用地球观测与遥感》2022, 15, 8085–8094. [ Google 学术] [ 跨引]
Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-Based RGB-Infrared Cross-Modality Vehicle Detection Via Uncertainty-Aware Learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
孙, Y.; 曹, B.; 朱, P.; 胡, Q. 基于无人机 RGB-红外跨模态车辆检测的不确定性感知学习. 《IEEE 电路与系统视频技术汇刊》2022, 32, 6700–6713. [ Google 学术] [ 跨引]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-ResNet and the impact of residual connections on learning. Proc. Aaai Conf. Artif. Intell. 2017, 31, 4278–4284. [Google Scholar] [CrossRef]
塞格迪, C.; 若耶, S.; 汤胡克, V.; 艾米, A.A. Inception-v4, Inception-ResNet 及残差连接对学习的影响. 《AAAI 会议论文集》2017, 31, 4278–4284. [ Google 学术] [ 跨引]
Xiao, X.; Wang, B.; Miao, L.; Li, L.; Zhou, Z.; Ma, J.; Dong, D. Infrared and visible image object detection via focused feature enhancement and cascaded semantic extension. Remote Sens. 2021, 13, 2538. [Google Scholar] [CrossRef]
肖, X.; 王, B.; 苗, L.; 李, L.; 周, Z.; 马, J.; 董, D. 基于聚焦特征增强和级联语义扩展的红外和可见光图像目标检测. 《遥感》2021, 13, 2538. [ Google 学术] [ 跨引]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2844–2853. [Google Scholar] [CrossRef]
丁, 佳.; 薛, 冰.; 龙, 亚.; 夏, 树山.; 陆, 晴. 基于 RoI 变换器的航拍图像定向目标检测学习. 在：2019 IEEE/CVF 国际计算机视觉与模式识别会议（CVPR）论文集，美国加利福尼亚州长滩，2019 年 6 月 15-20 日；第 2844-2853 页. [ Google 学术引用] [ 跨学科引用]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar] [CrossRef] [Green Version]
戴, 竞.; 奚, 海.; 宋, 亚.; 李, 易.; 张, 高.; 胡, 海.; 魏, 亚. 可变形卷积网络. 在：2017 IEEE 国际计算机视觉会议（ICCV）论文集，意大利威尼斯，2017 年 10 月 22-29 日；第 764-773 页. [ Google 学术引用] [ 跨学科引用] [ 绿色版本]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar] [CrossRef] [Green Version]
夏, 树山.; 白, 晓.; 丁, 佳.; 朱, 智.; 贝隆吉, 斯蒂芬.; 罗, 立.; 德库, 马里乌斯.; 比洛, 米歇尔.; 张, 立. DOTA：大规模航拍图像目标检测数据集. 在：2018 IEEE/CVF 国际计算机视觉与模式识别会议论文集，美国犹他州盐湖城，2018 年 6 月 18-23 日；第 3974-3983 页. [ Google 学术引用] [ 跨学科引用] [ 绿色版本]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
何, 开.; 吉克奥卡里, 格奥尔基.; 多尔, 彼得.; 吉尔希克, 里卡多. Mask R-CNN. 在：2017 IEEE 国际计算机视觉会议（ICCV）论文集，意大利威尼斯，2017 年 10 月 22-29 日；第 2980-2988 页. [ Google 学术引用] [ 跨学科引用]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar] [CrossRef] [Green Version]
蔡, Z.; 阿尔萨科内斯, N. 级联 R-CNN：深入高质量目标检测. 在 2018 年 IEEE/CVF 计算机视觉与模式识别会议记录中；盐湖城, 美国, 2018 年 6 月 18-23 日；第 6154-6162 页. [ Google 学术引用] [ 跨参考] [ 绿色版本]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed] [Green Version]
林, T.Y.; 加奥尔, P.; 吉尔希克, R.; 何, K.; 多尔, P. 密集目标检测的焦点损失. 电气电子工程师学会. 图像分析与机器智能汇刊 2020, 42, 318-327. [ Google 学术引用] [ 跨参考] [ 医学文献] [ 绿色版本]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar] [CrossRef] [Green Version]
朱, J.Y.; 朴, T.; 埃佛斯, P.; 埃弗斯, A.A. 循环一致对抗网络的无配对图像到图像转换. 在 2017 年 IEEE 国际计算机视觉会议记录中；威尼斯, 意大利, 2017 年 10 月 22-29 日；第 2242-2251 页. [ Google 学术引用] [ 跨参考] [ 绿色版本]
Liu, M.Y.; Breuel, T.; Kautz, J. Unsupervised image-to-image translation networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 701–709. [Google Scholar]
刘, M.Y.; 布雷尔, T.; 科茨, J. 无监督的图像到图像转换网络. 在第 31 届神经信息处理系统国际会议记录中；长滩, 加利福尼亚, 美国, 2017 年 12 月 4-9 日；第 701-709 页. [ Google 学术引用]
Huang, X.; Liu, M.Y.; Belongie, S.; Kautz, J. Multimodal Unsupervised Image-to-Image Translation. In Computer Vision—ECCV 2018; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 11207, pp. 179–196. [Google Scholar] [CrossRef] [Green Version]
黄, X.; 刘, M.Y.; 贝隆吉, S.; 科茨, J. 多模态无监督图像到图像翻译. 在计算机视觉—ECCV 2018; 讲座笔记在计算机科学; 施普林格: 瑞士苏黎世, 2018; 第 11207 卷, 第 179–196 页. [ Google 学术引用] [ 跨引文] [ 绿色版本]
Anoosheh, A.; Sattler, T.; Timofte, R.; Pollefeys, M.; Gool, L.V. Night-to-day image translation for retrieval-based localization. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 5958–5964. [Google Scholar] [CrossRef] [Green Version]
阿诺什, A.; 萨特勒, T.; 提莫夫特, R.; 波莱菲斯, M.; 格尔, L.V. 夜间到白天图像翻译用于基于检索的定位. 在 2019 年国际机器人与自动化会议（ICRA）论文集, 蒙特利尔, 加拿大, 2019 年 5 月 20 日至 24 日; 第 5958–5964 页. [ Google 学术引用] [ 跨引文] [ 绿色版本]
Kim, J.; Kim, M.; Kang, H.; Lee, K. U-gat-it: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. arXiv 2019, arXiv:1907.10830. [Google Scholar]
金, J.; 金, M.; 康, H.; 李, K. U-gat-it: 无监督生成注意力网络与自适应层实例归一化用于图像到图像翻译. arXiv 2019, arXiv:1907.10830. [ Google 学术引用]
Lee, H.Y.; Tseng, H.Y.; Mao, Q.; Huang, J.B.; Lu, Y.D.; Singh, M.; Yang, M.H. Drit++: Diverse image-to-image translation via disentangled representations. Int. J. Comput. Vis. 2020, 128, 2402–2417. [Google Scholar] [CrossRef] [Green Version]
李, H.Y.; 曹, H.Y.; 马奥, Q.; 黄, J.B.; 陆, Y.D.; 西恩, M.; 杨, M.H. Drit++: 通过解耦表示的多样化图像到图像翻译. 计算机视觉国际期刊 2020, 128, 2402–2417. [ Google 学术引用] [ 跨引文] [ 绿色版本]
Zheng, Z.; Wu, Y.; Han, X.; Shi, J. ForkGAN: Seeing into the Rainy Night. In Computer Vision—ECCV 2020; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; Volume 12348, pp. 155–170. [Google Scholar] [CrossRef]
郑, Z.; 吴, Y.; 韩, X.; 时, J. ForkGAN: 看见雨夜. 在计算机视觉—ECCV 2020; 讲座笔记在计算机科学; 春花出版社: 瑞士苏黎世, 2020; 第 12348 卷, 第 155-170 页. [ Google 学术[ CrossRef]
Devaguptapu, C.; Akolekar, N.; Sharma, M.M.; Balasubramanian, V.N. Borrow from anywhere: Pseudo multi-modal object detection in thermal imagery. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 1029–1038. [Google Scholar] [CrossRef] [Green Version]
德瓦古普塔, C.; 阿科莱克尔, N.; 喇沙马, M.M.; 巴拉苏布拉马尼亚, V.N. 从任何地方借用: 热成像中的伪多模态目标检测. 在 2019 IEEE/CVF 计算机视觉与模式识别研讨会论文集(CVPRW)上的会议论文, 美国加利福尼亚州长滩, 2019 年 6 月 16-17 日; 第 1029-1038 页. [ Google 学术[ CrossRef] [ 绿色版本]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-Shot Refinement Neural Network for Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212. [Google Scholar] [CrossRef] [Green Version]
张, S.; 文, L.; 毕安, X.; 雷, Z.; 李, S.Z. 单阶段精炼神经网络的目标检测. 在 2018 IEEE/CVF 计算机视觉与模式识别会议论文集上的会议论文, 美国犹他州盐湖城, 2018 年 6 月 18-23 日; 第 4203-4212 页. [ Google 学术[ CrossRef] [ 绿色版本]
Cao, Y.; Zhou, T.; Zhu, X.; Su, Y. Every Feature Counts: An Improved One-Stage Detector in Thermal Imagery. In Proceedings of the 2019 IEEE 5th International Conference on Computer and Communications (ICCC), Chengdu, China, 6–9 December 2019; pp. 1965–1969. [Google Scholar] [CrossRef]
曹, Y.; 周, T.; 朱, X.; 苏, Y. 每个特征都很重要: 热成像中的改进单阶段检测器. 在 2019 IEEE 第五届计算机与通信国际会议(ICCC)上的会议论文, 中国成都, 2019 年 12 月 6-9 日; 第 1965-1969 页. [ Google 学术[ CrossRef]
Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You Only Look One-level Feature. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13034–13043. [Google Scholar] [CrossRef]
陈, Q.; 王, Y.; 杨, T.; 张, X.; 成, J.; 孙, J. 你只需观察一层特征。载《2021 IEEE/CVF 计算机视觉与模式识别会议（CVPR）论文集》, 美国田纳西州纳什维尔, 2021 年 6 月 20-25 日; 第 13034-13043 页。[ Google Scholar] [ CrossRef]

Figure 1. The overall architecture of the proposed Dual-YOLO. The proposed network is mainly designed to detect weak infrared objects captured by UAVs. However, to compensate for the loss of features due to variations in light intensity, we add a visible image feature extraction branch to the network to reduce the probability of missing objects.

Figure 2. Structures of the feature extraction modules in the backbone and neck. Where (a) is the structure of ELAN1, (b) is the structure of ELAN2, and (c) is the structure of SPPCSP. These structures remain consistent with the design in YOLOv7, where ELAN2 has essentially the same essential components as ELAN1, but ELAN2 has more channels than ELAN1 in the feature aggregation part to ensure that multi-scale feature information is aggregated in the neck. For the maxpool structure in SPPCSP, the value of k is the ratio of downsampling.

Figure 3. The effect of separate detection of infrared images and visible images. a(1), a(2), c(1), and c(2) are training and detection results for single visible data. b(1), b(2), d(1), and d(2) are training and detection results for single infrared data. This is a collection of images taken from a drone. The images are taken during the day and night. The drone flies at altitudes of 100 m and 200 m.

Figure 4. The structure of the Attention fusion module. (a) shows the data flow structure of the attention fusion. (b) shows the Inception structure in (a), which mainly connects the 4 branches. (c) shows the detailed description of the convolution structure in (b).

Figure 5. The fusion shuffle module structure where the shuffle is performed after fusion.

Figure 6. Visualization of Dual-YOLO detection results on the KAIST pedestrian dataset.

Figure 7. Visualization of Dual-YOLO detection results on the FLIR dataset.

Figure 8. The fusion shuffle module structure where the shuffle is performed before fusion.

Figure 9. The mAP@.5:0.95 performance curve of Dual-YOLO during training. From the curves, we can see that Dual-YOLO has the highest accuracy when it adds Inception and the SE module together.

Figure 10. The mAP@0.5 performance curve of Dual-YOLO during training. From the curves, we can see that Dual-YOLO has the highest accuracy when it adds Inception and the SE module together.

Table 1. Dataset information we used in this paper.

Hyper-Parameter	DroneVehicle Dataset	KAIST Dataset	FLIR Dataset
Scenario	drone	pedestrian	adas
Modality	R + I	R + I	R + I
#Images	56,878	95,328	14,000
Categories	5	3	4
#Labels	190.6 K	103.1 K	14.5 K
Resolution	840 × 712	640 × 480	1600 × 1800
Year	2021	2015	2018

Table 2. Environment and parameter setting for the experiment setup.

Category	Parameter
CPU Intel	i9-10920X
GPU	RTX 3090 × 2
System	Ubuntu 18.04
Python	3.7
PyTorch	1.10
Training Epochs	300
Learning Rate	0.005
Weight Decay	0.0001
Momentum	0.9

Table 3. The hyper-parameters of the dataset we used in this manuscript. test-val means that the test set used in this article is the same as the validation set.

Hyper-Parameter	DroneVehicle Dataset	KAIST Dataset	FLIR Dataset
Visible Image Size	640 × 512	640 × 512	640 × 512
Infrared Image Size	640 × 512	640 × 512	640×512
#Visible Image	10,000	9853	10,228
#Infrared Image	10,000	9853	10,228
#Training set	9000	7601	8862
#Validation set	500	2252	1366
#Testing set	500	2252 (test-val)	1366 (test-val)

Table 4. Evaluation results on the DroneVehicle dataset. All values are in %.The top results are marked in green.

Method	Modality	Car	Freight Car	Truck	Bus	Van	mAP
RetinaNet(OBB) [28]	Visible	67.5	13.7	28.2	62.1	19.3	38.2
Faster R-CNN(OBB) [29]	Visible	67.9	26.3	38.6	67.0	23.2	44.6
Faster R-CNN(Dpool) [28]	Visible	68.2	26.4	38.7	69.1	26.4	45.8
Mask R-CNN [30]	Visible	68.5	26.8	39.8	66.8	25.4	45.5
Cascade Mask R-CNN [31]	Visible	68.0	27.3	44.7	69.3	29.8	47.8
RoITransformer [27]	Visible	68.1	29.1	44.2	70.6	27.6	47.9
YOLOv7 [3]	Visible	98.2	41.4	70.5	97.8	44.7	68.5
RetinaNet(OBB) [32]	Infrared	79.9	28.1	32.8	67.3	16.4	44.9
Faster R-CNN(OBB) [29]	Infrared	88.6	35.2	42.5	77.9	28.5	54.6
Faster R-CNN(Dpool) [28]	Infrared	88.9	36.8	47.9	78.3	32.8	56.9
Mask R-CNN [30]	Infrared	88.8	36.6	48.9	78.4	32.2	57.0
Cascade Mask R-CNN [31]	Infrared	81.0	39.0	47.2	79.3	33.0	55.9
RoITransformer [27]	Infrared	88.9	41.5	51.5	79.5	34.4	59.2
YOLOv7 [3]	Infrared	98.0	31.9	65.0	95.8	43.0	66.7
UA-CMDet [24]	Visible + Infrared	87.5	46.8	60.7	87.1	38.0	64.0
Dual-YOLO (Ours)	Visible + Infrared	98.1	52.9	65.7	95.8	46.6	71.8

Table 5. Pedestrian detection results of the synthesized images obtained by different translation methods on the KAIST dataset computed at a single IoU of 0.5. All values are in %. The top results are marked in green.

Method	Precision	Recall	mAP
CycleGAN [33]	4.7	2.8	1.1
UNIT [34]	26.7	14.5	11.0
MUNIT [35]	2.1	1.6	0.3
ToDayGAN [36]	11.4	14.9	5.0
UGATIT [37]	13.3	7.6	3.2
DRIT++ [38]	7.9	4.1	1.2
ForkGAN [39]	33.9	4.6	4.9
PearlGAN [21]	21.0	39.8	25.8
Dual-YOLO (Ours)	75.1	66.7	73.2

Table 6. Object detection results of the synthesized images obtained by different translation methods on FLIR dataset, were computed at a single IoU of 0.5. All values are in %. The top results are marked in green.

Method	Person	Bicycle	Car	mAP
Faster R-CNN [40]	39.6	54.7	67.6	53.9
SSD [1]	40.9	43.6	61.6	48.7
RetinaNet [32]	52.3	61.3	71.5	61.7
FCOS [4]	69.7	67.4	79.7	72.3
MMTOD-UNIT [40]	49.4	64.4	70.7	61.5
MMTOD-CG [40]	50.3	63.3	70.6	61.4
RefineDet [41]	77.2	57.2	84.5	72.9
TermalDet [42]	78.2	60.0	85.5	74.6
YOLO-FIR [9]	85.2	70.7	84.3	80.1
YOLOv3-tiny [16]	67.1	50.3	81.2	66.2
IARet [16]	77.2	48.7	85.8	70.7
CMPD [22]	69.6	59.8	78.1	69.3
PearlGAN [21]	54.0	23.0	75.5	50.8
Cascade R-CNN [31]	77.3	84.3	79.8	80.5
YOLOv5s [10]	68.3	67.1	80.0	71.8
YOLOF [43]	67.8	68.1	79.4	71.8
Dual-YOLO (Ours)	88.6	66.7	93.0	84.5

Table 7. Model complexity and runtime comparison of Dual-YOLO and the plain counterparts.

Method	Dataset	#Params	Runtime (fps)
Faster R-CNN (OBB)	DroneVehicle	58.3 M	5.3
Faster R-CNN (Dpool)	DroneVehicle	59.9 M	4.3
Mask R-CNN	DroneVehicle	242.0 M	13.5
RetinaNet	DroneVehicle	145.0 M	15.0
Cascade Mask R-CNN	DroneVehicle	368.0 M	9.8
RoITransformer	DroneVehicle	273.0 M	7.1
YOLOv7	DroneVehicle	72.1 M	161.0
SSD	FLIR	131.0 M	43.7
FCOS	FLIR	123.0 M	22.9
RefineDet	FLIR	128.0 M	24.1
YOLO-FIR	FLIR	7.1 M	83.3
YOLOv3-tiny	FLIR	17.0 M	50.0
Cascade R-CNN	FLIR	165.0 M	16.1
YOLOv5s	FLIR	14.0 M	41.0
YOLOF	FLIR	44.0 M	32.0
Dual-YOLO	DroneVehicle/FLIR	175.1 M	62.0

Table 8. On the FLIR dataset, object detection results at a single IoU of 0.50 when the shuffle is placed in different positions of Dual-YOLO. All values are in %.The top results are marked in green.

Method	Person	Bicycle	Car	mAP
without shuffle	87.2	63.6	92.6	81.1
shuffle before fusion	88.0	68.6	92.9	83.2
shuffle after fusion	88.6	66.7	93.0	84.5

Table 9. Object detection results of the synthesized images obtained by different modules in the attention fusion module on the FLIR dataset. These results are computed at a single IoU of 0.50 and IoU between 0.50 and 0.95. All values are in %.The top results are marked in green.

Inception	SE	Person	Bicycle	Car	mAP@0.5	mAP@0.5:0.95
✘	✘	85.1	64.5	89.4	79.7	41.6
✔	✘	86.9	69.0	91.6	82.5	44.3
✘	✔	86.2	65.7	91.4	81.1	43.3
✔	✔	88.6	66.7	93.0	84.5	46.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
免责声明/出版商声明：所有出版物中的陈述、意见和数据仅由个别作者和贡献者负责，而不代表 MDPI 和/或编辑的责任。MDPI 和/或编辑不对任何因参考内容中的想法、方法、指示或产品而导致的人身或财产伤害承担责任。

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
© 2023 作者。许可人 MDPI, 瑞士巴塞尔。本文根据 Creative Commons Attribution (CC BY) 许可证（https://creativecommons.org/licenses/by/4.0/）条款开放获取。

Share and Cite 分享与引用

MDPI and ACS Style MDPI 和 ACS 格式

Bao, C.; Cao, J.; Hao, Q.; Cheng, Y.; Ning, Y.; Zhao, T. Dual-YOLO Architecture from Infrared and Visible Images for Object Detection. Sensors 2023, 23, 2934. https://doi.org/10.3390/s23062934
Bao, C.; Cao, J.; Hao, Q.; Cheng, Y.; Ning, Y.; Zhao, T. 基于红外和可见光图像的物体检测的双 YOLO 架构. 传感器 2023, 23, 2934. https://doi.org/10.3390/s23062934

AMA Style AMA 格式

Bao C, Cao J, Hao Q, Cheng Y, Ning Y, Zhao T. Dual-YOLO Architecture from Infrared and Visible Images for Object Detection. Sensors. 2023; 23(6):2934. https://doi.org/10.3390/s23062934
Bao C, Cao J, Hao Q, Cheng Y, Ning Y, Zhao T. 基于红外和可见光图像的物体检测的双 YOLO 架构. 传感器. 2023; 23(6):2934. https://doi.org/10.3390/s23062934

Chicago/Turabian Style 芝加哥/塔拉伯格式

Bao, Chun, Jie Cao, Qun Hao, Yang Cheng, Yaqian Ning, and Tianhua Zhao. 2023. "Dual-YOLO Architecture from Infrared and Visible Images for Object Detection" Sensors 23, no. 6: 2934. https://doi.org/10.3390/s23062934
包, 春, 贾曹, 郝群, 郑阳, 宁雅倩, 和赵天华. 2023. "基于红外和可见光图像的物体检测的双 YOLO 架构" 传感器 23, 第 6 期: 2934. https://doi.org/10.3390/s23062934

APA Style APA 格式

Bao, C., Cao, J., Hao, Q., Cheng, Y., Ning, Y., & Zhao, T. (2023). Dual-YOLO Architecture from Infrared and Visible Images for Object Detection. Sensors, 23(6), 2934. https://doi.org/10.3390/s23062934
包, C., 曹, J., 郝, Q., 郑, Y., 宁, Y., & 赵, T. (2023). 基于红外和可见光图像的物体检测的双 YOLO 架构. 传感器, 23(6), 2934. https://doi.org/10.3390/s23062934

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.
请注意，从 2016 年第 1 期开始，该期刊使用文章编号而不是页码。更多信息请参见此处。

Article Metrics 文章指标

Citations 引用

Crossref

50

ads

10

PubMed

6

PMC

10

Scopus

52

Web of Science

32

Google Scholar

[click to view]
[点击查看]

Article Access Statistics
文章访问统计

For more information on the journal statistics, click here.
欲了解期刊统计详情，请点击这里。

Multiple requests from the same IP address are counted as one view.
来自同一 IP 地址的多个请求计为一次浏览。

Article Menu 文章菜单

Dual-YOLO Architecture from Infrared and Visible Images for Object Detection
来自红外和可见光图像的 Dual-YOLO 架构用于目标检测

Abstract 摘要

1. Introduction 1. 介绍

2. Methods 2. 方法

2.1. Overall Network Architecture
2.1. 整体网络架构

2.2. Information Fusion Module
2.2. 信息融合模块

2.2.1. Attention Fusion Module
2.2.1. Attention Fusion 模块

2.2.2. Fusion Shuffle Module
2.2.2. 融合混洗模块

2.3. Loss Function 2.3. 损失函数

3. Experiment and Analysis
3. 实验与分析

3.1. Dataset Introduction
3.1. 数据集介绍

3.1.1. DroneVehicle Dataset
3.1.1. 无人机车辆数据集

3.1.2. KAIST Dataset 3.1.2. KAIST 数据集

3.1.3. FLIR Dataset 3.1.3. FLIR 数据集

3.2. Implementation Details
3.2. 实现细节

3.3. Evaluation Metrics 3.3. 评估指标

3.4. Analysis of Results
3.4. 结果分析

3.4.1. Experiments on the DroneVehicle Remote Sensing Dataset
3.4.1. 对无人机车辆遥感数据集的实验

3.4.2. Experiments on the KAIST Pedestrian Dataset
3.4.2. KAIST 行人数据集上的实验

3.4.3. Experiments on the FLIR Dataset
3.4.3. FLIR 数据集上的实验

3.5. Ablation Study 3.5. 剖析研究

3.5.1. Position of the Shuffle
3.5.1. 混合模块中 Shuffle 的位置

3.5.2. Functions of the Components in the Attention Fusion Module
3.5.2. 注意融合模块中各组件的功能

4. Conclusions 4. 结论

Author Contributions 作者贡献

Funding 资助

Institutional Review Board Statement
机构审查委员会声明

Informed Consent Statement
知情同意声明

Data Availability Statement
数据可用性声明

Conflicts of Interest 利益冲突

References 参考文献

Share and Cite 分享与引用

Article Metrics 文章指标

Citations 引用

Article Access Statistics
文章访问统计

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu 文章菜单

Dual-YOLO Architecture from Infrared and Visible Images for Object Detection 来自红外和可见光图像的 Dual-YOLO 架构用于目标检测

Abstract 摘要

1. Introduction 1. 介绍

2. Methods 2. 方法

2.1. Overall Network Architecture2.1. 整体网络架构

2.2. Information Fusion Module2.2. 信息融合模块

2.2.1. Attention Fusion Module2.2.1. Attention Fusion 模块

2.2.2. Fusion Shuffle Module2.2.2. 融合混洗模块

2.3. Loss Function 2.3. 损失函数

3. Experiment and Analysis3. 实验与分析

3.1. Dataset Introduction3.1. 数据集介绍

3.1.1. DroneVehicle Dataset3.1.1. 无人机车辆数据集

3.1.2. KAIST Dataset 3.1.2. KAIST 数据集

3.1.3. FLIR Dataset 3.1.3. FLIR 数据集

3.2. Implementation Details3.2. 实现细节

3.3. Evaluation Metrics 3.3. 评估指标

3.4. Analysis of Results3.4. 结果分析

3.4.1. Experiments on the DroneVehicle Remote Sensing Dataset3.4.1. 对无人机车辆遥感数据集的实验

3.4.2. Experiments on the KAIST Pedestrian Dataset3.4.2. KAIST 行人数据集上的实验

3.4.3. Experiments on the FLIR Dataset3.4.3. FLIR 数据集上的实验

3.5. Ablation Study 3.5. 剖析研究

3.5.1. Position of the Shuffle3.5.1. 混合模块中 Shuffle 的位置

3.5.2. Functions of the Components in the Attention Fusion Module3.5.2. 注意融合模块中各组件的功能

4. Conclusions 4. 结论

Author Contributions 作者贡献

Funding 资助

Institutional Review Board Statement机构审查委员会声明

Informed Consent Statement知情同意声明

Data Availability Statement数据可用性声明

Conflicts of Interest 利益冲突

References 参考文献

Share and Cite 分享与引用

Article Metrics 文章指标

Citations 引用

Article Access Statistics文章访问统计

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Dual-YOLO Architecture from Infrared and Visible Images for Object Detection
来自红外和可见光图像的 Dual-YOLO 架构用于目标检测

2.1. Overall Network Architecture
2.1. 整体网络架构

2.2. Information Fusion Module
2.2. 信息融合模块

2.2.1. Attention Fusion Module
2.2.1. Attention Fusion 模块

2.2.2. Fusion Shuffle Module
2.2.2. 融合混洗模块

3. Experiment and Analysis
3. 实验与分析

3.1. Dataset Introduction
3.1. 数据集介绍

3.1.1. DroneVehicle Dataset
3.1.1. 无人机车辆数据集

3.2. Implementation Details
3.2. 实现细节

3.4. Analysis of Results
3.4. 结果分析

3.4.1. Experiments on the DroneVehicle Remote Sensing Dataset
3.4.1. 对无人机车辆遥感数据集的实验

3.4.2. Experiments on the KAIST Pedestrian Dataset
3.4.2. KAIST 行人数据集上的实验

3.4.3. Experiments on the FLIR Dataset
3.4.3. FLIR 数据集上的实验

3.5.1. Position of the Shuffle
3.5.1. 混合模块中 Shuffle 的位置

3.5.2. Functions of the Components in the Attention Fusion Module
3.5.2. 注意融合模块中各组件的功能

Institutional Review Board Statement
机构审查委员会声明

Informed Consent Statement
知情同意声明

Data Availability Statement
数据可用性声明

Article Access Statistics
文章访问统计