这是用户在 2024-5-16 16:25 为 https://arxiv.org/html/2404.09146v1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
许可证:arXiv.org 永久非排他性许可证
arXiv:2404.09146v1 [cs.CV] 14 Apr 2024
arXiv:2404.09146v1 [cs.简历] 2024 年 4 月 14 日

Fusion-Mamba for Cross-modality Object Detection
用于跨模态目标检测的 Fusion-Mamba

Wenhao Dong1, Haodong Zhu111footnotemark: 1, Shaohui Lin 2, Xiaoyan Luo1, Yunhang Shen3,
董文豪, 朱浩东, 林 2 绍辉, 罗晓燕, 沈云航 3

Xuhui Liu1, Juan Zhang1, Guodong Guo4,Baochang Zhang
刘旭辉, 张娟, 郭 4 国栋,张宝昌
1
1Beihang University, Beijing, China  2East China Normal University, Shanghai, China
北京航空航天大学,北京 2 华东师范大学,中国上海

3Tencent Youtu Lab, Shanghai, China  4Eastern Institute of Technology, Ningbo, China
3 腾讯优图实验室,上海,中国 4 东方理工学院,宁波,中国
These authors contributed equally.Corresponding Author: shaohuilin007@gmail.com.
这些作者的贡献相同。通讯作者: shaohuilin007@gmail.com.
Abstract 抽象

Cross-modality fusing complementary information from different modalities effectively improves object detection performance, making it more useful and robust for a wider range of applications. Existing fusion strategies combine different types of images or merge different backbone features through elaborated neural network modules. However, these methods neglect that modality disparities affect cross-modality fusion performance, as different modalities with different camera focal lengths, placements, and angles are hardly fused. In this paper, we investigate cross-modality fusion by associating cross-modal features in a hidden state space based on an improved Mamba with a gating mechanism. We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction, thereby reducing disparities between cross-modal features and enhancing the representation consistency of fused features. FMB contains two modules: the State Space Channel Swapping (SSCS) module facilitates shallow feature fusion, and the Dual State Space Fusion (DSSF) enables deep fusion in a hidden state space. Through extensive experiments on public datasets, our proposed approach outperforms the state-of-the-art methods on m𝑚mitalic_mAP with 5.9% on M3FDsuperscript𝑀3𝐹𝐷M^{3}FDitalic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_F italic_D and 4.9% on FLIR-Aligned datasets, demonstrating superior object detection performance. To the best of our knowledge, this is the first work to explore the potential of Mamba for cross-modal fusion and establish a new baseline for cross-modality object detection.
跨模态融合来自不同模态的互补信息可有效提高目标检测性能,使其在更广泛的应用中更加有用和强大。现有的融合策略通过精心设计的神经网络模块将不同类型的图像组合在一起或合并不同的主干特征。然而,这些方法忽略了模态差异会影响跨模态融合性能,因为具有不同相机焦距、位置和角度的不同模态几乎无法融合。在本文中,我们通过将基于改进的Mamba的隐藏状态空间中的跨模态特征与门控机制相关联来研究跨模态融合。我们设计了一个融合-曼巴块(FMB),将跨模态特征映射到一个隐藏的状态空间进行交互,从而减少了跨模态特征之间的差异,增强了融合特征的表示一致性。FMB 包含两个模块:状态空间信道交换 (SSCS) 模块促进浅层特征融合,双态空间融合 (DSSF) 实现隐藏状态空间中的深度融合。通过对公共数据集的大量实验,我们提出的方法在AP m𝑚mitalic_m 上的表现优于最先进的方法,在FLIR对齐数据集上 M3FDsuperscript𝑀3𝐹𝐷M^{3}FDitalic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_F italic_D 分别为5.9%和4.9%,显示出卓越的目标检测性能。据我们所知,这是探索Mamba在跨模态融合方面的潜力并为跨模态目标检测建立新基线的第一部工作。

Refer to caption
Figure 1: Heatmap visualization.  (a) and (b) show the initial RGB and IR input images.  (c) and (d) show heatmaps generated from single-modality using YOLOv8.  (e) shows the heatmap of YOLO-MS with a CNN-based fusion module.  (f) and (g) show heatmaps of ICAFusion and CFT with a transformer-based fusion module.  (h) shows the heatmap of our FMB, which achieves better localization.
图 1:热图可视化。 (a)和(b)显示初始RGB和IR输入图像。 (c) 和 (d) 显示了使用 YOLOv8 从单一模态生成的热图。 (e)显示了YOLO-MS与基于CNN的融合模块的热图。 (f)和(g)显示了ICAFusion和CFT与基于变压器的融合模块的热图。 (h) 显示了我们的 FMB 的热图,实现了更好的定位。

1 Introduction 1引言

With the swift development of multi-modality sensor technology, multi-modality images have been used in many different areas. Among them, paired infrared (IR) and visible images have been utilized widely, since such two modalities of images provide complementary information. For example, infrared images show a clear thermal structure of objects without being affected by luminance, while they lack the texture details of the target. In contrast, visible images capture rich object texture and scene information, but lighting conditions severely affect image quality. Thus, many studies focus on infrared and visible feature fusion to improve the perceptibility and robustness for downstream high-level image and scene understanding tasks, e.g., object detection, and image segmentation.
随着多模态传感器技术的快速发展,多模态图像已应用于许多不同的领域。其中,成对红外(IR)和可见光图像已被广泛使用,因为这两种图像模式提供了互补的信息。例如,红外图像显示物体的清晰热结构,不受亮度影响,而缺乏目标的纹理细节。相比之下,可见图像可以捕捉到丰富的物体纹理和场景信息,但照明条件会严重影响图像质量。因此,许多研究都集中在红外和可见光特征融合上,以提高下游高级图像和场景理解任务(例如目标检测和图像分割)的可感知性和鲁棒性。

Refer to caption
Figure 2: The architecture of the proposed Fusion-Mamba method. The detection network comprises a dual-stream feature extraction network and three Fusion-Mamba blocks (FMB), with the same neck and head as YOLOv8. The top is our detection framework, ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and φisubscript𝜑𝑖\varphi_{i}italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the convolutional modules of the RGB and IR branches, which are used to generate features of FRisubscript𝐹subscript𝑅𝑖F_{R_{i}}italic_F start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and FIRisubscript𝐹𝐼subscript𝑅𝑖F_{IR_{i}}italic_F start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, respectively. F^Risubscript^𝐹subscript𝑅𝑖\hat{F}_{R_{i}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F^IRisubscript^𝐹𝐼subscript𝑅𝑖\hat{F}_{IR_{i}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT are the enhanced feature maps through our FMB. P3,P4subscript𝑃3subscript𝑃4P_{3},P_{4}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, and P5subscript𝑃5P_{5}italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT are the summation outputs of enhanced feature maps as the feature pyramid inputs for the neck at the last three stages. The bottom shows the design details of our FMB.
图 2:所提出的 Fusion-Mamba 方法的架构。该检测网络由一个双流特征提取网络和三个Fusion-Mamba块(FMB)组成,与YOLOv8具有相同的颈部和头部。顶部是我们的检测框架, ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT φisubscript𝜑𝑖\varphi_{i}italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 是 RGB 和 IR 分支的卷积模块,分别用于生成 FRisubscript𝐹subscript𝑅𝑖F_{R_{i}}italic_F start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPTFIRisubscript𝐹𝐼subscript𝑅𝑖F_{IR_{i}}italic_F start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT 的特征。 F^Risubscript^𝐹subscript𝑅𝑖\hat{F}_{R_{i}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT 并且是 F^IRisubscript^𝐹𝐼subscript𝑅𝑖\hat{F}_{IR_{i}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT 通过我们的 FMB 增强的功能图。 P3,P4subscript𝑃3subscript𝑃4P_{3},P_{4}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPTP5subscript𝑃5P_{5}italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 是增强特征图的总和输出,作为最后三个阶段颈部的特征金字塔输入。底部显示了我们 FMB 的设计细节。

Existing multi-spectral fusion approaches generally employ deep convolutional neural networks (CNNs) [27, 29, 10] or Transformers [32, 5] to fuse the cross-modality features. A halfway fusion is introduced to integrate two-branch middle-level features from RGB and IR images for multispectral pedestrian detection [19]. GFD-SSD [46] uses Gated Fusion Units to build a two-stream middle fusion detector, which achieves a higher performance than a single modality. In this light, YOLO-MS is introduced based on two CNN-based fusion modules to fuse the adjacent branches from the YOLOv5 backbone for real-time object detection [35]. Albeit with great success for cross-modality fusion based on CNNs with local receptive fields, Transformers-based [32, 5] methods have been proposed to effectively learn long-range dependencies for cross-modality feature fusion. CFT [6] is the first study to explore a Transformer for middle-level feature fusion, which can improve YOLOv5 performance. ICAFusion [26] with a double cross-attention Transformer can successfully model global features and capture complementary information among modalities. However, these cross-modality fusion methods fail to consider modality disparities, which produces adverse effects on the cross-modality feature fusion. As shown in Fig. 1(e)(f)(g), the heatmaps of YOLO-Ms, ICAFusion and CFT fusion features show that they cannot effectively fuse features from different modalities and model the correlations between cross-modality objects, as they have clearly different modality representations. This begs our rethinking: Can we have an effective cross-modality interactive space to reduce modality disparities for a consistent representation, which can thus benefit from the cross-modality relationship for feature enhancement? Moreover, Transformer-based cross-modality fusion is compute-intensive with a quadratic time- and space-complexity.
现有的多谱融合方法通常采用深度卷积神经网络(CNN)[27,29,10]或Transformers[32,5]来融合跨模态特征。引入半向融合,将RGB和IR图像的双分支中间级特征集成在一起,用于多光谱行人检测[19]。GFD-SSD [46]使用门控聚变单元构建双流中间聚变探测器,其性能高于单一模态。有鉴于此,基于两个基于CNN的融合模块引入了YOLO-MS,以融合YOLOv5主干的相邻分支,以实现实时目标检测[35]。尽管基于具有局部感受野的CNN的跨模态融合取得了巨大成功,但已经提出了基于Transformers的[32,5]方法,以有效地学习跨模态特征融合的长程依赖性。CFT [6] 是第一个探索用于中级特征融合的 Transformer 的研究,它可以提高 YOLOv5 的性能。ICAFusion[26]采用双交叉注意力Transformer,可以成功地对全局特征进行建模,并捕获模态之间的互补信息。然而,这些跨模态融合方法没有考虑模态差异,这对跨模态特征融合产生了不利影响。如图1(e)(f)(g)所示,YOLO-Ms、ICAFusion和CFT融合特征的热图显示,由于它们具有明显不同的模态表示,它们无法有效地融合不同模态的特征并模拟跨模态对象之间的相关性。这就引出了我们重新思考的问题:我们能否有一个有效的跨模态交互空间来减少模态差异,以实现一致的表示,从而可以从跨模态关系中受益,从而增强特征? 此外,基于Transformer的跨模态融合是计算密集型的,具有二次时间和空间复杂度。

In this paper, we propose a Fusion-Mamba method, aiming to fuse features in a hidden state space, which might open up a new paradigm for cross-modality feature fusion. We are inspired by Mamba [8, 21, 47] with a linear complexity to build a hidden state space, which is further improved by a gating mechanism to enable a deeper and more complex fusion. Our Fusion-Mamba method lies in the innovative Fusion-Mamba block (FMB), as illustrated in Fig. 2. In FMB, we design a State Space Channel Swapping (SSCS) module for shallow feature fusion to improve the interaction ability of cross-modality features, and a Dual State Space Fusion (DSSF) module to build a hidden state space for cross-modality feature association and complementarity. These two blocks help reduce the modal disparities during fusion as shown in Fig. 1(h). The heatmap shows that our method fuses features more effectively and makes the detector focus more on the target. This work makes the following contributions:

1) The proposed Fusion-Mamba method explores the potential of Mamba for cross-modal fusion, which enhances the representation consistency of fused features. We build a hidden state space for cross-modality interaction to reduce disparities between cross-modality features based on an improved Mamba by gating mechanisms.

2) We design a Fusion-Mamba block with two modules: the State Space Channel Swapping (SSCS) module facilitates shallow feature fusion, and the Dual State Space Fusion (DSSF) module enables deep fusion in a hidden state space.

3) Extensive experiments on three public RGB-IR object detection datasets demonstrate that our method achieves state-of-the-art performance, offering a new baseline in the cross-modal object detection method.

2 Related Works

Multi-modality Object Detection. With the rapid development of single modal detectors such as YOLO series models [23], Transformer carion2020end,liu2021swin,guo2022cmt, multi-modal object detectors have emerged to make good use of images from different modalities. So far, research on multi-modality object detection has focused on two main directions: pixel-level fusion and feature-level fusion. Pixel-level fusion merges multi-modal input images and the fused image is fed into the detector. Those methods focus on reconstructing fused images using multi-modal input image information li2018densefuse, ma2019fusiongan, creswell2018generative. Feature-level fusion joins the output of a detector at a certain stage, such as the early and later features extracted by the backbone (early and middle fusion [33, 2]) and the detection output (late fusion [16, 4]). Feature-level fusion can integrate the fusion operation into the detection network as a united end-to-end CNN and or Transformer framework [2, 33, 35, 6, 26]. Those fusion methods can effectively improve the object detection performance of single-modality. However, they are still limited in modeling modality disparity and fusion complexity.

Mamba. Since Mamba [8] was proposed for linear-time sequence modeling in the NLP field, it has been rapidly extended for applications in various computer vision tasks. Vmamba [21] introduces a four-way scanning algorithm based on the characteristics of the image and constructs a Mamba-based vision backbone with better performance in object detection, object segmentation, and object tracking than Swin Transformer. VM-UNet [25] is splendid in the field of medical segmentation based on the UNet framework and Mamba blocks. After that, many Mamba-based deep networks [22, 34, 41] are proposed to make accurate segmentation in medical images. Video Mamba [3] expands the original 2D scan to different bidirectional 3D scans and designs a Mamba framework to use mamba in the video understanding area.

Different from previous methods, our work is the first to exploit Mamba for multi-modality feature fusion. We introduce a carefully designed Mamba-based structure to integrate the cross-modality features in a hidden state space.

Refer to caption
Figure 3: Illustration of the 2D Selective Scan (SS2D) on a RGB image. Initially, the image undergoes scan expansion, resulting in four distinct feature sequences. Subsequently, each of these sequences is independently processed through the S6 block. Finally, the outputs of the S6 block are combined through scan merging to generate the final 2D feature map.

3 Method

3.1 Preliminaries

State Space Models. State Space Models (SSMs) are frequently used to represent linear time-invariant systems, which process a one-dimensional input sequence x(t)𝑥𝑡x(t)\in\mathcal{R}italic_x ( italic_t ) ∈ caligraphic_R by passing it through intermediate implicit states h(t)N𝑡superscript𝑁h(t)\in\mathcal{R}^{N}italic_h ( italic_t ) ∈ caligraphic_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT to produce an output y(t)𝑦𝑡y(t)\in\mathcal{R}italic_y ( italic_t ) ∈ caligraphic_R. Mathematically, SSMs are often formulated as linear ordinary differential equations (ODEs):

h(t)=Ah(t)+Bx(t),superscript𝑡𝐴𝑡𝐵𝑥𝑡\displaystyle h^{\prime}(t)=Ah(t)+Bx(t),italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) = italic_A italic_h ( italic_t ) + italic_B italic_x ( italic_t ) , (1)
y(t)=Ch(t)+Dx(t),𝑦𝑡𝐶𝑡𝐷𝑥𝑡\displaystyle y(t)=Ch(t)+Dx(t),italic_y ( italic_t ) = italic_C italic_h ( italic_t ) + italic_D italic_x ( italic_t ) ,

where the system’s behavior is defined by a set of parameters, including the state transition matrix AN×N𝐴superscript𝑁𝑁A\in\mathcal{R}^{N\times N}italic_A ∈ caligraphic_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, the projection parameters B,CN×1𝐵𝐶superscript𝑁1B,C\in\mathcal{R}^{N\times 1}italic_B , italic_C ∈ caligraphic_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT, and the skip connection D𝐷D\in\mathcal{R}italic_D ∈ caligraphic_R. Dx(t)𝐷𝑥𝑡Dx(t)italic_D italic_x ( italic_t ) can be easily removed by setting D=0𝐷0D=0italic_D = 0 for exposition.

Discretization. The continuous-time nature of SSMs in Eq. 1 poses significant challenges when applied in deep learning scenarios. To address this issue, it is necessary to discretize the ODEs through a process of discretization, which serves the key purpose of converting ODEs into discrete functions. It is essential for ensuring alignment between the model and the sampling rate of the underlying signals in the input data, facilitating efficient computational operations [15]. Considering the input xkL×Dsubscript𝑥𝑘superscript𝐿𝐷x_{k}\in\mathcal{R}^{L\times D}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT italic_L × italic_D end_POSTSUPERSCRIPT, a sampled vector within the signal flow of length L𝐿Litalic_L following [40], the introduction of a timescale parameter ΔΔ\Deltaroman_Δ allows for the transition from continuous parameters A𝐴Aitalic_A and B𝐵Bitalic_B to their discrete counterparts A¯¯𝐴\overline{A}over¯ start_ARG italic_A end_ARG and B¯¯𝐵\overline{B}over¯ start_ARG italic_B end_ARG, adhering to the zeroth-order hold (ZOH) principle. Consequently, Eq. 1 is discretized as follows:

hk=A¯hk1+B¯xk,subscript𝑘¯𝐴subscript𝑘1¯𝐵subscript𝑥𝑘\displaystyle h_{k}=\overline{A}h_{k-1}+\overline{B}x_{k},italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over¯ start_ARG italic_A end_ARG italic_h start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + over¯ start_ARG italic_B end_ARG italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , (2)
y(t)=C¯hk+Dxk,𝑦𝑡¯𝐶subscript𝑘𝐷subscript𝑥𝑘\displaystyle y(t)=\overline{C}h_{k}+Dx_{k},italic_y ( italic_t ) = over¯ start_ARG italic_C end_ARG italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_D italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,
A¯=eΔA,¯𝐴superscript𝑒Δ𝐴\displaystyle\overline{A}=e^{\Delta A},over¯ start_ARG italic_A end_ARG = italic_e start_POSTSUPERSCRIPT roman_Δ italic_A end_POSTSUPERSCRIPT ,
B¯=(ΔA)1(eΔAI)ΔB,¯𝐵superscriptΔ𝐴1superscript𝑒Δ𝐴𝐼Δ𝐵\displaystyle\overline{B}=~{}(\Delta A)^{-1}(e^{\Delta A}-I)\Delta B,over¯ start_ARG italic_B end_ARG = ( roman_Δ italic_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_e start_POSTSUPERSCRIPT roman_Δ italic_A end_POSTSUPERSCRIPT - italic_I ) roman_Δ italic_B ,
C¯=C,¯𝐶𝐶\displaystyle\overline{C}=C,over¯ start_ARG italic_C end_ARG = italic_C ,

where B,CD𝐵𝐶superscript𝐷B,C\in\mathcal{R}^{D}italic_B , italic_C ∈ caligraphic_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT and I𝐼Iitalic_I is an identity matrix. After discretization, SSMs are computed by a global convolution with a structured convolutional kernel K¯D¯𝐾superscript𝐷\bar{K}\in\mathcal{R}^{D}over¯ start_ARG italic_K end_ARG ∈ caligraphic_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT:

y=xK¯,K¯=(CB¯,CA¯B¯,,CA¯L1B¯).formulae-sequence𝑦𝑥¯𝐾¯𝐾𝐶¯𝐵𝐶¯𝐴¯𝐵𝐶superscript¯𝐴𝐿1¯𝐵y=x*\bar{K},\quad\bar{K}=\big{(}C\bar{B},C\bar{A}\bar{B},\cdots,C\bar{A}^{L-1}% \bar{B}\big{)}.italic_y = italic_x ∗ over¯ start_ARG italic_K end_ARG , over¯ start_ARG italic_K end_ARG = ( italic_C over¯ start_ARG italic_B end_ARG , italic_C over¯ start_ARG italic_A end_ARG over¯ start_ARG italic_B end_ARG , ⋯ , italic_C over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_B end_ARG ) . (3)

Based on Eq. 2 and Eq. 3, Mamba [8] designs a simple selection mechanism to parameterize the SSM parameters of ΔΔ\Deltaroman_Δ, A𝐴Aitalic_A, B𝐵Bitalic_B, and C𝐶Citalic_C depending on the input x𝑥xitalic_x, which selectively propagates or forgets information along the sequence length dimension for 1D language sequence modeling.

2D-Selective-Scan Mechanism. The incompatibility between 2D visual data and 1D language sequences renders the direct application of Mamba to vision tasks inappropriate. For example, while 2D spatial information plays a crucial role in vision-related endeavors, it takes the secondary role in 1D sequence modeling. This discrepancy leads to limited receptive fields that fail to capture potential correlations with unexplored patches. 2D selective scan (SS2D) Mechanism is introduced in [21] to address the above challenge. The overview of SS2D is depicted in Fig. 3. SS2D first scan expands image patches into four distinct directions to generate four independent sequences. This quad-directional scanning methodology guarantees that every element within the feature map incorporates information from all other positions across various directions. Consequently, it establishes a comprehensive global receptive field without necessitating a linear increase in computational complexity. Subsequently, each feature sequence undergoes processing using the selective scan space state sequential model (S6) [8]. Finally, the feature sequences are aggregated to reconstruct the 2D feature map. SS2D serves as the core element of the Visual State Space (VSS) block, which is illustrated in Fig. 2 and will be used to build a hidden state space for cross-modality feature fusion.

3.2 Fusion-Mamba

3.2.1 Architecture

The architecture of our model is depicted in Fig. 2. Its detection backbone comprises a dual-stream feature extraction network and three Fusion-Mamba blocks (FMB), while the detection network contains the neck and head for cross-modality object detection. The feature extraction network facilitates the extraction of local features from RGB and IR images, denoted by FRisubscript𝐹subscript𝑅𝑖F_{R_{i}}italic_F start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and FIRisubscript𝐹𝐼subscript𝑅𝑖F_{IR_{i}}italic_F start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, respectively. After that, we input these two features into FMB by associating cross-modal features in a hidden state space, which reduces disparities between cross-modal features and enhances the representation consistency of fused features. Specifically, these two local features first undergo the State Space Channel Swapping (SSCS𝑆𝑆𝐶𝑆SSCSitalic_S italic_S italic_C italic_S) module for shallow feature fusion to obtain interactive features F~Risubscript~𝐹subscript𝑅𝑖\tilde{F}_{R_{i}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F~IRisubscript~𝐹𝐼subscript𝑅𝑖\tilde{F}_{IR_{i}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Then, we feed these interactive features into the Dual State Space Fusion (DSSF𝐷𝑆𝑆𝐹DSSFitalic_D italic_S italic_S italic_F) module for deep feature fusion in the hidden state space, which generates the corresponding complementary features F¯Risubscript¯𝐹subscript𝑅𝑖\overline{F}_{R_{i}}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F¯IRisubscript¯𝐹𝐼subscript𝑅𝑖\overline{F}_{IR_{i}}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The local features are enhanced to generate F^Risubscript^𝐹subscript𝑅𝑖\hat{F}_{R_{i}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F^IRisubscript^𝐹𝐼subscript𝑅𝑖\hat{F}_{IR_{i}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT by adding the original ones FRisubscript𝐹subscript𝑅𝑖F_{R_{i}}italic_F start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and FIRisubscript𝐹𝐼subscript𝑅𝑖F_{IR_{i}}italic_F start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT into the complementary features F¯Risubscript¯𝐹subscript𝑅𝑖\overline{F}_{R_{i}}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F¯IRisubscript¯𝐹𝐼subscript𝑅𝑖\overline{F}_{IR_{i}}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, respectively. Subsequencely, the enhanced features F^Risubscript^𝐹subscript𝑅𝑖\hat{F}_{R_{i}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F^IRisubscript^𝐹𝐼subscript𝑅𝑖\hat{F}_{IR_{i}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT are directly added to generate the fused feature Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In this paper, FMB is only added to the last three stages to generate fused features P3,P4subscript𝑃3subscript𝑃4P_{3},P_{4}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and P5subscript𝑃5P_{5}italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT (if not specified), which are the inputs for the neck and head of Yolov8 to generate the final detection results (As shown in Fig. 4).

Refer to caption
Figure 4: Illustration of the neck and head following Yolov8.

3.2.2 Key components

Given the input RGB image IRsubscript𝐼𝑅I_{R}italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and infrared image IIRsubscript𝐼𝐼𝑅I_{IR}italic_I start_POSTSUBSCRIPT italic_I italic_R end_POSTSUBSCRIPT, we feed them into a series of convolutional blocks to extract their local features:

FRi=ϕi(ϕ2(ϕ1(IR))),FIRi=φi(φ2(φ1(IIR))),formulae-sequencesubscript𝐹subscript𝑅𝑖subscriptitalic-ϕ𝑖subscriptitalic-ϕ2subscriptitalic-ϕ1subscript𝐼𝑅subscript𝐹𝐼subscript𝑅𝑖subscript𝜑𝑖subscript𝜑2subscript𝜑1subscript𝐼𝐼𝑅F_{R_{i}}=\phi_{i}{\cdots}(\phi_{2}(\phi_{1}(I_{R}))),\quad F_{IR_{i}}=\varphi% _{i}{\cdots}(\varphi_{2}(\varphi_{1}(I_{IR}))),italic_F start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋯ ( italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) ) ) , italic_F start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋯ ( italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_I italic_R end_POSTSUBSCRIPT ) ) ) , (4)

where ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and φisubscript𝜑𝑖\varphi_{i}italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the convolutional blocks of RGB and IR branches at the i𝑖iitalic_i-th stage, respectively.

To implement cross-modality feature fusion, existing methods [9, 6, 26] primarily emphasize the integration of spatial features, yet they inadequately consider the feature disparity between modalities. Consequently, the fused model fails to effectively model the correlations between targets of different modalities, diminishing the model’s representation capacity. Motivated by Mamba [8] with strong sequence modeling ability on the state space, we design a Fusion-Mamba block (FMB) to construct a hidden state space for cross-modal feature interaction and association. The effectiveness of FMB lies in two key modules, the State Space Channel Swapping (SSCS) module and the Dual State Space Fusion (DSSF) module, which can reduce disparities between cross-modality features to enhance the representation consistency of fused features. Alg. 1 provides the computation process of SSCS and DSSF modules.

SSCS module. This module aims to enhance cross-modality feature interaction for shallow feature fusion through the channel swapping operation and VSS block. Cross-modal feature correlation is constructed by integrating information from distinct channels, which enriches the diversity of channel features to improve fusion performance. First, we employ the channel swapping operation to generate new local features of RGB TRisubscript𝑇subscript𝑅𝑖T_{R_{i}}italic_T start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and IR TIRisubscript𝑇𝐼subscript𝑅𝑖T_{IR_{i}}italic_T start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which can be formulated as:

TRi=CS(FRi,FIRi),TIRi=CS(FIRi,FRi),formulae-sequencesubscript𝑇subscript𝑅𝑖𝐶𝑆subscript𝐹subscript𝑅𝑖subscript𝐹𝐼subscript𝑅𝑖subscript𝑇𝐼subscript𝑅𝑖𝐶𝑆subscript𝐹𝐼subscript𝑅𝑖subscript𝐹subscript𝑅𝑖T_{R_{i}}=CS(F_{R_{i}},F_{IR_{i}}),\quad T_{IR_{i}}=CS(F_{IR_{i}},F_{R_{i}}),italic_T start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_C italic_S ( italic_F start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_T start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_C italic_S ( italic_F start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (5)

where CS(,)𝐶𝑆CS(\cdot,\cdot)italic_C italic_S ( ⋅ , ⋅ ) is the channel swapping operation, which is easy to implement by channel splitting and concatenation. First, both local features FRisubscript𝐹subscript𝑅𝑖F_{R_{i}}italic_F start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and FIRisubscript𝐹𝐼subscript𝑅𝑖F_{IR_{i}}italic_F start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT are divided into four equal parts along the channel dimension. Subsequently, we select the first and third parts from FRisubscript𝐹subscript𝑅𝑖F_{R_{i}}italic_F start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and the second and fourth parts from FIRisubscript𝐹𝐼subscript𝑅𝑖F_{IR_{i}}italic_F start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT through concatenation in a portion order to generate new local RGB features TRisubscript𝑇subscript𝑅𝑖T_{R_{i}}italic_T start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Correspondingly, we generate new local IR features TIRisubscript𝑇𝐼subscript𝑅𝑖T_{IR_{i}}italic_T start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. After that, one VSS block is applied to TRisubscript𝑇subscript𝑅𝑖T_{R_{i}}italic_T start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and TIRisubscript𝑇𝐼subscript𝑅𝑖T_{IR_{i}}italic_T start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which enhances cross-modality interaction from shallow features:

F~Ri=VSS(TRi),F~IRi=VSS(TIRi),formulae-sequencesubscript~𝐹subscript𝑅𝑖𝑉𝑆𝑆subscript𝑇subscript𝑅𝑖subscript~𝐹𝐼subscript𝑅𝑖𝑉𝑆𝑆subscript𝑇𝐼subscript𝑅𝑖\tilde{F}_{R_{i}}=VSS(T_{R_{i}}),\quad\tilde{F}_{IR_{i}}=VSS(T_{IR_{i}}),over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_V italic_S italic_S ( italic_T start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_V italic_S italic_S ( italic_T start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (6)

where VSS()𝑉𝑆𝑆VSS(\cdot)italic_V italic_S italic_S ( ⋅ ) denotes the VSS block [21] depicted in Fig. 2. F~Risubscript~𝐹subscript𝑅𝑖\tilde{F}_{R_{i}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F~IRisubscript~𝐹𝐼subscript𝑅𝑖\tilde{F}_{IR_{i}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT are the outputs of shallow fused features from RGB and IR modality, respectively.

DSSF module. To further reduce modality disparities, we build a hidden state space for cross-modality feature association and complementary. DSSF is proposed to model cross-modality object correlation to facilitate feature fusion. Specifically, we employ the VSS block by projecting features from both modalities into a hidden state space, and a gating mechanism is utilized to construct hidden state transitions dually for cross-modality deep feature fusion.

Formally, after obtaining the shallow fused features F~Risubscript~𝐹subscript𝑅𝑖\tilde{F}_{R_{i}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F~IRisubscript~𝐹𝐼subscript𝑅𝑖\tilde{F}_{IR_{i}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we first project them into the hidden state space through a VSS block without gating as:

yRi=Pin(F~Ri),yIRi=Pin(F~IRi)formulae-sequencesubscript𝑦subscript𝑅𝑖subscript𝑃𝑖𝑛subscript~𝐹subscript𝑅𝑖subscript𝑦𝐼subscript𝑅𝑖subscript𝑃𝑖𝑛subscript~𝐹𝐼subscript𝑅𝑖y_{R_{i}}=P_{in}(\tilde{F}_{R_{i}}),\quad y_{IR_{i}}=P_{in}(\tilde{F}_{IR_{i}})italic_y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (7)

where Pin()subscript𝑃𝑖𝑛P_{in}(\cdot)italic_P start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( ⋅ ) denotes an operation for projecting features to the hidden state space. The detailed implementation is described in lines 13-17 of Alg. 1. yRisubscript𝑦subscript𝑅𝑖y_{R_{i}}italic_y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and yIRisubscript𝑦𝐼subscript𝑅𝑖y_{IR_{i}}italic_y start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the hidden state features. We also project F~Risubscript~𝐹subscript𝑅𝑖\tilde{F}_{R_{i}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F~IRisubscript~𝐹𝐼subscript𝑅𝑖\tilde{F}_{IR_{i}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT to obtain gating parameters zRisubscript𝑧subscript𝑅𝑖z_{R_{i}}italic_z start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and zIRisubscript𝑧𝐼subscript𝑅𝑖z_{IR_{i}}italic_z start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT:

zRi=fθi(F~Ri),zIRi=gωi(F~IRi)formulae-sequencesubscript𝑧subscript𝑅𝑖subscript𝑓subscript𝜃𝑖subscript~𝐹subscript𝑅𝑖subscript𝑧𝐼subscript𝑅𝑖subscript𝑔subscript𝜔𝑖subscript~𝐹𝐼subscript𝑅𝑖z_{R_{i}}=f_{\theta_{i}}(\tilde{F}_{R_{i}}),\quad z_{IR_{i}}=g_{\omega_{i}}(% \tilde{F}_{IR_{i}})italic_z start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_z start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (8)

where fθi()subscript𝑓subscript𝜃𝑖f_{\theta_{i}}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) and gωi()subscript𝑔subscript𝜔𝑖g_{\omega_{i}}(\cdot)italic_g start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) represent the gating operation with parameters θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ωisubscript𝜔𝑖\omega_{i}italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in a dual stream, respectively. After that, we employ the gating outputs of zRisubscript𝑧subscript𝑅𝑖z_{R_{i}}italic_z start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and zIRisubscript𝑧𝐼subscript𝑅𝑖z_{IR_{i}}italic_z start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT in Eq. 8 to modulate yRisubscript𝑦subscript𝑅𝑖y_{R_{i}}italic_y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and yIRisubscript𝑦𝐼subscript𝑅𝑖y_{IR_{i}}italic_y start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and implement the hidden state feature fusion as:

yRi=yRizRi+zRiyIRi,subscriptsuperscript𝑦subscript𝑅𝑖subscript𝑦subscript𝑅𝑖subscript𝑧subscript𝑅𝑖subscript𝑧subscript𝑅𝑖subscript𝑦𝐼subscript𝑅𝑖y^{\prime}_{R_{i}}=y_{R_{i}}\cdot z_{R_{i}}+z_{R_{i}}\cdot y_{IR_{i}},italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_y start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (9)
yIRi=yIRizIRi+zIRiyRi,subscriptsuperscript𝑦𝐼subscript𝑅𝑖subscript𝑦𝐼subscript𝑅𝑖subscript𝑧𝐼subscript𝑅𝑖subscript𝑧𝐼subscript𝑅𝑖subscript𝑦subscript𝑅𝑖y^{\prime}_{IR_{i}}=y_{IR_{i}}\cdot z_{IR_{i}}+z_{IR_{i}}\cdot y_{R_{i}},italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (10)

where yRisubscriptsuperscript𝑦subscript𝑅𝑖y^{\prime}_{R_{i}}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and yIRisubscriptsuperscript𝑦𝐼subscript𝑅𝑖y^{\prime}_{IR_{i}}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT represent the hidden state feature of RGB and IR after feature interaction, respectively. \cdot is an element-wise product. Actually, Eq. 9 and  10 build the cross-modality fusion in a hidden state space based on the gating mechanism, and dual attention is fully used for cross-branch information complementarity.

Subsequently, we project yRisubscriptsuperscript𝑦subscript𝑅𝑖y^{\prime}_{R_{i}}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and yIRisubscriptsuperscript𝑦𝐼subscript𝑅𝑖y^{\prime}_{IR_{i}}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT back to the original space and pass them through a residual connection to obtain the complementary features F¯Risubscript¯𝐹subscript𝑅𝑖\overline{F}_{R_{i}}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F¯IRisubscript¯𝐹𝐼subscript𝑅𝑖\overline{F}_{IR_{i}}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT:

F¯Ri=Pout(yRi)+F~Ri,F¯IRi=Pout(yIRi)+F~IRi.formulae-sequencesubscript¯𝐹subscript𝑅𝑖subscript𝑃𝑜𝑢𝑡subscriptsuperscript𝑦subscript𝑅𝑖subscript~𝐹subscript𝑅𝑖subscript¯𝐹𝐼subscript𝑅𝑖subscript𝑃𝑜𝑢𝑡subscriptsuperscript𝑦𝐼subscript𝑅𝑖subscript~𝐹𝐼subscript𝑅𝑖\overline{F}_{R_{i}}=P_{out}(y^{\prime}_{R_{i}})+\tilde{F}_{R_{i}},\quad% \overline{F}_{IR_{i}}=P_{out}(y^{\prime}_{IR_{i}})+\tilde{F}_{IR_{i}}.over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (11)

where Pout()subscript𝑃outP_{\text{out}}(\cdot)italic_P start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ( ⋅ ) denotes the projection operation with a linear transformation.

In practice, we stack several DSSF modules (i.e., Eq. 7 to Eq. 11) to obtain much deeper feature fusion, which achieves better results. However, the number of DSSF modules will saturate at a certain value, which is further evaluated in our experiments. Finally, we merge the complementary features into the local features to enhance feature representation by the addition operation:

F^Ri=FRi+F¯Ri,F^IRi=FIRi+F¯IRi.formulae-sequencesubscript^𝐹subscript𝑅𝑖subscript𝐹subscript𝑅𝑖subscript¯𝐹subscript𝑅𝑖subscript^𝐹𝐼subscript𝑅𝑖subscript𝐹𝐼subscript𝑅𝑖subscript¯𝐹𝐼subscript𝑅𝑖\hat{F}_{R_{i}}=F_{R_{i}}+\overline{F}_{R_{i}},\quad\hat{F}_{IR_{i}}=F_{IR_{i}% }+\overline{F}_{IR_{i}}.over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (12)
Algorithm 1 Algorithm of Fusion-Mamba block (FMB)
0:  FRi:(B,Ci,Hi,Wi):subscript𝐹subscript𝑅𝑖𝐵subscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖F_{R_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0.8,0}(B,C_{i},H_{i},W_{i})}italic_F start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), FIRi:(B,Ci,Hi,Wi):subscript𝐹𝐼subscript𝑅𝑖𝐵subscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖F_{IR_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0.8,0}(B,C_{i},H_{i},W_{i})}italic_F start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
0:  F^Ri:(B,Ci,Hi,Wi):subscript^𝐹subscript𝑅𝑖𝐵subscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖\hat{F}_{R_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0.8,0}(B,C_{i},H_{i},W_{i})}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), F^IRi:(B,Ci,Hi,Wi):subscript^𝐹𝐼subscript𝑅𝑖𝐵subscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖\hat{F}_{IR_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{rgb}% {0,0.8,0}(B,C_{i},H_{i},W_{i})}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
1:  /StateSpaceChannelSwapping(SSCS)module/{\color[rgb]{0.8,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.8,0,0}/*~{}% State~{}Space~{}Channel~{}Swapping~{}(SSCS)~{}module~{}*/}/ ∗ italic_S italic_t italic_a italic_t italic_e italic_S italic_p italic_a italic_c italic_e italic_C italic_h italic_a italic_n italic_n italic_e italic_l italic_S italic_w italic_a italic_p italic_p italic_i italic_n italic_g ( italic_S italic_S italic_C italic_S ) italic_m italic_o italic_d italic_u italic_l italic_e ∗ /
2:  /channelswappingwithshallowfeaturefusion/{\color[rgb]{0.5,0.5,0.5}\definecolor[named]{pgfstrokecolor}{rgb}{0.5,0.5,0.5}% \pgfsys@color@gray@stroke{0.5}\pgfsys@color@gray@fill{0.5}/*~{}channel~{}% swapping~{}with~{}shallow~{}feature~{}fusion~{}*/}/ ∗ italic_c italic_h italic_a italic_n italic_n italic_e italic_l italic_s italic_w italic_a italic_p italic_p italic_i italic_n italic_g italic_w italic_i italic_t italic_h italic_s italic_h italic_a italic_l italic_l italic_o italic_w italic_f italic_e italic_a italic_t italic_u italic_r italic_e italic_f italic_u italic_s italic_i italic_o italic_n ∗ /
3:  TRi:(B,Ci,Hi,Wi)CS(FRi,FIRi):subscript𝑇subscript𝑅𝑖𝐵subscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖𝐶𝑆subscript𝐹subscript𝑅𝑖subscript𝐹𝐼subscript𝑅𝑖T_{R_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0.8,0}(B,C_{i},H_{i},W_{i})}\leftarrow CS(F_{R_{i}},F_{IR_{i}})italic_T start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_C italic_S ( italic_F start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
4:  TIRi:(B,Ci,Hi,Wi)CS(FIRi,FRi):subscript𝑇𝐼subscript𝑅𝑖𝐵subscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖𝐶𝑆subscript𝐹𝐼subscript𝑅𝑖subscript𝐹subscript𝑅𝑖T_{IR_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0.8,0}(B,C_{i},H_{i},W_{i})}\leftarrow CS(F_{IR_{i}},F_{R_{i}})italic_T start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_C italic_S ( italic_F start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
5:  F~Ri:(B,Ci,Hi,Wi)VSS(TRi):subscript~𝐹subscript𝑅𝑖𝐵subscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖𝑉𝑆𝑆subscript𝑇subscript𝑅𝑖\tilde{F}_{R_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{rgb% }{0,0.8,0}(B,C_{i},H_{i},W_{i})}\leftarrow VSS(T_{R_{i}})over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_V italic_S italic_S ( italic_T start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
6:  F~IRi:(B,Ci,Hi,Wi)VSS(TIRi):subscript~𝐹𝐼subscript𝑅𝑖𝐵subscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖𝑉𝑆𝑆subscript𝑇𝐼subscript𝑅𝑖\tilde{F}_{IR_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0.8,0}(B,C_{i},H_{i},W_{i})}\leftarrow VSS(T_{IR_{i}})over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_V italic_S italic_S ( italic_T start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
7:  /DualStateSpaceFusion(DSSF)module/{\color[rgb]{0.8,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.8,0,0}/*~{}% Dual~{}State~{}Space~{}Fusion~{}(DSSF)~{}module~{}*/}/ ∗ italic_D italic_u italic_a italic_l italic_S italic_t italic_a italic_t italic_e italic_S italic_p italic_a italic_c italic_e italic_F italic_u italic_s italic_i italic_o italic_n ( italic_D italic_S italic_S italic_F ) italic_m italic_o italic_d italic_u italic_l italic_e ∗ /
8:  for k = 1 to N do
9:     if k != 1 then
10:        F~RiF¯Risubscript~𝐹subscript𝑅𝑖subscript¯𝐹subscript𝑅𝑖\tilde{F}_{R_{i}}\leftarrow\overline{F}_{R_{i}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT
11:        F~IRiF¯IRisubscript~𝐹𝐼subscript𝑅𝑖subscript¯𝐹𝐼subscript𝑅𝑖\tilde{F}_{IR_{i}}\leftarrow\overline{F}_{IR_{i}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT
12:     end if
13:     for i in {Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTdo
14:        /Projecttheinputintothehiddenstatespace/{\color[rgb]{0.5,0.5,0.5}\definecolor[named]{pgfstrokecolor}{rgb}{0.5,0.5,0.5}% \pgfsys@color@gray@stroke{0.5}\pgfsys@color@gray@fill{0.5}/*~{}Project~{}the~{% }input~{}into~{}the~{}hidden~{}state~{}space~{}*/}/ ∗ italic_P italic_r italic_o italic_j italic_e italic_c italic_t italic_t italic_h italic_e italic_i italic_n italic_p italic_u italic_t italic_i italic_n italic_t italic_o italic_t italic_h italic_e italic_h italic_i italic_d italic_d italic_e italic_n italic_s italic_t italic_a italic_t italic_e italic_s italic_p italic_a italic_c italic_e ∗ /
15:        xi:(B,Pi,Hi,Wi)Linear(Norm(F~i)):subscript𝑥𝑖𝐵subscript𝑃𝑖subscript𝐻𝑖subscript𝑊𝑖𝐿𝑖𝑛𝑒𝑎𝑟𝑁𝑜𝑟𝑚subscript~𝐹𝑖x_{i}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.8,0}(B% ,P_{i},H_{i},W_{i})}\leftarrow Linear(Norm(\tilde{F}_{i}))italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : ( italic_B , italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_L italic_i italic_n italic_e italic_a italic_r ( italic_N italic_o italic_r italic_m ( over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
16:        xi:(B,Pi,Hi,Wi)SiLU(DWConv(xi)):subscriptsuperscript𝑥𝑖𝐵subscript𝑃𝑖subscript𝐻𝑖subscript𝑊𝑖𝑆𝑖𝐿𝑈𝐷𝑊𝐶𝑜𝑛𝑣subscript𝑥𝑖x^{\prime}_{i}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0.8,0}(B,P_{i},H_{i},W_{i})}\leftarrow SiLU(DWConv(x_{i}))italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : ( italic_B , italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_S italic_i italic_L italic_U ( italic_D italic_W italic_C italic_o italic_n italic_v ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
17:        yi:(B,Pi,Hi,Wi)Norm(SS2D(xi)):subscript𝑦𝑖𝐵subscript𝑃𝑖subscript𝐻𝑖subscript𝑊𝑖𝑁𝑜𝑟𝑚𝑆𝑆2𝐷subscriptsuperscript𝑥𝑖y_{i}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.8,0}(B% ,P_{i},H_{i},W_{i})}\leftarrow Norm(SS2D(x^{\prime}_{i}))italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : ( italic_B , italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_N italic_o italic_r italic_m ( italic_S italic_S 2 italic_D ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
18:     end for
19:     /ObtainthegatingoutputszRi,zIRi/{\color[rgb]{0.5,0.5,0.5}\definecolor[named]{pgfstrokecolor}{rgb}{0.5,0.5,0.5}% \pgfsys@color@gray@stroke{0.5}\pgfsys@color@gray@fill{0.5}/*~{}Obtain~{}the~{}% gating~{}outputs~{}z_{R_{i}},~{}z_{IR_{i}}~{}*/}/ ∗ italic_O italic_b italic_t italic_a italic_i italic_n italic_t italic_h italic_e italic_g italic_a italic_t italic_i italic_n italic_g italic_o italic_u italic_t italic_p italic_u italic_t italic_s italic_z start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∗ /
20:     zRi:(B,Pi,Hi,Wi)SiLU(Linear(Norm(F~Ri))z_{R_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0.8,0}(B,P_{i},H_{i},W_{i})}\leftarrow SiLU(Linear(Norm(\tilde{F}_{R_{i}}))italic_z start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_S italic_i italic_L italic_U ( italic_L italic_i italic_n italic_e italic_a italic_r ( italic_N italic_o italic_r italic_m ( over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) )
21:     zIRi:(B,Pi,Hi,Wi)SiLU(Linear(Norm(F~IRi))z_{IR_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0.8,0}(B,P_{i},H_{i},W_{i})}\leftarrow SiLU(Linear(Norm(\tilde{F}_{IR_{i}}))italic_z start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_S italic_i italic_L italic_U ( italic_L italic_i italic_n italic_e italic_a italic_r ( italic_N italic_o italic_r italic_m ( over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) )
22:     /Transitionofhiddenstates/{\color[rgb]{0.5,0.5,0.5}\definecolor[named]{pgfstrokecolor}{rgb}{0.5,0.5,0.5}% \pgfsys@color@gray@stroke{0.5}\pgfsys@color@gray@fill{0.5}/*~{}Transition~{}of% ~{}hidden~{}states~{}*/}/ ∗ italic_T italic_r italic_a italic_n italic_s italic_i italic_t italic_i italic_o italic_n italic_o italic_f italic_h italic_i italic_d italic_d italic_e italic_n italic_s italic_t italic_a italic_t italic_e italic_s ∗ /
23:     yRi:(B,Pi,Hi,Wi)yRiSiLU(zRi)+SiLU(zRi)yIRi:subscriptsuperscript𝑦subscript𝑅𝑖𝐵subscript𝑃𝑖subscript𝐻𝑖subscript𝑊𝑖subscript𝑦subscript𝑅𝑖𝑆𝑖𝐿𝑈subscript𝑧subscript𝑅𝑖𝑆𝑖𝐿𝑈subscript𝑧subscript𝑅𝑖subscript𝑦𝐼subscript𝑅𝑖y^{\prime}_{R_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0.8,0}(B,P_{i},H_{i},W_{i})}\leftarrow y_{R_{i}}\bullet SiLU(z_{R_{i}})% +SiLU(z_{R_{i}})\bullet y_{IR_{i}}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∙ italic_S italic_i italic_L italic_U ( italic_z start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_S italic_i italic_L italic_U ( italic_z start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∙ italic_y start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT
24:     yIRi:(B,Pi,Hi,Wi)yRiSiLU(zIRi)+SiLU(zIRi)yIRi:subscriptsuperscript𝑦𝐼subscript𝑅𝑖𝐵subscript𝑃𝑖subscript𝐻𝑖subscript𝑊𝑖subscript𝑦subscript𝑅𝑖𝑆𝑖𝐿𝑈subscript𝑧𝐼subscript𝑅𝑖𝑆𝑖𝐿𝑈subscript𝑧𝐼subscript𝑅𝑖subscript𝑦𝐼subscript𝑅𝑖y^{\prime}_{IR_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0.8,0}(B,P_{i},H_{i},W_{i})}\leftarrow y_{R_{i}}\bullet SiLU(z_{IR_{i}}% )+SiLU(z_{IR_{i}})\bullet y_{IR_{i}}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∙ italic_S italic_i italic_L italic_U ( italic_z start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_S italic_i italic_L italic_U ( italic_z start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∙ italic_y start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT
25:     /Projectthehiddenstateintotheoriginalspace/{\color[rgb]{0.5,0.5,0.5}\definecolor[named]{pgfstrokecolor}{rgb}{0.5,0.5,0.5}% \pgfsys@color@gray@stroke{0.5}\pgfsys@color@gray@fill{0.5}/*~{}Project~{}the~{% }hidden~{}state~{}into~{}the~{}original~{}space~{}*/}/ ∗ italic_P italic_r italic_o italic_j italic_e italic_c italic_t italic_t italic_h italic_e italic_h italic_i italic_d italic_d italic_e italic_n italic_s italic_t italic_a italic_t italic_e italic_i italic_n italic_t italic_o italic_t italic_h italic_e italic_o italic_r italic_i italic_g italic_i italic_n italic_a italic_l italic_s italic_p italic_a italic_c italic_e ∗ /
26:     F¯Ri:(B,Ci,Hi,Wi)Linear(yRi)+F~Ri:subscript¯𝐹subscript𝑅𝑖𝐵subscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖𝐿𝑖𝑛𝑒𝑎𝑟subscriptsuperscript𝑦subscript𝑅𝑖subscript~𝐹subscript𝑅𝑖\overline{F}_{R_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0.8,0}(B,C_{i},H_{i},W_{i})}\leftarrow Linear(y^{\prime}_{R_{i}})+% \tilde{F}_{R_{i}}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_L italic_i italic_n italic_e italic_a italic_r ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT
27:     F¯IRi:(B,Ci,Hi,Wi)Linear(yIRi)+F~IRi:subscript¯𝐹𝐼subscript𝑅𝑖𝐵subscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖𝐿𝑖𝑛𝑒𝑎𝑟subscriptsuperscript𝑦𝐼subscript𝑅𝑖subscript~𝐹𝐼subscript𝑅𝑖\overline{F}_{IR_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}% {rgb}{0,0.8,0}(B,C_{i},H_{i},W_{i})}\leftarrow Linear(y^{\prime}_{IR_{i}})+% \tilde{F}_{IR_{i}}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_L italic_i italic_n italic_e italic_a italic_r ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT
28:  end for
29:  /Enhancefeaturerepresentation/{\color[rgb]{0.5,0.5,0.5}\definecolor[named]{pgfstrokecolor}{rgb}{0.5,0.5,0.5}% \pgfsys@color@gray@stroke{0.5}\pgfsys@color@gray@fill{0.5}/*~{}Enhance~{}% feature~{}representation~{}*/}/ ∗ italic_E italic_n italic_h italic_a italic_n italic_c italic_e italic_f italic_e italic_a italic_t italic_u italic_r italic_e italic_r italic_e italic_p italic_r italic_e italic_s italic_e italic_n italic_t italic_a italic_t italic_i italic_o italic_n ∗ /
30:  F^Ri:(B,Ci,Hi,Wi)FRi+F¯Ri:subscript^𝐹subscript𝑅𝑖𝐵subscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖subscript𝐹subscript𝑅𝑖subscript¯𝐹subscript𝑅𝑖\hat{F}_{R_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0.8,0}(B,C_{i},H_{i},W_{i})}\leftarrow F_{R_{i}}+\overline{F}_{R_{i}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_F start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT
31:  F^IRi:(B,Ci,Hi,Wi)FIRi+F¯IRi:subscript^𝐹𝐼subscript𝑅𝑖𝐵subscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖subscript𝐹𝐼subscript𝑅𝑖subscript¯𝐹𝐼subscript𝑅𝑖\hat{F}_{IR_{i}}:{\color[rgb]{0,0.8,0}\definecolor[named]{pgfstrokecolor}{rgb}% {0,0.8,0}(B,C_{i},H_{i},W_{i})}\leftarrow F_{IR_{i}}+\overline{F}_{IR_{i}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_B , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_F start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT
32:  return  F^Risubscript^𝐹subscript𝑅𝑖\hat{F}_{R_{i}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT,F^IRisubscript^𝐹𝐼subscript𝑅𝑖\hat{F}_{IR_{i}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT

3.2.3 Loss Function

After FMB, the enhanced features from RGB and IR (i.e., F^Risubscript^𝐹subscript𝑅𝑖\hat{F}_{R_{i}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F^IRisubscript^𝐹𝐼subscript𝑅𝑖\hat{F}_{IR_{i}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT in Eq. 12) are further added to generate fused feature Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the input of the neck to improve the detection performance. Following [12, 13], the total loss function can be constructed as:

=λcoordcoord+conf+class,subscript𝜆coordsubscriptcoordsubscriptconfsubscriptclass\mathcal{L}=\lambda_{\text{coord}}\mathcal{L}_{\text{coord}}+\mathcal{L}_{% \text{conf}}+\mathcal{L}_{\text{class}},caligraphic_L = italic_λ start_POSTSUBSCRIPT coord end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT coord end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT conf end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT class end_POSTSUBSCRIPT , (13)

where λcoordsubscript𝜆coord\lambda_{\text{coord}}italic_λ start_POSTSUBSCRIPT coord end_POSTSUBSCRIPT is a hyperparameter that adjusts the weight of the localization loss coordsubscriptcoord\mathcal{L}_{\text{coord}}caligraphic_L start_POSTSUBSCRIPT coord end_POSTSUBSCRIPT, confsubscriptconf\mathcal{L}_{\text{conf}}caligraphic_L start_POSTSUBSCRIPT conf end_POSTSUBSCRIPT is the confidence loss, and classsubscriptclass\mathcal{L}_{\text{class}}caligraphic_L start_POSTSUBSCRIPT class end_POSTSUBSCRIPT is the classification loss. The more details of individual loss term for coord,confsubscriptcoordsubscriptconf\mathcal{L}_{\text{coord}},\mathcal{L}_{\text{conf}}caligraphic_L start_POSTSUBSCRIPT coord end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT conf end_POSTSUBSCRIPT and classsubscriptclass\mathcal{L}_{\text{class}}caligraphic_L start_POSTSUBSCRIPT class end_POSTSUBSCRIPT are described in jocher2022ultralytics.

Table 1: Comparison results with SOTA methods on LLVIP dataset. The best and second results are highlighted in red and green, respectively.
Methods Modality Backbone m𝑚mitalic_mAP50 m𝑚mitalic_mAP
(NIPS’16) Faster R-CNN IR ResNet50 92.6 50.7
(CVPR’18) Cascade R-CNN IR ResNet50 95.0 56.8
     (CVPR’23) DDQ DETR IR ResNet50 93.9 58.6
     (Zenodo’22) YOLOv5-l IR YOLOv5 94.6 61.9
     (2023) YOLOv8-l IR YOLOv8 95.2 62.1
Faster R-CNN [24] RGB ResNet50 88.8 47.5
Cascade R-CNN [1] RGB ResNet50 88.3 47.0
DDQ DETR [42] RGB ResNet50 86.1 46.7
YOLOv5-l [12] RGB YOLOv5 90.8 50.0
YOLOv8-l [13] RGB YOLOv8 91.9 54.0
(2016) Halfway fusion [19] IR+RGB VGG16 91.4 55.1
(WACV’21) GAFF [39] IR+RGB Resnet18 94.0 55.8
(ECCV’22) ProEn [4] IR+RGB ResNet50 93.4 51.5
(CVPR’23) CSAA [2] IR+RGB ResNet50 94.3 59.2
     (2024) RSDet [43] IR+RGB ResNet50 95.8 61.3
(IF’23) DIVFusion [31] IR+RGB YOLOv5 89.8 52.0
Fusion-Mamba(Ours) IR+RGB YOLOv5 96.8 62.8
Fusion-Mamba(Ours) IR+RGB YOLOv8 97.0 64.3

3.3 Compared to Transformer-based fusion

Existing Transformer-based cross-modality fusion methods [6, 26] flatten and concatenate the features with convolution to generate the intermediate fused features, which is further fused by multi-head cross-attention to generate final fused features. They cannot effectively reduce the modality disparities just on the spatial interaction, which is due to the difficult modeling of object correlation from cross-modality features. Our FMB block can scan features in four directions to obtain four sets of patches and effectively preserve the local information of the features. In addition, these patches are mapped into a hidden space for feature fusion. This mapping-based deep feature fusion method effectively reduces spatial disparities through dual-direction gated attention, which further suppresses redundant features and captures complementary information among modalities. As such, the proposed FMB reduces disparities between cross- modality features and enhances the representation consistency of fused features.

In addition, the time complexity of the Transformer’s global attention is O(N2)𝑂superscript𝑁2O(N^{2})italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), while Mamba’s time complexity is only O(N)𝑂𝑁O(N)italic_O ( italic_N ), where N𝑁Nitalic_N is the sequence length. From the experiment perspective, using the same detection model architecture, replacing the transformer-based fusion module with Fusion-Mamba block saves 7197197-197 - 19ms inference time on one paired images. More details are discussed in our experiments.

4 Experiments

4.1 Experimental Setups

Datasets. Our Fusion-Mamba method is evaluated on three widely-used visible-infrared benchmark datasets, LLVIP [11], M3FDsuperscript𝑀3𝐹𝐷M^{3}FDitalic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_F italic_D [20] and FLIR [7].

LLVIP is an aligned visible and infrared (IR) dataset collected in low-light environments for pedestrian detection, which contains 15,4881548815,48815 , 488 RGB-IR image pairs. Following the official standards, we use 12,0251202512,02512 , 025 pairs for training and 3,46334633,4633 , 463 pairs for testing.

M3FDsuperscript𝑀3𝐹𝐷M^{3}FDitalic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_F italic_D contains 4,20042004,2004 , 200 RGB and IR-aligned image pairs collected in various environments including different lighting, seasons, and weather scenarios. It has six categories usually appearing in autonomous driving and road monitoring. Since there is no official dataset partitioning method, we use train/test splits provided by [18].

FLIR is collected in day and night scenes with five categories: people, car, bike, dog, and other cars. Following by [38], we use the FLIR-Aligned with 4,12941294,1294 , 129 pairs for training and 1,01310131,0131 , 013 for testing.

Evaluation metrics. We use the most common evaluation metric m𝑚mitalic_mAP and m𝑚mitalic_mAP50. The m𝑚mitalic_mAP50 metric represents the mean AP under IoU 0.500.500.500.50 and The m𝑚mitalic_mAP metric represents the mean AP under IoU ranges from 0.500.500.500.50 to 0.950.950.950.95 with a stride of 0.050.050.050.05 [43]. The larger test value of the two metrics means better model performance. We also report the average inference time of our method evaluated on one A800 GPU with 5555 runs on the input size of 640×640640640640\times 640640 × 640.

Implementation details. All the experiments are implemented in the two-stream framework [6] with a single GPU A800. The backbone, neck and head structures of our Fusion-Mamba are default set the same as those in YOLOv5-l or YOLOv8-l. During training, we set the batch size to 4444, SGD optimizer is set with a momentum of 0.90.90.90.9 and a weight decay of 0.0010.0010.0010.001. The input image size is 640×640640640640\times 640640 × 640 for all three datasets and the training epoch is set to 150150150150 with an initial learning rate of 0.010.010.010.01. The number of SSCS and DSSF modules in FMB is default set to 1111 and 8888, respectively. λcoordsubscript𝜆coord\lambda_{\text{coord}}italic_λ start_POSTSUBSCRIPT coord end_POSTSUBSCRIPT is set to 7.57.57.57.5. Other training hyper-parameters are the same as YOLOv8.

Table 2: Comparison results with eight SOTA methods on M3FDsuperscript𝑀3𝐹𝐷M^{3}FDitalic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_F italic_D Dataset. The best results are highlighted in bold.
Methods Backbone m𝑚mitalic_mAP50 m𝑚mitalic_mAP People Bus Car Motorcycle Lamp Truck
(2020) DIDFuse [44] YOLOv5 78.9 52.6 79.6 79.6 92.5 68.7 84.7 68.7
(IJCV’21) SDNet [37] YOLOv5 79.0 52.9 79.4 81.4 92.3 67.4 84.1 69.3
(CVPR’22) RFNet [36] YOLOv5 79.4 53.2 79.4 78.2 91.1 72.8 85.0 69.0
(CVPR’22) TarDAL [20] YOLOv5 80.5 54.1 81.5 81.3 94.8 69.3 87.1 68.7
(MM’22) DeFusion [28] YOLOv5 80.8 53.8 80.8 83.0 92.5 69.4 87.8 71.4
(CVPR’23) CDDFuse [45] YOLOv5 81.1 54.3 81.6 82.6 92.5 71.6 86.9 71.5
(MM’23) IGNet [17] YOLOv5 81.5 54.5 81.6 82.4 92.8 73.0 86.9 72.1
(JAS’22) SuperFusion [30] YOLOv7 83.5 56.0 83.7 93.2 91.0 77.4 70.0 85.8
Fusion-Mamba (ours) YOLOv5 85.0 57.5 80.3 92.8 91.9 73.0 84.8 87.1
YOLOv8l-IR YOLOv8 79.5 53.1 82.9 90.9 90.0 64.6 63.0 85.9
YOLOv8l-RGB [13] YOLOv8 80.9 52.5 70.6 92.9 91.2 69.6 75.3 86.0
Fusion-Mamba (ours) YOLOv8 88.0 61.9 84.3 94.2 92.9 80.5 87.5 88.8

4.2 Comparison with SOTA Methods

To verify the effectiveness of our Fusion-Mamba method, we employ two backbones based on YOLOv5 and YOLOv8 to make a fair comparison with SOTA methods.
LLVIP Dataset. The results of different methods on LLVIP are summarized in Tab. 1. We compare the proposed Fusion-Mamba method using two different backbones with 6666 SOTA multi-spectral object detection methods and 5555 single-modality detection methods. For single-modality detection, the detection performance only using IR images is better than that only using RGB images, which is due to the effect of low light conditions. After feature fusion with RGB and IR, the m𝑚mitalic_mAP performance is improved based on ResNet backbones, outperforming the single IR modality detection. For example, RSDet with the ResNet50 backbone outperforms Cascade R-CNN only using IR modality by 4.5%percent4.54.5\%4.5 % m𝑚mitalic_mAP. Note that it fails to effective fusion on the YOLOv5 backbone, e.g., a simple YOLOv5 detection framework using IR modality input achieves 61.9%percent61.961.9\%61.9 % m𝑚mitalic_mAP, significantly outperforming fusion method DIVFusion by 9.9%percent9.99.9\%9.9 % m𝑚mitalic_mAP. With the same YOLOV5 backbone, our Fusion-Mamba method achieves m𝑚mitalic_mAP gain over the YOLOv5 detection framework only with IR by 0.9%percent0.90.9\%0.9 %, and also outperforms the best previous fusion method RSDet by 1.5%percent1.51.5\%1.5 % m𝑚mitalic_mAP. To explain, our SSCS and DSSF effectively reduce modality disparities to improve the representation consistency of fused features. our method is also effective for the YOLOv8 backbone, which achieves state-of-the-art performance with 97.0%percent97.097.0\%97.0 % m𝑚mitalic_mAP50 and 64.3%percent64.364.3\%64.3 % m𝑚mitalic_mAP.

M3FDsuperscript𝑀3𝐹𝐷M^{3}FDitalic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_F italic_D Dataset. We compare our method with 7777 SOTA detectors based on the YOLOv5 and 1111 SOTA detector based on YOLOv7. As shown in Tab. 2, our Fusion-Mamba performs best on the all categories with m𝑚mitalic_mAP50 and m𝑚mitalic_mAP metric compared with SOTA methods based on the same YOLOv5 backbone and our method based on YOLOv8 backbone achieves new SOTA results on the People, Bus Motorcycle and Truck categories, while m𝑚mitalic_mAP50 and m𝑚mitalic_mAP metric further increases by 3%percent33\%3 % and 4.4%percent4.44.4\%4.4 %. In addition, our method using YOLOv5 backbone also outperforms SuperFusion based on YOLOv7 by 1.5%percent1.51.5\%1.5 % m𝑚mitalic_mAP and m𝑚mitalic_mAP50, despite the lower feature representation ability of YOLOv5 than YOLOv7. This is due to the effectiveness of our FMB, improving the inherent complementary of cross-modality features.

Table 3: Comparison results with SOTA methods on FLIR-Aligned Dataset. The best results are highlighted in bold.

Methods Backbone m𝑚mitalic_mAP50 m𝑚mitalic_mAP Param. Time(ms) (2021)CFT [6] YOLOv5 78.7 40.2 206.0M 68 (PRL’24)CrossFormer[14] YOLOv5 79.3 42.1 340.0M 80 (2024)RSDet [43] ResNet50 81.1 41.4 - - Fusion-Mamba (ours) YOLOv5 84.3 44.4 244.6M 61 YOLOv8l-IR YOLOv8 72.9 38.3 76.7M 22 YOLOv8l-RGB YOLOv8 66.3 28.2 76.7M 22 Fusion-Mamba (ours) YOLOv8 84.9 47.0 287.6M 78

Refer to caption
Figure 5: Heatmap visualization of various cross-modality object detection methods on LLVIP, M3superscript𝑀3M^{3}italic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTFD and FLIR-Aligned datasets.

FLIR-Aligned Dataset. As shown in Tab. 3, Fusion-Mamba also performs best on Aligned-FLIR Dataset. Compared to CrossFormer based on the two-stream YOLOv5 Backbone, our method based on YOLOv8 and YOLOv5 surpasses them with 5.6%percent5.65.6\%5.6 % and 5%percent55\%5 % on m𝑚mitalic_mAP50, and 4.9%percent4.94.9\%4.9 % and 2.3%percent2.32.3\%2.3 % on m𝑚mitalic_mAP, respectively. We also outperform RSDet with 3.8%percent3.83.8\%3.8 % m𝑚mitalic_mAP50 and 5.6%percent5.65.6\%5.6 % m𝑚mitalic_mAP. In terms of speed, our method with YOLOv5 achieves the fastest speed, saving 7777ms and 19191919ms for one paired images detection, compared to transformer-based CFT and CrossFormer methods. For parameters, our method based on YOLOv5 saves about 100100100100M compared to the CrossFormer method. Despite our method based on YOLOv8 increases about 40404040M parameters than that on YOLOv5, the m𝑚mitalic_mAP is significantly increased by 2.6%percent2.62.6\%2.6 %. This result indicates our method based on hidden space modeling better integrates features between different modalities, suppressing modality disparities to enhance the representation ability of fused features with the best trade-off between performance and computation cost.

Visialization of heatmaps. To visually demonstrate the high performance of our model, we randomly select one paired images from each of the three experimental datasets to visualize P5subscript𝑃5P_{5}italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT heatmap, and compare our method with other fusion methods. As shown in Fig.5, compared to other methods, our model focuses more on the targets rather than dispersing or focusing on unrelated parts. More examples are presented in the supplementary materials. We also visualize the object detection results to evaluate the effectiveness of our method in the supplementary materials.

4.3 Ablation Study

We use the FLIR-Aligned dataset for ablation study to separately verify the effectiveness of SSCS and DSSF modules and further explore the influence of numbers and position of the DSSF module. In particular, we also evaluate the effect of dual attention of DSSF module. All of these experiments are conducted based on YOLOv8 backbone.

Effects of SSCS and DSSF modules The results of removing SSCS and DSSF in the FMB are summarized in Tab. 4. After removing SSCS module (second row in Tab. 4), the detector performance is decreased by 2%percent22\%2 % and 1.1%percent1.11.1\%1.1 % on m𝑚mitalic_mAP50 and m𝑚mitalic_mAP, respectively. To explain, without the initial exchange of two modal features and shallow mapping fusion, the feature disparity is not well reduced during the following deep fusion. Meanwhile, without DSSF (third row in Tab. 4), only shallow fusion interaction could not effectively suppress redundant features and activate effective features during feature fusion, leading to a decrease in detector performance with 2.5%percent2.52.5\%2.5 % and 2.4%percent2.42.4\%2.4 % dropping on m𝑚mitalic_mAP50 and m𝑚mitalic_mAP, respectively. Both SSCS and DSSF are removed and the fused features are directly obtained by the addition of two local modality features (fourth row in Tab. 4), whose performance is significantly reduced by 4.8%percent4.84.8\%4.8 % and 7.6%percent7.67.6\%7.6 % on m𝑚mitalic_mAP50 and m𝑚mitalic_mAP, respectively. These results demonstrate that these two components of FMB are effective for cross-modality object detection.

Table 4: Effects of SSCS and DSSF on FLIR-Aligned Dataset.
Methods m𝑚mitalic_mAP50 m𝑚mitalic_mAP75 m𝑚mitalic_mAP Param. Time(ms)
Fusion-mamba 84.9 45.9 47.0 287.6M 78
Removing SSCS 82.9 42.3 45.9 266.8M 69
Removing DSSF 82.4 42.0 44.6 138.0M 57
Removing SSCS & DSSF 80.1 36.3 39.4 117.2M 48
Table 5: Effect of FMB position on FLIR-Aligned Dataset.
Positions m𝑚mitalic_mAP50 m𝑚mitalic_mAP75 m𝑚mitalic_mAP Param. Time(ms)
{P2,P3,P5}subscript𝑃2subscript𝑃3subscript𝑃5\{P_{2},P_{3},P_{5}\}{ italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } 83.9 44.4 46.7 256.8M 72
{P2,P4,P5}subscript𝑃2subscript𝑃4subscript𝑃5\{P_{2},P_{4},P_{5}\}{ italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } 76.1 39.8 42.3 281.1M 75
{P3,P4,P5}subscript𝑃3subscript𝑃4subscript𝑃5\{P_{3},P_{4},P_{5}\}{ italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } 84.9 45.9 47.0 287.6M 78

Effect on the position of FMB. Following the work [6, 14], we also set three FMB for feature fusion. Here, we will further explore the impact of the FMB position, which stages should add FMB. We select three groups of multi-level features: {P2,P3,P5}subscript𝑃2subscript𝑃3subscript𝑃5\{P_{2},P_{3},P_{5}\}{ italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT }, {P2,P4,P5}subscript𝑃2subscript𝑃4subscript𝑃5\{P_{2},P_{4},P_{5}\}{ italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } and {P3,P4,P5}subscript𝑃3subscript𝑃4subscript𝑃5\{P_{3},P_{4},P_{5}\}{ italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } for ablation study, where Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the fused feature at the i𝑖iitalic_i-th stage using FMB. As presented in Tab. 5, the position {P3,P4,P5}subscript𝑃3subscript𝑃4subscript𝑃5\{P_{3},P_{4},P_{5}\}{ italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } achieves the best trade-off between performance and computation complexity. Thus, we default select this position for experiments.

Effect on the numbers of DSSF modules. We have validated the effectiveness of DSSF in Tab. 4. Here, we further evaluate the effect of the number of DSSF modules, as summarized in Tab. 6. We select four kinds of DSSF numbers (i.e., 2222, 4444, 8888, 16161616), and keep other model settings be consistent with the above experiments. We can see that the number of blocks is set to 8888, which achieves the best performance. The 8888 DSSF modules will be saturated, increasing this number causes the drift of complementary features, which leads to a decrease in fusion performance.

Table 6: Effect on the numbers of DSSF modules on FLIR-Aligned Dataset.
Number m𝑚mitalic_mAP50 m𝑚mitalic_mAP75 m𝑚mitalic_mAP Param. Time(ms)
2 82.3 42.8 45.5 175.3M 50
4 82.6 43.6 45.9 212.7M 65
8 84.9 45.9 47.0 287.6M 78
16 83.4 44.2 46.3 437.2M 148

Effect on the dual attention of DSSF module. To further explore the effectiveness of our gating mechanism whether using the dual attention of DSSF module, we separately remove the IR attention (i.e., zIRiyRisubscript𝑧𝐼subscript𝑅𝑖subscript𝑦subscript𝑅𝑖z_{IR_{i}}\cdot y_{R_{i}}italic_z start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT in Eq. 9) in the RGB branch, the RGB attention (i.e., zIRiyRisubscript𝑧𝐼subscript𝑅𝑖subscript𝑦subscript𝑅𝑖z_{IR_{i}}\cdot y_{R_{i}}italic_z start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT in Eq. 10) in the IR branch, and both dual attention. The results are shown in Tab.7. After removing IR attention or RGB attention, m𝑚mitalic_mAP50 decreases 1.6%percent1.61.6\%1.6 % or 1.1%percent1.11.1\%1.1 % for reducing attention interaction between two features, respectively. When both dual attention is removed, the DSSF module becomes a stack of VSS blocks, and m𝑚mitalic_mAP50 degrades 2%percent22\%2 %. It is noted that both IR and RGB attention branches share weights with the other branches, which is equivalent to only adding activation functions and feature addition operations, compared to the removal of dual attention. Thus, the usage of dual attention does not have a significant impact on model parameters and runtime, while it significantly improves the detection performance.

Table 7: Ablation study of dual attention in the RGB and IR branches of DSSF module on FLIR-Aligned Dataset.
Methods m𝑚mitalic_mAP50 m𝑚mitalic_mAP75 m𝑚mitalic_mAP Param. Time(ms)
Fusion-Mamba 84.9 45.9 47.0 287.6M 78
Removing zIRiyRisubscript𝑧𝐼subscript𝑅𝑖subscript𝑦subscript𝑅𝑖z_{IR_{i}}\cdot y_{R_{i}}italic_z start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT in Eq. 9 83.3 42.8 45.3 287.6M 77
Removing zIRiyRisubscript𝑧𝐼subscript𝑅𝑖subscript𝑦subscript𝑅𝑖z_{IR_{i}}\cdot y_{R_{i}}italic_z start_POSTSUBSCRIPT italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT in Eq. 10 83.8 43.9 46.2 287.6M 77
Removing both dual attention 82.9 41.7 44.8 287.6M 76

5 Conclusion

In this paper, we propose a novel Fusion-Mamba method with well-designed SSCS module and DSSF module for multi-modal feature fusion. In particular, SSCS exchanges infrared and visible channel features for a shallow feature fusion. Subsequently, DSSF is further designed for deeper multi-modal feature interaction in a hidden state space based on Mamba, and a gated attention is used to suppress redundant features to enhance the effectiveness of feature fusion. Extensive experiments conducted on three public RGB-IR datasets demonstrated that our method achieves new state-of-the-art performance with a higher inference efficiency than Transformers. Our work confirms the potential of Mamba for cross-modal fusion, and we believe that our work can inspire more research on the application of Mamba in cross-modal tasks.

References

  • Cai and Vasconcelos [2018] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: delving into high quality object detection. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6154–6162. Computer Vision Foundation / IEEE Computer Society, 2018.
  • Cao et al. [2023] Yue Cao, Junchi Bin, Jozsef Hamari, Erik Blasch, and Zheng Liu. Multimodal object detection by channel switching and spatial attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Workshops, Vancouver, BC, Canada, June 17-24, 2023, pages 403–411. IEEE, 2023.
  • Chen et al. [2024] Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, and Limin Wang. Video mamba suite: State space model as a versatile alternative for video understanding. CoRR, abs/2403.09626, 2024.
  • Chen et al. [2022] Yi-Ting Chen, Jinghao Shi, Zelin Ye, Christoph Mertz, Deva Ramanan, and Shu Kong. Multimodal object detection via probabilistic ensembling. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IX, pages 139–158. Springer, 2022.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  • Fang et al. [2021] Qingyun Fang, Dapeng Han, and Zhaokui Wang. Cross-modality fusion transformer for multispectral object detection. CoRR, abs/2111.00273, 2021.
  • FLIR [2024] TELEDYNE FLIR. Free teledyne flir thermal dataset for algorithm training. Online, 2024.
  • Gu and Dao [2023] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. CoRR, abs/2312.00752, 2023.
  • Guan et al. [2019] Dayan Guan, Yanpeng Cao, Jiangxin Yang, Yanlong Cao, and Michael Ying Yang. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Inf. Fusion, 50:148–157, 2019.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society, 2016.
  • Jia et al. [2021] Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. LLVIP: A visible-infrared paired dataset for low-light vision. In IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021, Montreal, BC, Canada, October 11-17, 2021, pages 3489–3497. IEEE, 2021.
  • Jocher et al. [2022] Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec, Yonghye Kwon, Kalen Michael, Jiacong Fang, Colin Wong, Zeng Yifu, Diego Montes, et al. ultralytics/yolov5: v6. 2-yolov5 classification models, apple m1, reproducibility, clearml and deci. ai integrations. Zenodo, 2022.
  • Jocher et al. [2023] Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics YOLO, 2023.
  • Lee et al. [2024] Seungik Lee, Jaehyeong Park, and Jinsun Park. Crossformer: Cross-guided attention for multi-modal object detection. Pattern Recognition Letters, 2024.
  • Li et al. [2023a] Boyang Li, Chao Xiao, Longguang Wang, Yingqian Wang, Zaiping Lin, Miao Li, Wei An, and Yulan Guo. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process., 32:1745–1758, 2023a.
  • Li et al. [2019] Chengyang Li, Dan Song, Ruofeng Tong, and Min Tang. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognit., 85:161–171, 2019.
  • Li et al. [2023b] Jiawei Li, Jiansheng Chen, Jinyuan Liu, and Huimin Ma. Learning a graph neural network with cross modality interaction for image fusion. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, pages 4471–4479. ACM, 2023b.
  • Liang et al. [2023] Mingjian Liang, Junjie Hu, Chenyu Bao, Hua Feng, Fuqin Deng, and Tin Lun Lam. Explicit attention-enhanced fusion for rgb-thermal perception tasks. IEEE Robotics Autom. Lett., 8(7):4060–4067, 2023.
  • Liu et al. [2016] Jingjing Liu, Shaoting Zhang, Shu Wang, and Dimitris N. Metaxas. Multispectral deep neural networks for pedestrian detection. 2016.
  • Liu et al. [2022] Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5802–5811, 2022.
  • Liu et al. [2024] Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu. Vmamba: Visual state space model. CoRR, abs/2401.10166, 2024.
  • Ma et al. [2024] Jun Ma, Feifei Li, and Bo Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. CoRR, abs/2401.04722, 2024.
  • Redmon et al. [2016] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 779–788. IEEE Computer Society, 2016.
  • Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. pages 91–99, 2015.
  • Ruan and Xiang [2024] Jiacheng Ruan and Suncheng Xiang. Vm-unet: Vision mamba unet for medical image segmentation. CoRR, abs/2402.02491, 2024.
  • [26] Jifeng Shen, Yifei Chen, Yue Liu, Xin Zuo, Heng Fan, and Wankou Yang. Icafusion: Iterative cross-attention guided feature fusion for multispectral object detection. Pattern Recognit., 145:109913.
  • Simonyan and Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. 2015.
  • Sun et al. [2022] Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Detfusion: A detection-driven infrared and visible image fusion network. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, pages 4003–4011. ACM, 2022.
  • Szegedy et al. [2014] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. 2014.
  • Tang et al. [2022] Linfeng Tang, Yuxin Deng, Yong Ma, Jun Huang, and Jiayi Ma. Superfusion: A versatile image registration and fusion network with semantic awareness. IEEE CAA J. Autom. Sinica, 9(12):2121–2137, 2022.
  • Tang et al. [2023] Linfeng Tang, Xinyu Xiang, Hao Zhang, Meiqi Gong, and Jiayi Ma. Divfusion: Darkness-free infrared and visible image fusion. Inf. Fusion, 91:477–493, 2023.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. pages 5998–6008, 2017.
  • Wang et al. [2022] Qingwang Wang, Yongke Chi, Tao Shen, Jian Song, Zifeng Zhang, and Yan Zhu. Improving rgb-infrared object detection by reducing cross-modality redundancy. Remote. Sens., 14(9):2020, 2022.
  • Wang et al. [2024] Ziyang Wang, Jian-Qing Zheng, Yichi Zhang, Ge Cui, and Lei Li. Mamba-unet: Unet-like pure visual mamba for medical image segmentation. CoRR, abs/2402.05079, 2024.
  • Xie et al. [2023] Yumin Xie, Langwen Zhang, Xiaoyuan Yu, and Wei Xie. YOLO-MS: multispectral object detection via feature interaction and self-attention guided fusion. IEEE Trans. Cogn. Dev. Syst., 15(4):2132–2143, 2023.
  • Xu et al. [2022] Han Xu, Jiayi Ma, Jiteng Yuan, Zhuliang Le, and Wei Liu. Rfnet: Unsupervised network for mutually reinforcing multi-modal image registration and fusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 19647–19656. IEEE, 2022.
  • Zhang and Ma [2021] Hao Zhang and Jiayi Ma. Sdnet: A versatile squeeze-and-decomposition network for real-time image fusion. Int. J. Comput. Vis., 129(10):2761–2785, 2021.
  • Zhang et al. [2020] Heng Zhang, Élisa Fromont, Sébastien Lefèvre, and Bruno Avignon. Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In IEEE International Conference on Image Processing, ICIP 2020, Abu Dhabi, United Arab Emirates, October 25-28, 2020, pages 276–280. IEEE, 2020.
  • Zhang et al. [2021] Heng Zhang, Élisa Fromont, Sébastien Lefèvre, and Bruno Avignon. Guided attentive feature fusion for multispectral pedestrian detection. In IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, January 3-8, 2021, pages 72–80. IEEE, 2021.
  • Zhang et al. [2022] Mingjin Zhang, Rui Zhang, Yuxiang Yang, Haichen Bai, Jing Zhang, and Jie Guo. Isnet: Shape matters for infrared small target detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 867–876. IEEE, 2022.
  • Zhang et al. [2024] Mingya Zhang, Yue Yu, Limei Gu, Tingsheng Lin, and Xianping Tao. Vm-unet-v2 rethinking vision mamba unet for medical image segmentation. arXiv preprint arXiv:2403.09157, 2024.
  • Zhang et al. [2023] Shilong Zhang, Xinjiang Wang, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wenwei Zhang, Ping Luo, and Kai Chen. Dense distinct query for end-to-end object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 7329–7338. IEEE, 2023.
  • Zhao et al. [2024] Tianyi Zhao, Maoxun Yuan, and Xingxing Wei. Removal and selection: Improving rgb-infrared object detection via coarse-to-fine fusion. CoRR, abs/2401.10731, 2024.
  • Zhao et al. [2020] Zixiang Zhao, Shuang Xu, Chunxia Zhang, Junmin Liu, Jiangshe Zhang, and Pengfei Li. Didfuse: Deep image decomposition for infrared and visible image fusion. pages 970–976, 2020.
  • Zhao et al. [2023] Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Shuang Xu, Zudi Lin, Radu Timofte, and Luc Van Gool. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 5906–5916. IEEE, 2023.
  • Zheng et al. [2019] Yang Zheng, Izzat H. Izzat, and Shahrzad Ziaee. GFD-SSD: gated fusion double SSD for multispectral pedestrian detection. CoRR, abs/1903.06999, 2019.
  • Zhu et al. [2024] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. CoRR, abs/2401.09417, 2024.
\thetitle

Supplementary Material

6 More Heatmap Visualization Results

We randomly select images from LLVIP, M3FDsuperscript𝑀3𝐹𝐷M^{3}FDitalic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_F italic_D and FLIR-Aligned dataset and visualize P5subscript𝑃5P_{5}italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT heatmaps of different fusion methods. As shown in Fig.6, Fig.7, Fig.8, visualization examples demonstrate that our Fusion-Mamba method for cross-modality feature fusion on the hidden state space more focuses on the target, compared to CNN and Transformer based fusion methods. It also indicates that our method can effectively model the correlations between targets of different modalities.

7 Visualization of Object Detection

We also randomly select images from LLVIP, M3FDsuperscript𝑀3𝐹𝐷M^{3}FDitalic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_F italic_D and FLIR-Aligned datasets and output bounding-box detection results using different fusion methods. As shown in Fig.9, Fig.10, Fig.11, compared to several SOTA methods, our method significantly reduces the missed detection results to improve mAP.

For example, In Fig.9 under the situations of insufficient lighting and severe occlusion, our method can detect more target objects compared to other methods, as our method’s interaction based on hidden space effectively effectively integrates information from the IR modality. As shown in Fig.10, in harsh weather and situations where the target object is small and far away, our model can detect more diverse target objects compared to other methods, since our gated attention can effectively combine two modal information to suppress redundant features for better cross-modality object detection. In Fig.11, in target dense areas, our model can better distinguish and detect these targets, because our modal interaction method from shallow to deep can better preserve detailed information, while other SOTA methods tend to fail for the detection of dense objects.

Therefore, Our Fusion-Mamba method achieves the best detection performance, which improves the perceptibility and robustness in complex scenes with several challenges, like low lighting, adverse weather, high occlusion, target density, and small target.

Refer to caption
Figure 6: Comparison of heatmap visualization on LLVIP dataset.
Refer to caption
Figure 7: Comparison of heatmap visualization on M3FDsuperscript𝑀3𝐹𝐷M^{3}FDitalic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_F italic_D dataset.
Refer to caption
Figure 8: Comparison of heatmap visualization on FLIR-Aligned dataset.
Refer to caption
Figure 9: Detection results of three methods in LLVIP dataset.
Refer to caption
Figure 10: Detection results of three methods in M3FDsuperscript𝑀3𝐹𝐷M^{3}FDitalic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_F italic_D dataset.
Refer to caption
Figure 11: Detection results of three methods in FLIR-Aligned dataset.