Fusion-Mamba for Cross-modality Object Detection
用于跨模态目标检测的 Fusion-Mamba
Abstract 抽象
Cross-modality fusing complementary information from different modalities effectively improves object detection performance, making it more useful and robust for a wider range of applications.
Existing fusion strategies combine different types of images or merge different backbone features through elaborated neural network modules.
However, these methods neglect that modality disparities affect cross-modality fusion performance, as different modalities with different camera focal lengths, placements, and angles are hardly fused.
In this paper, we investigate cross-modality fusion by associating cross-modal features in a hidden state space based on an improved Mamba with a gating mechanism.
We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction, thereby reducing disparities between cross-modal features and enhancing the representation consistency of fused features.
FMB contains two modules: the State Space Channel Swapping (SSCS) module facilitates shallow feature fusion, and the Dual State Space Fusion (DSSF) enables deep fusion in a hidden state space.
Through extensive experiments on public datasets, our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M^{3}FD$ and 4.9% on FLIR-Aligned datasets, demonstrating superior object detection performance.
To the best of our knowledge, this is the first work to explore the potential of Mamba for cross-modal fusion and establish a new baseline for cross-modality object detection.
跨模态融合来自不同模态的互补信息可有效提高目标检测性能，使其在更广泛的应用中更加有用和强大。现有的融合策略通过精心设计的神经网络模块将不同类型的图像组合在一起或合并不同的主干特征。然而，这些方法忽略了模态差异会影响跨模态融合性能，因为具有不同相机焦距、位置和角度的不同模态几乎无法融合。在本文中，我们通过将基于改进的Mamba的隐藏状态空间中的跨模态特征与门控机制相关联来研究跨模态融合。我们设计了一个融合-曼巴块（FMB），将跨模态特征映射到一个隐藏的状态空间进行交互，从而减少了跨模态特征之间的差异，增强了融合特征的表示一致性。FMB 包含两个模块：状态空间信道交换 （SSCS） 模块促进浅层特征融合，双态空间融合 （DSSF） 实现隐藏状态空间中的深度融合。通过对公共数据集的大量实验，我们提出的方法在AP $m$ 上的表现优于最先进的方法，在FLIR对齐数据集上 $M^{3}FD$ 分别为5.9%和4.9%，显示出卓越的目标检测性能。据我们所知，这是探索Mamba在跨模态融合方面的潜力并为跨模态目标检测建立新基线的第一部工作。
1 Introduction 1引言
With the swift development of multi-modality sensor technology, multi-modality images have been used in many different areas.
Among them, paired infrared (IR) and visible images have been utilized widely, since such two modalities of images provide complementary information. For example,
infrared images show a clear thermal structure of objects without being affected by luminance, while they lack the texture details of the target.
In contrast, visible images capture rich object texture and scene information, but lighting conditions severely affect image quality.
Thus, many studies focus on infrared and visible feature fusion to improve the perceptibility and robustness for downstream high-level image and scene understanding tasks, e.g., object detection, and image segmentation.
随着多模态传感器技术的快速发展，多模态图像已应用于许多不同的领域。其中，成对红外（IR）和可见光图像已被广泛使用，因为这两种图像模式提供了互补的信息。例如，红外图像显示物体的清晰热结构，不受亮度影响，而缺乏目标的纹理细节。相比之下，可见图像可以捕捉到丰富的物体纹理和场景信息，但照明条件会严重影响图像质量。因此，许多研究都集中在红外和可见光特征融合上，以提高下游高级图像和场景理解任务（例如目标检测和图像分割）的可感知性和鲁棒性。
Existing multi-spectral fusion approaches generally employ deep convolutional neural networks (CNNs) [27, 29, 10] or Transformers [32, 5] to fuse the cross-modality features.
A halfway fusion is introduced to integrate two-branch middle-level features from RGB and IR images for multispectral pedestrian detection [19].
GFD-SSD [46] uses Gated Fusion Units to build a two-stream middle fusion detector, which achieves a higher performance than a single modality.
In this light, YOLO-MS is introduced based on two CNN-based fusion modules to fuse the adjacent branches from the YOLOv5 backbone for real-time object detection [35].
Albeit with great success for cross-modality fusion based on CNNs with local receptive fields,
Transformers-based [32, 5] methods have been proposed to effectively learn long-range dependencies for cross-modality feature fusion.
CFT [6] is the first study to explore a Transformer for middle-level feature fusion, which can improve YOLOv5 performance.
ICAFusion [26] with a double cross-attention Transformer can successfully model global features and capture complementary information among modalities. However, these cross-modality fusion methods fail to consider modality disparities, which produces adverse effects on the cross-modality feature fusion.
As shown in Fig. 1(e)(f)(g), the heatmaps of YOLO-Ms, ICAFusion and CFT fusion features show that they cannot effectively fuse features from different modalities and model the correlations between cross-modality objects, as they have clearly different modality representations.
This begs our rethinking: Can we have an effective cross-modality interactive space to reduce modality disparities for a consistent representation, which can thus benefit from the cross-modality relationship for feature enhancement?
Moreover, Transformer-based cross-modality fusion is compute-intensive with a quadratic time- and space-complexity.
现有的多谱融合方法通常采用深度卷积神经网络（CNN）[27,29,10]或Transformers[32,5]来融合跨模态特征。引入半向融合，将RGB和IR图像的双分支中间级特征集成在一起，用于多光谱行人检测[19]。GFD-SSD [46]使用门控聚变单元构建双流中间聚变探测器，其性能高于单一模态。有鉴于此，基于两个基于CNN的融合模块引入了YOLO-MS，以融合YOLOv5主干的相邻分支，以实现实时目标检测[35]。尽管基于具有局部感受野的CNN的跨模态融合取得了巨大成功，但已经提出了基于Transformers的[32,5]方法，以有效地学习跨模态特征融合的长程依赖性。CFT [6] 是第一个探索用于中级特征融合的 Transformer 的研究，它可以提高 YOLOv5 的性能。ICAFusion[26]采用双交叉注意力Transformer，可以成功地对全局特征进行建模，并捕获模态之间的互补信息。然而，这些跨模态融合方法没有考虑模态差异，这对跨模态特征融合产生了不利影响。如图1（e）（f）（g）所示，YOLO-Ms、ICAFusion和CFT融合特征的热图显示，由于它们具有明显不同的模态表示，它们无法有效地融合不同模态的特征并模拟跨模态对象之间的相关性。这就引出了我们重新思考的问题：我们能否有一个有效的跨模态交互空间来减少模态差异，以实现一致的表示，从而可以从跨模态关系中受益，从而增强特征？ 此外，基于Transformer的跨模态融合是计算密集型的，具有二次时间和空间复杂度。
In this paper, we propose a Fusion-Mamba method, aiming to fuse features in a hidden state space, which might open up a new paradigm for cross-modality feature fusion. We are inspired by Mamba [8, 21, 47] with a linear complexity to build a hidden state space, which is further improved by a gating mechanism to enable a deeper and more complex fusion. Our Fusion-Mamba method lies in the innovative Fusion-Mamba block (FMB), as illustrated in Fig. 2. In FMB, we design a State Space Channel Swapping (SSCS) module for shallow feature fusion to improve the interaction ability of cross-modality features, and a Dual State Space Fusion (DSSF) module to build a hidden state space for cross-modality feature association and complementarity. These two blocks help reduce the modal disparities during fusion as shown in Fig. 1(h). The heatmap shows that our method fuses features more effectively and makes the detector focus more on the target. This work makes the following contributions:
1) The proposed Fusion-Mamba method explores the potential of Mamba for cross-modal fusion, which enhances the representation consistency of fused features. We build a hidden state space for cross-modality interaction to reduce disparities between cross-modality features based on an improved Mamba by gating mechanisms.
2) We design a Fusion-Mamba block with two modules: the State Space Channel Swapping (SSCS) module facilitates shallow feature fusion, and the Dual State Space Fusion (DSSF) module enables deep fusion in a hidden state space.
3) Extensive experiments on three public RGB-IR object detection datasets demonstrate that our method achieves state-of-the-art performance, offering a new baseline in the cross-modal object detection method.
2 Related Works
Multi-modality Object Detection. With the rapid development of single modal detectors such as YOLO series models [23], Transformer carion2020end,liu2021swin,guo2022cmt, multi-modal object detectors have emerged to make good use of images from different modalities. So far, research on multi-modality object detection has focused on two main directions: pixel-level fusion and feature-level fusion. Pixel-level fusion merges multi-modal input images and the fused image is fed into the detector. Those methods focus on reconstructing fused images using multi-modal input image information li2018densefuse, ma2019fusiongan, creswell2018generative. Feature-level fusion joins the output of a detector at a certain stage, such as the early and later features extracted by the backbone (early and middle fusion [33, 2]) and the detection output (late fusion [16, 4]). Feature-level fusion can integrate the fusion operation into the detection network as a united end-to-end CNN and or Transformer framework [2, 33, 35, 6, 26]. Those fusion methods can effectively improve the object detection performance of single-modality. However, they are still limited in modeling modality disparity and fusion complexity.
Mamba. Since Mamba [8] was proposed for linear-time sequence modeling in the NLP field, it has been rapidly extended for applications in various computer vision tasks. Vmamba [21] introduces a four-way scanning algorithm based on the characteristics of the image and constructs a Mamba-based vision backbone with better performance in object detection, object segmentation, and object tracking than Swin Transformer. VM-UNet [25] is splendid in the field of medical segmentation based on the UNet framework and Mamba blocks. After that, many Mamba-based deep networks [22, 34, 41] are proposed to make accurate segmentation in medical images. Video Mamba [3] expands the original 2D scan to different bidirectional 3D scans and designs a Mamba framework to use mamba in the video understanding area.
Different from previous methods, our work is the first to exploit Mamba for multi-modality feature fusion. We introduce a carefully designed Mamba-based structure to integrate the cross-modality features in a hidden state space.
3 Method
3.1 Preliminaries
State Space Models. State Space Models (SSMs) are frequently used to represent linear time-invariant systems, which process a one-dimensional input sequence $x(t)\in\mathcal{R}$ by passing it through intermediate implicit states $h(t)\in\mathcal{R}^{N}$ to produce an output $y(t)\in\mathcal{R}$. Mathematically, SSMs are often formulated as linear ordinary differential equations (ODEs):
$\displaystyle h^{\prime}(t)=Ah(t)+Bx(t),$ | (1) | |||
$\displaystyle y(t)=Ch(t)+Dx(t),$ |
where the system’s behavior is defined by a set of parameters, including the state transition matrix $A\in\mathcal{R}^{N\times N}$, the projection parameters $B,C\in\mathcal{R}^{N\times 1}$, and the skip connection $D\in\mathcal{R}$. $Dx(t)$ can be easily removed by setting $D=0$ for exposition.
Discretization. The continuous-time nature of SSMs in Eq. 1 poses significant challenges when applied in deep learning scenarios. To address this issue, it is necessary to discretize the ODEs through a process of discretization, which serves the key purpose of converting ODEs into discrete functions. It is essential for ensuring alignment between the model and the sampling rate of the underlying signals in the input data, facilitating efficient computational operations [15]. Considering the input $x_{k}\in\mathcal{R}^{L\times D}$, a sampled vector within the signal flow of length $L$ following [40], the introduction of a timescale parameter $\Delta$ allows for the transition from continuous parameters $A$ and $B$ to their discrete counterparts $\overline{A}$ and $\overline{B}$, adhering to the zeroth-order hold (ZOH) principle. Consequently, Eq. 1 is discretized as follows:
$\displaystyle h_{k}=\overline{A}h_{k-1}+\overline{B}x_{k},$ | (2) | |||
$\displaystyle y(t)=\overline{C}h_{k}+Dx_{k},$ | ||||
$\displaystyle\overline{A}=e^{\Delta A},$ | ||||
$\displaystyle\overline{B}=~{}(\Delta A)^{-1}(e^{\Delta A}-I)\Delta B,$ | ||||
$\displaystyle\overline{C}=C,$ |
where $B,C\in\mathcal{R}^{D}$ and $I$ is an identity matrix. After discretization, SSMs are computed by a global convolution with a structured convolutional kernel $\bar{K}\in\mathcal{R}^{D}$:
$y=x*\bar{K},\quad\bar{K}=\big{(}C\bar{B},C\bar{A}\bar{B},\cdots,C\bar{A}^{L-1}% \bar{B}\big{)}.$ | (3) |
Based on Eq. 2 and Eq. 3, Mamba [8] designs a simple selection mechanism to parameterize the SSM parameters of $\Delta$, $A$, $B$, and $C$ depending on the input $x$, which selectively propagates or forgets information along the sequence length dimension for 1D language sequence modeling.
2D-Selective-Scan Mechanism. The incompatibility between 2D visual data and 1D language sequences renders the direct application of Mamba to vision tasks inappropriate. For example, while 2D spatial information plays a crucial role in vision-related endeavors, it takes the secondary role in 1D sequence modeling. This discrepancy leads to limited receptive fields that fail to capture potential correlations with unexplored patches. 2D selective scan (SS2D) Mechanism is introduced in [21] to address the above challenge. The overview of SS2D is depicted in Fig. 3. SS2D first scan expands image patches into four distinct directions to generate four independent sequences. This quad-directional scanning methodology guarantees that every element within the feature map incorporates information from all other positions across various directions. Consequently, it establishes a comprehensive global receptive field without necessitating a linear increase in computational complexity. Subsequently, each feature sequence undergoes processing using the selective scan space state sequential model (S6) [8]. Finally, the feature sequences are aggregated to reconstruct the 2D feature map. SS2D serves as the core element of the Visual State Space (VSS) block, which is illustrated in Fig. 2 and will be used to build a hidden state space for cross-modality feature fusion.
3.2 Fusion-Mamba
3.2.1 Architecture
The architecture of our model is depicted in Fig. 2. Its detection backbone comprises a dual-stream feature extraction network and three Fusion-Mamba blocks (FMB), while the detection network contains the neck and head for cross-modality object detection. The feature extraction network facilitates the extraction of local features from RGB and IR images, denoted by $F_{R_{i}}$ and $F_{IR_{i}}$, respectively. After that, we input these two features into FMB by associating cross-modal features in a hidden state space, which reduces disparities between cross-modal features and enhances the representation consistency of fused features. Specifically, these two local features first undergo the State Space Channel Swapping ($SSCS$) module for shallow feature fusion to obtain interactive features $\tilde{F}_{R_{i}}$ and $\tilde{F}_{IR_{i}}$. Then, we feed these interactive features into the Dual State Space Fusion ($DSSF$) module for deep feature fusion in the hidden state space, which generates the corresponding complementary features $\overline{F}_{R_{i}}$ and $\overline{F}_{IR_{i}}$. The local features are enhanced to generate $\hat{F}_{R_{i}}$ and $\hat{F}_{IR_{i}}$ by adding the original ones $F_{R_{i}}$ and $F_{IR_{i}}$ into the complementary features $\overline{F}_{R_{i}}$ and $\overline{F}_{IR_{i}}$, respectively. Subsequencely, the enhanced features $\hat{F}_{R_{i}}$ and $\hat{F}_{IR_{i}}$ are directly added to generate the fused feature $P_{i}$. In this paper, FMB is only added to the last three stages to generate fused features $P_{3},P_{4}$ and $P_{5}$ (if not specified), which are the inputs for the neck and head of Yolov8 to generate the final detection results (As shown in Fig. 4).
3.2.2 Key components
Given the input RGB image $I_{R}$ and infrared image $I_{IR}$, we feed them into a series of convolutional blocks to extract their local features:
$F_{R_{i}}=\phi_{i}{\cdots}(\phi_{2}(\phi_{1}(I_{R}))),\quad F_{IR_{i}}=\varphi% _{i}{\cdots}(\varphi_{2}(\varphi_{1}(I_{IR}))),$ | (4) |
where $\phi_{i}$ and $\varphi_{i}$ represent the convolutional blocks of RGB and IR branches at the $i$-th stage, respectively.
To implement cross-modality feature fusion, existing methods [9, 6, 26] primarily emphasize the integration of spatial features, yet they inadequately consider the feature disparity between modalities. Consequently, the fused model fails to effectively model the correlations between targets of different modalities, diminishing the model’s representation capacity. Motivated by Mamba [8] with strong sequence modeling ability on the state space, we design a Fusion-Mamba block (FMB) to construct a hidden state space for cross-modal feature interaction and association. The effectiveness of FMB lies in two key modules, the State Space Channel Swapping (SSCS) module and the Dual State Space Fusion (DSSF) module, which can reduce disparities between cross-modality features to enhance the representation consistency of fused features. Alg. 1 provides the computation process of SSCS and DSSF modules.
SSCS module. This module aims to enhance cross-modality feature interaction for shallow feature fusion through the channel swapping operation and VSS block. Cross-modal feature correlation is constructed by integrating information from distinct channels, which enriches the diversity of channel features to improve fusion performance. First, we employ the channel swapping operation to generate new local features of RGB $T_{R_{i}}$ and IR $T_{IR_{i}}$, which can be formulated as:
$T_{R_{i}}=CS(F_{R_{i}},F_{IR_{i}}),\quad T_{IR_{i}}=CS(F_{IR_{i}},F_{R_{i}}),$ | (5) |
where $CS(\cdot,\cdot)$ is the channel swapping operation, which is easy to implement by channel splitting and concatenation. First, both local features $F_{R_{i}}$ and $F_{IR_{i}}$ are divided into four equal parts along the channel dimension. Subsequently, we select the first and third parts from $F_{R_{i}}$, and the second and fourth parts from $F_{IR_{i}}$ through concatenation in a portion order to generate new local RGB features $T_{R_{i}}$. Correspondingly, we generate new local IR features $T_{IR_{i}}$. After that, one VSS block is applied to $T_{R_{i}}$ and $T_{IR_{i}}$, which enhances cross-modality interaction from shallow features:
$\tilde{F}_{R_{i}}=VSS(T_{R_{i}}),\quad\tilde{F}_{IR_{i}}=VSS(T_{IR_{i}}),$ | (6) |
where $VSS(\cdot)$ denotes the VSS block [21] depicted in Fig. 2. $\tilde{F}_{R_{i}}$ and $\tilde{F}_{IR_{i}}$ are the outputs of shallow fused features from RGB and IR modality, respectively.
DSSF module. To further reduce modality disparities, we build a hidden state space for cross-modality feature association and complementary. DSSF is proposed to model cross-modality object correlation to facilitate feature fusion. Specifically, we employ the VSS block by projecting features from both modalities into a hidden state space, and a gating mechanism is utilized to construct hidden state transitions dually for cross-modality deep feature fusion.
Formally, after obtaining the shallow fused features $\tilde{F}_{R_{i}}$ and $\tilde{F}_{IR_{i}}$, we first project them into the hidden state space through a VSS block without gating as:
$y_{R_{i}}=P_{in}(\tilde{F}_{R_{i}}),\quad y_{IR_{i}}=P_{in}(\tilde{F}_{IR_{i}})$ | (7) |
where $P_{in}(\cdot)$ denotes an operation for projecting features to the hidden state space. The detailed implementation is described in lines 13-17 of Alg. 1. $y_{R_{i}}$ and $y_{IR_{i}}$ denote the hidden state features. We also project $\tilde{F}_{R_{i}}$ and $\tilde{F}_{IR_{i}}$ to obtain gating parameters $z_{R_{i}}$ and $z_{IR_{i}}$:
$z_{R_{i}}=f_{\theta_{i}}(\tilde{F}_{R_{i}}),\quad z_{IR_{i}}=g_{\omega_{i}}(% \tilde{F}_{IR_{i}})$ | (8) |
where $f_{\theta_{i}}(\cdot)$ and $g_{\omega_{i}}(\cdot)$ represent the gating operation with parameters $\theta_{i}$ and $\omega_{i}$ in a dual stream, respectively. After that, we employ the gating outputs of $z_{R_{i}}$ and $z_{IR_{i}}$ in Eq. 8 to modulate $y_{R_{i}}$ and $y_{IR_{i}}$, and implement the hidden state feature fusion as:
$y^{\prime}_{R_{i}}=y_{R_{i}}\cdot z_{R_{i}}+z_{R_{i}}\cdot y_{IR_{i}},$ | (9) |
$y^{\prime}_{IR_{i}}=y_{IR_{i}}\cdot z_{IR_{i}}+z_{IR_{i}}\cdot y_{R_{i}},$ | (10) |
where $y^{\prime}_{R_{i}}$ and $y^{\prime}_{IR_{i}}$ represent the hidden state feature of RGB and IR after feature interaction, respectively. $\cdot$ is an element-wise product. Actually, Eq. 9 and 10 build the cross-modality fusion in a hidden state space based on the gating mechanism, and dual attention is fully used for cross-branch information complementarity.
Subsequently, we project $y^{\prime}_{R_{i}}$ and $y^{\prime}_{IR_{i}}$ back to the original space and pass them through a residual connection to obtain the complementary features $\overline{F}_{R_{i}}$ and $\overline{F}_{IR_{i}}$:
$\overline{F}_{R_{i}}=P_{out}(y^{\prime}_{R_{i}})+\tilde{F}_{R_{i}},\quad% \overline{F}_{IR_{i}}=P_{out}(y^{\prime}_{IR_{i}})+\tilde{F}_{IR_{i}}.$ | (11) |
where $P_{\text{out}}(\cdot)$ denotes the projection operation with a linear transformation.
In practice, we stack several DSSF modules (i.e., Eq. 7 to Eq. 11) to obtain much deeper feature fusion, which achieves better results. However, the number of DSSF modules will saturate at a certain value, which is further evaluated in our experiments. Finally, we merge the complementary features into the local features to enhance feature representation by the addition operation:
$\hat{F}_{R_{i}}=F_{R_{i}}+\overline{F}_{R_{i}},\quad\hat{F}_{IR_{i}}=F_{IR_{i}% }+\overline{F}_{IR_{i}}.$ | (12) |
3.2.3 Loss Function
After FMB, the enhanced features from RGB and IR (i.e., $\hat{F}_{R_{i}}$ and $\hat{F}_{IR_{i}}$ in Eq. 12) are further added to generate fused feature $P_{i}$ as the input of the neck to improve the detection performance. Following [12, 13], the total loss function can be constructed as:
$\mathcal{L}=\lambda_{\text{coord}}\mathcal{L}_{\text{coord}}+\mathcal{L}_{% \text{conf}}+\mathcal{L}_{\text{class}},$ | (13) |
where $\lambda_{\text{coord}}$ is a hyperparameter that adjusts the weight of the localization loss $\mathcal{L}_{\text{coord}}$, $\mathcal{L}_{\text{conf}}$ is the confidence loss, and $\mathcal{L}_{\text{class}}$ is the classification loss. The more details of individual loss term for $\mathcal{L}_{\text{coord}},\mathcal{L}_{\text{conf}}$ and $\mathcal{L}_{\text{class}}$ are described in jocher2022ultralytics.
Methods | Modality | Backbone | $m$AP_{50} | $m$AP |
---|---|---|---|---|
(NIPS’16) Faster R-CNN | IR | ResNet50 | 92.6 | 50.7 |
(CVPR’18) Cascade R-CNN | IR | ResNet50 | 95.0 | 56.8 |
(CVPR’23) DDQ DETR | IR | ResNet50 | 93.9 | 58.6 |
(Zenodo’22) YOLOv5-l | IR | YOLOv5 | 94.6 | 61.9 |
(2023) YOLOv8-l | IR | YOLOv8 | 95.2 | 62.1 |
Faster R-CNN [24] | RGB | ResNet50 | 88.8 | 47.5 |
Cascade R-CNN [1] | RGB | ResNet50 | 88.3 | 47.0 |
DDQ DETR [42] | RGB | ResNet50 | 86.1 | 46.7 |
YOLOv5-l [12] | RGB | YOLOv5 | 90.8 | 50.0 |
YOLOv8-l [13] | RGB | YOLOv8 | 91.9 | 54.0 |
(2016) Halfway fusion [19] | IR+RGB | VGG16 | 91.4 | 55.1 |
(WACV’21) GAFF [39] | IR+RGB | Resnet18 | 94.0 | 55.8 |
(ECCV’22) ProEn [4] | IR+RGB | ResNet50 | 93.4 | 51.5 |
(CVPR’23) CSAA [2] | IR+RGB | ResNet50 | 94.3 | 59.2 |
(2024) RSDet [43] | IR+RGB | ResNet50 | 95.8 | 61.3 |
(IF’23) DIVFusion [31] | IR+RGB | YOLOv5 | 89.8 | 52.0 |
Fusion-Mamba(Ours) | IR+RGB | YOLOv5 | 96.8 | 62.8 |
Fusion-Mamba(Ours) | IR+RGB | YOLOv8 | 97.0 | 64.3 |
3.3 Compared to Transformer-based fusion
Existing Transformer-based cross-modality fusion methods [6, 26] flatten and concatenate the features with convolution to generate the intermediate fused features, which is further fused by multi-head cross-attention to generate final fused features. They cannot effectively reduce the modality disparities just on the spatial interaction, which is due to the difficult modeling of object correlation from cross-modality features. Our FMB block can scan features in four directions to obtain four sets of patches and effectively preserve the local information of the features. In addition, these patches are mapped into a hidden space for feature fusion. This mapping-based deep feature fusion method effectively reduces spatial disparities through dual-direction gated attention, which further suppresses redundant features and captures complementary information among modalities. As such, the proposed FMB reduces disparities between cross- modality features and enhances the representation consistency of fused features.
In addition, the time complexity of the Transformer’s global attention is $O(N^{2})$, while Mamba’s time complexity is only $O(N)$, where $N$ is the sequence length. From the experiment perspective, using the same detection model architecture, replacing the transformer-based fusion module with Fusion-Mamba block saves $7-19$ms inference time on one paired images. More details are discussed in our experiments.
4 Experiments
4.1 Experimental Setups
Datasets. Our Fusion-Mamba method is evaluated on three widely-used visible-infrared benchmark datasets, LLVIP [11], $M^{3}FD$ [20] and FLIR [7].
LLVIP is an aligned visible and infrared (IR) dataset collected in low-light environments for pedestrian detection, which contains $15,488$ RGB-IR image pairs. Following the official standards, we use $12,025$ pairs for training and $3,463$ pairs for testing.
$M^{3}FD$ contains $4,200$ RGB and IR-aligned image pairs collected in various environments including different lighting, seasons, and weather scenarios. It has six categories usually appearing in autonomous driving and road monitoring. Since there is no official dataset partitioning method, we use train/test splits provided by [18].
FLIR is collected in day and night scenes with five categories: people, car, bike, dog, and other cars. Following by [38], we use the FLIR-Aligned with $4,129$ pairs for training and $1,013$ for testing.
Evaluation metrics. We use the most common evaluation metric $m$AP and $m$AP_{50}. The $m$AP_{50} metric represents the mean AP under IoU $0.50$ and The $m$AP metric represents the mean AP under IoU ranges from $0.50$ to $0.95$ with a stride of $0.05$ [43]. The larger test value of the two metrics means better model performance. We also report the average inference time of our method evaluated on one A800 GPU with $5$ runs on the input size of $640\times 640$.
Implementation details. All the experiments are implemented in the two-stream framework [6] with a single GPU A800. The backbone, neck and head structures of our Fusion-Mamba are default set the same as those in YOLOv5-l or YOLOv8-l. During training, we set the batch size to $4$, SGD optimizer is set with a momentum of $0.9$ and a weight decay of $0.001$. The input image size is $640\times 640$ for all three datasets and the training epoch is set to $150$ with an initial learning rate of $0.01$. The number of SSCS and DSSF modules in FMB is default set to $1$ and $8$, respectively. $\lambda_{\text{coord}}$ is set to $7.5$. Other training hyper-parameters are the same as YOLOv8.
Methods | Backbone | $m$AP_{50} | $m$AP | People | Bus | Car | Motorcycle | Lamp | Truck |
(2020) DIDFuse [44] | YOLOv5 | 78.9 | 52.6 | 79.6 | 79.6 | 92.5 | 68.7 | 84.7 | 68.7 |
(IJCV’21) SDNet [37] | YOLOv5 | 79.0 | 52.9 | 79.4 | 81.4 | 92.3 | 67.4 | 84.1 | 69.3 |
(CVPR’22) RFNet [36] | YOLOv5 | 79.4 | 53.2 | 79.4 | 78.2 | 91.1 | 72.8 | 85.0 | 69.0 |
(CVPR’22) TarDAL [20] | YOLOv5 | 80.5 | 54.1 | 81.5 | 81.3 | 94.8 | 69.3 | 87.1 | 68.7 |
(MM’22) DeFusion [28] | YOLOv5 | 80.8 | 53.8 | 80.8 | 83.0 | 92.5 | 69.4 | 87.8 | 71.4 |
(CVPR’23) CDDFuse [45] | YOLOv5 | 81.1 | 54.3 | 81.6 | 82.6 | 92.5 | 71.6 | 86.9 | 71.5 |
(MM’23) IGNet [17] | YOLOv5 | 81.5 | 54.5 | 81.6 | 82.4 | 92.8 | 73.0 | 86.9 | 72.1 |
(JAS’22) SuperFusion [30] | YOLOv7 | 83.5 | 56.0 | 83.7 | 93.2 | 91.0 | 77.4 | 70.0 | 85.8 |
Fusion-Mamba (ours) | YOLOv5 | 85.0 | 57.5 | 80.3 | 92.8 | 91.9 | 73.0 | 84.8 | 87.1 |
YOLOv8l-IR | YOLOv8 | 79.5 | 53.1 | 82.9 | 90.9 | 90.0 | 64.6 | 63.0 | 85.9 |
YOLOv8l-RGB [13] | YOLOv8 | 80.9 | 52.5 | 70.6 | 92.9 | 91.2 | 69.6 | 75.3 | 86.0 |
Fusion-Mamba (ours) | YOLOv8 | 88.0 | 61.9 | 84.3 | 94.2 | 92.9 | 80.5 | 87.5 | 88.8 |
4.2 Comparison with SOTA Methods
To verify the effectiveness of our Fusion-Mamba method, we employ two backbones based on YOLOv5 and YOLOv8 to make a fair comparison with SOTA methods.
LLVIP Dataset.
The results of different methods on LLVIP are summarized in Tab. 1.
We compare the proposed Fusion-Mamba method using two different backbones with $6$ SOTA multi-spectral object detection methods and $5$ single-modality detection methods.
For single-modality detection, the detection performance only using IR images is better than that only using RGB images, which is due to the effect of low light conditions.
After feature fusion with RGB and IR, the $m$AP performance is improved based on ResNet backbones, outperforming the single IR modality detection.
For example, RSDet with the ResNet50 backbone outperforms Cascade R-CNN only using IR modality by $4.5\%$ $m$AP.
Note that it fails to effective fusion on the YOLOv5 backbone, e.g., a simple YOLOv5 detection framework using IR modality input achieves $61.9\%$ $m$AP, significantly outperforming fusion method DIVFusion by $9.9\%$ $m$AP. With the same YOLOV5 backbone,
our Fusion-Mamba method achieves $m$AP gain over the YOLOv5 detection framework only with IR by $0.9\%$, and also outperforms the best previous fusion method RSDet by $1.5\%$ $m$AP. To explain, our SSCS and DSSF effectively reduce modality disparities to improve the representation consistency of fused features.
our method is also effective for the YOLOv8 backbone, which achieves state-of-the-art performance with $97.0\%$ $m$AP_{50} and $64.3\%$ $m$AP.
$M^{3}FD$ Dataset. We compare our method with $7$ SOTA detectors based on the YOLOv5 and $1$ SOTA detector based on YOLOv7. As shown in Tab. 2, our Fusion-Mamba performs best on the all categories with $m$AP_{50} and $m$AP metric compared with SOTA methods based on the same YOLOv5 backbone and our method based on YOLOv8 backbone achieves new SOTA results on the People, Bus Motorcycle and Truck categories, while $m$AP_{50} and $m$AP metric further increases by $3\%$ and $4.4\%$. In addition, our method using YOLOv5 backbone also outperforms SuperFusion based on YOLOv7 by $1.5\%$ $m$AP and $m$AP_{50}, despite the lower feature representation ability of YOLOv5 than YOLOv7. This is due to the effectiveness of our FMB, improving the inherent complementary of cross-modality features.
FLIR-Aligned Dataset. As shown in Tab. 3, Fusion-Mamba also performs best on Aligned-FLIR Dataset. Compared to CrossFormer based on the two-stream YOLOv5 Backbone, our method based on YOLOv8 and YOLOv5 surpasses them with $5.6\%$ and $5\%$ on $m$AP_{50}, and $4.9\%$ and $2.3\%$ on $m$AP, respectively. We also outperform RSDet with $3.8\%$ $m$AP_{50} and $5.6\%$ $m$AP. In terms of speed, our method with YOLOv5 achieves the fastest speed, saving $7$ms and $19$ms for one paired images detection, compared to transformer-based CFT and CrossFormer methods. For parameters, our method based on YOLOv5 saves about $100$M compared to the CrossFormer method. Despite our method based on YOLOv8 increases about $40$M parameters than that on YOLOv5, the $m$AP is significantly increased by $2.6\%$. This result indicates our method based on hidden space modeling better integrates features between different modalities, suppressing modality disparities to enhance the representation ability of fused features with the best trade-off between performance and computation cost.
Visialization of heatmaps. To visually demonstrate the high performance of our model, we randomly select one paired images from each of the three experimental datasets to visualize $P_{5}$ heatmap, and compare our method with other fusion methods. As shown in Fig.5, compared to other methods, our model focuses more on the targets rather than dispersing or focusing on unrelated parts. More examples are presented in the supplementary materials. We also visualize the object detection results to evaluate the effectiveness of our method in the supplementary materials.
4.3 Ablation Study
We use the FLIR-Aligned dataset for ablation study to separately verify the effectiveness of SSCS and DSSF modules and further explore the influence of numbers and position of the DSSF module. In particular, we also evaluate the effect of dual attention of DSSF module. All of these experiments are conducted based on YOLOv8 backbone.
Effects of SSCS and DSSF modules The results of removing SSCS and DSSF in the FMB are summarized in Tab. 4. After removing SSCS module (second row in Tab. 4), the detector performance is decreased by $2\%$ and $1.1\%$ on $m$AP_{50} and $m$AP, respectively. To explain, without the initial exchange of two modal features and shallow mapping fusion, the feature disparity is not well reduced during the following deep fusion. Meanwhile, without DSSF (third row in Tab. 4), only shallow fusion interaction could not effectively suppress redundant features and activate effective features during feature fusion, leading to a decrease in detector performance with $2.5\%$ and $2.4\%$ dropping on $m$AP_{50} and $m$AP, respectively. Both SSCS and DSSF are removed and the fused features are directly obtained by the addition of two local modality features (fourth row in Tab. 4), whose performance is significantly reduced by $4.8\%$ and $7.6\%$ on $m$AP_{50} and $m$AP, respectively. These results demonstrate that these two components of FMB are effective for cross-modality object detection.
Methods | $m$AP_{50} | $m$AP_{75} | $m$AP | Param. | Time(ms) |
---|---|---|---|---|---|
Fusion-mamba | 84.9 | 45.9 | 47.0 | 287.6M | 78 |
Removing SSCS | 82.9 | 42.3 | 45.9 | 266.8M | 69 |
Removing DSSF | 82.4 | 42.0 | 44.6 | 138.0M | 57 |
Removing SSCS & DSSF | 80.1 | 36.3 | 39.4 | 117.2M | 48 |
Positions | $m$AP_{50} | $m$AP_{75} | $m$AP | Param. | Time(ms) |
---|---|---|---|---|---|
$\{P_{2},P_{3},P_{5}\}$ | 83.9 | 44.4 | 46.7 | 256.8M | 72 |
$\{P_{2},P_{4},P_{5}\}$ | 76.1 | 39.8 | 42.3 | 281.1M | 75 |
$\{P_{3},P_{4},P_{5}\}$ | 84.9 | 45.9 | 47.0 | 287.6M | 78 |
Effect on the position of FMB. Following the work [6, 14], we also set three FMB for feature fusion. Here, we will further explore the impact of the FMB position, which stages should add FMB. We select three groups of multi-level features: $\{P_{2},P_{3},P_{5}\}$, $\{P_{2},P_{4},P_{5}\}$ and $\{P_{3},P_{4},P_{5}\}$ for ablation study, where $P_{i}$ is the fused feature at the $i$-th stage using FMB. As presented in Tab. 5, the position $\{P_{3},P_{4},P_{5}\}$ achieves the best trade-off between performance and computation complexity. Thus, we default select this position for experiments.
Effect on the numbers of DSSF modules. We have validated the effectiveness of DSSF in Tab. 4. Here, we further evaluate the effect of the number of DSSF modules, as summarized in Tab. 6. We select four kinds of DSSF numbers (i.e., $2$, $4$, $8$, $16$), and keep other model settings be consistent with the above experiments. We can see that the number of blocks is set to $8$, which achieves the best performance. The $8$ DSSF modules will be saturated, increasing this number causes the drift of complementary features, which leads to a decrease in fusion performance.
Number | $m$AP_{50} | $m$AP_{75} | $m$AP | Param. | Time(ms) |
---|---|---|---|---|---|
2 | 82.3 | 42.8 | 45.5 | 175.3M | 50 |
4 | 82.6 | 43.6 | 45.9 | 212.7M | 65 |
8 | 84.9 | 45.9 | 47.0 | 287.6M | 78 |
16 | 83.4 | 44.2 | 46.3 | 437.2M | 148 |
Effect on the dual attention of DSSF module. To further explore the effectiveness of our gating mechanism whether using the dual attention of DSSF module, we separately remove the IR attention (i.e., $z_{IR_{i}}\cdot y_{R_{i}}$ in Eq. 9) in the RGB branch, the RGB attention (i.e., $z_{IR_{i}}\cdot y_{R_{i}}$ in Eq. 10) in the IR branch, and both dual attention. The results are shown in Tab.7. After removing IR attention or RGB attention, $m$AP_{50} decreases $1.6\%$ or $1.1\%$ for reducing attention interaction between two features, respectively. When both dual attention is removed, the DSSF module becomes a stack of VSS blocks, and $m$AP_{50} degrades $2\%$. It is noted that both IR and RGB attention branches share weights with the other branches, which is equivalent to only adding activation functions and feature addition operations, compared to the removal of dual attention. Thus, the usage of dual attention does not have a significant impact on model parameters and runtime, while it significantly improves the detection performance.
5 Conclusion
In this paper, we propose a novel Fusion-Mamba method with well-designed SSCS module and DSSF module for multi-modal feature fusion. In particular, SSCS exchanges infrared and visible channel features for a shallow feature fusion. Subsequently, DSSF is further designed for deeper multi-modal feature interaction in a hidden state space based on Mamba, and a gated attention is used to suppress redundant features to enhance the effectiveness of feature fusion. Extensive experiments conducted on three public RGB-IR datasets demonstrated that our method achieves new state-of-the-art performance with a higher inference efficiency than Transformers. Our work confirms the potential of Mamba for cross-modal fusion, and we believe that our work can inspire more research on the application of Mamba in cross-modal tasks.
References
- Cai and Vasconcelos [2018] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: delving into high quality object detection. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6154–6162. Computer Vision Foundation / IEEE Computer Society, 2018.
- Cao et al. [2023] Yue Cao, Junchi Bin, Jozsef Hamari, Erik Blasch, and Zheng Liu. Multimodal object detection by channel switching and spatial attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Workshops, Vancouver, BC, Canada, June 17-24, 2023, pages 403–411. IEEE, 2023.
- Chen et al. [2024] Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, and Limin Wang. Video mamba suite: State space model as a versatile alternative for video understanding. CoRR, abs/2403.09626, 2024.
- Chen et al. [2022] Yi-Ting Chen, Jinghao Shi, Zelin Ye, Christoph Mertz, Deva Ramanan, and Shu Kong. Multimodal object detection via probabilistic ensembling. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IX, pages 139–158. Springer, 2022.
- Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Fang et al. [2021] Qingyun Fang, Dapeng Han, and Zhaokui Wang. Cross-modality fusion transformer for multispectral object detection. CoRR, abs/2111.00273, 2021.
- FLIR [2024] TELEDYNE FLIR. Free teledyne flir thermal dataset for algorithm training. Online, 2024.
- Gu and Dao [2023] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. CoRR, abs/2312.00752, 2023.
- Guan et al. [2019] Dayan Guan, Yanpeng Cao, Jiangxin Yang, Yanlong Cao, and Michael Ying Yang. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Inf. Fusion, 50:148–157, 2019.
- He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society, 2016.
- Jia et al. [2021] Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. LLVIP: A visible-infrared paired dataset for low-light vision. In IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021, Montreal, BC, Canada, October 11-17, 2021, pages 3489–3497. IEEE, 2021.
- Jocher et al. [2022] Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec, Yonghye Kwon, Kalen Michael, Jiacong Fang, Colin Wong, Zeng Yifu, Diego Montes, et al. ultralytics/yolov5: v6. 2-yolov5 classification models, apple m1, reproducibility, clearml and deci. ai integrations. Zenodo, 2022.
- Jocher et al. [2023] Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics YOLO, 2023.
- Lee et al. [2024] Seungik Lee, Jaehyeong Park, and Jinsun Park. Crossformer: Cross-guided attention for multi-modal object detection. Pattern Recognition Letters, 2024.
- Li et al. [2023a] Boyang Li, Chao Xiao, Longguang Wang, Yingqian Wang, Zaiping Lin, Miao Li, Wei An, and Yulan Guo. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process., 32:1745–1758, 2023a.
- Li et al. [2019] Chengyang Li, Dan Song, Ruofeng Tong, and Min Tang. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognit., 85:161–171, 2019.
- Li et al. [2023b] Jiawei Li, Jiansheng Chen, Jinyuan Liu, and Huimin Ma. Learning a graph neural network with cross modality interaction for image fusion. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, pages 4471–4479. ACM, 2023b.
- Liang et al. [2023] Mingjian Liang, Junjie Hu, Chenyu Bao, Hua Feng, Fuqin Deng, and Tin Lun Lam. Explicit attention-enhanced fusion for rgb-thermal perception tasks. IEEE Robotics Autom. Lett., 8(7):4060–4067, 2023.
- Liu et al. [2016] Jingjing Liu, Shaoting Zhang, Shu Wang, and Dimitris N. Metaxas. Multispectral deep neural networks for pedestrian detection. 2016.
- Liu et al. [2022] Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5802–5811, 2022.