这是用户在 2025-6-10 16:48 为 https://app.immersivetranslate.com/pdf-pro/ea7ff403-2ef1-4298-a89d-9882da532c31/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

MCFNet: Multiscale Cross-Domain Fusion Network for HSI and LiDAR Data Joint Classification
MCFNet:用于高光谱图像和激光雷达数据联合分类的多尺度跨域融合网络

Qiya Song ® ®  ^("® "){ }^{\text {® }}, Feng Mo ® ®  ^("® "){ }^{\text {® }}, Kexin Ding ® ®  ^("® "){ }^{\text {® }}, Lin Xiao, Renwei Dian ® ®  ^("® "){ }^{\text {® }}, Member, IEEE, Xudong Kang © ©  ^("© "){ }^{\text {© }}©, Senior Member, IEEE, and Shutao Li ® ®  ^("® "){ }^{\text {® }}, Fellow, IEEE
宋奇亚 ® ®  ^("® "){ }^{\text {® }} ,莫峰 ® ®  ^("® "){ }^{\text {® }} ,丁克欣 ® ®  ^("® "){ }^{\text {® }} ,肖林,电雷任伟 ® ®  ^("® "){ }^{\text {® }} ,IEEE 会员,康旭东 © ©  ^("© "){ }^{\text {© }}© ,IEEE 高级会员,以及李树涛 ® ®  ^("® "){ }^{\text {® }} ,IEEE Fellow

Abstract  摘要

The hyperspectral image (HSI) encompasses abundant spatial and spectral details, while light detection and ranging (LiDAR) delivers precise elevation data. The amalgamation of HSI and LiDAR data significantly improves the precision of image classification. However, most methods focus solely on spatial features while neglecting frequency-domain information, limiting the ability of deep models to characterize land cover. Furthermore, how to establish a sufficient interaction between different modalities is also an important issue. In this article, we propose a novel multiscale cross-domain fusion network (MCFNet) for joint classification of HSI and LiDAR data. The main idea is that the wavelet transform can provide details at different resolutions simultaneously, supplementing spatial-domain information and enriching feature representation. In addition, the multimodal fusion module (MFM) guided by HSI and the cross-domain fusion module (CDFM) strategy are developed to integrate features from diverse modalities and domains, respectively. Specifically, frequency-domain features are extracted by the discrete wavelet transform (DWT), and spatial-domain features of the image are captured through a set of convolutional operations. Then, interactive fusion is performed by MFM and CDFM, and finally, the integrated features are categorized using a classification module. Extensive experiments on three widely used HSI and LiDAR datasets indicate that MCFNet outperforms the state of the arts (SOTA) methods. The code will be available at: https://github.com/MSFLabX/MCFNet
高光谱图像(HSI)包含丰富的空间和光谱细节,而激光雷达(LiDAR)提供精确的高程数据。HSI 和 LiDAR 数据的融合显著提高了图像分类的精度。然而,大多数方法仅关注空间特征而忽略频域信息,限制了深度模型表征地物的能力。此外,如何建立不同模态之间的充分交互也是一个重要问题。在本文中,我们提出了一种用于 HSI 和 LiDAR 数据联合分类的新型多尺度跨域融合网络(MCFNet)。主要思想是,小波变换可以同时提供不同分辨率的细节,补充空间域信息并丰富特征表示。此外,开发了由 HSI 指导的多模态融合模块(MFM)和跨域融合模块(CDFM)策略,分别用于整合来自不同模态和领域的特征。 具体而言,频域特征通过离散小波变换(DWT)提取,图像的空域特征则通过一组卷积操作捕捉。随后,MFM 和 CDFM 进行交互式融合,最终通过分类模块对集成特征进行分类。在三个广泛使用的 HSI 和 LiDAR 数据集上的大量实验表明,MCFNet 优于当前最先进(SOTA)方法。代码将在以下地址提供:https://github.com/MSFLabX/MCFNet

Index Terms-Hyperspectral images (HSIs), image classification, light detection and ranging (LiDAR) data, wavelet transform.
索引词-高光谱图像(HSIs)、图像分类、激光雷达(LiDAR)数据、小波变换。

I. Introduction  I. 引言

HYPERSPECTRAL image (HSI), due to its rich spectral information content, finds widespread applications across multiple domains, including change detection [1], [2], [3] and land cover classification. HSI-based land cover classification typically encompasses both conventional unimodal methods and multisource image-based methods [4], [5], [6],
高光谱图像(HSI),由于其丰富的光谱信息内容,在多个领域得到广泛应用,包括变化检测[1]、[2]、[3]和土地覆盖分类。基于 HSI 的土地覆盖分类通常涵盖传统单模态方法和基于多源图像的方法[4]、[5]、[6]。
[7]. Among them, Tu et al. [5] performed classification by using a network based on three branches of superpixel-pixel-subpixel, while Tu et al. [6] addressed the problem of adversarial sample vulnerability in the model by modeling the global textual relationship of pixels. Bruni et al. [7] developed a wavelet-adaptive band selection framework that intelligently identifies and selects optimal spectral bands by analyzing class-specific spectral signatures in HSI. Wang et al. [8] proposed S2Mamba model, which employs dual scanning mechanisms with a selective state space model to capture long-range dependencies at linear complexity, replacing traditional self-attention. However, spectral information are not enough for some objects with similar categories; then, multisource image fusion classification methods have been explored. The elevation information from light detection and ranging (LiDAR) can complement the HSI to improve the effectiveness of classification. With the full fusion of two modalities, we can more accurately categorize the ground objects in some scenarios include, agricultural monitoring, marine monitoring, and mineral detection [9], [10].
[7]. 其中,Tu 等人[5]采用基于超像素-像素-亚像素三分支的网络进行分类,而 Tu 等人[6]通过建模像素的全局文本关系来解决模型中对抗样本脆弱性的问题。Bruni 等人[7]开发了一种小波自适应波段选择框架,通过分析 HSI 中特定类别的光谱特征来智能识别和选择最优光谱波段。Wang 等人[8]提出了 S2Mamba 模型,该模型采用具有选择性状态空间模型的 dual scanning 机制,以线性复杂度捕获长程依赖关系,取代了传统的自注意力机制。然而,对于一些具有相似类别的物体,光谱信息并不足够;因此,人们开始探索多源图像融合分类方法。激光雷达(LiDAR)的光探测与测距(LiDAR)提供的高程信息可以补充 HSI,以提高分类的有效性。通过两种模态的完全融合,我们可以在某些场景(包括农业监测、海洋监测和矿产探测[9][10])中更准确地分类地面物体。
In recent years, numerous fusion classification methods for HSI and LiDAR data have emerged, the extraction of rich features from images and the design of robust fusion methods [11], [12]. These methods typically fuse the spatial-domain information of these two modalities through early fusion, middle fusion, or late fusion approaches. For example, in [12], a dual-tributary architecture is developed to capture the features of HSI and LiDAR images, combining heterogeneous features during the fusion stage by employing feature-level and decision-level fusion strategies. Moreover, techniques such as Transformer have been widely used in various fields, proving the effectiveness of feature representation [13], [14], [15], [16], [17]. For instance, Ding et al. [9] combined Transformer and convolutional neural network (CNN) to model intricate local and global spatial-spectral relationships for the joint HSI and LiDAR data classification. Yao et al. [15] proposed ExViT, which is capable of handling multimodal remote sensing image patches with position-shared ViT parallel branches, extending the separable convolution module. Deng et al. [16] captured spatial features from HSI and SAR by combining a two-branch GCN with a Transformer. To improve feature discrimination, Lu et al. [17] leveraged the spectral reflectance characteristics of images and critical attributes of attention mechanisms. However, the aforementioned methods are based solely on the spatial domain of images, with minimal investigation of the frequency-domain features of ground object
近年来,针对高光谱图像(HSI)和激光雷达(LiDAR)数据的融合分类方法层出不穷,包括从图像中提取丰富特征和设计鲁棒的融合方法[11], [12]。这些方法通常通过早期融合、中期融合或后期融合的方式,融合这两种模态的空间域信息。例如,在[12]中,开发了一种双支路架构来捕获 HSI 和 LiDAR 图像的特征,在融合阶段通过采用特征级和决策级融合策略,结合异构特征。此外,Transformer 技术已被广泛应用于各个领域,证明了特征表示的有效性[13], [14], [15], [16], [17]。例如,Ding 等人[9]结合了 Transformer 和卷积神经网络(CNN),对联合 HSI 和 LiDAR 数据进行分类,建模了复杂的局部和全局空间-光谱关系。Yao 等人[15]提出了 ExViT,该技术能够处理具有位置共享 ViT 并行分支的多模态遥感图像块,扩展了可分离卷积模块。邓等人 [16] 通过结合双分支图卷积网络(GCN)和 Transformer,从高光谱(HSI)和合成孔径雷达(SAR)中捕获空间特征。为提高特征区分度,Lu 等人[17]利用了图像的光谱反射特性以及注意力机制的关键属性。然而,上述方法仅基于图像的空间域,对地面物体频域特征的探究极少。

Fig. 1. Feature decomposition principle of Daubechies wavelet transform. H low H low  H_("low ")H_{\text {low }} denotes the low-pass filtering of the original data, resulting in low-frequency approximation coefficients, while H high H high  H_("high ")H_{\text {high }} refers to the high–pass filtering, producing high-frequency detail coefficients. The low- and high-frequency components of the image are extracted by convolutional operation, and four subbands are constructed to obtain one low frequency feature and three high frequency features. In this way, different frequency information of the image is decomposed and processed.
图 1. Daubechies 小波变换的特征分解原理。 H low H low  H_("low ")H_{\text {low }} 表示对原始数据进行低通滤波,得到低频近似系数,而 H high H high  H_("high ")H_{\text {high }} 指高通滤波,产生高频细节系数。通过卷积操作提取图像的低频和高频分量,构建四个子带,获得一个低频特征和三个高频特征。这样,图像的不同频率信息被分解和处理。

images. This constraint restricts the comprehensive utilization of deep models’ representational potential in classification tasks.
图像的频域特征。这种限制制约了深度模型在分类任务中的表征潜力综合利用。
With the progression of deep learning (DL), Fourier transform (FT) has gained increasingly significant in the realm of image processing, such as image super-resolution [18], [19], image dehazing [20], mirror detection [21], and hyperspectral anomaly detection [22], [23]. Zhao et al. [24] proposed a Transformer based on FT, which first performs pixel-level fusion of HSI and LiDAR images, then extracts global contextual and sequential features through the image Transformer, effectively achieving classification. Zeng et al. [25] introduced a method for adaptive fusion of cross-modal frequency-domain features, which separately fuses the amplitude and phase information of different modalities, alleviating the issue of modality differences. However, the methods that use FT for frequency-domain analysis have a limitation: FT can only focus on global frequency-domain analysis. In contrast, wavelet transform can simultaneously provide both frequencyand time-domain details, thereby complementing and enriching the local details of the image [26], [27]. Wavelet transform can extract features in a multiscale and multiresolution manner. As shown in Fig. 1, this is the process of the Daubechies wavelet transform, where H low H low  H_("low ")H_{\text {low }} is the low-pass filter and H high H high  H_("high ")H_{\text {high }} is the high-pass filter. In the first decomposition stage, the larger dimension between the length and width of the feature is halved, and low-pass and high-pass filter matrices are constructed to perform the 2-D wavelet transform, while in the second decomposition, the same operation is performed on the smaller dimension. The components of the image in low- and high-frequency bands are extracted by convolutional operations. Wavelet decomposition yields four distinct components: The components A, H, D, and V correspond to the low frequency, horizontal high-frequency, diagonal highfrequency, and vertical high-frequency elements, respectively. This method effectively captures structural details. Therefore,
随着深度学习(DL)的发展,傅里叶变换(FT)在图像处理领域的重要性日益凸显,例如图像超分辨率[18], [19]、图像去雾[20]、镜像检测[21]以及高光谱异常检测[22], [23]。赵等人[24]提出了一种基于 FT 的 Transformer,该 Transformer 首先对 HSI 和 LiDAR 图像进行像素级融合,然后通过图像 Transformer 提取全局上下文和序列特征,有效实现分类。曾等人[25]引入了一种跨模态频域特征自适应融合方法,该方法分别融合不同模态的幅度和相位信息,缓解了模态差异问题。然而,使用 FT 进行频域分析的方法存在一个局限性:FT 只能专注于全局频域分析。相比之下,小波变换可以同时提供频域和时间域的细节,从而补充和丰富图像的局部细节[26], [27]。小波变换可以以多尺度和多分辨率的方式提取特征。如图所示。 1,这是 Daubechies 小波变换的过程,其中 H low H low  H_("low ")H_{\text {low }} 是低通滤波器, H high H high  H_("high ")H_{\text {high }} 是高通滤波器。在第一次分解阶段,特征的长和宽中较大的维度被减半,并构建低通和高通滤波器矩阵来执行二维小波变换,而在第二次分解中,对较小的维度执行相同的操作。通过卷积操作提取图像在低频和高频带中的分量。小波分解产生四个不同的分量:A、H、D 和 V 分别对应低频、水平高频、对角高频和垂直高频元素。这种方法有效地捕捉了结构细节。因此,

recent studies have attempted to use wavelet transform for the joint HSI and LiDAR classification. For example, Ni et al. [28] proposed a method that combines CNN, Transformer, and discrete wavelet transform (DWT), effectively improving the classification performance. The method initially employs 2-D and 3-D DWTs on HSI data and 2-D DWT on LiDAR data to extract both high- and low-frequency features from each modality. Then, it fuses the multimodal features and finally designs a frequency-domain-based global feature encoder for classification. Nevertheless, the above method still lack exploration of the spatial domain, which suppresses the capacity of the method to express spatial features.
近期研究尝试使用小波变换进行 HSI 和 LiDAR 的联合分类。例如,Ni 等人[28]提出了一种结合 CNN、Transformer 和离散小波变换(DWT)的方法,有效提高了分类性能。该方法首先对 HSI 数据使用 2-D 和 3-D DWT,对 LiDAR 数据使用 2-D DWT,从每种模态中提取高低频特征。然后,它融合多模态特征,并最终设计了一种基于频域的全局特征编码器用于分类。然而,上述方法仍缺乏对空间域的探索,这限制了该方法表达空间特征的能力。
By addressing the aforementioned challenges, this article proposes a novel multiscale cross-domain fusion network (MCFNet). The primary motivation for proposing MCFNet is to harness the robust spatial feature extraction (SFE) capabilities of CNNs and the strengths of wavelet transforms in capturing temporal and frequency-domain texture details, to achieve the superior global and local feature extraction. On the one hand, CNNs are employed to capture spatial-domain features from the presegmented multiscale inputs. On the other hand, the inputs are divided into one low-frequency group and three high-frequency groups, thereby capturing the texture and edge information of the image through wavelet transform. Subsequently, the multimodal fusion module (MFM) and the cross-domain fusion module (CDFM) are introduced to integrate features across different scales and domains. Our method demonstrates that the fusion of spatial and wavelet domains can significantly enhance classification performance.
通过解决上述挑战,本文提出了一种新的多尺度跨域融合网络(MCFNet)。提出 MCFNet 的主要动机是利用 CNN 强大的空间特征提取(SFE)能力以及小波变换在捕捉时域和频域纹理细节方面的优势,以实现卓越的全局和局部特征提取。一方面,CNN 用于从预分割的多尺度输入中捕获空间域特征。另一方面,输入被分为一个低频组和高频组三个,通过小波变换捕获图像的纹理和边缘信息。随后,引入了多模态融合模块(MFM)和跨域融合模块(CDFM),以整合不同尺度和域的特征。我们的方法表明,空间域和小波域的融合可以显著提高分类性能。
The main contributions of our work are summarized as follows.
我们的工作的主要贡献总结如下。
  1. A framework for joint classification of HSI and LiDAR data is proposed by fusing spatial and wavelet domain features. This method effectively mines spatial- and frequency-domain information from images of different modalities and improves the overall classification performance.
    提出了一种融合空间域和小波域特征的 HSI 和 LiDAR 数据联合分类框架。该方法有效挖掘不同模态图像的空间域和频域信息,并提高了整体分类性能。
  2. To effectively capture multiscale features and complementary information across different modalities, we design CDFM and MFM. The CDFM is used to fuse spatial domain with wavelet domain information, while MFM exploits the spectral features of HSI data and the spatial structural details of LiDAR data.
    为有效捕获多尺度特征和跨模态的互补信息,我们设计了 CDFM 和 MFM。CDFM 用于融合空间域与小波域信息,而 MFM 则利用 HSI 数据的频谱特征和 LiDAR 数据的空间结构细节。
  3. Comprehensive experimental results validate the effectiveness of the proposed method. Furthermore, when evaluating the proposed method, the average results of ten trials in all comparative experiments outperform other state of the arts (SOTA) methods.
    综合实验结果验证了所提方法的有效性。此外,在所有对比实验中,所提方法的十次试验平均结果均优于其他当前最优(SOTA)方法。

    We structure the remainder of this article as follows. Section II reviews relevant literature on wavelet transform and DL-based fusion methods. Section III provides a detailed description of the proposed MCFNet. Section IV compares the MCFNet with SOTA methods and conducts comprehensive ablation studies to validate its effectiveness and evaluate its performance. Finally, Section V concludes this article with a summary of the key contributions and findings.
    我们按以下结构组织本文的其余部分。第二节回顾了小波变换和基于深度学习的融合方法的相关文献。第三节详细描述了所提出的 MCFNet。第四节将 MCFNet 与 SOTA 方法进行比较,并进行全面的消融研究以验证其有效性和评估其性能。最后,第五节总结本文的主要贡献和发现。
This section provides a detailed review of significant progress in image classification, broadly divided into frequency-domain-based approaches and multimodal fusion methods.
本节详细回顾了图像分类的重要进展,大致分为基于频域的方法和多模态融合方法。

A. Frequency-Domain-Based Methods
A. 基于频域的方法

The methods of frequency domain initially decompose the input image into four frequency components: one low-frequency group and three high-frequency groups. Subsequently, these low- and high-frequency components are fused for the following task. Zhao et al. [24] introduced supervised and adversarial losses designed in the Fourier space. By effectively combining complementary details from both frequency and spatial domains, which enhances the high-frequency detail recovery capability and perceptual quality of low-complexity generators in image super-resolution tasks. In [29], a wavelet transform-based frequency division interactive CNN (WFDI-CNN) is proposed to reduce spatial redundancy by decomposing the common tensor into lowand high-frequency complementary tensors. Li et al. [30] suppressed the aliasing effect and improve noise robustness by replacing traditional downsampling operations, such as maximum pooling and step-size convolution with the DWT operation.
频域方法首先将输入图像分解为四个频率分量:一组低频分量和三组高频分量。随后,这些低频和高频分量被融合以进行后续任务。赵等人[24]在傅里叶空间中引入了监督损失和对抗损失。通过有效结合频率域和空间域的互补细节,增强了图像超分辨率任务中低复杂度生成器的高频细节恢复能力和感知质量。在[29]中,提出了一种基于小波变换的频率分解交互 CNN(WFDI-CNN),通过将常见张量分解为低频和高频互补张量来减少空间冗余。李等人[30]通过用 DWT 操作替换传统的下采样操作(如最大池化和步长卷积)来抑制混叠效应并提高噪声鲁棒性。
Xiong et al. [31] developed a semi-supervised domain adaptation framework featuring a wavelet Siamese network that enables spatial-frequency interaction across bitemporal HSIs, significantly improving the change boundary detection accuracy. Xing et al. [32] proposed frequency-enhanced mamba for remote sensing change detection, developing a novel DCT-assisted Mamba decoder to achieve simultaneous feature decoding and refinement. Ni et al. [28] transformed the input data into different frequency intervals by DWT, while downsampling to obtain low- and high-frequency components, and learned deep features by 3-D DWT and 2-D DWT. Although these methods successfully harness frequency-domain information, they fail to establish a synergistic relationship between frequency components and their spatially correlated semantic patterns.
熊等人[31]开发了一个半监督领域自适应框架,该框架采用小波 Siamese 网络,能够在双时相 HSI 之间实现空间频率交互,显著提高了变化边界检测的准确性。熊等人[32]提出了频率增强型 Mamba 用于遥感变化检测,开发了一种新型 DCT 辅助 Mamba 解码器,以实现特征解码和精炼的同步。牛等人[28]通过 DWT 将输入数据转换为不同的频率区间,同时进行下采样以获得低频和高频分量,并通过 3-D DWT 和 2-D DWT 学习深度特征。尽管这些方法成功利用了频域信息,但它们未能建立频率分量与其空间相关语义模式之间的协同关系。

B. Multimodal Fusion Methods
B. 多模态融合方法

With the advancement of DL, an increasing number of studies have focused on incorporating DL methods into multimodal tasks, facilitating remarkable progress in the domain of multimodal learning [33], [34]. The steps of multimodal fusion classification typically involve three key stages: feature extraction, fusion, and classification. HSI mainly emphasizes on the spectral resolution while LiDAR focuses more on the elevation information. Existing researches fuse these two types of features by designing various fusion methods. Hang et al. [12] selected the better modality features for fusion by using decision-level fusion, but the weights have a large impact, and inappropriate weights can interfere with the classification results. In [15], the classification tokens of all modalities are fused together by a multimodal attention and token fusion strategy before the head of the multilayer perceptron (MLP)
随着深度学习(DL)的发展,越来越多的研究将深度学习方法应用于多模态任务,推动了多模态学习领域的显著进展[33][34]。多模态融合分类的步骤通常涉及三个关键阶段:特征提取、融合和分类。HSI 主要强调光谱分辨率,而 LiDAR 则更关注高程信息。现有研究通过设计各种融合方法来融合这两种类型的特征。Hang 等人[12]通过决策级融合选择了更好的模态特征进行融合,但权重的影响很大,不合适的权重会干扰分类结果。在[15]中,所有模态的分类标记在多层感知器(MLP)头部之前通过多模态注意力和标记融合策略进行融合。

to acquire the final classification distribution. Zhang et al. [35] introduced an oriented attention fusion (OAF) method, which extracts cross-modal shallow local shape features and integrates them along horizontal and vertical dimensions to enhance feature consistency. Lu et al. [10] proposed a multilevel fusion method, which combines different level features to generate the final classification results through an adaptive probabilistic fusion strategy. The above methods tend to overlook the hierarchical interdependence between these two modalities. In our experiment, we verify the main role of HSI in the fusion process and design a feature fusion method guided by HSI.
为了获取最终的分类分布。张等人[35]引入了一种定向注意力融合(OAF)方法,该方法提取跨模态的浅层局部形状特征,并将其沿水平和垂直维度进行整合以增强特征一致性。陆等人[10]提出了一种多级融合方法,该方法结合不同级别的特征,通过自适应概率融合策略生成最终的分类结果。上述方法往往忽略了这两种模态之间的层次相互依赖性。在我们的实验中,我们验证了 HSI 在融合过程中的主要作用,并设计了一种由 HSI 指导的特征融合方法。

III. Proposed Method  III. 提出方法

In this article, we propose an MCFNet for HSI and LiDAR data, which integrates multiscale local spatial-domain details with multiscale global wavelet domain details to enhance classification accuracy. As illustrated in Fig. 2, the proposed MCFNet comprises three core parts: spatial-frequency feature extraction (SFFE), cross-domain fusion, and classification and loss function. This section elaborates on the design details of each component.
在本文中,我们提出了一种用于 HSI 和 LiDAR 数据的 MCFNet,该网络集成了多尺度局部空间域细节和多尺度全局小波域细节,以增强分类精度。如图 2 所示,所提出的 MCFNet 包含三个核心部分:空间频率特征提取(SFFE)、跨域融合以及分类和损失函数。本节将详细阐述每个组件的设计细节。

A. Spatial-Frequency Feature Extraction
A. 空间频率特征提取

Fig. 2 portrays the structure of SFFE. In general, the single-spatial-domain approach is deficient in detail texture feature extraction, while the tandem SFFE approach lacks the representation of complementary information. For this reason, we propose a multiscale, spatial-frequency-based feature extraction framework tailored for the two kinds of data. The SFFE module comprises two key components: SFE and wavelet feature extraction (WFE). Briefly, the SFE acquires the local spatial details of images, and the WFE acquires the detailed texture features through 2-D DWT. Then, we capitalize on local spatial details and global texture features at multiscale.
图 2 描绘了 SFFE 的结构。一般来说,单空间域方法在细节纹理特征提取方面存在不足,而串联 SFFE 方法缺乏互补信息的表示。因此,我们提出了一种针对这两种数据的多尺度、基于空间频率的特征提取框架。SFFE 模块包含两个关键组件:SFE 和小波特征提取(WFE)。简而言之,SFE 获取图像的局部空间细节,而 WFE 通过二维 DWT 获取详细的纹理特征。然后,我们利用多尺度的局部空间细节和全局纹理特征。
In the multimodal fusion classification task, there is a complex process of feature extraction and fusion. Given two different modalities, HSI and LiDAR, denote as X h R m × n × d h X h R m × n × d h X_(h)inR^(m xx n xxd_(h))X_{h} \in \mathbb{R}^{m \times n \times d_{h}} and X l R m × n × d l X l R m × n × d l X_(l)inR^(m xx n xxd_(l))X_{l} \in \mathbb{R}^{m \times n \times d_{l}}, where m m mm and n n nn related to the height and width of HSI and LiDAR data, respectively. d h d h d_(h)d_{h} and d l d l d_(l)d_{l} denote the number of spectral bands of them, respectively. Our task is to learn reliable feature representations from these modalities for classification.
在多模态融合分类任务中,存在一个复杂的特征提取和融合过程。给定两种不同的模态,HSI 和 LiDAR,分别表示为 X h R m × n × d h X h R m × n × d h X_(h)inR^(m xx n xxd_(h))X_{h} \in \mathbb{R}^{m \times n \times d_{h}} X l R m × n × d l X l R m × n × d l X_(l)inR^(m xx n xxd_(l))X_{l} \in \mathbb{R}^{m \times n \times d_{l}} ,其中 m m mm n n nn 分别与 HSI 和 LiDAR 数据的高度和宽度相关。 d h d h d_(h)d_{h} d l d l d_(l)d_{l} 分别表示它们的波段数。我们的任务是学习从这些模态中提取可靠的分类特征表示。
  1. Wavelet Feature Extraction: The network composition of WFE can be simply observed in Fig. 2(a). WFE is divided into three main steps: wavelet transform, convolutional coding, and frequency-domain feature alignment (FDFA). Firstly, we segment the input image into two different-sized patches, 8 × 8 8 × 8 8xx88 \times 8 and 16 × 16 16 × 16 16 xx1616 \times 16. After padding the edges of the image, patches are extracted for each pixel to generate HSI cubes X h p R p × p × d h X h p R p × p × d h X_(h)^(p)inR^(p xx p xxd_(h))X_{h}^{p} \in \mathbb{R}^{p \times p \times d_{h}} and LiDAR cube X l p R p × p × d l X l p R p × p × d l X_(l)^(p)inR^(p xx p xxd_(l))X_{l}^{p} \in \mathbb{R}^{p \times p \times d_{l}}, where p p pp is the space size of the cubes. Then, the image is decomposed into three high frequency constituents ( X h H L , X h L H , X h H H X h H L , X h L H , X h H H X_(h)^(HL),X_(h)^(LH),X_(h)^(HH)X_{h}^{H L}, X_{h}^{L H}, X_{h}^{H H} ) and one low frequency constituent ( X h L L X h L L X_(h)^(LL)X_{h}^{L L} ) by wavelet transform, this operation allows the model to learn detailed features from the high-frequency components. Subsequently, we will get
    小波特征提取:WFE 的网络结构可以简单地观察在图 2(a)中。WFE 分为三个主要步骤:小波变换、卷积编码和频域特征对齐(FDFA)。首先,我们将输入图像分割成两个不同尺寸的块, 8 × 8 8 × 8 8xx88 \times 8 16 × 16 16 × 16 16 xx1616 \times 16 。在填充图像边缘后,为每个像素提取块以生成 HSI 立方体 X h p R p × p × d h X h p R p × p × d h X_(h)^(p)inR^(p xx p xxd_(h))X_{h}^{p} \in \mathbb{R}^{p \times p \times d_{h}} 和 LiDAR 立方体 X l p R p × p × d l X l p R p × p × d l X_(l)^(p)inR^(p xx p xxd_(l))X_{l}^{p} \in \mathbb{R}^{p \times p \times d_{l}} ,其中 p p pp 是立方体的空间尺寸。然后,通过小波变换将图像分解为三个高频成分( X h H L , X h L H , X h H H X h H L , X h L H , X h H H X_(h)^(HL),X_(h)^(LH),X_(h)^(HH)X_{h}^{H L}, X_{h}^{L H}, X_{h}^{H H} )和一个低频成分( X h L L X h L L X_(h)^(LL)X_{h}^{L L} ),这一操作使模型能够从高频成分中学习详细特征。随后,我们将得到

  1. Received 5 March 2025; revised 24 April 2025; accepted 2 May 2025. Date of publication 6 May 2025; date of current version 21 May 2025. This work was supported in part by the National Natural Science Foundation of China under Grant 62401204 and Grant 61866013 and in part by the Open Project of Fujian Key Laboratory of Spatial Information Perception and Intelligent Processing under Grant FKLSIPIP1001. (Corresponding author: Feng Mo.)
    收到日期:2025 年 3 月 5 日;修改日期:2025 年 4 月 24 日;接受日期:2025 年 5 月 2 日。发布日期:2025 年 5 月 6 日;当前版本日期:2025 年 5 月 21 日。这项工作部分由国家自然科学基金(项目编号 62401204 和 61866013)资助,部分由福建省空间信息感知与智能处理重点实验室开放项目(项目编号 FKLSIPIP1001)资助。(通讯作者:莫峰。)
    Qiya Song, Feng Mo, and Lin Xiao are with the School of Information Science and Engineering, Hunan Normal University, Changsha, Hunan 410081, China (e-mail: Mofeng711@163.com).
    齐亚松、冯默和林晓来自湖南师范大学信息科学与工程学院,中国长沙,湖南 410081(邮箱:Mofeng711@163.com)。
    Kexin Ding, Xudong Kang, and Shutao Li are with the School of Robotics, Hunan University, Changsha, Hunan 410082, China.
    丁克新、康旭东和李树涛来自湖南大学机器人学院,中国长沙,湖南 410082。
    Renwei Dian is with the School of Robotics, Hunan University, Changsha, Hunan 410082, China, and also with Fujian Key Laboratory of Spatial Information Perception and Intelligent Processing, Yango University, Fuzhou, Fujian 350015, China.
    段仁伟来自湖南大学机器人学院,中国长沙,湖南 410082,同时也来自福建空间信息感知与智能处理重点实验室,阳冈大学,中国福州,福建 350015。
    Digital Object Identifier 10.1109/TGRS.2025.3567297
    数字对象标识符 10.1109/TGRS.2025.3567297