Hyperspectral Image Classification Using 3D Attention Mechanism in Collaboration with Transformer
基于 3D 注意力机制与 Transformer 协作的高光谱图像分类

Yubing Wang, School of Communication and Information Engineering, Xi'an University of Posts and Telecommunications, China, 911190425@qq.com
王玉冰，西安邮电大学通信与信息工程学院，中国，911190425@qq.com

Ye Zhang, School of Communication and Information Engineering, Xi'an University of Posts and Telecommunications, China, 13637057794@163.com
叶张，西安邮电大学通信与信息工程学院，中国，13637057794@163.com

Kaifeng Duan, School of Communication and Information Engineering, Xi'an University of Posts and Telecommunications, China, 1687845289@qq.com
段开封，西安邮电大学通信与信息工程学院，中国，1687845289@qq.com

DOI: https://doi.org/10.1145/3641584.3641609
AIPR 2023: 2023 6th International Conference on Artificial Intelligence and Pattern Recognition (AIPR), Xiamen, China, September 2023
AIPR 2023: 2023 年第六届人工智能与模式识别国际会议 (AIPR)，中国厦门，2023 年 9 月

With the continuous innovation in deep learning, it has become a major direction for scholars to introduce the knowledge of deep learning into hyperspectral image classification to enhance its classification accuracy. Convolutional Neural Networks (CNN) are one of the most commonly used deep learning-based visual data processing methods, and are widely used in hyperspectral image (HSI) classification by virtue of their excellent contextual modeling capability. Since the performance of HSI classification is highly dependent on spatial and spectral information, this paper proposes a hyperspectral image classification method using 3D attention mechanism in collaboration with Transformer for hyperspectral image classification in view of the problems that the current hyperspectral image classification models with the framework of CNN have insufficient spatial spectral feature extraction and fail to excavate and represent the sequence properties of spectral features well. In this paper, we introduce a variant Transformer model based on a hybrid model of both improved 3D-CNN and 2D-CNN, combining complementary information of spatial spectrum and spectra in the form of 3D convolution and 2D convolution on CNN, and adding a variant attention mechanism module to strengthen spatial texture features, while combining grouped transfer Transformer to jump connection to enable the lower layer to better learn the upper layer features. Firstly, a variant channel attention mechanism is introduced on 3D-CNN to enhance the acquisition of spectral information of image features by 3D-CNN. Secondly, a variant spatial attention mechanism is introduced to enable 3D-CNN to better acquire the spatial information of hyperspectral images in the network, and subsequently the acquired spatial and spectral feature information is passed to 2D-CNN to enable it to better acquire local feature information. Finally, the acquired image feature information is passed to the variant Transformer model to make up for the fact that CNN can only acquire hyperspectral image features in local contexts, enabling it to better acquire global feature information on feature sequences. The experimental results show that the proposed model is experimented on two hyperspectral datasets, Indian Pines and Pavia University, and the overall classification accuracy (OA), average classification accuracy (AA), and Kappa coefficient reach up to 99.59%, 99.31%, and 99.45%, respectively, on the PU dataset, compared with the current cutting-edge techniques. The classification accuracy has been improved.
随着深度学习的不断创新，将深度学习的知识引入高光谱图像分类以提高其分类精度已成为学者们的主要研究方向。卷积神经网络（CNN）是最常用的基于深度学习的视觉数据处理方法之一，由于其出色的情境建模能力，在高光谱图像（HSI）分类中得到广泛应用。由于 HSI 分类的性能高度依赖于空间和光谱信息，本文针对当前基于 CNN 框架的高光谱图像分类模型在空间光谱特征提取不足以及未能很好地挖掘和表示光谱特征的序列属性的问题，提出了一种结合 3D 注意力机制与 Transformer 的高光谱图像分类方法。在本文中，我们介绍了一种基于改进的 3D-CNN 和 2D-CNN 混合模型的变体 Transformer 模型，结合了空间光谱和光谱的互补信息，并通过 CNN 中的 3D 卷积和 2D 卷积形式进行结合，同时添加了一个变体注意力机制模块以增强空间纹理特征，并结合分组传输 Transformer 进行跳跃连接，使下层能够更好地学习上层特征。首先，在 3D-CNN 上引入了一种变体通道注意力机制，以增强 3D-CNN 对图像特征光谱信息的获取。其次，引入了一种变体空间注意力机制，使 3D-CNN 能够在网络中更好地获取高光谱图像的空间信息，随后将获取到的空间和光谱特征信息传递给 2D-CNN，使其能够更好地获取局部特征信息。最后，将获取到的图像特征信息传递给变体 Transformer 模型，以弥补 CNN 只能在局部上下文中获取高光谱图像特征的不足，使其能够更好地在特征序列上获取全局特征信息。实验结果表明，所提出的模型在两个高光谱数据集 Indian Pines 和 Pavia University 上进行了测试，在 PU 数据集上的总体分类精度（OA）、平均分类精度（AA）和 Kappa 系数分别达到了 99.59%、99.31%和 99.45%，相比当前的前沿技术，分类精度得到了提升。

KEYWORDS: hyperspectral image classification, attention mechanism, convolutional neural networks, variant Transformer
关键词：高光谱图像分类，注意力机制，卷积神经网络，变体 Transformer

ACM Reference Format: ACM 参考格式：
Yubing Wang, Ye Zhang and Kaifeng Duan. 2023. Hyperspectral Image Classification Using 3D Attention Mechanism in Collaboration with Transformer. In 2023 6th International Conference on Artificial Intelligence and Pattern Recognition (AIPR) (AIPR 2023), September 22-24, 2023, Xiamen, China. ACM, New York, NY, USA, 12 Pages. https://doi.org/10.1145/3641584.3641609
王玉冰，张晔和段凯丰。2023。结合 Transformer 的三维注意力机制在高光谱图像分类中的应用。在 2023 年第 6 届人工智能与模式识别国际会议（AIPR 2023），2023 年 9 月 22-24 日，中国厦门。ACM, 纽约，美国，12 页。https://doi.org/10.1145/3641584.3641609

1 INTRODUCTION 1 引言

Hyperspectral image technology is a technique for acquiring spectral information about the reflectance of an object over a wide range of wavelengths. Typically, the image captured by a camera contains information in only three bands: red, green and blue, while a hyperspectral camera can acquire information in more bands, thus providing a more detailed spectral characterization of the object. Hyperspectral Image (HSI) contains hundreds of channels, which can provide extremely rich channel information and detailed spatial textures. In recent years, with the development of remote sensing imaging technology and computer vision, learning by recognizing multidimensional information of HSI to achieve the corresponding research objectives has been widely used in geological survey[1], precision agriculture[2], environmental monitoring[3] and other fields. Hyperspectral Image Classification (HSIC) based on HSI has received attention in remote sensing applications and research fields.
高光谱图像技术是一种获取物体在宽波长范围内反射光谱信息的技术。通常，普通相机捕捉的图像只包含红、绿、蓝三个波段的信息，而高光谱相机可以获取更多波段的信息，从而提供更详细的光谱特征描述。高光谱图像（HSI）包含数百个通道，可以提供极其丰富的通道信息和详细的空间纹理。近年来，随着遥感成像技术和计算机视觉的发展，通过识别 HSI 的多维信息以实现相应研究目标的方法已被广泛应用于地质勘探[1]、精准农业[2]、环境监测[3]等领域。基于 HSI 的高光谱图像分类（HSIC）在遥感应用和研究领域受到了关注。

Although HSI has many advantages, there are still some problems to be solved for HSIC research. In the early stage of research, researchers mainly focused on the extraction of spectral features, and the related methods include principal component analysis, independent component analysis, linear discriminant analysis, etc. These methods are used to extract the spectral features of hyperspectral images and classify them using classifiers, and the more commonly used methods are K-nearest neighbor algorithm[4,5] and support vector machine[6,7] . These methods do not take spatial information into account, and the extracted features are not comprehensive enough, and the classification accuracy is relatively low. In hyperspectral image classification, the phenomenon of "same substance with different spectrum" and "different substance with the same spectrum" may occur[8], the same class of substances exhibit different spectral profiles due to the interference of external physical environment and other factors, or different substances present the same spectral features in certain bands. The spectral characteristics of different substances in some bands are the same.
尽管高光谱图像（HSI）具有许多优势，但在高光谱图像分类（HSIC）研究中仍存在一些问题需要解决。在研究的早期阶段，研究人员主要关注光谱特征的提取，相关方法包括主成分分析、独立成分分析、线性判别分析等。这些方法用于提取高光谱图像的光谱特征，并使用分类器进行分类，其中较为常用的方法是 K 近邻算法[4, 5]和支持向量机[6, 7]。这些方法没有考虑空间信息，提取的特征不够全面，分类精度相对较低。在高光谱图像分类中，可能会出现“同物异谱”和“异物同谱”的现象[8]，即由于外部物理环境和其他因素的干扰，同一类物质表现出不同的光谱轮廓，或者不同物质在某些波段表现出相同的光谱特征。不同物质在某些波段的光谱特征相同。

Compared with traditional machine learning, deep learning has the feature of automatic learning, which can extract features automatically. With the application of deep learning in remote sensing images, convolutional neural networks are widely used in hyperspectral image classification. Wang Hao et al[9] proposed a network structure LA 3D-CNN based on 3D convolutional joint attention mechanism, using long short-term memory network LSTM to process features and adding attention mechanism to 3D convolutional network, which achieved better results. However, the training process of 3D-CNN requires a large amount of sample data, and the classification results are less satisfactory in the case of shortage of training samples. Roy et al[10] proposed a hybrid spectral network HybridSN to classify hyperspectral images, which extracts spectral and spatial features by a 3-layer continuous 3D-CNN, and then uses a 2-layer continuous 2D-CNN to extract spatial features. demonstrated that hybrid convolution can achieve better results in hyperspectral image classification, but its ability in characterizing spectral sequence information (especially in capturing subtle spectral differences in spectral dimensions) is still insufficient. Hong et al[11] rethought HS image classification in terms of the order of Transformers and proposed a new backbone network, SpectralFormer , which learns spectral local sequence information from the neighboring bands of HS images to produce grouped spectral embeddings.
与传统机器学习相比，深度学习具有自动学习的特性，可以自动提取特征。随着深度学习在遥感图像中的应用，卷积神经网络在高光谱图像分类中得到广泛应用。王浩等人[9]提出了一种基于 3D 卷积联合注意力机制的网络结构 LA 3D-CNN，使用长短期记忆网络 LSTM 处理特征，并在 3D 卷积网络中加入注意力机制，取得了更好的结果。然而，3D-CNN 的训练过程需要大量的样本数据，在训练样本不足的情况下，分类结果不太令人满意。Roy 等人[10]提出了一种混合光谱网络 HybridSN 来分类高光谱图像，该网络通过 3 层连续的 3D-CNN 提取光谱和空间特征，然后使用 2 层连续的 2D-CNN 提取空间特征。研究表明，混合卷积在高光谱图像分类中可以取得更好的结果，但其在表征光谱序列信息（尤其是在捕获光谱维度中的细微光谱差异）方面的能力仍然不足。Hong 等人[11]从 Transformer 的顺序角度重新思考了高光谱图像分类，并提出了一种新的骨干网络 SpectralFormer，该网络从高光谱图像的邻近波段中学习光谱局部序列信息，以生成分组的光谱嵌入。

In response to the above research status, this paper proposes a method to classify hyperspectral images by using 3D attention mechanism in collaboration with grouped Transformer. Through the collaboration between CNN and Transformer, the effective classification of hyperspectral images is achieved and the classification accuracy is improved.
针对上述研究现状，本文提出了一种利用 3D 注意力机制与分组 Transformer 相结合的方法对高光谱图像进行分类。通过 CNN 与 Transformer 的协作，实现了高光谱图像的有效分类，并提高了分类精度。

The work done in this paper has:
本文所做工作有：

1. A novel 3D-CNN channel attention mechanism is proposed to obtain the spectral feature information of hyperspectral images, aiming to obtain detailed image feature information by using the feature of 3D-CNN for simultaneous acquisition of spectral features.
提出了一种新颖的 3D-CNN 通道注意力机制，以获取高光谱图像的光谱特征信息，旨在通过利用 3D-CNN 的特性同时获取光谱特征，从而获得详细的图像特征信息。

2. In this paper, we propose a novel 3D-CNN spatial attention mechanism to obtain spatial information of image features.
2. 在本文中，我们提出了一种新颖的 3D-CNN 空间注意力机制来获取图像特征的空间信息。

3. In this paper, Transformer model is introduced to address the shortcoming that CNN can only obtain local feature information, and use the internal characteristics of Transformer to transform into global sequence information and obtain global contextual feature information to adapt its network model to hyperspectral image characteristics.
3. 本文引入了 Transformer 模型来解决 CNN 只能获取局部特征信息的不足，并利用 Transformer 的内部特性将其转换为全局序列信息，获取全局上下文特征信息，使其网络模型适应高光谱图像的特性。

2 METHODOLOGY OF THIS PAPER
2 本文的方法论

The 3D-Attention Mechanism Synergetic Transformer Network(3D-AMSTN) structure for hyperspectral image classification proposed in this paper is shown in Figure 1. The network framework is divided into three parts: 3D-CNN variable spatial attention mechanism and channel attention mechanism, 2D-CNN and grouped Transformer model. The network's classification capability for hyperspectral images is enhanced by the inter-collaboration of CNN convolution and Transformer models. The details in the network are described in detail by sections 2.2, 2.3.
本文提出的用于高光谱图像分类的 3D-Attention 机制协同 Transformer 网络(3D-AMSTN)结构如图 1 所示。网络框架分为三个部分：3D-CNN 可变空间注意力机制和通道注意力机制、2D-CNN 以及分组 Transformer 模型。通过 CNN 卷积和 Transformer 模型之间的协同作用，增强了网络对高光谱图像的分类能力。网络的具体细节在第 2.2 节和 2.3 节中详细描述。

Figure 1: Structure of 3D-AMSTN
图 1：3D-AMSTN 的结构

2.1 Hyperspectral Image Pre-processing
2.1 高光谱图像预处理

Hyperspectral images not only exhibit high dimensionality in spectral dimension, but also correlation exists between different bands of data and high information redundancy. In order to eliminate the redundancy between spectra, Principal Component Analysis (PCA) is used to reduce the number of spectral bands, and the spatial information is retained by reducing the dimensionality while still maintaining the original spatial dimensions. Firstly, the feature maps are cut, and the dimensionality reduction operation is performed on the data set using PCA, and the reduced feature maps are put into the network.
高光谱图像不仅在光谱维度上表现出高维度，而且不同波段的数据之间存在相关性以及高信息冗余。为了消除光谱之间的冗余，使用主成分分析（PCA）来减少光谱波段的数量，并通过降维同时保持原始空间维度来保留空间信息。首先，对特征图进行裁剪，然后使用 PCA 对数据集进行降维操作，将降维后的特征图输入网络。

The main idea of PCA is to map n-dimensional features to k-dimensions, which are new orthogonal features also called principal components, reconstructing k-dimensional features based on the original n-dimensional features. PCA works by sequentially finding a set of mutually orthogonal axes from the original space, and the selection of new axes is closely related to the data itself. The first new axis is chosen in the direction of the largest variance in the original data, the second new axis is chosen in the plane orthogonal to the first axis that makes the largest variance, and the third axis is in the plane orthogonal to the 1st and 2nd axes that makes the largest variance. By analogy, n such axes can be obtained. The new axes are obtained in this way. Most of the variance is contained in the first k axes, and the latter axes contain almost zero variance, so that the remaining axes are ignored and only the first k axes containing most of the variance are retained. In fact, this is equivalent to keeping only the dimensional features that contain most of the variance, and ignoring the dimensional features that contain almost zero variance, to realize the dimensionality reduction of the data features.
PCA 的主要思想是将 n 维特征映射到 k 维，这些新的正交特征也称为主成分，并基于原始的 n 维特征重构 k 维特征。PCA 通过从原始空间中依次找到一组相互正交的轴来工作，新轴的选择与数据本身密切相关。第一个新轴选择在原始数据方差最大的方向上，第二个新轴选择在与第一个轴正交的平面上方差最大的方向上，第三个轴则选择在与第 1 和第 2 个轴正交的平面上方差最大的方向上。依此类推，可以获得 n 个这样的轴。新轴就是这样获得的。大部分方差包含在前 k 个轴中，而后面的轴几乎不包含方差，因此可以忽略剩余的轴，只保留包含大部分方差的前 k 个轴。实际上，这相当于只保留包含大部分方差的维度特征，而忽略掉几乎零方差的维度特征，从而实现数据特征的降维。

Take the Indian Pines dataset as an example, its size is 145 × 145 × 200, where 145 × 145 denotes the height and width of the image, respectively, and 200 denotes the spectral dimension. Assuming that the original input image size is M × N × C, M is the height of the image, N is the width of the image, and C is the number of bands of the spectrum, and the first B principal components are retained by using PCA, then the image size can be changed to M × N × B by PCA dimensionality reduction, which eliminates the redundancy of the spectrum while retaining the spatial information.
以 Indian Pines 数据集为例，其大小为 145 × 145 × 200，其中 145 × 145 分别表示图像的高度和宽度，200 表示光谱维度。假设原始输入图像的大小为 M × N × C，M 是图像的高度，N 是图像的宽度，C 是光谱的波段数，并且通过 PCA 保留了前 B 个主成分，则可以通过 PCA 降维将图像大小变为 M × N × B，这样既消除了光谱的冗余，又保留了空间信息。

Assuming that all pixels in the p-neighborhood of each pixel point are selected as a sample, the sample after neighborhood extraction can be expressed as:
假设每个像素点的 p 邻域内的所有像素都被选为样本，那么在邻域提取后的样本可以表示为：

\begin{equation} {Z}_{\rm{n}} = \left[ {\begin{array}{@{}*{5}{c}@{}} {{y}_{(i - p)(j - p)}}& \ldots &{{y}_{(i - p)j}}& \ldots &{{y}_{(i - p)(j + p)}}\\ \vdots & \ddots & \vdots & \ddots & \vdots \\ {{y}_{i(j - p)}}& \ldots &{{y}_{ij}}& \cdots &{{y}_{i(j + p)}}\\ \vdots & \ddots & \vdots & \ddots & \vdots \\ {{y}_{(i + p)(j - p)}}& \ldots &{{y}_{(i + p)j}}& \ldots &{{y}_{(i + p)(j + p)}} \end{array}} \right] \end{equation}
(1)

Where: ${Z}_n$ denotes the nth sample, ${y}_{i\dot{j}}$ is a single pixel of size 1×1×B after PCA dimensionality reduction.
其中：${Z}_n$ 表示第 n 个样本，${y}_{i\dot{j}}$ 是经过 PCA 降维后的大小为 1×1×B 的单个像素。

When a pixel point is located at the edge of an image, its neighborhood will exceed the original size of the image, and the excess is filled with 0. The label of the central pixel point is used as the true label of this sample. Let K= 2p+ 1, the size of each sample becomes K × K × B. After neighborhood extraction, the 3D stereo hyperspectral image data is divided into small overlapping 3D image blocks (Patch), and the spectral dimension of each patch is still the size after dimensionality reduction, only the spatial size has changed.
当一个像素点位于图像边缘时，其邻域将超出原始图像的大小，超出部分用 0 填充。中心像素点的标签被用作该样本的真实标签。设 K=2p+1，每个样本的大小变为 K × K × B。经过邻域提取后，三维立体高光谱图像数据被分割成小的重叠三维图像块（Patch），每个 Patch 的光谱维度仍然是降维后的大小，只有空间大小发生了变化。

The final divided data is input to the network in the form of 9*9*30. Firstly, a convolutional kernel of 3*3*C is used in the tandem 3D-CNN, where C decreases with the deepening of the network, and separate space and separate channels are taken for the 3D-CNN used for SAM and CAM. The parameters used in the specific network regarding the channel attention mechanism and the spatial attention mechanism are shown in Table 1.Subsequently, the 2D-CNN is added after the 3D-CNN to enable the network to acquire the spatial information of the feature map again. The use of a separate layer of 2D-CNN ensures that the convolution will not be overfitted, but also ensures that the network can learn the spatial feature information of the hyperspectral image.
最终划分的数据以 9*9*30 的形式输入到网络中。首先，在串联的 3D-CNN 中使用 3*3*C 的卷积核，其中 C 随着网络的加深而减少，并且为用于 SAM 和 CAM 的 3D-CNN 分别采用空间和通道分离的方式。关于通道注意力机制和空间注意力机制的具体网络参数如表 1 所示。随后，在 3D-CNN 之后添加 2D-CNN，使网络能够再次获取特征图的空间信息。使用单独一层 2D-CNN 确保卷积不会过拟合，同时也确保网络可以学习高光谱图像的空间特征信息。

Table 1: Number of pixels per feature category in IP
表 1：IP 中每个特征类别的像素数量

#	Labels 标签	Sum 求和	Train 训练
1	Alfalfa 紫花苜蓿	46	5
2	Corn-notill 免耕玉米	1428	143
3	Corn-mintill 玉米-薄荷醇	830	83
4	Corn 玉米	237	24
5	Grass-pasture 草地-牧场	483	49
6	Grass-trees 草树	730	73
7	Grass-pasture-mowed 草-牧场-修剪过的	28	3
8	Hay-windrowed 干草成排堆放	478	48
9	Oats 燕麦	20	2
10	Soybean-notill 大豆-免耕	972	98
11	Soybean- mintill 大豆-薄荷	2455	246
12	Soybean- clean 大豆-清洁	593	60
13	Wheat 小麦	205	21
14	Woods 伍兹	1265	127
15	Builings-Grass-Trees-Drives 建筑物-草地-树木-车道	386	39
16	Stone-Steel-Towers 石钢塔	93	10
Total 总数		10249	1031

Table 2: Number of pixels per feature category in PU
表 2：PU 中每个特征类别的像素数量

#	Labels 标签	Sum 和	Train 训练
1	Asplalt 沥青	6631	332
2	Meadows 草地	18649	933
3	Gravel 砾石	2099	105
4	Tree 树	3064	154
5	Painted metal sheets 涂漆金属板	1345	68
6	Bare Soil 裸土	5029	252
7	Bitumen 沥青	1330	67
8	Self-blocking bricks 自锁砖	3682	185
9	Shadows 阴影	947	48
Total 总数		42776	2144

2.2 Attention Mechanisms Module
2.2 注意力机制模块

Spatial Attention Mechanisms (SAM) aims to enhance the feature representation of key regions, essentially transforming the spatial information in the original image into another space and preserving the key information through a spatial transformation module, generating a weighted mask for each location and weighting the output so as to enhance specific target regions of interest while weakening irrelevant background regions.
空间注意力机制（SAM）旨在增强关键区域的特征表示，本质上是将原始图像中的空间信息转换到另一个空间，并通过空间变换模块保留关键信息，为每个位置生成加权掩码，并对输出进行加权，以增强特定的目标感兴趣区域，同时削弱无关的背景区域。

Channel Attention Mechanisms (CAM) is designed to model the correlation between different channels and automatically obtain the importance of each feature channel through network learning, and then assign different weight factors to each channel to reinforce the important features and suppress the non-important ones.
通道注意力机制（CAM）旨在建模不同通道之间的相关性，并通过网络学习自动获得每个特征通道的重要性，然后为每个通道分配不同的权重因子以增强重要特征并抑制不重要的特征。

The 3D-CNN and 2D-CNN alone do not provide good classification accuracy in the process of processing hyperspectral images with little acquired data. Although 3D-CNN can directly acquire the null spectrum information in image features at the same time, the acquired feature information cannot know the internal situation of the feature map. To address the above situation, SAM and CAM are introduced into 3D-CNN, using CAM can make the network focus on more discriminative channels while suppressing unnecessary channel information, and similarly, SAM can make the network focus more on spatial texture information. In this paper, we compare the original spatial attention mechanism with the traditional channel attention mechanism, and make an innovation. The comparison diagram is shown in Figure 2.
3D-CNN 和 2D-CNN 在处理数据量较少的高光谱图像时，单独使用并不能提供良好的分类精度。尽管 3D-CNN 可以直接获取图像特征中的空谱信息，但所获取的特征信息无法了解特征图内部的情况。为了解决上述情况，将 SAM 和 CAM 引入到 3D-CNN 中，使用 CAM 可以使网络更关注更具区分性的通道，同时抑制不必要的通道信息，同样地，SAM 可以使网络更关注空间纹理信息。在本文中，我们将原始的空间注意力机制与传统的通道注意力机制进行了比较，并进行了创新。比较图如图 2 所示。

Figure 2: SAM/CAM of variant residual structure
图 2：变体残差结构的 SAM/CAM

The single channel attention mechanism branch, which contains only 1∼2 convolutions, is lacking for channel and space learning compared to using multiple convolutions, but due to the internal algorithm of convolution itself, the more convolutions are used for image learning, the easier it is to cause overfitting phenomenon, so this paper finally uses 3 3D-CNNs for channel and space learning of feature maps, and adds BN layer after each 3D-CNN to prevent over-learning and under-learning of feature information.
单通道注意力机制分支仅包含 1~2 个卷积，与使用多个卷积相比，在通道和空间学习方面存在不足，但由于卷积本身的内部算法，用于图像学习的卷积越多，越容易导致过拟合现象。因此，本文最终使用 3 个 3D-CNN 进行特征图的通道和空间学习，并在每个 3D-CNN 后添加 BN 层，以防止特征信息的过学习和欠学习。

2.3 Variant Transformer Model
2.3 变体 Transformer 模型

The conventional Transformer processes image features in the form of a one-dimensional sequence. To process 2D images, the conventional Transformer reshapes the image${\rm{x}} \in {\mathbb{R}}^{H \times W \times C}$into a sequence of flat 2D blocks ${{\rm{x}}}_P \in {\mathbb{R}}^{N \times ({p}^2\centerdot C)}$, where (H,W) is the resolution of the original image, C is the number of channels, (P,P) is the resolution of each image block and$N = HW/{p}^2$is the number of blocks obtained, and N will also be used as the effective input sequence length of the Transformer. The Transformer in the hidden layer uses a constant potential vector size D, patch is stretched and mapped to the D dimension using a trainable linear projection (Eq. 2). The output of this projection is called patch embedding in this paper.
传统的 Transformer 以一维序列的形式处理图像特征。为了处理二维图像，传统的 Transformer 将图像${\rm{x}} \in {\mathbb{R}}^{H \times W \times C}$重塑为一系列平坦的二维块${{\rm{x}}}_P \in {\mathbb{R}}^{N \times ({p}^2\centerdot C)}$，其中(H, W)是原始图像的分辨率，C 是通道数，(P, P)是每个图像块的分辨率，$N = HW/{p}^2$是获得的块数，N 也将用作 Transformer 的有效输入序列长度。隐藏层中的 Transformer 使用恒定的潜在向量大小 D，通过可训练的线性投影（公式 2）将块拉伸并映射到 D 维。这个投影的输出在本文中称为块嵌入。

The Transformer adds a learnable embedding before the embedded patch sequence ($Z_0^0 = {X}_{class}$) with the state at the output of its encoder ($Z_L^0$) as the image representation y (Eq. 5). The cls are appended to $Z_L^0$ during pre-training. The cls are implemented by an MLP in an implicit layer during pre-training and by a linear layer in the FFN layer. The position information of the feature map is added to the patch embeding to preserve the position information. Transformer adds cls to the feature sequence as input to the embedding vector sequence setting encoder.
Transformer 在嵌入的 patch 序列之前添加了一个可学习的嵌入（$Z_0^0 = {X}_{class}$），其编码器输出的状态（$Z_L^0$）作为图像表示 y（公式 5）。在预训练期间，cls 被附加到$Z_L^0$。在预训练期间，cls 通过一个隐层中的 MLP 实现，在 FFN 层中则通过一个线性层实现。特征图的位置信息被添加到 patch 嵌入中以保留位置信息。Transformer 将 cls 添加到特征序列中，作为嵌入向量序列设置编码器的输入。

The Transformer encoder consists of an MLP block FFN layer (Eq. 3,4) and a Multihead Self-Attention (MSA)[12] . Layernorm (LN) is applied before MSA and FFN layers, and residual connections are applied after each block[13].
Transformer 编码器由一个 MLP 块 FFN 层（公式 3,4）和一个多头自注意力机制（MSA）[12]组成。在 MSA 和 FFN 层之前应用了层归一化（LN），并在每个块之后应用了残差连接[13]。

The MLP contains two layers with GELU nonlinearity.
MLP 包含两层，使用 GELU 非线性激活函数。

\begin{equation} \begin{array}{@{}l@{}} {z}_o = [{X}_{class};X_p^1E;X_p^2E; \cdot \cdot \cdot ;X_p^NE]\\ \,\,\,\,\,\,\,\,\,\, + {E}_{pos}{\rm{, E}} \in {\mathbb{R}}^{({p}^2\centerdot C) \times D},{E}_{pos} \in {\mathbb{R}}^{(N + 1) \times D} \end{array} \end{equation}
\begin{equation} \begin{array}{@{}l@{}} z_o = [X_{class};X_p^1E;X_p^2E; \cdot \cdot \cdot ;X_p^NE]\\ \,\,\,\,\,\,\,\,\,\, + E_{pos}, E \in \mathbb{R}^{(p^2 \cdot C) \times D}, E_{pos} \in \mathbb{R}^{(N + 1) \times D} \end{array} \end{equation}
(2)

\begin{equation} {z^{'}}_\ell = MSA(LN({z}_{\ell - 1})) + {z}_{\ell - 1}{\rm{ }}\ell {\rm{ = 1}}...{\rm{L}} \end{equation}
\begin{equation} {z^{'}}_\ell = MSA(LN({z}_{\ell - 1})) + {z}_{\ell - 1} \quad \ell = 1...L \end{equation}
(3)

\begin{equation} {z^{'}}_\ell = MLP(LN({z^{'}}_\ell )) + {z^{'}}_\ell {\rm{ }}\ell {\rm{ = 1}}...{\rm{L}} \end{equation}
\begin{equation} {z^{'}}_\ell = MLP(LN({z^{'}}_\ell )) + {z^{'}}_\ell \quad \ell = 1...L \end{equation}
(4)

\begin{equation} y = LN(z_L^0) \end{equation}
(5)

Self-attention contains three elements, query Q, key K, and value V. For each element in the input sequence $z \in {\mathbb{R}}^{N \times D}$, we compute the weighted sum of all values v in the sequence. The attention weight ${A}_{ij}$ is based on the pairwise similarity between two elements of the sequence and their respective query ${q}^{\rm{i}}$ and key ${k}^j$ representations.
自注意力机制包含三个元素，查询 Q、键 K 和值 V。对于输入序列 $z \in {\mathbb{R}}^{N \times D}$ 中的每个元素，我们计算序列中所有值 v 的加权和。注意力权重 ${A}_{ij}$ 基于序列中两个元素及其各自的查询 ${q}^{\rm{i}}$ 和键 ${k}^j$ 表示之间的成对相似性。

\begin{equation} [q,k,v] = z{U}_{qkv}{\rm{ }}{U}_{qkv} \in {\mathbb{R}}^{D \times 3{D}_h} \end{equation}
(6)

\begin{equation} {\rm{A = softmax(q}}{{\rm{k}}}^T{\rm{/}}\sqrt {{D}_h} {\rm{) }}A \in {\mathbb{R}}^{N \times N} \end{equation}
(7)

\begin{equation} SA(z) = Av \end{equation}
(8)

Multi-headed self-attentiveness (MSA) is an extension of SA that runs k self-attentive operations, called "heads", in parallel and projects their connected outputs. To keep the amount of computation and the number of parameters constant when changing k, ${D}_h$ (Eq. 7) is usually set to D/k.
多头自注意力（MSA）是自注意力（SA）的扩展，它并行运行 k 个自注意力操作，称为“头”，并将它们的输出连接起来。为了在改变 k 时保持计算量和参数数量不变，${D}_h$（公式 7）通常设置为 D/k。

\begin{equation} MSA(z) = [S{A}_1(z);S{A}_2(z); \cdot \cdot \cdot ;S{A}_k(z)]{U}_{msa}{\rm{ }}{U}_{msa} \in {\mathbb{R}}^{k \cdot {D}_h \times D} \end{equation}
\begin{equation} MSA(z) = [SA_1(z);SA_2(z); \cdot \cdot \cdot ;SA_k(z)]U_{msa} \quad U_{msa} \in \mathbb{R}^{k \cdot D_h \times D} \end{equation}
(9)

For the Transformer model, in order to make the Transformer have better learning ability, this paper is improved for the information input to the VIT module, and the sequences are passed into the Transformer. It seems that the neighboring sequence blocks may have similar characteristics in the natural image, but relative to the hyperspectral image, because of its image characteristics at the neighboring positions of the null spectrum The information contained in the features is different, the sequence block and the neighboring sequence blocks do not closely understand their correlation, so this paper adopts the group transfer method to gradually pass in the feature sequences into Transformer, the group transfer method is shown in Figure 3.
对于 Transformer 模型，为了使 Transformer 具有更好的学习能力，本文改进了输入到 VIT 模块的信息，并将序列传递到 Transformer 中。在自然图像中，相邻的序列块可能具有相似的特征，但相对于高光谱图像而言，由于其在零谱相邻位置的图像特性所包含的信息不同，序列块与其相邻序列块之间的关联性并不紧密，因此本文采用分组传递方法逐步将特征序列传递到 Transformer 中，分组传递方法如图 3 所示。

Figure 3: Grouped transfer structure diagram
图 3：分组传输结构图

In this paper, four different adjacent blocks are passed into the Transformer at one time, and the subsequent sequences are passed in a progressive manner, moving one sequence block at a time to ensure that in the Transformer model, each adjacent block can be learned. Given a spectral feature (a pixel in the HS image) $y = [{y}_1,{y}_2, \cdots ,{y}_m] \in {\mathbb{R}}^{1 \times m}$, the feature embeddings a obtained by classic transformers are formulated by
在本文中，四个不同的相邻块一次性输入到 Transformer 中，随后的序列以递进的方式输入，每次移动一个序列块，以确保在 Transformer 模型中，每个相邻块都可以被学习。给定一个光谱特征（HS 图像中的一个像素）$y = [{y}_1,{y}_2, \cdots ,{y}_m] \in {\mathbb{R}}^{1 \times m}$，通过经典 Transformer 获得的特征嵌入表示为

\begin{equation} A = \omega y \end{equation}
(10)

where $\omega \in {\mathbb{R}}^{d \times 1}$ denotes the linear transformation equivalent to all bands in the spectral label, $A \in {\mathbb{R}}^{d \times m}$ collects the output features, and the proposed block of grouped sequences passes to learn feature embeddings from neighboring bands. Thus, the module can be modeled as
其中 $\omega \in {\mathbb{R}}^{d \times 1}$ 表示与光谱标签中所有波段等效的线性变换，$A \in {\mathbb{R}}^{d \times m}$ 收集输出特征，提出的分组序列块传递以从相邻波段学习特征嵌入。因此，该模块可以建模为

\begin{equation} \mathop A\limits^\centerdot = WY = W{\rm{h}}(y) \end{equation}
(11)

where $W \in {\mathbb{R}}^{d \times n}$ and $Y \in {\mathbb{R}}^{n \times m}$ correspond to the group representation of variables W and Y, respectively, and n denotes the number of adjacent bands. The variable W can be simply considered as a layer of the network, which can be optimized by updating the whole network. The function h(·) represents the overlapping grouping operation on the variables.
其中 $W \in {\mathbb{R}}^{d \times n}$ 和 $Y \in {\mathbb{R}}^{n \times m}$ 分别对应变量 W 和 Y 的组表示，n 表示相邻波段的数量。变量 W 可以简单地视为网络的一层，可以通过更新整个网络来优化。函数 h(·) 表示对变量的重叠分组操作。

\begin{equation} Y = h(y) = [{y}_1, \cdots ,{y}_q, \cdots ,{y}_m] \end{equation}
(12)

3 EXPERIMENTAL ANALYSIS 3 实验分析

3.1 Hardware Environment 3.1 硬件环境

All experiments in this paper were run on 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz, NVIDIA GeForce RTX 3060, 16G RAM hardware support. The implementation is based on Windows 10 using the Pytorch framework with Python 3.6.14.
本文中的所有实验均在 11 代 Intel(R) Core(TM) i7-11800H @ 2.30GHz、NVIDIA GeForce RTX 3060、16G RAM 硬件支持下运行。实现基于 Windows 10 操作系统，使用 Pytorch 框架和 Python 3.6.14。

3.2 Data and Evaluation Standards
3.2 数据和评估标准

To verify the effectiveness of the method in this paper, experiments are conducted on 2 publicly available datasets of hyperspectral images, namely Indian Pines (IP) and Pavia University (PU).
为了验证本文方法的有效性，在两个公开的高光谱图像数据集上进行了实验，即 Indian Pines (IP) 和 Pavia University (PU)。

The IP dataset is the earliest test data for hyperspectral image classification, imaged by the Airborne Visual Infrared Imaging Spectrometer (AVIRIS) in 1992 on an Indian pine tree in Indiana, USA, and then intercepted to a size of 145×145 for annotation as hyperspectral image classification test purpose. The spatial resolution is 17m/pixel, with 200 channel bands from 400nm to 2500nm, and 24 noise bands removed. It can classify 10249 pixels, including 16 kinds of features such as forest, corn, oats and soybean. Due to the insufficient amount of data, the feature boundaries are not clear and the feature boundaries in the IP dataset cannot be mapped accurately.
IP 数据集是最早的高光谱图像分类测试数据，由机载视觉红外成像光谱仪（AVIRIS）于 1992 年在美国印第安纳州的一棵印第安松树上成像，然后截取为 145×145 的大小，用于标注作为高光谱图像分类测试目的。空间分辨率为 17 米/像素，包含从 400 纳米到 2500 纳米的 200 个通道波段，并去除了 24 个噪声波段。它可以分类 10249 个像素，包括森林、玉米、燕麦和大豆等 16 种特征。由于数据量不足，特征边界不清晰，IP 数据集中的特征边界无法准确映射。

The PU dataset was collected using a spectral imager at the Pavia University, Italy, with a spatial size of 610 × 340 pixels and a spatial resolution of 1.3 m/pixel, possessing 103 channel bands between 430 nm and 860 nm, with 42,776 classifiable image elements, containing nine feature classes such as woods, bricks, and gravels. The high spatial resolution of scattering-prone objects, such as woods and sidewalks, poses great difficulties for feature learning based on CNNs.
PU 数据集是在意大利帕维亚大学使用光谱成像仪收集的，空间大小为 610×340 像素，空间分辨率为 1.3 米/像素，拥有 430 纳米到 860 纳米之间的 103 个通道波段，包含 42,776 个可分类的图像元素，包含九种特征类别，如树林、砖块和碎石。易散射物体（如树林和人行道）的高空间分辨率给基于 CNN 的特征学习带来了很大的困难。

The evaluation criteria use Overall Accuracy (OA), Average Accuracy (AA) and Kappa (KA) coefficients to determine the classification performance of the model.
评估标准使用总体准确率（OA）、平均准确率（AA）和 Kappa（KA）系数来确定模型的分类性能。

10% of the IP data set is extracted as the training set, and the number of pixels per pixel point is shown in the following table.
10%的 IP 数据集被提取作为训练集，每个像素点的像素数量如下表所示。

5% of the PU data set is extracted as the training set, and the number of pixels per pixel point is shown in the following table.
5%的 PU 数据集被提取作为训练集，每个像素点的像素数量如下表所示。

3.3 Experimental Results and Analysis
3.3 实验结果与分析

Several common network models, 2D-CNN, ContextualNet, PyResNet, ResNet, SPRVIT, and Transformer, were selected for comparative analysis and used to verify the effectiveness of the proposed method. To ensure the fairness of the experimental results, each experiment is conducted in the same environment and each parameter setting is the same. In order to verify the advantages of 3D-AMSTN, its classification performance is compared with several network models mentioned above, where the settings of each parameter and the input size are the same. The experimental results on the 2 datasets, IP and PU, are listed in Table 3 and Table 4.
选取了几种常见的网络模型，包括 2D-CNN、ContextualNet、PyResNet、ResNet、SPRVIT 和 Transformer，进行对比分析，并用于验证所提出方法的有效性。为了确保实验结果的公平性，每个实验都在相同的环境中进行，且每个参数设置都相同。为了验证 3D-AMSTN 的优势，将其分类性能与上述几种网络模型进行了比较，其中每个参数的设置和输入大小都相同。在 IP 和 PU 两个数据集上的实验结果列于表 3 和表 4 中。

Table 3: Classification accuracy of different classification models on IP dataset
表 3：不同分类模型在 IP 数据集上的分类准确率

Accuracy 准确率	2D-CNN	Contextu-alNet	PyResNet	ResNet	SPRVIT	Transformer	3D-AMSTN
OA(%)	0.9099	0.9393	0.9227	0.8964	0.9827	0.5946	0.9916
AA(%)	0.8910	0.9253	0.9327	0.8199	0.9768	0.5776	0.9933
Kappa	0.8969	0.9304	0.9116	0.8818	0.9803	0.5306	0.9904

Table 4: Classification accuracy of different classification models on PU dataset
表 4：不同分类模型在 PU 数据集上的分类准确率

Accuracy 准确率	2D-CNN	Contextu-alNet	PyResNet	ResNet	SPRVIT	Transformer	3D-AMSTN
OA(%)	0.8743	0.8430	0.8212	0.9617	0.9670	0.7713	0.9959
AA(%)	0.8271	0.7750	0.8458	0.9578	0.9302	0.7071	0.9931
Kappa	0.8324	0.7896	0.7535	0.9489	0.9559	0.6952	0.9945

From the overall experimental results on the 2 datasets, it is clear that the 3D-AMSTN has a high classification accuracy with an overall accuracy of more than 99% on both datasets. On the IP dataset, the proposed 3D-AMSTN is 39.7% higher in OA, 41.57% higher in AA, and 45.98% higher in Kappa compared with the traditional Transformer network model, which is due to the insufficient local information acquisition capability of the traditional Transformer, and this paper improves the information input to the VIT module by making the Transformer has better learning ability. Compared with 2D-CNN based on spatial extraction, OA is 8.17% higher, AA is 10.23% higher, and Kappa is 9.35% higher. This is due to the fact that CNN can only acquire hyperspectral image features in local context, and the variant Transformer model proposed in this paper enables the network to obtain better global feature information on feature sequences, thus improving the classification accuracy. The classification accuracy is also improved by 0.89% for OA, 1.65% for AA, and 1.01% for Kappa compared with SPRVIT, which has the best performance among several other classification models. On the PU dataset, the 3D-AMSTN proposed in this paper also performs well, with OA, AA, and Kappa improved by 12.16%, 16.6%, and 16.21%, respectively, compared with the 2D-CNN network model based on spatial extraction. Compared with the best-performing SPRVIT, OA is 2.89% higher, AA is 6.29% higher, and Kappa is 3.86% higher. The proposed 3D-AMSTN outperforms several other classification models in all performance metrics, which fully demonstrates the effectiveness of 3D-AMSTN in improving classification accuracy.
从 2 个数据集的整体实验结果来看，3D-AMSTN 具有很高的分类精度，在两个数据集上的总体精度都超过 99%。在 IP 数据集上，所提出的 3D-AMSTN 相比传统的 Transformer 网络模型，OA 提高了 39.7%，AA 提高了 41.57%，Kappa 提高了 45.98%，这是由于传统 Transformer 获取局部信息的能力不足，本文通过改进输入到 VIT 模块的信息，使 Transformer 具有更好的学习能力。与基于空间提取的 2D-CNN 相比，OA 提高了 8.17%，AA 提高了 10.23%，Kappa 提高了 9.35%。这是因为 CNN 只能在局部上下文中获取高光谱图像特征，而本文提出的变体 Transformer 模型使网络能够在特征序列中获得更好的全局特征信息，从而提高了分类精度。与表现最好的其他几种分类模型中的 SPRVIT 相比，OA 提高了 0.89%，AA 提高了 1.65%，Kappa 提高了 1.01%。在 PU 数据集上，本文提出的 3D-AMSTN 也表现良好，与基于空间提取的 2D-CNN 网络模型相比，OA、AA 和 Kappa 分别提高了 12.16%、16.6%和 16.21%。与表现最好的 SPRVIT 相比，OA 高出 2.89%，AA 高出 6.29%，Kappa 高出 3.86%。所提出的 3D-AMSTN 在所有性能指标上都优于其他几种分类模型，充分证明了 3D-AMSTN 在提高分类精度方面的有效性。

Figure 4 and Figure 5 give the classification results of this paper's method and the comparison method under the IP and PU datasets, respectively. It is obvious from Fig. 4 and Fig. 5 that the proposed method in this paper has fewer misclassification points. The 2D-CNN and Transformer-based classification methods differ more from the reference samples, and the classification results are much inferior to those of the proposed method. The rest of the classification methods have improved the classification effect, but the classification effect is slightly inferior to that of the proposed method.
图 4 和图 5 分别展示了本文方法与对比方法在 IP 和 PU 数据集上的分类结果。从图 4 和图 5 可以看出，本文提出的方法误分类点较少。基于 2D-CNN 和 Transformer 的分类方法与参考样本差异较大，分类结果远不如本文提出的方法。其余分类方法的分类效果有所改善，但仍然略逊于本文提出的方法。

Figure 4: Classification results of different models in IP dataset
图 4：IP 数据集上不同模型的分类结果

Figure 5 Classification results of different models in PU dataset
图 5 PU 数据集上不同模型的分类结果

In addition, the classification results of the method in this paper are comparable to other deep learning methods when the features of the features in the images are easier to distinguish. For example, the Oats and Grass-trees classes in the IP dataset and the Metal sheets and Shadow classes in the PU dataset are easy to distinguish, and they can achieve good classification results. The classification accuracy of the method in this paper is also improved for feature types with similar features that are prone to errors in classification.
此外，当图像中的特征更容易区分时，本文方法的分类结果与其他深度学习方法相当。例如，在 IP 数据集中，燕麦和草树类别以及在 PU 数据集中的金属板和阴影类别都容易区分，并且可以实现良好的分类结果。对于容易在分类中出错的相似特征类型，本文方法的分类精度也得到了提高。

3.4 Ablation Experiment 3.4 消融实验

In order to verify the rationality of each module of 3D-AMSTN, the effectiveness of setting different modules is experimentally demonstrated with the IP dataset as an example, and the comparison results are shown in Table 5.
为了验证 3D-AMSTN 每个模块的合理性，实验以 IP 数据集为例展示了设置不同模块的有效性，比较结果如表 5 所示。

Table 5: Comparison of effective classification accuracy of 3D-AMSTN in IP dataset
表 5：3D-AMSTN 在 IP 数据集中的有效分类准确率比较

#	SAM	CAM	Transformer	OA(%)	AA(%)	Kappa
Indian Pines 印第安松林	√	×	×	0.9805	0.9673	0.9742
	×	√	×	0.9880	0.9854	0.9868
	×	×	√	0.9759	0.9777	0.9677

As shown in the above table, it can be seen that gradually adding channel attention mechanism and channel attention mechanism has a significant increase for the classification of hyperspectral, and finally adding Transformer can find that the network can classify better on hyperspectral images in terms of global information, and the three methods collaborate with each other to make the classification accuracy of the network reach more than 99%.
如上表所示，可以发现逐步添加通道注意力机制和通道注意力机制对高光谱图像分类有显著提升，最终添加 Transformer 后可以发现网络在全局信息方面能够更好地对高光谱图像进行分类，三种方法相互协作使网络的分类准确率达到 99%以上。

4 CONCLUSION 4 结论

At present, the hyperspectral image classification model based on convolutional neural networks (CNNs) has some problems, such as insufficient extraction of spatial spectral features and not well mining and representing the sequence attributes of spectral features. Based on the improved 3D-CNN and 2D-CNN hybrid models, Variant Transformer model is introduced in this paper. In CNN, spatial spectrum and complementary information of spectra are combined in the form of three-dimensional convolution and two-dimensional convolution, and attention mechanism module is added to strengthen spatial texture features. At the same time, combined with Transformer skip connection, the lower layer can better learn the features of the upper layer.
目前，基于卷积神经网络（CNNs）的高光谱图像分类模型存在一些问题，如空间光谱特征提取不足，以及对光谱特征序列属性的挖掘和表示不够充分。本文引入了改进的 3D-CNN 和 2D-CNN 混合模型，并结合了变体 Transformer 模型。在 CNN 中，通过三维卷积和二维卷积的形式结合空间光谱和光谱的互补信息，并添加注意力机制模块以增强空间纹理特征。同时，结合 Transformer 的跳跃连接，使底层能够更好地学习上层的特征。

The experimental results show that the overall classification accuracy, average classification accuracy, and kappa coefficient of the proposed method are 99.16%, 99.33%, 99.04%, and 99.59%, 99.31%, 99.45% in the data sets of Indian Pines and Pavia University, respectively, which achieve better classification results compared with other classification methods. classification results compared with other classification methods, which confirmed the effectiveness of the method.
实验结果表明，所提出的方法在 Indian Pines 和 Pavia University 数据集上的总体分类精度、平均分类精度和 kappa 系数分别为 99.16%、99.33%、99.04%和 99.59%、99.31%、99.45%，与其它分类方法相比，取得了更好的分类结果，这证实了该方法的有效性。

REFERENCES 参考文献

Kim J, Kawamura Y, Nishikawa O, et al. A system of the granite weathering degree assessment using hyperspectral image and CNN[J]. International Journal of Mining, Reclamation and Environment, 2022, 36(5): 368-380.
Kim J, Kawamura Y, Nishikawa O, 等. 使用高光谱图像和卷积神经网络的花岗岩风化程度评估系统[J]. 国际采矿、修复与环境杂志, 2022, 36(5): 368-380.
TYLER N, GABRIEL D P, DAVID J. M, et al. The influence of aerial hyperspectral image processing workflow on Nitrogen uptake prediction accuracy in maize[J]. Remote Sensing, 2022,14(1): 132
TYLER N, GABRIEL D P, DAVID J. M, 等. 航空高光谱图像处理流程对玉米氮素吸收预测准确性的影响[J]. 遥感, 2022,14(1): 132
Zhu C, Ding J, Zhang Z, et al. Exploring the potential of UAV hyperspectral image for estimating soil salinity: Effects of optimal band combination algorithm and random forest[J]. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 2022, 279: 121416.
朱 C，丁 J，张 Z，等. 探索无人机高光谱图像估算土壤盐分的潜力：最佳波段组合算法和随机森林的影响[J]. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 2022, 279: 121416.
Jayapriya S, MM R P, Hemashri R. Hyperspectral image classification by using K-nearest neighbor algorithm[J]. International Journal of Psychosocial Rehabilitation, 24(5): 5068-5074.
Jayapriya S, MM R P, Hemashri R. 使用 K 近邻算法进行高光谱图像分类[J]. 国际心理社会康复杂志, 24(5): 5068-5074.
Song W, Li S, Kang X, et al. Hyperspectral image classification based on KNN sparse representation[C]//2016 IEEE international geoscience and remote sensing symposium (IGARSS). IEEE, 2016: 2411-2414.
宋 W, 李 S, 康 X, 等. 基于 KNN 稀疏表示的高光谱图像分类[C]//2016 IEEE 国际地球科学与遥感研讨会(IGARSS). IEEE, 2016: 2411-2414.
Harikiran J J H. Hyperspectral image classification using support vector machines[J]. IAES International Journal of Artificial Intelligence, 2020, 9(4): 684.
Harikiran J J H. 使用支持向量机的高光谱图像分类[J]. IAES 国际人工智能期刊, 2020, 9(4): 684.
Pathak D K, Kalita S K, Bhattacharya D K. Hyperspectral image classification using support vector machine: a spectral spatial feature based approach[J]. Evolutionary Intelligence, 2021: 1-15.
Pathak D K, Kalita S K, Bhattacharya D K. 使用支持向量机的高光谱图像分类：一种基于光谱空间特征的方法[J]. Evolutionary Intelligence, 2021: 1-15.
YE Z, BAI L, HE M Y. Review of spatial-spectral feature extraction for hyperspectral image ༻J༽. Journal of Image and Graphics, 2021, 26(8): 1737-1763. (in Chinese)
YE Z, BAI L, HE M Y. 高光谱图像空谱特征提取综述[J]. 中国图象图形学报, 2021, 26(8): 1737-1763. （中文）
WANG H, ZHANG J J, LI Y Y, et al. Hyperspectral image classification based on 3D convolution joint attention mechanism ༻J༽. Infrared Technology, 2020, 42(3): 264-271. (in Chinese)
王 H，张 JJ，李 YY，等. 基于 3D 卷积联合注意力机制的高光谱图像分类[J]. 红外技术, 2020, 42(3): 264-271. （中文）
Roy S K, Krishna G, Dubey S R, et al. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification[J]. IEEE Geoscience and Remote Sensing Letters, 2019, 17(2): 277-281.
Roy S K, Krishna G, Dubey S R, 等. HybridSN：探索 3D-2D CNN 特征层次结构用于高光谱图像分类[J]. IEEE 地球科学与遥感快报, 2019, 17(2): 277-281.
Hong D, Han Z, Yao J, et al. SpectralFormer: Rethinking hyperspectral image classification with transformers[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021, 60: 1-15.
Hong D, Han Z, Yao J, 等. SpectralFormer：用变换器重新思考高光谱图像分类[J]. IEEE 地球科学与遥感学报, 2021, 60: 1-15.
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
Vaswani A, Shazeer N, Parmar N, 等. 注意力就是你所需要的[J]. 神经信息处理系统进展, 2017, 30.
Wang Q, Li B, Xiao T, et al. Learning deep transformer models for machine translation[J]. arXiv preprint arXiv:1906.01787, 2019.
王 Q，李 B，肖 T，等. 学习用于机器翻译的深度 Transformer 模型[J]. arXiv 预印本 arXiv:1906.01787, 2019.

FOOTNOTE 脚注

Address all correspondence to 911190425@qq.com
所有通信请发送至 911190425@qq.com

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
允许出于个人或课堂教学目的免费制作本作品全部或部分内容的数字或纸质副本，前提是这些副本不得用于盈利或商业利益，并且副本必须包含此声明和第一页的完整引用。对于作者以外的其他人拥有的本作品组成部分的版权，必须予以尊重。允许注明出处进行摘要。其他任何形式的复制、再版、在服务器上发布或重新分发到列表，需要事先获得特定许可和/或支付费用。请向 permissions@acm.org 申请许可。

AIPR 2023, September 22–24, 2023, Xiamen, China
AIPR 2023，2023 年 9 月 22 日至 24 日，中国厦门

© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
© 2023 版权由所有者/作者持有。出版权利授权给 ACM。
ACM ISBN 979-8-4007-0767-4/23/09…$15.00.
ACM ISBN 979-8-4007-0767-4/23/09…$15.00。
DOI: https://doi.org/10.1145/3641584.3641609