Hyperspectral image accurate classification method based on improved multi-scale attention and transformer network

By improving the multi-scale attention and Transformer network, and combining the dynamic spatial attention unit, multi-kernel fusion attention module, and cross-attention Swin Transformer module, the problems of feature redundancy and information dispersion in hyperspectral image classification are solved, multi-scale feature extraction and deep fusion are realized, and classification accuracy is improved.

CN120047816BActive Publication Date: 2026-06-23HAINAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HAINAN UNIV
Filing Date
2024-12-30
Publication Date
2026-06-23

Smart Images

  • Figure CN120047816B_ABST
    Figure CN120047816B_ABST
Patent Text Reader

Abstract

The application discloses a hyperspectral image accurate classification method based on an improved multi-scale attention and a Transformer network, first, dynamic spatial attention unit spatial information extraction; then, multi-core fusion attention module multi-scale feature extraction; finally, cross attention Swin Transformer module information fusion improvement; specifically, features processed by multi-scale attention are input into LayerNorm; then, a multi-head attention mechanism is introduced, so that the network can process features of different scales; a multi-head cross attention is introduced, so that the network can fuse features of different scales; an MLP is added after the multi-head attention mechanism, and is used for integrating and extracting features; the application solves the problem that the multi-head self-attention mechanism in the prior art may cause feature redundancy and information overdispersion.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of deep learning in remote sensing image classification technology, specifically involving a method for accurate classification of hyperspectral images based on improved multi-scale attention and Transformer networks. Background Technology

[0002] Hyperspectral remote sensing technology has been widely applied in recent years in fields such as earth science, environmental monitoring, agriculture, and urban planning. Compared with traditional multispectral images, hyperspectral images can provide hundreds of consecutive bands, offering richer information for more detailed object identification and classification. In HSI, each pixel represents the reflectance spectral characteristics of the target surface across multiple bands, giving HSI classification (HSIC) significant advantages in many applications, particularly in land cover classification and environmental monitoring.

[0003] HSIC (High-Speed ​​Identification) is a technique for accurately classifying ground features in images based on the spectral characteristics of pixels. Because HSI contains a large amount of band information, traditional image classification methods often face the challenges of the curse of dimensionality and redundant information when processing HSI, resulting in poor classification accuracy. In recent years, with the rapid development of machine learning and deep learning technologies, more and more advanced algorithms have been introduced into HSIC. For example, Support Vector Machines (SVM), Random Forests (RF), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Long Short-Term Memory Networks (LSTM) have all made significant progress in the classification accuracy and efficiency of HSI. Deep learning methods, in particular, with their feature extraction and end-to-end training capabilities, have become an important tool for solving the HSIC problem.

[0004] Huang et al. proposed a method for classifying and identifying HSI (High-Speed ​​Indication) of textile fibers using a one-dimensional convolutional neural network (1D-CNN). Zhao et al. used a two-dimensional convolutional neural network (2D-CNN) to extract high-level spatial features of HSI and stacked spatial and spectral features to achieve HSIC (High-Speed ​​Indication Classification). Kanthi et al. proposed a 3D deep feature extraction CNN model that can simultaneously utilize spectral and spatial information in HSI for HSIC and achieve good classification performance. Ge et al. combined 2D-CNN and 3D-CNN to design an HSI classification model and achieved good results on four publicly available HSI datasets.

[0005] The Transformer model was originally developed for Natural Language Processing (NLP) tasks. Leveraging its self-attention mechanism and excellent long-range dependency modeling capabilities, the Transformer has achieved great success. Its self-attention mechanism dynamically focuses on key components of the data, demonstrating powerful feature extraction capabilities in NLP, Computer Vision (CV), and other fields. With the introduction of the Visual Transformer (ViT), the Transformer model began to be introduced into image classification tasks, exhibiting superior performance. Unlike traditional CNNs, the Transformer can directly capture global features of an image without relying on convolutional kernels, making it particularly suitable for processing high-dimensional data. In HSIC, the introduction of the Transformer provided a new approach to solving the spatial-spectral feature fusion problem.

[0006] Mei et al. proposed a Group-Aware Hierarchical Transformer (GAHT) for HSIC, addressing the problem of over-dispersion in feature extraction from multi-head self-attention. Yang et al. proposed embedding convolutional operations into the transformer structure to capture subtle spectral differences and convey local spatial context information, improving classification performance. To address the difficulty of CNNs in extracting deep semantic features, Sun et al. proposed a spectral-spatialfeature tokenization transformer (SSFTT) model for extracting spectral spatial features and high-level semantic features from HSI, achieving good classification results on three standard datasets. Zhang et al. proposed an HSIC model combining multiple attention and Transformer, addressing the issue that networks are easily influenced by surrounding redundant information during training, leading to inaccurate feature extraction and poor model generalization. They first utilized spatial attention (SA) and channel attention (CA) to focus on more important information, then used a tokenizer module to perform semantic-level representation of different categories of land features, and finally used a Transformer encoder module for deep semantic feature extraction. The results show that the model performs well in extracting spatial spectral features of HSI and understanding semantic depth. To improve the performance of traditional HSIC tasks, Huang et al. proposed a Transformer based on a spectral spatial vision model (SS-VFMT). Furthermore, building upon SS-VFMT, they proposed a Transformer based on spectral spatial visual language (SS-VLFMT) to address the generalized zero-shot classification task, providing a new approach to HSI zero-shot classification. Guo et al. further utilized the multi-attention mechanism in the Swin-Transformer to fully leverage rich discriminative information, designing an end-to-end network that further enhanced HSI classification performance.

[0007] While these methods have further improved the performance of the models, some difficult challenges remain, as follows.

[0008] 1. Hyperspectral images typically contain features at different scales. Traditional CNNs excel in multi-scale feature extraction, while the native Transformer has limited performance in this area. Although the Swin Transformer achieves multi-scale modeling to some extent through its hierarchical window attention mechanism, it still falls short in multi-scale feature fusion for hyperspectral images. 2. Hyperspectral images contain rich spatial and spectral information, which are highly correlated. However, achieving deep fusion of spatial and spectral features within the Transformer architecture remains a challenge. 3. Multi-head self-attention mechanisms introduce powerful feature representation capabilities in hyperspectral image classification, but too many attention heads can lead to feature redundancy and excessive information dispersion, affecting the model's discriminative ability. For example, GAHT has made improvements to address this issue, but feature dispersion and redundancy still exist. Summary of the Invention

[0009] The purpose of this invention is to provide a method for accurate classification of hyperspectral images based on improved multi-scale attention and Transformer networks, which solves the problem that multi-head self-attention mechanisms in the prior art may lead to feature redundancy and excessive information dispersion.

[0010] The technical solution adopted in this invention is a hyperspectral image accurate classification method based on improved multi-scale attention and Transformer networks, specifically following these steps:

[0011] Step 1: Dynamic spatial attention unit spatial information extraction;

[0012] Step 2: Multi-scale feature extraction using multi-core fusion attention module;

[0013] Step 3: Cross-pay attention to the Swing Transformer module to improve information fusion;

[0014] The invention is further characterized in that,

[0015] Step 1 is specifically followed by the following steps:

[0016] Step 1.1: Normalize the input;

[0017] Step 1.2: Extract features from the normalized data using convolution;

[0018] Step 1.3: Use dynamic weights and different convolutions to complete spatial feature extraction.

[0019] Step 1.3 is specifically performed as follows:

[0020] First, given the input features x∈R B×C×LWhere B, C, and L represent the batch size, number of channels, and feature length of the input hyperspectral image, respectively. First, the input features are standardized and channel-expanded to obtain the expanded feature representation as shown in formula (1):

[0021] z = Conv1(LayerNorm(x)) (1)

[0022] Where z represents the feature extracted by convolution, z is divided into two sub-features a and b along the channel dimension, and the dynamic weights are calculated as shown in formulas (2) and (3).

[0023] a,b=chunk(z,2) (2)

[0024] w=Softmax(ReLU(Conv1(AdaptiveAvgPool(a)))) (3)

[0025] Where w represents dynamic weights, b is subjected to weighted convolution operation, and spatial features are extracted by combining a, as shown in formula (4):

[0026] y=b⊙DWConv1(a⊙w) (4)

[0027] Where y represents the features extracted after dynamic weighting, ⊙ represents the element-wise dot product operation, and DWConv1 represents depthwise separable convolution, which is used to capture spatial information;

[0028] Finally, the extracted spatial features are output through 1D convolution and residual connection, as shown in Equation (5):

[0029] y′=Conv2(y)⊙scale+x (5)

[0030] Where y' represents the extracted spatial features.

[0031] Step 2 is specifically performed as follows:

[0032] Step 2.1: Normalize the input;

[0033] Step 2.2: Extract features from the normalized data using convolutional kernels of different scales;

[0034] Step 2.3: Integrate features at different scales.

[0035] Step 2.1 is specifically performed as follows:

[0036] First, perform layer normalization on the input x, and calculate as shown in formula (6).

[0037] x′=LayerNorm(x) (6)

[0038] Where x' represents the normalized result;

[0039] Step 2.2 is specifically performed as follows:

[0040] Parallel multi-scale feature extraction is performed through convolution with kernels of different scales, and the specific calculation is shown in formula (7):

[0041] x k =Conv k (x′), k∈{3,5,7} (7)

[0042] Where x k This represents the features extracted by different convolution kernels, and k represents the kernel size.

[0043] All the features extracted above are concatenated and then channel fusion is performed. The specific calculation is shown in formula (8):

[0044] x fused =Fusion(concat(x3,x5,x7)) (8)

[0045] Where x fused Indicates fusion characteristics;

[0046] Finally, the spectral characteristics are obtained by combining the scaling factor and residual connection:

[0047] y = x fused ⊙scale+x (9)

[0048] Where y represents the extracted spectral features.

[0049] Step 2.3 is specifically performed as follows:

[0050] The Hybrid Scale Attention Module (HSA) extracts multi-scale spatial-spectral features through the Dynamic Spatial Attention Unit (DASU) and the Multi-Kernel Fusion Attention Module (MKFA), respectively, capturing information at different scales, as shown in Equation (10):

[0051]

[0052] y spectral This represents the spectral features extracted by the DSAU module, y spatial This represents the spatial features extracted by the MKFA module;

[0053] Then, the two features are weighted and fused to output the final feature, as shown in formula (11):

[0054] y = y spectral +y spatial (11)

[0055] Where y represents the final extracted feature.

[0056] Step 3 is specifically followed by the following steps:

[0057] Step 3.1: Input the features processed by multi-scale attention into LayerNorm;

[0058] Step 3.2: Introduce a multi-head attention mechanism to enable the network to process features at different scales;

[0059] Step 3.3: Introduce multi-head cross-attention to enable the network to better integrate features at different scales.

[0060] Step 3.4: Add an MLP after the multi-head attention mechanism to integrate and extract features;

[0061] Step 3.5: After MLP, residual connections are introduced to reduce the computational complexity of the model.

[0062] Step 3.1 is specifically followed by the following steps:

[0063] First, the input feature x after multi-scale attention is standardized to make the feature distribution similar in each channel, which facilitates subsequent feature extraction operations, as shown in formula (12):

[0064]

[0065] in, For the standardized features, x∈R L×B×D L represents the length of the feature sequence, B represents the batch size, and D is the feature dimension. The feature dimension obtained after standardization is the same as the input x.

[0066] Step 3.2 is specifically performed as follows:

[0067] In the multi-head self-attention module, the long-range dependencies of the input features are calculated through the self-attention mechanism. The calculation process of self-attention is shown in formula (13):

[0068]

[0069] in, These represent the query, key, and value vectors, respectively, which are represented by the linear projection matrix W. Q W K W V Obtain, d k For each attention head's feature dimension, the calculation is as shown in Equation (14):

[0070] d k =D / num_heads (14)

[0071] After MHSA, the output is calculated as shown in formula (15):

[0072]

[0073] Among them, W O ∈R D×D The output projection matrix is ​​used. Finally, the feature representation capability is enhanced through residual connections, and the calculation is shown in Equation (16):

[0074] x1=x+MHSA(LayerNorm(x)) (16).

[0075] Step 3.3 is specifically performed according to the following steps:

[0076] The proposed cross-attention module aims to further integrate information between different features. By applying the cross-attention mechanism to the standardized input features, the calculation is shown in Equation (17):

[0077] x2=x1+CrossAttention(LayerNorm(x1),LayerNorm(x1),LayerNorm(x1)) (17)

[0078] x1 represents the features extracted using the multi-head attention mechanism, and x2 represents the features extracted using the cross-attention mechanism.

[0079] CrossAttention is calculated in the same way as MHSA, but it is used to capture the interaction between features at different levels, thereby improving the fusion of spatial-spectral information of features.

[0080] Step 3.4 is specifically performed as follows:

[0081] The MLP module is used to further model the nonlinear relationships of features, and the calculation process is shown in formula (18):

[0082] MLP(x2)=W2·Dropout(GELU(W1·x2+b1))+b2 (18)

[0083] MLP(x2) represents the features extracted by MLP;

[0084] Where W1∈R D×mlp_dim ,W2∈R mlp_dim×D It is a weight matrix, b1∈R mlp_dim b2∈R D The bias is GELU, the activation function is GELU, and Dropout is a random deactivation operation to prevent overfitting. Finally, the output of the MLP is shown in Equation (19):

[0085] y=x2+MLP(LayerNorm(x2)) (19).

[0086] y represents the spectral-spatial features of the final extracted hyperspectral image, used for accurate classification of hyperspectral images.

[0087] The beneficial effects of this invention are that, based on an improved method for accurate hyperspectral image classification using multi-scale attention and Transformer networks, the MSA2T-Net framework is proposed. This is the first hyperspectral image classification model that combines multi-scale attention mechanisms, dynamic spatial attention, and cross-attention, enabling simultaneous multi-scale modeling, deep fusion of spatial and spectral features, and feature redundancy optimization. Extensive experiments were conducted on four representative hyperspectral datasets (Pavia, PaviaU, Houston2013, and Salinas), and the proposed MSA2T-Net significantly outperforms existing state-of-the-art methods in terms of overall accuracy (OA), average accuracy (AA), and Kappa coefficient. Attached Figure Description

[0088] Figure 1 It is a network model MSA2T-Net that combines multi-scale attention with an improved Transformer;

[0089] Figure 2 It is the DSAU model structure;

[0090] Figure 3 It is an MKFA model structure;

[0091] Figure 4 It is the CASTA model structure;

[0092] Figure 5 It is the Pavia dataset;

[0093] Figure 6 It is the Houston 2013 dataset;

[0094] Figure 7 It is the PaviaU dataset;

[0095] Figure 8 It is the Salinas dataset;

[0096] Figure 9 These are F1 scores from different datasets;

[0097] Figure 10 These are results from different training-to-test ratios. Detailed Implementation

[0098] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments.

[0099] This invention is based on an improved hyperspectral image accurate classification method using multi-scale attention and Transformer networks, combined with... Figure 1 Specifically, please follow these steps:

[0100] Step 1: Extraction of spatial information from Dynamic Spatial Attention Units (DSAUs);

[0101] Combination Figure 2 Step 1 is specifically performed according to the following steps:

[0102] Step 1.1: Normalize the input;

[0103] Step 1.2: Extract features from the normalized data using convolution;

[0104] Step 1.3: Use dynamic weights and different convolutions to complete spatial feature extraction.

[0105] Step 1.3 is specifically performed as follows:

[0106] Dynamic Spatial Attention Unit (DSAU)

[0107] First, given the input features x∈R B×C×L Where B, C, and L represent the batch size, number of channels, and feature length of the input hyperspectral image, respectively. First, the input features are standardized and channel-expanded to obtain the expanded feature representation as shown in formula (1):

[0108] z = Conv1(LayerNorm(x)) (1)

[0109] Where z represents the feature extracted by convolution, z is divided into two sub-features a and b along the channel dimension, and the dynamic weights are calculated as shown in formulas (2) and (3).

[0110] a,b=chunk(z,2) (2)

[0111] w=Softmax(ReLU(Conv1(AdaptiveAvgPool(a)))) (3)

[0112] Where w represents dynamic weights, b is subjected to weighted convolution operation, and spatial features are extracted by combining a, as shown in formula (4):

[0113] y=b⊙DWConv1(a⊙w) (4)

[0114] Where y represents the features extracted after dynamic weighting, ⊙ represents the element-wise dot product operation, and DWConv1 represents depthwise separable convolution, which is used to capture spatial information;

[0115] Finally, the extracted spatial features are output through 1D convolution and residual connection, as shown in Equation (5):

[0116] y′=Conv2(y)⊙scale+x (5)

[0117] Where y' represents the extracted spatial features.

[0118] Step 2: Multi-scale feature extraction using the Multi-Kernel Fusion Attention Module (MKFA);

[0119] Combination Figure 3 Step 2 specifically follows these steps:

[0120] Step 2.1: Normalize the input;

[0121] Step 2.1 is specifically performed as follows:

[0122] First, perform layer normalization on the input x, and calculate as shown in formula (6).

[0123] x′=LayerNorm(x) (6)

[0124] Where x' represents the normalized result;

[0125] Step 2.2: Extract features from the normalized data using convolutional kernels of different scales;

[0126] Step 2.2 is specifically performed as follows:

[0127] Parallel multi-scale feature extraction is performed through convolution with kernels of different scales, and the specific calculation is shown in formula (7):

[0128] x k =Conv k (x′), k∈{3,5,7} (7)

[0129] Where x k This represents the features extracted by different convolution kernels, and k represents the kernel size.

[0130] All the features extracted above are concatenated and then channel fusion is performed. The specific calculation is shown in formula (8):

[0131] x fused =Fusion(concat(x3,x5,x7)) (8)

[0132] Where x fusedIndicates fusion characteristics;

[0133] Finally, the spectral characteristics are obtained by combining the scaling factor and residual connection:

[0134] y = x fused ⊙scale+x (9)

[0135] Where y represents the extracted spectral features.

[0136] Step 2.3: Integrate features at different scales.

[0137] Step 2.3 is specifically performed as follows:

[0138] To effectively integrate spatial and spectral information, a novel attention mechanism called HSA is proposed. By combining the advantages of MKFA and DSAU, a balance can be achieved between multi-scale feature extraction and dynamic spatial attention mechanisms, thereby effectively enhancing the spatial-spectral feature representation capability.

[0139] The Hybrid Scale Attention Module (HSA) extracts multi-scale spatial-spectral features through the Dynamic Spatial Attention Unit (DASU) and the Multi-Kernel Fusion Attention Module (MKFA), respectively, capturing information at different scales, as shown in Equation (10):

[0140]

[0141] y spectral This represents the spectral features extracted by the DSAU module, y spatial This represents the spatial features extracted by the MKFA module;

[0142] Then, the two features are weighted and fused to output the final feature, as shown in formula (11):

[0143] y = y spectral +y spatial (11)

[0144] Where y represents the final extracted feature.

[0145] Step 3: Cross-attention SwingTransformer module (CASTB) enhances information fusion;

[0146] Combination Figure 4 Step 3 specifically follows these steps:

[0147] Step 3.1: Input the features processed by multi-scale attention into LayerNorm;

[0148] Step 3.1 is specifically followed by the following steps:

[0149] Layer normalization

[0150] First, the input feature x after multi-scale attention is standardized to make the feature distribution similar in each channel, which facilitates subsequent feature extraction operations, as shown in formula (12):

[0151]

[0152] in, For the standardized features, x∈R L×B×D L represents the length of the feature sequence, B represents the batch size, and D is the feature dimension. The feature dimension obtained after standardization is the same as the input x.

[0153] Step 3.2: Introduce a multi-head attention mechanism to enable the network to process features at different scales;

[0154] Step 3.2 is specifically performed as follows:

[0155] Bullish Self-Attention

[0156] In the multi-head self-attention module, the long-range dependencies of the input features are calculated through the self-attention mechanism. The calculation process of self-attention is shown in formula (13):

[0157]

[0158] in These represent the query, key, and value vectors, respectively, which are represented by the linear projection matrix W. Q W K W V Obtain, d k For each attention head's feature dimension, the calculation is as shown in Equation (14):

[0159] d k =D / num_heads (14)

[0160] After MHSA, the output is calculated as shown in formula (15):

[0161]

[0162] Among them, W O ∈R D×D The output projection matrix is ​​used. Finally, the feature representation capability is enhanced through residual connections, and the calculation is shown in Equation (16):

[0163] x1=x+MHSA(LayerNorm(x)) (16).

[0164] Step 3.3: Introduce multi-head cross-attention to enable the network to better integrate features at different scales.

[0165] Step 3.3 is specifically performed according to the following steps:

[0166] Multi-head cross attention

[0167] The proposed cross-attention module aims to further integrate information between different features. By applying the cross-attention mechanism to the standardized input features, the calculation is shown in Equation (17):

[0168] x2=x1+CrossAttention(LayerNorm(x1),LayerNorm(x1),LayerNorm(x1)) (17)

[0169] x1 represents the features extracted using the multi-head attention mechanism, and x2 represents the features extracted using the cross-attention mechanism.

[0170] CrossAttention is calculated in the same way as MHSA, but it is used to capture the interaction between features at different levels, thereby improving the fusion of spatial-spectral information of features.

[0171] Step 3.4: Add an MLP after the multi-head attention mechanism to integrate and extract features;

[0172] Step 3.4 is specifically performed as follows:

[0173] Multilayer Perceptron (MLP)

[0174] The MLP module is used to further model the nonlinear relationships of features, and the calculation process is shown in formula (18):

[0175] MLP(x2)=W2·Dropout(GELU(W1·x2+b1))+b2 (18)

[0176] MLP(x2) represents the features extracted by MLP;

[0177] Where W1∈R D×mlp_dim ,W2∈R mlp_dim×D It is a weight matrix, b1∈R mlp_dim b2∈R D The bias is GELU, the activation function is GELU, and Dropout is a random deactivation operation to prevent overfitting. Finally, the output of the MLP is shown in Equation (19):

[0178] y=x2+MLP(LayerNorm(x2)) (19).

[0179] y represents the spectral-spatial features of the final extracted hyperspectral image, used for accurate classification of hyperspectral images.

[0180] Step 3.5: After MLP, residual connections are introduced to reduce the computational complexity of the model.

[0181] Example 1

[0182] This invention is based on an improved hyperspectral image accurate classification method using multi-scale attention and Transformer networks, combined with... Figure 1 Specifically, please follow these steps:

[0183] Step 1: Extraction of spatial information from Dynamic Spatial Attention Units (DSAUs);

[0184] Step 2: Multi-scale feature extraction using the Multi-Kernel Fusion Attention Module (MKFA);

[0185] Step 3: Cross-attention Swing Transformer module (CASTB) enhances information fusion;

[0186] Example 2

[0187] This invention is based on an improved hyperspectral image accurate classification method using multi-scale attention and Transformer networks, combined with... Figure 1 Specifically, please follow these steps:

[0188] Step 1: Extraction of spatial information from Dynamic Spatial Attention Units (DSAUs);

[0189] Combination Figure 2 Step 1 is specifically performed according to the following steps:

[0190] Step 1.1: Normalize the input;

[0191] Step 1.2: Extract features from the normalized data using convolution;

[0192] Step 1.3: Use dynamic weights and different convolutions to complete spatial feature extraction.

[0193] Step 1.3 is specifically performed as follows:

[0194] Dynamic Spatial Attention Unit (DSAU)

[0195] First, given the input features x∈R B×C×L Where B, C, and L represent the batch size, number of channels, and feature length of the input hyperspectral image, respectively. First, the input features are standardized and channel-expanded to obtain the expanded feature representation as shown in formula (1):

[0196] z = Conv1(LayerNorm(x)) (1)

[0197] Where z represents the feature extracted by convolution, z is divided into two sub-features a and b along the channel dimension, and the dynamic weights are calculated as shown in formulas (2) and (3).

[0198] a,b=chunk(z,2) (2)

[0199] w=Softmax(ReLU(Conv1(AdaptiveAvgPool(a)))) (3)

[0200] Where w represents dynamic weights, b is subjected to weighted convolution operation, and spatial features are extracted by combining a, as shown in formula (4):

[0201] y=b⊙DWConv1(a⊙w) (4)

[0202] Where y represents the features extracted after dynamic weighting, ⊙ represents the element-wise dot product operation, and DWConv1 represents depthwise separable convolution, which is used to capture spatial information;

[0203] Finally, the extracted spatial features are output through 1D convolution and residual connection, as shown in Equation (5):

[0204] y′=Conv2(y)⊙scale+x (5)

[0205] Where y' represents the extracted spatial features.

[0206] Step 2: Multi-scale feature extraction using the Multi-Kernel Fusion Attention Module (MKFA);

[0207] Step 3: Cross-attention Swing Transformer module (CASTB) enhances information fusion.

[0208] Example 3

[0209] This invention is based on an improved hyperspectral image accurate classification method using multi-scale attention and Transformer networks, combined with... Figure 1 Specifically, please follow these steps:

[0210] Step 1: Extraction of spatial information from Dynamic Spatial Attention Units (DSAUs);

[0211] Combination Figure 2 Step 1 is specifically performed according to the following steps:

[0212] Step 1.1: Normalize the input;

[0213] Step 1.2: Extract features from the normalized data using convolution;

[0214] Step 1.3: Use dynamic weights and different convolutions to complete spatial feature extraction.

[0215] Step 1.3 is specifically performed as follows:

[0216] Dynamic Spatial Attention Unit (DSAU)

[0217] First, given the input features x∈R B×C×L Where B, C, and L represent the batch size, number of channels, and feature length of the input hyperspectral image, respectively. First, the input features are standardized and channel-expanded to obtain the expanded feature representation as shown in formula (1):

[0218] z = Conv1(LayerNorm(x)) (1)

[0219] Where z represents the feature extracted by convolution, z is divided into two sub-features a and b along the channel dimension, and the dynamic weights are calculated as shown in formulas (2) and (3).

[0220] a,b=chunk(z,2) (2)

[0221] w=Softmax(ReLU(Conv1(AdaptiveAvgPool(a)))) (3)

[0222] Where w represents dynamic weights, b is subjected to weighted convolution operation, and spatial features are extracted by combining a, as shown in formula (4):

[0223] y=b⊙DWConv1(a⊙w) (4)

[0224] Where y represents the features extracted after dynamic weighting, ⊙ represents the element-wise dot product operation, and DWConv1 represents depthwise separable convolution, which is used to capture spatial information;

[0225] Finally, the extracted spatial features are output through 1D convolution and residual connection, as shown in Equation (5):

[0226] y′=Conv2(y)⊙scale+x (5)

[0227] Where y' represents the extracted spatial features.

[0228] Step 2: Multi-scale feature extraction using the Multi-Kernel Fusion Attention Module (MKFA);

[0229] Combination Figure 3 Step 2 specifically follows these steps:

[0230] Step 2.1: Normalize the input;

[0231] Step 2.1 is specifically performed as follows:

[0232] First, perform layer normalization on the input x, and calculate as shown in formula (6).

[0233] x′=LayerNorm(x) (6)

[0234] Where x' represents the normalized result;

[0235] Step 2.2: Extract features from the normalized data using convolutional kernels of different scales;

[0236] Step 2.2 is specifically performed as follows:

[0237] Parallel multi-scale feature extraction is performed through convolution with kernels of different scales, and the specific calculation is shown in formula (7):

[0238] x k =Conv k (x′), k∈{3,5,7} (7)

[0239] Where x k This represents the features extracted by different convolution kernels, and k represents the kernel size.

[0240] All the features extracted above are concatenated and then channel fusion is performed. The specific calculation is shown in formula (8):

[0241] x fused =Fusion(concat(x3,x5,x7)) (8)

[0242] Where x fused Indicates fusion characteristics;

[0243] Finally, the spectral characteristics are obtained by combining the scaling factor and residual connection:

[0244] y = x fused ⊙scale+x (9)

[0245] Where y represents the extracted spectral features.

[0246] Step 2.3: Integrate features at different scales.

[0247] Step 2.3 is specifically performed as follows:

[0248] To effectively integrate spatial and spectral information, a novel attention mechanism called HSA is proposed. By combining the advantages of MKFA and DSAU, a balance can be achieved between multi-scale feature extraction and dynamic spatial attention mechanisms, thereby effectively enhancing the spatial-spectral feature representation capability.

[0249] The Hybrid Scale Attention Module (HSA) extracts multi-scale spatial-spectral features through the Dynamic Spatial Attention Unit (DASU) and the Multi-Kernel Fusion Attention Module (MKFA), respectively, capturing information at different scales, as shown in Equation (10):

[0250]

[0251] y spectral This represents the spectral features extracted by the DSAU module, y spatial This represents the spatial features extracted by the MKFA module;

[0252] Then, the two features are weighted and fused to output the final feature, as shown in formula (11):

[0253] y = y spectral +y spatial (11)

[0254] Where y represents the final extracted feature.

[0255] Step 3: Cross-attention Swing Transformer module (CASTB) enhances information fusion;

[0256] Combination Figure 4 Step 3 specifically follows these steps:

[0257] Step 3.1: Input the features processed by multi-scale attention into LayerNorm;

[0258] Step 3.2: Introduce a multi-head attention mechanism to enable the network to process features at different scales;

[0259] Step 3.3: Introduce multi-head cross-attention to enable the network to better integrate features at different scales.

[0260] Step 3.4: Add an MLP after the multi-head attention mechanism to integrate and extract features;

[0261] Step 3.5: After MLP, residual connections are introduced to reduce the computational complexity of the model.

[0262] Example 4

[0263] This invention is based on an improved hyperspectral image accurate classification method using multi-scale attention and Transformer networks, combined with... Figure 1Specifically, please follow these steps:

[0264] Step 1: Extraction of spatial information from Dynamic Spatial Attention Units (DSAUs);

[0265] Combination Figure 2 Step 1 is specifically performed according to the following steps:

[0266] Step 1.1: Normalize the input;

[0267] Step 1.2: Extract features from the normalized data using convolution;

[0268] Step 1.3: Use dynamic weights and different convolutions to complete spatial feature extraction.

[0269] Step 1.3 is specifically performed as follows:

[0270] Dynamic Spatial Attention Unit (DSAU)

[0271] First, given the input features x∈R B×C×L Where B, C, and L represent the batch size, number of channels, and feature length of the input hyperspectral image, respectively. First, the input features are standardized and channel-expanded to obtain the expanded feature representation as shown in formula (1):

[0272] z = Conv1(LayerNorm(x)) (1)

[0273] Where z represents the feature extracted by convolution, z is divided into two sub-features a and b along the channel dimension, and the dynamic weights are calculated as shown in formulas (2) and (3).

[0274] a,b=chunk(z,2) (2)

[0275] w=Softmax(ReLU(Conv1(AdaptiveAvgPool(a)))) (3)

[0276] Where w represents dynamic weights, b is subjected to weighted convolution operation, and spatial features are extracted by combining a, as shown in formula (4):

[0277] y=b⊙DWConv1(a⊙w) (4)

[0278] Where y represents the features extracted after dynamic weighting, ⊙ represents the element-wise dot product operation, and DWConv1 represents depthwise separable convolution, which is used to capture spatial information;

[0279] Finally, the extracted spatial features are output through 1D convolution and residual connection, as shown in Equation (5):

[0280] y′=Conv2(y)⊙scale+x (5)

[0281] Where y' represents the extracted spatial features.

[0282] Step 2: Multi-scale feature extraction using the Multi-Kernel Fusion Attention Module (MKFA);

[0283] Combination Figure 3 Step 2 specifically follows these steps:

[0284] Step 2.1: Normalize the input;

[0285] Step 2.1 is specifically performed as follows:

[0286] First, perform layer normalization on the input x, and calculate as shown in formula (6).

[0287] x′=LayerNorm(x) (6)

[0288] Where x' represents the normalized result;

[0289] Step 2.2: Extract features from the normalized data using convolutional kernels of different scales;

[0290] Step 2.2 is specifically performed as follows:

[0291] Parallel multi-scale feature extraction is performed through convolution with kernels of different scales, and the specific calculation is shown in formula (7):

[0292] x k =Conv k (x′), k∈{3,5,7} (7)

[0293] Where x k This represents the features extracted by different convolution kernels, and k represents the kernel size.

[0294] All the features extracted above are concatenated and then channel fusion is performed. The specific calculation is shown in formula (8):

[0295] x fused =Fusion(concat(x3,x5,x7)) (8)

[0296] Where x fused Indicates fusion characteristics;

[0297] Finally, the spectral characteristics are obtained by combining the scaling factor and residual connection:

[0298] y = x fused ⊙scale+x (9)

[0299] Where y represents the extracted spectral features.

[0300] Step 2.3: Integrate features at different scales.

[0301] Step 2.3 is specifically performed as follows:

[0302] To effectively integrate spatial and spectral information, a novel attention mechanism called HSA is proposed. By combining the advantages of MKFA and DSAU, a balance can be achieved between multi-scale feature extraction and dynamic spatial attention mechanisms, thereby effectively enhancing the spatial-spectral feature representation capability.

[0303] The Hybrid Scale Attention Module (HSA) extracts multi-scale spatial-spectral features through the Dynamic Spatial Attention Unit (DASU) and the Multi-Kernel Fusion Attention Module (MKFA), respectively, capturing information at different scales, as shown in Equation (10):

[0304]

[0305] y spectral This represents the spectral features extracted by the DSAU module, y spatial This represents the spatial features extracted by the MKFA module;

[0306] Then, the two features are weighted and fused to output the final feature, as shown in formula (11):

[0307] y = y spectral +y spatial (11)

[0308] Where y represents the final extracted feature.

[0309] Step 3: Cross-attention SwingTransformer module (CASTB) enhances information fusion;

[0310] Combination Figure 4 Step 3 specifically follows these steps:

[0311] Step 3.1: Input the features processed by multi-scale attention into LayerNorm;

[0312] Step 3.1 is specifically followed by the following steps:

[0313] Layer normalization

[0314] First, the input feature x after multi-scale attention is standardized to make the feature distribution similar in each channel, which facilitates subsequent feature extraction operations, as shown in formula (12):

[0315]

[0316] in, For the standardized features, x∈R L×B×D L represents the length of the feature sequence, B represents the batch size, and D is the feature dimension. The feature dimension obtained after standardization is the same as the input x.

[0317] Step 3.2: Introduce a multi-head attention mechanism to enable the network to process features at different scales;

[0318] Step 3.3: Introduce multi-head cross-attention to enable the network to better integrate features at different scales.

[0319] Step 3.4: Add an MLP after the multi-head attention mechanism to integrate and extract features;

[0320] Step 3.5: After MLP, residual connections are introduced to reduce the computational complexity of the model.

[0321] Example 5

[0322] This invention is based on an improved hyperspectral image accurate classification method using multi-scale attention and Transformer networks, combined with... Figure 1 Specifically, please follow these steps:

[0323] Step 1: Extraction of spatial information from Dynamic Spatial Attention Units (DSAUs);

[0324] Combination Figure 2 Step 1 is specifically performed according to the following steps:

[0325] Step 1.1: Normalize the input;

[0326] Step 1.2: Extract features from the normalized data using convolution;

[0327] Step 1.3: Use dynamic weights and different convolutions to complete spatial feature extraction.

[0328] Step 1.3 is specifically performed as follows:

[0329] Dynamic Spatial Attention Unit (DSAU)

[0330] First, given the input features x∈R B×C×L Where B, C, and L represent the batch size, number of channels, and feature length of the input hyperspectral image, respectively. First, the input features are standardized and channel-expanded to obtain the expanded feature representation as shown in formula (1):

[0331] z = Conv1(LayerNorm(x)) (1)

[0332] Where z represents the feature extracted by convolution, z is divided into two sub-features a and b along the channel dimension, and the dynamic weights are calculated as shown in formulas (2) and (3).

[0333] a,b=chunk(z,2) (2)

[0334] w=Softmax(ReLU(Conv1(AdaptiveAvgPool(a)))) (3)

[0335] Where w represents dynamic weights, b is subjected to weighted convolution operation, and spatial features are extracted by combining a, as shown in formula (4):

[0336] y=b⊙DWConv1(a⊙w) (4)

[0337] Where y represents the features extracted after dynamic weighting, ⊙ represents the element-wise dot product operation, and DWConv1 represents depthwise separable convolution, which is used to capture spatial information;

[0338] Finally, the extracted spatial features are output through 1D convolution and residual connection, as shown in Equation (5):

[0339] y′=Conv2(y)⊙scale+x (5)

[0340] Where y' represents the extracted spatial features.

[0341] Step 2: Multi-scale feature extraction using the Multi-Kernel Fusion Attention Module (MKFA);

[0342] Combination Figure 3 Step 2 specifically follows these steps:

[0343] Step 2.1: Normalize the input;

[0344] Step 2.1 is specifically performed as follows:

[0345] First, perform layer normalization on the input x, and calculate as shown in formula (6).

[0346] x′=LayerNorm(x) (6)

[0347] Where x' represents the normalized result;

[0348] Step 2.2: Extract features from the normalized data using convolutional kernels of different scales;

[0349] Step 2.3: Integrate features at different scales.

[0350] Step 3: Cross-attention Swing Transformer module (CASTB) enhances information fusion;

[0351] Combination Figure 4 Step 3 specifically follows these steps:

[0352] Step 3.1: Input the features processed by multi-scale attention into LayerNorm;

[0353] Step 3.1 is specifically followed by the following steps:

[0354] Layer normalization

[0355] First, the input feature x after multi-scale attention is standardized to make the feature distribution similar in each channel, which facilitates subsequent feature extraction operations, as shown in formula (12):

[0356]

[0357] in, For the standardized features, x∈R L×B×D L represents the length of the feature sequence, B represents the batch size, and D is the feature dimension. The feature dimension obtained after standardization is the same as the input x.

[0358] Step 3.2: Introduce a multi-head attention mechanism to enable the network to process features at different scales;

[0359] Step 3.2 is specifically performed as follows:

[0360] Bullish Self-Attention

[0361] In the multi-head self-attention module, the long-range dependencies of the input features are calculated through the self-attention mechanism. The calculation process of self-attention is shown in formula (13):

[0362]

[0363] in, These represent the query, key, and value vectors, respectively, which are represented by the linear projection matrix W. Q W K W V Obtain, d k For each attention head's feature dimension, the calculation is as shown in Equation (14):

[0364] d k =D / num_heads (14)

[0365] After MHSA, the output is calculated as shown in formula (15):

[0366]

[0367] Among them, W O ∈R D×D The output projection matrix is ​​used. Finally, the feature representation capability is enhanced through residual connections, and the calculation is shown in Equation (16):

[0368] x1=x+MHSA(LayerNorm(x)) (16).

[0369] Step 3.3: Introduce multi-head cross-attention to enable the network to better integrate features at different scales.

[0370] Step 3.3 is specifically performed according to the following steps:

[0371] Multi-head cross attention

[0372] The proposed cross-attention module aims to further integrate information between different features. By applying the cross-attention mechanism to the standardized input features, the calculation is shown in Equation (17):

[0373] x2=x1+CrossAttention(LayerNorm(x1),LayerNorm(x1),LayerNorm(x1)) (17)

[0374] x1 represents the features extracted using the multi-head attention mechanism, and x2 represents the features extracted using the cross-attention mechanism.

[0375] CrossAttention is calculated in the same way as MHSA, but it is used to capture the interaction between features at different levels, thereby improving the fusion of spatial-spectral information of features.

[0376] Step 3.4: Add an MLP after the multi-head attention mechanism to integrate and extract features;

[0377] Step 3.5: After MLP, residual connections are introduced to reduce the computational complexity of the model.

[0378] Example 6

[0379] Figure 1 A schematic diagram of the proposed method is shown. The method will now be tested.

[0380] The experiments tested four standard publicly available hyperspectral datasets: Pavia, WHU-Hi-HanChuan (HanChuan), WHU-Hi-HongHu (HongHu), and XuZhou. Figure 5The image shows a true-color map (left) and ground truth (right) of the Pavia dataset. Figure 6 The image shows a true-color map (top) and ground truth (bottom) of the Houston2013 dataset. Figure 7 The image shows a true-color map (left) and ground truth (right) of the PaviaU dataset. Figure 8 The image shows a true-color map (left) and ground truth (right) of the Salinas dataset.

[0381] The proposed algorithm was implemented using Python 12 and torch 2.4.1. The hardware used for training was an NVIDIA GeForce RTX 3060Ti GPU, x64, Win11.

[0382] To compare the performance of various classification algorithms, three evaluation metrics commonly used in HSI classification tasks were employed: overall classification accuracy (OA), average accuracy (AA), and kappa coefficient. In addition to comparisons with OA, AA, and kappa, we also used the F1-Score to compare the effectiveness of our proposed method. We further compared methods with different training samples to evaluate whether the proposed method is better on low-quality samples or only accurate on higher-quality samples. The dataset classes were split at 10% and 90% ratios.

[0383] This experiment comprehensively compared the proposed method with various mainstream classification methods (such as SVM, KNN, RF, 2D-CNN, SACNet, SSFCN, ViT, SF, and MF) on four hyperspectral datasets (Pavia, Houston2013, PaviaU, and Salinas), as shown in Tables 1-4. The results show that the proposed method has significant advantages in classification performance. On the Pavia dataset, the overall accuracy (OA) of the proposed method reaches 0.988, significantly outperforming traditional methods (such as SVM's 0.978 and RF's 0.982) and some deep learning methods (such as SACNet's 0.921). It achieves the best performance in classification accuracy for categories 7 and 8, reaching 0.9986 and 1.000 respectively. On the Houston2013 dataset, the OA of the proposed method is 0.902, a significant improvement compared to RF (0.877) and 2D-CNN (0.864), especially in the classification of complex land cover (such as categories 3 and 14). The accuracy is 0.996 and 0.988, respectively. In the PaviaU dataset, the proposed method achieves an OA of 0.919, with AA and Kappa values ​​of 0.892 and 0.893, respectively. In classification of categories 1 and 8, the accuracy is 0.952 and 0.999, significantly outperforming other methods. In the Salinas dataset, the proposed method achieves an OA of 0.915, with AA and Kappa values ​​of 0.950 and 0.905*, respectively. It performs particularly well in handling difficult-to-classify categories (such as categories 4 and 9), achieving accuracy of 0.989 and 0.965, surpassing other compared methods. In summary, the proposed HSIC method demonstrates excellent performance in OA, AA, and Kappa, especially in complex land cover classification and cases with uneven category distribution, fully demonstrating the algorithm's superiority and stability.

[0384] Table 1. Quantitative results of the Pavia dataset

[0385] ClassNo. SVM KNN RF 2D-CNN SACNet SSFCN ViT SF MF Proposed 0 1.000 1.000 1.000 0.995 0.957 0.983 0.875 0.828 0.854 1.000 1 0.967 0.962 0.966 0.885 0.852 0.924 0.935 0.832 0.975 0.956 2 0.818 0.825 0.852 0.937 0.834 0.923 0.731 0.474 0.393 0.916 3 0.808 0.782 0.806 0.982 0.954 0.961 0.950 0.854 0.867 0.873 4 0.923 0.950 0.951 0.973 0.660 0.795 0.981 0.957 0.940 0.978 5 0.923 0.928 0.950 0.931 0.843 0.965 0.893 0.774 0.615 0.959 6 0.925 0.933 0.955 0.967 0.904 0.900 0.846 0.576 0.563 0.973 7 0.996 0.997 0.995 0.967 0.948 0.979 0.820 0.691 0.673 0.998 8 1.000 1.000 0.999 0.896 0.794 0.907 0.936 0.990 0.957 1.000 OA (%) 0.978 0.979 0.982 0.972 0.921 0.962 0.902 0.812 0.837 0.988 AA (%) 0.927 0.940 0.948 0.948 0.861 0.926 0.885 0.775 0.760 0.964 Kappa (%) 0.969 0.970 0.975 0.951 0.863 0.910 0.869 0.743 0.780 0.983

[0386] Table 2. Quantitative results of the Houston 2013 dataset

[0387]

[0388]

[0389] Table 3. Quantitative results of the PaviaU dataset

[0390]

[0391]

[0392] Table 4. Quantitative results of the Salinas dataset

[0393] ClassNo. SVM KNN RF 2D-CNN SACNet SSFCN ViT SF MF Proposed 0 1.000 1.000 0.998 0.988 0.989 0.985 0.855 0.932 0.759 0.993 1 0.991 0.989 0.996 0.993 0.997 0.941 0.926 0.874 1.000 0.993 2 0.907 0.931 0.967 1.000 0.966 0.958 0.882 0.812 0.923 0.924 3 0.975 0.982 0.985 0.993 0.964 0.957 0.888 0.985 0.986 0.982 4 0.981 0.984 0.989 0.944 0.946 0.822 0.880 0.720 0.901 0.989 5 1.000 1.000 0.999 0.991 0.979 0.993 0.937 1.000 0.983 0.998 6 0.984 0.994 1.000 0.996 0.979 0.985 0.896 0.999 0.995 1.000 7 0.697 0.727 0.793 0.765 0.639 0.605 0.910 0.856 0.789 0.794 8 0.988 0.988 0.986 0.994 0.985 0.989 0.895 1.000 1.000 0.992 9 0.899 0.911 0.926 0.960 0.928 0.893 0.925 0.941 0.951 0.965 10 0.885 0.923 0.943 0.997 1.000 0.999 0.957 0.980 0.998 0.942 11 0.957 0.954 0.972 0.958 0.951 0.472 0.949 1.000 1.000 0.971 12 0.911 0.935 0.959 0.971 0.957 0.737 0.949 0.997 0.987 0.954 13 0.979 0.961 0.964 0.971 0.973 0.974 0.908 0.973 0.959 0.956 14 0.838 0.647 0.791 0.678 0.795 0.493 0.751 0.305 0.635 0.771 15 0.989 0.991 0.983 0.983 0.965 0.969 0.948 0.976 0.955 0.987 OA (%) 0.886 0.881 0.917 0.918 0.878 0.799 0.902 0.836 0.883 0.915 AA (%) 0.928 0.935 0.951 0.961 0.938 0.861 0.922 0.897 0.9226 0.950 Kappa (%) 0.872 0.868 0.907 0.951 0.928 0.841 0.940 0.815 0.869 0.905

[0394] To comprehensively evaluate the reliability of the proposed model, the F1-Scores of four datasets were analyzed, such as... Figure 9 As shown in the figure, in the Salinas dataset, the F1-Score for class 11 is greater than 0.65, and the F1-Score for each of the other classes is greater than 0.8. In the PaviaU dataset, except for class 2, the F1-Score for all other classes is greater than 0.8. In the Pavia dataset, almost every class has an F1-Score greater than 0.9, fully demonstrating the effectiveness of the algorithm. For the Houston2013 dataset, the F1-Score for class 12 is greater than 0.6, and the F1-Score for the other classes is greater than 0.8. In summary, this algorithm has good classification performance and strong generalization ability for HSIC.

[0395] To evaluate the performance of the proposed method under different training-test sample ratios, this paper selected five different training-test ratios (0.1-0.9, 0.15-0.85, 0.2-0.8, 0.25-0.75, 0.3-0.7) for experiments. The results are as follows: Figure 9 As shown, with the increase of the training sample ratio, the overall accuracy (OA), average accuracy (AA), and Kappa coefficient of the proposed method all show a steady upward trend, fully demonstrating that more training data helps to improve classification performance. At low training sample ratios (0.1-0.9), the OA, AA, and Kappa coefficients of the proposed method are significantly better than other comparative methods, indicating its strong robustness under insufficient sample conditions. However, at high training ratios (0.3-0.7), the classification performance tends to saturate, further verifying the generalization ability and consistency of the method. Under different training-test ratios, the proposed method exhibits stable performance and smooth changes, demonstrating its broad applicability and superiority in hyperspectral classification tasks.

Claims

1. A method for accurate classification of hyperspectral images based on improved multi-scale attention and Transformer networks, characterized in that, Follow these steps: Step 1: Dynamic spatial attention unit spatial information extraction; Step 2: Multi-core fusion focuses on multi-scale feature extraction and spectral feature extraction, and the fusion of spatial and spectral features. Step 2 is specifically performed according to the following steps: Step 2.1: Normalize the input; Step 2.1 is specifically performed as follows: First, the input features x Perform layer normalization and calculate as shown in formula (6). (6) in x’ This represents the result after normalization; Step 2.2: Extract features from the normalized data using convolutional kernels of different scales; Step 2.2 is specifically performed as follows: Parallel multi-scale feature extraction is performed through convolution with kernels of different scales, and the specific calculation is shown in formula (7): (7) in This represents the features extracted by different convolution kernels. k Indicates the kernel size; All the features extracted above are concatenated and then channel fusion is performed. The specific calculation is shown in formula (8): (8) in Indicates fusion characteristics; Finally, the features are obtained by combining the scaling factor and residual connection, as shown in formula (9): (9) in y Indicates the extracted features; Step 2.3: Integrate features at different scales; Step 2.3 is specifically performed as follows: The Hybrid Scale Attention Module (HSA) extracts multi-scale spatial-spectral features through the Dynamic Spatial Attention Unit (DASU) and the Multi-Kernel Fusion Attention Module (MKFA), respectively, capturing information at different scales, as shown in Equation (10): (10) This indicates the features extracted by the DSAU module. This represents the features extracted by the MKFA module; Then the two features are fused to output the final feature, as shown in formula (11): (11) in, This represents the final extracted features; Step 3: Cross-pay attention to the Swing Transformer module to improve information fusion.

2. The hyperspectral image accurate classification method based on improved multi-scale attention and Transformer network according to claim 1, characterized in that, Step 1 is specifically performed according to the following steps: Step 1.1: Normalize the input; Step 1.2: Extract features from the normalized data using convolution; Step 1.3: Use dynamic weights and different convolutions to complete spatial feature extraction.

3. The hyperspectral image accurate classification method based on improved multi-scale attention and Transformer network according to claim 2, characterized in that, Step 1.3 is specifically performed as follows: First, given input features Where B, C, and L represent the batch size, number of channels, and feature length of the input hyperspectral image, respectively. First, the input features are standardized and channel-expanded to obtain the expanded feature representation as shown in formula (1): (1) Here, z represents the feature extracted by convolution, and z is divided into two sub-features along the channel dimension. a , b The dynamic weights are calculated as shown in formulas (2) and (3): (2) (3) in, w Represents dynamic weights, for b Perform weighted convolution operations, combined with a Extract spatial features, as shown in formula (4): (4) in, This represents the features extracted after dynamic weighting. This represents the element-wise dot product operation. DWConv1 This represents depthwise separable convolution, used to capture spatial information; Finally, the spatial features are extracted by residual connection between 1D convolution and the input feature x, as shown in formula (5): (5) in y’ This represents the extracted spatial features.

4. The hyperspectral image accurate classification method based on improved multi-scale attention and Transformer network according to claim 3, characterized in that, Step 3 is specifically performed according to the following steps: Step 3.1: Input the features processed by multi-scale attention into LayerNorm; Step 3.2: Introduce a multi-head attention mechanism to enable the network to process features at different scales; Step 3.3: Introduce multi-head cross-attention to enable the network to better integrate features at different scales. Step 3.4: Add an MLP after the multi-head attention mechanism to integrate and extract features; Step 3.5: After MLP, residual connections are introduced to reduce the computational complexity of the model.

5. The hyperspectral image accurate classification method based on improved multi-scale attention and Transformer network according to claim 4, characterized in that, Step 3.1 is specifically performed as follows: First, for features that have undergone multi-scale attention Standardization is performed to make the distribution of features similar across each channel, which facilitates subsequent feature extraction operations, as shown in formula (12): (12) in, For standardized features, L represents the length of the feature sequence, B represents the batch size, and D is the feature dimension. The standardized feature dimension is the same as the input. same; Step 3.2 is specifically performed as follows: In the multi-head self-attention module, the long-range dependencies of the input features are calculated through the self-attention mechanism. The calculation process of self-attention is shown in formula (13): (13) in, , representing the query, key, and value vectors, respectively, which are represented by linear projection matrices. get, For each attention head's feature dimension, the calculation is as shown in Equation (14): (14) After MHSA, the output is calculated as shown in formula (15): (15) in, The output projection matrix is ​​used. Finally, the feature representation capability is enhanced through residual connections, and the calculation is shown in Equation (16): (16)。 6. The hyperspectral image accurate classification method based on improved multi-scale attention and Transformer network according to claim 5, characterized in that, Step 3.3 is specifically performed as follows: The proposed cross-attention module aims to further integrate information between different features. By applying the cross-attention mechanism to the standardized input features, the calculation is shown in formula (17): (17) This indicates the features extracted through a multi-head attention mechanism. This represents the features extracted through the cross-attention mechanism; in, CrossAttention The calculation method is the same as MHSA, but it is used to capture the interaction between features at different levels, thereby improving the fusion of spatial-spectral information of features.

7. The hyperspectral image accurate classification method based on improved multi-scale attention and Transformer network according to claim 6, characterized in that, Step 3.4 is specifically performed as follows: The MLP module is used to further model the nonlinear relationships of features, and the calculation process is shown in formula (18): (18) This represents the features extracted by the MLP; in, It is a weight matrix. The bias is GELU, the activation function is GELU, and Dropout is a random deactivation operation to prevent overfitting. Finally, the output of the MLP is shown in Equation (19): (19) This represents the spectral-spatial features of the final extracted hyperspectral image, used for accurate classification of hyperspectral images.