A hyperspectral image classification method based on spectral attention and enhanced second-order pooling
By combining an autoencoder and a learnable spectral attention mechanism with diagonal-enhanced second-order pooling, the problems of high computational complexity and neglect of spatial information in hyperspectral image classification are solved, and efficient hyperspectral image classification is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI INSTITUTE OF TECHNICAL PHYSICS CHINESE ACADEMY OF SCIENCES
- Filing Date
- 2026-03-05
- Publication Date
- 2026-06-23
AI Technical Summary
Existing hyperspectral image classification methods have high computational complexity when processing high-dimensional spectral data. The self-attention mechanism introduces high overhead and ignores spatial context information, leading to overfitting problems, especially when the training samples are limited, resulting in performance degradation.
An autoencoder is used for unsupervised dimensionality reduction. Combined with a learnable nonlinear spectral attention mechanism and diagonally enhanced second-order pooling, spatial information is extracted through two-dimensional convolutional layers, reducing spectral dimensionality while retaining discriminative information, thereby improving classification accuracy.
It effectively reduces spectral redundancy, retains category discrimination information, improves classification accuracy, maintains robust performance especially in scenarios with scarce samples, and reduces computational complexity.
Smart Images

Figure CN122265705A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of hyperspectral remote sensing, specifically to a hyperspectral image classification method based on spectral attention and enhanced second-order pooling. Background Technology
[0002] Compared to traditional optical images, hyperspectral images (HSIs) offer significantly higher spectral resolution, providing rich spectral information for each pixel. Leveraging this characteristic, hyperspectral image classification has been widely applied in fields such as geological exploration, precision agriculture, and military reconnaissance. In recent years, the rapid development of remote sensing technology has significantly improved spatial resolution, further enhancing the ability of hyperspectral images to distinguish different land cover types. However, hyperspectral image classification still faces several challenges. HSIs typically contain hundreds of spectral bands, resulting in extremely high data dimensionality. With limited training samples, this not only increases computational complexity but also often leads to the Hughes phenomenon, thus reducing classification performance. Furthermore, adjacent bands exhibit high correlation and contain a large amount of redundant information; sensor noise and atmospheric radiative transfer further interfere with the spectral characteristics of land cover. These problems are particularly pronounced when different categories have similar spectral characteristics. In addition, hyperspectral data acquisition is costly, and the annotation process is time-consuming and laborious; therefore, most publicly available datasets contain only a small number of labeled samples, which significantly limits the performance of classification methods.
[0003] To address the problem of high-dimensional redundancy, traditional hyperspectral classification typically employs dimensionality reduction methods for preprocessing, including principal component analysis (PCA), independent component analysis (ICA), and supervised PCA. However, these linear dimensionality reduction methods struggle to fully capture complex spectral variations. Subsequently, researchers have manually designed and extracted features, using classifiers such as Support Vector Machines (SVM) and k-nearest neighbors (KNN) for classification. However, these methods rely heavily on spectral information, often neglecting spatial structural features in the image. To overcome this limitation, researchers have proposed various methods that fuse spectral and spatial information, such as Markov Random Fields (MRF) and sparse representation techniques to improve classification accuracy. Despite these advancements, traditional models relying on handcrafted features are still prone to misclassification in complex scenes, thus limiting their practical application effectiveness.
[0004] In recent years, deep learning methods have been able to automatically extract more discriminative and high-level semantic features from data, demonstrating stronger adaptability and generalization capabilities. Stacked autoencoders (SAEs) were first introduced into the field of remote sensing for hyperspectral image classification. Subsequently, deep belief networks (DBNs) were also applied to hyperspectral classification for feature extraction and classification tasks. However, both SAEs and DBNs are fully connected networks, requiring a large number of parameters during training; furthermore, due to their one-dimensional input structure, they inevitably lose some spatial information. Subsequently, convolutional neural networks (CNNs) have become an important research direction for high-precision hyperspectral image classification, especially hybrid networks combining 2D and 3D convolutions. Chinese invention application CN112801204A discloses a hyperspectral classification method based on automatic neural networks with lifelong learning capabilities, using both 3D and 2D convolutions, whose combination can more effectively utilize spatial and spectral features. The paper "Hyperspectral image classification using mixed convolutions and covariance pooling" presents a hybrid CNN combining covariance pooling for hyperspectral image classification, effectively extracting second-order spectral-spatial information. However, 3D convolutional networks typically contain a large number of parameters, significantly increasing computational complexity. The Transformer architecture is widely used in hyperspectral image classification due to its superior context modeling capabilities. Chinese invention application CN115439679A discloses a hyperspectral image classification method combining multi-attention and Transformer, achieving good classification performance using spatial and channel attention mechanisms and a Transformer Encoder structure. The paper "Spectral–spatialmorphological attention transformer for hyperspectral image classification" combines morphological operators with self-attention mechanisms, proposing MorphFormer to enhance structural feature extraction and improve classification performance.
[0005] Despite significant progress in Transformer-based hyperspectral image classification methods, several limitations remain: First, the self-attention mechanism has quadratic computational complexity, leading to high computational costs when processing high-dimensional spectral data. Second, many existing models, such as SpectralFormer, primarily focus on spectral sequence modeling while neglecting spatial context information. Furthermore, the Transformer architecture is prone to overfitting when training samples are limited.
[0006] In summary, there is an urgent need for a hyperspectral image classification method that can effectively reduce spectral redundancy, fully integrate high-order spectral-spatial discriminative features, and still have good robustness in scenarios with scarce samples. Summary of the Invention
[0007] The purpose of this invention is to provide a hyperspectral image classification method based on spectral attention and enhanced second-order pooling. This method not only effectively reduces the spectral dimensionality and retains class discriminative information while suppressing redundancy and noise components, but also jointly utilizes spectral and spatial information to extract more discriminative higher-order features. This significantly improves the classification accuracy of hyperspectral images with relatively low computational complexity. It addresses the problems of existing technologies, such as the high computational cost of self-attention mechanisms when processing high-dimensional spectral data due to their secondary computational complexity, and the fact that many existing models, like SpectralFormer, primarily focus on spectral sequence modeling while neglecting spatial context information, leading to overfitting issues in Transformer architectures with limited training samples.
[0008] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0009] A hyperspectral image classification method based on spectral attention and enhanced second-order pooling is used to identify land cover categories in hyperspectral images, comprising the following steps:
[0010] Step S100: Unsupervised dimensionality reduction is performed on the input hyperspectral image data, and an autoencoder is used to reduce the spectral dimension while retaining discriminative information.
[0011] Includes the following sub-steps:
[0012] Step S101: Input hyperspectral data X, with a size of C×H×W, where C is the number of bands in the hyperspectral image, and H and W are the number of pixels in the height and width of the hyperspectral image, respectively. Perform Z-score normalization for each band:
[0013]
[0014] Where x represents the original spectral pixel value, μ represents the pixel mean of this band, σ is the standard deviation, and x' is the standardized spectral pixel value. After standardization, the data satisfies:
[0015] ,
[0016] Step S102, low-dimensional features after encoder It is expressed as follows:
[0017] Where s is the spectral vector of the input autoencoder, Encoder weight matrix, It is the encoder bias vector, SatLin nonlinear activation function. SatLin is a piecewise function. When the input is less than or equal to 0, the output is 0. When the input is between 0 and 1, the output is the input value itself. When the input is greater than or equal to 1, the output is 1.
[0018] Step S103: Train the network by minimizing the loss function of the autoencoder so that the encoder retains discriminative information while reducing the spectral dimension. When the loss function converges, take the m-dimensional output of the encoder as the effective data after dimensionality reduction.
[0019] Step S200: Input the dimensionality-reduced data into a deep classification network, extract spatial information through a two-dimensional convolutional layer, and adaptively acquire spectral features by combining a learnable nonlinear spectral attention mechanism.
[0020] Includes the following sub-steps:
[0021] S201 extracts spatial features from the hyperspectral image by passing through three convolutional layers. The sizes of the three convolutions are 30×64×3×3, 64×128×3×3, and 128×256×3×3, respectively. The output feature maps are denoted as F1, F2, and F3, respectively. The specific process is as follows:
[0022]
[0023]
[0024]
[0025] Where ReLU is the activation function;
[0026] S202, F3 is input into the learnable nonlinear spectral attention module, and the spectral features of the hyperspectral image are further utilized to obtain the feature map F with enhanced spectral information;
[0027] S203, finally, F1 is concatenated with the F residual to obtain the feature map F that integrates spatial and spectral information. out : This is to avoid the loss of initial information and improve the stability of training;
[0028] Step S300: A diagonal-enhanced second-order pooling SOP operation is introduced into the fused feature map to obtain higher-order, more discriminative features in the hyperspectral image.
[0029] The diagonal-enhanced second-order pooling operation is defined as follows:
[0030]
[0031] in, This represents the output feature map obtained after diagonal-enhanced second-order pooling. The input for the second-order pooling is I, where I is the identity matrix. It is an introduced bias that is used to enhance the diagonal of the second-order matrix, thereby increasing the contribution of the low-variance channel and avoiding over-reliance on the high-variance channel.
[0032] Step S400: Input the extracted features into a 3-layer fully connected layer to output the final hyperspectral image classification result; the feature map after diagonal enhancement by the second-order pooling module is then processed. Input three linear fully connected layers to obtain the final hyperspectral classification result.
[0033] In step S103, the joint loss function L is defined as follows:
[0034]
[0035]
[0036] The total loss L of the self-encoder is determined by the mean square error. L2 norm regularization The weighted sum of sparsity constraints, mean squared error (MSE) is the sum and average of the squared errors between the original spectral vector and the corresponding reconstructed spectral vector for each input sample, used to measure the difference between the original input spectral vector and the decoder's reconstructed output. The L2 regularization term constrains the Frobenius norm of the encoder and decoder weight matrices. Represents the original input spectral vector. This represents the reconstructed output generated by the decoder, where N = H × W represents the total number of pixels in the input feature map. and These represent the weight matrices of the encoder and decoder, respectively. The L2 regularization coefficient is... For sparse constraint weights, Indicates the sparsity level of the target. Let represent the average activation value of the j-th neuron, and KL be the Kullback-Leibler divergence.
[0037] In step S202, the specific operations of the learnable nonlinear spectral attention mechanism include:
[0038] S2021 uses a Gaussian kernel function to calculate the similarity between different spectral channels, which can effectively capture the nonlinear correlation between spectral channels. :
[0039]
[0040] in, and Let i and j represent the spectral vectors of any two spectral channels i and j, respectively. This is a bandwidth parameter used to control the rate at which similarity decays with distance; it is set to 1 here.
[0041] S2022, a learnable query weight vector q is constructed through 1×1 convolution and average pooling (AvgPool) to characterize the preference of the ideal reference spectral direction for each channel. The specific process is as follows:
[0042]
[0043] S2023, aggregates global inter-channel similarity information and channel-level information to generate reference vectors. For each spectral channel p, use The p-th row vector and The attention weights are obtained by calculating the inner product. The specific calculation formula is as follows:
[0044]
[0045] Where k and b represent the learnable weight parameters and bias terms, respectively, and the sigmoid function normalizes the attention weights.
[0046] S2024, the output feature map F of the learnable nonlinear spectral attention module, and the specific calculation process:
[0047]
[0048] Where w = [w1, w2, ..., wr] is the weight vector, and r is the number of channels after 3 convolutional layers. F1 is the unweighted feature map, and F2 is the input vector of the spectral attention module.
[0049] In view of the above technical features, the present invention has the following beneficial effects: 1. The present invention adopts an autoencoder unsupervised dimensionality reduction method, which reduces the hyperspectral dimension while retaining the most effective information by minimizing the reconstruction loss function; then, a two-dimensional residual network is established that includes a learnable nonlinear spectral attention mechanism and diagonally enhanced second-order pooling, while effectively utilizing spectral and spatial information. The introduction of diagonally enhanced second-order pooling enables the use of high-order discriminative features; 2. The present invention can effectively solve the problem of information redundancy and achieve better classification accuracy with lower computational complexity, especially with a significant improvement in average accuracy; at the same time, it enables the model to maintain robust classification performance when facing imbalanced datasets with uneven class distribution. Attached Figure Description
[0050] Figure 1 This is a flowchart of a hyperspectral image classification method based on spectral attention and enhanced second-order pooling according to the present invention;
[0051] Figure 2 This is a schematic diagram of the overall framework for hyperspectral classification of the present invention;
[0052] Figure 3 This is a schematic diagram of the autoencoder for unsupervised dimensionality reduction according to the present invention;
[0053] Figure 4 This is a schematic diagram of the learnable nonlinear attention mechanism of the present invention;
[0054] Figure 5 This is a schematic diagram of the second-order pooling of the present invention;
[0055] Figure 6 This is a schematic diagram of the classification results of the present invention on the Indian Pines dataset;
[0056] Figure 7 This is a truth graph of Indian Pines;
[0057] Figure 8 SSRN classification results;
[0058] Figure 9 DPRN classification results;
[0059] Figure 10 Classification results for HybridSN;
[0060] Figure 11 The classification results for MCNN-CP;
[0061] Figure 12 Classification results for SpectralFormer;
[0062] Figure 13SSFTT classification results;
[0063] Figure 14 The classification results for MorphFormer. Detailed Implementation
[0064] The present invention will be further described below with reference to the accompanying drawings and specific embodiments. It should be understood that these drawings and embodiments are for illustrative purposes only and are not intended to limit the scope of the invention. Furthermore, it should be understood that some components well-known to those skilled in the art but not related to the main content of the present invention may be omitted in the drawings or description. Additionally, for ease of description, some components in the drawings may be omitted, enlarged, or reduced, but this does not represent the actual size or complete structure of the product.
[0065] A hyperspectral image classification method based on spectral attention and enhanced second-order pooling is proposed for identifying land cover categories in hyperspectral images, such as... Figure 1 , Figure 2 As shown, it includes the following steps:
[0066] Step S100: Unsupervised dimensionality reduction is performed on the input hyperspectral image data, and an autoencoder is used to reduce the spectral dimension while retaining discriminative information.
[0067] Includes the following sub-steps:
[0068] Step S101: Input hyperspectral data X, with a size of C×H×W, where C is the number of bands in the hyperspectral image, and H and W are the number of pixels in the height and width of the hyperspectral image, respectively. Perform Z-score normalization for each band:
[0069]
[0070] Where x represents the original spectral pixel value, μ represents the pixel mean of this band, σ is the standard deviation, and x' is the standardized spectral pixel value. After standardization, the data satisfies:
[0071] ,
[0072] Step S102, low-dimensional features after encoder It is expressed as follows:
[0073] Where s is the spectral vector of the input autoencoder, Encoder weight matrix, It is the encoder bias vector, SatLin nonlinear activation function. SatLin is a piecewise function. When the input is less than or equal to 0, the output is 0. When the input is between 0 and 1, the output is the input value itself. When the input is greater than or equal to 1, the output is 1.
[0074] Step S103: Train the network by minimizing the loss function of the autoencoder so that the encoder retains discriminative information while reducing the spectral dimension. When the loss function converges, take the m-dimensional output of the encoder as the effective data after dimensionality reduction.
[0075] The joint loss function L is defined as:
[0076]
[0077]
[0078] The total loss L of the self-encoder is determined by the mean square error. L2 norm regularization The weighted sum of sparsity constraints, mean squared error (MSE) is the sum and average of the squared errors between the original spectral vector and the corresponding reconstructed spectral vector for each input sample, used to measure the difference between the original input spectral vector and the decoder's reconstructed output. The L2 regularization term constrains the Frobenius norm of the encoder and decoder weight matrices. Represents the original input spectral vector. This represents the reconstructed output generated by the decoder, where N = H × W represents the total number of pixels in the input feature map. and These represent the weight matrices of the encoder and decoder, respectively. The L2 regularization coefficient is... For sparse constraint weights, Indicates the sparsity level of the target. Let represent the average activation value of the j-th neuron, and KL be the Kullback-Leibler divergence.
[0079] Step S200: Input the dimensionality-reduced data into a deep classification network, extract spatial information through a two-dimensional convolutional layer, and adaptively acquire spectral features by combining a learnable nonlinear spectral attention mechanism.
[0080] Includes the following sub-steps:
[0081] S201 extracts spatial features from the hyperspectral image by passing through three convolutional layers. The sizes of the three convolutions are 30×64×3×3, 64×128×3×3, and 128×256×3×3, respectively. The output feature maps are denoted as F1, F2, and F3, respectively. The specific process is as follows:
[0082]
[0083]
[0084]
[0085] Where ReLU is the activation function;
[0086] S202, F3 is input into the learnable nonlinear spectral attention module, and the spectral features of the hyperspectral image are further utilized to obtain the feature map F with enhanced spectral information;
[0087] The specific operations of the learnable nonlinear spectral attention mechanism include:
[0088] S2021 uses a Gaussian kernel function to calculate the similarity between different spectral channels, which can effectively capture the nonlinear correlation between spectral channels. :
[0089]
[0090] in, and Let i and j represent the spectral vectors of any two spectral channels i and j, respectively. This is a bandwidth parameter used to control the rate at which similarity decays with distance; it is set to 1 here.
[0091] S2022, a learnable query weight vector q is constructed through 1×1 convolution and average pooling (AvgPool) to characterize the preference of the ideal reference spectral direction for each channel. The specific process is as follows:
[0092]
[0093] S2023, aggregates global inter-channel similarity information and channel-level information to generate reference vectors. For each spectral channel p, use The p-th row vector and The attention weights are obtained by calculating the inner product. The specific calculation formula is as follows:
[0094]
[0095] Where k and b represent the learnable weight parameters and bias terms, respectively, and the sigmoid function normalizes the attention weights.
[0096] S2024, the output feature map F of the learnable nonlinear spectral attention module, and the specific calculation process:
[0097]
[0098] Where w = [w1, w2, ..., wr] is the weight vector, and r is the number of channels after 3 convolutional layers. F1 is the unweighted feature map, and F2 is the input vector of the spectral attention module.
[0099] S203, finally, F1 is concatenated with the F residual to obtain the feature map F that integrates spatial and spectral information. out : This is to avoid the loss of initial information and improve the stability of training;
[0100] Step S300: A diagonal-enhanced second-order pooling SOP operation is introduced into the fused feature map to obtain higher-order, more discriminative features in the hyperspectral image.
[0101] The diagonal-enhanced second-order pooling operation is defined as follows:
[0102]
[0103] in, This represents the output feature map obtained after diagonal-enhanced second-order pooling. The input for the second-order pooling is I, where I is the identity matrix. It is an introduced bias that is used to enhance the diagonal of the second-order matrix, thereby increasing the contribution of the low-variance channel and avoiding over-reliance on the high-variance channel.
[0104] Step S400: Input the extracted features into a 3-layer fully connected layer to output the final hyperspectral image classification result; the feature map after diagonal enhancement by the second-order pooling module is then processed. Input three linear fully connected layers to obtain the final hyperspectral classification result.
[0105] Taking a hyperspectral image as an example, the specific implementation steps of a preferred embodiment of the present invention for image classification are described.
[0106] Step S100: Unsupervised dimensionality reduction is performed on the input hyperspectral image data, and an autoencoder is used to reduce the spectral dimension while retaining discriminative information.
[0107] The Indian Pines dataset was selected for testing. Acquired by the Airborne Visible / Infrared Imaging Spectroradiometer (AVIRIS) sensor, this dataset contains a single 145×145 pixel image with a spatial resolution of approximately 20 meters. Spectrally, the dataset initially contained 224 spectral bands, covering a wavelength range of 400 nm to 2500 nm. Due to severe noise and water vapor absorption in some bands, only 200 effective bands were retained after preprocessing (C=200, H=145, W=145). The dataset contains 16 land cover categories.
[0108] The original hyperspectral image is input into an autoencoder for unsupervised dimensionality reduction. When the reconstruction loss function is low, the features output by the encoder are selected as the effective data after dimensionality reduction. Here, m=30.
[0109] like Figure 3 As shown, unsupervised dimensionality reduction of hyperspectral images using an autoencoder includes the following specific steps:
[0110] Step S101, data standardization;
[0111] Z-score normalization was performed on each spectral band (channel) of the 200 spectral channels in the Indian Pines dataset:
[0112]
[0113] Where x represents the original spectral pixel value, μ represents the pixel mean of this band, σ is the standard deviation, and x' is the standardized spectral pixel value. After standardization, the data satisfies:
[0114] ,
[0115] Step S102, low-dimensional features after encoder It is expressed as follows:
[0116] Where s is the spectral vector of the input autoencoder, Encoder weight matrix, It is the encoder bias vector, SatLin nonlinear activation function. SatLin is a piecewise function. When the input is less than or equal to 0, the output is 0. When the input is between 0 and 1, the output is the input value itself. When the input is greater than or equal to 1, the output is 1.
[0117] Step S103: The network is trained by minimizing the loss function of the autoencoder, so that the encoder retains discriminative information while reducing the spectral dimension. When the loss function converges, the 30-dimensional output of the encoder is used as the effective data after dimensionality reduction.
[0118] Furthermore, in step S103, the autoencoder achieves unsupervised dimensionality reduction by reducing the reconstruction loss, thereby obtaining effective information. The overall loss L of the autoencoder is defined as:
[0119]
[0120]
[0121] The overall loss L of the autoencoder is determined by the mean squared error (MSE). L2 norm regularization The weighted sum of the sparsity constraints, MSE, is the sum and average of the squared errors between the original spectral vector and the corresponding reconstructed spectral vector for each input sample. It measures the difference between the original input spectral vector and the decoder's reconstructed output. The L2 regularization term constrains the Frobenius norm of the encoder and decoder weight matrices. Represents the original input spectral vector. This represents the reconstructed output generated by the decoder, where N = H × W represents the total number of pixels in the input feature map. and These represent the weight matrices of the encoder and decoder, respectively. The L2 regularization coefficient is... For sparse constraint weights, Indicates the sparsity level of the target. Let represent the average activation value of the j-th neuron, and KL be the Kullback-Leibler divergence.
[0122] Step S200: Input the dimensionality-reduced data into a deep classification network, extract spatial information through a two-dimensional convolutional layer, and adaptively acquire spectral features by combining a learnable nonlinear spectral attention mechanism.
[0123] 5% of the data was randomly selected as the training data, and the rest was used as the test set. The network hyperparameters were set as follows: batch size was set to 64, learning rate was set to 0.001 based on existing research experience, and the number of training epochs was fixed at 200.
[0124] The input training image is 30×145×145 pixels. Spatial features are obtained through 2D convolutions. The sizes of the three convolutions are 30×64×3×3, 64×128×3×3, and 128×256×3×3, respectively. The first two dimensions represent the number of input channels and the number of convolution kernels, while 3×3 represents the kernel size used for spatial feature extraction. Spectral features are then obtained through a learnable nonlinear spectral attention mechanism. Finally, these features are fused with the feature map from the first convolutional layer to obtain the final image. .
[0125] The specific steps are as follows:
[0126] Step S201: The hyperspectral image spatial features are extracted sequentially through three convolutional layers. The sizes of the three convolutions are 30×64×3×3, 64×128×3×3, and 128×256×3×3, respectively. The output feature maps are denoted as F1, F2, and F3, respectively. The specific process is as follows:
[0127]
[0128]
[0129]
[0130] Where ReLU is the activation function;
[0131] S202, the feature map F3 obtained by the convolutional layer is input into the learnable nonlinear spectral attention module, and the spectral features of the hyperspectral image are further utilized to obtain the feature map F after the spectral information is enhanced;
[0132] like Figure 4 The specific operations for obtaining the feature map F include:
[0133] S2021 uses a Gaussian kernel function to calculate the similarity between different spectral channels, which can effectively capture the nonlinear correlation between spectral channels. :
[0134]
[0135] in, and Let i and j represent the spectral vectors of any two spectral channels i and j, respectively. This is a bandwidth parameter used to control the rate at which similarity decays with distance; it is set to 1 here.
[0136] S2022, a learnable query weight vector q is constructed through 1×1 convolution and average pooling (AvgPool) to characterize the preference of the ideal reference spectral direction for each channel. The specific process is as follows:
[0137]
[0138] S2023, aggregates global inter-channel similarity information and channel-level information to generate reference vectors. For each spectral channel p, use The p-th row vector and The attention weights are obtained by calculating the inner product. The specific calculation formula is as follows:
[0139]
[0140] Where k and b represent the learnable weight parameters and bias terms, respectively, and the sigmoid function normalizes the attention weights.
[0141] S2024, the output feature map F of the learnable nonlinear spectral attention module, and the specific calculation process:
[0142]
[0143] Where w = [w1, w2, ..., wr] is the weight vector, and r is the number of channels after 3 convolutional layers. F1 is the unweighted feature map, and F2 is the input vector of the spectral attention module.
[0144] S203, finally, F1 is concatenated with the F residual to obtain the feature map F that integrates spatial and spectral information. out : This is to avoid the loss of initial information and improve the stability of training;
[0145] Step S300, as follows Figure 5 As shown, high-order discriminative features are obtained using diagonal-enhanced second-order pooling. The specific operations include:
[0146]
[0147] in, This represents the output feature map obtained after diagonal-enhanced second-order pooling. The input to the second-order pooling is (the result of fusing the output feature map of the previous spectral attention layer and the feature map of the first convolutional layer), and I is the identity matrix. This is an introduced bias value used to enhance the diagonal of the second-order matrix, thereby increasing the contribution of the low-variance channel and avoiding over-reliance on the high-variance channel. Here, it is set to 0.01.
[0148] Step S400: The output feature map after diagonal enhancement and second-order pooling is... By inputting three fully connected linear layers (FC), the hyperspectral classification results of the Indian Pines dataset are finally obtained.
[0149] To verify the performance of the method of this invention, the above classification results were compared with seven representative classification models, including SSRN, DPRN, HybridSN, MCNN-CP, SpectralFormer, SSFTT, and MorphFormer. The results are as follows: Figure 6-14 As shown in Table 1.
[0150] Evaluation metrics included Overall Accuracy (OA), Average Accuracy (AA), and Kappa coefficient. To minimize the influence of random factors, all experiments were independently repeated five times. This invention outperformed all compared methods, achieving the highest scores in OA, AA, and Kappa. The proposed method achieved 100% classification accuracy in classes 7, 8, 9, 14, and 16. Furthermore, among all methods, the proposed method also achieved the highest classification accuracy in classes 2, 3, and 6, reaching 98.08%, 98.48%, and 99.72%, respectively. Compared to the second-ranked HybridSN method, the proposed method improved OA and Kappa by 0.73% and 0.83%, respectively. Notably, its AA was improved by 4.44% compared to HybridSN, indicating that this invention achieves more balanced classification performance across different classes. As can be seen from the visualization results, the present invention has achieved the most accurate classification results. Its predictions are highly consistent with the actual annotations, with clear category boundaries, complete regional structures, and few noise pixels.
[0151] Table 1. Classification results of Indian Pines
[0152] category SSRN DPRN HybridSN MCNN-CP SpectralFormer SSFTT MorphFormer This invention 1 100.00 93.18 84.09 0 12.20 100.00 87.80 97.78 2 69.66 91.82 94.03 97.18 76.73 91.13 94.86 98.08 3 76.36 97.21 91.89 95.24 65.06 87.68 96.52 98.48 4 95.79 72.00 99.56 83.33 64.32 81.22 91.08 98.25 5 92.02 99.56 96.73 100.00 81.84 98.16 96.55 91.83 6 98.82 98.41 98.56 97.22 94.82 99.70 98.33 99.72 7 0 48.15 70.37 0 12.00 100.00 84.00 100.00 8 95.66 100.00 99.56 95.83 97.91 99.07 99.77 100.00 9 0 52.63 36.84 100.00 0 66.67 16.67 100.00 10 71.40 91.55 93.61 93.88 79.09 84.57 93.49 90.66 11 98.90 96.70 95.54 97.56 87.78 98.28 98.37 96.38 12 96.13 86.50 93.07 100.00 51.69 85.39 85.39 96.78 13 97.41 95.90 100.00 90.00 98.38 99.46 98.38 99.49 14 97.98 97.17 99.83 96.83 96.75 96.66 97.98 100.00 15 97.42 92.10 99.18 84.21 70.61 79.83 86.74 86.36 16 79.44 97.73 98.86 100.00 100.00 79.76 100.00 100.00 OA (%) 94.18±0.65 94.61±0.81 95.68±0.48 94.96±0.46 80.05±1.13 92.80±0.96 95.53±0.45 96.41±0.23 AA(%) 89.42±0.53 89.62±1.99 92.13± 1.16 83.19±0.78 68.33±2.11 89.86±1.64 89.04±1.08 96.57±0.53 Kappa×100 93.38±0.24 93.85±0.83 95.08±0.49 94.25±0.48 77.17±1.15 91.77±0.98 94.91±0.46 95.91±0.24
[0153] Ablation experiments were conducted to understand the internal mechanisms and key factors of the models. The models were divided into three variants: Model 1 with the autoencoder-based unsupervised dimensionality reduction module removed, Model 2 with the spectral attention module removed, and Model 3 with the second-order pooling module removed. Ablation experiments were performed on the Indian Pines dataset under the same experimental conditions, and the results are shown in Table 2. When the SOP module was removed, the OA, AA, and Kappa coefficients decreased to 81.57%, 71.49%, and 78.80%, respectively. This result indicates that the introduction of the SOP module significantly enhances the model's ability to extract discriminative features, thereby effectively improving classification performance. Removing the autoencoder-based unsupervised dimensionality reduction module resulted in the second largest performance drop, with OA, AA, and Kappa decreasing by 3.20%, 6.81%, and 3.66%, respectively. The spectral attention mechanism also had a significant impact on classification accuracy. When this module was removed, OA, AA, and Kappa decreased to 95.87%, 95.63%, and 95.30%, respectively. These results demonstrate that learnable spectral attention mechanisms can effectively capture discriminative spectral information.
[0154] Table 2 Ablation Experiment Results of the Model of the Invention
[0155] Model OA (%) AA(%) Kappa × 100 This invention 96.67 97.19 96.21 Model 1 93.47 90.38 92.55 Model 2 95.87 95.63 95.30 Model 3 81.57 71.49 78.80
[0156] In the above embodiments, this invention utilizes an autoencoder-based unsupervised dimensionality reduction method to reduce the dimensionality of hyperspectral data, preserving effective information. It then employs a two-dimensional convolutional residual network incorporating a learnable nonlinear spectral attention mechanism and diagonally enhanced second-order pooling for hyperspectral image classification. This invention effectively improves classification accuracy, especially in cases of imbalanced sample class distribution, significantly enhancing the recognition performance of minority classes and improving the model's robustness and generalization ability.
[0157] The above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of the invention. All equivalent changes and modifications made within the scope of the claims of this invention should be considered within the technical scope of this invention.
Claims
1. A hyperspectral image classification method based on spectral attention and enhanced second-order pooling, characterized in that: To identify land cover categories in hyperspectral images, the following steps are included: Step S100: Unsupervised dimensionality reduction is performed on the input hyperspectral image data, and an autoencoder is used to reduce the spectral dimension while retaining discriminative information. Includes the following sub-steps: Step S101: Input hyperspectral data X, with a size of C×H×W, where C is the number of bands in the hyperspectral image, and H and W are the number of pixels in the height and width of the hyperspectral image, respectively. Perform Z-score normalization for each band: Where x represents the original spectral pixel value, μ represents the pixel mean of this band, σ is the standard deviation, and x' is the standardized spectral pixel value. After standardization, the data satisfies: , Step S102, low-dimensional features after encoder It is expressed as follows: Where s is the spectral vector of the input autoencoder, Encoder weight matrix, It is the encoder bias vector, SatLin nonlinear activation function. SatLin is a piecewise function. When the input is less than or equal to 0, the output is 0. When the input is between 0 and 1, the output is the input value itself. When the input is greater than or equal to 1, the output is 1. Step S103: Train the network by minimizing the loss function of the autoencoder so that the encoder retains discriminative information while reducing the spectral dimension. When the loss function converges, take the m-dimensional output of the encoder as the effective data after dimensionality reduction. Step S200: Input the dimensionality-reduced data into a deep classification network, extract spatial information through a two-dimensional convolutional layer, and adaptively acquire spectral features by combining a learnable nonlinear spectral attention mechanism. Includes the following sub-steps: S201 extracts spatial features from the hyperspectral image by passing through three convolutional layers. The sizes of the three convolutions are 30×64×3×3, 64×128×3×3, and 128×256×3×3, respectively. The output feature maps are denoted as F1, F2, and F3, respectively. The specific process is as follows: Where ReLU is the activation function; S202, F3 is input into the learnable nonlinear spectral attention module, and the spectral features of the hyperspectral image are further utilized to obtain the feature map F with enhanced spectral information; S203, finally, F1 is concatenated with the F residual to obtain the feature map F that integrates spatial and spectral information. out : This is to avoid the loss of initial information and improve the stability of training; Step S300: A diagonal-enhanced second-order pooling SOP operation is introduced into the fused feature map to obtain higher-order, more discriminative features in the hyperspectral image. The diagonal-enhanced second-order pooling operation is defined as follows: in, This represents the output feature map obtained after diagonal-enhanced second-order pooling. The input for the second-order pooling is I, where I is the identity matrix. It is an introduced bias that is used to enhance the diagonal of the second-order matrix, thereby increasing the contribution of the low-variance channel and avoiding over-reliance on the high-variance channel. Step S400: Input the extracted features into a 3-layer fully connected layer to output the final hyperspectral image classification result; the feature map after diagonal enhancement by the second-order pooling module is then processed. Input three linear fully connected layers to obtain the final hyperspectral classification result.
2. The hyperspectral image classification method as described in claim 1, characterized in that: In step S103, the joint loss function L is defined as follows: The total loss L of the self-encoder is determined by the mean square error. L2 norm regularization The weighted sum of sparsity constraints, mean squared error (MSE) is the sum and average of the squared errors between the original spectral vector and the corresponding reconstructed spectral vector for each input sample, used to measure the difference between the original input spectral vector and the decoder's reconstructed output. The L2 regularization term constrains the Frobenius norm of the encoder and decoder weight matrices. Represents the original input spectral vector. This represents the reconstructed output generated by the decoder, where N = H × W represents the total number of pixels in the input feature map. and These represent the weight matrices of the encoder and decoder, respectively. The L2 regularization coefficient is... For sparse constraint weights, Indicates the sparsity level of the target. Let represent the average activation value of the j-th neuron, and KL be the Kullback-Leibler divergence.
3. The hyperspectral image classification method as described in claim 1, characterized in that: In step S202, the specific operations of the learnable nonlinear spectral attention mechanism include: S2021 uses a Gaussian kernel function to calculate the similarity between different spectral channels, which can effectively capture the nonlinear correlation between spectral channels. : in, and Let i and j represent the spectral vectors of any two spectral channels i and j, respectively. This is a bandwidth parameter used to control the rate at which similarity decays with distance; it is set to 1 here. S2022, a learnable query weight vector q is constructed through 1×1 convolution and average pooling (AvgPool) to characterize the preference of the ideal reference spectral direction for each channel. The specific process is as follows: S2023, aggregates global inter-channel similarity information and channel-level information to generate reference vectors. For each spectral channel p, use The p-th row vector and The attention weights are obtained by calculating the inner product. The specific calculation formula is as follows: Where k and b represent the learnable weight parameters and bias terms, respectively, and the sigmoid function normalizes the attention weights. S2024, the output feature map F of the learnable nonlinear spectral attention module, and the specific calculation process: Where w = [w1, w2, ..., wr] is the weight vector, and r is the number of channels after 3 convolutional layers. F1 is the unweighted feature map, and F2 is the input vector of the spectral attention module.