A two-stage multi-scale hyperspectral snapshot compressive imaging image reconstruction method
By using a two-stage multi-scale Transformer network and a dual-window multi-scale multi-head self-attention mechanism, the problems of computational cost and insufficient feature capture in hyperspectral image reconstruction are solved, achieving efficient image reconstruction and detail restoration.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHONGQING UNIV
- Filing Date
- 2024-03-04
- Publication Date
- 2026-06-16
AI Technical Summary
Existing hyperspectral image reconstruction methods are insufficient in terms of computational cost and capturing long-range dependencies and multi-scale features of images, making it difficult to achieve efficient and high-quality reconstruction.
A two-stage, multi-scale Transformer network architecture is adopted, which combines a dual-window, multi-scale, multi-head self-attention mechanism and conditional absolute position embedding. Through coarse feature extraction and fine pixel refinement stages, it captures long-distance dependencies and multi-scale features of the image.
It significantly improves the quality and efficiency of hyperspectral image reconstruction, enhances the robustness and generalization ability of the model, and is able to better recover image details and reduce artifacts.
Smart Images

Figure CN117974909B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of deep learning technology and relates to a two-stage, multi-scale hyperspectral snapshot compressed imaging image reconstruction method. Background Technology
[0002] Snapshot compression imaging systems can effectively reduce imaging costs by reconstructing three-dimensional hyperspectral images using computational spectral imaging algorithms based on two-dimensional measurement data. Snapshot compression imaging can prevent motion artifacts, significantly improve imaging efficiency, and facilitate the capture of dynamic scenes, thus attracting widespread attention and research.
[0003] Among these snapshot compression imaging systems, coded aperture snapshot spectral imaging is a highly efficient technique and a promising research direction. It modulates hyperspectral signals using a coded aperture and a diffuser, then integrates and compresses the modulated signals into two-dimensional measurements. Coded aperture snapshot spectral imaging is highly efficient and simple to fabricate. Furthermore, by flexibly designing various image reconstruction algorithms, imaging quality can be significantly improved, adapting to different application scenarios. Its core task is to solve the ill-posed inverse problem of recovering three-dimensional hyperspectral signals from two-dimensional measurements, i.e., using an efficient hyperspectral image reconstruction algorithm.
[0004] Traditional reconstruction algorithms rely on predefined manual priors to regularize the reconstruction process, obtaining the reconstruction result by solving a prior regularization optimization problem, such as sparse priors, total variational (TV) methods, and low-rank decomposition. However, these methods often suffer from low reconstruction efficiency, poor generalization, and limited reconstruction quality. In recent years, deep learning-based methods have been widely applied. These methods can achieve reconstruction by learning a mapping model from compressed measurements to hyperspectral images. However, convolutional neural network-based methods struggle to capture long-range dependencies in images and reveal global properties. To address this issue, researchers have applied Transformers to hyperspectral image reconstruction. For hyperspectral image reconstruction, Transformer-based methods significantly improve reconstruction quality. However, global Transformers have relatively high computational costs, with computational complexity equal to the square of the image space dimension, while local Transformers have limited ability to capture long-range dependencies. Therefore, achieving a balance between maintaining reasonable computational costs and simultaneously capturing local information and long-range dependencies is a challenging problem. In addition, existing Transformer-based hyperspectral image reconstruction methods need further improvement in their ability to integrate features at different scales. In the field of hyperspectral image reconstruction, the key to improving reconstruction quality lies in how to more fully integrate features from different stages and scales. Summary of the Invention
[0005] In view of this, the purpose of this invention is to provide a two-stage, multi-scale hyperspectral snapshot compressed imaging image reconstruction method. It employs an integrated framework consisting of two stages for coarse feature extraction and fine pixel refinement; and utilizes a novel attention mechanism, Dual-Window Multi-Scale Multi-Head Self-Attention (DWM-MSA), which is divided into four branches. Two branches compute self-attention for two non-overlapping windows of different sizes, while the other two branches compute self-attention for the two windows of different sizes after pixel scrambling. This captures long-range dependencies in local regions of two different window sizes, balancing global characteristics and computational cost through local computation, and effectively capturing features at different scales. This method can better capture long-term dependencies and multi-scale features of the image, recover image details more finely, and has advantages over other methods in avoiding artifacts.
[0006] To achieve the above objectives, the present invention provides the following technical solution:
[0007] A two-stage, multi-scale hyperspectral snapshot compressed imaging image reconstruction method includes the following steps: S1: acquiring a two-dimensional compressed image of the target scene; S2: preprocessing the hyperspectral images used as the training set to obtain training sample data;
[0008] S3: Input the samples into the two-stage multi-scale image reconstruction network for training; S4: After training, perform image reconstruction on the test samples to obtain the results.
[0009] Furthermore, in step S1, a drone platform equipped with a hyperspectral camera is used to acquire hyperspectral remote sensing images. Specifically, this includes: S11, selecting the spectral range, i.e., setting the working wavelength to 450nm-650nm, with a total of 28 channels;
[0010] S12. Design an coded aperture to introduce different optical codes at each pixel; S13. Set the parameters of the coded aperture snapshot spectral imaging system, including but not limited to the number of dispersive elements and the dispersive displacement step size; S14. Capture a compressed measurement image of the entire scene using a corresponding optical sensor.
[0011] Further, in step S2, the hyperspectral images used as the training set are preprocessed. First, the hyperspectral images are divided into two equal parts. The first part is cropped using a 256×256 non-overlapping window to generate samples, and the second part is divided using a 128×128 non-overlapping window. These are then randomly combined into several 256×256 samples. Next, data augmentation operations are performed on the two parts, including random rotation and flipping. Finally, through simulation of the imaging system, i.e., through modulation using a pre-designed physical mask and then cropping and summing, the hyperspectral images are processed into two-dimensional compressed measurements. The data-augmented hyperspectral images and their corresponding two-dimensional compressed measurements constitute training sample pairs.
[0012] Further, in step S3, the training sample pairs are initialized. First, by performing sliding extraction along the channel dimension, each column in the compressed measurement is restored to a specific position in the output hyperspectral image, thereby converting the original two-dimensional data into a coarse three-dimensional hyperspectral image with higher dimensions. Then, the physical mask of the imaging system is stitched with this hyperspectral image along the channel dimension, and then a 1×1 convolution is performed to restore the number of channels to the number of channels in the original hyperspectral image, obtaining the initial input of the network, as shown in the following formula:
[0013]
[0014] Where X is the input feature map, M is the physical mask, and I(X) is the obtained initial input. The symbol represents a convolution operation with a filter size of N×N, and "[]" represents a concatenation operation along the channel dimension.
[0015] Next, the processed image is input into DSMT. DSMT consists of two stages: coarse-grained feature extraction and fine-tuning of pixels. The coarse-grained feature extraction stage contains two identical U-shaped networks. The input to the first network is the initial input, and the input to the second network is the concatenation of the output of the first network and the initial input along the channel dimension, followed by a 1×1 convolution operation. The formula is as follows:
[0016]
[0017] Where X is the initial input, X out I2(X) is the output of the first U-shaped network, and I2(X) is the input feature map of the second network.
[0018] The U-shaped network consists of three stages: upsampling, intermediate layers, and downsampling. The upsampling module comprises three operations: a 1×1 convolution, a 3×3 depthwise separable convolution, and a 1×1 convolution, as shown in the following formula:
[0019]
[0020] Where X is the input feature map, and U(X) is the feature map obtained after upsampling. This represents a depthwise separable convolution operation with a filter size of N×N.
[0021] The intermediate layer is a Spatial Pyramid Pooling (ASPP) layer, which selects dilated convolutions with different dilation rates to process the feature maps. The ASPP layer concatenates these feature maps from different levels along the channel dimension for information fusion, as shown in the following formula:
[0022]
[0023] Where X is the input feature map, and A(X) is the feature map obtained after passing through the ASPP layer. This represents a dilated convolution operation with a filter size of N×N and an expansion rate of n. AvgPool() represents the average pooling operation, and Bilinear() represents the bilinear interpolation operation.
[0024] The downsampling module consists of four operations: 2×2 transposed convolution, 1×1 convolution, 3×3 depthwise separable convolution, and 1×1 convolution, as shown in the following formula:
[0025]
[0026] Where X is the input feature map, and D(X) is the feature map obtained after downsampling. This represents a transpose convolution operation with a filter size of N×N.
[0027] After two downsampling operations, through an intermediate layer, and then two upsampling operations, the feature map output by the U-shaped network is obtained. Therefore, the coarse-grained feature extraction stage has two output feature maps.
[0028] Furthermore, after the coarse-grained feature extraction stage, the two output feature maps are input into the pixel fine-tuning network, which is a U-shaped network structure with a dual-branch encoder. The main component of this U-shaped network is the Dual-Window Multi-Scale Attention Module (DWMAB), whose main component is the Dual-Window Multi-Scale Multi-Head Self-Attention Mechanism (DWM-MSA). DWM-MSA first transforms the input feature maps into query values, key values, and attribute values through three linear layers, as shown in the following formula:
[0029] query=XW q ,key=XW k ,value=XW v
[0030] Where X represents the input feature map, query, key, and value represent the obtained query value, key value, and attribute value, respectively, and W...q W k W v These represent the projected weights of the query value, key value, and attribute value, respectively.
[0031] Then, the query value, key value, and attribute value are divided into four equal parts along the channel dimension. These four equal parts with the same number of channels are processed using four different branches. For the first two branches, after dividing the input into h heads along the channel dimension, a non-overlapping 8×8 window is used for window multi-head self-attention calculation. The input to the first branch is the original input, and the input to the second branch is the feature map obtained by shuffling the pixel order of the original input. Taking the first branch as an example, the formula is as follows:
[0032]
[0033] in, This represents the query value, key value, and attribute value of the nth header, calculated using a window of size w1×w1. This represents the self-attention of the nth head after computation using a window of size w1×w1, softmax() represents the activation function that maps values to the interval [0,1], and d represents the dimension of each head.
[0034] Then, the calculated attention is concatenated along the channel dimension, as shown in the following formula:
[0035]
[0036] Among them, A w1 This represents the self-attention of the first branch obtained.
[0037] Then, for the second branch, the same operation is performed on the feature map after the pixel order has been shuffled to obtain the self-attention A′ of the second branch. w1 For the third and fourth branches, a non-overlapping window of size 16×16 is used for self-attention calculation, and the operations are the same as for the first and second branches, respectively, to obtain self-attention A. w2 A′ w2 Finally, the self-attention points obtained from the four branches are concatenated along the channel dimension, and then a linear layer mapping is applied to obtain the final output of DWM-MSA, as shown in the following formula:
[0038] A = Linear[A w1 ,A′ w1 A w2 ,A′ w2 ]
[0039] Where A is the self-attention value obtained by DWM-MSA. DWM-MSA enhances the model's ability to perceive multi-scale information and can capture long-range dependencies of images.
[0040] Furthermore, in DWMAB, the feature map is first processed using a 3×3 depthwise separable convolution to obtain the residual, which is then added to the input feature map for conditional location encoding, as shown in the following formula:
[0041]
[0042] Where X is the input feature map, and CPE(X) is the feature map after conditional location encoding.
[0043] Then, layer normalization is applied to the position-encoded feature map, followed by DWM-MSA processing to obtain the residual, which is then added to the input feature map, as shown in the following formula:
[0044] DWM(X) = X + LayerNorm(MSA(X))
[0045] Where X is the input feature map, DWM() represents the feature map after passing through the DWMAB module, LayerNorm() represents layer normalization, and MSA() represents the dual-window multi-scale multi-head self-attention mechanism (DWM-MSA).
[0046] Finally, the result after self-attention computation undergoes another layer normalization operation, followed by a feedforward neural network. The feedforward neural network contains 1×1 convolutions, 3×3 depthwise separable convolutions, and 1×1 convolutions. GELU activation is applied before every two convolutions, as shown in the following formula:
[0047]
[0048] Where X is the input feature map, and F(X) represents the feature map after DWMAB processing.
[0049] Furthermore, in the pixel fine-tuning stage, each upsampling in the encoder includes several DWMABs and 4×4 convolutions, as shown in the following formula:
[0050]
[0051] Where X is the input feature map, and FU(X) represents the feature map obtained after one upsampling in the pixel fine adjustment stage.
[0052] After both branches of the encoder have finished encoding, the encoded features are fed into the Cross-Attention Fusion (CAFM) module for feature fusion. This module includes channel attention and spatial attention modules. First, after obtaining the feature maps from the outputs of the two branches, global max pooling and average pooling are performed on them. Then, they are fed into a two-layer multilayer perceptron network, and the two are added together, as shown in the following formula:
[0053] W(X)=MLP(AvgPool(X))+MLP(MaxPool(X))
[0054] Where X is the input feature map, AvgPool() and MaxPool() represent average pooling and max pooling operations, respectively, and MLP represents a two-layer multilayer perceptron.
[0055] Using the method described above, the channel weighting coefficients W1 and W2 of the two branches are calculated. These coefficients are then multiplied to obtain the cross matrix M, as shown in the following formula:
[0056] M = W1(W2) T
[0057] Where M represents the obtained cross-attention matrix.
[0058] The features obtained after channel attention and cross-matrix processing can be represented as:
[0059] F s =softmax(M)F c
[0060] Among them, F s This represents the feature map obtained after channel attention and cross attention processing, where softmax represents the normalization exponential function, F. c represents the initial input of a branch of CAFM, and M represents the cross-attention matrix.
[0061] The obtained feature maps are then subjected to average pooling and max pooling operations, and then concatenated along the channel dimension. Finally, a weight coefficient is calculated using a 3×3 convolution, as shown in the following formula:
[0062]
[0063] Among them, F s Let W represent the feature map processed by channel attention and cross matrix, and let W represent the obtained weight coefficients.
[0064] Finally, the spatial weighting coefficients are multiplied by the input features to obtain the fused features. The fused features of the two branches are then concatenated along the channel dimension and subjected to a 3×3 convolution operation to obtain the final CAFM output, which effectively fuses the features of the two branches. The formula is as follows:
[0065]
[0066] in, Let W represent the initial inputs for the first and second branches, respectively, and let W represent the weight coefficients.
[0067] After fusing features from the two branches of the encoder using CAFM, the data is processed through an intermediate layer containing six DWMABs. Then, a downsampling stage is performed. Each downsampling stage contains a 2×2 transposed convolution and several DWMABs to calculate the residual. This residual is added to the input of the first branch of the encoder, and then subjected to a 3×3 convolution operation to obtain the final residual. This final residual is then added to the initial input of the entire network to obtain the final reconstruction result, as shown in the formula below:
[0068]
[0069] Where X represents the initial input feature map, X1 and X2 represent the inputs of the first and second branches of the encoder in the pixel fine-tuning stage, respectively, Fine() represents the feature map obtained in the pixel fine-tuning stage, and DWMT(X) represents the final reconstruction result obtained.
[0070] Furthermore, the loss function for the overall network is set as follows:
[0071] Loss = ||XY||2
[0072] Where X represents the reconstruction result and Y represents the real hyperspectral image. A loss function is calculated, and the model parameters of the hyperspectral image reconstruction framework are optimized based on the cross-loss function and the backpropagation process. After training, a trained hyperspectral image reconstruction framework is obtained. The input samples are reconstructed using the trained hyperspectral image reconstruction framework, and the resulting hyperspectral image reconstruction image is output.
[0073] The beneficial effects of this invention are as follows:
[0074] This invention proposes an end-to-end hyperspectral image reconstruction algorithm, a two-stage multi-scale Transformer (DSMT), which consists of two stages: coarse feature extraction and fine pixel refinement. This promotes the gradual learning and optimization of the network, helps capture details and contextual information in the data, enhances the network's robustness and generalization ability, and significantly improves reconstruction quality. This invention also proposes a novel U-shaped network architecture with a dual-branch encoder, which allows the simultaneous utilization of features from two different networks, providing richer and more diverse feature representations. Furthermore, this invention proposes a novel attention mechanism, the Dual-Window Multi-Scale Multi-Head Self-Attention Mechanism (DWM-MSA), which can simultaneously capture local information, long-range dependency information, and multi-scale information. Finally, this invention proposes a novel location embedding method, Conditional Absolute Location Embedding (CAPE), which effectively improves the accuracy of intra-image location information. Experimental results on three hyperspectral image datasets demonstrate that the proposed DSMT outperforms current state-of-the-art hyperspectral image reconstruction methods.
[0075] Other advantages, objectives, and features of the invention will be set forth in part in the description which follows, and in part will be apparent to those skilled in the art from the following examination, or may be learned from practice of the invention. The objectives and other advantages of the invention can be realized and obtained through the following description. Attached Figure Description
[0076] To make the objectives, technical solutions, and advantages of the present invention clearer, the preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings, wherein:
[0077] Figure 1 This is a flowchart of the method of the present invention;
[0078] Figure 2 This is a schematic diagram of the Coding Aperture Snapshot Spectral Imaging System (CASSI).
[0079] Figure 3 A diagram of a two-stage multi-scale Transformer (DSMT) used for hyperspectral image reconstruction;
[0080] Figure 4 This is a structural diagram of the dual-window multi-scale attention module (DWMAB) of the present invention;
[0081] Figure 5 This is a structural diagram of the dual-window multi-scale multi-head self-attention mechanism (DWM-MSA) of the present invention;
[0082] Figure 6 This is a structural diagram of the Cross-Attention Fusion Module (CAFM) of the present invention;
[0083] Figure 7 This is a scene from the KAIST hyperspectral image dataset, where (a) is the reconstruction result of the method described in this invention on four channels, and (b) is the ground truth map;
[0084] Figure 8 Visualization results of different methods on four channels of a scene on the KAIST hyperspectral image dataset, where (a) TSA-Net, (b) HDNet, (c) MST, (d) CST, (e) DSMT, and (f) ground truth map. Detailed Implementation
[0085] The technical solution of the present invention will now be described in detail with reference to the accompanying drawings.
[0086] Figure 1 This is a flowchart of the method of the present invention. The present invention provides a two-stage, multi-scale hyperspectral snapshot compressed imaging image reconstruction method. As shown in the figure, in the image acquisition stage, a coded aperture snapshot spectral imaging system is used to acquire two-dimensional compressed images. The deep learning network used for hyperspectral image reconstruction is as follows: Figure 3 As shown, it can capture details and contextual information in the data, as well as long-range dependencies of hyperspectral images, to obtain high-quality reconstructed images. The network consists of two stages: coarse-grained feature extraction and pixel fine-tuning. The pixel fine-tuning stage includes a dual-window multi-scale attention module (DWMAB). First, the hyperspectral image fused with the physical mask is input into the coarse-grained feature extraction network. After U-shaped network feature extraction, a coarse hyperspectral image reconstruction result can be obtained. Then, the two outputs of the coarse-grained feature extraction network are input into the pixel fine-tuning network. The DWMAB in the network can capture the long-range dependencies and multi-scale features of the image. The dual-branch encoder U-shaped network designed in this invention is used to provide richer and more diverse feature representations. The dual-window multi-scale multi-head self-attention mechanism (DWM-MSA) designed in this invention is used to capture the long-range dependencies and multi-scale features of hyperspectral images. The location embedding method CAPE designed in this invention is used to effectively improve the accuracy of location information within the image and improve the generalization ability of the model. Specifically, the technical solution of this invention includes the following:
[0087] 1. Data Acquisition: Snapshot compressed image acquisition is achieved using a coded aperture snapshot spectral imaging system. First, the spectral range is selected, with the working wavelength set to 450nm-650nm, for a total of 28 channels. Then, a coded aperture is designed to introduce different optical codes at each pixel. Next, the parameters of the coded aperture snapshot spectral imaging system are set, including but not limited to the number of dispersive elements and the dispersion shift step size. Finally, a compressed measurement image of the entire scene is captured using a corresponding optical sensor. The schematic diagram of the imaging system is shown below. Figure 2.
[0088] 2. Data Preprocessing: The hyperspectral images in the training set are preprocessed. The first half is cropped using 256×256 non-overlapping windows to generate samples, and the second half is divided into 128×128 non-overlapping windows, which are then randomly combined into several 256×256 samples. Data augmentation operations are then performed, including random rotation and flipping. Finally, through simulation of the imaging system, i.e., modulation using a pre-designed physical mask followed by cropping, and then integration on a single channel, the hyperspectral images are processed into two-dimensional compressed measurements. The data-augmented hyperspectral images and their corresponding two-dimensional compressed measurements constitute training sample pairs.
[0089] 3. Initialize the training sample pairs, such as... Figure 3 As shown. First, sliding extraction is performed along the channel dimension to restore each column in the compressed measurement to a specific position in the output hyperspectral image, thereby transforming the original two-dimensional data into a coarse three-dimensional hyperspectral image with higher dimensions. Then, the physical mask of the imaging system is stitched with this hyperspectral image along the channel dimension, and then a 1×1 convolution is performed to restore the number of channels to the number of channels in the original hyperspectral image, obtaining the initial input of the network, as shown in the following formula:
[0090]
[0091] Where X is the input feature map, M is the physical mask, and I(X) is the obtained initial input. The symbol represents a convolution operation with a filter size of N×N, and "[]" represents a concatenation operation along the channel dimension.
[0092] 4. Input the processed image into a two-stage multi-scale Transformer (DSMT). DSMT consists of two stages: coarse-grained feature extraction and fine-grained pixel adjustment, such as... Figure 3 As shown. The coarse-grained feature extraction stage contains two identical U-shaped networks. The input to the first network is the initial input, and the input to the second network is the concatenation of the output of the first network and the initial input along the channel dimension, followed by a 1×1 convolution operation. The structure is as follows:
[0093]
[0094] Where X is the initial input, X out I2(X) is the output of the first U-shaped network, and I2(X) is the input feature map of the second network.
[0095] The U-shaped network consists of three stages: upsampling, intermediate layers, and downsampling. The upsampling module comprises three operations: a 1×1 convolution, a 3×3 depthwise separable convolution, and a 1×1 convolution, as shown in the following formula:
[0096]
[0097] Where X is the input feature map, and U(X) is the feature map obtained after upsampling. This represents a depthwise separable convolution operation with a filter size of N×N.
[0098] The intermediate layer is a Spatial Pyramid Pooling (ASPP) layer, which selects dilated convolutions with different dilation rates to process the feature maps. The ASPP layer concatenates these feature maps from different levels along the channel dimension for information fusion, as shown in the following formula:
[0099]
[0100] Where X is the input feature map, and A(X) is the feature map obtained after passing through the ASPP layer. This represents a dilated convolution operation with a filter size of N×N and an expansion rate of n. AvgPool() represents the average pooling operation, and Bilinear() represents the bilinear interpolation operation.
[0101] The downsampling module consists of four operations: 2×2 transposed convolution, 1×1 convolution, 3×3 depthwise separable convolution, and 1×1 convolution, as shown in the following formula:
[0102]
[0103] Where X is the input feature map, and D(X) is the feature map obtained after downsampling. This represents a transpose convolution operation with a filter size of N×N.
[0104] After two downsampling operations, through the intermediate layer, and then after two upsampling operations, the feature map output by the U-shaped network is obtained; thus, two output feature maps are obtained in the coarse-grained feature extraction stage.
[0105] 5. After the coarse-grained feature extraction stage, the two output feature maps are then input into the pixel fine-tuning network. This network is a U-shaped network structure with a dual-branch encoder. The main component of this U-shaped network is the dual-window multi-scale attention module (DWMAB), as shown in the diagram below. Figure 4 As shown, the main component of this module is the dual-window multi-scale multi-head self-attention mechanism (DWM-MSA). A schematic diagram of DWM-MSA is shown below. Figure 5As shown. DWM-MSA first transforms the input feature map into query values, key values, and attribute values through three linear layers, as shown in the following formula:
[0106] query=XW q ,key=XW k ,value=XW v
[0107] Where X represents the input feature map, query, key, and value represent the obtained query value, key value, and attribute value, respectively, and W... q W k W v These represent the projected weights of the query value, key value, and attribute value, respectively.
[0108] Then, the query value, key value, and attribute value are divided into four equal parts along the channel dimension. These four equal parts with the same number of channels are processed using four different branches. For the first two branches, after dividing the input into h heads along the channel dimension, a non-overlapping 8×8 window is used for window multi-head self-attention calculation. The input to the first branch is the original input, and the input to the second branch is the feature map obtained by shuffling the pixel order of the original input. Taking the first branch as an example, the formula is as follows:
[0109]
[0110] in, This represents the query value, key value, and attribute value of the nth header, calculated using a window of size w1×w1. This represents the self-attention of the nth head after computation using a window of size w1×w1, softmax() represents the activation function that maps values to the interval [0,1], and d represents the dimension of each head.
[0111] Then, the calculated attention is concatenated along the channel dimension, as shown in the following formula:
[0112]
[0113] Among them, A w1 This represents the self-attention of the first branch obtained.
[0114] Then, for the second branch, the same operation is performed on the feature map after the pixel order has been shuffled to obtain the self-attention A′ of the second branch. w1 For the third and fourth branches, a non-overlapping window of size 16×16 is used for self-attention calculation, and the operations are the same as for the first and second branches, respectively, to obtain self-attention A. w2 A′ w2Finally, the self-attention points obtained from the four branches are concatenated along the channel dimension, and then a linear layer mapping is applied to obtain the final output of DWM-MSA, as shown in the following formula:
[0115] A = Linear[A w1 ,A′ w1 A w2 ,A′ w2 ]
[0116] Where A is the self-attention value obtained by DWM-MSA. DWM-MSA enhances the model's ability to perceive multi-scale information and can capture long-range dependencies of images.
[0117] 6. In DWMAB, the feature map is first processed using a 3×3 depthwise separable convolution to obtain the residual, which is then added to the input feature map for conditional location encoding. The formula is as follows:
[0118]
[0119] Where X is the input feature map, and CPE(X) is the feature map after conditional location encoding.
[0120] Then, layer normalization is applied to the position-encoded feature map, followed by DWM-MSA processing to obtain the residual, which is then added to the input feature map, as shown in the following formula:
[0121] DWM(X) = X + LayerNorm(MSA(X))
[0122] Where X is the input feature map, DWM() represents the feature map after passing through the DWMAB module, LayerNorm() represents layer normalization, and MSA() represents the dual-window multi-scale multi-head self-attention mechanism (DWM-MSA).
[0123] Finally, the result after self-attention computation undergoes another layer normalization operation, followed by a feedforward neural network. The feedforward neural network contains 1×1 convolutions, 3×3 depthwise separable convolutions, and 1×1 convolutions. GELU activation is applied before every two convolutions, as shown in the following formula:
[0124]
[0125] Where X is the input feature map, and F(X) represents the feature map after DWMAB processing.
[0126] 7. In the pixel fine-tuning stage, each upsampling in the encoder involves several DWMAB and 4×4 convolutions, as shown in the following formula:
[0127]
[0128] Where X is the input feature map, and FU(X) represents the feature map obtained after one upsampling in the pixel fine adjustment stage.
[0129] After both branches of the encoder have finished encoding, the encoded features are input into the Cross-Attention Fusion (CAFM) module for feature fusion. A schematic diagram of CAFM is shown below. Figure 6 As shown, this module includes two parts: channel attention and spatial attention. First, after obtaining feature maps from the outputs of the two branches, global max pooling and average pooling are performed on them. Then, they are input into a two-layer multilayer perceptron network, and the two are added together, as shown in the following formula:
[0130] W(X)=MLP(AvgPool(X))+MLP(MaxPool(X))
[0131] Where X is the input feature map, AvgPool() and MaxPool() represent average pooling and max pooling operations, respectively, and MLP represents a two-layer multilayer perceptron.
[0132] Using the method described above, the channel weighting coefficients W1 and W2 of the two branches are calculated. These coefficients are then multiplied to obtain the cross matrix M, as shown in the following formula:
[0133] M = W1(W2) T
[0134] Where M represents the obtained cross-attention matrix.
[0135] The features obtained after channel attention and cross-matrix processing can be represented as:
[0136] F s =softmax(M)F c
[0137] Among them, F s This represents the feature map obtained after channel attention and cross attention processing, where softmax represents the normalization exponential function, F. c represents the initial input of a branch of CAFM, and M represents the cross-attention matrix.
[0138] The obtained feature maps are then subjected to average pooling and max pooling operations, and then concatenated along the channel dimension. Finally, a weight coefficient is calculated using a 3×3 convolution, as shown in the following formula:
[0139]
[0140] Among them, F sLet W represent the feature map processed by channel attention and cross matrix, and let W represent the obtained weight coefficients.
[0141] Finally, the spatial weighting coefficients are multiplied by the input features to obtain the fused features. The fused features of the two branches are then concatenated along the channel dimension and subjected to a 3×3 convolution operation to obtain the final CAFM output, which effectively fuses the features of the two branches. The formula is as follows:
[0142]
[0143] in, Let W represent the initial inputs for the first and second branches, respectively, and let W represent the weight coefficients.
[0144] After fusing features from the two branches of the encoder using CAFM, the data is processed through an intermediate layer containing six DWMABs. Then, a downsampling stage is performed. Each downsampling stage contains a 2×2 transposed convolution and several DWMABs to calculate the residual. This residual is added to the input of the first branch of the encoder, and then subjected to a 3×3 convolution operation to obtain the final residual. This final residual is then added to the initial input of the entire network to obtain the final reconstruction result, as shown in the formula below:
[0145]
[0146] Where X represents the initial input feature map, X1 and X2 represent the inputs of the first and second branches of the encoder in the pixel fine-tuning stage, respectively, Fine() represents the feature map obtained in the pixel fine-tuning stage, and DWMT(X) represents the final reconstruction result obtained.
[0147] 8. The overall network loss function is set as: Loss = ||XY||², where X represents the reconstruction result and Y represents the real hyperspectral image. The loss function is calculated, and the model parameters of the hyperspectral image reconstruction framework are optimized based on the cross-loss function and the backpropagation process. After training, the trained hyperspectral image reconstruction framework is obtained. The input samples are reconstructed using the trained hyperspectral image reconstruction framework, and the hyperspectral image reconstruction result is output.
[0148] like Figure 7The experimental results of the DSMT hyperspectral image reconstruction network described in this invention on an open-source hyperspectral natural scene show that the scene details are well restored. The reconstruction effect of this invention can be further illustrated through comparative experiments. The method of this invention was compared with other existing methods such as GAP-TV, DeSCI, TSA-Net, HDNet, MST, and CST on a hyperspectral natural scene dataset. Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) were calculated respectively. A higher PSNR indicates relatively less distortion and higher hyperspectral image quality; a higher SSIM indicates that the image structure, brightness, and contrast are more similar to the real image, resulting in higher image quality. Table 1 shows the average reconstruction results of ten scenes selected from the KAIST test set using different methods.
[0149] Table 1. Comparison of DSMT with various methods on the KAIST hyperspectral image dataset (average of ten scenes).
[0150]
[0151] The method of this invention achieves optimal accuracy on this dataset. Figure 8 Several deep learning-based methods are presented, and the visualization results of detection in a scene are shown. It can be seen that the method described in this invention outperforms other hyperspectral image reconstruction methods. The method proposed in this invention can finely recover image details and has advantages over other methods in avoiding artifacts.
[0152] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications can be made to the technical solutions of the present invention without departing from the spirit and scope of the present invention, and all such modifications should be covered within the scope of the claims of the present invention.
Claims
1. A two-stage, multi-scale Transformer-based hyperspectral snapshot compressed imaging image reconstruction method, characterized in that: The method includes the following steps: S1: Acquire a two-dimensional compressed image of the target scene; S2: Preprocess the hyperspectral images used as the training set to obtain training sample data; S3: Input the samples into the two-stage multi-scale image reconstruction network for training; S4: After training is completed, image reconstruction is performed on the test samples to obtain the results; In step S3, the two-dimensional compressed measurement training samples processed by the simulated imaging system are initialized. First, by sliding extraction along the channel dimension, each column in the compressed measurement is restored to a specific position in the output hyperspectral image, thereby converting the original two-dimensional data into a coarse three-dimensional hyperspectral image with a higher dimension. Then, the physical mask of the imaging system is stitched with this hyperspectral image along the channel dimension, and then a 1×1 convolution is performed to restore the number of channels to the number of channels in the original hyperspectral image, obtaining the initial input of the network, as shown in the following formula: Where X is the input feature map, M is the physical mask, and I(X) is the obtained initial input. This indicates a convolution operation with a filter size of N×N, and "[]" indicates a concatenation operation along the channel dimension; The processed image is input into a two-stage multi-scale Transformer (DSMT). DSMT consists of two stages: coarse-grained feature extraction and pixel fine-tuning. The coarse-grained feature extraction stage contains two identical U-shaped networks. The input to the first network is the initial input, and the input to the second network is the concatenation of the output of the first network and the initial input along the channel dimension, followed by a 1×1 convolution operation. The structure is shown in the following formula: Where X is the initial input, X out I2(X) is the output of the first U-shaped network, and I2(X) is the input feature map of the second network. The U-shaped network consists of three stages: upsampling, intermediate layers, and downsampling. The upsampling module includes three operations: a 1×1 convolution, a 3×3 depthwise separable convolution, and a 1×1 convolution, as shown in the following formula: Where X is the input feature map, and U(X) is the feature map obtained after upsampling. This represents a depthwise separable convolution operation with a filter size of N×N; The intermediate layer is a Spatial Pyramid Pooling (ASPP) layer, which selects dilated convolutions with different dilation rates to process the feature maps. The ASPP layer concatenates these feature maps from different levels along the channel dimension for information fusion, as shown in the following formula: Where X is the input feature map, and A(X) is the feature map obtained after passing through the ASPP layer. This represents a dilated convolution operation with a filter size of N×N and an expansion rate of n. AvgPool() represents an average pooling operation, and Bilinear() represents a bilinear interpolation operation. The downsampling module consists of four operations: 2×2 transposed convolution, 1×1 convolution, 3×3 depthwise separable convolution, and 1×1 convolution, as shown in the following formula: Where X is the input feature map, and D(X) is the feature map obtained after downsampling. This represents a transpose convolution operation with a filter size of N×N; After two downsampling operations, through the intermediate layer, and then after two upsampling operations, the feature map output by the U-shaped network is obtained; therefore, the coarse-grained feature extraction stage has two output feature maps. After the coarse-grained feature extraction stage, the two output feature maps are then input into the pixel fine-tuning network. This network is a U-shaped network structure with a dual-branch encoder. The main component of this U-shaped network is the dual-window multi-scale attention module (DWMAB), which in turn is composed of the dual-window multi-scale multi-head self-attention mechanism (DWM-MSA). DWM-MSA first transforms the input feature maps into query values, key values, and attribute values through three linear layers, as shown in the following formula: Where X represents the input feature map, query, key, and value represent the obtained query value, key value, and attribute value, respectively, and W... q W k W v These represent the projected weights of the query value, key value, and attribute value, respectively. Then, the query value, key value, and attribute value are divided into four equal parts along the channel dimension. These four equal parts with the same number of channels are processed using four different branches. For the first two branches, after dividing the input into h heads along the channel dimension, a non-overlapping 8×8 window is used for window multi-head self-attention calculation. The input to the first branch is the original input, and the input to the second branch is the feature map obtained by shuffling the pixel order of the original input. Taking the first branch as an example, the formula is as follows: in, , , This represents the query value, key value, and attribute value of the nth header, calculated using a window of size w1×w1. This represents the self-attention of the nth head after computation using a window of size w1×w1, softmax() represents the activation function that maps values to the interval [0,1], and d represents the dimension of each head; Then, the calculated attention is concatenated along the channel dimension, as shown in the following formula: in, This represents the self-attention of the first branch obtained; Then, for the second branch, the same operation is performed on the feature map after the pixel order has been shuffled to obtain the self-attention of the second branch. For the third and fourth branches, self-attention is calculated using a non-overlapping window of size 16×16, with the same operations as for the first and second branches, to obtain the self-attention. , Finally, the self-attention points obtained from the four branches are concatenated along the channel dimension, and then a linear layer mapping is applied to obtain the final output of DWM-MSA, as shown in the following formula: Where A is the self-attention value obtained by DWM-MSA; DWM-MSA enhances the model's ability to perceive multi-scale information and can capture long-distance dependencies of images.
2. The method for reconstructing a two-stage, multi-scale, Transformer-based hyperspectral snapshot compressed imaging image according to claim 1, characterized in that: In step S1, a coded aperture snapshot spectral imaging system is used to acquire a two-dimensional compressed image. Specifically, this includes: S11, selecting the spectral range, i.e., setting the working wavelength to 450nm-650nm, with a total of 28 channels; S12, designing a coded aperture to introduce different optical codes at each pixel; S13, setting the parameters of the coded aperture snapshot spectral imaging system, including but not limited to the number of dispersive elements and the dispersive shift step size; and S14, capturing a compressed measurement image of the entire scene using a corresponding optical sensor.
3. The method for reconstructing a two-stage, multi-scale, Transformer-based hyperspectral snapshot compressed imaging image according to claim 2, characterized in that: In step S2, the hyperspectral images used as the training set are preprocessed. First, the hyperspectral images are divided into two equal parts. The first part is divided into samples using a 256×256 non-overlapping window, and the second part is divided into images using a 128×128 non-overlapping window. These are then randomly combined into several 256×256 samples. Next, data augmentation operations are performed on the two parts, including random rotation and flipping. Finally, by simulating the imaging system, i.e., by modulating with a pre-designed physical mask and then cropping and summing, the hyperspectral images are processed into two-dimensional compressed measurements. The data-augmented hyperspectral images and their corresponding two-dimensional compressed measurements constitute training sample pairs.
4. The two-stage, multi-scale Transformer-based hyperspectral snapshot compressed imaging image reconstruction method according to claim 3, characterized in that: In DWMAB, the feature map is first processed using a 3×3 depthwise separable convolution to obtain the residual, which is then added to the input feature map for conditional location encoding, as shown in the following formula: Where X is the input feature map, and CPE(X) is the feature map after conditional location encoding; Then, layer normalization is applied to the position-encoded feature map, followed by DWM-MSA processing to obtain the residual, which is then added to the input feature map, as shown in the following formula: Where X is the input feature map, DWM() represents the feature map after passing through the DWMAB module, LayerNorm() represents layer normalization, and MSA() represents the dual-window multi-scale multi-head self-attention mechanism DWM-MSA. Finally, the result after self-attention computation undergoes another layer normalization operation, followed by a feedforward neural network. This feedforward neural network contains 1×1 convolutions, 3×3 depthwise separable convolutions, and 1×1 convolutions, with GELU activation applied before each convolution, as shown in the following formula: Where X is the input feature map, and F(X) represents the feature map after DWMAB processing.
5. The two-stage, multi-scale Transformer-based hyperspectral snapshot compressed imaging image reconstruction method according to claim 4, characterized in that: During the pixel fine-tuning stage, each upsampling in the encoder involves several DWMABs and 4×4 convolutions, as shown in the following formula: Where X is the input feature map, and FU(X) represents the feature map obtained after one upsampling in the pixel fine adjustment stage; After both branches of the encoder have finished encoding, the encoded features are input into the Cross-Attention Fusion (CAFM) module for feature fusion. This module includes channel attention and spatial attention modules. First, after obtaining the feature maps from the outputs of the two branches, global max pooling and average pooling are performed on them. Then, they are input into a two-layer multilayer perceptron network, and the two are added together, as shown in the following formula: Where X is the input feature map, AvgPool() and MaxPool() represent average pooling and max pooling operations respectively, and MLP represents a two-layer multilayer perceptron. Using the method described above, the channel weighting coefficients W1 and W2 of the two branches are calculated; then these coefficients are multiplied together to obtain the cross matrix M, as shown in the following formula: Where M represents the obtained cross-attention matrix; The features obtained after channel attention and cross-matrix processing can be represented as: Among them, F s This represents the feature map obtained after channel attention and cross attention processing, where softmax represents the normalization exponential function, F. c This represents the initial input to a branch of CAFM, and M represents the cross-attention matrix; The obtained feature maps are then subjected to average pooling and max pooling operations, and then concatenated along the channel dimension. Finally, a weight coefficient is calculated using a 3×3 convolution, as shown in the following formula: Among them, F s This represents the feature map processed by channel attention and cross matrix, and W represents the obtained weight coefficients. Finally, the spatial weighting coefficients are multiplied by the input features to obtain the fused features. The fused features of the two branches are then concatenated along the channel dimension and subjected to a 3×3 convolution operation to obtain the final CAFM output, which effectively fuses the features of the two branches. The formula is as follows: in, , Let W represent the initial inputs for the first and second branches, respectively, and let W represent the weight coefficients. After fusing features from the two branches of the encoder using CAFM, the data is processed through an intermediate layer containing six DWMABs. Then, a downsampling stage is performed; each downsampling stage contains a 2×2 transposed convolution and several DWMABs to calculate the residual. This residual is added to the input of the first branch of the encoder, and then subjected to a 3×3 convolution operation to obtain the final residual. This final residual is then added to the initial input of the entire network to obtain the final reconstruction result, as shown in the formula below: Where X represents the initial input feature map, X1 and X2 represent the inputs of the first and second branches of the encoder in the pixel fine-tuning stage, respectively, Fine() represents the feature map obtained in the pixel fine-tuning stage, and DWMT(X) represents the final reconstruction result obtained.
6. The two-stage, multi-scale Transformer-based hyperspectral snapshot compressed imaging image reconstruction method according to claim 5, characterized in that: The loss function for the overall network is set as follows: Where X represents the reconstruction result and Y represents the real hyperspectral image, the loss function is calculated, and the model parameters of the hyperspectral image reconstruction framework are optimized based on the loss function and the backpropagation process. After training, the trained hyperspectral image reconstruction framework is obtained; the input sample is reconstructed using the trained hyperspectral image reconstruction framework, and the hyperspectral image reconstruction effect is output.