Generative self-supervised hyperspectral image target detection method based on spectral mask
By using spatial spectral masks and generative self-supervised learning, combined with a lightweight Transformer encoder, the problems of high label sample requirements and poor generalization ability in hyperspectral target detection are solved, achieving efficient target detection and improved accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA UNIV OF GEOSCIENCES (WUHAN)
- Filing Date
- 2023-12-08
- Publication Date
- 2026-06-23
Smart Images

Figure CN117746235B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of remote sensing image processing technology, and particularly relates to a generative self-supervised hyperspectral image target detection method based on spatial spectrum mask. Background Technology
[0002] Hyperspectral remote sensing images typically contain hundreds of spectral bands, providing rich spectral information. With their high spectral resolution, hyperspectral remote sensing images can identify ground features with subtle spectral differences, offering a unique advantage for target detection. Hyperspectral target detection can be viewed as the process of locating and identifying specific target pixels in a hyperspectral remote sensing image, aiming to separate the target of interest from various backgrounds with minimal prior knowledge of the target's spectrum. The development of hyperspectral imaging technology has led to the widespread application of hyperspectral target detection in various fields, such as crop quality assessment, disease progression monitoring, geological hazard detection, and military target detection.
[0003] In recent years, object detection methods have primarily focused on deep learning-based approaches. These methods do not rely on predefined model assumptions and automatically learn feature representations from large-scale data through multi-layered neural networks, effectively handling high-dimensional data and exhibiting strong robustness during training. However, deep learning architectures typically involve a large number of labeled samples, complex networks, and a vast amount of parameters. Hyperspectral object detection applications often have a limited number of target pixels and only a small amount of known target spectra, which cannot meet the demand for a large number of labeled samples. Furthermore, deep learning-based hyperspectral object detection methods are usually trained and tested on the same dataset to ensure that the network can learn effective discriminative information applicable to specific scenarios and tasks. However, the features learned by this method cannot be effectively transferred to other scenarios and object detection tasks, exhibiting poor generalization ability and low computational efficiency. Summary of the Invention
[0004] To address the shortcomings of existing technologies, this invention provides a high-precision and efficient generative self-supervised hyperspectral image target detection method based on spatial spectral masks. Self-supervised learning theory enables the learning of meaningful feature representations from unlabeled data using a proxy task, and the transfer of these learned feature representations to various downstream tasks. It learns general feature representations from unlabeled hyperspectral data and uses prior spectral data to obtain a small number of training samples to fine-tune different target detection networks, thereby achieving target detection. This method significantly reduces the need for labeled samples, has high model generalization ability, and can better accomplish hyperspectral image target detection tasks.
[0005] A generative self-supervised hyperspectral image target detection method based on spatial spectrum masking is proposed. The hyperspectral image is a data cube, which can be represented by a tensor. The vector of each point in the tensor is the pixel radiance value corresponding to each band. The method is applied to a pre-trained hyperspectral remote sensing image X.k Perform the following operations on the hyperspectral remote sensing image X to be detected:
[0006] S1: Acquire pre-trained hyperspectral remote sensing image X k The hyperspectral remote sensing image X to be detected and the prior target spectrum D;
[0007] S2: For the pre-trained hyperspectral remote sensing image X k Simultaneous masking and pixel block division are performed in both spatial and spectral dimensions.
[0008] S3: Input the unmasked pixel blocks processed in step S2 into the encoder to learn feature representations and obtain encoded feature vectors. Then, embed the spatially masked pixel blocks into the encoded feature vectors and input them into the decoder for image reconstruction. After multiple iterations, a pre-trained feature extraction model is obtained.
[0009] S4: Transfer the pre-trained feature extraction model from step S3 to the downstream task, and add a detector consisting of a fully connected layer and a Sigmoid layer to form the target detection network for the detection task.
[0010] S5: Using the prior target spectrum D and the background samples BA obtained by K-means clustering as labeled samples, and combining the weighted binary cross-entropy loss function, the target detection network constructed in step S4 is trained. The encoder parameters are completely frozen, and only the detector is fine-tuned. After multiple iterations, the final target detection network is obtained, and the target is detected in the hyperspectral remote sensing image X.
[0011] Furthermore, in step S2, the pre-trained hyperspectral remote sensing image X... k Masking is performed in the spectral dimension: certain bands are randomly masked for each pixel, and the band values of the mask are filled with 0, which enhances the model's sensitivity to the spectral variability of hyperspectral remote sensing images.
[0012] For pre-trained hyperspectral remote sensing images X k Perform masking operations in the spatial dimension: randomly mask certain pixels across the entire image, and fill all band values of the spatially masked pixels with 0;
[0013] The masked hyperspectral remote sensing image is divided into unmasked pixel blocks and spatially masked pixel blocks, denoted as X, ... k_unmasked and X k_masked .
[0014] Further, in step S3, the encoder consists of a 3-layer Transformer encoder, which is composed of alternating layers of multi-head self-attention, multi-layer perceptron, and layer normalization; the process of extracting the encoded feature vector of the unmasked pixel block is as follows:
[0015] (1) Obtain encoder input based on pre-trained hyperspectral remote sensing images:
[0016] The pre-trained hyperspectral remote sensing image is mapped to a low-dimensional representation space through a fully connected layer to obtain the initial feature vector X'. k :
[0017] X' k =FC(X) k )
[0018] Among them, X k X' represents a pre-trained hyperspectral remote sensing image, FC(·) represents a fully connected layer, and X' k This represents the initial eigenvectors of the low-dimensional representation space;
[0019] Based on the initial eigenvector X' of the low-dimensional representation space k Obtain the initial feature vector X' of the unmasked pixel block. k_unmasked Use it as encoder input;
[0020] (2) Based on the initial input feature vector, and by combining normalization, multi-head self-attention, and alternating calculations of the multilayer perceptron, the initial encoded feature vector F of the Transformer encoder is obtained:
[0021]
[0022]
[0023] Where F represents the initial encoded feature vector, MSA(·) represents the mapping function of multi-head self-attention, LN(·) represents the normalization function, MLP(·) represents the computation function of the multilayer perceptron, and f (l) and f (l-1) These represent the output of the current layer and the output of the previous layer, respectively. This represents the calculation result of multi-head self-attention.
[0024] Multi-head self-attention computation: For the input value of the current layer, three different weight matrices W are used. Q W K W V Perform linear transformations to obtain the query, key, and value matrix, calculate individual self-attention, and then combine different linear transformations to perform multi-head self-attention calculation:
[0025] MSA = Concat(SA1,...,SA) h W O
[0026]
[0027] Where MSA represents performing multi-head self-attention splicing computation, h is the number of self-attention heads, Concat(·) represents the splicing operation, and W O SA is the weight matrix for linear transformation, which combines the representations of different self-attention heads. i Q represents the result of the i-th self-attention calculation. i Let K represent the query matrix for the i-th self-attention. i Let K represent the key matrix of the i-th self-attention. i T K represents the key matrix of the i-th self-attention. i The transpose of V i Let d represent the value matrix of the i-th self-attention. n This represents the scaling factor, where n is the dimension of Q and K, and softmax(·) is the activation function;
[0028] (3) Based on the encoder consisting of a 3-layer Transformer encoder, the feature representation is learned to obtain the encoded feature vector F. en :
[0029] F en =F depth=3 (X' k_unmasked )
[0030] Among them, F depth=3 (·) represents the encoder, depth represents the number of Transformer encoder layers, and X' k_unmasked F represents the initial feature vector of the unmasked pixel block. en This represents the encoded feature vector.
[0031] Further, in step S3, the decoder consists of a single Transformer encoder layer. It obtains the initial feature vectors of the pixel blocks of the spatial mask based on the fully connected layer mapping, embeds them into the encoded feature vectors, and inputs them into the decoder for reconstruction, resulting in the reconstructed image F. de :
[0032]
[0033] Among them, F depth=1 (·) represents the decoder, depth represents the number of Transformer encoder layers, and X' k_masked The initial feature vector representing the pixel block of the spatial mask. F represents the embedding operation. en F represents the encoded feature vector. de This indicates a reconstructed image.
[0034] Further, in step S3, the reconstruction loss is calculated using the mean squared error loss function to constrain the feature extraction model. The formula for calculating the mean squared error loss function is as follows:
[0035]
[0036] Among them, L MSE Represents the reconstruction loss, where N is a positive integer representing the number of samples in a batch; It is the original image pixel block with the i-th spatial mask in a batch. It is the pixel block of the spatial mask reconstructed by the i-th encoder in a batch.
[0037] Furthermore, in step S4, the trained encoder is transferred to the downstream task, and a detector consisting of a fully connected layer and a sigmoid layer is added to form the target detection network for the detection task:
[0038]
[0039] Where σ represents the Sigmoid activation function, FC(·) represents a fully connected layer, and F en Represents the encoded feature vector. This represents the probability value obtained after passing through the target detection network. The higher the probability value, the more likely it is to be a target.
[0040] Furthermore, in step S5, the process of obtaining labeled samples is as follows:
[0041] Using the prior target spectrum As target samples, K-means clustering is performed on the hyperspectral remote sensing image X to be detected, and the cluster centers are used as background samples, denoted as . The prior target spectrum D and the background sample BA together constitute the final labeled sample S = (D∪BA). The label of the target sample is denoted as 1, and the label of the background sample is denoted as 0. The labels are represented as follows: y i Let L represent the true label of the i-th sample, where i is a positive integer, L represents the number of prior spectral lines, and C represents the number of background samples.
[0042] Furthermore, in step S5, to balance the target prior spectrum D and the number of background samples BA, a weighted binary cross-entropy loss function is introduced as follows:
[0043]
[0044] Where r is the ratio of the target sample size to the total sample size, y i This represents the true label of the i-th sample. L represents the probability value predicted by the target detection network for the i-th sample.WBCE This represents the weighted binary cross-entropy loss function.
[0045] A storage device that stores instructions and data for implementing a generative self-supervised hyperspectral image target detection method based on spatial spectrum masks.
[0046] A generative self-supervised hyperspectral image target detection device based on spatial spectrum masking includes: a processor and a storage device; the processor loads and executes instructions and data in the storage device to implement a generative self-supervised hyperspectral image target detection method based on spatial spectrum masking.
[0047] The beneficial effects of the technical solution provided by this invention are as follows: This invention performs feature encoding and decoding of hyperspectral remote sensing data through generative models and spatial spectral masking strategies, learns general feature representations without labeled samples, and constructs a transfer learning model based on a lightweight transformer encoder, fully connected layers, and a sigmoid layer. By combining a small number of training samples and weighted binary cross-entropy loss, the detector, composed of a single fully connected layer and a sigmoid layer, is fine-tuned, enabling rapid training of the transfer learning model and effectively transferring the pre-trained model to different target detection tasks. Ultimately, this significantly reduces the need for labeled samples, has high model generalization ability, and far surpasses current deep learning models in inference speed, thus better enabling the completion of hyperspectral image target detection tasks. Attached Figure Description
[0048] Figure 1 This is a flowchart of a generative self-supervised hyperspectral image target detection method based on spatial spectrum mask according to an embodiment of the present invention.
[0049] Figure 2 This is a framework diagram of the generative self-supervised hyperspectral image target detection method based on spatial spectrum mask in this embodiment of the invention.
[0050] Figure 3 This is a schematic diagram of the hardware device working in an embodiment of the present invention. Detailed Implementation
[0051] The following is combined with Figure 1 The specific embodiments of the present invention will be further described in detail below.
[0052] The key invention is a generative self-supervised hyperspectral image target detection framework based on spatial spectral masking. This method first performs spatial spectral masking on pre-trained hyperspectral remote sensing images, then reconstructs the masked portion. The pre-trained network consists of an encoder and decoder (using a Transformer encoder), capable of capturing global dependencies in the input data and better extracting spatial spectral features from the hyperspectral remote sensing data. The pre-trained encoder is then transferred to downstream target detection tasks, and the detector is fine-tuned using a small number of samples, effectively transferring the pre-trained model to different target detection tasks. Furthermore, a weighted binary cross-entropy loss is introduced in the downstream task to balance the samples and further improve the model's detection performance.
[0053] This invention is specifically implemented using the Python language and the classic deep learning framework PyTORCH, with Python remote sensing image read / write functions as the foundation. It calls the data processing libraries NUMPY, SCIPY, and SPECTRAL, inputs the filename of the remote sensing image to be read, and the image is read into a tensor. Each element in the tensor represents the pixel radiance value corresponding to each band. The Python remote sensing image read / write functions are well-known technologies in this field.
[0054] like Figure 1-2 As shown, a generative self-supervised hyperspectral image target detection method based on spatial spectral masking includes the following steps:
[0055] S1: Acquire pre-trained hyperspectral remote sensing image X k The hyperspectral remote sensing image X to be detected and the prior target spectrum D; where, k is the total number of pre-trained images, H k W k B k H, W, and B represent the length, width, and number of bands of the pre-trained hyperspectral remote sensing image, respectively; H, W, and B represent the length, width, and number of bands of the hyperspectral remote sensing image to be detected, respectively; and L represents the number of prior spectral lines.
[0056] S2: For the pre-trained hyperspectral remote sensing image X k Masking and pixel block division are performed simultaneously in both spatial and spectral dimensions; the masking here refers to randomly masking a portion of the pixel, rather than masking the entire pixel.
[0057] S3: Input the unmasked pixel blocks processed in step S2 into the encoder to learn feature representations and obtain encoded feature vectors. Then, embed the spatially masked pixel blocks into the encoded feature vectors and input them into the decoder for image reconstruction. After multiple iterations, a pre-trained feature extraction model is obtained. Some bands in the unmasked pixel blocks are assigned a value of 0 (i.e., spectral masking is superimposed).
[0058] S4: Transfer the pre-trained feature extraction model from step S3 to the downstream task, and add a detector consisting of a fully connected layer and a Sigmoid layer to form the target detection network for the detection task.
[0059] S5: Using the prior target spectrum D and the background samples BA obtained by K-means clustering as labeled samples, and combining the weighted binary cross-entropy loss function, the target detection network constructed in step S4 is trained. The encoder parameters are completely frozen, and only the detector is fine-tuned. After multiple iterations, the final target detection network is obtained, and the target is detected in the hyperspectral remote sensing image X.
[0060] The specific steps are as follows:
[0061] (1) Hyperspectral remote sensing images contain rich spatial-spectral information. Therefore, pre-trained hyperspectral remote sensing images are simultaneously subjected to random masking in both spatial and spectral dimensions. The masked images are then divided into unmasked pixel blocks and spatially masked pixel blocks. Specifically:
[0062] For pre-trained hyperspectral remote sensing images X k A masking operation is performed along the spectral dimension: certain bands are randomly masked for each pixel, and the masked band values are filled with 0. Spectral masking is used to enhance the model's sensitivity to the spectral variability of hyperspectral remote sensing images.
[0063] For pre-trained hyperspectral remote sensing images X k Spatial masking involves randomly masking certain pixels across the entire image, filling all band values of the spatially masked pixels with 0. Spatial masking is primarily used to complete the reconstruction task during pre-training.
[0064] The masked hyperspectral remote sensing image is divided into unmasked pixel blocks and spatially masked pixel blocks, denoted as X, ... k_unmasked and X k_masked .
[0065] (2) Obtain encoder input from pre-trained hyperspectral remote sensing images. Map the pre-trained hyperspectral remote sensing images to a low-dimensional representation space through a fully connected layer to obtain the initial feature vector X'. k :
[0066] X'k =FC(X) k )
[0067] Among them, X k X' represents a pre-trained hyperspectral remote sensing image, FC(·) represents a fully connected layer, and X' k This represents the initial eigenvectors of the low-dimensional representation space;
[0068] Based on the initial eigenvector X' of the low-dimensional representation space k Obtain the initial feature vector X' of the unmasked pixel block. k_unmasked Use it as encoder input;
[0069] (3) Based on the input feature vector, and by combining normalization, multi-head self-attention, and alternating calculations of the multilayer perceptron, the initial encoding feature vector F of the Transformer encoder is obtained:
[0070]
[0071]
[0072] Where F represents the initial encoded feature vector, MSA(·) represents the mapping function of multi-head self-attention, LN(·) represents the normalization function, MLP(·) represents the computation function of the multilayer perceptron, and f (l) and f (l-1) These represent the output of the current layer and the output of the previous layer, respectively. This represents the calculation result of multi-head self-attention.
[0073] Multi-head self-attention is a combination of multiple self-attentions, each containing one attention head.
[0074] Multi-head self-attention computation: For the input value of the current layer, three different weight matrices W are used. Q W K W V Perform linear transformations to obtain the query, key, and value matrix, calculate individual self-attention, and then combine different linear transformations to perform multi-head self-attention calculation:
[0075] MSA = Concat(SA1,...,SA) h W O
[0076]
[0077] Where MSA represents multi-head self-attention concatenation computation, h is the number of self-attention heads, Concat(·) represents the concatenation operation, and W O SA is the weight matrix for linear transformation, which combines the representations of different self-attention heads.i Q represents the result of the i-th self-attention calculation. i Let K represent the query matrix for the i-th self-attention. i Let K represent the key matrix of the i-th self-attention. i T K represents the key matrix of the i-th self-attention. i The transpose of V i Let d represent the value matrix of the i-th self-attention. n represents the scaling factor, n is the dimension of Q and K, and softmax(·) is the activation function.
[0078] (4) Based on the encoder consisting of a 3-layer Transformer encoder, the feature representation is learned to obtain the encoded feature vector F. en :
[0079] F en =F depth=3 (X' k_unmasked )
[0080] Among them, F depth=3 (·) represents the encoder, depth represents the number of Transformer encoder layers, and X' k_unmasked F represents the initial feature vector of the unmasked pixel block. en F represents the encoded feature vector. de This indicates a reconstructed image.
[0081] (5) Obtain the initial feature vector of the pixel block of the spatial mask according to the mapping of the fully connected layer, embed it into the encoded feature vector, input it into the decoder for reconstruction, and obtain the reconstructed image F. de The decoder consists of a single Transformer encoder layer, and the reconstructed feature vector is:
[0082]
[0083] Among them, F depth=1 (·) represents the decoder, depth represents the number of Transformer encoder layers, and X' k_masked The initial feature vector representing the pixel block of the spatial mask. F represents the embedding operation. en This represents the encoded feature vector.
[0084] (6) To reduce the reconstruction error of the pre-trained model, mean squared error loss is used to constrain the pre-training iteration process, and only the pixel blocks F of the reconstructed spatial mask are calculated. de_masked The loss X between the original image pixel blocks and the original image pixel blocks k_masked :
[0085]
[0086] Where N is the number of samples in a batch. It is the original image pixel block with the i-th spatial mask in a batch. It is the pixel block of the spatial mask reconstructed by the i-th encoder in a batch.
[0087] (7) The trained encoder is transferred to the downstream target detection task, and a detector consisting of a fully connected layer and a sigmoid layer is added to form the target detection network for the detection task:
[0088]
[0089] Where σ represents the Sigmoid activation function, FC(·) represents a fully connected layer, and F en Represents the encoded feature vector. This represents the probability value obtained through the detection network; the higher the probability value, the more likely it is to be the target.
[0090] (8) To train the target detection network, labeled samples are obtained. Prior target spectra are used. As target samples, the hyperspectral remote sensing image X to be detected is subjected to K-means clustering, and the cluster centers are used as background samples, denoted as . The prior target spectrum D and the background sample BA together constitute the final labeled sample S = (D∪BA). The label of the target sample is denoted as 1, and the label of the background sample is denoted as 0. The labels are represented as follows: y i Let L represent the true label of the i-th sample, where i is a positive integer, L represents the number of prior spectral lines, and C represents the number of background samples.
[0091] (9) To balance the target prior spectrum D and the number of background samples BA, a weighted binary cross-entropy loss function is used to train the target detection network:
[0092]
[0093] Where r is the ratio of the target sample number to the total sample number, and yi represents the true label of the i-th sample. L represents the probability value predicted by the probe network for the i-th sample. WBCE This represents the weighted binary cross-entropy loss function.
[0094] The following comparative experiments will verify the beneficial effects of the present invention.
[0095] This embodiment uses the Hyperspectral Anomaly Detection 100 dataset as the pre-training dataset for the model, and four hyperspectral datasets—San Diego, Bay Park, Wuhan Rocks, and Grand Island—are used as detection datasets to verify the model's effectiveness. The Hyperspectral Anomaly Detection 100 dataset was acquired by an Airborne Visible Infrared Imaging Spectroradiometer (AVIRIS) sensor and uniformly cropped to a 64×64 image size. The AVIRIS sensor collects spectral data in 224 bands ranging from 400 to 2500 nm. Given the rich diversity of ground features in this dataset, such as vehicles, ships, buildings, grasslands, forests, farmland, deserts, rivers, and coastlines, this dataset is used for pre-training. The San Diego dataset was acquired by the AVIRIS sensor and is located at San Diego Airport in California, USA, with an image size of 400×400. After removing water vapor absorption bands and noise bands, 189 bands remain. The targets of interest in this image are three aircraft, each consisting of 134 pixels. The target pixels account for 1.340% of the total image pixels, and there are three prior target spectra. The Bay Park dataset was acquired using a compact airborne spectral imager (CASI)-1500 sensor, with an image size of 325×220. It is located in Long Beach, Mississippi, USA, and is part of the Bay Park campus of the University of Southern Mississippi in Hattiesburg. The dataset includes 64 bands with a wavelength range of 367.7–1043.4 nm. The target of interest in this image is four slabs, each consisting of 269 pixels. The target pixels account for 0.414% of the total image pixels, and there are two prior target spectra. The Wuhan University Rocks dataset was acquired using a reflection-free imaging NanoVIP ultrafast camera (Nuance Cri), with an image size of 400×400. It is located on the lawn of Wuhan University in Wuhan, Hubei Province, China. The dataset includes 46 bands with a wavelength range of 650–1100 nm. The target of interest in this image is ten rocks, each consisting of 1254 pixels. The target pixels account for 0.783% of the total image pixels, and there are two prior target spectra. The Grand Island dataset was collected by the AVIRIS sensor and has an image size of 200×380. It is located along the Grand Island Bay coast in Jefferson Parish, Los Angeles, USA. The dataset includes 224 bands with a wavelength range of 366–2496 nm. The target of interest in this image is several man-made features, each consisting of 279 pixels. The target pixels account for 0.367% of the total image pixels, and there are three prior target spectra.
[0096] This invention compares the method of the present invention with the classical constrained energy minimization (Method 1), adaptive cosine estimator (Method 2), decomposition model of learning background dictionary (Method 3), tree structure coding model (Method 4), background learning method for suppressing the target (Method 5), auxiliary generative adversarial network (Method 6), two-stream convolutional network (Method 7), and interpretable representation network (Method 8) to demonstrate the effectiveness of the present invention.
[0097] Target detection evaluation metrics: Two quantitative evaluation metrics were used: the area under the receiver operating characteristic (AUC) and runtime. The evaluation metrics used are as follows:
[0098] (1) AUC value:
[0099] AUC is an authoritative evaluation metric for target detection problems. The most typical example is the AUC calculated by plotting the area under the curve (AUC) with different threshold values for different detectivity (Pd) and false alarm rate (Pf), and then plotting the false alarm rate on the x-axis and the detectivity on the y-axis. (Pf,Pd) This is used to measure the overall performance of the detection; a higher value indicates better overall detection results. Additionally, the area under the curve (AUC) is calculated using the threshold (τ) as the x-axis and either Pd or Pf as the y-axis. (τ,Pf) and AUC (τ,Pd) Used to evaluate the salience and background suppression effects of targets. AUC (τ,Pf) The lower the AUC, the better the background suppression. (τ,Pd) The higher the value, the stronger the target detection capability. The calculation methods for the three indicators are as follows:
[0100]
[0101]
[0102]
[0103]
[0104]
[0105] Where, N d N is the number of target pixels that were correctly detected. t N is the actual number of target pixels. f N is the number of pixels detected for errors. all It represents the total number of pixels.
[0106] Based on the above metrics, three variants of AUC were also used to comprehensively demonstrate the detector's performance; a higher value indicates stronger capabilities. These are metrics for evaluating target detection (TD) capability, background suppression (BS) capability, and overall detection probability (ODP): AUC TD AUC BS AUC ODP Their calculation methods are as follows:
[0107] AUC TD =AUC (Pf,Pd) +AUC (τ,Pd)
[0108] AUC BS =AUC (Pf,Pd) -AUC (τ,Pf)
[0109] AUC ODP =AUC (τ,Pd) +(1-AUC (τ,Pf) )
[0110] (2) Running time:
[0111] Model runtime is a key metric for evaluating computational efficiency and speed. In real-time applications, an efficient runtime is especially important. The shorter the runtime, the more efficient the model.
[0112] In this experiment, the AUC value was used to evaluate the detection capability of methods 1-8 and the method of the present invention, and the running time was used to evaluate the computational efficiency of methods 5-8 and the method of the present invention. The AUC values are shown in Table 1, and the running time is shown in Table 2.
[0113] Table 1 Comparison of experimental results
[0114]
[0115]
[0116]
[0117] As can be seen from Table 1, the method of the present invention can achieve a higher AUC. (Pf,Pd) AUC BS and AUC ODP Furthermore, more than half of the indicators reached the optimal level, indicating that the method of the present invention has a better overall detection effect and a better background suppression effect.
[0118] Table 2 Comparison of experimental results
[0119]
[0120] Table 2 shows the training and testing times of the deep learning-based methods (methods 5-8) and the method of this invention on four datasets, with the fine-tuning time for downstream tasks provided by the method of this invention. The method of this invention achieves the fastest runtime for both training and testing, especially in terms of training time. Compared with recent deep learning methods, the method of this invention has higher runtime efficiency.
[0121] Please see Figure 3 , Figure 3 This is a schematic diagram of the hardware device in operation according to an embodiment of the present invention. The hardware device specifically includes: a generative self-supervised hyperspectral image target detection device 301 based on spatial spectrum mask, a processor 302, and a storage device 303.
[0122] A generative self-supervised hyperspectral image target detection device 301 based on spatial spectrum mask: The generative self-supervised hyperspectral image target detection device 301 based on spatial spectrum mask implements the generative self-supervised hyperspectral image target detection method based on spatial spectrum mask.
[0123] Processor 302: The processor 302 loads and executes the instructions and data in the storage device 303 to implement the generative self-supervised hyperspectral image target detection method based on spatial spectrum mask.
[0124] Storage device 303: The storage device 303 stores instructions and data; the storage device 303 is used to implement the generative self-supervised hyperspectral image target detection method based on spatial spectrum mask.
[0125] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
[0126] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.
Claims
1. A generative self-supervised hyperspectral image target detection method based on spatial spectral masking, comprising the following steps: S1: obtaining a pre-trained hyperspectral remote sensing image , a to-be-detected hyperspectral remote sensing image , and a prior target spectrum ; S2: Pre-trained hyperspectral remote sensing images Simultaneous masking and pixel block division are performed in both spatial and spectral dimensions. S3: Input the unmasked pixel blocks processed in step S2 into the encoder to learn feature representations and obtain encoded feature vectors. Then, embed the spatially masked pixel blocks into the encoded feature vectors and input them into the decoder for image reconstruction. After multiple iterations, a pre-trained feature extraction model is obtained. S4: Transfer the pre-trained feature extraction model from step S3 to the downstream task, and add a detector consisting of a fully connected layer and a Sigmoid layer to form the target detection network for the detection task. S5: Develop the prior target spectrum and background samples obtained by K-means clustering As labeled samples, the target detection network constructed in step S4 is trained using a weighted binary cross-entropy loss function. The encoder parameters are completely frozen, and only the detector parameters are fine-tuned. After multiple iterations, the final target detection network is obtained, which is then applied to the hyperspectral remote sensing image to be detected. Target detection; In step S3, the encoder consists of a 3-layer Transformer encoder, which is composed of alternating layers of multi-head self-attention, multi-layer perceptron, and layer normalization. The process of extracting the encoded feature vector of the unmasked pixel block is as follows: (1) Obtain encoder input based on pre-trained hyperspectral remote sensing images: The pre-trained hyperspectral remote sensing images are mapped to a low-dimensional representation space through a fully connected layer to obtain the initial feature vector. : in, This represents a pre-trained hyperspectral remote sensing image. Indicates a fully connected layer. This represents the initial eigenvectors of the low-dimensional representation space; Based on the initial eigenvectors of the low-dimensional representation space Obtain the initial feature vector of the unmasked pixel block. Use it as encoder input; (2) Based on the initial input feature vector, and by combining normalization, multi-head self-attention, and alternating calculations of the multilayer perceptron, the initial encoding feature vector F of the Transformer encoder is obtained: Where F represents the initial encoded feature vector, The mapping function representing multi-head self-attention. Represents the normalization function. The computation function of a multilayer perceptron is represented. and These represent the output of the current layer and the output of the previous layer, respectively. This represents the calculation result of multi-head self-attention; Multi-head self-attention computation: For the input value of the current layer, it is processed through three different weight matrices. Perform linear transformations to obtain the query, key, and value matrix, calculate individual self-attention, and then combine different linear transformations to perform multi-head self-attention calculation: MSA indicates that multi-head self-attention splicing computation is performed. The number of self-attention heads. This indicates a splicing operation. The weight matrix represents the linear transformation, combining the representations of different self-attention heads; SAi represents the calculation result of the i-th self-attention, Qi represents the query matrix of the i-th self-attention, and Ki represents the key matrix of the i-th self-attention. Let Ki be the transpose of the key matrix Ki of the i-th self-attention, and Vi be the value matrix of the i-th self-attention. This represents the scaling factor, where n is the dimension of Q and K. For activation functions; (3) Based on the encoder consisting of a 3-layer Transformer encoder, the feature representation is learned to obtain the encoded feature vector. : in, This represents the encoder, and depth represents the number of layers in the Transformer encoder. This represents the initial feature vector of the pixel block without spatial masking. This represents the encoded feature vector.
2. In the generative self-supervised hyperspectral image target detection method based on spatial spectrum mask as described in claim 1, in step S2, the pre-trained hyperspectral remote sensing image... The process of performing masking simultaneously in both spatial and spectral dimensions is as follows: For pre-trained hyperspectral remote sensing images Perform masking operations in the spectral dimension: randomly mask out certain bands for each pixel, and fill the band values of the mask with 0; For pre-trained hyperspectral remote sensing images Perform masking operations in the spatial dimension: randomly mask certain pixels across the entire image, and fill all band values of the spatially masked pixels with 0; The masked hyperspectral remote sensing image is divided into unmasked pixel blocks and spatially masked pixel blocks, denoted as , ... and .
3. In the generative self-supervised hyperspectral image target detection method based on spatial spectral mask as described in claim 1, in step S3, the decoder consists of a single-layer Transformer encoder. The initial feature vector of the pixel block of the spatial mask is obtained based on the mapping of the fully connected layer, embedded into the encoded feature vector, and input into the decoder for reconstruction to obtain the reconstructed image. : in, Indicates decoder, depth Indicates the number of layers in the Transformer encoder. The initial feature vector representing the pixel block of the spatial mask. This indicates an embedding operation. Represents the encoded feature vector. This indicates a reconstructed image.
4. In the generative self-supervised hyperspectral image target detection method based on spatial spectral mask as described in claim 1, in step S3, the reconstruction loss is calculated using the mean square error loss function, and the formula for calculating the mean square error loss function is: in, Indicates the reconstruction loss. A positive integer representing the number of samples in a batch; The first in a batch i The original image pixel block of a spatial mask, The first in a batch i The pixel blocks of the spatial mask reconstructed by the encoder.
5. In the generative self-supervised hyperspectral image target detection method based on spatial spectrum mask as described in claim 1, in step S4, the trained encoder is transferred to the downstream task, and a detector consisting of a fully connected layer and a sigmoid layer is added to jointly constitute the target detection network of the detection task: in, This represents the Sigmoid activation function. Indicates a fully connected layer. Represents the encoded feature vector. This represents the probability value obtained after passing through the target detection network. The higher the probability value, the more likely it is to be a target.
6. In the generative self-supervised hyperspectral image target detection method based on spatial spectral mask as described in claim 1, the process of obtaining the labeled samples in step S5 is as follows: Using the prior target spectrum As a target sample, the hyperspectral remote sensing image to be detected Perform K-means clustering and use the cluster centers as background samples. Prior target spectrum and background samples Together they constitute the final labeled sample. The target sample is labeled as 1, and the background sample is labeled as 0. The labels are represented as follows: ; y i Indicates the first i The true label of each sample i It is a positive integer. C represents the number of prior spectral lines, and C represents the number of background samples.
7. In the generative self-supervised hyperspectral image target detection method based on spatial spectral mask as described in claim 1, in step S5, in order to balance the target's prior spectrum... and the number of background samples The introduced weighted binary cross-entropy loss function is: in, This is the ratio of the target sample size to the total sample size. y i Indicates the first i The true label of each sample Indicates the target detection network for the first i The probability value predicted for each sample. This represents the weighted binary cross-entropy loss function.
8. A storage device, characterized in that: The storage device stores instructions and data for implementing the generative self-supervised hyperspectral image target detection method based on spatial spectrum mask as described in any one of claims 1 to 7.
9. A generative self-supervised hyperspectral image target detection device based on spatial spectral masking, characterized in that: include: A processor and a storage device; the processor loads and executes instructions and data in the storage device to implement the generative self-supervised hyperspectral image target detection method based on spatial spectrum mask as described in any one of claims 1 to 7.