Contrastive self-supervised hyperspectral image target detection method based on dual-path network
By employing a contrastive self-supervised learning method with a dual-path network, this approach addresses the issue of insufficient transferability of hyperspectral target detection methods across different datasets. It achieves efficient model training and improved generalization capabilities, making it suitable for hyperspectral image target detection tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA UNIV OF GEOSCIENCES (WUHAN)
- Filing Date
- 2024-01-15
- Publication Date
- 2026-06-23
Smart Images

Figure CN117911673B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of remote sensing image processing technology, and particularly relates to a contrastive self-supervised hyperspectral image target detection method based on a dual-path network. Background Technology
[0002] Hyperspectral remote sensing images are three-dimensional cubes providing rich spectral and spatial information. They typically contain hundreds of spectral bands, offering abundant spectral information. With their extremely high spectral resolution, hyperspectral remote sensing images help distinguish subtle spectral differences between different ground features, providing a unique advantage for target detection tasks. Target detection aims to separate targets of interest from the background using very little prior target spectral information. Essentially, target detection can be viewed as a binary classification task, except that the number of targets is typically small, much smaller than the number of background objects. Therefore, target detection is considered a fundamental task and is widely used in various scenarios such as disaster monitoring, medicine, agriculture, and the military.
[0003] Since the advent of deep learning, numerous deep learning architectures have been introduced to learn the rich and deep features in hyperspectral remote sensing images. Currently, many deep learning methods have successfully overcome the sample barrier between deep learning architectures and hyperspectral target detection. However, existing methods focus on generating large amounts of labeled samples to extract spectral features for training deep learning architectures for target detection tasks, with limited exploration of models that require fewer labeled samples but possess strong spatial spectral feature extraction capabilities. Furthermore, deep learning-based hyperspectral target detection methods are typically trained and tested on the same dataset, meaning they can only learn some knowledge specific to a particular target detection task and cannot effectively transfer to other target detection tasks, resulting in poor model generalization ability and low computational efficiency. Therefore, a deep learning framework that can effectively transfer trained models to different target detection tasks is needed. Summary of the Invention
[0004] To address the limitations of existing hyperspectral target detection methods in terms of effective transferability, this invention provides a high-precision and efficient comparative self-supervised hyperspectral image target detection method based on a dual-path network. The method delves into sample selection and deep learning model design from three aspects: feature extraction, model transferability, and time consumption. Self-supervised learning trains the model on a large amount of unlabeled data through comparative tasks to learn general feature representations. Then, using a small number of samples, the target detection network is fine-tuned according to different target detection tasks, allowing the pre-trained model to be transferred to various downstream tasks, thereby achieving target detection. This method significantly reduces the need for labeled samples, has high model generalization ability, and can better accomplish hyperspectral image target detection tasks.
[0005] A contrastive self-supervised hyperspectral image target detection method based on a dual-path network, specifically including:
[0006] S1: Acquire pre-trained hyperspectral remote sensing image X pre The hyperspectral remote sensing image X to be detected and the prior target spectrum D;
[0007] S2: For the pre-trained hyperspectral remote sensing image X pre The image is divided into blocks, and then data augmentation is performed on each block in the spatial dimension. Positive and negative sample pairs are obtained by combining the spectrum of its center pixel.
[0008] S3: Input the sample pairs obtained in step S2 into the designed dual-path network to learn the feature representation, obtain the encoded feature vector, and then perform dimensionality reduction on the encoded feature vector through the projection head. Use the normalized temperature cross-entropy loss function to constrain the training of the dual-path network to bring positive sample pairs closer and push negative sample pairs further apart. After multiple iterations, the pre-trained dual-path network is obtained.
[0009] S4: Transfer the pre-trained dual-path network from step S3 to the downstream task and add a detector consisting of a fully connected layer and a sigmoid layer to form the target detection network for the detection task.
[0010] S5: Based on the prior target spectrum D, the target sample T and background sample B obtained by constrained energy minimization and superpixel segmentation are used as labeled samples. Combined with the binary cross-entropy loss function, the target detection network constructed in step S4 is fine-tuned. After multiple iterations, the final target detection network is obtained, and the target is detected in the hyperspectral remote sensing image X.
[0011] Furthermore, step S2 specifically includes:
[0012] First, the pre-trained hyperspectral remote sensing image X... pre The surrounding area is filled with zeros to obtain a new filled pre-trained remote sensing image X' pre , with X' pre Centered on each non-zero pixel, divide the image into blocks of size w, and obtain X. pre_patch ;
[0013] Then, for image block X... pre_patch Data augmentation processing is performed: each image patch is randomly flipped and randomly masked, with all pixel spectral values in the mask set to 0, resulting in two augmented views, each with an X... aug Both are related to the central pixel spectrum X spectra This constitutes a sample, and ultimately yields all samples X. pre_sample ;
[0014] Samples obtained from an image block after enhancement are positive samples of each other, and samples obtained from different image blocks after enhancement are negative samples of each other.
[0015] Furthermore, in step S3, the dual-path network consists of a spectral path and a spatial path:
[0016] The spectral path focuses on the continuity between spectral dimensions and consists of multiple convolutional layers, each followed by a ReLU activation function. First, six convolutional layers are used to reduce the dimensionality of the input spectrum and remove redundant information. Then, four middle convolutional layers and long skip connections are used to extract features from the dimensionality-reduced spectrum. Finally, a single convolutional layer reduces the spectral dimension to one dimension to concatenate the output of the spatial path.
[0017] h1=σ(W1*X spectra +b1)
[0018] h l =σ(W l *concat(h1,...,h l-1 )+b l ),l∈{2,3,4,5}
[0019] h6=σ(W6*h5)+b6)
[0020] Among them, h1,h l ,h l-1 h6 represents the outputs of layers 1, 1, (1-1) and 6 respectively, and W1, W l W l-1 W6 represents the convolution kernels of layers 1, 1, (1-1) and 6, respectively, and b1, b l ,b l-1 b6 represents the biases of layers 1, 1, (1-1) and 6, respectively; σ represents the ReLU activation function; * represents the convolution operation; X spectra This represents the spectrum of the center pixel of the image block, and concat(·) represents the stitching operation;
[0021] The spatial path focuses on the spatial correlation between image patches and consists of 5 convolutional layers and 1 global average pooling layer, with a ReLU activation function following each convolutional layer. First, a single convolutional layer reduces the spectral dimension of the input image patch to 1 dimension, focusing only on spatial features. Then, 4 convolutional layers and short skip connections are used for spatial feature extraction. Finally, the output of the spectral path is concatenated through a global average pooling layer.
[0022] h'1=σ(W'1*X aug +b'1)
[0023] h'2=σ(W'2*h1+b'2)
[0024] h' l =σ(W' l *concat(h' l-2 ,h' l-1 )+b' l ),l∈{3,4,5}
[0025] h'6 = GAP(h'5)
[0026] Among them, h'1, h'2, h' l ,h' l-1 h'5, h'6 represent the outputs of layers 1, 2, 1, (1-1)th, 5th, and 6 respectively, and W'1, W'2, W'6 represent the outputs of layers 1, 2, 1, 1-1, 5th, and 6 respectively. l ,W' l-1 Let b'1, b'2, b' represent the convolution kernels of layers 1, 2, 1, and (1-1) respectively. l ,b' l-1 These represent the biases of layers 1, 2, 1, and (1-1) respectively, σ represents the ReLU activation function, * represents the convolution operation, and X... aug This represents the enhanced image block; concat(·) represents the stitching operation; and GAP(·) represents the global average pooling operation.
[0027] The outputs of the two paths are concatenated and then normalized to obtain the final feature vector h:
[0028] h = BN(concat(h6,h'6))
[0029] Where h represents the output feature vector of the dual-path network, BN(·) represents the normalization layer, concat(·) represents the concatenation operation, h6 represents the output of the 6th layer of the spectral path, and h'6 represents the output of the 6th layer of the spatial path.
[0030] Furthermore, in step S3, the projection head consists of two fully connected layers and one ReLU activation function. The feature vector h is reduced in dimensionality by the projection head and projected onto the contrast loss space.
[0031] Z = FC2(σ(FC1(h)))
[0032] Where Z is the output vector of the projection head, FC1(·) and FC2(·) represent two different fully connected mapping functions, σ represents the ReLU activation function, and h represents the output feature vector of the dual-path network.
[0033] Furthermore, in step S3, the normalized temperature cross-entropy loss function is:
[0034]
[0035]
[0036]
[0037] Where sim(·) represents the similarity mapping function, L(i,j) represents the loss for a sample pair, and z i ,z j z represents the output vector of the corresponding projection head for a positive sample pair. i ,z k Let τ represent the output vector of the corresponding projection head for the negative sample pair, τ represent the temperature coefficient, N represent the batch size, and L represent the output vector of the projection head for the negative sample pair. NT-Xent Let represent the normalized temperature cross-entropy loss function, K represent the Kth pair of samples in a batch, and 2K-1 and 2K represent samples obtained by two different augmentation methods in the Kth pair of samples, respectively.
[0038] Furthermore, in step S4, the target detection network is:
[0039]
[0040] in, σ' represents the probability value obtained after passing through the target detection network, FC3(·) represents the fully connected layer, and h represents the output feature vector of the dual-path network.
[0041] Furthermore, in step S5, the process of obtaining labeled samples is as follows:
[0042] The constrained energy minimization method is used to perform initial detection on the image to be detected. The initial detection results are sorted in descending order, and the first M pixels and the surrounding image block of size w are taken as the target sample T.
[0043] Using SLIC segmentation, centroid pixels and their corresponding image blocks are selected from superpixels as background samples B, and centroid pixels and their corresponding image blocks that are similar to the spectrum of the prior target are removed from B using spectral angular distance.
[0044] The final labeled sample S = (T∪B) is composed of target samples and background samples. The label of the target sample is denoted as 1, and the label of the background sample is denoted as 0. The labels are represented as follows: y i Let represent the true label of the i-th sample, where i is a positive integer, M represents the number of target samples, and C represents the number of background samples.
[0045] Furthermore, in step S5, the fine-tuning process of the target detection network is as follows:
[0046] In the dual-path network, the layers related to the number of spectral bands (i.e., the 6th layer of the spectral path and the 1st layer of the spatial path) are retrained through random initialization, and the detector parameters also need to be randomly initialized:
[0047]
[0048]
[0049] in, These represent the outputs after fine-tuning the 6th layer of the spectral path and the 1st layer of the spatial path, respectively; h5 represents the output of the 5th layer of the spectral path; X aug This represents the enhanced image block. This represents the random initialization of the convolutional kernels in layer 6 of the spectral path and layer 1 of the spatial path, respectively. σ represents the bias of the 6th layer of the spectral path and the 1st layer of the spatial path, respectively, and σ represents the ReLU activation function. * represents the convolution operation.
[0050] Then, the dual-path network has a hierarchical structure, and the low-level features obtained from the shallow layer are general. Therefore, the parameters of the first 4 layers of the spectral path and the 2nd, 3rd and 4th layers of the spatial path are frozen and directly transferred to the downstream task.
[0051] Because the number of pre-trained hyperspectral remote sensing images is limited, the pre-trained model cannot extract all ground features, and the extracted deep features are usually incomplete. Therefore, the last few layers of the network (i.e., the 5th layer of the spectral path, the 5th layer of the spatial path, and the global average pooling layer and normalization layer) need to continue to update their parameters based on the images from the downstream task.
[0052]
[0053]
[0054] in, h1, ..., h4 represent the outputs after fine-tuning the 5th layer of the spectral path and the 5th layer of the spatial path, respectively. h'3, h'4 represent the outputs of the first four layers of the spectral path, and h'3, h'4 represent the outputs of the 3rd and 4th layers of the spatial path. These represent the convolutional kernels of the 5th layer in the spectral path and the 5th layer in the spatial path, respectively, which are trained further based on the pre-trained initialization. σ represents the biases of the 5th layer of the spectral path and the 5th layer of the spatial path, respectively, for further training based on the pre-trained initialization. σ represents the ReLU activation function, * represents the convolution operation, and concat(·) represents the concatenation operation.
[0055] Furthermore, in step S5, the formula for the binary cross-entropy loss function is:
[0056]
[0057] Among them, L BCE Let y represent the binary cross-entropy loss function, where N represents the batch size, and y represents the value of N. i This represents the true label of the i-th sample. This represents the probability value predicted by the target detection network for the i-th sample.
[0058] A contrastive self-supervised hyperspectral image target detection device based on a dual-path network includes: a processor and a storage device; the processor loads and executes instructions and data in the storage device to implement a contrastive self-supervised hyperspectral image target detection method based on a dual-path network.
[0059] The beneficial effects of the technical solution provided by this invention are as follows: This invention performs feature encoding on hyperspectral remote sensing data through a comparative task and a dual-path network, learns a general feature representation without labeled samples, constructs a target detection model based on a pre-trained dual-path network, fully connected layers, and a sigmoid layer, and fine-tunes the target detection model with a small number of training samples to achieve rapid model training and effectively transfer the pre-trained model to different target detection tasks. Ultimately, it can significantly reduce the need for labeled samples, has high model generalization ability, and significantly surpasses current deep learning models in inference speed, thus better completing hyperspectral image target detection tasks. Attached Figure Description
[0060] Figure 1 This is a flowchart of a contrastive self-supervised hyperspectral image target detection method based on a dual-path network, according to an embodiment of the present invention.
[0061] Figure 2 This is a framework diagram of the contrastive self-supervised hyperspectral image target detection method based on a dual-path network in this embodiment of the invention.
[0062] Figure 3 This is a schematic diagram of the hardware device working in an embodiment of the present invention. Detailed Implementation
[0063] To provide a clearer understanding of the technical features, objectives, and effects of the present invention, specific embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
[0064] This invention proposes a contrastive self-supervised hyperspectral image target detection method based on a dual-path network. A hyperspectral image is a data cube, which can be represented by a tensor. The vector at each point in the tensor represents the pixel radiance value corresponding to each band. This method first divides a pre-trained hyperspectral remote sensing image into image patches and performs data augmentation to create positive and negative samples. The designed dual-path network is then used to extract features from these samples, which are then mapped to a contrast loss space via a projector to bring positive samples closer and push negative samples further away. The pre-trained dual-path network is then transferred to a downstream target detection task, and the target detection model is fine-tuned using a small number of samples, effectively transferring the pre-trained model to different target detection tasks.
[0065] This invention is specifically implemented using the Python language and the classic deep learning framework PyTORCH, with Python remote sensing image read / write functions as the foundation. It calls the data processing libraries NUMPY, SCIPY, and SPECTRAL, inputs the filename of the remote sensing image to be read, and the image is read into a tensor. Each element in the tensor represents the pixel radiance value corresponding to each band. The Python remote sensing image read / write functions are well-known technologies in this field.
[0066] like Figure 1-2 As shown, a contrastive self-supervised hyperspectral image target detection method based on a dual-path network includes the following steps:
[0067] S1: Acquire pre-trained hyperspectral remote sensing image X pre The hyperspectral remote sensing image X to be detected and the prior target spectrum D, wherein... H', W', and b' represent the length, width, and number of bands of the pre-trained hyperspectral remote sensing image, respectively; H, W, and b represent the length, width, and number of bands of the hyperspectral remote sensing image to be detected, respectively; and L represents the number of prior spectral lines.
[0068] S2: For the pre-trained hyperspectral remote sensing image X pre The image is divided into blocks, and then each block is augmented in the spatial dimension to obtain the augmented image block X. aug And combined with its central pixel spectrum X spectra We obtain positive and negative sample pairs;
[0069] The specific operation of step S2 is as follows:
[0070] The first key step in contrast to self-supervised pre-training is obtaining positive and negative samples through image augmentation on hyperspectral remote sensing images. To utilize the spatial and spectral information of all pixels, edge padding is performed on the original image to obtain a new padded pre-trained remote sensing image X'. pre , with X' preCentered on each non-zero pixel, divide the image into blocks of size w, and obtain X. pre_patch In this embodiment, w is set to 5.
[0071] Then, for image block X... pre_patch Data augmentation processing is performed: each image block is randomly flipped (vertically and horizontally) and randomly masked (multi-pixel mask and block mask), with all pixel spectral values of the mask set to 0, resulting in two augmented views, each X... aug Both are related to the central pixel spectrum X spectra This constitutes a sample, and all samples are ultimately represented as X. pre_sample In this context, samples obtained from the enhancement of a single image block are considered positive samples, while samples obtained from enhancement of different image blocks are considered negative samples.
[0072] S3: Input the sample pairs obtained in step S2 into the designed dual-path network to learn the feature representation, obtain the encoded feature vector, and then reduce the dimensionality of the encoded feature vector through the projection head. Use the normalized temperature cross-entropy loss function to constrain the network training to bring positive sample pairs closer and push negative sample pairs further apart. After multiple iterations, a pre-trained feature extraction model is obtained.
[0073] To fully extract the spectral and spatial features of hyperspectral remote sensing images, a dual-path network was designed for feature extraction. The dual-path network consists of a spectral path and a spatial path.
[0074] The spectral path focuses on the continuity between spectral dimensions and consists of six convolutional layers, each followed by a ReLU activation function. First, a single convolutional layer is used to reduce the dimensionality of the input spectrum and remove redundant information. Then, four convolutional layers and long skip connections are used to extract features from the dimensionality-reduced spectrum. Finally, a single convolutional layer reduces the spectral dimension to one dimension to concatenate the output of the spatial path.
[0075] h1=σ(W1*X spectra +b1)
[0076] h l =σ(W l *concat(h1,...,h l-1 )+b l ),l∈{2,3,4,5}
[0077] h6=σ(W6*h5)+b6)
[0078] Among them, h1,h l ,h l-1 h6 represents the outputs of layers 1, 1, (1-1) and 6 respectively, and W1, W l Wl-1 W6 represents the convolution kernels of layers 1, 1, (1-1) and 6, respectively, and b1, b l ,b l-1 b6 represents the biases of layers 1, 1, (1-1) and 6, respectively; σ represents the ReLU activation function; * represents the convolution operation; X spectra This represents the spectrum of the center pixel of the image block, and concat(·) represents the stitching operation;
[0079] The spatial path focuses on the spatial correlation between image patches and consists of five convolutional layers and one global average pooling layer, with a ReLU activation function following each convolutional layer. First, a single convolutional layer reduces the spectral dimension of the input image patch to one dimension, focusing only on spatial features. Then, four convolutional layers and short skip connections are used for spatial feature extraction. Finally, the output of the spectral path is concatenated through a global average pooling layer.
[0080] h'1=σ(W'1*X aug +b'1)
[0081] h'2=σ(W'2*h1+b'2)
[0082] h' l =σ(W' l *concat(h' l-2 ,h' l-1 )+b' l ),l∈{3,4,5}
[0083] h'6 = GAP(h'5)
[0084] Among them, h'1, h'2, h' l ,h' l-1 h'5, h'6 represent the outputs of layers 1, 2, 1, (1-1)th, 5th, and 6 respectively, and W'1, W'2, W'6 represent the outputs of layers 1, 2, 1, 1-1, 5th, and 6 respectively. l ,W' l-1 Let b'1, b'2, b' represent the convolution kernels of layers 1, 2, 1, and (1-1) respectively. l ,b' l-1 These represent the biases of layers 1, 2, 1, and (1-1) respectively, σ represents the ReLU activation function, * represents the convolution operation, and X... aug This represents the enhanced image block; concat(·) represents the stitching operation; and GAP(·) represents the global average pooling operation.
[0085] The outputs of the two paths are concatenated and then normalized to obtain the final feature vector h:
[0086] h = BN(concat(h6,h'6))
[0087] Where h represents the output feature vector of the dual-path network, BN(·) represents the normalization layer, concat(·) represents the concatenation operation, h6 represents the output of the 6th layer of the spectral path, and h'6 represents the output of the 6th layer of the spatial path.
[0088] The projection head consists of two fully connected layers and a ReLU activation function:
[0089] Z = FC2(σ(FC1(h)))
[0090] Where Z is the output vector of the projection head, FC1(·) and FC2(·) represent two different fully connected mapping functions, σ represents the ReLU activation function, and h represents the output feature vector of the dual-path network.
[0091] The normalized temperature cross-entropy loss function is used to constrain the pre-trained network, bringing positive samples closer together and pushing negative samples further away. The formula for the normalized temperature cross-entropy loss function is:
[0092]
[0093]
[0094]
[0095] Where sim(·) represents the similarity mapping function, L(i,j) represents the loss for a sample pair, and z i ,z j z represents the output vector of the corresponding projection head for a positive sample pair. i ,z k Let τ represent the output vector of the corresponding projection head for the negative sample pair, τ represent the temperature coefficient, N represent the batch size, and L represent the output vector of the projection head for the negative sample pair. NT-Xent Let represent the normalized temperature cross-entropy loss function, K represent the Kth pair of samples in a batch, and 2K-1 and 2K represent samples obtained by two different augmentation methods in the Kth pair of samples, respectively.
[0096] S4: Transfer the pre-trained dual-path network from step 3 to the downstream task, and add a detector consisting of a fully connected layer and a sigmoid layer to form the target detection network for the detection task:
[0097]
[0098] in, σ' represents the probability value obtained after passing through the target detection network, FC3(·) represents the fully connected layer, and h represents the output feature vector of the dual-path network.
[0099] S5: Based on the prior target spectrum D, target sample T and background sample B obtained by constrained energy minimization and superpixel segmentation (SLIC) are used as labeled samples. Combined with the binary cross-entropy loss function, the target detection network constructed in step S4 is fine-tuned. After multiple iterations, the final target detection network is obtained. The hyperspectral remote sensing image X to be detected is divided into image blocks, and the target is detected by combining the central pixel spectrum of the image block.
[0100] To train the target detection network, labeled samples need to be obtained. The process is as follows:
[0101] The constrained energy minimization method is used to perform initial detection on the image to be detected. The initial detection results are sorted in descending order, and the first M pixels and the surrounding image block of size w are taken as the target sample T.
[0102] Using SLIC segmentation, centroid pixels and their corresponding image blocks are selected from superpixels as background samples B. Considering that centroid pixels may be target pixels, centroid pixels and their corresponding image blocks that are similar to the spectrum of the prior target are removed from B using spectral angular distance.
[0103] The target samples and background samples obtained in this way together constitute the final labeled samples S = (T∪B). The label of the target samples is denoted as 1, and the label of the background samples is denoted as 0. The labels are represented as follows: y i Let represent the true label of the i-th sample, where i is a positive integer, M represents the number of target samples, and C represents the number of background samples.
[0104] The fine-tuning process for the target detection network is as follows:
[0105] The fine-tuning strategy consists of three parts: freezing some layer parameters, randomly initializing and retraining some layer parameters, and continuing to train some layer parameters using the pre-trained initialization. The images from the pre-training task and the downstream task may not have been collected by the same sensor, leading to a mismatch in spectral dimensions. Therefore, layers in the dual-path network related to the number of spectral bands (i.e., layer 6 of the spectral path and layer 1 of the spatial path) are retrained through random initialization. Since the detector is primarily used for the downstream task, its parameters also need to be randomly initialized.
[0106]
[0107]
[0108] in, These represent the outputs after fine-tuning the 6th layer of the spectral path and the 1st layer of the spatial path, respectively; h5 represents the output of the 5th layer of the spectral path; X aug This represents the enhanced image block. These represent the randomly initialized convolutional kernels at layer 6 of the spectral path and layer 1 of the spatial path, respectively. σ represents the random initialization of the biases in the 6th layer of the spectral path and the 1st layer of the spatial path, respectively. σ represents the ReLU activation function, and * represents the convolution operation.
[0109] Then, the dual-path network has a hierarchical structure, and the low-level features obtained from the shallow layer are general. Therefore, the parameters of the first 4 layers of the spectral path and the 2nd, 3rd and 4th layers of the spatial path are frozen and directly transferred to the downstream task.
[0110] Because the number of pre-trained hyperspectral remote sensing images is limited, the pre-trained model cannot extract all ground features, and the extracted deep features are usually incomplete. Therefore, the last few layers of the network (i.e., the 5th layer of the spectral path, the 5th layer of the spatial path, and the global average pooling layer and normalization layer) need to continue to update their parameters based on the images from the downstream task.
[0111]
[0112]
[0113] in, h1, ..., h4 represent the outputs after fine-tuning the 5th layer of the spectral path and the 5th layer of the spatial path, respectively. h'3, h'4 represent the outputs of the first four layers of the spectral path, and h'3, h'4 represent the outputs of the 3rd and 4th layers of the spatial path. These represent the convolutional kernels of the 5th layer in the spectral path and the 5th layer in the spatial path, respectively, which are trained further based on the pre-trained initialization. σ represents the biases of the 5th layer of the spectral path and the 5th layer of the spatial path, respectively, for further training based on the pre-trained initialization. σ represents the ReLU activation function, * represents the convolution operation, and concat(·) represents the concatenation operation.
[0114] The target detection network is constrained using the binary cross-entropy loss function, and its formula is as follows:
[0115]
[0116] Among them, L BCE Let y represent the binary cross-entropy loss function, where N represents the batch size, and y represents the value of N. i This represents the true label of the i-th sample. This represents the probability value predicted by the target detection network for the i-th sample.
[0117] The following comparative experiments will verify the beneficial effects of the present invention.
[0118] This embodiment uses the San Diego dataset as the pre-training dataset for the model, and four hyperspectral datasets—the San Diego subset, Bay Park, Wuda Rocks, and Grand Island—are used as detection datasets to verify the model's effectiveness. The San Diego dataset was acquired by an Airborne Visible Infrared Imaging Spectroradiometer (AVIRIS) sensor, with a wavelength range of 400 to 2500 nm. After removing water vapor absorption bands and noise bands, 189 bands remain. This dataset is 400×400 pixels in size and is located at San Diego Airport in California, USA. Due to the diverse types of ground objects in this image, such as rooftops, bare soil, grass, roads, airport runways, and shadows, this dataset was used for pre-training. A 100×100 pixel subset was cropped from the San Diego dataset to form the first downstream task dataset, the San Diego subset. The target of interest is three aircraft composed of 134 pixels. The Bay Park dataset was acquired by a Compact Airborne Spectroradiometer (CASI)-1500 sensor, including 64 bands with a wavelength range of 367.7–1043.4 nm. The first dataset, measuring 325×220 pixels, is located in Long Beach, Mississippi, USA, part of the University of Southern Mississippi's Bay Park Campus in Hattiesburg. The objects of interest are several slabs composed of 269 pixels each. The second dataset, measuring 400×400 pixels, was acquired using the NanoVIP Imaging Ultrafast Camera (Nuance Cri) with reflectionless imaging. It includes 46 bands with wavelengths ranging from 650 to 1100 nm. Located on the lawn of Wuhan University in Wuhan, Hubei Province, China, the objects of interest are ten rocks composed of 1254 pixels each. The third dataset, measuring 150×85 pixels, was acquired using the Hyperspectral Digital Image Acquisition Experiment (HYDICE) sensor. After noise removal, 162 bands remain. Located in a suburban residential area of Fort Hood, Texas, the objects of interest are vehicles composed of 21 pixels each.
[0119] This invention compares the method of the present invention with the classical constrained energy minimization (Method 1), adaptive cosine estimator (Method 2), decomposition model of learning background dictionary (Method 3), tree structure encoding model (Method 4), background learning method for suppressing the target (Method 5), deep convolutional neural network (Method 6), two-stream convolutional network (Method 7) to demonstrate the effectiveness of the present invention.
[0120] Target detection evaluation metrics: Two quantitative evaluation metrics were used: the area under the receiver operating characteristic (AUC) and runtime. The evaluation metrics used are as follows:
[0121] (1) AUC value:
[0122] AUC is an authoritative evaluation metric for target detection problems. The most typical approach involves using various threshold values to obtain different detectivity (Pd) and false alarm rate (Pf), plotting the false alarm rate on the x-axis and the detectivity on the y-axis as the area under the curve (AUC). (Pf,Pd) This value is used to measure the overall performance of the detection; a higher value indicates better overall detection results. The calculation method for this index is as follows:
[0123]
[0124]
[0125]
[0126] Where, N d N is the number of target pixels that were correctly detected. t N is the actual number of target pixels. f N is the number of pixels detected for errors. all It represents the total number of pixels.
[0127] (2) Running time:
[0128] Model runtime is a key metric for evaluating computational efficiency and speed. In real-time applications, an efficient runtime is especially important. The shorter the runtime, the more efficient the model.
[0129] In this experiment, AUC was used. (Pf,Pd) The detection capabilities of methods 1-8 and the method of the present invention are evaluated using values, while the computational efficiency of methods 5-7 and the method of the present invention is evaluated using runtime, AUC. (Pf,Pd) The values are shown in Table 1, and the running time is shown in Table 2. To avoid randomness, all algorithms were run 5 times consecutively, and the median value was taken as the final result.
[0130] Table 1 Comparison of experimental results
[0131]
[0132] As can be seen from Table 1, the method of the present invention can achieve a higher AUC. (Pf,Pd) The value indicates that the method of the present invention has a better overall detection effect.
[0133] Table 2 Comparison of experimental results
[0134]
[0135] Table 2 shows the training and testing times of the deep learning-based methods (Methods 5-7) and the method of this invention on four datasets, with the fine-tuning time for downstream tasks provided by the method of this invention. The method of this invention achieves the fastest overall runtime, especially in terms of training time. Compared with recent deep learning methods, the method of this invention has higher runtime efficiency.
[0136] Please see Figure 3 , Figure 3 This is a schematic diagram of the hardware device operation according to an embodiment of the present invention. The hardware device specifically includes: a contrastive self-supervised hyperspectral image target detection device 301 based on a dual-path network, a processor 302, and a storage device 303.
[0137] A contrastive self-supervised hyperspectral image target detection device 301 based on a dual-path network: The contrastive self-supervised hyperspectral image target detection device 301 based on a dual-path network implements the contrastive self-supervised hyperspectral image target detection method based on a dual-path network.
[0138] Processor 302: The processor 302 loads and executes the instructions and data in the storage device 303 to implement the comparative self-supervised hyperspectral image target detection method based on a dual-path network.
[0139] Storage device 303: The storage device 303 stores instructions and data; the storage device 303 is used to implement the comparative self-supervised hyperspectral image target detection method based on a dual-path network.
[0140] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
[0141] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.
Claims
1. A contrastive self-supervised hyperspectral image target detection method based on a dual-path network, characterized in that: Includes the following steps: S1: Acquire pre-trained hyperspectral remote sensing images Hyperspectral remote sensing images to be detected and prior target spectrum ; S2: Pre-trained hyperspectral remote sensing images The image is divided into blocks, and then data augmentation is performed on each block in the spatial dimension. Positive and negative sample pairs are obtained by combining the spectrum of its center pixel. S3: Input the sample pairs obtained in step S2 into the designed dual-path network to learn feature representations, obtain encoded feature vectors, and then perform dimensionality reduction on the encoded feature vectors through the projection head. Use the normalized temperature cross-entropy loss function to constrain the training of the dual-path network, and obtain the pre-trained dual-path network after multiple iterations. S4: Transfer the pre-trained dual-path network from step S3 to the downstream task and add a detector consisting of a fully connected layer and a sigmoid layer to form the target detection network for the detection task. S5: Based on the prior target spectrum The target samples obtained using constrained energy minimization and superpixel segmentation methods will be... and background samples As labeled samples, combined with the binary cross-entropy loss function, the target detection network constructed in step S4 is fine-tuned. After multiple iterations, the final target detection network is obtained for the hyperspectral remote sensing image to be detected. Target detection; In step S3, the dual-path network consists of spectral paths and spatial paths: The spectral path focuses on the continuity between spectral dimensions and consists of multiple convolutional layers, each followed by a ReLU activation function. First, six convolutional layers are used to reduce the dimensionality of the input spectrum and remove redundant information. Then, four middle convolutional layers and long skip connections are used to extract features from the dimensionality-reduced spectrum. Finally, a single convolutional layer reduces the spectral dimension to one dimension to concatenate the output of the spatial path. in, These represent the outputs of layer 1, layer l, layer (l-1), and layer 6, respectively. These represent the convolution kernels of layers 1, 1, (1-1) and 6, respectively. These represent the offsets of the 1st, 1st, (1-1)th, and 6th layers, respectively. Represents the ReLU activation function. This represents the convolution operation. Represents the spectrum of the center pixel of the image block. Indicates a splicing operation; The spatial path focuses on the spatial correlation between image patches and consists of 5 convolutional layers and 1 global average pooling layer, with a ReLU activation function following each convolutional layer. First, a single convolutional layer reduces the spectral dimension of the input image patch to 1 dimension, focusing only on spatial features. Then, 4 convolutional layers and short skip connections are used for spatial feature extraction. Finally, the output of the spectral path is concatenated through a global average pooling layer. in, These represent the outputs of layers 1, 2, 1, (1-1)th, 5, and 6, respectively. These represent the convolution kernels of layers 1, 2, 1, and (1-1) respectively. These represent the biases of layers 1, 2, 1, and (1-1) respectively. Represents the ReLU activation function. This represents the convolution operation. This represents the enhanced image block. This indicates a splicing operation. This indicates a global average pooling operation; The outputs of the two paths are concatenated and then normalized to obtain the final feature vector. : in, This represents the output feature vector of the two-path network. Indicates the normalization layer. This indicates a splicing operation. This represents the output of the 6th layer of the spectral path. This represents the 6th layer output of the spatial path.
2. The contrastive self-supervised hyperspectral image target detection method based on a dual-path network as described in claim 1, characterized in that: Step S2 is as follows: First, the pre-trained hyperspectral remote sensing images were analyzed. New pre-trained remote sensing images are obtained by filling the surrounding area with zeros. ,by Centered on each non-zero pixel, divide the data into sections of size 1. The image blocks are obtained ; Then the image block Data augmentation processing is performed: each image patch is randomly flipped and randomly masked, with all pixel spectral values in the mask set to 0, resulting in two augmented views. Both with the spectrum of its central pixel To form a sample, we eventually obtain all samples. ; Samples obtained from an image block after enhancement are positive samples of each other, and samples obtained from different image blocks after enhancement are negative samples of each other.
3. The contrastive self-supervised hyperspectral image target detection method based on a dual-path network as described in claim 1, characterized in that: In step S3, the projection head consists of two fully connected layers and one ReLU activation function. The projection head then processes the feature vector... h Dimensionality reduction is performed, and the image is projected onto the contrast loss space: in, The output vector of the projection head. These represent two different fully connected mapping functions. Represents the ReLU activation function. This represents the output feature vector of the dual-path network.
4. The contrastive self-supervised hyperspectral image target detection method based on a dual-path network as described in claim 1, characterized in that: In step S3, the normalized temperature cross-entropy loss function is: in, Represents the similarity mapping function, This represents the loss for a single sample pair. This represents the output vector of the corresponding projection head for a positive sample pair. This represents the output vector of the corresponding projection head for negative sample pairs. i,j,k They represent the first i, j,k One sample, Indicates the temperature coefficient. N Indicates the size of a batch. This represents the normalized temperature cross-entropy loss function. K Indicates the first in a batch K For the sample, 2 K- 1 and 2 K They represent the first K The samples were obtained by two different enhancement methods.
5. The contrastive self-supervised hyperspectral image target detection method based on a dual-path network as described in claim 1, characterized in that: In step S4, the target detection network is: in, This represents the probability value obtained after passing through the target detection network. This represents the Sigmoid activation function. Indicates a fully connected layer. This represents the output feature vector of the dual-path network.
6. The contrastive self-supervised hyperspectral image target detection method based on a dual-path network as described in claim 1, characterized in that: In step S5, the process of obtaining labeled samples is as follows: The constrained energy minimization method is used to perform an initial detection of the image to be detected. The initial detection results are then sorted in descending order, and the top results are selected. M 1 pixel and its surroundings w Image patches of a certain size as target samples T ; Using SLIC segmentation, centroid pixels and their corresponding image patches are selected from superpixels as background samples. B And using spectral angular distance from B Removal of prior target spectrum Similar centroid pixels and their corresponding image blocks; The final labeled sample is composed of both the target sample and the background sample. The target sample is labeled as 1, and the background sample is labeled as 0. The labels are represented as follows: ;in, y i Indicates the first i The true label of each sample i It is a positive integer. M C represents the number of target samples and C represents the number of background samples.
7. The contrastive self-supervised hyperspectral image target detection method based on a dual-path network as described in claim 1, characterized in that: In step S5, the fine-tuning process of the target detection network is as follows: In the dual-path network, the layers related to the number of spectral bands, namely the 6th layer of the spectral path and the 1st layer of the spatial path, are retrained through random initialization. The detector parameters also need to be randomly initialized. in, These represent the outputs after fine-tuning the 6th layer of the spectral path and the 1st layer of the spatial path, respectively. This represents the output of the 5th layer of the spectral path. This represents the enhanced image block. These represent the randomly initialized convolutional kernels at layer 6 of the spectral path and layer 1 of the spatial path, respectively. These represent the biases randomly initialized in the 6th layer of the spectral path and the 1st layer of the spatial path, respectively. Represents the ReLU activation function. Indicates the convolution operation; The dual-path network has a hierarchical structure, with the parameters of the first four layers of the spectral path and the second, third, and fourth layers of the spatial path being frozen and directly transferred to the downstream task; The fifth layer of the spectral path, the fifth layer of the spatial path, the global average pooling layer, and the normalization layer continue to update parameters based on the downstream task images. in, These represent the outputs after fine-tuning the 5th layer of the spectral path and the 5th layer of the spatial path, respectively. This represents the output of the first four layers of the spectral path. This indicates the output of layers 3 and 4 of the spatial path. These represent the convolutional kernels of the 5th layer in the spectral path and the 5th layer in the spatial path, respectively, which are trained further based on the pre-trained initialization. These represent the biases used for further training of the 5th layer of the spectral path and the 5th layer of the spatial path, respectively, based on the pre-trained initialization. Represents the ReLU activation function. This represents the convolution operation. This indicates a splicing operation.
8. The contrastive self-supervised hyperspectral image target detection method based on a dual-path network as described in claim 1, characterized in that: In step S5, the binary cross-entropy loss function is: in, This represents the binary cross-entropy loss function. N Indicates a batch size. y i Indicates the first i The true label of each sample Indicates the target detection network for the first i The probability value predicted for each sample.
9. A contrastive self-supervised hyperspectral image target detection device based on a dual-path network, characterized in that: include: A processor and a storage device; the processor loads and executes instructions and data in the storage device to implement the contrastive self-supervised hyperspectral image target detection method based on a dual-path network as described in any one of claims 1 to 8.