Hyperspectral target tracking method, system, medium and device based on large model segmentation

By constructing a hyperspectral target tracking method using segmentation models and Siamese networks, combined with genetic algorithms and knowledge distillation techniques, the problem of distinguishing targets from backgrounds in hyperspectral videos is solved, improving tracking accuracy and generalization ability, and achieving more accurate substance identification.

CN117911697BActive Publication Date: 2026-06-19JIANGNAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JIANGNAN UNIV
Filing Date
2024-01-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing hyperspectral video target tracking methods cannot effectively distinguish between the target and the background during feature extraction, resulting in insufficient tracking accuracy and limited generalization ability.

Method used

A hyperspectral target tracking method based on large model segmentation is adopted. By constructing a tracking network model, including a segmentation model and a Siamese network, combining genetic algorithm to select bands, using knowledge distillation technique to train student models, using the segmentation model to distinguish targets from background, and learning spectral features through the Siamese network to improve recognition accuracy.

Benefits of technology

It effectively distinguishes between the target and the background, improves the accuracy and generalization ability of hyperspectral target tracking, reduces the risk of overfitting due to insufficient sample data, and enhances the ability to identify targets of different substances.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117911697B_ABST
    Figure CN117911697B_ABST
Patent Text Reader

Abstract

This invention relates to the field of image processing technology, and discloses a hyperspectral target tracking method, system, medium, and device based on large model segmentation. The method includes: constructing a tracking network model comprising a segmentation model and a Siamese network; acquiring existing hyperspectral video data, preprocessing it, and training the tracking network model to obtain a teacher model; acquiring hyperspectral video data of the object to be measured, preprocessing it, and dividing it into training and testing sets; using the tracking network model as a student model; training the student model using the teacher model and the training set to obtain a prediction model; during student model training, inputting the preprocessed image into the segmentation model for target and background segmentation; weighting the segmentation results with the background and inputting the weighted values ​​into the Siamese network to obtain a feature map; and performing target tracking based on the feature map. This invention can effectively distinguish between the target and the background, improving the accuracy of target tracking.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing technology, and in particular to a hyperspectral target tracking method, system, medium, and device based on large model segmentation. Background Technology

[0002] Currently, hyperspectral video acquisition methods are becoming increasingly sophisticated, acquisition costs are decreasing, and obtaining hyperspectral video is becoming easier. Hyperspectral video is mostly used in fields such as autonomous driving and military guidance because it contains information across many spectral bands, including spatial and spectral information, making it more robust in target tracking than visible light. The application scenarios for hyperspectral video are rapidly evolving, and the demand for hyperspectral video in target tracking is constantly increasing.

[0003] To improve target tracking performance in hyperspectral video, existing technologies combine hyperspectral imaging with image processing techniques. For example, there are feature extraction methods based on the material information of the target object. These methods combine hyperspectral data with histograms of spatial multidimensional gradients (SSHMG) to describe the local spectral-spatial structure information in the HIS model, encode material distribution information in the scene based on the abundance features obtained from hyperspectral unmixing, and then embed the extracted features into a filter framework to implement the target tracking algorithm. Most of these existing target tracking methods utilize the reliability of features to dynamically adjust feature weights and update model parameters online. However, these models are often limited by the size of the dataset, have insufficient generalization ability, and the feature extraction does not distinguish between the target and the background, significantly impacting the accuracy of target tracking. Summary of the Invention

[0004] Therefore, the technical problem to be solved by the present invention is to overcome the shortcomings of the prior art and provide a hyperspectral target tracking method, system, medium and device based on large model segmentation, which can effectively distinguish between the target and the background and improve the accuracy of target tracking.

[0005] To address the aforementioned technical problems, this invention provides a hyperspectral target tracking method based on large model segmentation, comprising:

[0006] Construct a tracking network model, which includes a segmentation model and a twin network;

[0007] Existing hyperspectral video data is acquired, preprocessed, and used to train the tracking network model. The trained tracking network model is then used as the teacher model.

[0008] The hyperspectral video data of the object to be tested is acquired, preprocessed, and divided into training set and test set. The tracking network model is used as the student model. The student model is trained using the teacher model and training set. The trained student model is used as the prediction model.

[0009] Training the student model includes:

[0010] The first frame of the preprocessed hyperspectral image sequence is used as the template frame image, and the T-th frame image in the preprocessed hyperspectral image sequence is extracted as the detection frame image. The preprocessed detection frame image is input into the segmentation model to segment the target and the background. The segmentation result is weighted with the background and then input into the Siamese network to obtain the feature map. The response map is obtained based on the feature map of the template frame image and the feature map of the detection frame image. The response map is input into the classification model to obtain the predicted target box.

[0011] The frame image of the next frame after the Tth frame in the preprocessed hyperspectral image sequence is repeatedly extracted as the detection frame image. The above operation is performed to obtain the predicted target box corresponding to the frame image of the next frame after the Tth frame. This process is repeated until all frame images in the hyperspectral image sequence have been traversed. All predicted target boxes at this time are taken as candidate target boxes, and the final target tracking result is obtained based on the candidate target boxes.

[0012] Preferably, the preprocessing includes:

[0013] The hyperspectral video data is arranged in a sequential time sequence to obtain a hyperspectral image sequence, and each frame of the hyperspectral image sequence is used as the initial frame image.

[0014] A genetic algorithm is used to select the 'a' bands with the largest joint entropy in the initial frame image, and these 'a' bands are combined to form a new frame image.

[0015] Calculate the spectral response weighting coefficient w for hyperspectral video data:

[0016]

[0017] Among them, R tj R represents the average spectral response curve of all pixels within the target image region in the j-th spectral band. bj This represents the average spectral response curve of all pixels within the background image region in the j-th spectral band, where n represents the total number of spectral bands in the image; μ b It is the average value of the spectral response of the background region, σ b It is the standard deviation of the spectral response of the background region, μ t It is the average value of the spectral response of the target region, σ t It is the standard deviation of the spectral response of the target region, d j It is the attenuation factor, S jIt is a spatial consistency parameter;

[0018] The center coordinates, width, and height of the target are calculated based on the label of the new frame image, and a tracking bounding box is formed based on the center coordinates, width, and height of the target; the tracking bounding box is used as the target image region to be tracked, and the target image region to be tracked is used as the initial position of the target;

[0019] The tracking box is scaled and cropped, and the portion of the tracking box that exceeds the search area is filled with the average value of the global image pixels. The image in the cropped and filled tracking box is then used as the preprocessed frame image.

[0020] Preferably, the step of inputting the preprocessed detection frame image into the segmentation model for target and background segmentation, and weighting the segmentation result with the background, includes:

[0021] The preprocessed detection frame image is input into the segmentation model, and the image is encoded using the pre-trained parameter model of the segmentation model to obtain the mask result and the mask quality score vector.

[0022] The mask matrix is ​​obtained by selecting the mask result based on the mask quality score vector, and the target and background are distinguished based on the values ​​of the mask matrix;

[0023] Weighted masking results using the aforementioned spectral response weighting coefficients:

[0024]

[0025] Among them, X i,j X is the pixel value in the i-th row and j-th column of the currently detected frame image. i,j 'Is the weighted X' i,j The corresponding pixel value, M ij This represents the mask matrix; α and β are coefficients that adjust the contribution of the target pixel value to its local neighbor pixel values. γ represents the set of neighboring pixels of pixel (i,j). k,l δ is the contribution weight of neighboring pixels (k,l) to the center pixel (i,j), δ is the coefficient that adjusts the contribution of the background pixel value to its local neighboring pixel value, and η is the contribution weight of the neighboring pixels (k,l) to the center pixel value (i,j). k,l It is the contribution weight of the neighboring pixel (k,l) to the background pixel (i,j).

[0026] Preferably, the mask matrix is ​​obtained by selecting the mask result based on the mask quality score vector, specifically as follows:

[0027] TopMasks = {M[i]|i∈I} sorted [0:k]},

[0028] Where TopMasks is the mask matrix, Isorted M[i] represents the index vector obtained by sorting the values ​​of the mask quality score vector from high to low, and M[i] represents the i-th mask result.

[0029] Preferably, the step of obtaining a response map based on the feature map of the template frame image and the feature map of the detection frame image, and inputting the response map into the classification model to obtain the predicted target box, includes:

[0030] The feature maps of the template frame image and the detection frame image are cross-correlated channel by channel to obtain the response map. The response map is then input into the feature extraction model to obtain the final response map. The response map R is calculated as follows:

[0031]

[0032] Where X represents the detection frame image and Z represents the template frame image. This represents the feature map of the detection frame image, where the elements in the feature map of the detection frame image are the X... i,j ', This represents the feature map of the template frame image; * indicates a convolution operation.

[0033] The classification model of the tracking network model includes a classification branch and a regression branch. The classification branch includes a center branch. The final response map is input into the classification model to obtain the predicted target box.

[0034] Preferably, obtaining the final target tracking result based on the candidate target bounding boxes includes:

[0035] The candidate bounding boxes are scored using a scale change penalty, and the top n predicted bounding boxes are selected. Multiple neighboring predicted bounding boxes are selected from the top n predicted bounding boxes and a weighted average is performed. The result of the weighted average is used as the final target tracking result.

[0036] Preferably, the method for scoring the candidate target boxes using scale variation penalty is as follows:

[0037] S=(1-λ d )cls i,j ×p ij ×λ d H,

[0038] Where, λ d It is the balancing weight, cls i,j p represents the corresponding category label at position (i,j) in the response graph. ij H represents the penalty coefficient for scale change at position (i,j) in the response graph, and H is the cosine window.

[0039] This invention also provides a hyperspectral target tracking system based on large model segmentation, comprising:

[0040] A tracking network model building module is used to build a tracking network model, which includes a segmentation model and a Siamese network;

[0041] The teacher model construction module is used to acquire existing hyperspectral video data, preprocess it, and train the tracking network model, using the trained tracking network model as the teacher model.

[0042] The prediction model building module is used to acquire hyperspectral video data of the object to be tested, preprocess it and divide it into training set and test set, use the tracking network model as student model, use the teacher model and training set to train the student model, and use the trained student model as prediction model.

[0043] Training the student model includes: using the first frame of the preprocessed hyperspectral image sequence as the template frame image, and extracting the T-th frame image from the preprocessed hyperspectral image sequence as the detection frame image; inputting the preprocessed detection frame image into a segmentation model to segment the target and background, and inputting the segmentation result and background weighted into a Siamese network to obtain a feature map; obtaining a response map based on the feature map of the template frame image and the feature map of the detection frame image, and inputting the response map into a classification model to obtain the predicted target box;

[0044] The tracking and prediction module repeatedly extracts the frame image of the next frame after the Tth frame in the preprocessed hyperspectral image sequence as the detection frame image, performs the above operation to obtain the predicted target box corresponding to the frame image of the next frame after the Tth frame, until all frame images in the hyperspectral image sequence have been traversed, and all predicted target boxes at this time are taken as candidate target boxes, and the final target tracking result is obtained based on the candidate target boxes.

[0045] The present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the hyperspectral target tracking method based on large model segmentation.

[0046] The present invention also provides a hyperspectral target tracking device based on large model segmentation, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the hyperspectral target tracking method based on large model segmentation.

[0047] Compared with the prior art, the above-described technical solution of the present invention has the following advantages:

[0048] This invention distinguishes targets from background using a segmentation model, and then utilizes a Siamese network to learn the spectral features of hyperspectral data. This enables the tracking network model to more accurately identify and differentiate targets of different substances, improving its generalization ability and the accuracy of its identification. At the same time, this invention uses knowledge distillation when training the tracking network model, reducing the impact of insufficient sample data and improving the generalization ability of the tracking network model, thereby further improving the accuracy of its identification. Attached Figure Description

[0049] To make the content of this invention easier to understand, the invention will be further described in detail below with reference to specific embodiments and accompanying drawings, wherein:

[0050] Figure 1 This is a flowchart of the method of the present invention.

[0051] Figure 2 This is a flowchart of the method of the present invention.

[0052] Figure 3 This is a schematic diagram of the twin network CAR model structure in this invention.

[0053] Figure 4 This is a schematic diagram of the first frame of the hyperspectral sequence in an embodiment of the present invention.

[0054] Figure 5 This is a schematic diagram of the result of band selection processing on a hyperspectral target image in an embodiment of the present invention.

[0055] Figure 6 This is a schematic diagram of the result after the hyperspectral target image is processed by the SAM module in an embodiment of the present invention.

[0056] Figure 7 This is a schematic diagram of the labels and prediction boxes after the hyperspectral image sequence has been tracked in an embodiment of the present invention.

[0057] Explanation of the markings in the accompanying drawings: 1. Actual location; 2. Predicted location. Detailed Implementation

[0058] The present invention will be further described below with reference to the accompanying drawings and specific embodiments, so that those skilled in the art can better understand and implement the present invention. However, the embodiments described are not intended to limit the present invention.

[0059] Example 1

[0060] Reference Figures 1-2 As shown, this invention discloses a hyperspectral target tracking method based on large model segmentation, comprising the following steps:

[0061] S1: Construct as follows Figure 3 The tracking network model shown includes a segmentation model and a Siamese network (CAR). In this embodiment, the segmentation model used is the pre-trained ViT-B SAM model (source: https: / / doi.org / 10.48550 / arXiv.2304.02643).

[0062] S2: Obtain existing hyperspectral video data, preprocess it, and train the tracking network model. Use the trained tracking network model as the teacher model. In this embodiment, the existing hyperspectral video data used can be the HOT2022 dataset (source: https: / / www.hsitracking.com). The method for training the tracking network model can be the same as the method for training the student model in S3, or conventional training methods can be used.

[0063] S3: Obtain hyperspectral video data of the object to be tested, preprocess it, and divide it into training set and test set. Use the tracking network model as the student model, train the student model using the teacher model and training set, and use the trained student model as the prediction model. This invention reduces the overfitting problem caused by insufficient training samples through knowledge distillation technology, improves the generalization ability of the student model, and further improves the tracking effect of hyperspectral images.

[0064] S3-1: Acquire hyperspectral image data of the object under test and perform preprocessing.

[0065] S3-1-1: Arrange the hyperspectral image data in time sequence to obtain a hyperspectral image sequence, and use each frame image in the hyperspectral image sequence as the initial frame image; in this embodiment, the hyperspectral image sequence is a single channel, so the size of the frame image is M×N×1, where M×N is the size of the image, which is 256*256 in this example.

[0066] S3-1-2: Use a genetic algorithm to select the 'a' bands with the highest joint entropy from the initial frame image, and combine these 'a' bands to form a new frame image; the number of 'a' bands is adjusted according to the actual situation. In this embodiment, a = 3, that is, 3 suitable bands are selected from the 16 bands of the HOT2022 dataset. Combining the band selection method based on genetic algorithm and maximum joint entropy to select valuable bands, and eliminating information redundancy in hyperspectral video through the band selection module, physical information can be preserved and tracking speed can be improved.

[0067] S3-1-3: Calculate the spectral response weighting coefficient w for hyperspectral video data:

[0068]

[0069] Among them, R tjR represents the average spectral response curve of all pixels within the target image region in the j-th spectral band. bj μ represents the average spectral response curve of all pixels within the background image region in the j-th spectral band, where n represents the total number of spectral bands in the image. b and σ b These are the mean and standard deviation of the spectral response of the background region, respectively, used to standardize the background signal. μ t and σ t These are the mean and standard deviation of the spectral response of the target region, respectively, used for standardizing the target signal. j It is an attenuation factor that takes into account the signal attenuation that may be introduced in the j-th band due to factors such as changes in equipment sensitivity. j d is a spatial consistency parameter that measures the spatial correlation between pixels within the j-th band. High spatial correlation implies lower noise and sharper target boundaries. j and S j The specific settings should be adjusted according to the actual situation.

[0070] S3-1-4: Calculate the center coordinates, width, and height of the target to be tracked based on the label of the new frame image, and form a tracking box based on the center coordinates, width, and height of the target to be tracked; use the tracking box as the target image region to be tracked, and use the target image region to be tracked as the initial position of the target to be tracked.

[0071] S3-1-5: The tracking box is scaled and cropped. The portion of the tracking box that exceeds the search area is filled using the average pixel value of the global image. The image within the cropped and filled tracking box is then used as the preprocessed frame image. Specifically, in this embodiment, the process is as follows: Based on the target label determined from the template frame image, and considering the target's size and movement speed, the search area is selected to be four times the area of ​​the target region to be tracked, i.e., the width and height of the search area are each twice the original. Therefore, the width and height of the tracking box are doubled before template cropping. Considering that the target may be located at an edge, and the corresponding box may exceed the search area, the image is further processed by filling the portion exceeding the search area using the average pixel value of the global image. The cropped and filled template frame image is then input into the tracking network model for training and testing.

[0072] S3-2: Divide the preprocessed hyperspectral image sequence into a training set and a test set. Use the tracking network model as a student model. Train the student model using the teacher model and the training set. Use the trained student model as a prediction model.

[0073] Knowledge distillation is a transfer learning technique used to improve the performance and generalization ability of a student model by transferring knowledge from a teacher model. In this invention, a tracking network model trained using an existing dataset is used as the teacher model. When using hyperspectral image data of the object under test as the training set, the original tracking network model is used as the student model for knowledge distillation, thereby improving the performance and generalization ability of the student model by transferring knowledge from the teacher model. When training the student model using the teacher model and the training set, the classification results of the teacher model are used as soft labels to guide the training of the student model. A temperature parameter T is set to soften the classification results, making them contain more information. The loss function L for knowledge distillation when training the student model using the teacher model and the training set is... cls for:

[0074] L cls =T 2 ×KLdiv(C s C t ),

[0075] Where T is the temperature parameter of the distillation model, KLdiv() is the KL divergence, and KLdiv(C s C t )=∑(C t log(C t / C s )); C t C represents the soft label of the teacher model. t =softmax(z) t / T), softmax() is the softmax function operation, z t This represents the classification output of the teacher model; C s C represents the soft label of the student model. s =softmax(z) s / T), z s This represents the classification output of the student model.

[0076] When training the student model using the teacher model and the training set, the total loss function L is:

[0077] L = L cls +λ1L cen +λ2L reg ,

[0078] Among them, L cls Let L be the loss function for the knowledge distillation. cen Let L be the loss function of the central branch. reg Let λ1 and λ2 be the loss function of the regression branch, and λ1 and λ2 be the weighting coefficients; in this embodiment, λ1 = 2 and λ2 = 3.

[0079] The regression branch uses IOU loss, and the loss function of the regression branch is L. reg The calculation method is as follows:

[0080]

[0081] Where (i,j) represents each position in the response graph R, and (x,y) represents the position of point (i,j) mapped back to the corresponding position in the tracking box. This represents the IOU loss function value between the actual bounding box and the predicted bounding box at point (i,j). This represents the distance from the ground truth point (x, y) to the four sides of the ground truth bounding box; The value can be 0 or 1. When a point in the feature map does not belong to the manually defined visible bounding box in the first frame, the value of that point is... The value is 0, otherwise it is 1; A reg (i,j) represents the position of the predicted bounding box, and (i,j) corresponds to the distance between the point in the tracked bounding box and the four sides of the ground truth bounding box. L IOU () represents the IOU loss function operation.

[0082] The regression branch includes four channels. The calculation method is as follows:

[0083]

[0084] in, The feature maps for the four channels of the regression branch are as follows:

[0085]

[0086]

[0087]

[0088]

[0089] in, This represents the distance from the predicted center point to the left boundary of the tracking box. This represents the distance from the predicted center point to the upper boundary of the tracking box. This represents the distance from the predicted center point to the right boundary of the tracking box. (x0, y0) represents the distance from the prediction center point to the lower boundary of the tracking box, (x1, y1) represents the coordinates of the upper left corner of the tracking box, and (x1, y1) represents the coordinates of the lower right corner of the tracking box.

[0090] The The calculation method is as follows:

[0091]

[0092] Where I and U are the intersection and union of the true center point and the predicted center point, respectively, and I and U are calculated as follows:

[0093]

[0094]

[0095] Where l represents the distance from the ground truth center point to the left boundary of the tracking box, t represents the distance from the ground truth center point to the top boundary of the tracking box, r represents the distance from the ground truth center point to the right boundary of the tracking box, and b represents the distance from the ground truth center point to the bottom boundary of the tracking box.

[0096] The loss function L of the central branch cen for:

[0097]

[0098] Where C(i,j) is the centrality score. For a point (i,j) in the feature map output by the central branch, C(i,j) is calculated as follows:

[0099]

[0100] The centrality score C(i,j) represents the degree to which the current pixel deviates from the center point of the real target. The smaller the value of C(i,j), the greater the deviation of the current pixel.

[0101] S4: Input the training set and test set into the prediction model to obtain the target tracking result.

[0102] S4-1: Extract the first frame image from the preprocessed hyperspectral image sequence as the template frame image. In this embodiment, the extracted first frame image is as follows: Figure 4 As shown, the T-th frame image in the preprocessed hyperspectral image sequence is extracted as the detection frame image, where T is an integer greater than 1.

[0103] S4-2: The preprocessed detection frame image is input into the segmentation model for target and background segmentation. The segmentation result is weighted with the background to highlight the distinction between the target and the background, and then input into the Siamese network to obtain the feature map. The backbone network of the tracking network model is a deep learning neural network, and the deep learning neural network used in this embodiment is ResNet50. The feature map of the template frame image is extracted using the deep learning neural network ResNet50, and the feature map of the detection frame image is extracted using the Siamese network.

[0104] S4-2-1: Input the preprocessed detection frame image into the segmentation model, use the pre-trained parameter model of the segmentation model to perform image encoding, and obtain the mask result and the mask quality score vector Q;

[0105] S4-2-2: Obtain the mask matrix by selecting the mask result based on the mask quality score vector Q:

[0106] TopMasks = {M[i]|i∈I} sorted [0:k]},

[0107] Where TopMasks is the mask matrix, representing the mask results corresponding to the k highest quality score vectors Q selected as the mask quality score vectors Q. sorted M[i] represents the index vector obtained by sorting the mask quality score vector Q from high to low, and M[i] represents the i-th mask result.

[0108] Based on the values ​​of the mask matrix, the elements are divided into target and background. In this embodiment, elements with a value of True at position (i,j) in the mask matrix are determined as the target, and elements with a value of False are determined as the background.

[0109] S4-2-3: For each type of video, it is assumed that the distinction between the target and the background is different; therefore, the spectral response weighting coefficient w is used to weight the masking result.

[0110]

[0111] Among them, X i,j X is the pixel value in the i-th row and j-th column of the currently detected frame image. i,j 'Is the weighted X' i,j The corresponding pixel value, M ij M represents the mask matrix. ij A value of 1 indicates the target pixel and a value of 0 indicates the background pixel; α and β are coefficients that adjust the contribution of the target pixel value to its local neighboring pixel values, and γ... k,l It represents the contribution weight of neighboring pixels (k,l) to the center pixel (i,j). Let δ represent the set of neighboring pixels of pixel (i,j), where δ is a coefficient that adjusts the contribution of the background pixel value to its local neighboring pixel values, and η is the set of neighboring pixels of pixel (i,j). k,l These are the contribution weights of the neighboring pixel (k,l) to the background pixel (i,j). α, β, δ, γ k,l η k,l The specific value should be adjusted according to the actual situation.

[0112] S4-3: Perform a channel-wise cross-correlation operation on the feature maps of the template frame image and the detection frame image to obtain a response map. Input the response map into the feature extraction model to obtain the final response map. In this embodiment, the feature extraction model is a hybrid attention mechanism. Before being input into the hybrid attention mechanism PSA module, it first passes through a pyramid convolution, which can utilize convolution kernels of different scales and depths to extract multi-scale information, thereby capturing more important information.

[0113] The method for calculating the response graph R is as follows:

[0114]

[0115] Where X represents the detection frame image and Z represents the template frame image. This represents the feature map of the detection frame image, where each element is X. i,j ', This represents the feature map of the template frame image; * indicates the convolution operation, i.e., cross-correlation. Cat() represents the concatenation operation, and F3(X), F4(X), and F5(X) are the features extracted from the last three residual blocks of the ResNet50 deep learning neural network, respectively.

[0116] In this embodiment, F3(X), F4(X), and F5(X) contain 256 channels. It contains 256×3 channels. When the feature map is input into the hybrid attention mechanism PSA module, a 1×1 convolution is first performed, followed by a two-layer pyramid convolution to capture different local details at both 5×5 and 3×3 scales. Then, a 1×1 convolution is applied to combine the information extracted by different kernels, and the fused features are grouped and reordered along the channel dimension. Channel rearrangement units are used to integrate channel attention and spatial attention into each group, and finally, all features are aggregated to form the final response map.

[0117] S4-4: The classification model of the tracking network model includes a classification branch and a regression branch. The classification branch includes a central branch. The classification branch, regression branch, and central branch each output three feature maps with different channel sizes. In this embodiment, the feature map output by the classification branch is... cls represents the classification branch, w and h represent the width and height of the feature map, respectively; the feature map output by the regression branch is... reg represents the regression branch; the feature map output by the central branch is cen represents the central branch. The final response map is input into the classification model to obtain the predicted target bounding box.

[0118] Cross-correlation operation yields features of different sizes in two channels. In this embodiment, features with 2K channels are classified and processed into classification branches and center point branches. Features with 4K channels are processed by bounding box offset. K is an integer representing the number of anchors. The predicted bounding boxes are corrected, and the bounding boxes of the final target are obtained through the regression branch. The center offset and size offset of the next frame are then updated and modified.

[0119] S4-5: Repeatedly extract the frame image of the next frame after the Tth frame in the preprocessed hyperspectral image sequence as the detection frame image, and perform the above S4-2 to S4-4 operations to obtain the predicted target box corresponding to the frame image of the next frame after the Tth frame, until all frame images in the preprocessed hyperspectral image sequence have been traversed; take all the predicted target boxes at this time as candidate target boxes.

[0120] S4-6: Use scale change penalty to score the candidate target boxes and select the n predicted target boxes corresponding to the top n scores. Select multiple nearby predicted target boxes in the vicinity of the n predicted target boxes corresponding to the top n scores and perform a weighted average. Use the weighted average result as the final target tracking result.

[0121] The method for scoring the candidate bounding boxes using scale variation penalty is as follows:

[0122]

[0123]

[0124] S=(1-λ d )cls i,j ×p ij +λ d H;

[0125] Where, λ d It is the balancing weight, λ in this embodiment d Value: 0.3; cls i,j Let represent the category label at position (i,j) in the response graph; r represents the ratio of the width to the height of the predicted bounding box at position (i,j) in the response graph, i.e., r = h / w; r' represents the ratio of the width to the height of the template frame; and s is the overall proportion of the predicted bounding box. s' represents the overall aspect ratio of the target's width and height in the template frame image, p ij This represents the penalty coefficient for scale change at position (i,j) in the response graph, where a1 is the penalty coefficient weight (0.04 in this embodiment); H is the cosine window, b1 is the window coefficient (0.5 in this embodiment), M is the window length, and n is an integer sequence increasing from 1-M to M-1 (25 in this embodiment), where n is an integer sequence increasing from -24 to 24 with a step size of 2. To calculate the outer product of two vectors.

[0126] The value of n is determined according to the actual situation. In this embodiment, n=3, that is, the three predicted target boxes corresponding to the minimum score S are obtained. Eight neighboring predicted target boxes are selected from the three predicted target boxes and a weighted average is taken. The result of the weighted average is taken as the final target tracking result.

[0127] After completing target tracking in the current detection frame, the initial width and height of the next frame can be updated using the learning rate, as well as the target position information in the next frame. The final position is obtained by adjusting the coordinates of the best prediction box after bias adjustment and then adjusting the scale. Similarly, the width and height also need to be fine-tuned with the width and height and scale deviation of the previous frame to obtain the final size. Finally, the position coordinates and size of the frame are updated for reference when predicting the position and scale of the next detection frame.

[0128] Example 2

[0129] The present invention also discloses a hyperspectral target tracking system based on large model segmentation, including a tracking network model construction module, a teacher model construction module, a prediction model construction module, and a tracking prediction module.

[0130] A tracking network model building module is used to build a tracking network model, which includes a segmentation model and a Siamese network;

[0131] The teacher model construction module is used to acquire existing hyperspectral video data, preprocess it, and train the tracking network model, using the trained tracking network model as the teacher model.

[0132] The prediction model building module is used to acquire hyperspectral video data of the object to be tested, preprocess it and divide it into training set and test set, use the tracking network model as student model, use the teacher model and training set to train the student model, and use the trained student model as prediction model.

[0133] Training the student model includes: using the first frame of the preprocessed hyperspectral image sequence as the template frame image, and extracting the T-th frame image from the preprocessed hyperspectral image sequence as the detection frame image; inputting the preprocessed detection frame image into a segmentation model to segment the target and background, and inputting the segmentation result and background weighted into a Siamese network to obtain a feature map; obtaining a response map based on the feature map of the template frame image and the feature map of the detection frame image, and inputting the response map into a classification model to obtain the predicted target box;

[0134] The tracking and prediction module repeatedly extracts the frame image of the next frame after the Tth frame in the preprocessed hyperspectral image sequence as the detection frame image, performs the above operation to obtain the predicted target box corresponding to the frame image of the next frame after the Tth frame, until all frame images in the hyperspectral image sequence have been traversed, and all predicted target boxes at this time are taken as candidate target boxes, and the final target tracking result is obtained based on the candidate target boxes.

[0135] Example 3

[0136] The present invention also discloses a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the hyperspectral target tracking method based on large model segmentation in Embodiment 1.

[0137] Example 4

[0138] The present invention also discloses a hyperspectral target tracking device based on large model segmentation, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the hyperspectral target tracking method based on large model segmentation in Embodiment 1.

[0139] Siamese Fully Convolutional Classification and Regression (CAR) proposes a Siamese-based classification and regression framework that decomposes the visual tracking task into two sub-problems: a classification problem and a regression task. This framework can predict the target class and bounding box at each pixel location without using anchor points or region proposals, thus avoiding complex parameter tuning and human intervention. Furthermore, a simple yet effective classification-regression sub-network is designed to decode the target's location and scale information from the multi-channel response map. This sub-network leverages the target's semantic and centrality information, improving the accuracy and robustness of bounding box regression.

[0140] This invention improves the generalization ability and accuracy of the algorithm based on the Segment Anything Model (SAM). SAM is a cue-based model trained on over 1 billion masks across 11 million images, achieving strong zero-shot generalization. However, SAM's performance in video is not ideal. Therefore, by combining SAM with a tracker, the problem of insufficient generalization is addressed, making this invention applicable to the feature extraction portion of any hyperspectral video target tracking.

[0141] SAM is an innovative image segmentation model. Its innovation and significance are mainly reflected in the following aspects:

[0142] 1. The Prompt mechanism has been added: Unlike traditional semantic segmentation methods, SAM incorporates a Prompt mechanism, which can use text, coordinate points, and bounding boxes as auxiliary information to optimize the segmentation results. This increases the flexibility of interaction and is also a beneficial attempt to solve the scale problem in image segmentation.

[0143] 2. Generate multiple valid masks: When encountering uncertainty in identifying the object to be segmented, SAM can generate multiple valid masks.

[0144] 3. Automatic Segmentation Mode: SAM's automatic segmentation mode can identify all potential objects in an image and generate masks.

[0145] 4. It contributed the largest semantic segmentation dataset in the world to date: the training dataset of SAM, which is 6 times larger than the previous largest dataset.

[0146] 5. High versatility: SAM is a general-purpose model for image segmentation tasks. Unlike previous image segmentation models that could only handle certain types of images, SAM can handle all types of images.

[0147] 6. Reduced requirements for specific scene modeling knowledge, training computation, and data labeling: SAM establishes a general model for image segmentation, which is expected to complete image segmentation tasks under a unified framework.

[0148] 7. Broad Application Prospects: SAM will not only play a role in the aforementioned cutting-edge fields, but may also be used in people's daily lives. For example, in the field of medical imaging diagnosis, SAM may lead to more accurate medical imaging models, improving medical standards; in the process of taking pictures, the addition of SAM may enable faster and smarter facial recognition.

[0149] This invention extracts multi-scale information by using convolutional kernels of different scales and depths, and then uses a hybrid attention approach to capture important information, thereby enhancing the model's ability to identify similar objects, capturing more important information, and improving the accuracy and robustness of tracking.

[0150] This invention distinguishes the target from the background by segmentation model, and then uses a Siamese network to learn the spectral features of hyperspectral data, enabling the tracking network model to more accurately identify and distinguish targets of different substances, thereby improving the accuracy of the tracking network model's identification.

[0151] This invention uses the SAM model concept when training the tracking network model, directly using the pre-trained parameter model to process the task, thereby further improving the recognition performance of the tracking network model.

[0152] This invention uses the concept of knowledge distillation when training the tracking network model. The output of the teacher model is used as a soft label to guide the training of the student model. This solves the problem of difficulty in training deep neural networks due to limited sample data, reduces the risk of overfitting during training, and further improves the recognition performance of the tracking network model.

[0153] This invention selects three bands with the highest joint entropy from hyperspectral data by using a genetic algorithm-based band selection method, thereby reducing information redundancy in hyperspectral data while extracting effective features.

[0154] To further illustrate the beneficial effects of the present invention, a simulation experiment was conducted using the method of the present invention in this embodiment. Figure 5 This is a schematic diagram of the result after band selection processing. Figure 5 It can be seen that band selection removes redundant information and noise, which can help highlight targets in hyperspectral images. Figure 6 This is a schematic diagram of a hyperspectral target image after SAM processing. Figure 6 It can be seen that the image processed by SAM can better distinguish the target from the background. Figure 7 This is a schematic diagram of the labels and predicted bounding boxes after the hyperspectral image sequence has been tracked. Figure 7 In the diagram, identifier 1 represents the label obtained from the template frame image, i.e., the true location of the target, while identifier 2 represents the location predicted using the method of this invention. Figure 7 As can be seen, the prediction box obtained by the present invention contains the hyperspectral target to be tracked, and the overlap range with the label is large, resulting in good prediction effect, thus proving the beneficial effect of the present invention.

[0155] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0156] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0157] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0158] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0159] Obviously, the above embodiments are merely illustrative examples for clear explanation and are not intended to limit the implementation. Those skilled in the art will recognize that other variations or modifications can be made based on the above description. It is neither necessary nor possible to exhaustively list all possible implementations here. However, obvious variations or modifications derived therefrom are still within the scope of protection of this invention.

Claims

1. A hyperspectral target tracking method based on large model segmentation, characterized in that, include: Construct a tracking network model, which includes a segmentation model and a twin network; Existing hyperspectral video data is acquired, preprocessed, and used to train the tracking network model. The trained tracking network model is then used as the teacher model. The hyperspectral video data of the object to be tested is acquired, preprocessed, and divided into training set and test set. The tracking network model is used as the student model. The student model is trained using the teacher model and training set. The trained student model is used as the prediction model. Training the student model includes: The first frame of the preprocessed hyperspectral image sequence is used as the template frame image, and the T-th frame image in the preprocessed hyperspectral image sequence is extracted as the detection frame image. The preprocessed detection frame image is input into the segmentation model to segment the target and the background. The segmentation result is weighted with the background and then input into the Siamese network to obtain the feature map. The response map is obtained based on the feature map of the template frame image and the feature map of the detection frame image. The response map is input into the classification model to obtain the predicted target box. The frame image of the next frame after the Tth frame in the preprocessed hyperspectral image sequence is repeatedly extracted as the detection frame image. The above operation is performed to obtain the predicted target box corresponding to the frame image of the next frame after the Tth frame. This process is repeated until all frame images in the hyperspectral image sequence have been traversed. All predicted target boxes at this time are taken as candidate target boxes. The final target tracking result is obtained based on the candidate target boxes. The step of inputting the preprocessed detection frame image into the segmentation model for target and background segmentation, and weighting the segmentation result with the background, includes: The preprocessed detection frame image is input into the segmentation model, and the image is encoded using the pre-trained parameter model of the segmentation model to obtain the mask result and the mask quality score vector. The mask matrix is ​​obtained by selecting the mask result based on the mask quality score vector, and the target and background are distinguished based on the values ​​of the mask matrix; Weighted masking results using spectral response weighting coefficients: , in, X is the spectral response weighting coefficient. i,j X is the pixel value in the i-th row and j-th column of the currently detected frame image. i,j 'Is the weighted X' i,j The corresponding pixel value, M ij This represents the value in the i-th row and j-th column of the mask matrix; and It is a coefficient that adjusts the contribution of the target pixel value to its local neighboring pixel values. Represents pixels The set of neighboring pixels, Neighboring pixels For the center pixel Contribution weight, It is a coefficient that adjusts the contribution of background pixel values ​​to their local neighboring pixel values. Neighboring pixels For background pixels Contribution weight; The method for calculating the spectral response weighting coefficients of hyperspectral video data is as follows: , Among them, R tj R represents the average spectral response curve of all pixels within the target image region in the j-th spectral band. bj This represents the average spectral response curve of all pixels in the background image region within the j-th spectral band, where n represents the total number of spectral bands in the image. It is the average value of the spectral response of the background region. It is the standard deviation of the spectral response of the background region. It is the average value of the spectral response of the target region. It is the standard deviation of the spectral response of the target region. It is the attenuation factor. It is a spatial consistency parameter.

2. The hyperspectral target tracking method based on large model segmentation according to claim 1, characterized in that: The preprocessing includes: The hyperspectral video data is arranged in sequential time sequence to obtain a hyperspectral image sequence, and each frame of the hyperspectral image sequence is used as the initial frame image. A genetic algorithm is used to select the 'a' bands with the largest joint entropy in the initial frame image, and these 'a' bands are combined to form a new frame image. Calculate the spectral response weighting coefficients of hyperspectral video data. The center coordinates, width, and height of the target are calculated based on the label of the new frame image, and a tracking bounding box is formed based on the center coordinates, width, and height of the target; the tracking bounding box is used as the target image region to be tracked, and the target image region to be tracked is used as the initial position of the target; The tracking box is scaled and cropped, and the portion of the tracking box that exceeds the search area is filled with the average value of the global image pixels. The image in the cropped and filled tracking box is then used as the preprocessed frame image.

3. The hyperspectral target tracking method based on large model segmentation according to claim 1, characterized in that: The mask matrix is ​​obtained by selecting the mask result based on the mask quality score vector, specifically: , Where TopMasks is the mask matrix, This represents the index vector obtained by sorting the values ​​of the mask quality score vector from highest to lowest. This represents the result of the i-th mask.

4. The hyperspectral target tracking method based on large model segmentation according to claim 1, characterized in that: The step of obtaining a response map based on the feature maps of the template frame image and the detection frame image, and inputting the response map into the classification model to obtain the predicted target box, includes: The feature maps of the template frame image and the detection frame image are cross-correlated channel by channel to obtain the response map. The response map is then input into the feature extraction model to obtain the final response map. The response map R is calculated as follows: R = φ(X) * φ(Z), wherein X represents a detection frame image, Z represents a template frame image, φ(X) represents a feature map of the detection frame image, elements in the feature map of the detection frame image are the X i,j , φ(Z) represents a feature map of the template frame image, and * represents a convolution operation; The classification model of the tracking network model includes a classification branch and a regression branch. The classification branch includes a center branch. The final response map is input into the classification model to obtain the predicted target box.

5. The hyperspectral target tracking method based on large model segmentation according to any one of claims 1-4, characterized in that: The step of obtaining the final target tracking result based on the candidate target bounding boxes includes: The candidate bounding boxes are scored using a scale change penalty, and the top n predicted bounding boxes are selected. Multiple neighboring predicted bounding boxes are selected from the top n predicted bounding boxes and a weighted average is performed. The result of the weighted average is used as the final target tracking result.

6. The hyperspectral target tracking method based on large model segmentation according to claim 5, characterized in that: The method for scoring the candidate bounding boxes using scale variation penalty is as follows: S= , Where, λ d It is a balancing weight. This represents the category label at position (i,j) in the response graph. H represents the penalty coefficient for scale change at position (i,j) in the response graph, and H is the cosine window.

7. A hyperspectral target tracking system based on large model segmentation, characterized in that, include: A tracking network model building module is used to build a tracking network model, which includes a segmentation model and a Siamese network; The teacher model construction module is used to acquire existing hyperspectral video data, preprocess it, and train the tracking network model, using the trained tracking network model as the teacher model. The prediction model building module is used to acquire hyperspectral video data of the object to be tested, preprocess it and divide it into training set and test set, use the tracking network model as student model, use the teacher model and training set to train the student model, and use the trained student model as prediction model. Training the student model includes: using the first frame of the preprocessed hyperspectral image sequence as the template frame image, and extracting the T-th frame image from the preprocessed hyperspectral image sequence as the detection frame image; inputting the preprocessed detection frame image into a segmentation model to segment the target and background, and inputting the segmentation result and background weighted into a Siamese network to obtain a feature map; obtaining a response map based on the feature map of the template frame image and the feature map of the detection frame image, and inputting the response map into a classification model to obtain the predicted target box; The tracking and prediction module repeatedly extracts the frame image of the next frame after the Tth frame in the preprocessed hyperspectral image sequence as the detection frame image, performs the above operation to obtain the predicted target box corresponding to the frame image of the next frame after the Tth frame, until all frame images in the hyperspectral image sequence have been traversed, and all predicted target boxes at this time are taken as candidate target boxes, and the final target tracking result is obtained based on the candidate target boxes. The step of inputting the preprocessed detection frame image into the segmentation model for target and background segmentation, and weighting the segmentation result with the background, includes: The preprocessed detection frame image is input into the segmentation model, and the image is encoded using the pre-trained parameter model of the segmentation model to obtain the mask result and the mask quality score vector. The mask matrix is ​​obtained by selecting the mask result based on the mask quality score vector, and the target and background are distinguished based on the values ​​of the mask matrix; Weighted masking results using spectral response weighting coefficients: , in, X is the spectral response weighting coefficient. i,j X is the pixel value in the i-th row and j-th column of the currently detected frame image. i,j 'Is the weighted X' i,j The corresponding pixel value, M ij This represents the value in the i-th row and j-th column of the mask matrix; and It is a coefficient that adjusts the contribution of the target pixel value to its local neighboring pixel values. Represents pixels The set of neighboring pixels, Neighboring pixels For the center pixel Contribution weight, It is a coefficient that adjusts the contribution of background pixel values ​​to their local neighboring pixel values. Neighboring pixels For background pixels Contribution weight; The method for calculating the spectral response weighting coefficients of hyperspectral video data is as follows: , Among them, R tj R represents the average spectral response curve of all pixels within the target image region in the j-th spectral band. bj This represents the average spectral response curve of all pixels in the background image region within the j-th spectral band, where n represents the total number of spectral bands in the image. It is the average value of the spectral response of the background region. It is the standard deviation of the spectral response of the background region. It is the average value of the spectral response of the target region. It is the standard deviation of the spectral response of the target region. It is the attenuation factor. It is a spatial consistency parameter.

8. A computer-readable storage medium having a computer program stored thereon, characterized in that: When executed by a processor, the computer program implements the hyperspectral target tracking method based on large model segmentation as described in any one of claims 1-6.

9. A hyperspectral target tracking device based on large model segmentation, characterized in that: It includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the hyperspectral target tracking method based on large model segmentation as described in any one of claims 1-6.