An rgbt target tracking method based on modal difference compensation

By constructing a dual-stream CNN backbone network and a modality difference compensation module, differential feature weights are generated. Combining cross-modal and single-modal features, attention mechanisms are used to select highly discriminative features. Furthermore, a quadratic regression network for boundary localization is used to solve the problem of insufficient information utilization caused by modality differences in RGBT target tracking, thereby achieving higher robustness and accuracy.

CN115205337BActive Publication Date: 2026-06-23XIAN THERMAL POWER RES INST CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIAN THERMAL POWER RES INST CO LTD
Filing Date
2022-07-28
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing RGBT target tracking methods ignore the modal differences between visible light and infrared images, resulting in insufficient utilization of cross-modal information and affecting tracking performance. In particular, under extreme conditions, single-modal features may be less reliable than fused features.

Method used

A modality difference compensation-based approach is adopted. By constructing a dual-stream CNN backbone network and a modality difference compensation module, difference feature weights are generated. Combining cross-modal and single-modal features, attention mechanisms are used to select highly discriminative features, and a quadratic regression network for boundary localization is used to improve tracking accuracy.

Benefits of technology

By fully utilizing the differences between visible light and infrared image information, the robustness and accuracy of RGBT target tracking are improved, enabling it to maintain high-efficiency target tracking performance in complex environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115205337B_ABST
    Figure CN115205337B_ABST
Patent Text Reader

Abstract

The application discloses an RGBT target tracking method based on modal difference compensation, which comprises the following steps: (1) a double-flow CNN backbone network for single-modal image feature extraction, wherein one flow is used for RGB image and the other flow is used for infrared image; (2) a modal difference compensation module for compensating the difference information of single-modal RGB and infrared images and the fusion of cross-modal features of multi-modal RGB-T images; (3) a feature selection module based on a attention mechanism for selecting high-discrimination features for RGB-T tracking; and (4) a prediction head composed of a discriminant model prediction tracking network and a secondary regression network based on boundary positioning for predicting accurate target frames. The application fully utilizes the difference between visible light and infrared image information, designs a modal difference compensation module to obtain robust cross-modal features, and performs RGB-T tracking by simultaneously considering cross-modal features and single-modal (RGB and infrared) characteristics, so that the RGB-T tracking performance is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision, specifically relating to an RGBT target tracking method based on modal difference compensation. Background Technology

[0002] Target tracking refers to the process of identifying a target of interest in the first frame of an image and then marking its position and scale information frame by frame in a subsequent video sequence, typically using bounding boxes. As an important task and research branch in the field of computer vision, target tracking technology plays a vital role and has significant value in civilian fields such as autonomous driving, community security, environmental monitoring, and intelligent traffic monitoring, as well as military fields such as battlefield dynamic analysis, precision guidance of military weapons, UAV reconnaissance, and anti-missile systems.

[0003] With the rapid development of deep neural networks, target tracking technology based on visible light has achieved significant breakthroughs. Visible light images can capture rich target information, such as color, texture, and boundaries. However, due to limitations in imaging principles, visible light images are easily affected by the environment, resulting in low robustness of visible light-based target tracking methods in scenarios with low visibility, complex lighting, and adverse weather conditions.

[0004] In recent years, to improve target tracking performance in complex scenes, some researchers have attempted to apply multimodal data for target tracking, such as registered visible-infrared (RGB-T) images and visible-depth (RGB-D) images. Infrared images, which are based on the thermal radiation of objects, lack information on target color, texture, and shape, and face unique challenges such as thermal cross-contamination. However, they are insensitive to changes in lighting and have strong penetration capabilities through fog and haze, forming a strong complementary advantage with visible light images. Therefore, RGB-T target tracking has received increasing attention.

[0005] Compared to RGB tracking, RGBT trackers achieve robust tracking performance in challenging environments. To date, researchers have proposed numerous RGBT trackers. Early RGBT trackers were based on handcrafted features. These methods did not adapt well to challenging environments such as dramatic appearance changes, cluttered backgrounds, rapid target movement, and occlusion. Inspired by the successful application of Convolutional Neural Networks (CNNs) in RGB tracking, recent work has tended to use CNNs to improve the performance of RGBT trackers. Due to the powerful feature extraction and representation capabilities of deep CNNs, these state-of-the-art RGB-T trackers typically outperform traditional trackers significantly. Modern RGB-T trackers often use a two-stream network structure to learn features for each modality and fuse visible and infrared features through fusion strategies such as cascading, pixel-level addition, and modality weights to obtain a more robust target representation. Other trackers extract different feature representations using three adapters—a general adapter, a modal adapter, and an instance adapter—to fully leverage the complementary advantages of RGB and infrared modalities.

[0006] While these algorithms have achieved great success in RGBT tracking, they ignore the modal differences between RGB and infrared images caused by different imaging mechanisms. This leads to insufficient utilization of cross-modal complementary information, thus affecting subsequent tracking performance. Furthermore, these tracking methods typically use fused RGBT cross-modal features to predict the final result. Because RGB and infrared data have strong complementary advantages, fused cross-modal features can generally provide better predictions than single-modal features (such as RGB or infrared features). However, under extreme conditions such as thermal crossover or strong illumination, fused cross-modal features may be less reliable than single-modal features. Summary of the Invention

[0007] To address the problem of utilizing cross-modal information in target tracking using visible light and infrared images, this invention provides an RGBT target tracking method based on modal difference compensation. This method utilizes the difference information between modes to achieve the interaction and fusion of complementary information between modes, thereby fully leveraging the complementary advantages of the two modes. Furthermore, it combines single-modal features to improve the accuracy and robustness of the target tracking algorithm.

[0008] The present invention is achieved using the following technical solution:

[0009] An RGBT target tracking method based on modal difference compensation includes the following steps:

[0010] Step 1: Construct a two-stream CNN backbone network;

[0011] Step 2: Construct a modal difference compensation module;

[0012] Step 2.1: The modal difference compensation module adopts a compensation and re-fusion strategy. First, it compensates for the difference information of the two modes separately, and then fuses the compensated RGB and compensated infrared features by summing elements.

[0013] Step 2.2: Using the difference feature F respectively r-t and F t-r As input, two identical weight generation networks are used to generate differential feature weights W. r-t and W t-r The weight generation network uses both spatial weight graphs and channel weight graphs to obtain more compensation information.

[0014] Step 2.3: Obtain the differential feature weight map W r-t (W t-r After that, the compensated RGB features and compensated infrared features Obtained through cross-modal residual connectivity, i.e.:

[0015]

[0016]

[0017] Compensated RGB features In addition to the original single-modal RGB feature F rgb In addition, it also includes infrared mode-specific features; similarly, as shown in formula (7), the compensated infrared features Includes RGB modal-specific features and original single-mode infrared characteristics F t ; By adjusting the compensated RGB features and compensated infrared features The features are added together and fused to obtain the final fused cross-modal RGBT features F. rgbt ∈R C×H×W ,Right now:

[0018]

[0019] Step 3: Construct the feature selection module;

[0020] The attention-based feature selection module further adaptively selects highly discriminative cross-modal and single-modal features to improve RGBT tracking performance. The feature selection module fully selects highly discriminative features of all modal features through three steps, including cross-modal RGBT features, original single-modal RGB features, and original single-modal infrared features.

[0021] Step 3.1: Fusion of all modal features, the purpose of which is to obtain more information from cross-modal RGBT features, original single-modal RGB features, and original single-modal infrared features;

[0022] Step 3.2: Channel-level feature selection, which aims to enhance features of categories relevant to the target and suppress useless features;

[0023] Step 3.3: Spatial feature selection, aimed at enhancing the weight of target spatial location and suppressing locations far from the target; Step 4: Constructing a discriminative model to predict the tracking network;

[0024] The discriminative model prediction tracking network takes template image features and target image features as inputs and performs multiple tasks to output: a classification score map and an IoU prediction. The classification score map is obtained by calculating the target image features and a filter f. The filter f includes a model initializer consisting of a convolutional layer and an accurate ROI pooling, and a model optimizer that solves the final model using steepest descent. The model filter f is solved by using multiple samples from the initial training set in the model initializer. The IoU prediction is the IoU between the depth features of the predicted image and the candidate bounding box features. The bounding box is then estimated by maximizing the IoU prediction.

[0025] Step 5: Construct a quadratic regression network based on boundary localization:

[0026] The purpose of the boundary-based quadratic regression network is to perform a quadratic regression on the IoU prediction results obtained in step 4 to obtain a more accurate target box. The boundary-based quadratic regression network can effectively refine the initial estimate of the IoUNet predictor and significantly improve tracking performance.

[0027] The boundary-localized quadratic regression network uses the following two steps to further improve target tracking performance;

[0028] Step 5.1: The quadratic regression network based on boundary localization uses the cross-correlation between the target feature representation and the search feature representation to readjust the bounding box estimation center, and obtains the readjusted bounding box estimate of the target at the center;

[0029] Step 5.2: For the search feature F obtained in step 5.1 p To improve positioning accuracy, a boundary localization scheme is adopted. A classifier is used to locate each boundary separately, while a regressor is used to predict the offset between the target boundary position and the true value.

[0030] Step 6: Two-stage training;

[0031] Step 6.1: On the training dataset, the discriminative model prediction tracking network from Step 4 is trained using a supervised learning mechanism by minimizing the discriminative learning loss function and the mean squared error loss function, respectively, to obtain the model parameters of the discriminative model prediction tracking network.

[0032] Step 6.2: Load and freeze the network parameters from Step 6.1, and perform supervised learning on the boundary-based quadratic regression network from Step 5 separately on the training dataset. By minimizing the mean squared error loss function, the trained network is obtained.

[0033] Step 7: Target Tracking: Integrate the network into an online tracker to track visible light and infrared video data;

[0034] Step 7.1: Given a first frame with annotations, perform data augmentation on the first frame by translation, rotation and blurring to obtain an initial training set containing 15 samples;

[0035] Step 7.2: Using the initial training set image and the next frame image as input, the template fusion feature and the fusion feature to be detected are obtained by using the dual-stream CNN backbone network in step 1, the modality difference compensation module in step 2, and the feature selection module in step 3, respectively.

[0036] Step 7.3: Using the template fusion features and the fusion features to be detected as input, the initial bounding box is obtained through the discriminative model prediction tracking network in Step 4;

[0037] Step 7.4: Using the template fusion features, the fusion features to be detected, and the expanded initial bounding box as input, perform secondary regression on the initial bounding box through the boundary-based quadratic regression network in Step 5 to obtain a more accurate target rectangle bounding box.

[0038] Step 7.5: Repeat steps 7.2-7.4 to iteratively calculate the target position and bounding box in the image frame by frame to achieve continuous RGBT target tracking;

[0039] Step 7.6: Update the initial training set every 20 frames to obtain new template features, and then continue to step 7.5.

[0040] A further improvement of this invention is that, in step 1, the dual-stream CNN backbone network uses two ResNet50 networks with identical structures but different parameters. The two ResNet50 backbone networks take RGB images and infrared images as inputs, respectively, and output RGB single-modal features F. rgb and infrared single-mode feature F t The dual-stream CNN backbone network uses a twin structure to extract dual-stream features from the template image. Dual-stream features of the image to be detected

[0041] A further improvement of this invention is that, in step 2.1, the modal difference compensation module uses the single-modal RGB feature F rgb ∈R C×H×W Subtract single-mode infrared feature F t ∈R C×H×W To obtain differential features F r-t ∈R C×H×W Simultaneously, through single-mode infrared feature F t ∈R C×H×W Subtract the single-modal RGB feature F rgb ∈R C×G×W To obtain differential features F t-r ∈R C×H×W ,Right now:

[0042] F r-t =(F rgb -F t (1)

[0043] F t-r =(F t -F rgb (2)

[0044] Among them, the difference feature F r-t ∈R C×H×W Represents RGB modality-specific features; differential features F t-r ∈R c×H×W This represents the infrared modal-specific characteristics.

[0045] A further improvement of the present invention is that, in step 2.2, the spatial weight graph S r-t ∈R 1×H×W It is generated through a convolutional layer and a sigmoid function to reflect the spatial differences between RGB and infrared modes; the convolutional layer here consists of a 3×3 convolutional operation, a batch normalization layer, and a ReLU activation function; the spatial weight map S r-t ∈R 1×H×W and S t-r ∈R 1×H×W The spatial locations of RGB mode-specific features and infrared mode-specific features are reflected respectively. Meanwhile, the channel weight map C... r-t ∈R C×1×1 (C t-r ∈R C×1×1 The channel weight map C is generated by a pooling layer consisting of global average pooling and global max pooling, and a sigmoid function to reflect the differences between RGB and infrared modes in the target category. r-t and C t-rTarget categories reflecting RGB mode-specific features and infrared mode-specific features respectively; finally, through the spatial weight map S r-t (S t-r ) and channel weighting graph C r-t (C t-r Element-wise multiplication between elements generates a difference feature weight map W. r-t (W t-r ); Differential feature weighting diagram W r-t The weight generation process is described as follows:

[0046] S r-t =σ(conv(F) r-t (3)

[0047] C r-t =σ(GAP(F) r-t )+GMP(F r-t (4)

[0048]

[0049] Where conv(*) represents a convolutional layer consisting of a 3×3 kernel, a batch normalization layer (BatchNorm), and a ReLU activation function; σ(*) represents the sigmoid function; GAP(*) represents global average pooling; and GMP(*) represents global max pooling. Represents element-wise multiplication; differential feature weight map W r-t and W t-r These respectively reflect the spatial location and target category of RGB mode-specific features and infrared mode-specific features.

[0050] A further improvement of this invention is that, in step 3.1, by analyzing the cross-modal RGBT feature F... rgbt Original single-modal RGB features F rgb and the original single-mode infrared feature F t The fused feature F is obtained by performing concatenation and convolution operations. c ∈R 2C×H×W ,Right now:

[0051] F c =conv(cat(F rgbt F rgb F t (9)

[0052] Here, cat(*) represents a cascade operation, and contv(*) represents a convolution operation with a kernel size of 1×1.

[0053] A further improvement of the present invention is that, in step 3.2, given the fusion feature F...c ∈R 2C×H×W As input, global average pooling and global max pooling are first used simultaneously to obtain a more refined feature descriptor containing global information for each channel; then, channel weights W are generated by performing a fast one-dimensional convolution with a kernel size of 3 and a sigmoid function. c ∈R 2C ×1×1 The method for calculating channel attention is as follows:

[0054] W c =σ(CID(GAP(F) c )+GMP(F c ))) (10)

[0055] Where CID(*) represents one-dimensional convolution, in order to obtain the channel weights W c Then, by fusing the feature F c With weight W c Multiplying yields the channel-level feature selection output F. cc ∈R 2C×H×W ,Right now:

[0056]

[0057] A further improvement of this invention is that, in step 3.3, after obtaining the channel-level feature selection output F... cc ∈R 2C×H×W Then, average pooling and max pooling operations are applied along the channels to generate an effective feature descriptor; then, a spatial attention map W is generated through concatenation, convolution, and the sigmoid function. s ∈R 1×H×W The method for calculating spatial attention is as follows:

[0058] W s =σ(conv(cat(Avgpool(F) cc )+Maxpool(F cc (12)

[0059] Where Avgpool(*) represents the average pooling operation along the channel, and Maxpool(*) represents the max pooling operation along the channel; the spatial weights W are obtained. s Then, the channel-level feature selection output F is... cc With spatial weight W s Multiplying yields the spatial feature selection output F. cs ∈R 2C×H×W ,Right now:

[0060]

[0061] A further improvement of this invention is that, in step 5.1, the target feature representation and the search feature representation are first extracted using a reference branch and a test branch, respectively; the input features of the reference branch are... And the target bounding box annotation B0, where The reference frame features output by the feature selection module are used to return the RoI target features. This branch consists of a convolutional layer and a PrPool; the test branch uses the features of the frame to be detected output from the feature selection module. And bounding box estimation B = (c x c y Extract RoI search features (λw) and (λh). Among them (c x c y The coordinates of the bounding box are represented by ), w and h represent the estimated width and height of the bounding box, and λ is a scaling factor used to expand the candidate region boundary to cover the entire target, where λ > 1; This is used to obtain the ROI target features. and ROI search characteristics Subsequently, a quadratic regression network based on boundary localization uses cross-correlation to adjust the bounding box estimate B; the cross-correlation is input to the ROI target features. and ROI search characteristics It returns a score map to reflect the similarity between the target feature and the search feature; using the 2D position with the highest score as the center, and combining the width and height of the bounding box estimate B, a new bounding box B1 is generated; then, the new bounding box B1 and the search feature after two convolutional layers are input into PrPool to obtain a new search feature F. p ∈R 256×7×7 New search feature F p Includes target features and the target is located at F p center.

[0062] A further improvement of the present invention is that, in step 5.2, feature F is first... p Aggregates along the x-axis and y-axis respectively, and further refines through 1×3 and 3×1 convolutional layers to extract horizontal and vertical features F. x ∈R 1×7 and F y ∈R 7×1 Then, the horizontal and vertical features F are respectively... x ∈R 1×7 and F y ∈R 7×1 After upsampling, the data is divided into two parts on average to obtain the boundary feature F. l ∈R 1×7 F r ∈R 1 ×7F t ∈R 7×1 and F d ∈R 7×1 For each boundary feature, a classifier and a regressor are used simultaneously. The classifier takes the boundary feature as input and outputs a confidence map of each boundary response location. The regressor takes each boundary feature as input and outputs the offset between the target boundary location and the ground truth to refine the bounding box location prediction.

[0063] The RGBT target tracking method based on modal difference compensation disclosed in this invention has the following advantages compared with the prior art:

[0064] 1) This invention fully utilizes the differences between visible light and infrared image information, designs a modal difference compensation module to obtain robust cross-modal features, and improves RGBT tracking performance by simultaneously considering cross-modal features and single-modal (RGB and infrared) characteristics. Extensive experimental results demonstrate that the tracking method of this invention achieves superior performance compared to existing tracking methods.

[0065] 2) This invention proposes a modal difference compensation module, which effectively captures cross-modal information from RGB and infrared images through a compensation and refusion strategy.

[0066] 3) Based on channel and spatial attention mechanisms, this invention designs a feature selection module that adaptively selects cross-modal and single-modal features with strong discriminative power for more accurate tracking.

[0067] 4) This invention proposes a simple yet effective boundary-based quadratic regression module. After readjusting the initial bounding box to ensure the target is centered within it, a dedicated network branch is used for boundary box localization on each edge. This module allows the tracker to obtain more accurate box estimates and can handle some tracking failures in the first-stage regression. Attached Figure Description

[0068] Figure 1 This is the overall network framework of the RGBT target tracking method based on modal difference compensation disclosed in this invention;

[0069] Figure 2 This is a schematic diagram of the modal difference compensation module in the tracking method disclosed in this invention;

[0070] Figure 3 This is a schematic diagram of the feature selection module in the tracking method disclosed in this invention;

[0071] Figure 4 This is a schematic diagram of the boundary-based quadratic regression network in the tracking method disclosed in this invention;

[0072] Figure 5 This is a schematic diagram illustrating the qualitative tracking results of the tracking method disclosed in this invention. Detailed Implementation

[0073] The technical solution of the present invention will now be described in detail with reference to the accompanying drawings.

[0074] Referring to the framework diagram and schematic diagrams of each module of this invention ( Figure 1 , Figure 2 , Figure 3 , Figure 4 An RGBT target tracking method based on modal difference compensation includes the following steps:

[0075] Step 1: Construct a two-stream CNN backbone network:

[0076] The purpose of the backbone network is usually to extract deep feature representations for subsequent models. Here, the two-stream CNN backbone networks use ResNet50 with the same structure but different parameters. The two ResNet50 backbone networks take RGB and infrared images as inputs, respectively, and output RGB unimodal feature Frg. b and infrared single-mode feature F t The dual-stream CNN backbone network employs a twin structure to extract dual-stream features from the template image separately. Dual-stream features of the image to be detected

[0077] Step 2: Construct the modal difference compensation module:

[0078] like Figure 3 As shown, the modal difference compensation module employs a compensation-refusion strategy. First, it compensates for the difference information between the two modes separately. Then, it fuses the compensated RGB and infrared features through element-wise summation. Specifically, the modal difference compensation module uses the single-mode RGB feature F... rgb ∈R C×H×W Subtract single-mode infrared feature F t ∈R C×H×W To obtain differential features F r-t ∈R C×H×W Simultaneously, through single-mode infrared feature F t ∈R C×H×W Subtract the single-modal RGB feature F rgb ∈R C×H×W To obtain differential features F t-r ∈R C×H×W ,Right now:

[0079] F r-t =(F rgb -F t (1)

[0080] Ft-r = (F t - F rgb ) (2)

[0081] Wherein, the differential feature F r-t ∈ R C×H×W represents the RGB modality-specific feature representation; the differential feature F t-r ∈ R C×H×W represents the infrared modality-specific feature representation.

[0082] Then, using the differential features F r-t and F t-r as inputs respectively, two weight generation networks with the same structure are used to generate the differential feature weights W r-t and W t-r . Different from the previous weight generation methods that only use the spatial weight map or the channel weight map, the weight generation network uses both the spatial weight map and the channel weight map to obtain more compensation information. Specifically, the spatial weight map S r-t ∈ R 1×H×W is generated through a convolutional layer and a sigmoid function to reflect the difference information of the RGB and infrared modalities in the spatial position. The convolutional layer here consists of a convolutional operation with a 3×3 convolutional kernel, a batch normalization layer, and a ReLu activation function. Similar to the differential features F r-t and F t-r , the spatial weight maps S r-t ∈ R 1×H×W and S t-r ∈ R 1×H×W respectively reflect the spatial positions of the RGB modality-specific feature and the infrared modality-specific feature. At the same time, the channel weight map C r-t ∈ R C×1×1 (C t-r ∈ R C×1×1 ) is generated through a pooling layer composed of global average pooling and global max pooling and a sigmoid function to reflect the difference information of the RGB and infrared modalities in the target category. Similarly, the channel weight maps C r-t and C t-r respectively reflect the target categories of the RGB modality-specific feature and the infrared modality-specific feature. Finally, the differential feature weight map W r-t (W t-r ) is generated by the element-wise multiplication between the spatial weight map S r-t (S t-r ) and the channel weight map C r-t (C t-r ). Taking the differential feature weight map W r-t as an example, the weight generation process can be expressed as:

[0083] S r-t=σ(conv(F) r-t (3)

[0084] C r-t =σ(GAP(F) r-t )+GMP(F r-t (4)

[0085]

[0086] Where conv(*) represents a convolutional layer consisting of a 3×3 kernel, a batch normalization layer (BatchNorm), and a ReLU activation function; σ(*) represents the sigmoid function; GAP(*) represents global average pooling; and GMP(*) represents global max pooling. This represents the element-wise multiplication operation. The difference feature weight map W r-t and W t-r These respectively reflect the spatial location and target category of RGB mode-specific features and infrared mode-specific features.

[0087] In obtaining the differential feature weight map W r-t (W t-r After that, the compensated RGB features and compensated infrared features Obtained through cross-modal residual connectivity, i.e.:

[0088]

[0089]

[0090] As shown in formula (6), the compensated RGB features In addition to the original single-modal RGB feature F rgb In addition, it also includes infrared mode-specific features. Similarly, as shown in formula (7), the compensated infrared features Includes RGB modal-specific features and original single-mode infrared characteristics F t By adjusting the compensated RGB features and compensated infrared features A simple addition and fusion process yields the final fused cross-modal RGBT feature F. rgbt ∈R C×H×W ,Right now:

[0091]

[0092] As shown in Equation (8), obtaining the final fused features by compensating for the features instead of the original single-modal features will improve the recognition capability of RGBT tracking in the subsequent process.

[0093] Step 3: Construct the feature selection module:

[0094] The attention-based feature selection module further adaptively selects highly discriminative cross-modal and unimodal features to improve RGBT tracking performance. For example... Figure 4 As shown, the feature selection module fully selects highly discriminative features from all modal features (cross-modal RGBT features, original single-modal RGB features, and original single-modal infrared features) through three steps.

[0095] Step 3.1: All modal features are fused, with the aim of extracting more information from cross-modal RGBT features, original single-modal RGB features, and original single-modal infrared features. Specifically, this is achieved by fusing the cross-modal RGBT features F... rgbt Original single-modal RGB features F rgb and the original single-mode infrared feature F t The fused feature F is obtained by performing concatenation and convolution operations. c ∈R 2C×H×W ,Right now:

[0096] F c =conv(cat(F rgbt F rgb F t (9)

[0097] Here, cat(*) represents a cascade operation, and conv(*) represents a convolution operation with a kernel size of 1×1.

[0098] Step 3.2: Channel-level feature selection aims to enhance features of categories relevant to the target while suppressing useless features. Specifically, given the fused features F... c ∈R 2C×H×W As input, global average pooling and global max pooling are first used simultaneously to obtain a more refined feature descriptor that contains global information for each channel. Then, channel weights W are generated by performing a fast one-dimensional convolution with a kernel size of 3 and a sigmoid function. c ∈R 2C×1×1 In short, the method for calculating channel attention is as follows:

[0099] W c =σ(CID(GAP(F) c )+GMP(F c ))) (10)

[0100] Where CID(*) represents one-dimensional convolution. This is used to obtain the channel weights W. c Then, by fusing the feature F c With weight W cMultiplying yields the channel-level feature selection output F. cc ∈R 2C×H×W ,Right now:

[0101]

[0102] Step 3.3: Spatial-level feature selection aims to enhance the weight of the target's spatial location and suppress locations far from the target. Specifically, after obtaining the channel-level feature selection output F... cc ∈R 2C×H×W Then, average pooling and max pooling operations are applied along the channels to generate an effective feature descriptor. A spatial attention map W is then generated through concatenation, convolution, and the sigmoid function. s ∈R 1×H×W In short, the calculation method for spatial attention is as follows:

[0103] W s =σ(conv(cat(Avgpool(F) cc )+Maxpool(F cc (12)

[0104] Where Avgpool(*) represents the average pooling operation along the channel, and Maxpool(*) represents the max pooling operation along the channel. The spatial weights W are obtained. s Then, the channel-level feature selection output F is... cc With spatial weight W s Multiplying yields the spatial feature selection output F. cs ∈R 2C×H×W ,Right now:

[0105]

[0106] Through the above three steps, the feature selection module simultaneously utilizes fused cross-modal RGBT features and single-modal (RGB and infrared) features to adaptively select highly discriminative features in both spatial and channel dimensions.

[0107] Step 4: Construct a discriminative model to predict the tracking network:

[0108] The discriminative model prediction tracking network takes template image features and target image features as inputs and outputs multiple tasks: a classification score map and an IoU prediction. The classification score map is obtained by calculating the target image features and a filter f. The filter f consists of a model initializer consisting of a convolutional layer and an exact ROI pooling, and a model optimizer that solves the final model using steepest descent (SD). The model filter f is solved using multiple samples from the initial training set in the model initializer. The IoU prediction predicts the IoU between the depth features of the image and the candidate bounding box features, and then estimates the bounding box by maximizing the IoU prediction.

[0109] Step 5: Construct a quadratic regression network based on boundary localization:

[0110] The purpose of the boundary-localized quadratic regression network is to perform quadratic regression on the IoU prediction results obtained in step (4) to obtain a more accurate target box. The boundary-localized quadratic regression network can effectively refine the initial estimate of the IoUNet predictor and significantly improve tracking performance. Figure 5 As shown, the boundary-localized quadratic regression network uses two steps to further improve target tracking performance.

[0111] Step 5.1: The quadratic regression network based on boundary localization uses the cross-correlation between the target feature representation and the search feature representation to readjust the bounding box estimation center, obtaining the readjusted bounding box estimate of the target at the center. Specifically, the reference branch and the test branch are used to extract the target feature representation and the search feature representation, respectively. The input features of the reference branch are... And the target bounding box annotation B0, where The reference frame features output by the feature selection module are used to return the RoI target features. This branch consists of a convolutional layer and a PrPool. The test branch uses the features of the frame to be detected output from the feature selection module. And bounding box estimation B = (c x c y Extract RoI search features (λw) and (λh). Among them (c x c y The coordinates of the bounding box are denoted by , w and h represent the estimated width and height of the bounding box, and λ (λ > 1) is a scaling factor used to expand the candidate region boundary to cover the entire target. Since the test branch extracts features from the boundary prediction, this constitutes a more complex task; therefore, compared to the reference branch, the test branch uses more layers and a higher pooling resolution. This is crucial for obtaining the ROI target features. and ROI search characteristics Subsequently, a quadratic regression network based on boundary localization employs cross-correlation to adjust the bounding box estimate B. The cross-correlation input is the target feature of the ROI. and ROI search characteristics It returns a score map to reflect the similarity between the target feature and the search feature. Centered on the 2D location with the highest score, and combining the width and height of the bounding box estimate B, a resized bounding box B1 is generated. Then, the resized bounding box B1 and the search features after two convolutional layers are input into PrPool to obtain a new search feature F. p ∈R 256×7×7 New search feature F p Includes target features and the target is located at F p center.

[0112] Step 5.2: For the search feature F obtained in step 5.1 p A boundary localization scheme is employed to improve localization accuracy. This scheme uses a simple classifier to locate each boundary and a simple regressor to predict the offset between the target boundary position and the ground truth. Specifically, firstly, feature F... p Aggregates along the x-axis and y-axis respectively, and further refines through 1×3 and 3×1 convolutional layers to extract horizontal and vertical features F. x ∈R 1×7 and F y ∈R 7×1 Then, the horizontal and vertical features F are respectively... x ∈R 1×7 and F y ∈R 7×1 After upsampling, the data is divided into two parts on average to obtain the boundary feature F. l ∈R 1×7 F r ∈R 1×7 F t ∈R 7×1 and F d ∈R 7×1 For each boundary feature, a simple classifier and regressor are used simultaneously. The classifier takes the boundary feature as input and outputs a confidence map of each boundary response location, while the regressor takes each boundary feature as input and outputs the offset between the target boundary location and the ground truth to refine the bounding box location prediction.

[0113] Step 6: Two-stage training:

[0114] Step 6.1: On the training dataset, the discriminative model prediction tracking network in step (4) is trained by a supervised learning mechanism by minimizing the discriminative learning loss function and the mean squared error loss function, respectively, to obtain the model parameters of the discriminative model prediction tracking network.

[0115] Step 6.2: Load and freeze the network parameters from Step 6.1, and perform supervised learning on the boundary-based quadratic regression network from Step (5) separately on the training dataset. By minimizing the mean squared error loss function, the trained network is obtained.

[0116] Step 7: Target tracking: Integrate the network into an online tracker to track visible light and infrared video data.

[0117] Step 7.1: Given a first frame with annotations, perform data augmentation on the first frame by translation, rotation and blurring to obtain an initial training set containing 15 samples.

[0118] Step 7.2: Using the initial training set image and the next frame image as input, the template fusion feature and the fusion feature to be detected are obtained by using the dual-stream CNN backbone network in step (1), the modality difference compensation module in step (2), and the feature selection module in step (3).

[0119] Step 7.3: Using template fusion features and fusion features to be detected as input, the initial bounding box is obtained by predicting the tracking network through the discriminative model in step (4).

[0120] Step 7.4: Using the template fusion features, the fusion features to be detected, and the expanded initial bounding box as input, perform secondary regression on the initial bounding box through the boundary-based secondary regression network in step (5) to obtain a more accurate target rectangular bounding box.

[0121] Step 7.5: Repeat steps 7.2-7.4 to iteratively calculate the target position and bounding box in the image frame by frame to achieve continuous RGBT target tracking.

[0122] Step 7.6: Update the initial training set every 20 frames to obtain new template features, and then continue to step 7.5.

[0123] The technical effects of the present invention will be further explained below with reference to simulation experiments:

[0124] 1. Simulation conditions: All simulation experiments were conducted on the operating system Ubuntu 16.04.5, the hardware environment was Nvidia GeForce GTX1080Ti GPU, and the PyTorch deep learning framework was used.

[0125] 2. Simulation Content and Result Analysis

[0126] As described in the above implementation scheme, the objective metrics and performance of the model were tested on the RGBT target tracking dataset RGBT234, and compared with nine other tracking algorithms. The quantitative comparison of its attributes and overall accuracy and success rate metrics is shown in Table 1. Among them:

[0127] SR represents the success rate of target tracking; PR represents the accuracy of target tracking; No occlusion (NO), partial occlusion (PO), heavy occlusion (HO), low illumination (LI), low resolution (LR), hot crossover (TC), deformation (Def), fast motion (FM), scale variation (SV), motion blur (MB), camera movement (CM), and background clutter (BC) are the challenging attributes of the RGBT234 dataset. The red, green, and blue numbers in the table represent the best, second, and third best tracking results, respectively.

[0128] Table 1 shows the quantitative tracking results on the test dataset RGBT234 and the comparison results with other advanced trackers.

[0129]

[0130] As shown in Table 1, on the RGBT234 dataset, compared to state-of-the-art trackers (including RGB and RGBT trackers), the present invention significantly outperforms other trackers in most cases, achieving the best overall tracking performance. This indicates that the present invention can fully utilize the complementary information between the two modes to handle various complex situations. Its subjective comparison results are as follows: Figure 5 As shown, in sequence child1, compared with other methods, the present invention can accurately locate the target and performs better on challenges of occlusion and motion blur; from sequence dog1, it can be seen that the present invention can better handle challenges of occlusion and background clutter; from sequence kite2, it can be seen that the present invention effectively suppresses the interference of camera movement challenges, thus ensuring more robust target positioning; in sequence elecbikewithlight1, it can be seen that the initial target of the RGB modality contains strong lighting information, which makes most algorithms dominated by this information. Therefore, when the lighting becomes normal, most algorithms suffer from model drift and target loss. However, the present invention can effectively suppress this noise information, thus ensuring more accurate target localization. This shows that the present invention can fully exploit the potential of modal differences and single-modal features. In summary, through the above comparisons, the present invention can better deploy information from both modalities to handle complex challenges.

[0131] Although the present invention has been described in detail above with general descriptions and specific embodiments, modifications or improvements can be made to it, which will be obvious to those skilled in the art. Therefore, all such modifications or improvements made without departing from the spirit of the present invention fall within the scope of protection claimed by the present invention.

Claims

1. An RGBT target tracking method based on modal difference compensation, characterized in that, Includes the following steps: Step 1: Construct a two-stream CNN backbone network; Step 2: Construct a modal difference compensation module; Step 2.1: The modal difference compensation module adopts a compensation and fusion strategy. First, it compensates for the difference features of the RGB mode and the infrared mode respectively. Then, it fuses the compensated RGB and infrared features by summing elements. The modal difference compensation module uses single-modal RGB features Subtract single-mode infrared features Obtain differential features Simultaneously, through single-mode infrared features Subtracting unimodal RGB features Obtain differential features ,Right now: (1) (2) Among them, the differences Representation of RGB modality-specific features; differential features Indicates the representation of infrared modal-specific characteristics; Step 2.2: Using the differences in features respectively and As input, two identical weight generation networks are used to generate differential feature weight maps. and The weight generation network uses both spatial weight graphs and channel weight graphs to obtain more compensation information. Step 2.3: Obtain the differential feature weight map and Then, the compensated RGB features and compensated infrared features Obtained through cross-modal residual connectivity, i.e.: (6) (7) Compensated RGB features In addition to the original single-modal RGB features In addition, it also includes infrared mode-specific features; similarly, as shown in formula (7), the compensated infrared features It includes RGB modal-specific features as well as original single-mode infrared characteristics. ; By adjusting the compensated RGB features and compensated infrared features The features are then added together and fused to obtain the final fused cross-modal RGBT features. ,Right now: (8) Step 3: Construct the feature selection module; The attention-based feature selection module further adaptively selects highly discriminative cross-modal and single-modal features to improve RGBT tracking performance. The feature selection module fully selects highly discriminative features of all modal features through three steps, including cross-modal RGBT features, original single-modal RGB features, and original single-modal infrared features. Step 3.1: Fusion of all modal features, the purpose of which is to obtain more information from cross-modal RGBT features, original single-modal RGB features, and original single-modal infrared features; Step 3.2: Channel-level feature selection, which aims to enhance features of categories relevant to the target and suppress useless features; Step 3.3: Spatial feature selection, which aims to enhance the weight of the target's spatial location and suppress locations far from the target; Step 4: Construct a discriminative model to predict the tracking network; The discriminative model prediction tracking network takes template image features and target image features as inputs and performs multi-task outputs: classification score map and IoU prediction. The classification score map is obtained by calculating the target image features and filter f. Filter f includes a model initializer consisting of a convolutional layer and a precise ROI pooling and a model optimizer that solves the final model. The model filter f is solved by using multiple samples from the initial training set in the model initializer. The IoU prediction result is the IoU between the depth features of the predicted image and the candidate bounding box features. Bounding box estimation is then performed by maximizing the IoU prediction. Step 5: Construct a quadratic regression network based on boundary localization: The purpose of the boundary-based quadratic regression network is to perform a quadratic regression on the IoU prediction results obtained in step 4 to obtain a more accurate target box. The boundary-based quadratic regression network can effectively refine the initial estimate of the IoUNet predictor and significantly improve tracking performance. The boundary-localized quadratic regression network uses the following two steps to further improve target tracking performance; Step 5.1: The quadratic regression network based on boundary localization uses the cross-correlation between the target feature representation and the search feature representation to readjust the bounding box estimation center, and obtains the readjusted bounding box estimate of the target at the center; Step 5.2: For the search features obtained in Step 5.1 To improve positioning accuracy, a boundary localization scheme is adopted. A classifier is used to locate each boundary separately, while a regressor is used to predict the offset between the target boundary position and the true value. Step 6: Two-stage training; Step 6.1: On the training dataset, the discriminative model prediction tracking network from Step 4 is trained using a supervised learning mechanism by minimizing the discriminative learning loss function and the mean squared error loss function, respectively, to obtain the model parameters of the discriminative model prediction tracking network. Step 6.2: Load and freeze the network parameters from Step 6.1, and perform supervised learning on the boundary-based quadratic regression network from Step 5 separately on the training dataset. By minimizing the mean squared error loss function, the trained network is obtained. Step 7: Target Tracking: Integrate the network into an online tracker to track visible light and infrared video data; Step 7.1: Given a first frame with annotations, perform data augmentation on the first frame by translation, rotation and blurring to obtain an initial training set containing 15 samples; Step 7.2: Using the initial training set image and the next frame image as input, the template fusion feature and the fusion feature to be detected are obtained by using the dual-stream CNN backbone network in step 1, the modality difference compensation module in step 2, and the feature selection module in step 3, respectively. Step 7.3: Using the template fusion features and the fusion features to be detected as input, the initial bounding box is obtained through the discriminative model prediction tracking network in Step 4; Step 7.4: Using the template fusion features, the fusion features to be detected, and the expanded initial bounding box as input, perform secondary regression on the initial bounding box through the boundary-based quadratic regression network in Step 5 to obtain a more accurate target rectangle bounding box. Step 7.5: Repeat steps 7.2-7.4 to iteratively calculate the target position and bounding box in the image frame by frame to achieve continuous RGBT target tracking; Step 7.6: Update the initial training set every 20 frames to obtain new template features, and then continue to step 7.

5.

2. The RGBT target tracking method based on modal difference compensation according to claim 1, characterized in that, In step 1, the dual-stream CNN backbone network uses two ResNet50 networks with identical structures but different parameters. The two ResNet50 backbone networks take RGB images and infrared images as inputs, respectively, and output single-modal RGB features. and single-mode infrared features The dual-stream CNN backbone network uses a twin structure to extract dual-stream features from the template image. Dual-stream features of the image to be detected .

3. The RGBT target tracking method based on modal difference compensation according to claim 1, characterized in that, In step 2.2, the spatial weight graph It is generated through a convolutional layer and a sigmoid function to reflect the spatial differences between RGB and infrared modes; the convolutional layer here consists of a 3×3 convolutional operation, a batch normalization layer, and a ReLU activation function; spatial weight map. and The channel weight map reflects the spatial locations of RGB mode-specific features and infrared mode-specific features, respectively. ( The channel weight map is generated by a pooling layer consisting of global average pooling and global max pooling, and a sigmoid function to reflect the differences between RGB and infrared modes in the target category. and Target categories reflecting RGB mode-specific features and infrared mode-specific features respectively; finally, through spatial weighting maps... ( ) and channel weighting diagram ( Element-wise multiplication between elements generates a difference feature weight map. ( ); Differential feature weighting diagram The weight generation process is described as follows: (3) (4) (5) in, This represents a convolutional layer consisting of a 3×3 kernel, a batch normalization layer (BatchNorm), and a ReLU activation function. This represents the sigmoid function. This indicates a global average pooling operation. This represents the global max pooling operation. Represents element-wise multiplication; differential feature weighting graph and These respectively reflect the spatial location and target category of RGB mode-specific features and infrared mode-specific features.

4. The RGBT target tracking method based on modal difference compensation according to claim 3, characterized in that, In step 3.1, the cross-modal RGBT features are analyzed. Original unimodal RGB features and original single-mode infrared features Cascade and convolution operations are performed to obtain fused features. ,Right now: (9) in, Indicates a cascading operation. This indicates a convolution operation with a kernel size of 1×1.

5. The RGBT target tracking method based on modal difference compensation according to claim 4, characterized in that, In step 3.2, given the fusion features As input, we first use global average pooling and global max pooling simultaneously to obtain a more refined feature descriptor that contains global information for each channel; then, we generate channel weights by performing fast one-dimensional convolutions with a kernel size of 3 and a sigmoid function. The method for calculating channel attention is as follows: (10) in, This represents a one-dimensional convolution, where the channel weights are obtained. Then, by fusing features with weight Multiply to obtain the channel-level feature selection output. ,Right now: (11)。 6. The RGBT target tracking method based on modal difference compensation according to claim 5, characterized in that, In step 3.3, after obtaining the channel-level feature selection output... Then, average pooling and max pooling operations are applied along the channels to generate an effective feature descriptor; then, a spatial attention map is generated through concatenation, convolution, and the sigmoid function. The method for calculating spatial attention is as follows: (12) in, This indicates the average pooling operation along the channel. This represents the max pooling operation along the channel; it obtains the spatial weights. Then, the channel-level feature selection is output. Spatial weights Multiply to obtain the spatial feature selection output. ,Right now: (12)。 7. The RGBT target tracking method based on modal difference compensation according to claim 6, characterized in that, In step 5.1, the target feature representation and the search feature representation are first extracted using the reference branch and the test branch, respectively; the input features of the reference branch are... and target bounding box annotation ,in The reference frame features output by the feature selection module are used to return the RoI target features. This branch consists of a convolutional layer and a PrPool; the test branch uses the features of the frame to be detected output from the feature selection module. and bounding box estimation Extract RoI search features ;in The coordinates of the center of the bounding box are represented by w and h, which represent the width and height of the bounding box estimate. λ is a scaling factor used to expand the candidate region boundary to cover the entire target, and λ>

1. In obtaining ROI target characteristics and ROI search characteristics Subsequently, the boundary-localized quadratic regression network employs cross-correlation to adjust the bounding box estimation. Cross-correlation input ROI target features and ROI search characteristics It also returns a score graph to reflect the similarity between the target features and the search features; Centered on the two-dimensional location with the highest score, and combining the bounding box estimation of the width and height of B, the bounding box is readjusted. It is generated; then, the bounding box will be readjusted. The search features are input into PrPool after two convolutional layers to obtain new search features. New search features Includes target features and the target is located center.

8. The RGBT target tracking method based on modal difference compensation according to claim 7, characterized in that, In step 5.2, the features are first... Aggregates along the x-axis and y-axis respectively, and further refines them using 1×3 and 3×1 convolutional layers to extract horizontal and vertical features. and Then, the horizontal and vertical features are respectively... and After upsampling, the data is divided into two parts on average to obtain the boundary features. , , and For each boundary feature, a classifier and a regressor are used simultaneously. The classifier takes the boundary feature as input and outputs a confidence map of each boundary response location. The regressor takes each boundary feature as input and outputs the offset between the target boundary location and the ground truth to refine the bounding box location prediction.