Visual single target tracking method based on multi-modal information

By converting RGB-TIR images into patches of RGB and TIR modes and constructing a feature modulation module, the problem of insufficient information utilization in multimodal tracking methods under complex environments is solved, and high-precision single-target tracking is achieved.

CN122200513APending Publication Date: 2026-06-12XIAN UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
XIAN UNIV OF TECH
Filing Date
2026-03-25
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing multimodal tracking methods struggle to fully utilize the complementary relationship between visible light and thermal infrared information in complex environments, resulting in limited tracking performance, especially target drift or tracking failure when illumination changes.

Method used

By embedding patches to convert RGB-TIR images into RGB and TIR modalities, a feature modulation module is constructed and embedded in the Transformer encoder. The feature modulation module is used to perform feature cues between modalities, and the prediction network is optimized by combining the total loss function to achieve adaptive fusion of multimodal information.

🎯Benefits of technology

It effectively integrates multimodal information, improves tracking robustness in complex environments and all-weather operation capability, and achieves high-precision single-target tracking.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122200513A_ABST
    Figure CN122200513A_ABST
Patent Text Reader

Abstract

The application discloses a visual single-target tracking method based on multi-modal information, inputs an RGB-TIR image of a first frame of a video and determines a target region, converts the target region into patches of an RGB mode and a TIR mode through patch embedding, extracts features of the patches, constructs a feature modulation module, embeds the feature modulation module into a Transform encoder, inputs the patches of the two modes to obtain feature vectors of two modal branches, adds the feature vectors and inputs the feature vectors into a prediction network for prediction to obtain a target bounding box, calculates a total loss through a total loss function, inputs the total loss into the prediction network for back propagation and iteration, minimizes the total loss until convergence, optimizes the prediction network, and finally realizes single-target tracking through the optimized prediction network. The method disclosed by the application solves the problem in the prior art that image features cannot be effectively extracted when the imaging quality of a certain mode of image is poor, thereby affecting the target tracking result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision technology, specifically relating to a visual single-target tracking method based on multimodal information. Background Technology

[0002] The core of visual target tracking is to continuously estimate the target's spatial location, scale, and other key information in subsequent frames of a video sequence, given the target information already known in the first frame. Due to its significant application value in practical scenarios such as intelligent security and autonomous driving, this field has long received extensive research attention. However, due to the limitations of visible light imaging principles, visible light (RGB)-based tracking methods are easily affected by external factors in complex open environments. For example, performance degrades significantly when illumination changes drastically, leading to target drift or even tracking failure.

[0003] Unlike visible light cameras that rely on reflected light for imaging, thermal infrared sensors acquire information about the thermal energy radiated by the target itself. While RGB images typically contain rich texture and color details under sufficient lighting, they are unstable in low-light or no-light environments. In contrast, thermal infrared (TIR) ​​images offer more significant contrast for high-temperature targets and remain effective in darkness, although their spatial resolution is lower and their detailed texture information is limited. Given the limitations of single-modal sensing capabilities, there is a growing trend towards multimodal target tracking technology. By fusing visible light and thermal infrared information, the advantages of each are complemented, thereby improving the system's robustness in complex environments and its all-weather operation capabilities.

[0004] Currently, existing multimodal tracking tasks face two challenges: First, the high cost of data annotation limits the size of available datasets, directly hindering the development of robust multimodal tracking algorithms; second, the complementary relationship between modalities is highly uncertain, as different imaging devices have varying sensitivities to environmental factors, causing the dominant position of each modality in feature representation to change with scene fluctuations.

[0005] Because pure RGB sequences are easier to obtain than RGB-T (RGB-TIR) sequence pairs, some multimodal trackers, considering this first limitation, pre-train on RGB sequences and then transfer them to multimodal scenes with full fine-tuning. Besides full fine-tuning methods, some approaches introduce parameter-efficient prompt tuning into multimodal tracking by freezing the backbone network parameters and attaching a set of learnable parameters. These methods typically use one modality (usually RGB) as the dominant modality and another as the auxiliary modality. However, these methods ignore the dynamic dominance correlations of multimodal data, making it difficult to fully utilize complementary multimodal information in complex scenes, thus limiting tracking performance. Summary of the Invention

[0006] The purpose of this invention is to provide a visual single-target tracking method based on multimodal information, which solves the problem in the prior art that when the imaging quality of a certain modality image is poor, it is impossible to effectively extract image features, thereby affecting the target tracking results.

[0007] The technical solution adopted in this invention is as follows: a visual single-target tracking method based on multimodal information. The method inputs an RGB-TIR image of the first frame of a video and determines the target region. It then converts the RGB-TIR image into RGB and TIR modal patches through patch embedding, extracts features from these patches, constructs a feature modulation module, and embeds it into a Transformer encoder. The input of the RGB and TIR modal patches yields feature vectors for the two modal branches. These vectors are summed and input into a prediction network to predict the target bounding box. A total loss function is used to calculate the loss. The calculated total loss is input into the prediction network for backpropagation and iteration, minimizing the total loss until convergence. The prediction network is then optimized, and finally, single-target tracking is achieved through the optimized prediction network.

[0008] The invention is further characterized by:

[0009] Furthermore, the visual single-target tracking method based on multimodal information specifically includes the following steps: Step 1: Input the RGB-TIR image of the first frame of the video and determine the target region. Given a template frame and a search frame for the RGB mode and the TIR mode respectively, convert them into patches through patch embedding. Then, the template patch and the search patch of the same mode are spliced ​​together to obtain the combined patch of RGB mode and TIR mode. Step 2: Input the combined RGB and TIR modal patches obtained in Step 1 into the attention feature extraction network for feature extraction, and obtain feature tokens for the RGB and TIR modal respectively. ; Step 3: Construct a feature modulation module and embed it into the attention feature extraction network, inputting feature tokens for RGB and TIR modes: Then the eigenvectors of the two modal branches are obtained. ; Step 4: Add the feature vectors of the two modal branches, reshape the sum into a two-dimensional spatial feature map, and input it into the prediction network to predict the target, thereby obtaining the predicted bounding box on the RGB image and TIR image, and thus the target bounding box. Step 5: Calculate the total loss of the input image using the total loss function. Then, input the total loss into the prediction network for backpropagation to optimize the training of the entire network. Finally, achieve single-target tracking through the optimized prediction network.

[0010] Furthermore, step 1 is detailed as follows: Step 1.1: Obtain spatially strictly aligned RGB and TIR mode tracking videos. Manually select the target region on the first frame of the RGB mode tracking video: Let... The coordinates of the center point of the target region in each frame of the image. and Let the width and height of the target region be the same for each frame of the image. Then, taking the center point of each frame as the center, a side length of [value missing] is extracted. The square region is shown in formula (1):

[0011] in, Indicates the fill amount; If the square region If the size exceeds the image size, the excess portion is padded with the image mean; then, the side length is... Scaling the square area to Size, to obtain the target region for each frame Since the RGB and TIR mode tracking videos are spatially aligned, the target region obtained in the first frame of the RGB mode tracking video is... The coordinate position parameters are consistent with the coordinate position parameters of the target region on the first frame of the TIR modal tracking video, and the target region of each modality is used as its search frame. Step 1.2: Provide a template frame for both RGB and TIR modes. and a search frame obtained from step 1.1 Where H is the height of the image and W is the width of the image; Convert them to a size using patch embedding. Patch: and ,in, and These are the number of patches for the template frame and the search frame, respectively. P These are the length and width of the patch; then use the following parameters: The trainable linear projection layer will and Projected to The latent space is patched with the corresponding modalities, namely template patch and search patch; Step 1.3: Embed the learnable 1D position and Add them to the stencil patch and search patch for RGB and TIR respectively to obtain the stencil markup patch. and search tag patch As shown in formulas (2), (3), (4), and (5):

[0012] in, It is a trainable linear projection layer. and These are learnable 1D position embeddings for RGB and TIR stencil patches and search patches, respectively. It is after the parameter is The dimension of the latent space after processing by the trainable linear projection layer. and These are the number of patches for the template frame and the search frame, respectively. Subsequently, the template patch and the search patch of the same modality are spliced ​​together according to formula (6) to obtain the combined RGB and TIR patch. :

[0013] in, , It is a stencil markup patch for RGB and TIR. It is a search tag patch for RGB and TIR.

[0014] Furthermore, in step 2, the combined RGB and TIR patch obtained in step 1 is applied. The inputs are fed into the attention feature extraction network. Feature extraction is performed using an attention feature extraction network. It is a two-branch, weight-shared, independent ViT network, with the following specific structure: It consists of 12 encoder layers, each consisting of two sub-layer connection structures. The first sub-layer connection structure includes a multi-head attention sub-layer, which can learn information from different representation subspaces simultaneously, thereby capturing long-range dependencies between features. This is followed by a normalization layer to stabilize the training process and accelerate convergence, and a residual connection that directly adds the input of the sub-layer to the output. The second sub-layer connection structure includes a feedforward fully connected FFN sub-layer, which performs non-linear transformations on the features to further enhance their expressive power. This second sub-layer is also followed by a normalization layer and a residual connection. Feature extraction network Parameters include Layer Normalization, Multi Head Attention, Dropout, Layer Norm, MLP Block, Dropout; The specific implementation steps and related formulas for the multi-head attention sublayer are as follows: First, the input feature matrix of the network layer Multiply by query ,key ,value Corresponding to The weight matrix gives the corresponding ,in The specific values ​​will be continuously updated and learned, as shown in formula (7).

[0015] in, It is a dimension parameter; enter After passing through a linear transformation layer, it then enters the Scaled Dot-Product Attention layer. and Attention score is obtained by matrix multiplication. ; Next, we move to the Scale layer to perform dimensional scaling, as shown in the formula above. Finally, the weight matrix obtained by softmax is multiplied by V to obtain... The same operation is performed three times before the final Linear output, which means that the multi-head attention layer has been passed, as shown in formula (8).

[0016] in, It is the first The output result obtained by calculating the attention head; A matrix representing the input query, key, and value; They are the first A single attention point is specifically designed for... Learnable weight matrix for linear projection; The final output of the Multi-Head Attention layer is shown in Equation (9).

[0017] in, This indicates a concatenation operation, which concatenates the outputs of all individual attention heads together along the feature dimension. Representing the first One to the first The output matrix of each attention head; The total number of "heads" in a multi-head attention mechanism; Learnable output weight matrix used to perform the final linear transformation on the concatenated multi-head features; Then, through the feedforward layer, there is also a similar residual connection and layer normalization process; The output of the multi-head attention layer is further processed through a feedforward layer, which consists of two fully connected layers with a ReLU activation function and a Dropout layer inserted in between, as shown in Equation (10):

[0018] in, The input to a feedforward neural network is usually the result of the previous layer after residual connections and layer normalization; These are the learnable weight matrices of the first and second fully connected layers in the feedforward network, respectively. These are the learnable bias terms corresponding to the two fully connected layers; The mathematical expression representing the ReLU activation function; After the first The feature tokens of the RGB and TIR modes extracted by the layer encoder are As shown in formula (11):

[0019] in Indicates the first One encoder, It is the Transformer Encoder layer of the ViT feature extraction network.

[0020] Furthermore, step 3 is detailed below: The feature modulation module is constructed with the following architecture: from bottom to top, it consists of a dimensionality reduction projection layer, a linear projection layer, and an updimensionality projection layer. A given feature token is input and its dimensionality is reduced to [dimensionality value missing] through the dimensionality reduction projection layer. The dimension is first calculated, then passed through a linear projection layer, and then projected back to the original dimension by a higher-dimensional projection layer. This projection is then used as feature feedback to the Transformer encoder layer of another modality. Through this simple structure, the feature modulation module effectively... and Feature cues are performed between modalities for multimodal tracking; Embed the feature modulation module into the first step of the Transformer encoder. Within the layer, there is an encoder that spans both modes; For each encoder, the first The layer integrates the specific modal features of the layer above with complementary information from another modality; Each encoder learns feature cues from another modality in a layer-by-layer manner; The feature modulation module adopts a modular design and is embedded in the multi-head attention sub-layer and the feedforward layer respectively; The working principle of the feature modulation module will be explained in detail by taking the example of the TIR mode branch providing auxiliary information to the RGB mode branch. The same principle applies to the RGB mode branch providing auxiliary information to the TIR mode branch, as follows: The first branch of RGB modality The layer integrates auxiliary information from the TIR mode branch through the feature modulation module, as shown in formulas (12) and (13):

[0021] in, and These represent the multi-head self-attention block and the feature modulation module, respectively. This refers to the output feature hints from the feature modulation module; These are feature cues extracted from the TIR mode; After layer normalization, it is fed into the multi-head attention sub-layer, and then... and Combined to obtain ; In the next stage, Feed into feedforward layer and feature hints and Add them together to obtain the RGB encoder's first value. Layer output As shown in formulas (14) and (15):

[0022] in, It is the RGB encoder number 1 The output of the layer, These are feature cues extracted from the TIR mode. Represents the number of patches; RGB and TIR patches After passing through 12 Transformer encoder layers and a feature modulation module, the feature vector is obtained. As shown in formula (16):

[0023] in, These are the template patch and search patch obtained in step 1. It is a 12-layer Transformer Encoder feature extraction network with a feature modulation module.

[0024] Furthermore, step 4 is as follows: Features of two modal branches The sums are then fed into the classification and regression network for target prediction, resulting in predicted bounding boxes on the RGB and TIR images, as shown in formula (17):

[0025] in, These are the predicted bounding boxes on the RGB and TIR images. It is a prediction network. These are features of two modal branches. This represents the number of patches.

[0026] Furthermore, in step 4, the prediction network It consists of two parts: the box prediction header and the IoU prediction header. (1) Box prediction head: A fully convolutional localization head is used to predict the probability distribution of the box corners. The fused features are sent to multiple Conv-BN-ReLU layers to generate the probability distribution of the top left and bottom right corners. The expected value of the probability distribution of each corner is calculated as the predicted bounding box coordinates. (2) IoU prediction head: The IoU prediction head is used for online template updates and to handle target deformation during the tracking process. The IoU head learns the IoU score between the predicted ground truth box and the predicted box. During the inference process, the tracker selects a reliable online template based on the predicted IoU score. Sum of the eigenvectors of the two modal branches The sequence is reshaped into a two-dimensional spatial feature map, and then fed into a fully convolutional network (FCN), which is composed of parameters tailored to each output. It consists of stacked Conv-BN-ReLU layers; The output of FCN includes a target classification score graph. Local offset used to compensate for discretization errors caused by reduced resolution. And the normalized bounding box dimensions, i.e., width and height. The position with the highest classification score is considered the target position, i.e. The final target bounding box is obtained as shown in formula (18):

[0027] in, The x-coordinate of the top-left corner of the bounding box. The ordinate of the top-left corner of the bounding box. Refers to the width of the bounding box. This refers to the height of the bounding box.

[0028] Furthermore, step 5 is detailed below: The loss is calculated using a weighted focus loss for classification, for each true target center. and its corresponding low-resolution equivalent points The Gaussian kernel is used to generate the real heat map, as shown in formula (19):

[0029] in, In terms of spatial location The true heatmap value at a given location represents the "probability" or "confidence score" that the center of the target exists at that pixel, ranging from... between; It refers to the specific coordinates of a pixel on the feature map; It is the coordinate position of the target's true center point mapped onto the low-resolution feature map; It is the adaptive standard deviation of the target size; The Gaussian weighted focal loss is calculated as shown in formula (20):

[0030] in, It is the classification loss based on center point prediction. It is the heatmap predicted by the predictive network model at the location The value, It is the true heatmap value calculated by formula (19). and It is a hyperparameter of focus loss, set and ; In addition, to accurately regress the target size and bounding box, the prediction network employs... Loss and Generalized intersection and comparison of losses; The loss is used to calculate the absolute error between the attributes of the predicted bounding box and the true label, and its calculation is shown in Equation (21):

[0031] in, It's the number of patches. These are the specific attribute values ​​predicted by the network. It corresponds to the actual label value; To more accurately constrain the overlap between the predicted bounding box and the true bounding box in terms of spatial morphology, and to solve the gradient vanishing problem when the two do not intersect, the GIoU loss is introduced, and its calculation is shown in Equation (22):

[0032] in, It predicts the bounding box. With the true bounding box The intersection and union ratio; It can contain at the same time and The minimum bounding rectangle closure region; Indicates the closure region China does not belong to and The area; Combining the above three losses, the total loss function is shown in formula (21):

[0033] in and ; It is used to optimize the total loss of the entire tracking network model. The centroid classification loss is calculated using formula (20); It is the generalized cross-union ratio loss calculated by formula (22); The loss is the mean absolute error calculated by formula (21); and These are the weighting coefficients of the loss function, used to balance the three loss terms of different magnitudes; Finally, the gradient of the loss function with respect to all learnable parameters in the prediction network is calculated using the backpropagation algorithm. Based on this, the AdamW optimizer is used in conjunction with a set learning rate decay strategy to iteratively update the weights of the prediction network. By performing multiple iterations on the training set, the total loss is continuously minimized. This continues until the model converges, thereby enabling the prediction network to learn a robust RGB-T multimodal feature representation, achieving high-precision single-target tracking.

[0034] The beneficial effects of this invention are: 1. The present invention proposes a visual cueing framework based on feature modulation module for multimodal tracking. This model can perceive the dynamic changes of the dominant modality in an open scene and effectively fuse multimodal information in an adaptive manner.

[0035] 2. The method of this invention proposes a general feature modulation module for the basic tracking model. It effectively provides cross-prompts for multimodal tracking through a simple and efficient structure, enabling the model to cope with robust multimodal tracking in open scenes. Attached Figure Description

[0036] Figure 1 This is a network structure diagram of the visual single-target tracking method based on multimodal information of the present invention; Figure 2 This is an internal structure diagram of the Transformer module of the present invention; Figure 3 This is an internal structure diagram of the feature modulation module proposed in this invention; Figure 4 This is the frame-by-frame IoU result in the Stroller video of Embodiment 8 of the present invention. Detailed Implementation

[0037] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments.

[0038] This invention provides a visual single-target tracking method based on multimodal information, such as... Figure 1 As shown, the specific steps are as follows: Step 1: Acquire spatially aligned RGB and TIR mode tracking videos. Manually select the target region on the first frame of the RGB mode tracking video: Let... The coordinates of the center point of the target region in each frame of the image. and Let the width and height of the target region be the same for each frame of the image. Then, taking the center point of each frame as the center, a side length of [value missing] is extracted. The square region is shown in formula (1):

[0039] in, Indicates the fill amount; If the square region If the size exceeds the image size, the excess portion is padded with the image mean; then, the side length is... Scaling the square area to Size, to obtain the target region for each frame Since the RGB and TIR mode tracking videos are spatially aligned, the target region obtained in the first frame of the RGB mode tracking video is... The coordinate position parameters are consistent with the coordinate position parameters of the target region on the first frame of the TIR modal tracking video, and the target region of each modality is used as its search frame. Compared to RGB tracking, RGB-T tracking needs to handle two modalities: RGB and TIR. Therefore, the network structure is a two-branch framework, such as... Figure 1 As shown; a template frame for RGB and TIR is given respectively. and a search frame They are converted to a size of [size missing] through patch embedding. Patches: and ,in and These are the number of patches for the template frame and the search frame, respectively; then the parameters are... The trainable linear projection layer will and Projected to The output of this projection into a 1D latent space is called a patch embedding. The learnable 1D location is embedded... and Patches were added to the template and search region embeddings in the visible light and infrared light fields, respectively, to generate template token embeddings. and search region marker embedding As shown in formulas (2), (3), (4), and (5):

[0040] in, It is a trainable linear projection layer. and These are learnable 1D position embeddings for RGB and TIR stencil patches and search patches, respectively. It is after the parameter is The dimension of the latent space after processing by the trainable linear projection layer. and These are the number of patches for the template frame and the search frame, respectively. Subsequently, the RGB and TIR patch embeddings are obtained by concatenating the template patch embeddings and search patch embeddings of the same modality in the following manner. As shown in formula (6):

[0041] in, , It is a stencil markup patch for RGB and TIR. It is a search tag patch for RGB and TIR.

[0042] Step 2: Convert the RGB and TIR tokens obtained in Step 1. (in The inputs are respectively fed into the feature extraction network. Feature extraction is performed in the process; feature extraction network It is a dual-branch, weight-sharing independent ViT (Vision Transformer) network, which mainly consists of 12 encoder layers, each of which is composed of two sub-layers connected together.

[0043] The first sub-layer connection structure includes a multi-head attention sub-layer, such as... Figure 2 As shown, this sub-layer can learn information from different representation subspaces simultaneously, thereby capturing long-range dependencies between features. Following this is a layer normalization layer to stabilize the training process and accelerate convergence, and a residual connection that directly adds the sub-layer's input to the output, helping to address the vanishing gradient problem in deep networks. The second sub-layer connection structure includes a feed-forward network (FFN) sub-layer for non-linear transformation of the features, further enhancing their expressive power.

[0044] Table 1 Feature Extraction Network parameter

[0045] Similarly, this sub-layer is followed by a normalization layer and a residual connection. The feature extraction network... The parameters are shown in Table 1, including Layer Normalization, Multi-Head Attention, Dropout, LayerNorm, MLP Block, and Dropout.

[0046] The multi-head attention sublayer, or self-attention mechanism, is a core component of the Transformer architecture. It allows the model to dynamically assign different attention weights between different input locations, thereby capturing long-distance dependencies in sequential data.

[0047] The specific implementation steps of the multi-head attention sublayer are shown in formula (7).

[0048] in, It is a dimension parameter; enter Multiply corresponding The weight matrix gives the corresponding , The specific values ​​will be continuously updated and learned; input First, it goes through a linear transformation layer, and then into the Scaled Dot-Product Attention layer. and Matrix multiplication (MatMul) yields the attention score. Next, we move to the Scale layer to perform dimensional scaling (i.e., the scaling in the formula above). Finally, the weight matrix obtained by softmax is multiplied by V to obtain... If the same operation is performed three times before the final Linear output, it means that the multi-head attention layer has been passed, as shown in formula (8).

[0049] in, It is the first The output result obtained by calculating the attention head; A matrix representing the input query, key, and value; They are the first A single attention point is specifically designed for... Learnable weight matrix for linear projection; The final output of the Multi-Head Attention layer is shown in Equation (9).

[0050] in, This indicates a concatenation operation, which concatenates the outputs of all individual attention heads together along the feature dimension. Representing the first One to the first The output matrix of each attention head; The total number of "heads" in a multi-head attention mechanism; Learnable output weight matrix used to perform the final linear transformation on the concatenated multi-head features; Then, it goes through a feedforward layer, along with a similar residual connection and layer normalization process; The output of the multi-head attention layer is further processed through a feedforward layer, which consists of two fully connected layers with a ReLU activation function and a Dropout layer inserted in between, as shown in Equation (10):

[0051] in, The input to a feedforward neural network is usually the result of the previous layer after residual connections and layer normalization; These are the learnable weight matrices of the first and second fully connected layers in the feedforward network, respectively. These are the learnable bias terms corresponding to the two fully connected layers; The mathematical expression representing the ReLU activation function; After the first The feature tokens of RGB and TIR extracted by the layer encoder are As shown in formula (11):

[0052] in Indicates the first One encoder, It is the Transformer Encoder layer of the ViT feature extraction network.

[0053] Step 3: Construct the feature modulation module, with the following architecture: from bottom to top, it consists of a dimensionality reduction projection layer, a linear projection layer, and an updimensionality projection layer. A given feature token is input and its dimensionality is reduced to [dimensionality value missing] through the dimensionality reduction projection layer. The dimension is first calculated, then passed through a linear projection layer, and then projected back to the original dimension by a higher-dimensional projection layer. This projection is then used as feature feedback to the Transformer encoder layer of another modality. Through this simple structure, the feature modulation module effectively... and Feature cues are performed between modalities for multimodal tracking; The constructed feature modulation module is embedded into the first step of the Transformer encoder. In the layer, there is an encoder that runs through two modes; for each encoder's first... Each layer integrates specific modal features from the previous layer with complementary information from another modality; each encoder learns feature cues from another modality in a layer-by-layer manner; the feature modulation module adopts a modular design and is embedded in the multi-head self-attention stage and the MLP stage respectively; The working principle of the feature modulation module is explained in detail by taking the TIR mode branch providing auxiliary information to the RGB mode branch as an example. Similarly, the RGB mode branch provides auxiliary information to the TIR mode branch, as follows: The first... The layer integrates auxiliary information from the TIR branch through the feature modulation module, as shown in formulas (12) and (13):

[0054] in, and These represent the multi-head self-attention block and the feature modulation module, respectively. This refers to the output feature hints from the feature modulation module; These are feature cues extracted from the TIR mode; After undergoing layer normalization, it is fed into the multi-head attention sub-layer, and then... and Combined to obtain ; In the next stage, Sent into a multilayer sensor and feature hints and Add them together to obtain the RGB encoder's first value. Layer output As shown in formulas (14) and (15):

[0055] in, It is the RGB encoder number 1 The output of the layer, These are feature cues extracted from the TIR mode. Represents the number of patches; The detailed architecture of the feature modulation module is as follows: Figure 3 As shown, the aim is to transfer feature cues from one modality to another. The input token of the feature modulation module is first reduced in dimensionality through a downprojection layer. The dimension is then passed through a linear projection layer, and then projected back up to the original dimension, serving as a feature cue fed back to the Transformer encoder layer of another modality. Through this simple structure, the feature modulation module effectively... and Feature cues are performed between modalities for multimodal tracking.

[0056] RGB and TIR patch embedding After passing through 12 Transformer encoder layers and feature modulation modules, the feature vector is obtained, as shown in formula (16):

[0057] in, These are the template patch and search patch obtained in step 1. It is a 12-layer Transformer Encoder feature extraction network with a feature modulation module.

[0058] Step 4: The multimodal features of the tracked target are progressively and dynamically extracted during the 12 layers of the Transformer encoder in the base model; finally, the features of the two modal branches are combined. The sums are then fed into the classification and regression network for target prediction, resulting in predicted bounding boxes on the RGB and TIR images, as shown in formula (17):

[0059] in, These are the predicted bounding boxes on the RGB and TIR images. It is a prediction network. These are features of two modal branches. This represents the number of patches.

[0060] The prediction network consists of two parts: the box prediction head and the IoU prediction head.

[0061] 1) Boundary prediction head: RGB-T image pairs are not always precisely aligned on every frame, and the position of the target in the infrared image may sometimes be different from its position in the visible light image; therefore, it is difficult for the tracker to directly regress the target coordinates based on the RGB-T image pairs; considering this problem, a fully convolutional localization head is used to predict the probability distribution of the bounding box corners; the fused features are sent to multiple Conv-BN-ReLU layers to generate the probability distributions of the top left and bottom right corners; the expectation of the probability distribution of each corner is calculated as the predicted bounding box coordinates.

[0062] 2) IoU prediction head: In order to handle target deformation during the tracking process, an IoU prediction head is designed for online template update; during training, the IoU head learns the IoU score between the predicted ground truth box and the predicted box; during inference, the tracker selects a reliable online template based on the predicted IoU score.

[0063] First, sum the features of the two modal branches. The sequence is reshaped into a two-dimensional spatial feature map, and then fed into a fully convolutional network (FCN), which consists of parameters tailored to each output. The FCN consists of stacked Conv-BN-ReLU layers; the output of the FCN includes the target classification score map. Local offset used to compensate for discretization errors caused by reduced resolution. And the normalized bounding box dimensions (i.e., width and height). The position with the highest classification score is considered the target position, i.e. The final target bounding box is obtained as shown in formula (18):

[0064] in, The x-coordinate of the top-left corner of the bounding box. The ordinate of the top-left corner of the bounding box. Refers to the width of the bounding box. This refers to the height of the bounding box.

[0065] Step 5: The loss is calculated using weighted focal loss for classification; specifically, for each true target center... and its corresponding low-resolution equivalent points A true heatmap can be generated using a Gaussian kernel, as shown in formula (19):

[0066] in, In terms of spatial location The true heatmap value at a given location represents the "probability" or "confidence score" that the center of the target exists at that pixel, ranging from... between; It refers to the specific coordinates of a pixel on the feature map; It is the coordinate position of the target's true center point mapped onto the low-resolution feature map; It is the adaptive standard deviation of the target size; The Gaussian weighted focal loss is calculated as shown in formula (20):

[0067] in, It is the classification loss based on center point prediction. It is the heatmap predicted by the predictive network model at the location The value, It is the true heatmap value calculated by formula (19). and It is a hyperparameter of focus loss, set and ; In addition, to accurately regress the target size and bounding box, the prediction network employs... Loss and Generalized intersection and comparison of losses; The loss is used to calculate the absolute error between the attributes of the predicted bounding box and the true label, and its calculation is shown in Equation (21):

[0068] in, It's the number of patches. These are the specific attribute values ​​predicted by the network. It corresponds to the actual label value; To more accurately constrain the overlap between the predicted bounding box and the true bounding box in terms of spatial morphology, and to solve the gradient vanishing problem when the two do not intersect, the GIoU loss is introduced, and its calculation is shown in Equation (22):

[0069] in, It predicts the bounding box. With the true bounding box The intersection and union ratio; It can contain at the same time and The minimum bounding rectangle closure region; Indicates the closure region China does not belong to and The area; Finally, the total loss function is shown in Equation (21):

[0070] in and ; It is used to optimize the total loss of the entire tracking network model. The centroid classification loss is calculated using formula (20); It is the generalized cross-union ratio loss calculated by formula (22); The loss is the mean absolute error calculated by formula (21); and These are the weighting coefficients of the loss function, used to balance the three loss terms of different magnitudes; Finally, the gradient of the loss function with respect to all learnable parameters in the prediction network is calculated using the backpropagation algorithm. Based on this, the AdamW optimizer is used in conjunction with a set learning rate decay strategy to iteratively update the weights of the prediction network. By performing multiple iterations on the training set, the total loss is continuously minimized. This continues until the model converges, thereby enabling the prediction network to learn a robust RGB-T multimodal feature representation, achieving high-precision single-target tracking.

[0071] The technical solution of the present invention will be further illustrated below through embodiments.

[0072] Example 1 A visual single-target tracking method based on multimodal information takes the first frame of a video as input and determines the target region. The RGB-TIR image is then converted into RGB and TIR modal patches via patch embedding, and features are extracted from these patches. A feature modulation module is constructed and embedded into a Transformer encoder. The RGB and TIR modal patches are input to obtain feature vectors for the two modal branches. These vectors are summed and input into a prediction network to predict the target bounding box. A total loss function is used to calculate the loss. The calculated total loss is then input into the prediction network for backpropagation and iteration to minimize the total loss until convergence. The prediction network is then optimized, and finally, single-target tracking is achieved through the optimized prediction network.

[0073] Example 2 The visual single-target tracking method based on multimodal information specifically includes the following steps: Step 1: Input the RGB-TIR image of the first frame of the video and determine the target region. Given a template frame and a search frame for the RGB mode and the TIR mode respectively, convert them into patches through patch embedding. Then, the template patch and the search patch of the same mode are spliced ​​together to obtain the combined patch of RGB mode and TIR mode. Step 2: Input the combined RGB and TIR modal patches obtained in Step 1 into the attention feature extraction network for feature extraction, and obtain feature tokens for the RGB and TIR modal respectively. ; Step 3: Construct a feature modulation module and embed it into the attention feature extraction network, inputting feature tokens for RGB and TIR modes: Then the eigenvectors of the two modal branches are obtained. ; Step 4: Add the feature vectors of the two modal branches, reshape the sum into a two-dimensional spatial feature map, and input it into the prediction network to predict the target, thereby obtaining the predicted bounding box on the RGB image and TIR image, and thus the target bounding box. Step 5: Calculate the total loss of the input image using the total loss function. Then, input the total loss into the prediction network for backpropagation to optimize the training of the entire network. Finally, achieve single-target tracking through the optimized prediction network.

[0074] Example 3 Based on Example 2, step 1 is as follows: Step 1.1: Obtain spatially strictly aligned RGB and TIR mode tracking videos. Manually select the target region on the first frame of the RGB mode tracking video: Let... The coordinates of the center point of the target region in each frame of the image. and Let the width and height of the target region be the same for each frame of the image. Then, taking the center point of each frame as the center, a side length of [value missing] is extracted. The square region is shown in formula (1):

[0075] in, Indicates the fill amount; If the square region If the size exceeds the image size, the excess portion is padded with the image mean; then, the side length is... Scaling the square area to Size, to obtain the target region for each frame Since the RGB and TIR mode tracking videos are spatially aligned, the target region obtained in the first frame of the RGB mode tracking video is... The coordinate position parameters are consistent with the coordinate position parameters of the target region on the first frame of the TIR modal tracking video, and the target region of each modality is used as its search frame. Step 1.2: Provide a template frame for both RGB and TIR modes. and a search frame obtained from step 1.1 Where H is the height of the image and W is the width of the image; Convert them to a size using patch embedding. Patch: and ,in, and These are the number of patches for the template frame and the search frame, respectively. P These are the length and width of the patch; then use the following parameters: The trainable linear projection layer will and Projected to The latent space is patched with the corresponding modalities, namely template patch and search patch; Step 1.3: Embed the learnable 1D position and Add them to the stencil patch and search patch for RGB and TIR respectively to obtain the stencil markup patch. and search tag patch As shown in formulas (2), (3), (4), and (5):

[0076] in, It is a trainable linear projection layer. and These are learnable 1D position embeddings for RGB and TIR stencil patches and search patches, respectively. It is after the parameter is The dimension of the latent space after processing by the trainable linear projection layer. and These are the number of patches for the template frame and the search frame, respectively. Subsequently, the template patch and the search patch of the same modality are spliced ​​together according to formula (6) to obtain the combined RGB and TIR patch. :

[0077] in, , It is a stencil markup patch for RGB and TIR. It is a search tag patch for RGB and TIR.

[0078] Example 4 Based on Example 3, step 2 is as follows: The combined RGB and TIR patch obtained in step 1 The inputs are fed into the attention feature extraction network. Feature extraction is performed using an attention feature extraction network. It is a two-branch, weight-shared, independent ViT network, with the following specific structure: It consists of 12 encoder layers, each consisting of two sub-layer connection structures. The first sub-layer connection structure includes a multi-head attention sub-layer, which can learn information from different representation subspaces simultaneously, thereby capturing long-range dependencies between features. This is followed by a normalization layer to stabilize the training process and accelerate convergence, and a residual connection that directly adds the input of the sub-layer to the output. The second sub-layer connection structure includes a feedforward fully connected FFN sub-layer, which performs non-linear transformations on the features to further enhance their expressive power. This second sub-layer is also followed by a normalization layer and a residual connection. Feature extraction network Parameters include Layer Normalization, Multi-Head Attention, Dropout, Layer Normalization, MLP Block, and Dropout; The specific implementation steps and related formulas for the multi-head attention sublayer are as follows: First, the input feature matrix of the network layer Multiply by query ,key ,value Corresponding to The weight matrix gives the corresponding ,in The specific values ​​will be continuously updated and learned, as shown in formula (7).

[0079] in, It is a dimension parameter; enter After passing through a linear transformation layer, it then enters the Scaled Dot-Product Attention layer. and Attention score is obtained by matrix multiplication. ; Next, we move to the Scale layer to perform dimensional scaling, as shown in the formula above. Finally, the weight matrix obtained by softmax is multiplied by V to obtain... The same operation is performed three times before the final Linear output, which means that the multi-head attention layer has been passed, as shown in formula (8).

[0080] in, It is the first The output result obtained by calculating the attention head; A matrix representing the input query, key, and value; They are the first A single attention point is specifically designed for... Learnable weight matrix for linear projection; The final output of the Multi-Head Attention layer is shown in Equation (9).

[0081] in, This indicates a concatenation operation, which concatenates the outputs of all individual attention heads together along the feature dimension. Representing the first One to the first The output matrix of each attention head; The total number of "heads" in a multi-head attention mechanism; Learnable output weight matrix used to perform the final linear transformation on the concatenated multi-head features; Then, through the feedforward layer, there is also a similar residual connection and layer normalization process; The output of the multi-head attention layer is further processed through a feedforward layer, which consists of two fully connected layers with a ReLU activation function and a Dropout layer inserted in between, as shown in Equation (10):

[0082] in, The input to a feedforward neural network is usually the result of the previous layer after residual connections and layer normalization; These are the learnable weight matrices of the first and second fully connected layers in the feedforward network, respectively. These are the learnable bias terms corresponding to the two fully connected layers; The mathematical expression representing the ReLU activation function; After the first The feature tokens of the RGB and TIR modes extracted by the layer encoder are As shown in formula (11):

[0083] in Indicates the first One encoder, It is the Transformer Encoder layer of the ViT feature extraction network.

[0084] Example 5 Based on Example 4, step 3 is as follows: The feature modulation module is constructed with the following architecture: from bottom to top, it consists of a dimensionality reduction projection layer, a linear projection layer, and an updimensionality projection layer. A given feature token is input and its dimensionality is reduced to [dimensionality value missing] through the dimensionality reduction projection layer. The dimension is first calculated, then passed through a linear projection layer, and then projected back to the original dimension by a higher-dimensional projection layer. This projection is then used as feature feedback to the Transformer encoder layer of another modality. Through this simple structure, the feature modulation module effectively... and Feature cues are performed between modalities for multimodal tracking; Embed the feature modulation module into the first step of the Transformer encoder. Within the layer, there is an encoder that spans both modes; For each encoder, the first The layer integrates the specific modal features of the layer above with complementary information from another modality; Each encoder learns feature cues from another modality in a layer-by-layer manner; The feature modulation module adopts a modular design and is embedded in the multi-head attention sub-layer and the feedforward layer respectively; The working principle of the feature modulation module will be explained in detail by taking the example of the TIR mode branch providing auxiliary information to the RGB mode branch. The same principle applies to the RGB mode branch providing auxiliary information to the TIR mode branch, as follows: The first branch of RGB The layer integrates auxiliary information from the TIR branch through the feature modulation module, as shown in formulas (12) and (13):

[0085] in, and These represent the multi-head self-attention block and the feature modulation module, respectively. This refers to the output feature hints from the feature modulation module; These are feature cues extracted from the TIR mode; After layer normalization, it is fed into the multi-head attention sub-layer, and then... and Combined to obtain ; In the next stage, Feed into feedforward layer and feature hints and Add them together to obtain the RGB encoder's first value. Layer output As shown in formulas (14) and (15):

[0086] in, It is the RGB encoder number 1 The output of the layer, These are feature cues extracted from the TIR mode. Represents the number of patches; RGB and TIR patches After passing through 12 Transformer encoder layers and a feature modulation module, the feature vector is obtained. As shown in formula (16):

[0087] in, These are the template patch and search patch obtained in step 1. It is a 12-layer Transformer Encoder feature extraction network with a feature modulation module.

[0088] Example 6 Based on Example 5, step 4 is as follows: Features of two modal branches The sums are then fed into the classification and regression network for target prediction, resulting in predicted bounding boxes on the RGB and TIR images, as shown in formula (17):

[0089] in, These are the predicted bounding boxes on the RGB and TIR images. It is a prediction network. These are features of two modal branches. This represents the number of patches.

[0090] Prediction network in step 4 It consists of two parts: the box prediction header and the IoU prediction header. (1) Box prediction head: A fully convolutional localization head is used to predict the probability distribution of the box corners. The fused features are sent to multiple Conv-BN-ReLU layers to generate the probability distribution of the top left and bottom right corners. The expected value of the probability distribution of each corner is calculated as the predicted bounding box coordinates. (2) IoU prediction head: The IoU prediction head is used for online template updates and to handle target deformation during the tracking process. The IoU head learns the IoU score between the predicted ground truth box and the predicted box. During the inference process, the tracker selects a reliable online template based on the predicted IoU score. Sum of the eigenvectors of the two modal branches The sequence is reshaped into a two-dimensional spatial feature map, and then fed into a fully convolutional network (FCN), which is composed of parameters tailored to each output. It consists of stacked Conv-BN-ReLU layers; The output of FCN includes a target classification score graph. Local offset used to compensate for discretization errors caused by reduced resolution. And the normalized bounding box dimensions, i.e., width and height. The position with the highest classification score is considered the target position, i.e. The final target bounding box is obtained as shown in formula (18):

[0091] in, The x-coordinate of the top-left corner of the bounding box. The ordinate of the top-left corner of the bounding box. Refers to the width of the bounding box. This refers to the height of the bounding box.

[0092] Example 7 Based on Example 6, step 5 is as follows: The loss is calculated using a weighted focus loss for classification, for each true target center. and its corresponding low-resolution equivalent points The Gaussian kernel is used to generate the real heat map, as shown in formula (19):

[0093] in, In terms of spatial location The true heatmap value at a given location represents the "probability" or "confidence score" that the center of the target exists at that pixel, ranging from... between; It refers to the specific coordinates of a pixel on the feature map; It is the coordinate position of the target's true center point mapped onto the low-resolution feature map; It is the adaptive standard deviation of the target size; The Gaussian weighted focal loss is calculated as shown in formula (20):

[0094] in, It is the classification loss based on center point prediction. It is the heatmap predicted by the predictive network model at the location The value, It is the true heatmap value calculated by formula (19). and It is a hyperparameter of focus loss, set and ; In addition, to accurately regress the target size and bounding box, the prediction network employs... Loss and Generalized intersection and comparison of losses; The loss is used to calculate the absolute error between the attributes of the predicted bounding box and the true label, and its calculation is shown in Equation (21):

[0095] in, It's the number of patches. These are the specific attribute values ​​predicted by the network. It corresponds to the actual label value; To more accurately constrain the overlap between the predicted bounding box and the true bounding box in terms of spatial morphology, and to solve the gradient vanishing problem when the two do not intersect, the GIoU loss is introduced, and its calculation is shown in Equation (22):

[0096] in, It predicts the bounding box. With the true bounding box The intersection and union ratio; It can contain at the same time and The minimum bounding rectangle closure region; Indicates the closure region China does not belong to and The area; Combining the above three losses, the total loss function is shown in formula (21):

[0097] in and ; It is used to optimize the total loss of the entire tracking network model. The centroid classification loss is calculated using formula (20); It is the generalized cross-union ratio loss calculated by formula (22); The loss is the mean absolute error calculated by formula (21); and These are the weighting coefficients of the loss function, used to balance the three loss terms of different magnitudes; Finally, the gradient of the loss function with respect to all learnable parameters in the prediction network is calculated using the backpropagation algorithm. Based on this, the AdamW optimizer is used in conjunction with a set learning rate decay strategy to iteratively update the weights of the prediction network. By performing multiple iterations on the training set, the total loss is continuously minimized. This continues until the model converges, thereby enabling the prediction network to learn a robust RGB-T multimodal feature representation, achieving high-precision single-target tracking.

[0098] Example 8 Perform steps 1 through 5. Step 1: Acquire RGB and TIR tracking videos. Manually select the target region on the first frame of the RGB tracking video. The coordinates of the center point of the target region in each frame of the image. and These are the width and height of the target region for each frame of the image. Taking the center point of each frame as the center, a side with a length of... The square region is shown in formula (1):

[0099] in, Indicates the fill amount; If the size of the square region exceeds the image size, the excess portion is filled with the image mean; then the side length is... Scaling the square area to Size, to obtain the target region for each frame Since the RGB and TIR tracking videos are spatially aligned, the target region is obtained on the first frame of the RGB tracking video. The coordinate position parameters are consistent with the coordinate position parameters of the target area on the first frame of the TIR tracking video, and the target area of ​​each mode is used as its search frame.

[0100] Given a template frame for RGB and TIR respectively and a search frame They are converted to a size of [size missing] through patch embedding. Patches: and The parameters used are: The trainable linear projection layer will and Projected to A 3D latent space is used to embed learnable 1D locations. and Patch embeddings are added to the template and search region for visible light and infrared light, respectively, to generate template token embeddings. and search region marker embedding As shown in formulas (2), (3), (4), and (5):

[0101] in, It is a trainable linear projection layer. and These are learnable 1D position embeddings for RGB and TIR stencil patches and search patches, respectively. It is after the parameter is The dimension of the latent space after processing by the trainable linear projection layer. and These are the number of patches for the template frame and the search frame, respectively. , , and The sizes are respectively , , and .

[0102] Subsequently, the RGB and TIR patch embeddings obtained by concatenating the template patch embeddings and search patch embeddings of the same modality according to formula (6) are obtained. Their sizes are respectively and .

[0103]

[0104] in, , It is a stencil markup patch for RGB and TIR. It is a search tag patch for RGB and TIR.

[0105] Step 2: Embed the RGB and TIR patches obtained in Step 1. (in The inputs are respectively fed into the feature extraction network. Feature extraction is performed within the network. Feature extraction network. It is a dual-branch, weight-sharing independent ViT (Vision Transformer) network, which mainly consists of 12 encoder layers, each of which is composed of two sub-layers connected together.

[0106] The ViT network pre-trained on the LaSOT dataset was selected as the feature extraction network for the Transformer tracker. The parameters of the feature extraction network are shown in Table 2.

[0107] Table 2 Feature Extraction Network Parameter table

[0108] The final dimension is obtained by formulas (7)-(11). The depth features, among which This refers to the batch size.

[0109] Step 3: Construct the feature modulation module, with the following architecture: from bottom to top, it consists of a dimensionality reduction projection layer, a linear projection layer, and an updimensionality projection layer. A given feature token is input and its dimensionality is reduced to [dimensionality value missing] through the dimensionality reduction projection layer. The dimension is first calculated, then passed through a linear projection layer, and then projected back to the original dimension by a higher-dimensional projection layer. This projection is then used as feature feedback to the Transformer encoder layer of another modality. Through this simple structure, the feature modulation module effectively... and Feature cues are performed between modalities for multimodal tracking; The proposed feature modulation module is embedded into the first step of the Transformer encoder. Within the layer, there is an encoder that spans two modes. For each encoder's first... Each layer integrates modal-specific features from the previous layer with complementary information from another modality. Each encoder learns feature cues from another modality layer by layer. The feature modulation module employs a modular design, embedded separately in the multi-head self-attention stage and the MLP stage. Here, we take... Let's take the processing of data as an example to explain in detail the working principle of the feature modulation module. The first step of the RGB branch... The layer integrates auxiliary information from the TIR branch through the feature modulation module, as shown in formulas (12) and (13):

[0110] in, and These represent the multi-head self-attention block and the feature modulation module, respectively. This refers to the output feature hints from the feature modulation module; These are feature cues extracted from the TIR mode; After undergoing layer normalization, it is fed into the multi-head attention sub-layer, and then... and Combined to obtain ; In the next stage, Sent into a multilayer sensor and feature hints and Add them together to obtain the RGB encoder's first value. Layer output As shown in formulas (14) and (15):

[0111] in, It is the RGB encoder number 1 The output of the layer, These are feature cues extracted from the TIR mode. Represents the number of patches; RGB and TIR patch embedding After passing through 12 Transformer encoder layers and a feature modulation module, the feature vector is obtained. According to formula (16), its size is calculated as follows: and .

[0112]

[0113] in, These are the template patch and search patch obtained in step 1. It is a 12-layer Transformer Encoder feature extraction network with a feature modulation module.

[0114] Step 4: The multimodal features of the tracked target are progressively and dynamically extracted during the 12 layers of the Transformer encoder in the base model; features from both modal branches are then combined. The sums are then fed into the classification and regression network for target prediction, resulting in predicted bounding boxes on the RGB and TIR images, as shown in formula (17):

[0115] in, These are the predicted bounding boxes on the RGB and TIR images. It is a prediction network. These are features of two modal branches. This represents the number of patches.

[0116] Prediction Networks It consists of two parts: the box prediction header and the IoU prediction header. (1) Box prediction head: A fully convolutional localization head is used to predict the probability distribution of the box corners. The fused features are sent to multiple Conv-BN-ReLU layers to generate the probability distribution of the top left and bottom right corners. The expected value of the probability distribution of each corner is calculated as the predicted bounding box coordinates. (2) IoU prediction head: The IoU prediction head is used for online template updates and to handle target deformation during the tracking process. The IoU head learns the IoU score between the predicted ground truth box and the predicted box. During the inference process, the tracker selects a reliable online template based on the predicted IoU score. Sum of the eigenvectors of the two modal branches The sequence is reshaped into a two-dimensional spatial feature map, and then fed into a fully convolutional network (FCN), which is composed of parameters tailored to each output. It consists of stacked Conv-BN-ReLU layers; The output of FCN includes a target classification score graph. Local offset used to compensate for discretization errors caused by reduced resolution. And the normalized bounding box dimensions, i.e., width and height. The position with the highest classification score is considered the target position, i.e. The final target bounding box is obtained as shown in formula (18):

[0117] in, The x-coordinate of the top-left corner of the bounding box. The ordinate of the top-left corner of the bounding box. Refers to the width of the bounding box. This refers to the height of the bounding box.

[0118] The Stroller video sequence contains 354 frames, most of which are affected by strong light interference. Strong light interference is present on the target object between frames 163 and 166. However, between frames 92 and 105, challenges arise due to lighting and motion blur. When the object moves away from these challenges, the algorithm re-matches the initial template frame on the search image, thus re-locating the object and enabling tracking. In fact, the target environment in this video sequence is not complex, but the strong light interference directly alters the appearance of the tracked target. This algorithm possesses powerful modeling capabilities, fully mining the semantic information of the template features and utilizing different neural networks to extract features at different depths, enabling it to track the target even under these challenges of appearance changes. The Intersection over Union (IoU) reflects the degree of overlap between the predicted tracking box and the manually labeled tracking box; it is defined as the ratio of the area of ​​the intersection of the predicted box and the ground truth box to the area of ​​the union of the predicted box and the ground truth box.

[0119]

[0120] in, Indicates the prediction box. This represents the actual annotation box. Represents the area; a larger IoU value indicates higher accuracy of the tracking algorithm, and its value range is... A success rate threshold is typically introduced. The requirement is that when a certain frame of an image... When the tracking is successful, the accuracy metric is defined as the tracking performance on a video sequence. The proportion of images in a video sequence to the total number of images in the video sequence.

[0121] The frame-by-frame IoU results in the Stroller video are as follows: Figure 4 As shown, this refers to the IoU value of each frame in the video sequence, taken as... As shown in the figure, the value appears The frequency was relatively high, and the accuracy of the Stroller video was 72%.

[0122] Tested on the LasHeR dataset, taking... The tracker accuracy is 60%, so the method of this invention also has good performance on datasets.

Claims

1. A visual single-target tracking method based on multimodal information, characterized in that, The first frame of the video is taken as an RGB-TIR image to determine the target region. The RGB-TIR image is converted into RGB and TIR modal patches through patch embedding, and features are extracted from them. A feature modulation module is constructed and embedded into the Transformer encoder. The RGB and TIR modal patches are input to obtain the feature vectors of the two modal branches. These feature vectors are added together and input into the prediction network to predict the target bounding box. The total loss function is used to calculate the loss. The calculated total loss is input into the prediction network for backpropagation and iteration to minimize the total loss until convergence. The prediction network is then optimized, and finally, single-target tracking is achieved through the optimized prediction network.

2. The visual single-target tracking method based on multimodal information according to claim 1, characterized in that, Specifically, the following steps are included: Step 1: Input the RGB-TIR image of the first frame of the video and determine the target region. Given a template frame and a search frame for the RGB mode and the TIR mode respectively, convert them into patches through patch embedding. Then, the template patch and the search patch of the same mode are spliced ​​together to obtain the combined patch of RGB mode and TIR mode. Step 2: Input the combined RGB and TIR modal patches obtained in Step 1 into the attention feature extraction network for feature extraction, and obtain feature tokens for the RGB and TIR modal respectively. ; Step 3: Construct a feature modulation module and embed it into the attention feature extraction network, inputting feature tokens for RGB and TIR modes: Then the eigenvectors of the two modal branches are obtained. ; Step 4: Add the feature vectors of the two modal branches, reshape the sum into a two-dimensional spatial feature map, and input it into the prediction network to predict the target, thereby obtaining the predicted bounding box on the RGB image and TIR image, and thus the target bounding box. Step 5: Calculate the total loss of the input image using the total loss function. Then, input the total loss into the prediction network for backpropagation to optimize the training of the entire network. Finally, achieve single-target tracking through the optimized prediction network.

3. The visual single-target tracking method based on multimodal information according to claim 2, characterized in that, Step 1 is described in detail as follows: Step 1.1: Obtain spatially strictly aligned RGB and TIR mode tracking videos. Manually select the target region on the first frame of the RGB mode tracking video: Let... The coordinates of the center point of the target region in each frame of the image. and Let the width and height of the target region be the same for each frame of the image. Then, taking the center point of each frame as the center, a side length of [value missing] is extracted. The square region is shown in formula (1): in, Indicates the fill amount; If the square region If the size exceeds the image size, the excess portion is padded with the image mean; then, the side length is... Scaling the square area to Size, to obtain the target region for each frame Since the RGB and TIR mode tracking videos are spatially aligned, the target region obtained in the first frame of the RGB mode tracking video is... The coordinate position parameters are consistent with the coordinate position parameters of the target region on the first frame of the TIR modal tracking video, and the target region of each modality is used as its search frame. Step 1.2: Provide a template frame for both RGB and TIR modes. and a search frame obtained from step 1.1 Where H is the height of the image and W is the width of the image; Convert them to a size using patch embedding. Patch: and ,in, and These are the number of patches for the template frame and the search frame, respectively. P These are the length and width of the patch; then use the following parameters: The trainable linear projection layer will and Projected to The latent space is patched with the corresponding modalities, namely template patch and search patch; Step 1.3: Embed the learnable 1D position and Add them to the stencil patch and search patch for RGB and TIR respectively to obtain the stencil markup patch. and search tag patch As shown in formulas (2), (3), (4), and (5): in, It is a trainable linear projection layer. and These are learnable 1D position embeddings for RGB and TIR stencil patches and search patches, respectively. It is after the parameter is The dimension of the latent space after processing by the trainable linear projection layer. and These are the number of patches for the template frame and the search frame, respectively. Subsequently, the template patch and the search patch of the same modality are spliced ​​together according to formula (6) to obtain the combined RGB and TIR patch. : in, , It is a stencil markup patch for RGB and TIR. It is a search tag patch for RGB and TIR.

4. The visual single-target tracking method based on multimodal information according to claim 3, characterized in that, In step 2, the combined RGB and TIR patch obtained in step 1 is applied. The inputs are fed into the attention feature extraction network. Feature extraction is performed in the attention feature extraction network. It is a two-branch, weight-shared, independent ViT network, with the following specific structure: It consists of 12 encoder layers, each consisting of two sub-layer connection structures. The first sub-layer connection structure includes a multi-head attention sub-layer, which can learn information from different representation subspaces simultaneously, thereby capturing long-range dependencies between features. This is followed by a normalization layer to stabilize the training process and accelerate convergence, and a residual connection that directly adds the input of the sub-layer to the output. The second sub-layer connection structure includes a feedforward fully connected FFN sub-layer, which performs non-linear transformations on the features to further enhance their expressive power. This second sub-layer is also followed by a normalization layer and a residual connection. Feature extraction network Parameters include Layer Normalization, Multi-Head Attention, Dropout, Layer Normalization, MLP Block, and Dropout; The specific implementation steps and related formulas for the multi-head attention sublayer are as follows: First, the input feature matrix of the network layer Multiply by query ,key ,value Corresponding to The weight matrix gives the corresponding ,in The specific values ​​will be continuously updated and learned, as shown in formula (7). in, It is a dimension parameter; enter After passing through a linear transformation layer, it then enters the Scaled Dot-Product Attention layer. and Attention score is obtained by matrix multiplication. ; Next, we move to the Scale layer to perform dimensional scaling, as shown in the formula above. Finally, the weight matrix obtained by softmax is multiplied by V to obtain... The same operation is performed three times before the final Linear output, which means that the multi-head attention layer has been passed, as shown in formula (8). in, It is the first The output result obtained by calculating the attention head; A matrix representing the input query, key, and value; They are the first A single attention point is specifically designed for... Learnable weight matrix for linear projection; The final output of the Multi-Head Attention layer is shown in Equation (9). in, This indicates a concatenation operation, which concatenates the outputs of all individual attention heads together along the feature dimension. Representing the first One to the first The output matrix of each attention head; The total number of "heads" in a multi-head attention mechanism; Learnable output weight matrix used to perform the final linear transformation on the concatenated multi-head features; Then, through the feedforward layer, there is also a similar residual connection and layer normalization process; The output of the multi-head attention layer is further processed through a feedforward layer, which consists of two fully connected layers with a ReLU activation function and a Dropout layer inserted in between, as shown in Equation (10): in, The input to a feedforward neural network is usually the result of the previous layer after residual connections and layer normalization; These are the learnable weight matrices of the first and second fully connected layers in the feedforward network, respectively. These are the learnable bias terms corresponding to the two fully connected layers; The mathematical expression representing the ReLU activation function; After the first The feature tokens of the RGB and TIR modes extracted by the layer encoder are As shown in formula (11): in Indicates the first One encoder, It is the Transformer Encoder layer of the ViT feature extraction network.

5. The visual single-target tracking method based on multimodal information according to claim 4, characterized in that, Step 3 is as follows: The feature modulation module is constructed with the following architecture: from bottom to top, it consists of a dimensionality reduction projection layer, a linear projection layer, and an updimensionality projection layer. A given feature token is input and its dimensionality is reduced to [dimensionality value missing] through the dimensionality reduction projection layer. The dimension is first calculated, then passed through a linear projection layer, and then projected back to the original dimension by a higher-dimensional projection layer. This projection is then used as feature feedback to the Transformer encoder layer of another modality. Through this simple structure, the feature modulation module effectively... and Feature cues are performed between modalities for multimodal tracking; The feature modulation module is embedded into the first step of the Transformer encoder. Within the layer, there is an encoder that spans both modes; For each encoder, the first The layer integrates the specific modal features of the layer above with complementary information from another modality; Each encoder learns feature cues from another modality in a layer-by-layer manner; The feature modulation module adopts a modular design and is embedded in the multi-head attention sub-layer and the feedforward layer respectively; The working principle of the feature modulation module will be explained in detail by taking the example of the TIR mode branch providing auxiliary information to the RGB mode branch. The same principle applies to the RGB mode branch providing auxiliary information to the TIR mode branch, as follows: The first branch of RGB The layer integrates auxiliary information from the TIR branch through the feature modulation module, as shown in formulas (12) and (13): in, and These represent the multi-head self-attention block and the feature modulation module, respectively. This refers to the output feature hints from the feature modulation module; These are feature cues extracted from the TIR mode; After layer normalization, it is fed into the multi-head attention sub-layer, and then... and Combined to obtain ; In the next stage, Feed into feedforward layer and feature hints and Add them together to obtain the RGB encoder's first value. Layer output As shown in formulas (14) and (15): in, It is the RGB encoder number 1 The output of the layer, These are feature cues extracted from the TIR mode. Represents the number of patches; RGB and TIR patches After passing through 12 Transformer encoder layers and a feature modulation module, the feature vector is obtained. As shown in formula (16): in, These are the template patch and search patch obtained in step 1. It is a 12-layer Transformer Encoder feature extraction network with a feature modulation module.

6. The visual single-target tracking method based on multimodal information according to claim 5, characterized in that, Step 4 is as follows: Features of two modal branches The sums are then fed into the classification and regression network for target prediction, resulting in predicted bounding boxes on the RGB and TIR images, as shown in formula (17): in, These are the predicted bounding boxes on the RGB and TIR images. It is a prediction network. These are features of two modal branches. This represents the number of patches.

7. The visual single-target tracking method based on multimodal information according to claim 6, characterized in that, The prediction network in step 4 It consists of two parts: the box prediction header and the IoU prediction header. (1) Box prediction head: A fully convolutional localization head is used to predict the probability distribution of the box corners. The fused features are sent to multiple Conv-BN-ReLU layers to generate the probability distribution of the top left and bottom right corners. The expected value of the probability distribution of each corner is calculated as the predicted bounding box coordinates. (2) IoU prediction head: The IoU prediction head is used for online template updates and to handle target deformation during the tracking process. The IoU head learns the IoU score between the predicted ground truth box and the predicted box. During the inference process, the tracker selects a reliable online template based on the predicted IoU score. Sum of the eigenvectors of the two modal branches The sequence is reshaped into a two-dimensional spatial feature map, and then fed into a fully convolutional network (FCN), which is composed of parameters tailored to each output. It consists of stacked Conv-BN-ReLU layers; The output of FCN includes a target classification score graph. Local offset used to compensate for discretization errors caused by reduced resolution. And the normalized bounding box dimensions, i.e., width and height. The position with the highest classification score is considered the target position, i.e. The final target bounding box is obtained as shown in formula (18): in, The x-coordinate of the top-left corner of the bounding box. The ordinate of the top-left corner of the bounding box. Refers to the width of the bounding box. This refers to the height of the bounding box.

8. The visual single-target tracking method based on multimodal information according to claim 7, characterized in that, Step 5 is described in detail below: The loss is calculated using a weighted focus loss for classification, for each true target center. and its corresponding low-resolution equivalent points The Gaussian kernel is used to generate the real heat map, as shown in formula (19): in, In terms of spatial location The true heatmap value at a given location represents the "probability" or "confidence score" that the center of the target exists at that pixel, ranging from... between; It refers to the specific coordinates of a pixel on the feature map; It is the coordinate position of the target's true center point mapped onto the low-resolution feature map; It is the target size adaptive standard deviation; The Gaussian weighted focal loss is calculated as shown in formula (20): in, It is the classification loss based on center point prediction. It is the heatmap predicted by the predictive network model at the location The value, It is the true heatmap value calculated by formula (19). and It is a hyperparameter of focus loss, set and ; In addition, to accurately regress the target size and bounding box, the prediction network employs... Loss and Generalized intersection and merger losses; The loss is used to calculate the absolute error between the attributes of the predicted bounding box and the true label, and its calculation is shown in Equation (21): in, It's the number of patches. These are the specific attribute values ​​predicted by the network. It corresponds to the actual label value; To more accurately constrain the overlap between the predicted bounding box and the true bounding box in terms of spatial morphology, and to solve the gradient vanishing problem when the two do not intersect, the GIoU loss is introduced, and its calculation is shown in Equation (22): in, It predicts the bounding box. With the true bounding box The intersection and union ratio; It can contain at the same time and The minimum bounding rectangle closure region; Indicates the closure region China does not belong to and The area; Combining the above three losses, the total loss function is shown in formula (21): in and ; It is used to optimize the total loss of the entire tracking network model. The center point classification loss is calculated using formula (20); It is the generalized cross-union ratio loss calculated by formula (22); The loss is the mean absolute error calculated by formula (21); and These are the weighting coefficients of the loss function, used to balance the three loss terms of different magnitudes; Finally, the gradient of the loss function with respect to all learnable parameters in the prediction network is calculated using the backpropagation algorithm. Based on this, the AdamW optimizer is used in conjunction with a set learning rate decay strategy to iteratively update the weights of the prediction network. By performing multiple iterations on the training set, the total loss is continuously minimized. This continues until the model converges, thereby enabling the prediction network to learn a robust RGB-T multimodal feature representation, achieving high-precision single-target tracking.