Adaptive perception-based general multi-modal target tracking system, training method and application

By using an adaptive sensing multimodal target tracking system, and utilizing a dual-stream embedding layer and a modality-independent feature representation module, the problem of poor scalability in multimodal tracking research is solved, and high-precision and robust target tracking is achieved in different environments.

CN118674945BActive Publication Date: 2026-06-12ANHUI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ANHUI UNIV
Filing Date
2024-07-01
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing multimodal tracking methods have poor scalability across different subtasks and cannot effectively utilize modality-independent and modality-specific information, leading to decreased tracking accuracy and parameter redundancy when the environment changes.

Method used

A general-purpose multimodal target tracking system with adaptive perception is adopted. Through a dual-stream embedding layer module, a modal perception module, and a modality-independent feature representation module, features of visible light and auxiliary modalities are extracted and fused respectively. The modality-independent and specific feature representation modules are used to improve the adaptability and robustness of the model.

🎯Benefits of technology

It achieves universal tracking in various multimodal tasks, improves the model's ability to perceive and analyze complex environments, and enhances tracking accuracy and robustness.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118674945B_ABST
    Figure CN118674945B_ABST
Patent Text Reader

Abstract

The application provides a kind of general multimodal target tracking system based on adaptive perception, training method and application. Specifically, in the model input layer, considering that the visible light modality has more rich semantic information than infrared, depth and event modalities, in order to better preserve this information, an embedding layer is separately set for the visible light modality, and a shared embedding layer is set for the infrared, depth and event modalities. Such setting also takes into account the flexibility of the input layer, which is to enable adaptive perception of the input modalities. A simple and effective modal perception module is designed, which can simultaneously perform feature extraction, feature interaction and modal perception. In multimodal tracking, each modality contains some modality-independent information, such as target shape, motion and context information, etc. These information helps to capture the shared semantic information between different modalities, thus assisting the model to understand the overall context of the target. In addition, modality-specific features that carry the unique perspective and information of each modality are also crucial, as they can facilitate the model's understanding and processing capabilities of overall information. By fully utilizing modality-independent features and modality-specific features, the model's perception and analysis level of complex multimodal data can be improved, achieving more accurate and robust task execution.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to target tracking technology, specifically a general-purpose multimodal target tracking system, training method, and application based on adaptive perception. Background Technology

[0002] In complex real-world scenarios, relying solely on RGB for single-object tracking has proven insufficient. Especially in situations with low illumination, rapid target movement, and background interference, the tracking accuracy of these methods drops significantly. Therefore, researchers have begun exploring how to extract discriminative features from other modal data to supplement RGB data, thereby enhancing the accuracy and robustness of trackers in real-world, complex environments. Based on different modal combinations, multimodal tracking can be divided into subtasks such as RGB-T (visible and infrared modal) tracking, RGB-D (visible and depth modal) tracking, and RGB-E (visible and event modal) tracking.

[0003] Current multimodal tracking research is mainly divided into two categories: (1) designing model structures and training model parameters for specific sub-tasks. Since thermal infrared information is easier to obtain than depth information and event information, RGB-T tracking is the most widely and deeply studied sub-task. Early RGB-T tracking work (e.g., Real-time grayscale-thermal tracking via laplacian sparse representation, Fusion tracking in color and infrared images using joint sparse representation, and Multiple source data fusion via sparse representation for robust visual tracking) uses sparse representation models to suppress feature noise and model multimodal features. However, these methods do not perform well in terms of real-time performance and robustness. Some works (e.g., Multi-adapter rgbt tracking, Rgbt tracking via multi-adapter network with hierarchical divergence loss, and Siamese infrared and visible light fusion network for rgb-t tracking) have designed specialized branch structures to extract modality-shared and modality-specific information to enhance the representation ability of multimodal features and achieve robust visual tracking. However, there is a risk of modal mutual inhibition when fusing features from two modalities. To address this issue, Lu et al. proposed a dual-gated mutual condition network in their paper "Duality-gated mutual condition network for RGBT tracking". This network uses dual-gated modules to extract modal discriminative features and guide the learning of the other modal feature, while filtering redundant and noisy information through a dual-gated mechanism.Recent studies (e.g., Bridging search region interaction with template for RGB-T tracking, Revisiting color-event based tracking: A unified network, dataset, and metric, and Unified single-stage transformer network for efficient RGB-T tracking) are influenced by Transformer and attempt to unify feature extraction and interaction into a single network, involving the RGBT and RGBE tracking fields. Research on RGBD tracking (e.g., DepthTrack: Unveiling the power of RGBD tracking and RGBD1k: A large-scale dataset and benchmark for RGB-D object tracking) is mostly based on extensions of RGB-only trackers (such as ATOM, DiMP, and STARK). (2) Another approach is to use the same model structure for various multimodal tracking subtasks, but to train model parameters based on specific tracking tasks. Inspired by cue learning, the paper Prompting for multi-modal tracking first attempted this on multimodal tracking tasks. It treats auxiliary modalities as a kind of cue information and weights and sums them with RGB images to form a new three-channel image as input to the pre-trained tracking model. However, to achieve optimal performance across different subtasks, the modality weights need to be manually adjusted. Zhu et al., in their paper *Visual prompt multi-modal tracking*, provided a set of learnable Modality-Complementary Prompters for the base model, which can generate appropriate prompts at each stage of the forward propagation. However, these prompts require customized learning for different modalities.

[0004] These studies have poor scalability because they are all model structures designed or trained with model parameters specific to a particular multimodal tracking subtask. If the tracker's operating environment changes and requires switching to other, more reliable auxiliary modalities, the tracker's accuracy will drop significantly. Furthermore, training different model parameters for each multimodal tracking subtask results in substantial parameter redundancy.

[0005] Current multimodal tracking research decomposes the multimodal tracking task into specific sub-tasks. This division helps researchers develop specialized solutions for specific application scenarios, thereby reducing technical complexity and promoting the rapid development of various branches of multimodal tracking. However, this approach, which relies solely on a single mode to assist RGB tracking, still has significant limitations. For example, while introducing an infrared mode can improve tracking accuracy in low-light environments, it may struggle to provide discriminative information when the temperature contrast between the target object and the background is small. Event sensors offer dense temporal resolution and largely avoid motion blur interference, but they rely on changes in ambient light for imaging, making them potentially inaccurate in low-light or no-light conditions. Depth modes can provide some depth information and help address occlusion issues, but due to limitations in imaging distance, they may not provide accurate information for distant targets. Furthermore, models designed for a specific sub-task often struggle to accurately track targets when switching to other sub-tasks unless the model structure is redesigned or the model parameters are retrained, resulting in poor scalability.

[0006] Although the modal types of different subtasks differ, there is always some modality-independent information between different modalities, such as the target's position, size, and motion information. This information can be extracted using the same model structure and parameters, but current research has ignored this phenomenon, resulting in a large amount of parameter redundancy. Summary of the Invention

[0007] The technical problem to be solved by this invention is how to achieve adaptive perception-based general multimodal target tracking for multimodal data input.

[0008] The present invention solves the above-mentioned technical problems through the following technical means:

[0009] A general-purpose multimodal target tracking system based on adaptive perception, including a multimodal target tracking model, which includes:

[0010] The dual-stream embedding layer module is configured to tokenize the visible light mode and a certain auxiliary mode sequentially through the dual-stream embedding layer module to obtain the template token and search region token of the visible light mode, as well as the template token and search region token of the auxiliary mode; and to concatenate the template tokens and search region tokens of the two modes into a classification token; thus obtaining the token sequence of the two modes.

[0011] The modality perception module is configured to receive token sequences of two modalities. The two modality token sequences are used for intramodal feature extraction and template-search region interaction within the modality perception module. At the same time, the added classification token captures the modality information of the corresponding modality data. The output modality probability and the token sequence after removing the classification token are defined as the input token.

[0012] The modality-independent and modality-specific feature representation module is configured to accept the modality probability and the input token. The modality-independent and modality-specific feature representation module uses one modality-independent branch to extract modality-independent features, uses four modality-independent branches to extract corresponding modality-specific features, and performs a weighted sum of the output features of the four modality branches and the modality probability to retain the correct modality-specific features. Finally, the tokens of the search regions of the two modalities output by the module are merged together and sent to the tracking head.

[0013] Furthermore, the dual-stream embedding layer module includes a visible light embedding layer and an auxiliary modal embedding layer; the visible light modal data is input from the visible light embedding layer, and one of the modal data—infrared modality, depth modality, or event modality—is input from the auxiliary modal embedding layer.

[0014] Furthermore, the modality sensing module includes a modality sensing layer and a modality classifier;

[0015] The modality perception layer comprises multiple ordinary visual Transformer blocks. The token sequences of the two modalities are input into the modality perception layer for intra-modal feature extraction and template-search region interaction as follows:

[0016] In each typical vision Transformer block, there are initially three linear layers... Map them to query Q, key K, and value V, and then perform self-attention operations on them:

[0017] in express subscript , and These represent the corresponding submatrix belonging to the classification token, template token, and search region token, respectively; the attention weights in the above formula can be further written in the following form:

[0018] in The similarity metric between the tokens in the template and the search region is represented by the same submatrix; ultimately, the output A of the self-attention can be further written as:

[0019] Through the above equation, the template token and the search region token simultaneously achieve their own feature extraction and template-search region feature interaction in a single self-attention operation, and and This indicates that the category token also interacted with the template token and the search area token, aggregating modal information.

[0020] Furthermore, the modality perception module also includes a modality classifier; the modality classifier is configured to have modality classification capabilities; the modality classifier receives the output of each ordinary vision Transformer block. Modal prediction is performed, and the prediction process is as follows:

[0021] In the formula, Represents category tokens Fine-grained weights, This represents the prediction result of the modality classifier, indicating the probability that the input data belongs to a certain modality.

[0022] Furthermore, the modality-independent and specific feature representation module includes multiple panoramic vision Transformer blocks; the modality probability output by the modality classifier is denoted as... ,in The subscripts represent the corresponding modes, namely, visible light mode probability, infrared mode probability, depth mode probability, and event mode probability, respectively. The input tokens are fed into each panoramic vision Transformer block. In each block, the input tokens undergo layer normalization, multi-head self-attention, and residual connections to generate a set of tokens, abbreviated as […]. in, Indicates template token, The search area token is represented; then, modality-independent and modality-specific features are captured as follows:

[0023]

[0024]

[0025] in This indicates a feedforward network layer that specifically handles a certain mode. and These represent modality-independent and modality-specific tokens, respectively. The temperature coefficient is learnable; finally, directly... , and The sum is used as the output token of the block.

[0026] Furthermore, there are 3 ordinary vision Transformer blocks and 9 panoramic vision Transformer blocks.

[0027] This invention also provides a training method for a general multimodal target tracking model based on adaptive perception, comprising the following steps:

[0028] Phase 1:

[0029] Step 1: Randomly select N video sequences from the training datasets of LasHeR, DepthTrack, and VisEvent. Sample N pairs of template-search frames from these sequences. Crop and resize the template frame and search frame regions according to the initialized bounding boxes to obtain three batches of training data. , , , ),( , , , ),( , , , )}, where Z represents the template, X represents the search area, the superscripts 1, 2 and 3 indicate that Z and X belong to the data in LasHeR, DepthTrack and VisEvent respectively, and the subscripts r, t, d and e represent visible light, infrared, depth and event modes respectively.

[0030] Step Two: , , , The two input layers of the dual-stream embedding layer module are fed forward. During the forward propagation, the modality-independent and specific feature representation module directly selects the corresponding modality branch based on the modality type information to ensure that each modality branch has the ability to extract modality-specific features.

[0031] Step 3: Calculate the loss based on the output of the modality classifier and the tracking head. ,in It is the focal loss used for classification. and These are L1 and GIOU losses used for bounding box regression. It is the cross-entropy loss used to supervise modality classifiers. This represents the weights of the loss function that are set manually. Then, backpropagation is performed to calculate the gradients of the parameters, but no parameter updates are performed.

[0032] Step 4: [The text appears to be incomplete and contains several grammatical errors. A more accurate translation would require the full context.] , , , As input data, execute steps two and three;

[0033] Step 5: , , , As input data, execute steps two and three;

[0034] Step Six: The gradients generated in Steps Three to Five will be automatically accumulated. At this point, the parameters will be updated uniformly according to the ADAMW optimization algorithm.

[0035] Step 7: Repeat steps 1 through 6 until the model converges;

[0036] Second phase of training:

[0037] Step 1: Randomly select N video sequences from the training datasets of LasHeR, DepthTrack, and VisEvent, and sample N pairs of template-search frames. Crop and resize the template frame and search frame regions according to the initialized bounding boxes to obtain a batch of training data. , , , )}, where Z represents the template, X represents the search area, the subscript r represents the visible light mode, and the subscript x represents the auxiliary mode, which is one of infrared, depth, or event mode;

[0038] Step Two: , , , The two input layers fed into the dual-stream embedding layer module are forward propagated; during this training phase, the modality-independent and specific feature representation module performs feature weighting on the output features of the four modality branches based on the modality probabilities output by the modality classifier.

[0039] Step 3: Calculate the loss based on the output of the tracking head. The meanings of each loss function are the same as in the first stage; then backpropagation is performed to calculate the parameter gradients;

[0040] Step 4: Update the parameters in the modality-independent and specific feature characterization modules and the tracking head according to the ADAMW optimization algorithm, and freeze other parameters.

[0041] The present invention also provides an application of a general multimodal target tracking model based on adaptive perception.

[0042] The present invention also provides a processing device, including at least one processor and at least one memory communicatively connected to the processor, wherein: the memory stores program instructions executable by the processor, and the processor can execute the above-described method by calling the program instructions.

[0043] The present invention also provides a computer-readable storage medium storing computer instructions that cause the computer to perform a method.

[0044] The advantages of this invention are:

[0045] Current multimodal tracking methods can only solve one type of multimodal tracking task, resulting in poor scalability and an inability to handle more complex scenarios. To address this issue, a universal tracker for multiple multimodal tracking tasks is proposed. At the model input layer, considering that the visible light modality possesses richer semantic information than the infrared, depth, and event modalities, a separate embedding layer is set up for the visible light modality to better preserve this information, while a shared embedding layer is used for the infrared, depth, and event modalities. This setup also ensures flexibility in the input layer. The beneficial effects of this dual-stream embedding layer module are clearly shown in Table 6. To enable adaptive perception of the input modality, a simple and effective modality perception module is designed, capable of simultaneously performing feature extraction, feature interaction, and modality perception. In multimodal tracking, each modality contains some modality-independent information, such as the target's shape, motion, and contextual information. This information helps capture the semantic information shared between different modalities, thereby assisting the model in understanding the overall context of the target. Furthermore, modality-specific features, carrying the unique perspective and information of each modality, are also crucial, as they enhance the model's ability to understand and process overall information. By fully utilizing modality-independent and modality-specific features, the model's perception and analysis capabilities for complex multimodal data can be improved, leading to more accurate and robust task execution. To this end, a modality-independent and modality-specific feature representation module is proposed. The beneficial effects of this module are... Figure 3 The visualized features and scores are presented in detail.

[0046] Tables 1 to 5 compare the methods evaluated on five multimodal tracking datasets with other methods, clearly showing that the method of this invention is at the highest level. Attached Figure Description

[0047] Figure 1 A general multimodal tracking model framework diagram in Embodiment 1 of this invention;

[0048] Figure 2 This invention provides a comparison between the general multimodal tracking model framework in embodiment 1 and other tracking models.

[0049] Figure 3 This is a visualization of some of the tracking results in Embodiment 1 of the present invention. Detailed Implementation

[0050] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0051] Example 1

[0052] This embodiment describes a general-purpose multimodal target tracking system based on adaptive perception for single-target tracking. It includes a multimodal target tracking model, which comprises:

[0053] The dual-stream embedding layer module is configured to tokenize the visible light mode and the auxiliary mode sequentially, resulting in template tokens and search region tokens for the visible light mode and the auxiliary mode; and then concatenates the template tokens and search region tokens of the two modes into a classification token; thus obtaining the token sequence of the two modes.

[0054] The modality perception module is configured to receive token sequences of two modalities. The two modality token sequences are used for intramodal feature extraction and template-search region interaction within the modality perception module. At the same time, the added classification token captures the modality information of the corresponding modality data. The output modality probability and the token sequence after removing the classification token are defined as the input token.

[0055] The modality-independent and modality-specific feature representation module is configured to accept the modality probability and the input token. The modality-independent and modality-specific feature representation module uses one modality-independent branch to extract modality-independent features, uses four modality-independent branches to extract corresponding modality-specific features, and performs a weighted sum of the output features of the four modality branches and the modality probability to retain the correct modality-specific features. Finally, the tokens of the search regions of the two modalities output by the module are merged together and sent to the tracking head.

[0056] The following is combined Figure 1 The model in this embodiment will be described in detail below:

[0057] The dual-stream embedding layer module includes a visible light embedding layer and an auxiliary mode embedding layer. Visible light mode data is input from the visible light embedding layer, while infrared mode, depth mode, or event mode data is input from the auxiliary mode embedding layer. During input, the video data is first frame-divided, with each frame containing both the visible light mode and any other auxiliary mode. During input, the visible light mode is input first, followed by the auxiliary mode, and the two are input sequentially.

[0058] Before input, it is assumed that a visible light template and an auxiliary template are defined from the first frame of data. Then, the search regions for the two modalities are selected, and after embedding tokens, template tokens and search region tokens for the two modalities are obtained. The visible light template token sequence... Search area token sequence and category tokens Combine them together, then add them to a set of position codes to form At the same time, Then and The data is fed into a shared modality-aware module composed of several ordinary visual Transformer blocks for in-modality feature extraction and template-search region interaction. Simultaneously, the added classification tokens gradually capture the modal information of the input data. Since the subsequent operations are the same for both RGB and X modalities, this embodiment will use one modality as an example to illustrate the calculation process, and the modality indices will be omitted. or .

[0059] In each typical vision Transformer block, there are initially three linear layers... Map them to query Q, key K, and value V, and then perform self-attention operations on them:

[0060] in express subscript , and These represent the corresponding submatrices belonging to the classification token, template, and search region, respectively. The attention weights in the above formula can be further written in the following form:

[0061] in This represents a similarity measure between the tokens in the template and the search region; the remaining submatrices follow the same logic. Finally, the output A of the self-attention can be further written as:

[0062] Through the above formula, this embodiment shows that the template token and the search region token simultaneously achieve their own feature extraction and template-search region feature interaction in a single self-attention operation. Furthermore, and This indicates that the category token also interacted with the template and search area tokens, aggregating modal information.

[0063] Output of each block in the modality sensing layer All values ​​will be retained and then fed into the modality classifier for modality prediction. This prediction process can be represented as:

[0064] Here Represents category tokens Fine-grained weights, This represents the prediction result of the modality classifier, indicating the probability that the input data belongs to a certain modality.

[0065] Modality-independent and feature-specific representation modules:

[0066] After passing through the modality perception layer, this embodiment removes the classification token as the input token for the modality-independent and specific feature representation modules. The modality probability output by the modality classifier is denoted as... ,in The subscripts represent the corresponding modalities and are fed into each panoramic vision Transformer block. Within each block, the input tokens undergo layer normalization, self-attention, and residual connections to generate a set of tokens, abbreviated as . Then, modality-independent and modality-specific features are captured as follows:

[0067] in This indicates a feedforward network layer that specifically handles a certain mode. and These represent modality-independent and modality-specific tokens, respectively. It is a learnable temperature coefficient. Finally, directly... , and The sum is used as the output token of the block.

[0068] After the visible light mode and auxiliary mode have both passed through the mode-independent and specific feature characterization modules, their search region tokens are merged together and sent to the tracking head for target localization.

[0069] The target's bounding box coordinates are calculated based on the target center point score map P, the center point local offset map O, and the target bounding box size map S output by the tracking head. The point with the highest score in P is the center point position. , ), Bounding box coordinates .

[0070] In this embodiment, a total of 12 ViT blocks are used in the model, namely L1+L2=12.

[0071] exist Figure 2 The diagram illustrates a comparison between our general multimodal tracking method and current multimodal tracking methods. Different graph topologies represent different model frameworks, and different node colors represent different model parameters.

[0072] This embodiment addresses the limitation of current multimodal tracking methods, which can only solve one type of multimodal tracking task, exhibiting poor scalability and inability to handle more complex scenarios. To resolve this issue, a universal tracker for multiple multimodal tracking tasks is proposed. At the model input layer, considering that the visible light modality possesses richer semantic information than infrared, depth, and event modalities, a separate embedding layer is set up for the visible light modality to better preserve this information. A shared embedding layer is used for the infrared, depth, and event modalities. This setup also ensures flexibility in the input layer. The beneficial effects of this dual-stream embedding layer module are clearly shown in Table 6. To enable adaptive perception of the input modality, a simple and effective modality perception module is designed, capable of simultaneously performing feature extraction, feature interaction, and modality perception. In multimodal tracking, each modality contains modality-independent information, such as the target's shape, motion, and contextual information. This information helps capture the semantic information shared between different modalities, thereby assisting the model in understanding the overall context of the target. Furthermore, modality-specific features, carrying the unique perspective and information of each modality, are also crucial, enhancing the model's ability to understand and process overall information. By fully utilizing modality-independent and modality-specific features, we can improve the model's perception and analysis capabilities of complex multimodal data, achieving more accurate and robust task execution. To this end, we propose a modality-independent and modality-specific feature representation module. The beneficial effects of this module are as follows: Figure 3 The visualized features and scores are presented in detail.

[0073] Example 2

[0074] Regarding the general multimodal target tracking model based on adaptive perception in Example 1, this embodiment provides a training method for the model, characterized by the following steps:

[0075] Phase 1: Training the modality classifier and modality branching

[0076] Step 1: Randomly select N video sequences from the training datasets of LasHeR, DepthTrack, and VisEvent. Sample N pairs of template-search frames from these sequences. Crop and resize the template frame and search frame regions according to the initialized bounding boxes to obtain three batches of training data. , , , ),( , , , ),( , , , )}, where Z represents the template, X represents the search area, the superscripts 1, 2 and 3 indicate that Z and X belong to the data in LasHeR, DepthTrack and VisEvent respectively, and the subscripts r, t, d and e represent visible light, infrared, depth and event modes respectively.

[0077] Step Two: , , , The two input layers of the dual-stream embedding layer module are fed forward. During the forward propagation, the modality-independent and specific feature representation module directly selects the corresponding modality branch based on the modality type information to ensure that each modality branch has the ability to extract modality-specific features.

[0078] Step 3: Calculate the loss based on the output of the modality classifier and the tracking head. ,in It is the focal loss used for classification. and These are L1 and GIOU losses used for bounding box regression. It is the cross-entropy loss used to supervise modality classifiers. This represents the weights of the loss function that are set manually. Then, backpropagation is performed to calculate the gradients of the parameters, but no parameter updates are performed.

[0079] Step 4: [The text appears to be incomplete and contains several grammatical errors. A more accurate translation would require the full context.] , , , As input data, execute steps two and three;

[0080] Step 5: , , , As input data, execute steps two and three;

[0081] Step Six: The gradients generated in Steps Three to Five will be automatically accumulated. At this point, the parameters will be updated uniformly according to the ADAMW optimization algorithm.

[0082] Step 7: Repeat steps 1 through 6 until the model converges;

[0083] The second stage of training involves training the modality-independent and specific feature representation modules and the tracking head based on the pre-trained modality classifier.

[0084] Step 1: Randomly select N video sequences from the training datasets of LasHeR, DepthTrack, and VisEvent, and sample N pairs of template-search frames. Crop and resize the template frame and search frame regions according to the initialized bounding boxes to obtain a batch of training data. , , , )}, where Z represents the template, X represents the search area, the subscript r represents the visible light mode, and the subscript x represents the auxiliary mode, which is one of infrared, depth, or event mode;

[0085] Step Two: , , , The two input layers of the dual-stream embedding layer module are fed forward; during this training phase, the modality-independent and specific feature representation module performs feature selection based on the modality classifier.

[0086] Step 3: Calculate the loss based on the output of the tracking head. The meanings of each loss function are the same as in the first stage; then backpropagation is performed to calculate the parameter gradients;

[0087] Step 4: Update the parameters in the modality-independent and specific feature characterization modules and the tracking head according to the ADAMW optimization algorithm, and freeze other parameters.

[0088] To further illustrate the performance of the model in this embodiment, the following is provided:

[0089] Tables 1 to 5 compare the performance evaluation results of the model in this embodiment with other models on different datasets. To more comprehensively evaluate the method of this embodiment, this embodiment modifies some current advanced trackers into multimodal trackers and trains them using mixed data of RGBT, RGBD, and RGBE to obtain a general tracker applicable to multiple modal combinations, namely the tracker in "Unified Structure and Parameters" in the table.

[0090] Table 1 Performance comparison on the LasHeR test set data

[0091]

[0092] Table 2 Performance comparison on the RGBT234 dataset

[0093]

[0094] Table 3 Performance comparison on the DepthTrack test set data

[0095]

[0096] Table 4 Performance comparison on the VOT-RGBD22 dataset

[0097]

[0098] Table 5 Performance comparison on the VisEvent test set data

[0099] Table 6 Embedded Layer Ablation Experiments

[0100]

[0101] To verify whether the dual-stream embedding layer module truly delivers a substantial performance improvement, this embodiment trains a model with a single-stream embedding layer for comparison, where all modal data share a single embedding layer. As shown in Table 6, the single-embedding-layer model exhibits a 0.8% decrease in success rate on RGBT234, a 1.2% decrease on VisEvent, and a significant 2.9% decrease in F-score on DepthTrack. These experimental results demonstrate that each modality possesses specific states and distributions within the embedding space. Therefore, using separate embedding layers for RGB and other modalities better preserves their respective features. Conversely, using a single embedding layer to map all modal features to the same space may lead to feature confusion or information loss between different modalities.

[0102] In this ablation study, this embodiment configured different numbers of ordinary ViT blocks in the modality-aware layer to investigate their impact on model performance. Specifically, this embodiment trained five models with 1, 2, 3, 4, and 5 ordinary ViT blocks stacked in the modality-aware layer, respectively, and evaluated their performance on all datasets. The experimental results clearly show that the model achieves the highest overall performance when the modality-aware layer contains 3 ordinary ViT blocks (at which point modality independence and specific feature modules contain 9 panoramic ViT blocks), as shown in Table 7:

[0103] Table 7 Ablation experiments on the number of ordinary ViT blocks in the modality sensing layer

[0104]

[0105] To evaluate the effectiveness of the proposed panoramic ViT block in extracting specific modal features, such as Figure 3 As shown in the figure, this embodiment visualizes the feature map output by the last panoramic ViT block and the center point score map generated by the tracking head, and compares them with the benchmark method (OSTrack with unified structure and parameters) of this embodiment. It is clear from the figure that the feature heatmap generated by the method of this embodiment exhibits a more concentrated and obvious high-intensity region at the target location, while the predicted center point position is also more stable and concentrated. In contrast, the feature heatmap of the benchmark method is easily affected by background interference, exhibiting scattered and lower intensity near the target, and even showing obvious errors, leading to a large shift in the predicted center point position and greater uncertainty. It is evident that the panoramic ViT block proposed in this embodiment effectively extracts modality-specific features, allowing for better utilization of the complementarity between modes in the fused feature representation, thereby improving tracking performance.

[0106] Example 3

[0107] This embodiment is an application of the general multimodal target tracking model based on adaptive perception in Embodiment 1, which can be used for single target tracking in various scenarios.

[0108] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A general-purpose multimodal target tracking system based on adaptive perception, characterized in that, Includes a multimodal target tracking module, wherein multimodal target tracking includes: The dual-stream embedding layer module is configured to tokenize the visible light mode and the auxiliary mode sequentially, resulting in template tokens and search region tokens for the visible light mode and the auxiliary mode; and then concatenates the template tokens and search region tokens of the two modes into a classification token; thus obtaining the token sequence of the two modes. The modality perception module is configured to receive token sequences of two modalities. The two modality token sequences are used for intramodal feature extraction and template-search region interaction within the modality perception module. At the same time, the added classification token captures the modality information of the corresponding modality data. The output modality probability and the token sequence after removing the classification token are defined as the input token. The modality-independent and modality-specific feature representation module is configured to accept the modality probability and the input token. The modality-independent and modality-specific feature representation module uses one modality-independent branch to extract modality-independent features, uses four modality-independent branches to extract corresponding modality-specific features, and performs a weighted sum of the output features of the four modality branches and the modality probability to retain the correct modality-specific features. Finally, the tokens of the search regions of the two modalities output by the module are merged together and sent to the tracking head. The modality sensing module includes a modality sensing layer and a modality classifier; The modality perception layer comprises multiple ordinary visual Transformer blocks. The token sequences of the two modalities are input into the modality perception layer for intra-modal feature extraction and template-search region interaction as follows: In each typical vision Transformer block, there are initially three linear layers... Map them to query Q, key K, and value V, and then perform self-attention operations on them: in express subscript , and These represent the corresponding submatrix belonging to the classification token, template token, and search region token, respectively; the attention weights in the above formula can be further written in the following form: in The similarity metric between the tokens in the template and the search region is represented by the same submatrix; ultimately, the output A of the self-attention can be further written as: Through the above equation, the template token and the search region token simultaneously achieve their own feature extraction and template-search region feature interaction in a single self-attention operation, and and This indicates that the category token also interacted with the template token and the search area token, aggregating modal information.

2. The general-purpose multimodal target tracking system based on adaptive perception according to claim 1, characterized in that, The dual-stream embedding layer module includes a visible light embedding layer and an auxiliary modal embedding layer; the visible light modal data is input from the visible light embedding layer, and one of the following modal data—infrared modality, depth modality, or event modality—is input from the auxiliary modal embedding layer.

3. The general-purpose multimodal target tracking system based on adaptive perception according to claim 1, characterized in that, The modality perception module further includes a modality classifier; the modality classifier is configured to perform modality classification; the modality classifier receives the output of each ordinary vision Transformer block. Modal prediction is performed, and the prediction process is as follows: In the formula, Represents category tokens Fine-grained weights, This represents the prediction result of the modality classifier, that is, the probability that the input data belongs to a certain modality.

4. The general-purpose multimodal target tracking system based on adaptive perception according to claim 1, characterized in that, The modality-independent and specific feature representation module includes multiple panoramic vision Transformer blocks; the modality probability output by the modality classifier is denoted as... ,in The subscripts represent the corresponding modes, namely, visible light mode probability, infrared mode probability, depth mode probability, and event mode probability, respectively. The input token is sent to each panoramic vision Transformer block. In each block, the input tokens undergo layer normalization, self-attention, and residual connections to generate a set of tokens, abbreviated as . Then, modality-independent and modality-specific features are captured as follows: in This indicates a feedforward network layer that specifically handles a certain mode. and These represent modality-independent and modality-specific tokens, respectively. The temperature coefficient is learnable; finally, directly... , and The sum is used as the output token of the block.

5. The general-purpose multimodal target tracking system based on adaptive perception according to claim 1, characterized in that, There are 3 ordinary vision Transformer blocks and 9 panoramic vision Transformer blocks.

6. A training method for the multimodal target tracking model according to any one of claims 1 to 5, characterized in that, Includes the following steps: Phase 1: Training the modality classifier and modality branching Step 1: Randomly select N video sequences from the training datasets of LasHeR, DepthTrack, and VisEvent. Sample N pairs of template-search frames from these sequences. Crop and resize the template frame and search frame regions according to the initialized bounding boxes to obtain three batches of training data. , , , ),( , , , ),( , , , )}, where Z represents the template, X represents the search area, the superscripts 1, 2 and 3 indicate that Z and X belong to the data in LasHeR, DepthTrack and VisEvent respectively, and the subscripts r, t, d and e represent visible light, infrared, depth and event modes respectively; Step Two: , , , The two input layers of the dual-stream embedding layer module are fed forward. During the forward propagation, the modality-independent and specific feature representation module directly selects the corresponding modality branch based on the modality type information to ensure that each modality branch has the ability to extract modality-specific features. Step 3: Calculate the loss based on the output of the modality classifier and the tracking head. ,in It is the focal loss used for classification. and These are L1 and GIOU losses used for bounding box regression. It is the cross-entropy loss used to supervise modality classifiers. This represents the weights of the loss function that are set manually; then backpropagation is performed to calculate the gradient of the parameters, but no parameter updates are performed. Step 4: [The text appears to be incomplete and contains several grammatical errors. A more accurate translation would require the full context.] , , , As input data, execute steps two and three; Step 5: , , , As input data, execute steps two and three; Step Six: The gradients generated in Steps Three to Five will be automatically accumulated. At this point, the parameters will be updated uniformly according to the ADAMW optimization algorithm. Step 7: Repeat steps 1 through 6 until the model converges; The second stage of training involves training the modality-independent and specific feature representation modules, as well as the tracking head, based on the pre-trained modality classifier. Step 1: Randomly select N video sequences from the training datasets of LasHeR, DepthTrack, and VisEvent, and sample N pairs of template-search frames. Crop and resize the template frame and search frame regions according to the initialized bounding boxes to obtain a batch of training data. , , , )}, where Z represents the template, X represents the search area, the subscript r represents the visible light mode, and the subscript x represents the auxiliary mode, which is one of infrared, depth, or event mode; Step Two: , , , The two input layers of the dual-stream embedding layer module are fed forward; during this training phase, the modality-independent and specific feature representation module performs feature selection based on the modality classifier. Step 3: Calculate the loss based on the output of the tracking head. The meanings of each loss function are the same as in the first stage; then backpropagation is performed to calculate the parameter gradients; Step 4: Update the parameters in the modality-independent and specific feature characterization modules and the tracking head according to the ADAMW optimization algorithm, and freeze other parameters.

7. A processing device, characterized in that, It includes at least one processor and at least one memory communicatively connected to the processor, wherein: the memory stores program instructions executable by the processor, and the processor can execute the method of claim 6 by invoking the program instructions.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that cause the computer to perform the method of claim 6.