A high-performance video tracking method and system based on a 3D twin convolutional network

By combining 3D twin convolutional networks and multi-template matching modules, the problem of failing to effectively utilize spatiotemporal correlation information in existing technologies is solved, achieving high-precision and stable video target tracking results.

CN115908500BActive Publication Date: 2026-06-26CHANGSHA UNIVERSITY OF SCIENCE AND TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHANGSHA UNIVERSITY OF SCIENCE AND TECHNOLOGY
Filing Date
2022-12-30
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing video target tracking methods based on Siamese networks fail to effectively utilize cross-temporal and spatial correlation information, resulting in low target tracking accuracy in complex scenarios and the problem of error accumulation.

Method used

A 3D twin convolutional network is used to extract spatiotemporal correlation features. Combined with a multi-template matching module and a quality assessment branch, target localization and scale estimation are performed through multi-channel correlation filtering features. A confidence search region estimation strategy is introduced to reduce errors.

Benefits of technology

It improves the accuracy and stability of video target tracking, effectively captures spatiotemporal correlations in video sequences, achieves more discriminative search features, and ensures the accuracy and speed of target tracking.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115908500B_ABST
    Figure CN115908500B_ABST
Patent Text Reader

Abstract

The application discloses a kind of high-performance video tracking methods based on 3D twin convolution network.First, a kind of space-time feature extractor based on 3D twin network is designed, for extracting the space-time feature of template sequence and search sequence.Second, a multi-template matching module is designed, by passing template feature to search feature to strengthen target feature in search block, the module includes template feature conversion submodule and space-time feature matching submodule.Template feature conversion template is used to pass appearance and motion information in template frame to search branch;Space-time feature matching module includes by two depth-related branches, respectively for classification and regression.Then, target frame prediction module is adopted, including classification, quality evaluation and regression branch, to accurately predict the target position and boundary frame of each video frame in search sequence.Finally, by minimizing the defined joint loss optimization video tracking model, for video sequence target tracking prediction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision, and more specifically, to a high-performance video tracking method and system based on 3D twin convolutional networks. Background Technology

[0002] Video target tracking refers to a technique that uses contextual information from video or image sequences to model the appearance and motion information of a target, thereby predicting the target's motion state and pinpointing its location. Typically, based on a target specified in the first frame of a video, the technique continuously tracks that specific target in subsequent video frames to achieve target localization and scale estimation. Video target tracking has wide-ranging applications, including video surveillance, autonomous driving, and precision guidance.

[0003] With the rapid development of deep learning and convolutional networks, more and more video object trackers based on convolutional networks have emerged. Researchers prefer trackers based on Siamese networks, which not only have an advantage in tracking speed but also achieve good accuracy. This type of Siamese network-based tracker treats visual tracking as a similarity matching problem. In 2016, Bertinetto et al. first proposed the SiamFC tracker for visual tracking (Luca Bertinetto, Jack Valmadre, João F. Henriques, Andrea Vedaldi, Philip HS Torr: Fully-Convolutional Siamese Networks for Object Tracking. ECCV Workshops (2) 2016:850-865.), which uses Siamese networks to extract templates and search features and uses correlation filtering to calculate the cross-correlation between the target template and the search region.

[0004] In recent years, some Siamese network-based methods have attempted to update templates online to address changes in the appearance of targets during continuous movement. Guo et al. proposed a dynamic Siamese network (Qing Guo, Wei Feng, Ce Zhou, Rui Huang, Liang Wan, and Song Wang. Learning dynamic siamese network for visualobject tracking. ICCV 2017: 1781-1789.), employing a transformational learning model for adaptive online learning. However, these methods use historical predictions for model updates and do not explicitly model the relationship between time and space. Furthermore, some methods have introduced attention mechanisms and Transformers into Siamese networks to improve the discriminability of targets in complex scenes. Chen et al. designed a feature fusion network using Transformers (Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, Huchuan Lu. Transformer Tracking. CVPR2021: 8126-8135.), fusing template and search region features through cross-attention. Du et al. designed a dual correlation-guided attention module (Fei Du, Peng Liu, Wei Zhao, and Xianglong Tang. Correlation-guided attention for corner detection based visual tracking. CVPR 2020: 6835-6844.) to integrate target information. These trackers, which incorporate attention mechanisms, improve target tracking accuracy, but they rely solely on spatial information, ignoring the rich contextual information contained within the video. Siamese network-based visual tracking should effectively utilize spatiotemporal information across time frames to better learn spatiotemporal features for target appearance modeling, while simultaneously obtaining more discriminative search features to achieve more accurate tracking and localization. Summary of the Invention

[0005] To address the shortcomings of existing technologies, this invention provides a high-performance video tracking method and system based on a 3D twin convolutional network. The method accepts template and search sequences as inputs and extracts spatiotemporally correlated feature information through a 3D twin network. A multi-template matching module is developed, utilizing an attention mechanism to pass template features to search features to enhance target features in the search block, resulting in more discriminative search features. Furthermore, a quality evaluation branch is introduced to avoid ambiguity in classification branch localization, further improving tracking accuracy. This tracker, which processes video targets sequentially and defines a multi-template matching mechanism to enhance search features by aggregating the spatiotemporal features of different targets in the sequence frames, achieves a balance between speed and accuracy, significantly improving the performance of the video target tracker.

[0006] To achieve the above objectives, the present invention provides a high-performance video tracking method based on a 3D twin convolutional network, comprising the following steps;

[0007] S1. Construct the network architecture, which consists of a spatiotemporal feature extractor, a multi-template matching module, and a target prediction module;

[0008] S2. Given template sequence video frames and search sequence video frames respectively, cut them into template sequence blocks and search sequence blocks as input to the entire network architecture;

[0009] S3. Construct a spatiotemporal feature extractor. This module is a 3D twin fully convolutional network, including a template branch and a search branch, using a 3D fully convolutional network as the base network and sharing weights. Taking template sequence blocks and search sequence blocks as input, the spatiotemporal feature extractor extracts template spatiotemporal features and search spatiotemporal features from them.

[0010] S4. Construct a multi-template matching module, including a template feature conversion submodule and a spatiotemporal feature matching submodule. The template feature conversion submodule is used to pass the appearance and motion information in the template frame to the search branch to obtain more discriminative search features. The spatiotemporal feature matching submodule consists of two deep correlation branches, which are used for classification and regression respectively. It takes the template spatiotemporal features and the enhanced search features as input to obtain multi-channel correlation filter features.

[0011] S5. Construct the target prediction module, which mainly consists of a classification head, a quality assessment head, and a regression head. The multi-channel correlation filter features output from the classification branch are used as inputs to the classification head and the quality assessment head to obtain the classification score map and the quality assessment score map; the multi-channel correlation filter features output from the regression branch are used as inputs to the regression head to obtain the regression map.

[0012] S6. Use the classification score map and quality assessment score map to locate the position of the target in each video frame of the sequence; estimate the target scale of each video frame in the sequence based on the regression map; finally obtain the target prediction box for each video frame in the search sequence.

[0013] S7. The network model is optimized by minimizing the joint loss, including the cross-entropy loss for classification and quality assessment, and the cross-union ratio loss for regression, ultimately resulting in a high-performance video target tracker model.

[0014] S8. Using the trained network model as a visual tracker, perform target tracking on a video-by-video sequence basis. To ensure stable and accurate tracking, a confidence search region estimation strategy is defined. The search region for the next sequence is cropped based on the different target states in the current video sequence to reduce error accumulation and accurately locate the target in each video frame of the search sequence.

[0015] This invention provides an end-to-end trainable neural network architecture and system for video target tracking, including a video sequence input unit for pruning template sequence blocks and search sequence blocks; a model training unit for training a high-performance video target tracker, which trains target tracking by minimizing the combined loss, including the cross-entropy loss of the classification and quality assessment heads and the cross-union loss of the regression head, ultimately achieving target tracking per video sequence; a video target tracking unit that uses the classification map, quality assessment score map, and regression map output by the model to estimate the target state and predict the scale in the search sequence video frames, respectively, and calculates the target prediction box in the search sequence; using the target prediction box of the current video sequence, it calculates the confidence search region of the next set of video sequences and inputs it into the search branch for target tracking of subsequent video sequences.

[0016] Compared with existing technologies, it has the following beneficial effects:

[0017] This invention utilizes a 3D twin fully convolutional network to extract template spatiotemporal features and search spatiotemporal features, learning rich spatiotemporal information across multiple consecutive video frames. The extracted template spatiotemporal features and search spatiotemporal features are input into a multi-template matching sub-network to obtain multi-channel correlation-filtered features. A classification head and a quality assessment head are used to process the multi-channel correlation-filtered features of the classification branch to predict target localization; a regression head is used to process the multi-channel correlation-filtered features of the regression branch to estimate the target scale. In the target tracking stage, to obtain a more accurate search sequence region, a confidence search region estimation strategy is defined. The next search sequence region is estimated based on the different states of the target in the current video sequence, ensuring the stability and accuracy of target tracking. This method processes video target tracking as a video sequence, ensuring the speed of video tracking and capturing the spatiotemporal correlation between video frames; simultaneously, the introduction of a multi-template matching mechanism yields more discriminative search features, and the addition of a quality assessment branch in the classification process improves the accuracy of video target tracking. Attached Figure Description

[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0019] Figure 1 This is the overall network structure diagram in this invention patent.

[0020] Figure 2 This is a schematic diagram of the template sequence block and the search sequence block in this invention patent.

[0021] Figure 3 This is a schematic diagram of the spatiotemporal feature extractor structure in this invention patent.

[0022] Figure 4 This is a schematic diagram of the multi-template matching module structure in this invention patent.

[0023] Figure 5 This is the confidence search region estimation map in this invention patent.

[0024] Figure 6 This is a schematic diagram of some video frames in this invention patent.

[0025] Figure 7 This is a schematic diagram of the video target tracking result in this invention patent. Detailed Implementation

[0026] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other. The invention will now be described in detail with reference to the accompanying drawings and specific embodiments.

[0027] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. A video target tracking method with deep spatiotemporal correlation includes steps S1 to S8.

[0028] S1. Construct the network architecture, which consists of a spatiotemporal feature extractor, a multi-template matching module, and a target prediction module;

[0029] S2. Given template sequence video frames and search sequence video frames respectively, cut them into template sequence blocks and search sequence blocks as input to the entire network architecture;

[0030] S3. Construct a spatiotemporal feature extractor. This module is a 3D twin fully convolutional network, including a template branch and a search branch, using a 3D fully convolutional network as the base network and sharing weights. Taking template sequence blocks and search sequence blocks as input, the spatiotemporal feature extractor extracts template spatiotemporal features and search spatiotemporal features from them.

[0031] S4. Construct a multi-template matching module, including a template feature conversion submodule and a spatiotemporal feature matching submodule. The template feature conversion submodule is used to pass the appearance and motion information in the template frame to the search branch to obtain more discriminative search features. The spatiotemporal feature matching submodule consists of two deep correlation branches, which are used for classification and regression respectively. It takes the template spatiotemporal features and the enhanced search features as input to obtain multi-channel correlation filter features.

[0032] S5. Construct the target prediction module, which mainly consists of a classification head, a quality assessment head, and a regression head. The multi-channel correlation filter features output from the classification branch are used as inputs to the classification head and the quality assessment head to obtain the classification score map and the quality assessment score map; the multi-channel correlation filter features output from the regression branch are used as inputs to the regression head to obtain the regression map.

[0033] S6. Use the classification score map and quality assessment score map to locate the position of the target in each video frame of the sequence; estimate the target scale of each video frame in the sequence based on the regression map; finally obtain the target prediction box for each video frame in the search sequence.

[0034] S7. The network model is optimized by minimizing the joint loss, including the cross-entropy loss for classification and quality assessment, and the cross-union ratio loss for regression, ultimately resulting in a high-performance video target tracker model.

[0035] S8. Using the trained network model as a visual tracker, perform target tracking on a video-by-video sequence basis. To ensure stable and accurate tracking, a confidence search region estimation strategy is defined. The search region for the next sequence is cropped based on the different target states in the current video sequence to reduce error accumulation and accurately locate the target in each video frame of the search sequence.

[0036] In step S1, the network architecture is constructed, such as... Figure 1 As shown, the network consists of a spatiotemporal feature extractor, a multi-template matching module, and a target prediction module. The specific steps are as follows:

[0037] S11. Construct a spatiotemporal feature extractor based on a 3D twin network, including a template branch and a search branch, using a 3D fully convolutional neural network as the base network with shared weights, to extract template spatiotemporal features and search spatiotemporal features from the input video sequence blocks.

[0038] S12. Construct a multi-template matching module, including a template feature conversion submodule and a spatiotemporal feature matching submodule. The template feature conversion submodule is used to pass the appearance and motion information in the template frame to the search branch to obtain more discriminative search features. The spatiotemporal feature matching submodule consists of two deep correlation branches, which are used for classification and regression respectively. The template spatiotemporal features and the enhanced search features are used as inputs to obtain multi-channel correlation filter features.

[0039] S13. The target prediction module includes a classification head, a quality assessment head, and a regression head. It takes multi-channel correlation filter features as input and obtains the classification score map, the quality assessment score map, and the regression map through the classification head, the quality assessment head, and the regression head, respectively.

[0040] In step 2, given template sequence video frames and search sequence video frames respectively, they are cropped into template sequence blocks and search sequence blocks, such as... Figure 2 As shown, this serves as the input to the overall network architecture. The specific steps are as follows:

[0041] S21. Given a template sequence, based on the target's real value information in each video frame of the template sequence, obtain the target's center position, width, and height information, and represent them as follows: .

[0042] S211. Based on the real target box information given in S21, calculate the augmentation values ​​for the width and height of the target box. And calculate the scaling factor. This is used to scale the expanded target bounding box region. If the target bounding box region, after adding the expansion value, exceeds the boundary values ​​of the video frame, it is filled using the average RGB value of the current video frame. Finally, each video frame in the template sequence is cropped to... Template blocks of a certain size.

[0043] S212. After cropping each video frame in the template sequence, a template block is obtained. ,in This indicates the total number of video frames in the template sequence.

[0044] S22. Given a search sequence, based on the real value information of the target in the first video frame of the template sequence, obtain the center position, width, and height information of the target, and represent them as follows: .

[0045] S221. Based on the real target bounding box information given in S22, calculate the augmentation values ​​for the width and height of the target bounding box. And calculate the scaling factor. This is used to scale the expanded target bounding box region. If the target bounding box region, after adding the expansion value, exceeds the boundary values ​​of the video frame, it is filled with the average RGB value of the current video frame. Ultimately, each video frame in the search sequence is cropped to... Search blocks of varying sizes.

[0046] S222. After cropping each video frame in the search sequence, a search block is obtained. ,in This represents the total number of video frames in the search sequence.

[0047] In step S3, the spatiotemporal feature extractor is a 3D twin fully convolutional network, including a template branch and a search branch, using a 3D fully convolutional network as the base network and sharing weights. Taking the template sequence block and the search sequence block as input, the spatiotemporal feature extractor extracts the template spatiotemporal features and the search spatiotemporal features from them. The specific steps are as follows:

[0048] S31. Construct a feature extraction network, such as Figure 3 As shown, each branch is a Res2+1D network consisting of five residual blocks.

[0049] S32. Modify the padding property of the first residual block of Res2+1D to... The output channels of the fourth residual block and the input channels of the fifth residual block were modified to 256, and the downsampling and final classification layers of the fifth residual block were removed. As a result, the output spatiotemporal features have the same duration as the input video sequence.

[0050] S33. Input the template block and search block obtained in steps S212 and S222 into the spatiotemporal feature extractor to obtain the template spatiotemporal features respectively. and search spatiotemporal features .

[0051] In step 4, a multi-template matching module is constructed, including a template feature transformation submodule and a spatiotemporal feature matching submodule. The template feature transformation submodule is used to pass the appearance and motion information in the template frame to the search branch to obtain more discriminative search features. The spatiotemporal feature matching submodule consists of two deep correlation branches, used for classification and regression respectively, and takes the template spatiotemporal features and the enhanced search features as input to obtain multi-channel correlation filtering features. The specific steps are as follows:

[0052] S41, such as Figure 4 As shown, the template feature transformation submodule establishes a connection between template features and search features using an interactive attention mechanism; firstly, the template features are transformed... Convert to dimension The spatiotemporal matrix, search feature transformation Convert to dimension The spacetime matrix, where , .

[0053] S411. Calculate the cross-attention matrix using the attention mechanism. The details are as follows:

[0054] (1)

[0055] in It is Linear transformation operations, This indicates a normalization operation.

[0056] S412, Given the spatiotemporal characteristics of the template Calculate each feature map Gaussian mask ,in This indicates the position of a pixel in each feature map. The Gaussian mask set is obtained by representing the true center position of the target in each frame. and transform it into dimensions .

[0057] S413, Utilizing cross-attention matrix ,Will As attention weights, the passed mask is calculated. and with search features Perform element-wise multiplication to obtain the mask search features. :

[0058] (2)

[0059] in This indicates element-wise multiplication. This indicates that instance normalization operation can more accurately find the potential location of the target in the search area.

[0060] S414. Simultaneously, the contextual information in the template sequence is encoded and passed to the search features. Based on the Gaussian mask set... Calculate mask template features This is done to highlight the target location in the search area and weaken background interference; and then passed to the search branch to obtain the passed mask template features. ;

[0061] S415, Further transform the mask template after feature transfer With search features Add them together to obtain the transferred search features. The specific calculations are as follows:

[0062] (3)

[0063] S416, By searching for features through the mask and the search features passed By adding elements one by one, the enhanced search features are calculated. and convert it into the original feature dimensions. .

[0064] S42. The template features obtained in S3 The inputs are respectively fed into the classification and regression branches of spatiotemporal feature matching to obtain classification template features. and regression template features Similarly, the enhanced search features in S41 The inputs are respectively fed into the classification and regression branches of spatiotemporal feature matching to obtain the classification search features. and regression search features ,like Figure 4 As shown.

[0065] S421. To effectively utilize the spatiotemporal features of the template, we average the spatiotemporal features of the classification and regression templates along the time dimension; the features are calculated through the averaging operation. and And copy these features The optimal classification template feature is finally obtained. and the spatiotemporal features of the best regression template , and used as inputs for the classification branch and regression branch, respectively.

[0066] S422. Perform a deep correlation filtering operation on the best template features in the classification and regression branches and the enhanced search features. The specific calculation is as follows:

[0067] (4)

[0068] (5)

[0069] in, Indicates a category branch, Indicates the regression branch, This indicates depth correlation filtering.

[0070] S43, the classification branch and the regression branch respectively output multi-channel correlation filter features. and .

[0071] In step 5, a target prediction module is constructed, mainly composed of a classification head, a quality assessment head, and a regression head. The multi-channel correlation filter features output from the classification branch are used as inputs to the classification head and the quality assessment head to obtain the classification score map and the quality assessment score map; the multi-channel correlation filter features output from the regression branch are used as inputs to the regression head to obtain the regression map. The specific steps are as follows:

[0072] S51, The classification header consists of one The convolutional layers consist of multi-channel correlation-filtered features output from the classification branch in S42. As input to the classification head, the output is a classification score image: .

[0073] S52, Quality assessment head is one The convolutional layers consist of multi-channel correlation-filtered features output from the classification branch in S42. As input to the quality assessment head, the output is a quality assessment score graph: .

[0074] S53, Return Header is One The convolutional layers consist of multi-channel correlation filtering features output from the regression branch in S42. As input to the regression head, the output is a regression graph: .

[0075] In step 6, the location of the target in each video frame of the sequence is located using the classification score map and the quality assessment score map; the target scale in each video frame of the sequence is estimated based on the regression map; finally, the target prediction box for each video frame in the search sequence is obtained. The specific steps are as follows:

[0076] S61. The size of the classification score chart is: The size of the quality assessment score chart is: Multiply the corresponding positions of the quality assessment score map and the classification score map to obtain the confidence classification map size. For the first video sequence Frame, confidence classification score map to find the point with the largest response value In the original video frame, it is represented as: ,in This represents the total step size of the entire network.

[0077] S62. The regression plot is a four-channel vector with the following size: ;use , , , Indicates the video sequence number 1 The offset of the regressed target in the frame, and the predicted target bounding box coordinates, can be represented as:

[0078] (6)

[0079] in , Represents the target prediction box The coordinates of the top left and bottom right corners.

[0080] In step 7, the network model is optimized by minimizing the joint loss, including the cross-entropy loss for classification and quality assessment, and the cross-union ratio loss for regression, ultimately yielding a high-performance video object tracker model. The specific steps are as follows:

[0081] S71, The total training loss is defined as:

[0082] (7)

[0083] in, For the first The loss of each search frame. This represents the total number of classification score plots (quality assessment score plots, regression plots). Indicates the first In each search block The probability that the location belongs to the target. Indicates the first In each search block The probability of being located close to the center of the target. Indicates the first Positions in the regression plot Offset from the perimeter of the bounding box.

[0084] S72, Training Loss The cross-entropy loss for classification and quality assessment, and the cross-union ratio loss for regression, are defined as follows:

[0085] (8)

[0086] in, It is an indicator function that indicates whether something belongs to the target; if it does, it is assigned a value of 1, otherwise it is assigned a value of 0. This represents the cross-entropy loss for classification. This represents the cross-entropy loss in quality assessment. This represents the intersection-union ratio loss of the regression. If the current position... If it belongs to a positive sample, meaning the current position belongs to the target, then... Assign a value of 1; if it is a negative sample, then... The value is assigned to 0. Indicates the first The probability of a search block being close to the center of the actual target box is as follows: the closer to the center of the target box, the higher the probability value; the further away from the center, the lower the probability value. Indicates the first The center position of the real target in each search block Offset from the perimeter of the bounding box. This represents the total number of positive samples. The weight value for each loss.

[0087] In step 8, the trained network model is used as a visual tracker to perform target tracking on a video-by-video sequence basis. To ensure stable and accurate tracking, a confidence search region estimation strategy is defined. This strategy trims the search region for the next sequence based on the different target states in the current video sequence, reducing error accumulation and accurately locating the target in each video frame of the search sequence. The specific steps are as follows:

[0088] S81. Since the target may change position significantly in the video sequence, based on the predicted bounding box results of the current search sequence... ,in It is the first in the search sequence The predicted bounding boxes for each frame are determined based on the coordinates of the top-left corner of each bounding box. and the coordinates of the bottom right corner Calculate the minimum bounding box ,like Figure 4 As shown.

[0089] S82, Minimum bounding box Expand the search area for cropping the next set of video sequences. This ensures that the search area covers the target in every video frame of the search sequence. The video target tracking results are as follows: Figure 6 As shown.

[0090] According to another aspect of this application, a deep spatiotemporal correlation video target tracking system is also provided, comprising the following units:

[0091] Video sequence input unit: Given a set of template sequence video frames and search sequence video frames, trim them into template sequence blocks and search sequence blocks of specified sizes according to the form in S2.

[0092] The model training unit, used to train a video target tracker based on a 3D Siamese network, includes a spatiotemporal feature extractor module, a multi-template matching module, and a target prediction module. The spatiotemporal feature extractor takes template sequence blocks and search sequence blocks as input, extracts template spatiotemporal features and search spatiotemporal features from them, and inputs them into the multi-template matching module. The multi-template matching module includes a feature transformation submodule and a spatiotemporal feature matching submodule. In the feature transformation submodule, an interactive attention mechanism is used to pass the template spatiotemporal features to the search features to obtain more discriminative search features. The spatiotemporal feature matching submodule calculates the optimal template features and inputs them and the enhanced search features into the classification and regression branches, respectively. Then, by using deep correlation filtering, the optimal template features and the enhanced search features are similarly matched in a high-dimensional feature space to obtain multi-channel correlation filtered features. The output of the classification branch is input into the classification head and quality assessment head in the target prediction module to obtain the classification score map and the quality assessment score map; the output of the regression branch is passed to the regression head to obtain the regression map; the target tracking is trained by minimizing the cross-entropy loss of classification and quality assessment, as well as the cross-union ratio loss of regression.

[0093] In the testing phase, the video target tracking unit combines the classification score map and quality assessment score map output by the model to obtain a confidence classification map. The confidence classification map is used to estimate the target's position, and the regression map output by the model is used to predict the target's scale, resulting in target prediction boxes in the search sequence. Using these target prediction boxes, a set of confidence search regions is obtained and input into the search branch for target tracking in subsequent video sequences.

[0094] This system is used to implement the functions of the methods in the above embodiments. The specific implementation steps of the methods involved in the system module have been described in the methods and will not be repeated here.

[0095] In this embodiment, firstly, a spatiotemporal feature extractor based on a 3D twin network is designed to extract spatiotemporal features from template sequences and search sequences to improve the tracker's discriminative ability. Secondly, a multi-template matching module is designed to enhance target features in the search block by passing template features to search features. This module includes a template feature transformation submodule and a spatiotemporal feature matching submodule. The template feature transformation submodule passes appearance and motion information from template frames to the search branch, resulting in more discriminative search features. The spatiotemporal feature matching module consists of two deeply correlated branches, used for classification and regression respectively, to achieve more efficient information association. Then, a target prediction module is employed, including classification, quality assessment, and regression branches, to accurately predict the target position and bounding box of each video frame in the search sequence. Finally, the video tracking model is optimized by minimizing a defined joint loss for target tracking prediction per video sequence. In the target tracking test, a confidence region estimation strategy is defined to calculate the search region of the next video sequence based on the target tracking result of the current video sequence, minimizing error accumulation and thus maintaining robust and accurate target tracking in the video sequence.

[0096] The above description is merely a preferred embodiment of the present invention and does not limit the patent scope of the present invention. Any equivalent structural transformations made using the contents of the present invention's specification and drawings under the inventive concept of the present invention, or direct / indirect applications in other related technical fields, are included within the patent protection scope of the present invention.

Claims

1. A high-performance video tracking method based on 3D twin convolutional networks, characterized in that, The method is executed by a computer and includes the following steps: S1. Construct a network architecture consisting of a spatiotemporal feature extractor, a multi-template matching module, and a target prediction module; this network architecture improves the model's detection and localization capabilities, resulting in more accurate video target tracking results, including: S11. The spatiotemporal feature extractor based on 3D twin network includes a template branch and a search branch. It uses a 3D fully convolutional neural network as the base network and shares weights. It is used to extract template spatiotemporal features and search spatiotemporal features from the input template sequence block and search sequence block. S12. The multi-template matching module includes a template feature conversion submodule and a spatiotemporal feature matching submodule. The template feature conversion submodule establishes a connection between template features and search features through an interactive attention mechanism, passing the appearance and motion information in the template frame to the search branch to obtain more discriminative search features. The spatiotemporal feature matching submodule consists of two deep correlation branches, which are used for classification and regression respectively. It takes the template spatiotemporal features and the enhanced search features as input to obtain multi-channel correlation filter features. S13. The target prediction module includes a classification head, a quality assessment head, and a regression head, each consisting of a... The convolutional layers are composed of multi-channel correlation filter features as input, which are used to obtain classification score maps, quality assessment score maps, and regression maps, respectively. S2. Given template sequence video frames and search sequence video frames respectively, cut them into template sequence blocks and search sequence blocks as input to the entire network architecture; S3. Construct a spatiotemporal feature extractor. This module is a 3D twin fully convolutional network, including a template branch and a search branch. It uses a 3D fully convolutional network as the base network and shares weights. The template sequence block and the search sequence block are taken as input, and the spatiotemporal feature extractor extracts the template spatiotemporal features and the search spatiotemporal features from them. S4. Construct a multi-template matching module, including a template feature conversion submodule and a spatiotemporal feature matching submodule. The template feature conversion submodule is used to pass the appearance and motion information in the template frame to the search branch to obtain more discriminative search features. The spatiotemporal feature matching submodule consists of two deep correlation branches, which are used for classification and regression respectively. It takes the template spatiotemporal features and the enhanced search features as input to obtain multi-channel correlation filter features. S5. Construct a target prediction module, consisting of a classification head, a quality assessment head, and a regression head; use the multi-channel correlation filter features output by the classification branch as input to the classification head and the quality assessment head to obtain the classification score map and the quality assessment score map; use the multi-channel correlation filter features output by the regression branch as input to the regression head to obtain the regression map; S6. Use the classification score map and quality assessment score map to locate the position of the target in each video frame of the sequence; estimate the target scale of each video frame in the sequence based on the regression map; finally obtain the target prediction box for each video frame in the search sequence. S7. The network model is optimized by minimizing the joint loss, including the cross-entropy loss for classification and quality assessment, and the cross-union ratio loss for regression, ultimately resulting in a high-performance video target tracker model. S8. Using the trained network model as a visual tracker, target tracking is performed on a video-by-video sequence basis for a given video. To ensure stable and accurate tracking, a confidence search region estimation strategy is defined. The search region of the next sequence is cropped according to the different target states in the current video sequence to reduce error accumulation and accurately locate the target in each video frame of the search sequence.

2. The high-performance video tracking method based on 3D twin convolutional networks as described in claim 1, characterized in that... The specific implementation process for constructing template sequence blocks and search sequence blocks is as follows: S21. Given a template sequence, based on the target's real value information in each video frame of the template sequence, obtain the target's center position, width, and height information, and represent them as follows: ; S211. Based on the real target box information given in S21, calculate the augmentation values ​​for the width and height of the target box. And calculate the scaling factor. , used to scale the expanded target bounding box area; If the target bounding box region, after adding the augmentation value, exceeds the boundary values ​​of the video frame, then the average RGB value of the current video frame is used for padding; ultimately, each video frame in the template sequence is cropped to... Template blocks of a certain size; S212. After cropping each video frame in the template sequence, a template block is obtained. ,in This indicates the total number of video frames in the template sequence; S22. Given a search sequence, based on the real value information of the target in the first video frame of the template sequence, obtain the center position, width, and height information of the target, and represent them as follows: ; S221. Based on the real target bounding box information given in S22, calculate the augmentation values ​​for the width and height of the target bounding box. And calculate the scaling factor. This is used to scale the expanded target bounding box area; If the target bounding box region, after adding the augmentation value, exceeds the boundary values ​​of the video frame, then the average RGB value of the current video frame is used to fill it. Ultimately, each video frame in the search sequence is cropped to... The size of the search block; S222. After cropping each video frame in the search sequence, a search block is obtained. ,in This represents the total number of video frames in the search sequence.

3. The high-performance video tracking method based on 3D twin convolutional networks as described in claim 1, characterized in that... The spatiotemporal feature extractor is constructed, and its specific implementation process is as follows: S31. Construct a spatiotemporal feature extraction network, where both the template branch and the search branch are Res2+1D networks composed of five residual blocks; S32. Modify the padding property of the first residual block of Res2+1D to... The output channels of the fourth residual block and the input channels of the fifth residual block are modified to 256 respectively, and the downsampling and final classification layers of the fifth residual block are removed; thus, the spatiotemporal features of the output have the same time length as the input video sequence. S33. Input the template block and search block obtained in steps S212 and S222 into the spatiotemporal feature extractor to obtain the template spatiotemporal features respectively. and search spatiotemporal features .

4. The high-performance video tracking method based on 3D twin convolutional networks as described in claim 1, characterized in that... The specific implementation process of constructing a multi-template matching module is as follows: S41. First, the template features Convert to dimension The spatiotemporal matrix, search features Convert to dimension The spacetime matrix, where , ; S411. Calculate the cross-attention matrix using the attention mechanism. The details are as follows: ,(1) in It is Linear transformation operations, This indicates a normalization operation; S412, Given the spatiotemporal characteristics of the template Calculate each feature map Gaussian mask ,in This indicates the position of a pixel in each feature map. The Gaussian mask set is obtained by representing the true center position of the target in each frame. and transform it into dimensions ; S413, Utilizing cross-attention matrix ,Will As attention weights, the passed mask is calculated. and with search features Perform element-wise multiplication to obtain the mask search features. : ,(2) in This indicates element-wise multiplication. This indicates that instance normalization operation can more accurately find the potential location of the target in the search area; S414. Simultaneously, the contextual information in the template sequence is encoded and passed to the search features; based on the Gaussian mask set... Calculate mask template features This is to highlight the target location in the search area and reduce background interference; The feature is then passed to the search branch to obtain the passed mask template. ; S415, Further transform the mask template after feature transfer With search features Add them together to obtain the transferred search features. The specific calculations are as follows: ,(3) S416, By searching for features through the mask and the search features passed By adding elements one by one, the enhanced search features are calculated. and convert it into the original feature dimensions. ; S42. The template features obtained in S3 The inputs are respectively fed into the classification and regression branches of spatiotemporal feature matching to obtain classification template features. and regression template features ; Similarly, the enhanced search features in S4 The inputs are respectively fed into the classification and regression branches of spatiotemporal feature matching to obtain the classification search features. and regression search features ; S421. To effectively utilize the spatiotemporal features of the template, we average the spatiotemporal features of the classification and regression templates along the time dimension; the features are calculated through the averaging operation. and And copy these features The optimal classification template feature is finally obtained. and the spatiotemporal features of the best regression template , and serve as inputs for the classification and regression branches, respectively; S422. Perform a deep correlation filtering operation on the best template features in the classification and regression branches and the enhanced search features. The specific calculation is as follows: , (4) , (5) in, Indicates a category branch, Indicates the regression branch, This indicates depth correlation filtering; S43, the classification branch and the regression branch respectively output multi-channel correlation filter features. and .

5. The high-performance video tracking method based on 3D twin convolutional networks as described in claim 1, characterized in that... The specific implementation process of constructing a video sequence target prediction module is as follows: S51, the classification head uses the multi-channel correlation filter feature output from the classification branch in S42. As input to the classification head, the output is a classification score image: ; S52, the quality assessment head uses the multi-channel correlation filter feature output from the classification branch in S42. As input to the quality assessment head, the output is a quality assessment score graph: ; S53, the regression head uses the multi-channel correlation filter feature output from the regression branch in S42. As input to the regression head, the output is a regression graph: .

6. The high-performance video tracking method based on 3D twin convolutional networks as described in claim 1, characterized in that... The specific implementation process for predicting the target location and estimating the bounding box scale is as follows: S61. The size of the classification score chart is: The size of the quality assessment score chart is: Multiply the corresponding positions of the quality assessment score map and the classification score map to obtain the confidence classification map size. For the first video sequence Frame, confidence classification score map to find the point with the largest response value In the original video frame, it is represented as: ,in This represents the total step size of the entire network. S62. The regression plot is a four-channel vector with the following size: ;use , , , Indicates the video sequence number 1 The offset of the regressed target in the frame, and the predicted target bounding box coordinates are represented as follows: ,(6) in , Represents the target prediction box The coordinates of the top left and bottom right corners.

7. The high-performance video tracking method based on 3D twin convolutional networks as described in claim 1, characterized in that... The specific implementation process for training the visual tracking model is as follows: S71, The total training loss is defined as: ,(7) in, For the first The loss of each search frame; This represents the total number of classification score plots; the number of quality assessment score plots and regression plots is the same as the number of classification score plots, respectively. ; Indicates the first In each search block The probability that the location belongs to the target; Indicates the first In each search block The probability that the location is close to the center of the target; Indicates the first Positions in the regression plot Offset from the perimeter of the bounding box; S72, Training Loss The cross-entropy loss for classification and quality assessment, and the cross-union ratio loss for regression, are defined as follows: , (8) in, It is an indicator function that indicates whether something belongs to the target; if it does, it is assigned a value of 1, otherwise it is assigned a value of 0. The cross-entropy loss represents the classification. This represents the cross-entropy loss in quality assessment. This represents the intersection-union ratio loss of the regression; if the current position If it belongs to a positive sample, meaning the current position belongs to the target, then... Assign a value of 1; if it is a negative sample, then... The value is assigned to 0; Indicates the first The probability of a search block being close to the center of the actual target box is as follows: the closer to the center of the target box, the higher the probability value, and the further away from the center, the lower the probability value. Indicates the first The center position of the real target in each search block Offset from the perimeter of the bounding box; This represents the total number of positive samples. The weight value for each loss.

8. A high-performance video tracking method based on a 3D twin convolutional network as described in claim 1, characterized in that... The specific implementation process for estimating the confidence search region is as follows: S81. Since the target may change position significantly in the video sequence, based on the predicted bounding box results of the current search sequence... ,in It is the first in the search sequence The predicted bounding boxes for each frame are determined based on the coordinates of the top-left corner of each bounding box. and the coordinates of the bottom right corner Calculate the minimum bounding box ; S82, Minimum bounding box Expand the search area for cropping the next set of video sequences. This ensures that the search area can cover the target in every video frame of the search sequence.

9. A high-performance video tracking system based on a 3D twin convolutional network, characterized in that, Includes the following units: Video sequence input unit: Given a set of template sequence video frames and search sequence video frames, trim them into template sequence blocks and search sequence blocks of specified sizes according to the form in S2; The model training unit is used to train a video target tracker based on a 3D twin network, which includes a spatiotemporal feature extractor module, a multi-template matching module, and a target prediction module. The spatiotemporal feature extractor takes template sequence blocks and search sequence blocks as input, extracts template spatiotemporal features and search spatiotemporal features from them, and inputs them into the multi-template matching module; Multi-template matching module, It includes a feature transformation submodule and a spatiotemporal feature matching submodule; In the feature transformation submodule, the template spatiotemporal features are passed to the search features using an interactive attention mechanism to obtain more discriminative search features. The optimal template features are calculated by the spatiotemporal feature matching submodule and input into the classification and regression branches respectively, along with the enhanced search features. Then, by using deep correlation filtering, the optimal template features and the enhanced search features are similarly matched in a high-dimensional feature space to obtain multi-channel correlation filtering features. The output of the classification branch is input into the classification head and quality assessment head in the target prediction module to obtain the classification score map and quality assessment score map. The output of the regression branch is passed to the regression head to obtain the regression map. Target tracking is trained by minimizing the cross-entropy loss of classification and quality assessment, as well as the cross-union ratio loss of regression. In the testing phase, the video target tracking unit combines the classification score map and quality assessment score map output by the model to obtain a confidence classification map. It then uses the confidence classification map to estimate the target's position and uses the regression map output by the model to predict the target's scale, thus obtaining the target prediction box in the search sequence. Using this set of target prediction boxes, a set of confidence search regions is obtained, which are then input into the search branch for target tracking in subsequent video sequences.