Method for unsupervised object tracking based on sparse attention updating template features

By using a sparse attention-based method to update template features, this approach addresses the tracking difficulties faced by deep learning target trackers when their appearance changes significantly. It achieves efficient template feature updates and robust tracking, adapting to target rotation and complex deformations.

CN116310971BActive Publication Date: 2026-06-19CHANGCHUN UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHANGCHUN UNIV OF SCI & TECH
Filing Date
2023-03-03
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing deep learning target trackers struggle to track objects with significant changes in appearance, and traditional template feature updates cause information to decay exponentially over time, affecting tracking performance.

Method used

A method based on sparse attention to update template features is adopted. By combining sparse multi-head self-attention and multi-head cross attention, Siamese network, region proposal network, region masking module and template update network are used to generate updated template features. Unsupervised learning is used to optimize the loss to achieve tracking.

🎯Benefits of technology

It effectively updates template features, maintains the integrity and position information of the target object, improves the robustness and running speed of the tracker, and adapts to the rotation and complex deformation of the target.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116310971B_ABST
    Figure CN116310971B_ABST
Patent Text Reader

Abstract

This unsupervised target tracking method based on sparse attention-based template feature updating belongs to the fields of target tracking and deep learning. The SiameseRPN's region proposal network output features and initial template features are used as input to a template update network to generate new template features. Subsequent frames use these updated template features for tracking. Using the proposed template update network avoids the exponential decay of information over time caused by the linear combination of the current template and the cumulative template from the previous frame. The sparse attention in the template update network addresses issues such as background distraction, thus achieving template feature updating. A sparse Transformer is used as the main body of the template update network to generate fused features containing rich semantic information, preserving reliable information from the initial template. This allows attention to be focused on salient points in the features, maintaining sufficient robustness and speed even under target rotation and complex deformation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of target tracking and deep learning, and more specifically, relates to an unsupervised target tracking method based on sparse attention updating template features. Background Technology

[0002] The target tracking task specifically involves analyzing a given video image sequence, marking the bounding box of the target of interest in the first frame, matching the candidate target regions detected in subsequent video or image sequences, locating the coordinate positions of these targets in the image, and then obtaining a series of continuous changes that are identical to the target of interest marked in the first frame.

[0003] Visual object tracking is a fundamental component of various tasks in computer vision, including visual analysis, autonomous driving, and pose estimation. Illumination, deformation, occlusion, background clutter, and large appearance changes caused by motion all have a significant impact on object tracking. Furthermore, time efficiency is also crucial in practical applications.

[0004] In recent years, with the tremendous leap in computing power, deep learning has developed rapidly, playing an unprecedented and significant role in many fields of artificial intelligence, achieving substantial performance improvements compared to many traditional methods. Visual object tracking technology, as one of the important research directions in artificial intelligence and computer vision, has also begun to widely utilize deep learning methods and has achieved good results.

[0005] Most popular existing depth trackers employ a Siamese network architecture, tracking by calculating the similarity between a template and the search region in the current frame. In the original Siamese tracker, the object template was initialized in the first frame and then remained fixed for the remainder of the video. However, appearance changes are often significant, and failure to update the template can cause the tracker to fail prematurely. While using the appearance template extracted from the current frame and a linear combination of the accumulated template from the previous frame achieves template updates, it also leads to an exponential decay of feature information over time. The cross-correlation similarity metric used in traditional template tracking easily loses semantic information about template features because it is a single-level linear computation process. In particular, the presence of similar appearance objects (interferences) in the context makes the tracking task even more challenging.

[0006] Transformers possess a powerful cross-attention mechanism for inference between patches. Using Transformers instead of cross-correlation similarity metrics addresses issues such as the loss of semantic information from template features. Specifically, while Transformer trackers enhance and fuse features of the target and tracked objects by introducing an attention mechanism, they simply apply pixel-level attention to the flattened features of the template and search region. Each pixel of one flattened feature (Query) matches all pixels of another flattened feature (Key) in a complete but disordered manner. This pixel-level attention compromises the integrity of the target object, leading to the loss of relative positional information between pixels. Summary of the Invention

[0007] The purpose of this invention is to address the problem that using traditional methods to update templates in Siamese networks leads to an exponential decay of feature information over time, making it difficult for trackers to track targets with drastically changing shapes or even causing tracking failures. Therefore, this invention proposes an unsupervised target tracking method based on updating template features with sparse attention.

[0008] The technical solution adopted by this invention to achieve the above objectives is: an unsupervised target tracking method based on sparse attention-updated template features. This invention is based on four modules: a Siamese Network, a Region Proposal Network (RPN), a region masking module, and a template update network. The method includes the following steps:

[0009] 1) In a given video sequence, select three frames sequentially, namely frame 1, frame 2, and frame 3. Frames 1, 2, and 3 form a palindrome structure with frame 3 as the center frame. Extract the target template region Z from the initially selected frame 1. Subsequently, determine the target search region X within the palindrome structure. t For the target search region X t Stored in chronological order as a sequence X t ={X2,X3,X2,X1}, that is, t=2,3,2,1, where X2 represents the target search region at frame 2, X3 represents the target search region at frame 3, and X1 represents the target search region at frame 1. During the tracking process, a loop frame is defined as the period from the selected frame 1 to the last frame of the palindrome. The tracking process that starts from frame 1 and returns to frame 1 is called loop tracking, the tracking process that starts from frame 1 and ends at frame 3 is called self tracking, and the tracking process that starts from frame 3 and ends at frame 1 is called backward tracking. The main module of this invention is the template update network. The template update network can only play a significant role when the number of selected frames is greater than 2. In order to consider the tracking time, this invention selects 3 frames for loop tracking.

[0010] 2) Combine the target template region Z and the target search region X from step 1).t The input is fed into a Siamese network for feature extraction, yielding features T1 of the target template region Z (feature T1 is used as the initial template feature during tracking) and the target search region X. t Features S t The Siamese Network is divided into a template branch and a search branch, which share parameters to extract deep features;

[0011] 3) Combine the features T1 of the target template region Z obtained in step 2) with the target search region X t Features S t The input is fed into the classification and regression branches of the Region Proposal Network (RPN) to obtain the classification confidence score and the target prediction box regression results;

[0012] 4) The output of step 3) undergoes region masking operations in the region masking module. Differentiable region masks are used to select features and implicitly penalize tracking errors in intermediate frames. Assuming there are K predicted bounding boxes in the regression branch output of step 3), the region masking module first calculates the overlap rate of the k-th target predicted bounding box in the fixed grid. It retains the classification confidence scores of the boxes with overlap rates greater than 0 from step 3), and sets the classification confidence scores of the rest to 0. This selects region features with target confidence greater than 0. Finally, the k-th grid value is multiplied by the classification confidence score, and the results are summed over the K predicted bounding boxes to obtain the region mask feature M. t In subsequent steps, the region mask feature M t The initial template features T1 are input into the template update network;

[0013] 5) Target search area X t Features S t The region mask features M are obtained through the Region Proposal Network (RPN) and the region masking module. t Region mask feature M t The initial template features T1 are input into the template update network to obtain the updated template features T. t ;

[0014] Specifically, the template update network consists of an encoder and a decoder. The encoder uses a sparse attention mechanism instead of a self-attention mechanism, and the region mask features M... t Each pixel value of the sparse attention feature is determined by only the S most similar pixel values, which makes the foreground object more focused and the edge regions of the foreground object more discriminative; the template update network output obtains the updated template feature T. t ;

[0015] 6) Following step 5), track back to frame 1, use the unsupervised learning optical flow model to generate the initial pseudo label for frame 1, calculate the loss by combining it with the tracking result generated when tracking back to frame 1, and use the SGD optimizer to optimize the loss in order to achieve tracking by the tracker.

[0016] In step 1), the target template region Z and the target search region X t The dimensions are cut off as follows:

[0017] ① Select a video frame containing a complete and clear target as the target template frame. The template area is cropped to 127mm×127mm with the target as the center.

[0018] ② During the training process, three video frames are selected as target search frames. The search area is cropped to 255mm×255mm based on the input target center and the target scale estimated in the previous frame.

[0019] In step 2), the Siamese Network is divided into two branches: a template branch and a search branch. These two branches share parameters and both use a ResNet50 convolutional neural network as the backbone for feature extraction. The ResNet50 network introduces residual blocks and batch normalization (BN) layers on top of the traditional ResNet network and abandons Dropout to address network degradation issues such as vanishing and exploding gradients caused by increasing the depth of convolutional networks. The initial target template features T1 extracted by the Siamese Network and the template features T1 updated by the template update network are then compared. t One of them is related to the target search feature S t It is sent to the Regional Proposal Network (RPN).

[0020] In step 3), the Region Proposal Network (RPN) is essentially a sliding window-based classless target detector. When tracking the initial second frame, the initial target template features T1 and the target search features S... t As input to the region proposal network; after tracking the initial second frame, the template update network updates the template features T. t and target search features S t As input to the Region Proposal Network (RPN), c candidate boxes (called anchor boxes) of possible targets are generated for each pixel in the target search region through an anchor point mechanism. Then, the RPN uses features extracted from the Siamese Network to provide predicted bounding boxes for tracking targets. Specifically, the RPN first uses two convolutional layers to convert the target template features T in the channel... t Add to two branches [T t ]cls and [T] t ] reg These two branches have 2c and 4c channel vectors respectively. While keeping the number of channels constant, the target search feature S is obtained through two convolutional layers. t It is also divided into two branches [S] t ] cls and [S t ] reg [T] t [T] t ] cls and [T] t ] reg Two groups as [S] t The related kernel of [T], that is, [T] t The channel number in ] and [S t The total channel number is the same. The Region Proposal Network (RPN) outputs 2c channels for classification and 4c channels for regression. The target search feature S is calculated through convolution operations in both the classification and regression branches. t and target template features T t Correlation was then used to obtain the classification confidence score P. cls Regression results P of the target prediction box reg (The specific operation of the proposed network in this area is based on existing technology and will not be described in detail in this invention);

[0021] P cls =[S t ] cls *[T t ] cls

[0022] P reg =[S t ] reg *[T t ] reg

[0023] Among them, [T t ] cls and [T] t ] reg Represents the target template feature T t The classification and regression branches, [S t ] cls and [S t ] reg S represents the target search feature t The classification and regression branches, and the convolution result P cls P represents the classification confidence score. reg This represents the regression result of the target prediction box.

[0024] In step 4), the region masking module can select target features from the last search region of the predicted bounding boxes of all outputs of the Region Proposal Network (RPN) to obtain the output region mask features M. t The region masking module first calculates the overlap rate (grid value) of the k-th target prediction box in a fixed grid, retaining the classification confidence scores of those with overlap rates greater than 0 in step 3), and setting the classification confidence scores of the rest to 0. This selects region features with target confidence greater than 0. Finally, the k-th grid value is multiplied by the classification confidence score, and the results are summed over the K prediction boxes to obtain the region mask feature M. t Unlike traditional feature selection operations, the region masking operation used in this framework is not only differentiable on the coordinates of the target prediction candidate boxes, but also can pass the gradient to all candidate boxes, thus avoiding ill-posed penalties during iterative training.

[0025] In step 4), the region mask feature M is obtained. t The process is as follows:

[0026] The output P of the region proposal network classification and regression branches cls and P reg As the input to the region mask, let the regression branch P reg The output contains a total of K predicted boxes, and the grid graph G of the i-th row and j-th column of the k-th predicted box is... k Grid value This represents the area of ​​overlap between the predicted bounding box and the grid in the i-th row and j-th column of the fixed grid. The grid value represents the overlap rate between the fixed grid and the predicted bounding box.

[0027] All grid values ​​in the search area form a set {G} k}, will set {G k The data is aggregated into a single-channel region mask, and a new grid map is generated using the grid values ​​of the predicted bounding boxes with confidence scores greater than 0. Region mask M t The calculation is as follows: Where P cls,k It is the classification confidence of the k-th grid image, and a 4×4 grid size is used in this region masking module.

[0028] In step 5), the template update network specifically refers to:

[0029] The template update network consists of an encoder and a decoder. A 2D sinusoidal position code is added before the target feature input to the encoder and decoder, which enables the template update network to perceive the spatial position information P of the target, making it more conducive to template updating.

[0030] The encoder consists of N encoder layers, and the region mask feature of the t-th frame is M. tAs input to the encoder,

[0031]

[0032] Among them, M t P represents the region mask feature. enc This indicates the addition of 2D sine coding to provide location clues. This represents the i-th encoder layer. This represents the output of the i-th encoder layer, and N represents the number of encoder layers;

[0033] Unlike other encoders, each encoder layer of this encoder first uses Sparse Multi-Head Self-Attention (SMSA) to calculate the region mask feature M. t The self-attention layer is then normalized by the residual connection layer and finally fed into the feedforward neural network. The purpose is to amplify the fitting ability of the sparse multi-head self-attention layer to obtain the output of each encoder layer.

[0034] The decoder consists of M decoder layers. Each decoder layer takes not only the initial template features T1 encoded by spatial location and the output of its preceding decoder layer as input, but also the encoded target template features output by the encoder, as shown below:

[0035]

[0036] Where T1 represents the initial template features, P dec This indicates the addition of 2D sine coding to provide location clues. This represents the encoded target template features output by the encoder. This represents the i-th decoder layer. Let M represent the output of the (i-1)th decoder layer, and M represent the number of decoder layers.

[0037] In step 6), gradient descent is used to optimize the parameters, minimizing the loss to achieve accurate tracking. The total loss for iterative tracking is:

[0038] L=(1-λ c )(αL cls +βL reg )+λ c L c

[0039] Where L cls and L reg L represents the classification loss and regression loss of self-tracking, respectively. c Let α, β, and λ represent the cyclic loss for looping back to frame 1. c It is the weighting coefficient.

[0040] The formula for calculating the stochastic gradient descent (SGD) method is as follows:

[0041]

[0042] The formula for calculating the stochastic gradient descent (SGD) method is as follows:

[0043]

[0044] in, To obtain the optimal parameters, Z represents the target template region, X represents the target search region, and y represents the label. For the predicted results;

[0045] After 20 iterations of training, the final total loss L of the cyclic tracking can be stabilized below 0.1.

[0046] Through the above design scheme, this invention can bring the following beneficial effects: The output features of the SiameseRPN Region Proposal Network (RPN) and the initial template features are used as inputs to the template update network to generate new template features. Subsequent frames use the updated template features for tracking. The proposed template update network module avoids the exponential decay of information over time caused by the linear combination of the current template and the cumulative template of the previous frame. The sparse attention in the template update network solves the problem of background distraction, thus achieving template feature updates. Using a sparse Transformer as the main body of the template update network generates fused features containing rich semantic information, retaining reliable information from the initial template, and allowing attention to be focused on salient points in the features. For situations such as target rotation and complex deformation, the tracker maintains sufficient robustness and running speed. Attached Figure Description

[0047] Figure 1 This is a framework diagram of the unsupervised target tracking method based on sparse attention updating template features according to the present invention;

[0048] Figure 2 This is a schematic diagram of the template update network in this invention. Detailed Implementation

[0049] To make the objectives, features, and advantages of this invention more apparent and understandable, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, this invention is not limited to the following embodiments, and specific implementation methods can be determined according to the technical solutions of this invention and actual circumstances. To avoid obscuring the essence of this invention, well-known methods, processes, and procedures are not described in detail.

[0050] This invention proposes an unsupervised target tracking method based on sparse attention-based template feature updating. Utilizing the Siamese network concept, features of the target template region and the search region are extracted separately and simultaneously input into a Region Proposal Network (RPN) to obtain classification confidence scores and target bounding box regression results. These scores and results are then fed into a region masking module to obtain mask features. The template update network updates the features based on the initial template features T1 and the mask features M for each frame. t Generate new template features T t This method employs unsupervised cyclic tracking (with frames 1, 2, 3, 2, 1 as one cycle) back to frame 1. The tracking result generated from the cyclic tracking back to frame 1 is compared with the pseudo-label of frame 1 at the beginning of the cycle to calculate the loss. Here, this invention uses an unsupervised optical flow model to generate pseudo-labels for the cyclic tracking back to frame 1 as ground truth boxes. The numerical difference between the predicted target box and the ground truth box in the cyclic tracking back to frame 1 is the loss in the tracking process. The entire cyclic tracking process includes classification loss and regression loss for self-tracking, as well as the cyclic loss for cyclic tracking back to frame 1. During training, the SGD optimizer is used to minimize the tracking loss to achieve accurate tracking. Sparse attention is used in the template update network to focus attention on salient points, ensuring sufficient robustness and speed for tracking under conditions such as target rotation and complex deformation.

[0051] Figure 1 The diagram illustrates the framework of the unsupervised target tracking method based on sparse attention-updated template features according to the present invention. The specific implementation process of this method includes the following steps:

[0052] A. In a given video sequence, select one frame and denote it as frame 1. Extract the target template region Z. The subsequent second and third frames form a palindrome, which is used as the target search region X. t (t = 2, 3, 2, 1);

[0053] Specifically, a clear target template frame containing the complete target is selected, and the template area is cropped to a size of 127mm×127mm around the target. During training, the search area is cropped to a size of 255mm×255mm based on the input target center and the target size estimated in the previous frame.

[0054] An unsupervised tracking method is used, where a target object labeled in frame 1 is used to track the target object forward in subsequent video frames. When tracking backward, the predicted position in the last frame of the target search frame is used as the initial target label, and tracking continues back to frame 1. The consistency loss is calculated between the target position obtained by backward tracking back to frame 1 and the initial pseudo-label. After measuring the difference between the forward and backward target trajectories, the overall cyclic tracking network of this invention is trained in an unsupervised manner to optimize the loss to ensure consistency between the predicted and ground truth bounding box trajectories.

[0055] B. Input the target template region Z and the target search region X2 of the second frame into the Siamese Network to obtain the feature T1 (i.e. the initial template feature T1) of the target template region Z and the feature S2 of the target search region X2. The Siamese Network is divided into a template branch and a search branch. These two branches share parameters to extract deep features.

[0056] Specifically, both branches of the Siamese Network use the ResNet50 convolutional neural network as the backbone for feature extraction. The ResNet50 convolutional neural network introduces residual blocks and BN (Batch normalization) layers on the basis of the traditional ResNet network, and abandons Dropout to solve the problems of network degradation, gradient vanishing and gradient explosion caused by increasing the depth of the convolutional network.

[0057] C. The Siamese Network extracts the target template features T t and target search features S t The classification and regression branches are fed into the Region Proposal Network (RPN);

[0058] Specifically, for each pixel of the feature corresponding to the target search region, c anchor boxes that may contain the target are generated. Then, the Region Proposal Network (RPN) outputs 2c channels for classification and 4c channels for regression. The target search feature S is calculated through convolution operations in both the classification and regression branches. t and target template features T t Correlation was then used to obtain the classification confidence score P. cls Regression results of target prediction bounding box P reg ;

[0059] P cls =[S t ] cls *[T t ] cls

[0060] Preg =[S t ] reg *[T t ] reg

[0061] Among them, [T t ] cls and [T] t ] reg Represents the target template feature T t The classification and regression branches, [S t ] cls and [S t ] reg S represents the target search feature t The classification and regression branches, and the convolution result P cls P represents the classification confidence score. reg This indicates the regression results of the target predicted bounding box;

[0062] Region Proposal Network (RPN) uses multiple anchor box ratios, making it easier to predict target shapes with large aspect ratios. To prevent overfitting, the anchor ratio of the Region Proposal Network (RPN) is set to [0.33, 0.5, 1, 2, 3], and the anchor value is set to 8.

[0063] D. The output of the Region Proposal Network (RPN) is processed by the region masking module to obtain the region mask feature M2. In subsequent steps, M2 and the initial template feature T1 are input into the template update network to obtain the updated template feature T2.

[0064] Specifically, feature selection is performed using a region masking module to penalize tracking errors in intermediate frames. In the initial prediction phase, the first n predictions of the classification and regression branches of the Region Proposal Network (RPN) may not be very accurate. The new template features generated from the first n target prediction boxes may not contain any target object features. Using such a template to backtrack to the initial target position leads to ill-posed penalties (i.e., tracking errors in intermediate frames are not penalized in this pipeline). Traditional feature selection operations, such as RoI-Align, are non-differentiable in the bounding box coordinates, and tracking errors in intermediate frames cannot be penalized. Therefore, a differentiable feature selection operation needs to be constructed in the bounding box coordinates. The output P of the classification and regression branches of the Region Proposal Network... cls and P reg As input to the region mask, assume the regression branch P reg The output contains a total of K predicted boxes, and the grid graph G of the i-th row and j-th column of the k-th predicted box is... k Grid value It represents the area of ​​overlap between the predicted bounding box and the grid in the i-th row and j-th column, and the grid value represents the overlap rate between the fixed grid and the predicted bounding box.

[0065] All grid values ​​in the search area form a set {G} k}, will set {G k The data is aggregated into a single-channel region mask, using only the grid values ​​of the predicted boxes with confidence scores greater than 0, and setting the other grid values ​​of the predicted boxes to 0. This generates a new grid map. Therefore, the region mask M t The following can be calculated: Where P cls,k It is the classification confidence of the k-th grid image, and a 4×4 grid size is used in this region masking module.

[0066] E. Target search area X t Features S t The region mask features M are obtained through a Region Proposal Network (RPN) and a region mask. t M t The initial template features T1 are input into the template update network to obtain the updated template features T. t .

[0067] Specifically, during cyclic tracking, the predicted bounding boxes in subsequent frames are significantly influenced by the predicted bounding boxes in the current frame. If the predicted bounding boxes in the current frame are inaccurate, it becomes difficult to generate accurate predicted bounding boxes in subsequent frames. The template update network ensures consistent propagation between frames, preventing the tracker from losing the target object.

[0068] The template update network consists of an encoder and a decoder, using the region mask features M of frame t. t As input to the encoder, the most salient points are selected through sparse attention. The decoder's input T1 is the initial template features, and the output is the obtained T. t It is the newly generated target template feature, T t It retains most of the reliable information from T1.

[0069] The encoder consists of N encoder layers. The region mask features M2 are added with spatial location codes P as input to the encoder. Each encoder layer first uses Sparse Multi-Head Self-Attention (SMSA) to calculate the self-attention of the input features, then performs residual connection layer normalization (Norm), and finally feeds it into a feedforward neural network (FFN) for another residual connection layer normalization to obtain the output of each encoder layer, as shown in the following expression:

[0070]

[0071]

[0072] in, denoted as the output of the i-th encoder layer, SMSA represents the sparse multi-head self-attention operation, Norm represents the residual connection layer normalization operation, and FFN represents the feedforward neural network.

[0073] Sparse attention uses only the softmax function to normalize the top S largest elements in each row of the similarity matrix. For other elements, they are replaced with 0, discarding smaller similarity weights. Finally, the similarity matrix and the values ​​are multiplied to obtain the final result, degenerating attention into sparse attention through the first S selections. S represents the sparsity of the sparse attention, set to 32.

[0074] The decoder consists of M decoder layers. Each decoder layer takes into account not only the initial template feature T1 encoded by spatial location and the output of the previous decoder layer, but also the encoded feature output by the encoder. First, multi-head self-attention (MSA) is used to calculate the self-attention of the initial template feature T1. After normalization by the residual connection layer, the input and the encoded feature output by the encoder are cross-attention (MCA) to calculate the cross-attention. Finally, the input is fed into the feedforward neural network (FFN) and normalized again by the residual connection layer to obtain the output of each decoder layer, as shown in the following expression:

[0075]

[0076]

[0077]

[0078] in, denoted as the output of the decoder layer, MSA represents multi-head self-attention computation, MCA represents multi-head cross-attention computation, Norm represents the normalization operation of the residual connection layer, and FFN represents the feedforward neural network.

[0079] Increasing the number of encoder and decoder layers increases performance, but too many layers can lead to overfitting during model training. Therefore, the number of encoder layers N and decoder layers are both set to 2.

[0080] F. The tracking process iterates back to the first frame of the loop. The tracking result generated from the first frame is compared with the pseudo-label of the original first frame to calculate the loss. This invention uses an unsupervised learning optical flow model to generate the pseudo-label of the first frame as the ground truth bounding box. The numerical difference between the predicted target bounding box and the ground truth bounding box in the first frame is the loss during the tracking process. The entire loop tracking process includes the classification loss L for self-tracking. cls and regression loss L reg And the loop loss L for looping back to frame 1 c The classification loss L cls It is cross-entropy loss, regression loss Lreg It is the Smooth L1 loss for normalized coordinates, and the cyclic loss L. c This is the IOU loss. The loss is optimized during training to achieve accurate tracking. Accurate tracking is achieved by optimizing the loss using the SGD optimizer. The total loss for iterative tracking is:

[0081] L total =(1-λ) c )(αL cls +βL reg )+λ c L c

[0082] Where L cls and L reg L represents the classification loss and regression loss of self-tracking, respectively. c Let α, β, and λ represent the cyclic loss for looping back to frame 1. c These are weighting coefficients. In this invention, the parameters are set as α = 10, β = 1.2, and λ. c =0.5.

[0083] Table 1 compares the analysis results of the method of this invention with those of other methods on the OTB2015 dataset.

[0084]

[0085] This invention is based on the premise of unsupervised target tracking. Currently, labeled data constitutes a relatively small proportion of real-world scenarios, making learning from unlabeled videos a promising approach. Unsupervised target tracking does not require a large number of annotated real labels. Furthermore, the template update network proposed in this invention avoids the exponential decay of information over time caused by the linear combination of the current frame template and the cumulative template of the previous frame. The sparse attention in the network addresses issues such as background distraction, enabling template feature updates. Using a sparse Transformer to update the template kernel generates fused features containing rich semantic information, preserving reliable information from the initial template and focusing attention on salient points within the features. This ensures sufficient robustness and speed for tracking even with target rotation and complex deformations.

Claims

1. An unsupervised target tracking method based on sparse attention-based template feature updating, characterized in that, The network architecture upon which this method is based includes a twin network, a region proposal network, a region masking module, and a template update network. Specifically, the method includes the following steps: 1) In a given video sequence, select three frames sequentially, namely frame 1, frame 2, and frame 3, and form a palindrome structure with frame 3 as the center frame. Extract the target template region from the initially selected frame 1. The target search region is then determined within the palindrome structure. For the target search area Stored in chronological order as a sequence. ,Right now , This represents the target search area at frame 2. This represents the target search area at frame 3. The target search area represents the first frame. During the tracking process, a loop frame is defined as the period from the selected first frame to the last frame of the palindrome. The tracking process that starts from the first frame and returns to the first frame is called loop tracking. The tracking process that starts from the first frame and ends at the third frame is called self tracking. The tracking process that starts from the third frame and ends at the first frame is called backward tracking. 2) The target template area in step 1) and target search area The input is fed into a Siamese network for feature extraction to obtain the target template region. Features and target search area Features During the tracking process, features As the initial template features for tracking, the Siamese network is divided into a template branch and a search branch, which share parameters to extract deep features; 3) The target template area obtained in step 2) Features and target search area Features The inputs are fed into the classification and regression branches of the region proposal network to obtain the classification confidence score and the target prediction box regression results; 4) The output of step 3) undergoes region masking by the region masking module. Differentiable region masks are used to select features and implicitly penalize tracking errors in intermediate frames. Assuming there are K predicted bounding boxes in the regression branch output of step 3), the region masking module first calculates the overlap rate of the k-th target predicted bounding box in the fixed grid. It retains the classification confidence scores of the boxes with overlap rates greater than 0 from step 3), and sets the classification confidence scores of the rest to 0. This selects region features with target confidence greater than 0. Finally, the k-th grid value is multiplied by the classification confidence score, and the results of the K predicted bounding boxes are summed to obtain the region mask features. ; 5) Target search area Features The region mask features are obtained through the region proposal network and the region mask module. Region mask features and initial template features The input is fed into the template update network to obtain the updated target template features. ; Specifically, the template update network consists of an encoder and a decoder. The encoder employs a sparse attention mechanism and region mask features. Each pixel value of the sparse attention feature is determined only by the S most similar pixel values; the template update network output yields the target template feature. ; 6) Following step 5), track back to frame 1, use the unsupervised learning optical flow model to generate the initial pseudo-label for frame 1, calculate the loss by combining it with the tracking result generated when tracking back to frame 1, and use the SGD optimizer to optimize the loss in order to achieve tracking by the tracker; In step 5), the template update network specifically refers to: The template update network consists of an encoder and a decoder. The encoder and decoder add a 2D sinusoidal position code before the target feature input, enabling the template update network to perceive the spatial position information P of the target. The encoder consists of N encoder layers, and the region mask features of the t-th frame are... As input to the encoder, ; in, Indicates region mask features, This indicates the addition of a 2D sine code. Indicates the first One encoder layer, Indicates the first The output of the encoder layer, where N represents the number of encoder layers; Each encoder layer first computes region mask features using sparse multi-head self-attention. The self-attention layer is then normalized by the residual connection layer and finally fed into the feedforward neural network to obtain the output of each encoder layer. The decoder consists of M decoder layers, each of which takes the initial template features encoded at the input position as input. In addition to the output of the previous decoder layer, the encoded target template features output by the encoder are also input, represented as: ; in, Indicates the initial template features. This indicates the addition of a 2D sine code. This represents the encoded target template features output by the encoder. Indicates the first One decoder layer, Indicates the ( M represents the output of the decoder layers.

2. The unsupervised target tracking method based on sparse attention-updated template features according to claim 1, characterized in that: In step 1), the target template area and target search area The dimensions are cut off as follows: ① Select a video frame containing a complete and clear target as the target template frame. The template area is cropped to 127mm×127mm around the target. ②During the training process, the target search area is cropped to 255mm×255mm based on the input target center and the target scale estimated in the previous frame.

3. The unsupervised target tracking method based on sparse attention-based template feature updating according to claim 1, characterized in that: In step 2), both branches of the Siamese network use ResNet50 convolutional neural network as the backbone for feature extraction.

4. The unsupervised target tracking method based on sparse attention-updated template features according to claim 1, characterized in that: In step 3), when tracking the initial second frame, the initial template features... and target search features As input to the region proposal network; after tracking the initial second frame, the template update network updates the target template features. and target search features As input to the region proposal network, c candidate boxes that may contain the target are generated in the target search region corresponding to each pixel through the anchor point mechanism. Then, the region proposal network uses the features extracted by the Siamese network to give the predicted box of the tracking target. The specific implementation of the region proposal network output is as follows: First, the target template features in the channel are processed through two convolutional layers. Add to two branches and , and Representing the target template features respectively The system has classification and regression branches, with 2c and 4c channel vectors respectively. While keeping the number of channels constant, the target search features are obtained through two convolutional layers. Divided into two branches and , and These represent the target search features respectively. The network has classification and regression branches; the region proposal network outputs 2c channels for classification and 4c channels for regression. Target search features are calculated through convolution operations in both the classification and regression branches. and target template features Relevance was then used to obtain the classification confidence score. Regression results with target prediction boxes ; ; in, and Representing the target template features respectively Classification branches and regression branches, and These represent the target search features respectively. The classification and regression branches, and the convolution results. This represents the classification confidence score. This represents the regression result of the target prediction box.

5. The unsupervised target tracking method based on sparse attention-based template feature updating according to claim 1, characterized in that: In step 4), the region masking module is used to select target features from the last search region of the target prediction boxes of all output results of the region proposal network to obtain the output region mask features. .

6. The unsupervised target tracking method based on sparse attention-based template feature updating according to claim 1 or 5, characterized in that: In step 4), the region mask features are obtained. The process is as follows: Output of the region proposal network classification and regression branches and As input to the region mask, let the regression branch... The output contains a total of K predicted boxes, and the grid diagram of the i-th row and j-th column of the k-th predicted box is... grid value It represents the area of ​​overlap between the predicted bounding box and the grid in the i-th row and j-th column of the fixed grid, and the grid value represents the overlap rate between the fixed grid and the predicted bounding box; All grid values ​​in the search area form a set. , will set Aggregate them into a single-channel region mask, and use the grid values ​​of the predicted bounding boxes with confidence scores greater than 0 to generate a new grid map. Region mask features The calculation is as follows: ,in It is the classification confidence of the k-th grid image, and a 4×4 grid size is used in this region masking module.

7. The unsupervised target tracking method based on sparse attention-updated template features according to claim 1, characterized in that: In step 6), gradient descent is used to optimize the parameters, minimizing the loss to achieve accurate tracking. The total loss for iterative tracking is: ; in and These represent the classification loss and regression loss for self-tracking, respectively. This represents the loop loss that loops back to frame 1. , , These are weighting coefficients; The formula for calculating the stochastic gradient descent (SGD) method is as follows: ; in, To obtain the optimal parameters, For the target template area, For the target search area, For tags, For the predicted results; After 20 iterations of training, the final total loss L of the cyclic tracking can be stabilized below 0.1.

Citation Information

Patent Citations

  • Twin neural network moving target tracking method based on full-connection attention module

    CN113744311A

  • Target object representation point estimation-based visual tracking method

    WO2023273136A1