Single target tracking method and device based on gaussian attention and adaptive focusing
By enhancing the center information of the template image through Gaussian attention and adaptive focusing modules, the robustness and accuracy problems of existing single-target tracking methods under occlusion and interference from similar objects are solved, achieving a more efficient target tracking effect.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTH CHINA AGRICULTURAL UNIVERSITY
- Filing Date
- 2023-10-07
- Publication Date
- 2026-06-30
AI Technical Summary
Existing single-target tracking methods struggle to achieve robust and accurate target tracking under occlusion and interference from similar objects. Self-attention cannot capture the absolute positional relationships of elements in an image, and cross-attention has difficulty distinguishing between the target and similar objects.
We employ Gaussian attention to enhance the semantic information modeling of template features, utilize the Gaussian prior distribution matrix to enhance the center information of the template image, and combine an adaptive focusing module to introduce the shape and position information of the target from the previous frame. We then capture the positional relationships of elements in the image and focus on the target through a Gaussian Transformer and an adaptive focusing module.
It significantly improves the robustness and accuracy of the tracker, reduces background information interference under complex conditions, clearly distinguishes the target from similar objects, and achieves more accurate target tracking.
Smart Images

Figure CN117541784B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the technical field of computer vision, specifically relating to a single-target tracking method and apparatus based on Gaussian attention and adaptive focusing. Background Technology
[0002] A fundamental task in single-object computer vision, predicting the state of a target given an initial state in a video sequence, has been widely applied in various applications such as visual surveillance and autonomous driving. However, despite significant efforts in recent years, developing a robust and accurate tracker remains challenging due to various difficulties frequently encountered in tracking, such as occlusion and interference from similar objects. Existing methods introduce the Transformer, which includes two modules: self-attention and cross-attention. First, self-attention struggles to guide target modeling with relative position embeddings. Self-attention fails to capture the positional relationships between elements in an image, only considering similarities or dependencies between elements. Position embeddings are introduced to address this issue, but they only consider the relative positions of elements, without paying more attention to elements located at the absolute center of the image. Second, appearance-similar-based cross-attention struggles to effectively distinguish the target from similar objects. Cross-attention strengthens the target's regional features through the similarity between the template and the search region features. In visual tracking, it assigns equal weight to similar objects at arbitrary locations, making it difficult to effectively distinguish them from the target and affecting tracking accuracy. These problems prevent features from focusing on the target and allow attention to be diverted by similar objects. Summary of the Invention
[0003] The main objective of this invention is to overcome the shortcomings and deficiencies of the prior art and provide a single-target tracking method and apparatus based on Gaussian attention and adaptive focusing. By utilizing Gaussian attention, the problem of element positional relationships in self-attention image capture is solved, enhancing the robustness of template features. At the same time, the shape and position information of the target in the previous frame are introduced into the adaptive focusing module to increase the contrast between the target and similar objects, making the target features more prominent, thereby achieving robust and accurate target tracking.
[0004] To achieve the above objectives, the present invention adopts the following technical solution:
[0005] In a first aspect, the present invention provides a single-target tracking method based on Gaussian attention and adaptive focusing, comprising the following steps:
[0006] The labeled single-object tracking dataset is cropped, and then the cropped single-object tracking dataset is combined to form a training set, which includes template images and search images.
[0007] By using a deep neural network equipped with a Swing Transformer to extract features from the training set, we can obtain search image features and template image features.
[0008] A Gaussian Transformer is constructed, which includes a Gaussian attention module and an adaptive focusing module. The training set is input and the Gaussian Transformer is trained. Specifically, sparse attention is established on the search image features, and then the most relevant region in the search image is obtained. Gaussian attention is used to enhance the template image features, capturing the positional relationships of elements in the image and generating a semantic template with robust target features, thus obtaining the core feature information in the template image. The adaptive focusing module is used to perform target focusing on the most relevant region in the search image and the core feature information in the template image, obtaining the region most focused on the target. The Gaussian attention includes a Gaussian prior distribution matrix and a Top-k module, where the Gaussian prior distribution matrix is used to enhance the importance of the center information of the template image and weaken the importance of surrounding information, and the Top-k module is used to highlight the most relevant feature information in the template features.
[0009] The region most focused on the target is input into the dual-head predictor to obtain the target tracking region.
[0010] As a preferred technical solution, the cropping operation on the labeled single-target tracking dataset specifically involves:
[0011] Template images and search images are cropped from the same labeled video sequence containing the target region to be tracked. The template image is cropped from the first frame of the video sequence, and then the remaining frames of the video sequence are cropped to form multiple search images. Each search image contains a relatively large search region that includes the target to be tracked.
[0012] As a preferred technical solution, the Gaussian attention specifically refers to:
[0013] F output (Q z ,K z V z )=S[T(G z +F input (Q z ,K z ))]V z
[0014] Among them, F input and F output Let G represent the input and output of the Gaussian attention (GA) function, respectively. z Let V denote the Gaussian prior distribution matrix, T denote the Top-k module, S denote the normalization operation module, and V denote the normalization operation module. z Indicates the generated query value Query, Kz Indicates the generated key-value pair, V. z This represents the generated numerical value, Value.
[0015] As a preferred technical solution, the method of enhancing template image features using Gaussian attention, capturing the positional relationships of elements in the image, generating a semantic template with robust target features, and obtaining core feature information in the template image includes the following steps:
[0016] By utilizing a Gaussian prior distribution matrix based on a two-dimensional Gaussian probability density function, the weight of features in the central region of the template image is increased, while the weight of features in the surrounding region is decreased to enhance the tracker's focus on the central region of the template image. The Gaussian prior distribution matrix is as follows:
[0017]
[0018] Where u and v represent the coordinates of elements in the template, μ1 and μ2 represent the coordinates of the center element of the template, σ1 represents the standard deviation of the coordinates μ1 of the center element of the template, and σ2 represents the standard deviation of the coordinates μ2 of the center element of the template.
[0019] As a preferred technical solution, the method of enhancing template image features using Gaussian attention, capturing the positional relationships of elements in the image, generating a semantic template with robust target features, and obtaining core feature information in the template image, further includes the following steps:
[0020] The Top-k module is used to construct a mask matrix and extract the top k highest values to highlight the most relevant features in the template. Specifically, first, each row of the mask matrix is sorted and the top k values are extracted, and their feature values are retained. Then, the other elements of the mask matrix are replaced with negative infinity (-inf). Finally, the Softmax module is used to normalize the k largest elements in the similarity matrix.
[0021] As a preferred technical solution, the adaptive focusing module performs target focusing on the most relevant region in the search image and the core feature information in the template image to obtain the region most focused on the target, specifically as follows:
[0022] The adaptive focusing module (AFM) is defined as follows:
[0023]
[0024] Where Top represents the adaptive Gaussian process module. This represents the scale and location information of the target in the previous frame, where T represents the Top-k module, A represents the normalization module, and GS represents the Gaussian similarity matrix. Value represents the numerical value generated by the template features;
[0025] The similarity matrix C is obtained by using the features of the search region as the query value Q, the features of the template image as the key value K, and the numerical value V through cross-attention calculation. The similarity matrix C is defined as follows:
[0026]
[0027] Where, d k Indicates the dimension of the key value. The query value Query represents the value generated by searching for image features. Key represents the key value generated from the template image features. express Matrix transpose;
[0028] An Adaptive Gaussian Process (AG) module is used to construct a Gaussian mask over the search region, with the center of the target location in the previous frame as the origin. The variance of the Gaussian function is dynamically adjusted based on the target scale in the current frame. This ensures the distribution of the Gaussian mask adaptively fits the current target scale. Furthermore, an adaptive Gaussian mask and a scaled cross-similarity matrix are added to obtain a Gaussian similarity matrix, GS, defined as follows:
[0029]
[0030] The Top-k module is used to retain only the top k maximum values in the Gaussian similarity matrix and mask the rest. The Softmax module is used to normalize the k maximum elements in each row of the similarity matrix and replace the other elements of the similarity matrix with negative infinity (inf).
[0031] As a preferred technical solution, the dual-head predictor includes a fully connected head and a convolutional head. The fully connected head uses fully connected layers to distinguish between the target and the background, and the convolutional head uses multi-layer convolution operations to locate the target's coordinate position.
[0032] Secondly, the present invention also provides a single target tracking system based on Gaussian attention and adaptive focusing, applied to the single target tracking method based on Gaussian attention and adaptive focusing, including a dataset preprocessing module, a feature extraction module, a feature processing module, and a dual-head predictor module;
[0033] The dataset preprocessing module is used to perform cropping operations on the labeled single-object tracking dataset, and then combine the cropped single-object tracking dataset to form a training set, which includes template images and search images.
[0034] The feature extraction module is used to extract features from the training set using a deep neural network equipped with a Swing Transformer, thereby obtaining search image features and template image features.
[0035] The feature processing module is used to construct a Gaussian Transformer, which includes a Gaussian attention module and an adaptive focusing module. The training set is input and the Gaussian Transformer is trained, specifically by: establishing sparse attention on the search image features, and then obtaining the most relevant region in the search image; using Gaussian attention to enhance the template image features, capturing the positional relationships of elements in the image and generating a semantic template with robust target features, thus obtaining the core feature information in the template image; and using the adaptive focusing module to perform target focusing on the most relevant region in the search image and the core feature information in the template image, obtaining the region most focused on the target. The Gaussian attention includes a Gaussian prior distribution matrix and a Top-k module, where the Gaussian prior distribution matrix is used to enhance the importance of the center information of the template image and weaken the importance of surrounding information, and the Top-k module is used to highlight the most relevant feature information in the template features.
[0036] The dual-head predictor module is used to input the region most focused on the target into the dual-head predictor to obtain the target tracking region.
[0037] Thirdly, the present invention provides an electronic device, the electronic device comprising:
[0038] At least one processor; and,
[0039] A memory communicatively connected to the at least one processor; wherein,
[0040] The memory stores computer program instructions that can be executed by the at least one processor to enable the at least one processor to perform the single-target tracking method based on Gaussian attention and adaptive focusing.
[0041] Fourthly, the present invention provides a computer-readable storage medium storing a program that, when executed by a processor, implements the single-target tracking method based on Gaussian attention and adaptive focusing.
[0042] Compared with the prior art, the present invention has the following advantages and beneficial effects:
[0043] (1) This invention uses Gaussian attention to guide the semantic information modeling of template features of Gaussian prior, and at the same time uses the strong central correlation of Gaussian prior to extract the core information of template center, which significantly enhances the expressive ability of the target and improves the robustness of the tracker.
[0044] (2) The present invention utilizes an adaptive focusing module to focus on the most relevant region in the search image and the core feature information in the template image, thereby achieving clear distinction between the target and similar objects, making the target more prominent, and thus improving the accuracy of the tracker. Attached Figure Description
[0045] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0046] Figure 1 This is a flowchart of a single-target tracking method based on Gaussian attention and adaptive focusing, according to an embodiment of the present invention.
[0047] Figure 2 This is a schematic diagram of a single-target tracking framework based on Gaussian attention and adaptive focusing according to an embodiment of the present invention;
[0048] Figure 3 This is a schematic diagram of Gaussian attention in an embodiment of the present invention;
[0049] Figure 4 This is a comparison diagram of the adaptive focusing module and the original cross-attention in an embodiment of the present invention;
[0050] Figure 5 This is a visual comparison of Gaussian masks of different target scales in embodiments of the present invention;
[0051] Figure 6 This is a comparison diagram of the effects after implementing the adaptive focusing module and after implementing poor attention in an embodiment of the present invention;
[0052] Figure 7 This is a comparison diagram of self-attention and Gaussian attention, cross-attention and adaptive focusing modules in the embodiments of the present invention;
[0053] Figure 8 This is a schematic diagram of the structure of a single-target tracking system based on Gaussian attention and adaptive focusing according to an embodiment of the present invention;
[0054] Figure 9 This is a structural diagram of an electronic device according to an embodiment of the present invention. Detailed Implementation
[0055] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of the present application, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present application without creative effort are within the scope of protection of the present application.
[0056] In this application, the reference to "embodiment" means that a specific feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a mutually exclusive, independent, or alternative embodiment. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described in this application can be combined with other embodiments.
[0057] Please see Figure 1 This embodiment provides a single-target tracking method based on Gaussian attention and adaptive focusing, which includes the following steps:
[0058] S1. Perform a cropping operation on the labeled single-object tracking dataset, and then combine the cropped single-object tracking datasets to form a training set, which includes template images and search images;
[0059] Preferably, the cropping operation on the labeled single-target tracking dataset described in step S1 specifically involves:
[0060] Template images and search images are cropped from the same labeled video sequence containing the target region to be tracked. The template image is cropped from the first frame of the video sequence, and then the remaining frames of the video sequence are cropped to form multiple search images. Each search image contains a relatively large search region that includes the target to be tracked.
[0061] During cropping, the video sequence is arranged according to the default playback time order. The first frame of the video sequence is the first frame of the selected video when it is played in the default playback time order. The remaining frames of the video sequence are all the frames of the selected video when it is played in the default playback time order, except for the first frame.
[0062] In this embodiment, the template image is set to 127×127 and the search image is set to 289×289 for the cropped image size.
[0063] S2. Use a deep neural network equipped with a Swing Transformer to extract features from the training set to obtain search image features and template image features;
[0064] Step S2 primarily utilizes a trained deep neural network to extract features from the template image and the search image. In this embodiment, a backbone network is used for feature extraction, specifically, the Swin Transformer is used to extract features from both the template image and the search image, resulting in template image features and search image features respectively. The Swin Transformer is a Transformer-based image classification model that divides the image into blocks, extracts features, and then concatenates them to form global features. The main purpose of this model is to achieve the effect of a visual Transformer, while employing a block-based approach similar to CNNs for image feature extraction.
[0065] The Swin Transformer employs a hierarchical design, dividing the image into multiple stages, each containing multiple Patch Merging and Block operations. Patch Merging reduces image resolution, while Blocks extract features. Within each Block, the Swin Transformer uses Shift-Windowed Patch Embedding to embed each patch into a vector and leverages the Transformer structure for feature extraction. Unlike traditional Transformers, the Swin Transformer uses a CNN-like approach at input, dividing the image into blocks and extracting features. Simultaneously, it employs a Transformer-like structure for feature extraction, rather than the convolutional operations of a CNN. This design allows the Swin Transformer to perform block-based feature extraction like a CNN, while also enabling global feature extraction like a Transformer.
[0066] Please see Figure 2 The two trapezoidal representations in the feature extraction network are identical, utilizing a pre-trained deep neural network with a Swing Transformer. Image features are extracted in blocks by employing the Swing Transformer through hierarchical and grouping methods. The pre-trained deep neural network reshapes the extracted feature image blocks through fully connected layers, obtaining template image features and search image features respectively, which are then input into a Gaussian Transformer. In this embodiment, the backbone network, by utilizing the Swing Transformer, achieves efficient image feature extraction through hierarchical and grouping methods, resulting in a feature extraction network with superior performance in terms of computational and parameter efficiency.
[0067] It is worth noting that, Figure 2The shared weights refer to the fact that the parameters of the two pre-trained deep neural networks need to be fine-tuned during the feature extraction process of the template image and the search image. The shared weights can help them learn the common features of the template image and the search image better.
[0068] S3. Construct a Gaussian Transformer, which includes a Gaussian attention module and an adaptive focusing module. Establish sparse attention on the search image features, then obtain the most relevant region in the search image. Use Gaussian attention to enhance the template image features, capture the positional relationships of elements in the image, and generate a semantic template with robust target features to obtain the core feature information in the template image. Use the adaptive focusing module to perform target focusing on the most relevant region in the search image and the core feature information in the template image to obtain the region most focused on the target. The Gaussian attention includes a Gaussian prior distribution matrix and a Top-k module. The Gaussian prior distribution matrix is used to enhance the importance of the center information of the template image and weaken the importance of surrounding information. The Top-k module is used to highlight the most relevant feature information in the template features. (See [link to relevant documentation]). Figure 2-4 ;
[0069] Preferably, the Gaussian attention described in step S3 is as follows: Figure 3 As shown, specifically:
[0070] F output (Q z ,K z V z )=S[T(G z +F input (V z ,K z ))]V z
[0071] Among them, F input and F output Let G represent the input and output of the Gaussian attention (GA) function, respectively. z Let T denote the Gaussian prior distribution matrix, T denote the Top-k module, k denote the normalization operation module, and Q denote the normalization operation module. z Indicates the generated query value Query, K z Indicates the generated key-value pair, V. z This represents the generated numerical value, Value.
[0072] For the best recommendations, please refer to the following: Figure 3 In step S3, the weight of features in the central region of the template image is increased based on a two-dimensional Gaussian probability density function using a Gaussian prior distribution matrix, while the weight of features in the surrounding region is decreased to increase the tracker's focus on the central region of the template image. The Gaussian prior distribution matrix is as follows:
[0073]
[0074] Where u and v represent the coordinates of elements in the template, μ1 and μ2 represent the coordinates of the center element of the template, σ1 represents the standard deviation of the coordinates μ1 of the center element of the template, and σ2 represents the standard deviation of the coordinates μ2 of the center element of the template.
[0075] For the best recommendations, please refer to the following: Figure 3 In step S3, the Top-k module is used to construct a mask matrix and take the top k highest values to highlight the most important feature information in the template features. Specifically, first, each row of the mask matrix is sorted and the top k values are taken, and their feature values are retained; then, the other elements of the mask matrix are replaced with negative infinity -inf; after that, the Softmax module is used to normalize the k largest elements in the similarity matrix.
[0076] exist Figure 3 In this process, the input is the template image feature map, which is the first square block on the left. The template image feature map is passed to the Gaussian attention module for processing. In Gaussian attention, the template image feature map undergoes a linear transformation to obtain the query value, key value, and value. The query value and key value are multiplied to obtain a similarity matrix, which is then scaled to generate a mask. After adding the Gaussian prior probability, the resulting Gaussian attention matrix is obtained. The Top-k module then selects the k highest values to highlight the most relevant features in the template image. Finally, after normalization, the matrix is multiplied by the value and output to obtain the feature map that highlights the core information in the template image.
[0077] This embodiment trains on an RTX3090 for 20 epochs, with an initial step size of 0.0001 and 8 heads in the multi-head self-attention system. It uses 600,000 image pairs for training, and sets the dimensions of the cropped images: 127×127 for the template image and 289×289 for the search image.
[0078] Preferred options, please refer to Figure 4 In step S3, the adaptive focusing module is used to focus on the most relevant region in the search image and the core feature information in the template image to obtain the region most focused on the target. Specifically:
[0079] The adaptive focusing module (AFM) is defined as follows:
[0080]
[0081] Wherein, AG represents the Adaptive Gaussian Process module. This represents the scale and location information of the target in the previous frame, where T represents the Top-k module, S represents the normalization module, and GS represents the Gaussian similarity matrix. Value represents the numerical value generated by the template features;
[0082] The similarity matrix C is obtained by using the features of the search region as the query value Q, the features of the template image as the key value K, and the numerical value V through cross-attention calculation. The similarity matrix C is defined as follows:
[0083]
[0084] Where, d k Indicates the dimension of the key value. The query value Query represents the value generated by searching for image features. Key represents the key value generated from the template image features. express Matrix transpose;
[0085] An Adaptive Gaussian Process (AG) module is used to construct a Gaussian mask over the search region, with the center of the target location in the previous frame as the origin. The variance of the Gaussian function is dynamically adjusted based on the target scale in the current frame. This ensures the distribution of the Gaussian mask adaptively fits the current target scale. Furthermore, an adaptive Gaussian mask and a scaled cross-similarity matrix are added to obtain a Gaussian similarity matrix, GS, defined as follows:
[0086]
[0087] The Top-k module is used to retain only the top k maximum values in the Gaussian similarity matrix and mask the rest. The Softmax module is used to normalize the k maximum elements in each row of the similarity matrix and replace the other elements of the similarity matrix with negative infinity (inf).
[0088] Figure 4 This is a comparison diagram between the adaptive focusing module and the original cross-attention module in this embodiment of the invention. Figure 4The left side of the diagram illustrates the original cross-attention method in the prior art. It uses a template image and the current frame image as input, which are passed to the cross-attention module, and then output the result. As you can see, the resulting focus points are disorganized, with points of interest appearing on both the target and the background. The right side illustrates the adaptive focusing module proposed in this invention. Similar to the original cross-attention method, it uses a template image and the current frame image as input, but with the addition of a new input: the image of the previous frame. This makes the tracker focus more on the target. Finally, a Top-k module further refines the focus points.
[0089] Please see Figure 5 This displays a visual comparison of Gaussian masks with different target scales in embodiments of the present invention. Figure 5 Parts (1), (2), and (3) in the figure represent the Gaussian mask matrices when the aspect ratios of the target are 1:1, 5:3, and 3:5, respectively, which can adaptively fit the scale of the current target.
[0090] Please see Figure 6 The image shows a comparison of the effects of the adaptive focusing module and the implementation of cross-attention in the embodiments of the present invention. The first column is the actual annotation of the target search area; the second column is the feature map obtained by cross-attention processing in the prior art, which appears very messy and disordered subjectively; the third column is the feature map obtained by the adaptive focusing module, which appears very focused on the target itself subjectively and is almost identical to the actual annotation.
[0091] The method of the present invention can track scenes more effectively under complex conditions. It can not only reduce the interference of background information, but also eliminate the influence of non-target factors, so that the tracker can focus more on tracking the target itself without being disturbed by background information or similar targets.
[0092] Please see Figure 7 This shows a comparison diagram of self-attention and Gaussian attention, cross-attention and adaptive focusing modules in embodiments of the present invention. Figure 7 In the diagram, sections (1) and (3) on the left side compare self-attention and Gaussian attention. Self-attention calculates element attention independent of the element positions in the template, so all elements are considered. Gaussian attention, on the other hand, focuses more on the central region of the template. Sections (2) and (4) on the right side compare ordinary cross-attention and the adaptive focusing module. Ordinary cross-attention is prone to identifying similar objects as tracking targets, thus affecting tracking accuracy, while the adaptive focusing module yields results closer to reality.
[0093] S4. Input the region most focused on the target into the dual-head predictor to obtain the target tracking region.
[0094] The dual-head predictor includes a fully connected head and a convolutional head. The fully connected head uses a fully connected layer to distinguish between the target and the background, and the convolutional head uses multiple convolutional operations to locate the target's coordinates, thereby achieving robust and accurate target tracking.
[0095] The objective evaluation results of the method of the present invention are shown in Table 1. The present invention uses the IOU value (AUC) between the real bounding box and the bounding box generated by the tracker, and the normalized accuracy (P) value. Norm The LaSOT and TrackingNet datasets were evaluated using accuracy (P) as evaluation metrics, with average overlap (AO) as the metric. The proportion of successfully tracked frames with overlap exceeding 0.5 (SR) was also used. 0.5 ) and the proportion of successfully tracked frames with an overlap of more than 0.75 (SR) 0.75 The higher the value, the more accurate the tracker is at tracking the target.
[0096] Table 1
[0097]
[0098] It should be noted that, for the sake of simplicity, the aforementioned method embodiments are all described as a series of actions. However, those skilled in the art should understand that the present invention is not limited to the described order of actions, because according to the present invention, some steps can be performed in other orders or simultaneously.
[0099] Based on the same idea as the single-target tracking method based on Gaussian attention and adaptive focusing in the above embodiments, the present invention also provides a single-target tracking system based on Gaussian attention and adaptive focusing, which can be used to execute the above-described single-target tracking method based on Gaussian attention and adaptive focusing. For ease of explanation, the structural diagram of the embodiment of the single-target tracking system based on Gaussian attention and adaptive focusing only shows the parts relevant to the embodiments of the present invention. Those skilled in the art will understand that the illustrated structure does not constitute a limitation on the device, and it may include more or fewer components than illustrated, or combine certain components, or have different component arrangements.
[0100] Please see Figure 8 In another embodiment of this application, a single target tracking system 100 based on Gaussian attention and adaptive focusing is provided. The system includes a dataset preprocessing module 101, a feature extraction module 102, a feature processing module 103, and a dual-head predictor module 104.
[0101] The dataset preprocessing module 101 is used to perform cropping operations on the labeled single-target tracking dataset, and then combine the cropped single-target tracking dataset to form a training set, which includes template images and search images.
[0102] The feature extraction module 102 is used to extract features from the training set using a deep neural network equipped with a Swing Transformer to obtain search image features and template image features.
[0103] The feature processing module 103 is used to construct a Gaussian Transformer, which includes a Gaussian attention module and an adaptive focusing module. The training set is input and the Gaussian Transformer is trained, specifically by: establishing sparse attention on the search image features, and then obtaining the most relevant region in the search image; using Gaussian attention to enhance the template image features, capturing the positional relationships of elements in the image and generating a semantic template with robust target features, thereby obtaining the core feature information in the template image; and using the adaptive focusing module to perform target focusing on the most relevant region in the search image and the core feature information in the template image, obtaining the region most focused on the target. The Gaussian attention includes a Gaussian prior distribution matrix and a Top-k module, wherein the Gaussian prior distribution matrix is used to enhance the importance of the center information of the template image and weaken the importance of surrounding information, and the Top-k module is used to highlight the most relevant feature information in the template features.
[0104] The dual-head predictor module 104 is used to input the region most focused on the target into the dual-head predictor to obtain the target tracking region.
[0105] It should be noted that the single-target tracking system based on Gaussian attention and adaptive focusing of the present invention corresponds one-to-one with the single-target tracking method based on Gaussian attention and adaptive focusing of the present invention. The technical features and beneficial effects described in the embodiments of the single-target tracking method based on Gaussian attention and adaptive focusing described above are applicable to the embodiments of the single-target tracking method based on Gaussian attention and adaptive focusing. For details, please refer to the description in the embodiments of the method of the present invention, which will not be repeated here.
[0106] Furthermore, in the implementation of the single-target tracking system based on Gaussian attention and adaptive focusing in the above embodiments, the logical division of each program module is only an example. In actual applications, the above functions can be assigned to different program modules as needed, for example, for the configuration requirements of the corresponding hardware or for the convenience of software implementation. That is, the internal structure of the single-target tracking system based on Gaussian attention and adaptive focusing is divided into different program modules to complete all or part of the functions described above.
[0107] Please see Figure 9In one embodiment, an electronic device is provided for implementing a single-target tracking method based on Gaussian attention and adaptive focusing. The electronic device 200 may include a first processor 201, a first memory 202 and a bus, and may also include a computer program stored in the first memory 202 and executable on the first processor 201, such as a single-target tracking program 23 based on Gaussian attention and adaptive focusing.
[0108] The first memory 202 includes at least one type of readable storage medium, including flash memory, portable hard drive, multimedia card, card-type memory (e.g., SD or DX memory), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the first memory 202 can be an internal storage unit of the electronic device 200, such as the portable hard drive of the electronic device 200. In other embodiments, the first memory 202 can also be an external storage device of the electronic device 200, such as a plug-in portable hard drive, smart media card (SMC), secure digital card (SD), flash card, etc., equipped on the electronic device 200. Furthermore, the first memory 202 can include both internal storage units and external storage devices of the electronic device 200. The first memory 202 can be used not only to store application software and various types of data installed on the electronic device 200, such as the code of a single-target tracking program 203 based on Gaussian attention and adaptive focusing, but also to temporarily store data that has been output or will be output.
[0109] In some embodiments, the first processor 201 may be composed of integrated circuits, such as a single packaged integrated circuit or multiple integrated circuits with the same or different functions, including combinations of one or more central processing units (CPUs), microprocessors, digital processing chips, graphics processors, and various control chips. The first processor 201 is the control unit of the electronic device, connecting various components of the entire electronic device through various interfaces and lines. It executes programs or modules stored in the first memory 202 and calls data stored in the first memory 202 to perform various functions of the electronic device 200 and process data.
[0110] Figure 9 Only electronic devices with components are shown; it will be understood by those skilled in the art that... Figure 9The structure shown does not constitute a limitation on the electronic device 200, and may include fewer or more components than shown, or combine certain components, or have different component arrangements.
[0111] The single-target tracking program 203 based on Gaussian attention and adaptive focusing, stored in the first memory 202 of the electronic device 200, is a combination of multiple instructions. When run in the first processor 201, it can achieve the following:
[0112] The labeled single-object tracking dataset is cropped, and then the cropped single-object tracking dataset is combined to form a training set, which includes template images and search images.
[0113] By using a deep neural network equipped with a Swing Transformer to extract features from the training set, we can obtain search image features and template image features.
[0114] A Gaussian Transformer is constructed, which includes a Gaussian attention module and an adaptive focusing module. The training set is input and the Gaussian Transformer is trained. Specifically, sparse attention is established on the search image features, and then the most relevant region in the search image is obtained. Gaussian attention is used to enhance the template image features, capturing the positional relationships of elements in the image and generating a semantic template with robust target features, thus obtaining the core feature information in the template image. The adaptive focusing module is used to perform target focusing on the most relevant region in the search image and the core feature information in the template image, obtaining the region most focused on the target. The Gaussian attention includes a Gaussian prior distribution matrix and a Top-k module, where the Gaussian prior distribution matrix is used to enhance the importance of the center information of the template image and weaken the importance of surrounding information, and the Top-k module is used to highlight the most relevant feature information in the template features.
[0115] The region most focused on the target is input into the dual-head predictor to obtain the target tracking region.
[0116] Furthermore, if the modules / units integrated in the electronic device 200 are implemented as software functional units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium. The computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, or a read-only memory (ROM).
[0117] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments described above. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
[0118] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0119] The above embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above embodiments. Any changes, modifications, substitutions, combinations, or simplifications made without departing from the spirit and principle of the present invention shall be considered equivalent substitutions and shall be included within the protection scope of the present invention.
Claims
1. A single-target tracking method based on Gaussian attention and adaptive focusing, characterized in that, Includes the following steps: The labeled single-object tracking dataset is cropped, and then the cropped single-object tracking dataset is combined to form a training set, which includes template images and search images. By using a deep neural network equipped with a Swing Transformer to extract features from the training set, we can obtain search image features and template image features. A Gaussian Transformer is constructed, comprising Gaussian attention and an adaptive focusing module. The training set is input and the Gaussian Transformer is trained, specifically: sparse attention is established on the features of the search image, and then the most relevant region in the search image is obtained; Gaussian attention is used to enhance the features of the template image, capturing the positional relationships of elements in the image and generating a semantic template with robust target features, thus obtaining the core feature information in the template image; the adaptive focusing module is used to focus on the most relevant region in the search image and the core feature information in the template image, obtaining the region most focused on the target; the Gaussian attention includes a Gaussian prior distribution matrix and... The module, in which the Gaussian prior distribution matrix is used to enhance the importance of the center information of the template image and reduce the importance of the surrounding information, This module is used to highlight the most relevant features in the template. The region most focused on the target is input into the dual-head predictor to obtain the target tracking region; The adaptive focusing module is used to focus on the most relevant region in the search image and the core feature information in the template image to obtain the region most focused on the target. Specifically: The adaptive focusing module (AFM) is defined as follows: in, This represents the adaptive Gaussian process module. This indicates the scale and location information of the target in the previous frame. express Module, This indicates the normalization operation module. Represents the Gaussian similarity matrix. Value represents the numerical value generated by the template features; By using the features of the search region as query values Using template image features as keys Sum of values Cross-attention calculation yields the similarity matrix C, which is defined as follows: in, Indicates the dimension of the key value. The query value Query represents the value generated by searching for image features. Key represents the key value generated from the template image features. express Matrix transpose; Adaptive Gaussian process module Using the center of the target location in the previous frame as the origin, a Gaussian mask is constructed over the search region, and the variance in the Gaussian function is dynamically adjusted according to the target scale in the current frame. This ensures the distribution of the Gaussian mask adaptively fits the current target scale. Furthermore, an adaptive Gaussian mask and a scaled cross-similarity matrix are added to obtain a Gaussian similarity matrix. The definition is as follows: use The module retains only the first few elements from the Gaussian similarity matrix. Find the maximum value, and mask the rest. The module performs similarity analysis on each row of the similarity matrix. Normalize the largest element and replace the other elements of the similarity matrix with negative infinity. .
2. The single-target tracking method based on Gaussian attention and adaptive focusing according to claim 1, characterized in that, The cropping operation on the labeled single-target tracking dataset specifically involves: Template images and search images are cropped from the same labeled video sequence containing the target region to be tracked. The template image is cropped from the first frame of the video sequence, and then the remaining frames of the video sequence are cropped to form multiple search images. Each search image contains a relatively large search region that includes the target to be tracked.
3. The single-target tracking method based on Gaussian attention and adaptive focusing according to claim 1, characterized in that, The Gaussian attention specifically refers to: in, and Let these represent the input and output of the Gaussian attention (GA) algorithm, respectively. Denotes the Gaussian prior distribution matrix. express Module, This indicates the normalization operation module. This represents the generated query value, Query. This represents the generated key value, Key. This represents the generated numerical value, Value.
4. The single-target tracking method based on Gaussian attention and adaptive focusing according to claim 1, characterized in that, The method of enhancing template image features using Gaussian attention, capturing the positional relationships of elements in the image, and generating a semantic template with robust target features to obtain core feature information in the template image includes the following steps: By utilizing a Gaussian prior distribution matrix based on a two-dimensional Gaussian probability density function, the weight of features in the central region of the template image is increased, while the weight of features in the surrounding region is decreased to enhance the tracker's focus on the central region of the template image. The Gaussian prior distribution matrix is as follows: in, and Represents the coordinates of an element in the template. This represents the coordinates of the center element of the template. Represents the coordinates of the center element of the template standard deviation Represents the coordinates of the center element of the template The standard deviation.
5. The single-target tracking method based on Gaussian attention and adaptive focusing according to claim 1, characterized in that, The method of enhancing template image features using Gaussian attention, capturing the positional relationships of elements in the image, generating a semantic template with robust target features, and obtaining core feature information in the template image, further includes the following steps: use Take the first part of the mask matrix for module construction The highest value is used to highlight the most relevant feature information in the template features. Specifically, first, each row of the mask matrix is sorted and the first k values are taken, and their feature values are retained; then, the other elements of the mask matrix are replaced with negative infinity. Then, normalization is performed. The module compares the similarity matrix. Normalize the largest element.
6. The single-target tracking method based on Gaussian attention and adaptive focusing according to claim 1, wherein the dual-head predictor comprises a fully connected head and a convolutional head, characterized in that, The fully connected head uses a fully connected layer to distinguish the target from the background, and the convolutional head uses multiple convolutional operations to locate the target's coordinates.
7. A single-target tracking system based on Gaussian attention and adaptive focusing, characterized in that, The single-target tracking method based on Gaussian attention and adaptive focusing, applied to any one of claims 1-6, includes a dataset preprocessing module, a feature extraction module, a feature processing module, and a dual-head predictor module; The dataset preprocessing module is used to perform cropping operations on the labeled single-object tracking dataset, and then combine the cropped single-object tracking dataset to form a training set, which includes template images and search images; The feature extraction module is used to extract features from the training set using a deep neural network equipped with a Swing Transformer, thereby obtaining search image features and template image features. The feature processing module is used to construct a Gaussian Transformer, which includes a Gaussian attention module and an adaptive focusing module. The training set is input and the Gaussian Transformer is trained, specifically by: establishing sparse attention on the search image features, and then obtaining the most relevant region in the search image; enhancing the template image features using Gaussian attention to capture the positional relationships of elements in the image and generate a semantic template with robust target features, obtaining the core feature information in the template image; and using the adaptive focusing module to perform target focusing on the most relevant region in the search image and the core feature information in the template image, obtaining the region most focused on the target. The Gaussian attention includes a Gaussian prior distribution matrix and... The module, in which the Gaussian prior distribution matrix is used to enhance the importance of the center information of the template image and reduce the importance of the surrounding information, This module is used to highlight the most relevant features in the template. The dual-head predictor module is used to input the region most focused on the target into the dual-head predictor to obtain the target tracking region.
8. An electronic device, characterized in that, The electronic device includes: At least one processor; and, A memory communicatively connected to the at least one processor; wherein, The memory stores computer program instructions that can be executed by the at least one processor to enable the at least one processor to perform the single-target tracking method based on Gaussian attention and adaptive focusing as described in any one of claims 1-6.
9. A computer-readable storage medium storing a program, characterized in that, When the program is executed by the processor, it implements the single-target tracking method based on Gaussian attention and adaptive focusing as described in any one of claims 1-6.