A visual target tracking system and method based on point set diffusion
The visual target tracking system using point set diffusion solves the problems of target deformation and occlusion by interference in dynamic scenes by using ViT-Base encoder and denoising diffusion decoder through multiple iterations, thus achieving more efficient target tracking.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI JIAOTONG UNIV
- Filing Date
- 2024-04-22
- Publication Date
- 2026-06-16
AI Technical Summary
Existing deep learning-based visual target tracking algorithms are insufficient in terms of processing speed and self-correction capabilities, especially in dynamic scenes where they struggle to handle target deformation and occlusion by interfering objects.
A visual target tracking system employing point set diffusion utilizes a ViT-Base encoder and a denoising diffusion-based decoder. By processing image features through multiple iterations of denoising diffusion layers, it achieves multiple target localization and self-correction, making it suitable for target tracking in dynamic scenes.
It improves the robustness and accuracy of visual target tracking algorithms in dynamic scenes, effectively handles target deformation and occlusion by interference objects, and enhances the flexibility and processing speed of the algorithm.
Smart Images

Figure CN118334081B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision, and specifically to a visual target tracking system and method based on point set diffusion. Background Technology
[0002] Visual target tracking technology has a wide range of applications, including security monitoring, human-computer interaction, and medical imaging. Visual target tracking involves detecting moving targets, extracting features, classifying and recognizing them, tracking filtering, and recognizing behaviors in continuous image sequences to obtain accurate target motion parameters, such as position and velocity. These parameters are then processed and analyzed to achieve an understanding of the target's behavior.
[0003] Many factors influence the performance of tracking algorithms, including changes in lighting, occlusion, and background clutter, and there is no single algorithm that can successfully handle all scenarios. Therefore, developing more robust and accurate high-performance tracking algorithms is of great significance for industrial applications in various scenarios.
[0004] To address these challenges, numerous new visual target tracking algorithms have been proposed in recent years. Early popular tracking algorithms were detection-based frameworks, which typically structured the tracking problem as a detection problem. First, the search region was detected and scored to obtain a large number of sampled data. Then, these samples were classified, and finally, the target location was determined. Detection-based tracking frameworks generally consist of three stages: feature extraction, tracking model construction, and target region classification. These three stages work together to locate the final tracked target. However, early tracking models were limited by their complex algorithms, and their processing speed was insufficient, significantly restricting their application in practical engineering.
[0005] Since 2016, deep learning-based target tracking algorithms have gradually become mainstream. The earliest deep learning-based target tracking algorithms introduced deep neural networks into feature extraction for target tracking. By using deep neural networks as feature extractors, large training datasets were incorporated into the target tracking algorithms, significantly improving their robustness and accuracy. Among these, Siamese network-based tracking algorithms are the most popular in the deep learning era. They utilize single forward pass inference to directly locate the target, greatly improving processing speed. Single forward pass inference refers to a complete computation process performed on new input data through a trained neural network model to obtain the model's prediction or classification result. In this process, the input data starts from the input layer of the neural network and propagates forward layer by layer, undergoing computation and transformation at each layer before finally reaching the output layer. The result produced by the output layer is the model's prediction or classification of the input data. This complete computation process constitutes one forward pass inference. Unlike the training process, forward pass inference does not involve backpropagation or parameter updates. During training, the model iterates through forward and backward propagation multiple times to optimize model parameters and improve performance. However, during inference or deployment, the model parameters are fixed, and only forward inference is needed. Nevertheless, the training and inference patterns of Siamese networks are heavily constrained by the training dataset, and their single-pass forward inference means they cannot self-correct after tracking failures.
[0006] Therefore, the key to developing high-performance tracking algorithms lies in how to innovate the single-pass forward inference form of tracking algorithms while taking into account both processing speed and algorithm simplicity, so as to enable the target tracking algorithm to have self-correction and multiple localization tracking capabilities. Summary of the Invention
[0007] To address the shortcomings of existing technologies, the purpose of this invention is to provide a visual target tracking scheme based on point set diffusion. By using point sets as the target representation, the diffusion process of random noise to the target can be realized. This transforms the target localization method that uses single forward propagation in the traditional tracking algorithm into a generative multi-iterative target localization method, enabling the tracking model to handle target deformation and interference occlusion, and making it more suitable for target tracking in dynamic scenes.
[0008] To achieve the above-mentioned objectives, in a first aspect, the present invention provides a visual target tracking system based on point set diffusion, comprising a visual target tracking unit, wherein the visual target tracking unit includes a ViT-Base encoder and a decoder based on denoising diffusion. The ViT-Base encoder has N1 encoding layers, each encoding layer being used to extract features from a template image and a search image respectively, and the features of the two images are interacted and readjusted into search image features of a two-dimensional structure, and output to the decoder. The decoder initializes N2 point sets randomly distributed in the search image features of the two-dimensional structure. The decoder has N denoising diffusion layers, wherein t adjacent denoising diffusion layers sequentially denoise the N2 point sets, and the output of the (t-1)th denoising diffusion layer is used as the input of the tth denoising diffusion layer. Each denoising process yields N2 target candidate boxes corresponding one-to-one with the N2 point sets, 1≤t≤T, where T is the total number of iteration steps and t is the current iteration step. When the confidence score of any target candidate box is greater than a preset threshold, the target candidate box is determined as a target.
[0009] Preferably, the ViT-Base encoder includes 12 transformer layers, wherein the first four layers encode the frame features of the template image and the frame features of the search image independently, and the last eight layers encode the two jointly.
[0010] Preferably, the frame size of the template image is an integer multiple of 128*128, the frame size of the search image is an integer multiple of 128*128, and it is downsampled by 4 times before being input into the transformer layer, increasing the number of channels from 3 dimensions to 768 dimensions.
[0011] Preferably, the denoising diffusion layer includes a global instance interaction layer, a dynamic convolutional layer, and a prediction fine-tuning layer. The global instance interaction layer extracts instance features from N2 points through RoI pooling. The dynamic convolutional layer performs global attention pooling and dynamic convolution processing on each instance feature. The prediction fine-tuning layer predicts the boundary of the target candidate box based on the dynamic convolution processing and sends the prediction result to the next denoising diffusion layer for fine-tuning.
[0012] Preferably, the denoising diffusion layer is a generative tracking model based on Markov probability, which is suitable for generating and determining the target from random noise, and obtaining the accurate position and size of the target through multi-step diffusion iteration of the random noise.
[0013] Preferably, the input to the training phase of the generative tracking model is constructed by mixing randomly distributed noise points and labels together.
[0014] Preferably, the noise point is Gaussian noise with a signal-to-noise ratio of 1.0-2.0.
[0015] Preferably, the noise level of each iteration step is controlled by a predefined monotonically decreasing cosine function.
[0016] Preferably, the training loss of the generative tracking model adopts the ensemble prediction loss on the set of N3 predictions, and the top 5 scores are selected for supervision based on the IoU score.
[0017] Preferably, multi-layer supervision is used to train the generative tracking model.
[0018] Preferably, an integrated prediction mechanism is used in the inference process of the generative tracking model. On the one hand, a random point set is used to replace the target candidate boxes with confidence scores below the first limit, while retaining the target candidate boxes with confidence scores above the second limit. On the other hand, a voting strategy is used to filter out interference objects from multiple target candidate boxes with confidence scores above the second limit.
[0019] Preferably, the preset threshold is 0.7.
[0020] Secondly, the present invention provides a visual target tracking method based on point set diffusion, employing the visual target tracking system based on point set diffusion described in any of the technical solutions of the first aspect, comprising the following steps: S100: extracting features from the template image and the search image respectively, and readjusting the features of the two to form a two-dimensional search image feature; S110: initializing N2 point sets randomly distributed in the two-dimensional search image feature; S120: sequentially denoising the N2 point sets with t adjacent denoising diffusion layers, using the output of the (t-1)th denoising diffusion layer as the input of the tth denoising diffusion layer, and obtaining N2 target candidate boxes corresponding one-to-one with the N2 point sets in each denoising process, 1≤t≤T, where T is the total number of iterations and t is the current number of iterations; S130: comparing the confidence score of the target candidate boxes with a preset threshold, and determining the target candidate box as the target when the confidence score of any target candidate box is greater than the preset threshold.
[0021] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0022] 1. The encoder part of the diffusion model (i.e. the generative tracking model) adopts the ViT-Base encoder, but the decoder part does not adopt the ViT-Base MLP Head, but adopts a decoder with N denoising diffusion layers, which can realize the diffusion process of random noise to the target.
[0023] 2. The target representation is represented by point sets, which enables the tracking model to handle target deformation and occlusion by interference.
[0024] 3. It adopts a Markov chain-based training and inference method, which enables the target tracking model to fine-tune the target localization multiple times. Attached Figure Description
[0025] Other features, objects, and advantages of the present invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:
[0026] Figure 1 This is a schematic diagram illustrating the principle of one embodiment of the system of the present invention;
[0027] Figure 2 This is a diagram illustrating the overall tracking framework of a generative tracking model in one embodiment of the system of the present invention.
[0028] Figure 3 This is a schematic diagram of the noise reduction diffusion layer in one embodiment of the system of the present invention. Detailed Implementation
[0029] The present invention will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present invention, but do not limit the invention in any way. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention. These all fall within the scope of protection of the present invention.
[0030] like Figures 1-3 As shown, an embodiment of the visual target tracking system based on point set diffusion of the present invention includes a visual target tracking unit. The visual target tracking unit includes a ViT-Base encoder and a decoder based on denoising diffusion. The ViT-Base encoder has N1 encoding layers. Each encoding layer is used to extract features from the template image and the search image respectively. After the features of the two are interacted, they are readjusted into two-dimensional search image features and output to the decoder. The decoder initializes N2 point sets randomly distributed in the two-dimensional search image features. The decoder has N denoising diffusion layers, where t adjacent denoising diffusion layers sequentially denoise the N2 point sets. The output of the (t-1)th denoising diffusion layer is used as the input of the tth denoising diffusion layer (this is a back diffusion process). Each denoising process yields N2 target candidate boxes corresponding one-to-one with the N2 point sets, 1≤t≤T, where T is the total number of iterations and t is the current iteration number. When the confidence score of any target candidate box is greater than a preset threshold, the target candidate box is determined as the target.
[0031] In this embodiment, a running race involving multiple athletes is used as the application scenario. A running scene consists of multiple video frames. The target to be tracked in the first frame is Usain Bolt, an athlete in lane three. The goal is to track or locate Bolt's position and size in each subsequent frame. Figure 1 and Figure 2 As shown, a template image z is obtained by cropping the image of Bolt from the first frame, and the next frame image is the search image x. Both are input into a visual target tracking unit based on tracking model f. Tracking model f includes a feature encoder and a decoder. The encoder uses a ViT-Base encoder, and the decoder uses a denoising diffusion-based decoder. Tracking model f constructs the entire target tracking and localization process as a denoising diffusion process, with the number of iterations ranging from 0 to T. y represents the localization result corresponding to each step, i.e., the predicted set of points.
[0032] In this embodiment, the visual target tracking problem is constructed as a point-set-based diffusion process. A diffusion model is a generative conceptual model that achieves generative tasks by modeling the generation process from noise to the target as a probabilistic model. Drawing on this idea, this generative model is introduced into a discriminative perception task. Specifically, during tracking, multiple point sets are used to estimate the target state, including the target's position and size, where each set of points corresponds to a target candidate box in the search area. The goal of the noise-to-target tracking paradigm is to learn a tracking model that can gradually obtain the accurate position and size of the tracked target through multiple diffusion iterations. The diffusion process during training describes the change in target estimation from an absolutely random state to a final deterministic state. Therefore, during inference, the diffusion model-based tracker can fine-tune the target estimation multiple times by estimating the current iteration step. The point set initially consists of random noise scattered across image features. After multiple layers of decoding, the point set gradually converges towards the target. Image features are input to the decoder, which outputs the movement direction and step size of each point set, causing the point set to move towards the target, thus realizing the diffusion process from random noise to the target. In addition, the point set representation of the target allows the tracking model to handle target deformation and occlusion by interference.
[0033] In this embodiment, the inference process has three properties: dynamic inference, ensemble prediction, and mid-process termination, described as follows: 1. Dynamic Inference: Two dynamic settings are introduced during inference: the first is an arbitrary number of target estimates N4, and the second is the number of iterations. In contrast, traditional detection-based trackers predict targets only in a single forward pass. The inference method in this embodiment brings great flexibility and dynamism. For each iteration step, the previous target prediction is sent to the diffusion denoising layer to generate the input for the next step. The input noise to be sent to the next iteration is obtained using the backdiffusion process (diffusion is from ground truth to random noise, and backdiffusion is from random noise to ground truth). This maximizes the use of the training steps of multiple diffusion iterations, thereby optimizing the results. 2. Ensemble Prediction: During the inference process at each decoder layer and each evaluation step, the predicted boxes can be roughly divided into two categories: predictions with high confidence scores and predictions with low scores. High-score predictions are usually correctly located on the corresponding target, while low-score predictions are mainly focused on the background. Inspired by these observations, an update strategy is adopted to replace low-scoring estimates with random point sets and retain high-scoring predictions, thereby improving predictions for the next layer or the next iteration. Furthermore, for multiple high-scoring estimates, a voting strategy is used to filter out interfering objects for more accurate target localization. 3. Mid-process termination: Since each layer in the decoder based on the diffusion denoising process can generate predictions independently, the model's inference speed can be accelerated by prematurely exiting the inference process. Here, we use a simple threshold termination strategy to stop inference; that is, when the maximum confidence score is greater than a preset threshold, the forward inference process is terminated early, i.e., the current iteration step t is less than the total number of iteration steps T. The value of T can be flexibly determined, which is precisely the advantage of the denoising diffusion model, allowing for flexible adjustment based on the difficulty of the tracking scene. Generally, a simple threshold judgment strategy is used: when the confidence is high, the iteration is terminated; when the confidence is low, another iteration is performed. Generally, a maximum of two complete iterations are performed. Experiments show that more than three iterations do not significantly improve the results. Each fine-tuning of the prediction result constitutes one iteration. After all denoising and diffusion layers have completed one iteration, the result can still be input into the earliest denoising and diffusion layer for cyclic iteration, so there is no upper limit to the maximum number of iterations. In actual inference, good results are generally obtained in the first three denoising and diffusion layers, so iteration can be terminated early. The threshold judgment strategy allows the model to determine the number of iterations itself, so iteration can be terminated early in relatively easy-to-track scenarios, thereby improving inference speed. In other cases, T can also be set as the total number of denoising and diffusion layers, which can yield a relatively better prediction result. This depends on whether the focus is on speed or performance, so flexibility is also a major advantage of this model.
[0034] In one embodiment of the visual target tracking system of the present invention, the ViT-Base encoder includes 12 transformer layers, each containing an attention layer and a perceptron layer. The first four layers independently encode the frame features of the template image and the search image, while the last eight layers jointly encode both. In this embodiment, the ViT-Base encoder is a feature extraction network pre-trained on the ImageNet dataset, performing joint feature extraction on the search image and the template image and learning the feature association between the two frames. In this embodiment, the backbone network of ViT-Base is optimized, mainly by independently encoding the features of the two frames in the first four layers and jointly encoding the features of the two frames in the last eight layers. A trade-off between tracking accuracy and speed is achieved through comparative experiments, which improves the reliability of the algorithm.
[0035] In one embodiment of the visual target tracking system of the present invention, the frame size of the template image is an integer multiple of 128*128, and the frame size of the search image is an integer multiple of 128*128. Before being input into the transformer layer, the images undergo a 4x downsampling, increasing the number of channels from 3D to 768D. In this embodiment, the frame size of each image can be 128*128, 256*256, 384*384, etc., and the specific value can be determined by a trade-off between tracking speed and accuracy. Larger frame sizes result in slower tracking speed but higher tracking accuracy, while smaller frame sizes result in faster tracking speed but lower tracking accuracy. After multiple experiments, the preferred frame size for the template image is 128*128, and the preferred frame size for the search image is 256*256. The 4x downsampling significantly compresses the data volume and improves tracking speed. Increasing the number of channels improves image detail and tracking accuracy.
[0036] In one embodiment of the visual target tracking system of the present invention, such as Figure 3As shown, the denoising diffusion layer includes a global instance interaction layer, a dynamic convolutional layer, and a prediction fine-tuning layer. The global instance interaction layer extracts instance features from N2 point sets through RoI pooling. The dynamic convolutional layer performs global attention pooling and dynamic convolution processing on each instance feature. The prediction fine-tuning layer predicts the boundary of the target candidate box based on the dynamic convolution processing and sends the prediction result to the next denoising diffusion layer for fine-tuning. In this embodiment, each denoising diffusion layer obtains a total of N2 point sets from the previous diffusion step t-1, extracts instance features through RoI pooling, and then models global relationships through a simplified attention layer. The simplified attention layer removes the multilayer perceptron in the standard attention layer, reduces the number of parameters, and can model relationships between multiple instance features obtained from the point set representation. In the prediction fine-tuning layer, the sampling steps are embedded into a feature vector through a convolutional layer, and then classification prediction and bounding box fine-tuning are performed on the instance features of each point set representation. Specifically, in the dynamic convolutional layer, each instance feature is first subjected to global attention pooling, and then used as a dynamic convolutional kernel to act on the corresponding instance features, thereby predicting the corresponding result. After obtaining the classification result and bounding box fine-tuning prediction of each instance feature, the result is sent to the next denoising diffusion layer for further fine-tuning. The denoising diffusion model in this embodiment differs significantly from traditional detection-based models in the following ways: 1. Object detection requires predicting the target's category label, while the decoder of the denoising diffusion layer only needs to perform a binary classification task, i.e., predicting whether it is a tracked target; 2. The detector's decoder uses a fixed number of layers, while the decoder of the denoising diffusion layer can terminate prediction in an earlier decoding layer to speed up inference; 3. The decoder of the denoising diffusion layer gradually locates the target by progressively optimizing the point set, while the detection model predicts the bounding box and category label of the object in the image.
[0037] In one embodiment of the visual target tracking system of the present invention, the denoising diffusion layer is a generative tracking model based on Markov probability, suitable for generating and determining the target from random noise, and obtaining the accurate position and size of the target through multi-step diffusion iteration of the random noise. In this embodiment, based on the training and inference form of Markov chains, the point set can be gradually converged onto the target by adjusting the movement direction and step size of the point set multiple times, thereby obtaining the final target tracking result.
[0038] In one embodiment of the visual target tracking system of the present invention, randomly distributed noise points and labels are mixed together to construct the input for the training phase of the generative tracking model. In this embodiment, the noise points and labels are mixed together, and their ratio is adjusted to represent iterations in the denoising process, which can simulate multiple target localization attempts during the tracking process. This allows the tracking model to be trained to localize from coarse to fine.
[0039] In one embodiment of the visual target tracking system of the present invention, the noise point is Gaussian noise with a signal-to-noise ratio (SNR) of 1.0-2.0. In this embodiment, repeated experiments have shown that for diffusion-based tracking, selecting Gaussian noise and setting a relatively large SNR yields the best results. A SNR that is too low, such as below 1.0, will prevent the model from learning to predict the true label, as noise significantly interferes with the model. Conversely, a SNR that is too high, such as above 2.0, will lead to overfitting of the model, making it impossible to train the model to eliminate strong interference and achieve accurate localization.
[0040] In one embodiment of the visual target tracking system of the present invention, the noise level of each iteration step is controlled by a predefined monotonically decreasing cosine function. In this embodiment, compared with a linear decreasing function, the cosine decreasing function decreases more slowly at the initial and final step sizes, which can effectively preserve the existence of noise and help train the model more effectively.
[0041] In one embodiment of the visual target tracking system of the present invention, the training loss of the generative tracking model adopts ensemble prediction loss on N3 sets of predictions, and the top 5 scores are selected for supervision based on the IoU score. In this embodiment, ensemble prediction can supervise multiple sets of prediction points, thereby making the supervision signal stronger. Compared with selecting the best result for supervised training, this is more conducive to improving the model's ability. The hyperparameters are set to the top five highest scores, and then these five sets of predictions are trained as the final ground truth labels. This can effectively utilize the supervision signal and avoid the huge learning difficulties that supervising all sets of prediction points would bring to the model.
[0042] In one embodiment of the visual target tracking system of the present invention, multi-layer supervision is employed to train the generative tracking model. In this embodiment, multi-layer supervision trains the decoder at each layer to achieve the ability to track and locate targets, thereby training the model's ability to perform multiple localizations and to judge targets from coarse to fine.
[0043] In one embodiment of the visual target tracking system of the present invention, an integrated prediction mechanism is employed during the inference process of the generative tracking model. On the one hand, target candidate boxes with confidence scores below a first threshold are replaced with random point sets, while target candidate boxes with confidence scores above a second threshold are retained. On the other hand, for multiple target candidate boxes with confidence scores above the second threshold, a voting strategy is used to filter out interfering objects. In this embodiment, the multi-layer decoder prediction can make multiple judgments on interfering objects and targets, thus improving the target tracking accuracy and robustness. Simultaneously, the prediction results of the previous layer decoder can be reused, which helps the prediction of the next layer decoder. For example, to locate a specific person in a scene containing two people and a car, the first layer decoder can easily filter out the car, but has high confidence scores on both people. The second layer decoder inputs the high-confidence point sets of the two people, plus randomly initialized surrounding semantic point sets, and then makes a judgment, thus quickly locating the person being tracked. This process demonstrates that the tracker has self-correction capabilities, with the ability to move from coarse localization to fine-grained judgment.
[0044] In one embodiment of the visual target tracking system of the present invention, the preset threshold is 0.7. In this embodiment, the preset threshold was obtained through metrics verification on a large number of test datasets. A preset threshold higher than 0.7 will result in too few candidate boxes, thus losing the correct target; a threshold lower than 0.7 will result in too many distracting boxes, reducing the model's inference speed.
[0045] An embodiment of the visual target tracking method based on point set diffusion of the present invention, employing the visual target tracking system based on point set diffusion described in any of the above embodiments of the visual target tracking system based on point set diffusion, includes the following steps:
[0046] S100: Extract features from the template image and the search image respectively, and then readjust the features of the two images into a two-dimensional structure of search image features.
[0047] S110: Initialize a set of N2 points randomly distributed in the search image features of the two-dimensional structure;
[0048] S120: Denoise the N2 point set in t adjacent denoising diffusion layers in sequence. The output of the (t-1)th denoising diffusion layer is used as the input of the tth denoising diffusion layer. Each denoising process yields N2 target candidate boxes that correspond one-to-one with the N2 point set. 1≤t≤T, where T is the total number of iterations and t is the current iteration number.
[0049] S130: Compare the confidence score of the target candidate box with a preset threshold. If the confidence score of any target candidate box is greater than the preset threshold, then the target candidate box is determined as the target.
[0050] In this embodiment, as Figures 1-3 As shown, a template image z is obtained by cropping the image of Bolt from the first frame, and the next frame image is the search image x. Both are input into a visual target tracking unit based on tracking model f. Tracking model f includes a feature encoder and a decoder. The encoder uses a ViT-Base encoder, and the decoder uses a denoising diffusion-based decoder. Tracking model f constructs the entire target tracking and localization process as a denoising diffusion process, with the number of iterations ranging from 0 to T. y represents the localization result corresponding to each step, i.e., the predicted set of points. Figure 2 In this context, t represents the number of iterations in the current denoising diffusion process, Δt represents the step size, and N represents the number of denoising diffusion layers in the decoder. Figure 3 In the diagram, (x, y) are the coordinates of each point, i.e., the corresponding coordinates. Figure 1 Each yellow dot in the image within the dashed box on the right can be transformed into a bounding box, similar to locating the final target. The output of this decoding layer includes a confidence score for each set of points, i.e., a classification score for whether it is a target, and the step size for each point to move along the x and y directions.
[0051] Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art can make various modifications or variations within the scope of the claims, which do not affect the essence of the present invention. The above preferred features can be used in any combination without conflict.
Claims
1. A visual target tracking system based on point set diffusion, characterized in that, It includes a visual target tracking unit, which comprises a ViT-Base encoder and a denoising diffusion-based decoder. The ViT-Base encoder has N1 encoding layers. Each encoding layer is used to extract features from the template image and the search image respectively. The features of the two images are interacted and readjusted into two-dimensional search image features, which are then output to the decoder. The decoder initializes N2 point sets randomly distributed in the search image features of the two-dimensional structure. The decoder has N denoising diffusion layers, where t adjacent denoising diffusion layers sequentially denoise the N2 point sets. The output of the (t-1)th denoising diffusion layer is used as the input of the tth denoising diffusion layer. Each denoising process yields N target candidate boxes corresponding one-to-one with the N2 point sets, 1≤t≤T, where T is the total number of iterations and t is the current iteration number. If the confidence score of any candidate bounding box is greater than a preset threshold, then the candidate bounding box is identified as the target. The denoising diffusion layer includes a global instance interaction layer, a dynamic convolutional layer, and a prediction fine-tuning layer. The global instance interaction layer extracts instance features from N2 points through RoI pooling. The dynamic convolutional layer performs global attention pooling and dynamic convolution processing on each instance feature. The prediction fine-tuning layer predicts the boundary of the target candidate box based on the dynamic convolution processing and sends the prediction result to the next denoising diffusion layer for fine-tuning. The denoising diffusion layer is a generative tracking model based on Markov probability, which is suitable for generating and determining targets from random noise, and obtaining the accurate position and size of the target through multi-step diffusion iteration of random noise. In the inference process of the generative tracking model, an integrated prediction mechanism is adopted. On the one hand, a random point set is used to replace the target candidate boxes with confidence scores below the first limit, while retaining the target candidate boxes with confidence scores above the second limit. On the other hand, for multiple target candidate boxes with confidence scores above the second limit, a voting strategy is used to filter out interference objects.
2. The visual target tracking system based on point set diffusion according to claim 1, characterized in that, The ViT-Base encoder includes 12 transformer layers, where the first four layers encode the frame features of the template image and the frame features of the search image independently, and the last eight layers encode the two together.
3. The visual target tracking system based on point set diffusion according to claim 2, characterized in that, The frame size of the template image is 128. Multiples of 128, the frame size of the search image is 128. Multiples of 128 are downsampled by a factor of 4 before being input to the transformer layer, increasing the number of channels from 3 dimensions to 768 dimensions.
4. The visual target tracking system based on point set diffusion according to claim 1, characterized in that, The random distribution of noise points and labels is mixed together to construct the input for the training phase of the generative tracking model.
5. The visual target tracking system based on point set diffusion according to claim 4, characterized in that, The noise point is Gaussian noise with a signal-to-noise ratio of 1.0-2.
0.
6. The visual target tracking system based on point set diffusion according to claim 1, characterized in that, The noise level of each iteration step is controlled by a predefined monotonically decreasing cosine function.
7. The visual target tracking system based on point set diffusion according to claim 1, characterized in that, The training loss of the generative tracking model is the ensemble prediction loss on a set of N3 predictions, and the top 5 scores are selected for supervision based on the IoU score.
8. The visual target tracking system based on point set diffusion according to claim 7, characterized in that, The generative tracking model is trained using multi-layer supervision.
9. The visual target tracking system based on point set diffusion according to claim 1, characterized in that, The preset threshold is 0.
7.
10. A visual target tracking method based on point set diffusion, characterized in that, The visual target tracking system based on point set diffusion as described in any one of claims 1-9 includes the following steps: S100: Extract features from the template image and the search image respectively, and then readjust the features of the two images into a two-dimensional structure of search image features. S110: Initialize a set of N2 points randomly distributed in the search image features of the two-dimensional structure; S120: Denoise the N2 point set in t adjacent denoising diffusion layers in sequence. The output of the (t-1)th denoising diffusion layer is used as the input of the tth denoising diffusion layer. Each denoising process yields N2 target candidate boxes that correspond one-to-one with the N2 point set. 1≤t≤T, where T is the total number of iterations and t is the current iteration number. S130: Compare the confidence score of the target candidate box with a preset threshold. If the confidence score of any target candidate box is greater than the preset threshold, then the target candidate box is determined as the target.