A multi-target tracking method based on a diffusion model of a learnable motion condition representation

By compressing the motion information of the target trajectory into a learnable motion condition representation and constraining and guiding it in a diffusion-based iterative update network, the problems of target mismatch and identity switching in multi-target tracking are solved, achieving efficient tracking results in complex scenarios.

CN122066735BActive Publication Date: 2026-06-23CHANGCHUN UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHANGCHUN UNIV OF SCI & TECH
Filing Date
2026-04-15
Publication Date
2026-06-23

Smart Images

  • Figure CN122066735B_ABST
    Figure CN122066735B_ABST
Patent Text Reader

Abstract

The application discloses a diffusion model multi-target tracking method based on a learnable motion condition representation, and belongs to the technical field of computer vision and intelligent video analysis. The method first acquires adjacent frame images of a video sequence and performs feature extraction; subsequently, motion priors are constructed by using a history target frame of a track in a previous frame and a current frame predicted target frame, and the motion priors are compressed into motion condition representations; the motion condition representations are aligned with candidate target frames of the current frame through an intersection-over-union matching, and corresponding motion condition features are obtained; in a diffusion iterative updating network, the motion condition is allowed to participate in diffusion denoising prediction and reverse diffusion updating processes, and correlation scores between adjacent frame candidate target frames are output; finally, data correlation and track state updating are completed according to the correlation scores, and multi-target tracking result output is realized. The method improves the accuracy of target position updating and the stability of track correlation, and enhances the multi-target tracking capability in a complex motion scene.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and intelligent video analysis technology, specifically to a multi-target tracking method, and more particularly to a diffusion model multi-target tracking method that compresses the motion information of the target trajectory to form a learnable motion condition representation, and introduces the motion condition representation as condition information into a diffusion iterative update network to achieve iterative update of candidate target boxes and data association. Background Technology

[0002] Multi-object tracking (MOT) aims to continuously estimate the positions of multiple targets in a video sequence across consecutive frames and assign a consistent identity to the same target across different frames. A typical MOT workflow includes generating candidate bounding boxes or detection results in each frame, combining motion information, appearance information, or a combination of both for cross-frame data association, and using trajectory management strategies to initialize, update, remove, and delete trajectories, ultimately outputting a tracking result containing the target's location and identity. However, real-world scenarios often involve target occlusion, dense crowds, similar appearances, scale changes, rapid motion, and camera movement, which can easily lead to mismatches and identity switching between candidate bounding boxes across frames, thus affecting the stability and accuracy of MOT results.

[0003] In existing multi-object tracking methods, motion information is typically used to predict the possible position of a trajectory in the current frame using motion models such as Kalman filtering. Gating constraints, such as distance to candidate bounding boxes and Intersection over Union (IoU), are then used to narrow down the search range. Appearance information is usually generated through feature extraction networks to produce appearance descriptors of the target, used to distinguish similar-looking targets and enhance re-identification capabilities under occlusion. While these methods improve the robustness of multi-object tracking to some extent, they still have the following shortcomings: First, motion information in existing methods is usually only used as a gating condition or a simple distance term in the association process, making it difficult to fully characterize the local motion patterns and cross-frame variation of the trajectory. In complex scenarios such as similar appearances, intersecting motions, and dense occlusion, target mismatch and identity switching can still easily occur. Second, to improve matching stability, some methods introduce more complex association networks or multi-stage matching modules. While this can improve tracking performance in certain scenarios, it also increases model structure complexity, training difficulty, and inference computation overhead, hindering efficient implementation and application deployment.

[0004] In recent years, diffusion models, as an iterative generation and estimation framework, have been able to gradually approximate more reasonable prediction results in the candidate state space through multi-step denoising iterations under a pre-defined noise schedule. Introducing a diffusion-based iterative update mechanism into multi-target tracking tasks can progressively refine candidate bounding boxes and output matching metrics for cross-frame association, thereby improving the quality of candidate bounding boxes and the stability of data association to some extent. However, the diffusion-based iterative update process is highly dependent on conditional information. Without effective motion constraints, in dense scenes, fast-moving scenes, or complex interaction scenarios, the diffusion denoising process is prone to producing unstable candidate update results, leading to a decrease in the separability between candidate bounding boxes and affecting the accuracy of subsequent data association. Conversely, directly introducing complex association structures to provide conditional constraints to the diffusion network can easily introduce additional computational burdens and structural coupling, weakening the advantages of the diffusion framework in terms of generality, lightweight design, and scalability.

[0005] Therefore, how to encode the motion information of the target trajectory into a compact and learnable form as a motion condition representation without introducing a heavy association module, and align it to the representation space of the diffusion iterative update network through a learnable mapping and feature modulation mechanism, so that the motion condition representation can form effective constraints and guidance in the diffusion denoising prediction and back diffusion update process, thereby improving the accuracy of candidate target box updates, enhancing the stability of data association, reducing mismatches and identity switching, while taking into account the implementation complexity and computational overhead, has become a technical problem that urgently needs to be solved in this field. Summary of the Invention

[0006] This invention proposes a multi-target tracking method based on a learnable motion condition representation using a diffusion model. This method compresses the motion information of target trajectories into a compact motion condition representation and aligns it to the representation space of the diffusion head through learnable mapping and feature modulation mechanisms. This constrains or guides the denoising prediction and backsampling path of candidate bounding boxes during the diffusion-based iterative update process, improving the separability and matching stability of paired candidates and reducing mismatches and identity switching in occlusion, cross-movement, and dense scenes. The method first acquires adjacent frame images from a video sequence and extracts multi-scale features, generating a set of candidate bounding boxes for target localization in the current frame. Then, it performs motion prediction and trajectory reliability screening on the established target trajectory set, constructing a motion prior using the historical bounding boxes of the trajectory in the previous frame and the current predicted bounding box, and compressing this motion prior to obtain the motion condition representation. Furthermore, alignment is performed based on the overlap metric between historical bounding boxes and candidate bounding boxes, writing the trajectory-level motion condition representation into the candidate-level motion condition features corresponding one-to-one with the candidate bounding boxes.

[0007] Based on this, the candidate target box state, adjacent frame image features, and candidate-level motion condition features are jointly input into the diffusion iterative update network. The motion condition features are first subjected to learnable projection through a motion mapping network to obtain motion embeddings, which are then modulated and fused with the feature representations of the candidate target boxes. This allows the motion embeddings to participate as conditional information in the denoising prediction process, iteratively outputting noise-free candidate box position predictions and noise estimates over multiple sampling time steps, achieving gradual correction of candidate box positions. Simultaneously, after at least one iteration of diffusion backsampling and / or at the last iteration, motion-guided updates can be further applied to the current frame's candidate target boxes based on the motion condition features to correct the sampling direction and enhance trajectory motion consistency. While completing the candidate box iterative update, the diffusion iterative update network outputs a correlation metric (matching score) for candidate pairs. The upper-level tracking management module performs data association and trajectory state updates based on this correlation metric, ultimately obtaining the multi-target tracking result.

[0008] The multi-target tracking method provided by this invention can not only make full use of trajectory motion information without introducing a heavy association network, but also integrate motion conditions into the diffusion-based iterative update process in a learnable manner, thereby enhancing the robustness and stability of candidate matching. It is also applicable to various video surveillance and intelligent analysis tasks such as crowded scenes, long-term occlusion, similar appearance targets, and complex camera movements, and has good versatility and engineering application value.

[0009] The present invention provides a multi-target tracking method based on a learnable motion condition representation diffusion model, comprising the following steps:

[0010] S1. Obtain adjacent frame images of the video sequence, including the current frame image and the previous frame image;

[0011] S2. Based on the adjacent frame images obtained in S1, perform feature extraction, and generate a set of candidate target boxes for the current frame based on the extracted features. ,in, This indicates the number of candidate bounding boxes generated in the current frame. Indicates the candidate target index number. Each candidate target box It must include at least the position parameter;

[0012] S3, For the established set of target trajectories ,in, Indicates the first A historical target trajectory, This indicates the number of target trajectories maintained at the current moment. Indicates the trajectory index number. Motion priors are constructed based on the historical target bounding boxes of the trajectory in the previous frame and the motion prediction target bounding boxes of the current frame, and then compressed into motion condition representations. The motion condition representations are used to represent the displacement information or corner displacement information of the target trajectory.

[0013] S4. Combine the motion condition representation obtained after step S3 with the candidate target box set obtained in step S2. Alignment processing is performed to obtain motion condition features that correspond one-to-one with the candidate target boxes;

[0014] S5. The candidate target box set obtained in step S2 is... The features of adjacent frames and the motion condition features obtained in step S4 are input into the diffusion iterative update network. The network is aligned to the representation space of the diffusion head through a lightweight learnable mapping and the features of the candidate target boxes are modulated so that the motion conditions constrain or guide the iterative update process. The network outputs the updated position parameters of the candidate target boxes and the correlation scores between the candidate target boxes.

[0015] S6. Based on the associated score and the updated candidate target box, perform data association and trajectory status update, and output the multi-target tracking result;

[0016] S7. During the model training phase, a dataset containing the real labeled target boxes of adjacent frames is constructed to train the diffusion-based iterative update network. A velocity consistency constraint can be introduced to improve the stability of target motion prediction in adjacent frames.

[0017] Beneficial effects

[0018] Compared with the prior art, the present invention has at least the following beneficial effects:

[0019] (1) An innovative multi-target tracking technology framework based on a diffusion model of motion condition representation is proposed. Unlike existing methods that only use motion information for simple position prediction or association gating, this invention compresses trajectory motion information into learnable motion condition representations and introduces them as condition information into the diffusion candidate box iterative update process, so that motion information can directly participate in the optimization of candidate target boxes, thereby forming a new multi-target tracking method that deeply integrates motion information and diffusion update.

[0020] (2) A learnable mapping and feature modulation mechanism for aligning motion conditions to the diffusion feature space is proposed. By constructing a motion mapping network, motion conditions are projected onto the feature space of candidate target boxes, and feature modulation is used to participate in denoising prediction, so that motion information can effectively constrain the diffusion update direction and improve the accuracy and stability of candidate target box position update.

[0021] (3) A motion-guided update mechanism for the diffusion iteration process is proposed. By applying motion guidance during the diffusion backsampling process, the candidate target box can be corrected along the actual motion trend of the target during the iteration process, thereby reducing the randomness in the diffusion update process and improving the target positioning accuracy and trajectory motion consistency.

[0022] (4) An injection strategy is proposed to accurately align trajectory motion conditions with candidate target boxes. By using trajectory reliability screening and a Top-k alignment mechanism based on intersection-union ratio, trajectory-level motion information is effectively mapped to the candidate target box level, enabling motion conditions to act on the most relevant candidate regions, improving the reliability of candidate matching and reducing the risk of mismatch.

[0023] (5) Improve the overall performance of multi-target tracking without introducing complex association networks. This invention achieves collaborative modeling of candidate box position optimization and association scoring through motion condition modulation and diffusion update mechanism. While ensuring the model is lightweight, it significantly improves the tracking robustness in complex scenarios, reduces the probability of identity switching, and has good engineering application value. Attached Figure Description

[0024] Figure 1 The principle structure block diagram of the present invention;

[0025] Figure 2 Flowchart of the method of this invention;

[0026] Figure 3 A schematic diagram of the motion condition characterization construction of the present invention;

[0027] Figure 4 A schematic diagram of the conditional diffusion iteration mechanism of this invention;

[0028] Figure 5 A schematic diagram illustrating the motion guidance effect of this invention;

[0029] Figure 6 This is a comparison chart of the candidate box iteration convergence between the method of this invention and the baseline method on the DanceTrack dataset;

[0030] Figure 7 This is a visualization of the tracking results of the method of the present invention on the DanceTrack dataset. Detailed Implementation

[0031] To make the technical solution of the present invention clearer, the present invention will be further described below with reference to the accompanying drawings and embodiments.

[0032] The overall technical process of this invention is as follows: Figure 1The diagram shows the principle structure of a multi-target tracking method based on a learnable motion condition representation diffusion model according to the present invention. It mainly includes the following five main stages, with each module forming a complete closed-loop tracking process:

[0033] (I) Feature extraction of adjacent frames and generation of candidate targets

[0034] In this embodiment, feature extraction is first performed on adjacent frames of the input video sequence, where the previous frame... With the current frame The images are input into the feature extraction module to obtain corresponding image feature representations. Specifically, the feature extraction module can use a convolutional neural network, such as ResNet-101, as the backbone network for feature extraction, encoding the previous frame image and the current frame image into multi-scale feature representations respectively. , The features can be further fused using a Feature Pyramid Network (FPN) to enhance the representation capability of targets at different scales. This is combined with the candidate bounding box set generated in the current frame. The candidate box can come from either an independent detector or an end-to-end detection and tracking framework, and this invention does not limit it.

[0035] (II) Construction of trajectory motion priors and generation of motion condition representations

[0036] The tracking results of the previous frame form a trajectory set. Each trajectory includes: historical target bounding box, trajectory state, update frame number, motion state information, etc. Then, based on the trajectory state of the previous frame, the current predicted bounding box for each trajectory is obtained through Kalman prediction. To improve the reliability of motion information, this implementation preferably filters trajectories. For example, motion conditions are constructed only for trajectories that meet the following conditions: the number of frames present on the trajectory is greater than a threshold, it has been successfully updated recently, and the motion amplitude is greater than a threshold. This avoids interference from short-term noisy trajectories in motion modeling.

[0037] Based on this, the displacement relationship between the historical target bounding box and the predicted target bounding box of the trajectory is calculated. Specifically, to distinguish different motion states, the relationship between center displacement and corner displacement can be constructed. In practice, the corner displacement prior method can more finely characterize the deformation trend of the target bounding box, thus exhibiting higher stability in scenarios with significant scale changes. Furthermore, the motion information of the trajectory is compressed into motion condition representations, thereby obtaining motion features that can characterize the direction and amplitude of the target's motion.

[0038] (iii) Alignment of motion conditions with candidate target boxes

[0039] Since the number of trajectories and the number of candidate boxes are usually inconsistent, it is necessary to map trajectory-level motion information to candidate box level. Therefore, this invention uses IoU matching to align the motion condition representation with the current frame's candidate box set using Top-k matching based on the intersection-union ratio, so that each candidate box obtains corresponding motion condition features. The advantages of this design are that it avoids motion information from affecting irrelevant candidates, improves the accuracy of condition injection, reduces the propagation of error updates, and is more stable than methods that directly broadcast motion features.

[0040] (iv) Motion mapping network and conditional diffusion update

[0041] To enable low-dimensional motion vectors to participate in high-dimensional feature computation, this invention innovatively designs a lightweight, learnable motion mapping network. The obtained motion conditional features are further mapped through this learnable mapping network, specifically a Linear-ReLU-Linear neural network, to align with the feature representation space of the diffusion head. This network is then fused and modulated with the feature representations of the candidate target boxes to form conditional features for diffusion updates. This modulation method is more stable in computation than the concatenation method, does not increase dimensionality, and is easy to train. Subsequently, the diffusion update module is entered. In the diffusion iterative update network, the diffusion target box representation is first constructed, and conditional denoising prediction and back-diffusion iterative updates are performed sequentially during the iteration process.

[0042] To further improve stability, this invention innovatively introduces motion-guided updates after diffusion iterations, enabling candidate bounding boxes to gradually converge towards the true target location during the diffusion sampling process. This reduces random drift, enhances motion consistency, and improves positioning accuracy, making it more stable and effective than methods relying solely on network prediction. Through multiple iterations until a termination condition is met, the updated candidate bounding boxes and their associated scores are finally obtained.

[0043] (v) Candidate Association and Trajectory Update

[0044] Finally, based on the updated candidate target boxes and associated scores, data association and trajectory state updates are performed to obtain a new trajectory set, and the multi-target tracking results for the current frame are output. By introducing motion condition features during the diffusion update process, this invention can utilize the historical motion information of the target to constrain and guide the candidate target box update process, thereby improving target positioning accuracy and trajectory association stability.

[0045] Please see Figure 2 The flowchart of the multi-target tracking method based on the diffusion model of learnable motion condition representation of the present invention is shown.

[0046] As an example, the method includes the following specific steps:

[0047] S1. Obtain adjacent frame images of the video sequence, including the current frame image and the previous frame image;

[0048] S2. Based on the adjacent frame images obtained in S1, perform feature extraction, and generate a set of candidate target boxes for the current frame based on the extracted features. ,in, This indicates the number of candidate bounding boxes generated in the current frame. Indicates the candidate target index number. Each candidate target box It must include at least the position parameter;

[0049] S3, For the established set of target trajectories ,in, Indicates the first A historical target trajectory, This indicates the number of target trajectories maintained at the current moment. Indicates the trajectory index number. Motion priors are constructed based on the historical target bounding boxes of the trajectory in the previous frame and the motion prediction target bounding boxes of the current frame, and then compressed into motion condition representations. The motion condition representations are used to represent the displacement information or corner displacement information of the target trajectory.

[0050] S4. Combine the motion condition representation obtained after step S3 with the candidate target box set obtained in step S2. Alignment processing is performed to obtain motion condition features that correspond one-to-one with the candidate target boxes;

[0051] S5. The candidate target box set obtained in step S2 is... The features of adjacent frames and the motion condition features obtained in step S4 are input into the diffusion iterative update network. The network is aligned to the representation space of the diffusion head through a lightweight learnable mapping and the features of the candidate target boxes are modulated so that the motion conditions constrain or guide the iterative update process. The network outputs the updated position parameters of the candidate target boxes and the correlation scores between the candidate target boxes.

[0052] S6. Based on the associated score and the updated candidate target box, perform data association and trajectory status update, and output the multi-target tracking result;

[0053] S7. During the model training phase, a dataset containing the real labeled target boxes of adjacent frames is constructed to train the diffusion-based iterative update network. A velocity consistency constraint can be introduced to improve the stability of target motion prediction in adjacent frames.

[0054] Simulation experiment:

[0055] The experimental simulation environment for this invention is: GPU NVIDIA RTX4090, CPU Intel i9-13900K, Ubuntu 20.04, CUDA 12.0, and PyTorch 2.7.0. The publicly available dataset DanceTrack was selected for simulation, and the dancetrack0004 video sequence from the validation set was used for evaluation.

[0056] Furthermore, Figure 3 The specific method for compressing trajectory motion information into motion condition representation as described in step S3 is shown below:

[0057] For the first Trajectory With the top left corner of the image as the origin, the horizontal axis to the right is... The positive direction of the axis, vertically downwards is Establish a Cartesian coordinate system in the image plane along the positive axis. In this system, the historical target bounding box is denoted as... Current prediction box ,in These are the coordinates of the top-left corner of the historical target bounding box and the top-left corner of the current predicted bounding box, respectively. These are the coordinates of the bottom right corner of the historical target bounding box and the bottom right corner of the current predicted bounding box, respectively, with superscripts. Indicates the historical target box, superscript Represents the current predicted bounding box; coordinates of the center of the historical target bounding box. The width of the historical target frame is The height of the historical target box is ,in To prevent division by zero constants; the center coordinates of the current prediction box are... The normalized displacement of the center of the current predicted bounding box relative to the historical target bounding box is defined as follows: and the first The motion conditions of a trajectory are characterized by the normalization of the center displacement and are defined as follows: in The trajectory-level motion condition representation vector, with superscript This represents the vector transpose operation.

[0058] Furthermore, the motion condition representation described in step S3 can also be constructed using a corner displacement normalization representation method, and the construction process is as follows:

[0059] In the defined historical target box Current prediction box Historical target frame width Historical target frame height Based on this, define the normalized components of the corner displacements between the current predicted bounding box and the historical target bounding boxes:

[0060]

[0061]

[0062]

[0063]

[0064] Here, each component represents the normalized lateral and longitudinal displacement components at the four corner points of the historical target box and the current predicted box, respectively, used to characterize the short-term movement trend of the target. Represent the top left, top right, bottom left, and bottom right corners respectively; and set the first... The motion conditions of a trajectory are characterized by the normalization of corner displacements and are defined as follows: , This is a trajectory-level corner point motion condition representation vector. This motion representation method can characterize the displacement changes of each corner point of the target bounding box, thereby expressing the target's scale changes and non-rigid body motion characteristics, and has higher representation accuracy for nonlinear motion.

[0065] Furthermore, in step S3, before constructing the motion condition representation, the trajectory set is... To perform reliability screening, the screening criteria must include at least the following:

[0066] (1) Trajectory maintenance time condition: when the first... Duration of the trajectory At that time, no motion condition representation is constructed for the trajectory, where This indicates the cumulative number of frames the trajectory has existed from the beginning to the current frame. The minimum maintenance time threshold is preset;

[0067] (2) Update interval condition: When the first... Unupdated frame count of the track At that time, no motion condition representation is constructed for the trajectory, where This indicates the number of frames since the last successful trajectory update. The preset maximum update interval threshold;

[0068] (3) Velocity threshold condition: when the first Motion velocity amplitude of the trajectory At that time, no motion condition representation is constructed for the trajectory, where , This represents the function for calculating the center coordinates of the target bounding box. The function for calculating the center coordinates of the target bounding box, representing the L2 norm, is defined as follows: , The coordinates of the top left corner of the target box. The coordinates of the bottom right corner of the target box. This is the preset minimum speed threshold.

[0069] By using the above screening, we can avoid constructing motion condition representations for near-static or unstable trajectories, thereby improving the reliability of motion priors and reducing the interference of noise motion on the diffusion generation process.

[0070] Further, the alignment in step S4 includes: performing Top-Order Alignment (TOO) based on the intersection-union ratio (IoU) of the historical target bounding box in the previous frame and the candidate target bounding box in the current frame. Matching, and writing motion condition features only for candidate bounding boxes that meet the intersection-union ratio threshold, specifically:

[0071] Candidate target boxes Historical target box ,in:

[0072] This represents the coordinates of the top-left corner of the j-th candidate bounding box;

[0073] This represents the coordinates of the bottom right corner of the j-th candidate bounding box;

[0074] This indicates the coordinates of the top-left corner of the historical target box in the previous frame for the i-th trajectory;

[0075] This represents the coordinates of the lower right corner of the historical target box in the previous frame for the i-th trajectory.

[0076] Define the intersection-union ratio (IUU) between candidate bounding boxes and historical trajectory bounding boxes as: The area of ​​the historical target box is The area of ​​the candidate target box is The intersection area of ​​the two target boxes is :

[0077]

[0078] in, This indicates taking the smaller value. Indicates taking the larger value, outer layer Used to avoid negative area when there is no overlap; let To set the intersection-union threshold, for each trajectory Select to make The largest front Candidate box index set ,in It is a preset integer; when At that time, characterize the motion conditions of the trajectory. Motion condition features written into the corresponding candidate boxes and according to the diffusion scale coefficient Scaling, i.e.

[0079] .

[0080] When the above conditions are not met, set , It is a zero vector; and the writing is performed only on the set of motion condition features corresponding to the candidate target boxes in the current frame.

[0081] By using the above alignment method, the motion condition representation corresponding to the historical trajectory can be accurately assigned to the candidate target box that matches the current frame spatial position. This reduces the misintroduction of motion priors by irrelevant candidate boxes, improves the effectiveness of motion condition injection, thereby enhancing the diffusion model's ability to constrain target position updates and improving the stability of target association and identity preservation in scenarios such as occlusion and dense interaction.

[0082] Furthermore, in step S5, the learnable mapping of motion condition features is obtained by a motion mapping network. To achieve this, the motion mapping network comprises at least two linear transformation layers and one nonlinear activation layer, and satisfies the following: for candidate target boxes... Motion condition characteristics Output the projected motion embedding The mapping relationship is as follows:

[0083]

[0084] in , For the mapping matrix, , For bias vectors, It is a non-linear activation function; Represents the dimension of motion condition features. Indicates the feature dimension of the candidate target box;

[0085] In step S5, the diffusion-based iterative update network updates at time steps. The denoising prediction uses the motion embedding as one of the conditional information, and its conditional denoising prediction satisfies:

[0086]

[0087] in:

[0088] Indicates time step The set of candidate bounding box states, and Each row corresponds to the four-dimensional position parameters of a candidate target box;

[0089] Indicates at time step The set of predicted noise-free candidate bounding box location parameters;

[0090] Indicates at time step The predicted noise set;

[0091] and These are the image features of the previous frame and the current frame, respectively;

[0092] To include candidate bounding box features and fuse the motion embedding The set of conditional features;

[0093] For denoising prediction function, Its parameters.

[0094] Furthermore, the modulation of the candidate target box features in step S5 includes embedding the projected motion. Additive fusion is performed element-wise with the feature representations of the candidate target boxes, specifically: Let the candidate target boxes... The features are represented as The modulated features are then represented as

[0095]

[0096] in The modulated candidate target box feature representation is used, and the additive fusion is performed before the candidate target box features enter the subsequent feature interaction module.

[0097] Furthermore, such as Figure 4 As shown, in step S5, the diffusion-based iterative update network applies motion guidance to the position parameters of the candidate target boxes in the current frame during the iterative sampling process. The motion guidance is performed after at least one iterative update and / or at the last iterative update, specifically satisfying the following:

[0098] (1) Backdiffusion iterative update:

[0099] Suppose that at time step t of the diffusion backsampling, the normalized position parameters of the candidate target boxes in the current frame are: ,set up and These correspond to two adjacent reverse sampling time steps in the noise scheduling, and satisfy the following conditions: Their cumulative coefficients are respectively and The sampling randomness coefficient is Then, by the denoising prediction function get and Then, the candidate bounding box status is updated as follows:

[0100] ,

[0101] in To and Gaussian noise of the same dimension It is a zero vector. For unit array;

[0102] ;

[0103] in The coefficients of the noise prediction term, ;

[0104] This represents the random sampling intensity control parameter, used to adjust the randomness intensity during the diffusion sampling process. In this case, the update is a deterministic sampling update;

[0105] (2) Motion guidance update:

[0106] Let the first After the nth iteration or the last iteration, the nth iteration is... When motion guidance is applied to candidate bounding boxes, their position parameters are: The corresponding motion condition characteristics are The guiding strength coefficient is ,but:

[0107] Under center displacement type motion conditions, update as follows:

[0108]

[0109] in express The first four components;

[0110] When corner displacement type motion conditions are adopted and When it contains at least eight components, construct the guide frame based on corner displacement. And update as follows:

[0111]

[0112] in These are the guiding position parameters recovered from the corner displacement and the current candidate target box size. After being guided by the above motion conditions, the candidate target box can gradually approach the true position along the target motion direction during the iterative update process, as illustrated in the diagram. Figure 5 As shown.

[0113] Furthermore, the association score between candidate target boxes in step S5 is obtained by concatenating the features of candidate target boxes in adjacent frames through a scoring network, and then normalized by the Sigmoid function before being output. Specifically, it involves indexing the same candidate target... The candidate bounding box features of the previous frame and the current frame are respectively and Construct splicing features ,in This indicates vector concatenation.

[0114] splicing features Input rating network Obtain the correlation log value ,in , Provide the network parameters for the scoring system; and output the associated scores. ,in This indicates the matching confidence of the candidate bounding box pair. This represents the Sigmoid activation function, used to map associated logarithmic values ​​to the (0,1) interval. This represents an exponential function with the natural constant e as its base.

[0115] Furthermore, a speed consistency loss is introduced during the training phase. Here, the predicted target box and the ground truth box are defined specifically for the training phase, used to describe the supervision relationship between training samples in adjacent frames. Speed ​​consistency loss. The robust distance metric is obtained by performing a robust distance measurement on the normalized velocity vectors of the predicted target boxes and the ground truth boxes of adjacent frames. Specifically:

[0116] For the first For each sample involved in the calculation, the predicted bounding boxes of the previous and current frames are respectively: , for the For each sample involved in the computation, the truth boxes of the previous and current frames are respectively... ; Let be the coordinates of the top-left corner of the predicted bounding box in the previous frame for the nth sample. Let be the coordinates of the bottom right corner of the predicted bounding box in the previous frame for the nth sample. The coordinates of the top-left corner of the predicted bounding box in the current frame. Define the coordinates of the bottom right corner of the predicted bounding box in the current frame; define the predicted center velocity vector. for: ,in, Let x and y represent the x and y coordinates of the center point of the predicted bounding box in the previous frame for the nth sample, respectively, which are calculated from the coordinates of the top-left and bottom-right corners of the predicted bounding box in the previous frame:

[0117]

[0118] These represent the x and y coordinates of the center point of the predicted bounding box in the current frame for the nth sample, respectively, which are calculated from the coordinates of the top-left and bottom-right corners of the predicted bounding box in the current frame:

[0119]

[0120] in, These represent the width and height of the predicted bounding box in the previous frame, respectively, and are used as scale factors for velocity normalization. Their calculation method is as follows:

[0121]

[0122] By using the width and height of the target bounding box in the previous frame as a normalized scale factor, the impact of velocity differences between targets of different scales on loss calculation can be reduced, thereby improving the applicability and consistency of the loss on targets of different sizes. To prevent division by zero, the constant is used only for numerical calculation stability and does not represent actual physical quantities.

[0123] True center velocity vector Defined as:

[0124]

[0125] in:

[0126]

[0127] in:

[0128] These represent the x and y coordinates of the center point of the truth box in the previous frame for the nth sample, respectively. These represent the width and height of the truth box in the previous frame, respectively;

[0129] Then speed consistency loss Defined as

[0130]

[0131] in For the sample size, For the velocity vector dimension, when using the center velocity... When using corner velocity ; and They represent the velocity vectors respectively. Dimensional components; This is the scaling factor for velocity loss; It is a robust distance function, and in specific implementations it is smooth. function:

[0132]

[0133] in , For smoothing parameters;

[0134] The robust distance function can also be the Charbonnier function: ,in It is a stability constant and is related to They have different meanings.

[0135] By introducing the aforementioned velocity consistency loss, the normalized motion changes of target boxes in adjacent frames are explicitly supervised. This enables the model to learn cross-frame positional change relationships that conform to real motion patterns, reducing unreasonable jumps, abrupt changes in direction, or amplitude distortions in predicted target boxes between adjacent frames. Simultaneously, using a robust distance function to measure velocity differences helps suppress the excessive influence of abnormal biases on loss calculation, thereby improving model training stability and enhancing the trajectory continuity, correlation stability, and identity preservation capabilities of multi-target tracking in complex scenes.

[0136] In this embodiment, to verify the improvement effect of the proposed motion prior on the convergence process of candidate boxes in diffusion-based multi-target tracking, target samples from the DanceTrack validation set sequence dancetrack0004 were selected for simulation experiments. Specifically, using target ID=3 in frame 313 as the analysis object, under the same network structure, the same number of diffusion iterations, and the same initialization conditions, the baseline method without motion prior and the method of this invention with motion prior were executed respectively, and the position changes of the candidate boxes in each diffusion iteration step were recorded.

[0137] To more accurately characterize the positional and scale deviations between candidate boxes and ground truth boxes, this implementation uses the "average distance error between the four corner points" as an evaluation metric. This involves calculating the Euclidean distances between the four corresponding corner points of the candidate box and the ground truth box, and taking the average as the error for the current step. Considering the inherent random fluctuations in the diffusion sampling process, to highlight the overall trend of candidate boxes gradually approaching the ground truth boxes, the figure shows the "cumulative optimal error convergence curve," which represents the minimum error value reached up to the current diffusion step.

[0138] like Figure 6As shown, the vertical axis represents the cumulative average distance error of the four best corner points, and the horizontal axis represents the number of diffusion iteration steps. A magnified view is used to show the error difference in the final convergence stage. In the early stages of diffusion iteration, the error reduction trends of the two methods are basically the same. As iterations continue, the method with motion priors exhibits better convergence ability in the later stages, especially in the final iteration stage, where its candidate boxes can further approach the true target boxes. The magnified view clearly shows that in the final convergence stage, the method with motion priors has a lower error than the baseline method without motion priors. Experimental results show that the final error of the baseline method is approximately 2.7364, while the final error with motion priors is approximately 1.8132, a decrease of approximately 33.7%.

[0139] Therefore, this invention, by introducing motion prior information during the diffusion candidate box update process, can provide more stable spatial constraints for candidate boxes even when the target is occluded, interacts with, or undergoes significant positional changes. This results in more accurate candidate box update directions in the later stages of diffusion sampling, thereby improving the convergence accuracy and final localization quality of the candidate boxes. These results demonstrate that this invention not only improves the stability of target association in multi-target tracking but also enhances the iterative convergence performance of diffusion candidate boxes.

[0140] See Figure 7 In this embodiment, the dancetrack0007 sequence from the DanceTrack validation set is selected as the test sample, and a continuous video segment with significant occlusion and dense target interaction is extracted from frames 31 to 55 for visual verification. The figure shows three representative frames in chronological order to demonstrate the tracking effect of the method of this invention under the conditions of multiple people moving in close parallel, partial occlusion, and mutual intersection, with the introduction of trajectory prior.

[0141] In this embodiment, multiple pedestrian targets are densely distributed horizontally in the scene, and there is continuous overlap and occlusion between the middle target and adjacent targets, which is a typical difficult scenario in multi-target tracking where identity switching is prone to occur. The method of the present invention introduces trajectory prior information in the process of expanding candidate box update and target association, and uses the position change trend of the target in historical frames to constrain the candidate box of the current frame, thereby making the candidate box update direction more stable and improving the target identity preservation capability.

[0142] from Figure 7As can be seen, in the selected consecutive frames, the tracking bounding box of the key target remains continuous throughout the process of occlusion, intersection, and separation by neighboring targets, and the target number does not change. This indicates that the method of the present invention can maintain stable identity consistency under strong occlusion conditions. Simultaneously, the trajectories of multiple adjacent targets also maintain good continuity, with no obvious trajectory interruptions or erroneous switching phenomena. These results demonstrate that the proposed trajectory prior mechanism can effectively improve the stability of multi-target tracking in complex and congested scenes, reduce the risk of identity switching, and enhance the continuity and reliability of tracking results.

[0143] Furthermore, the segment selected in this embodiment belongs to a difficult sample with a high degree of target overlap, but the method of the present invention can still continuously and stably track the key target, indicating that the method has strong robustness in occluded scenarios and is suitable for multi-target tracking application scenarios with dense pedestrians and strong mutual interference.

Claims

1. A multi-target tracking method based on a diffusion model with learnable motion condition representation, characterized in that, Includes the following steps: S1. Obtain adjacent frame images of the video sequence, including the current frame image and the previous frame image; S2. Based on the adjacent frame images obtained in S1, perform feature extraction, and generate a set of candidate target boxes for the current frame based on the extracted features. ,in, This indicates the number of candidate bounding boxes generated in the current frame. Indicates the candidate target index number. Each candidate target box It must include at least the position parameter; S3, For the established set of target trajectories ,in, Indicates the first A historical target trajectory, This indicates the number of target trajectories maintained at the current moment. Indicates the trajectory index number. Based on the historical target box of the trajectory in the previous frame and the motion prediction target box in the current frame, a motion prior is constructed, and the constructed motion prior is compressed into a motion condition representation, which is used to represent the displacement information or corner displacement information of the target trajectory. S4. Combine the motion condition representation obtained after step S3 with the candidate target box set obtained in step S2. Alignment processing is performed to obtain motion condition features that correspond one-to-one with the candidate target boxes; S5. The candidate target box set obtained in step S2 is... The features of adjacent frames and the motion condition features obtained in step S4 are input into the diffusion iterative update network. The network is aligned to the representation space of the diffusion head through a lightweight learnable mapping and the features of the candidate target boxes are modulated so that the motion conditions constrain or guide the iterative update process. The network outputs the updated position parameters of the candidate target boxes and the correlation scores between the candidate target boxes. S6. Based on the associated score and the updated candidate target box, perform data association and trajectory status update, and output the multi-target tracking result; S7. During the model training phase, a dataset containing the real labeled target boxes of adjacent frames is constructed to train the diffusion-based iterative update network. A velocity consistency constraint can be introduced to improve the stability of target motion prediction in adjacent frames.

2. The multi-target tracking method based on a learnable motion condition representation diffusion model according to claim 1, characterized in that, The specific method for compressing the constructed motion prior into motion condition representations in step S3 is as follows: For the Trajectory With the top left corner of the image as the origin, the horizontal axis to the right is... The positive direction of the axis, vertically downwards is Establish a Cartesian coordinate system in the image plane along the positive axis. In this system, the historical target bounding box is denoted as... Current prediction box ,in These are the coordinates of the top-left corner of the historical target bounding box and the top-left corner of the current predicted bounding box, respectively. These are the coordinates of the bottom right corner of the historical target bounding box and the bottom right corner of the current predicted bounding box, respectively, with superscripts. Indicates the historical target box, superscript Represents the current predicted bounding box; coordinates of the center of the historical target bounding box. The historical target frame width is The height of the historical target box is ,in To prevent division by zero constants; the center coordinates of the current prediction box are The normalized displacement of the center of the current predicted bounding box relative to the historical target bounding box is defined as follows: and will the The motion conditions of a trajectory are characterized by the normalization of the center displacement and are defined as follows: in The trajectory-level motion condition representation vector, with superscript This represents the vector transpose operation.

3. The multi-target tracking method based on a learnable motion condition representation diffusion model according to claim 2, characterized in that, The motion condition representation described in step S3 can also be constructed using a corner displacement normalization representation method, and the construction process is as follows: In the defined historical target box Current prediction box Historical target frame width Historical target frame height Based on this, define the normalized components of the corner displacements between the current predicted bounding box and the historical target bounding boxes: , , , , Here, each component represents the normalized lateral and longitudinal displacement components at the four corner points of the historical target box and the current predicted box, respectively, used to characterize the short-term movement trend of the target. Represent the top left, top right, bottom left, and bottom right corners respectively; and set the first... The motion conditions of a trajectory are characterized by the normalization of corner displacements and are defined as follows: , This is the trajectory-level corner point motion condition representation vector.

4. A multi-target tracking method based on a learnable motion condition representation diffusion model according to claim 2 or 3, characterized in that, In step S3, before constructing the motion condition representation, the trajectory set is... To perform reliability screening, the screening criteria must include at least the following: (1) Trajectory maintenance time condition: when the first... Duration of the trajectory At that time, no motion condition representation is constructed for the trajectory, where This indicates the cumulative number of frames the trajectory has existed from the beginning to the current frame. The preset minimum maintenance time threshold; (2) Update interval condition: When the first... Unupdated frame count of the track At that time, no motion condition representation is constructed for the trajectory, where This indicates the number of frames since the last successful trajectory update. The preset maximum update interval threshold; (3) Velocity threshold condition: when the first Motion velocity amplitude of the trajectory At that time, no motion condition representation is constructed for the trajectory, where , This represents the function for calculating the center coordinates of the target bounding box. The function for calculating the center coordinates of the target bounding box, representing the L2 norm, is defined as follows: , The coordinates of the top left corner of the target box. The coordinates of the bottom right corner of the target box. This is the preset minimum speed threshold.

5. The multi-target tracking method based on a learnable motion condition representation diffusion model according to claim 1, characterized in that, The alignment in step S4 includes: performing Top-Order Alignment based on the intersection-union ratio (IoU) of the historical target bounding box in the previous frame and the candidate target bounding box in the current frame. Matching, and writing motion condition features only for candidate bounding boxes that meet the intersection-union ratio threshold, specifically: Candidate target boxes Historical target box ,in: This represents the coordinates of the top-left corner of the j-th candidate bounding box; This represents the coordinates of the bottom right corner of the j-th candidate bounding box; This indicates the coordinates of the top-left corner of the historical target box in the previous frame for the i-th trajectory; This indicates the coordinates of the lower right corner of the i-th trajectory within the historical target bounding box of the previous frame; Define the intersection-union ratio (IUU) between candidate bounding boxes and historical trajectory bounding boxes as: The area of ​​the historical target box is The area of ​​the candidate target box is The intersection area of ​​historical target boxes and candidate target boxes is : , in, This indicates taking the smaller value. Indicates taking the larger value, outer layer Used to avoid negative area when there is no overlap; let To set the intersection-union threshold, for each trajectory Select to make The largest front Candidate box index set ,in It is a preset integer; when At that time, characterize the motion conditions of the trajectory. Motion condition features written into the corresponding candidate boxes and according to the diffusion scale coefficient Scaling, i.e. , When the above conditions are not met, set , It is a zero vector; and the writing is performed only on the set of motion condition features corresponding to the candidate target boxes in the current frame.

6. The multi-target tracking method based on a learnable motion condition representation diffusion model according to claim 5, characterized in that, In step S5, the learnable mapping of motion condition features is obtained by a motion mapping network. To achieve this, the motion mapping network comprises at least two linear transformation layers and one nonlinear activation layer, and satisfies the following: for candidate target boxes... Motion Conditions Characteristics Output the projected motion embedding The mapping relationship is as follows: , in , For the mapping matrix, , For bias vectors, It is a non-linear activation function; Represents the dimension of motion condition features. Indicates the feature dimension of the candidate target box; In step S5, the diffusion-based iterative update network updates at time steps. The denoising prediction uses the motion embedding as one of the conditional information, and its conditional denoising prediction satisfies: , in: Indicates time step The set of candidate bounding box states, and Each row corresponds to the four-dimensional position parameters of a candidate target box; Indicates at time step The set of predicted noise-free candidate bounding box location parameters; Indicates at time step The predicted noise set; and These are the image features of the previous frame and the current frame, respectively; To include candidate bounding box features and fuse the motion embedding The set of conditional features; For denoising prediction function, Its parameters.

7. The multi-target tracking method based on a learnable motion condition representation diffusion model according to claim 6, characterized in that, The modulation of the candidate target box features in step S5 includes embedding the projected motion. Additive fusion is performed element-wise with the feature representations of the candidate target boxes, specifically: Let the candidate target boxes... The features are represented as The modulated features are then represented as , in The modulated candidate target box feature representation is used, and the additive fusion is performed before the candidate target box features enter the subsequent feature interaction module.

8. The multi-target tracking method based on a learnable motion condition representation diffusion model according to claim 6, characterized in that, In step S5, the diffusion-based iterative update network applies motion guidance to the position parameters of the candidate target boxes in the current frame during the iterative sampling process. The motion guidance is performed after at least one iterative update and / or at the last iterative update, specifically satisfying the following: (1) Backdiffusion iterative update: Suppose that at time step t of the diffusion backsampling, the normalized position parameters of the candidate target boxes in the current frame are: ,set up and These correspond to two adjacent reverse sampling time steps in the noise scheduling, and satisfy the following conditions: Their cumulative coefficients are respectively and The sampling randomness coefficient is Then, by the denoising prediction function get and Then, the candidate bounding box status is updated as follows: , in To and Gaussian noise of the same dimension It is a zero vector. For unit array; ; in The coefficients of the noise prediction term, ; This represents the random sampling intensity control parameter, used to adjust the randomness intensity during the diffusion sampling process. At that time, the update is a deterministic sampling update; (2) Motion guidance update: Let the first After the nth iteration or the last iteration, the nth iteration is... When motion guidance is applied to candidate bounding boxes, their position parameters are: The corresponding motion condition characteristics are The guiding strength coefficient is ,but: Under center displacement type motion conditions, update as follows: , in express The first four components; When corner displacement type motion conditions are adopted and When it contains at least eight components, construct the guide frame based on corner displacement. And update as follows: , in These are the guiding position parameters recovered from the corner displacement and the current candidate target box size.

9. A multi-target tracking method based on a learnable motion condition representation diffusion model according to claim 1, characterized in that, The association score between candidate target boxes in step S5 is obtained by concatenating the features of candidate target boxes in adjacent frames through a scoring network, and then normalized by the Sigmoid function. Specifically, it involves indexing the same candidate target... The candidate bounding box features of the previous frame and the current frame are respectively and Construct splicing features ,in This indicates vector concatenation; splicing features Input rating network Obtain the correlation log value ,in , Provide the network parameters for the scoring system; and output the associated scores. ,in This represents the matching confidence of candidate bounding box pairs. This represents the Sigmoid activation function, used to map associated logarithmic values ​​to the (0,1) interval. This represents an exponential function with the natural constant e as its base.

10. The multi-target tracking method based on a learnable motion condition representation diffusion model according to claim 1, characterized in that, In the training phase described in step S7, a speed consistency loss is introduced. Speed ​​consistency loss The robust distance metric is obtained by performing a robust distance measurement on the normalized velocity vectors of the predicted target boxes and the ground truth boxes of adjacent frames. Specifically: For the For each sample involved in the calculation, the predicted bounding boxes of the previous and current frames are respectively: , for the For each sample involved in the computation, the truth boxes of the previous and current frames are respectively... ; Let be the coordinates of the top-left corner of the predicted bounding box in the previous frame for the nth sample. Let be the coordinates of the bottom right corner of the predicted bounding box in the previous frame for the nth sample. The coordinates of the top-left corner of the predicted bounding box in the current frame. Define the coordinates of the bottom right corner of the predicted bounding box in the current frame; define the predicted center velocity vector. for: ,in, Let x and y represent the x and y coordinates of the center point of the predicted bounding box in the previous frame for the nth sample, respectively, which are calculated from the coordinates of the top-left and bottom-right corners of the predicted bounding box in the previous frame: , These represent the x and y coordinates of the center point of the predicted bounding box in the current frame for the nth sample, respectively, which are calculated from the coordinates of the top-left and bottom-right corners of the predicted bounding box in the current frame: , in, These represent the width and height of the predicted bounding box in the previous frame, respectively, and are used as scale factors for velocity normalization. Their calculation method is as follows: , The constant used to prevent division by zero is only for numerical calculation stability and does not represent actual physical quantities. True center velocity vector Defined as, , in: in: These represent the x and y coordinates of the center point of the truth box in the previous frame for the nth sample, respectively. These represent the width and height of the truth box in the previous frame, respectively; Then speed consistency loss Defined as in For the sample size, For the velocity vector dimension, when using the center velocity... When using corner velocity ; and They represent the velocity vectors respectively. Dimensional components; This is the scaling factor for velocity loss; This is the robust distance function.