Two-way feedback collaborative optimization method for pedestrian detection and tracking, medium and device
By constructing a two-way feedback mechanism for detection and tracking in a multi-target tracking system, generating a virtual detection box and verifying its appearance features, the problem of information isolation between the detection and tracking modules is solved, and the continuity and accuracy of the trajectory in occluded scenarios are improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SOUTH CHINA UNIV OF TECH
- Filing Date
- 2026-03-30
- Publication Date
- 2026-06-19
AI Technical Summary
In existing multi-target tracking systems, the lack of information feedback between the detection and tracking modules leads to error accumulation, false detection propagation, and trajectory fragmentation, especially in complex scenarios such as occlusion and changes in lighting conditions, resulting in performance degradation.
A bidirectional feedback collaborative optimization method is adopted, which generates virtual detection boxes through trajectory guidance, verifies the detection boxes by combining them with a lightweight Re-ID network, and feeds back the virtual detection boxes and high-confidence detection boxes to the detector and tracker for optimization, thus constructing a closed-loop feedback mechanism between detection and tracking.
It achieves trajectory continuity and data correlation accuracy under occlusion, improves the performance of detectors and trackers, reduces trajectory interruptions and ID switching, and enhances the robustness and adaptability of the system.
Smart Images

Figure CN122244801A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, and more specifically, to a bidirectional feedback collaborative optimization method, medium, and device for pedestrian detection and tracking. Background Technology
[0002] Existing multi-target tracking (MOT) systems commonly employ a unidirectional serial architecture of "tracking-by-detection." However, this architecture suffers from three major drawbacks in practical applications:
[0003] (i) Error accumulation: Specifically, when the target is occluded, the detector is prone to missing detections, which in turn causes the tracker to be unable to associate with the target, resulting in trajectory interruption. The direct consequence is that the generated trajectory is fragmented and lacks continuity;
[0004] (ii) False Positives Propagation: This manifests as false positives generated by the detector being incorrectly associated by the tracker, thus generating false tracks. This leads to frequent switching of the target's ID (identity identifier), severely impacting tracking accuracy.
[0005] (iii) Module fragmentation: There is a lack of effective information exchange between the detection module and the tracking module, and they operate independently. This prevents the two modules from coordinating and optimizing, limiting the improvement of the overall system performance.
[0006] In summary, the core problem with existing technologies lies in the fact that there is only a one-way flow of information between the detection and tracking modules, lacking a feedback mechanism. The system cannot fully utilize the spatiotemporal consistency information in the video sequence to optimize the two modules, resulting in a sharp performance drop when facing complex scenarios such as occlusion and changes in lighting. Summary of the Invention
[0007] To overcome the shortcomings and deficiencies of the existing technology, the present invention aims to provide a bidirectional feedback collaborative optimization method, medium and device for pedestrian detection and tracking; the method can solve the problems of error accumulation, false detection propagation and trajectory fragmentation caused by the unidirectional serial connection between the detector and the tracker in the existing MOT system.
[0008] To achieve the above objectives, the present invention provides a bidirectional feedback collaborative optimization method for pedestrian detection and tracking, comprising the following steps:
[0009] S1. Use the detector to scan the current frame image. Generate an initial set of detection boxes Virtual detection boxes are generated using trajectory-guided detection and completion. This is done to enhance the set of detection boxes; the set of detection boxes is obtained after similarity verification. ;
[0010] S2. Employs a bidirectional feedback collaborative optimization detector and tracker, including:
[0011] The virtual detection box generated in step S1 As difficult positive samples, they are fed back to the detector for optimization.
[0012] And from the collection of detection boxes The high-confidence true detection boxes in the data are fed back to the tracker for tracker updates;
[0013] S3. Use the validated set of detection boxes. The tracker, after undergoing bidirectional feedback and collaborative optimization, performs a second data association, ultimately outputting a continuous and stable set of pedestrian trajectories. .
[0014] Preferably, step S1 includes:
[0015] Input: The set of trajectories from the previous frame Current frame image The initial set of detection boxes for the current frame generated by the detector. Input is the historical appearance feature database of each trajectory; initial detection box set. Mark as a true detection bounding box;
[0016] Trajectory prediction: Based on the trajectory set of the previous frame Each trajectory The tracker's Kalman filter is used to predict the position in the current frame, generating a set of prediction boxes. ;
[0017] Missed detection determination and virtual detection box generation: Calculating the set of prediction boxes With the initial set of detection boxes The intersection-union ratio (IoU) is calculated and compared with a set first threshold. Perform a size comparison: if IoU ≤ Then determine the set of prediction boxes. No match was found with any ground truth bounding boxes; the tracking trajectory set from the previous frame... trajectory Missed detection, generate virtual detection box Generate an enhanced set of detection boxes. ;otherwise ;
[0018] Feature extraction: For the enhanced set of detection boxes Each detection box in Use Re-ID network to extract appearance features ;
[0019] Similarity calculation and false detection filtering: For each detection box, find potentially associated historical trajectories based on spatial location and obtain its most recent... Frame feature library ,calculate and Maximum cosine similarity ;like If the value is below the set second threshold, it is considered a false detection and is removed.
[0020] Output: The set of validated detection boxes .
[0021] Preferably, in the process of determining missed detections and generating virtual detection boxes, the virtual detection box... The generation method is: virtual detection box Position and size are taken from the set of prediction boxes. Prediction box The confidence level is assigned as:
[0022] ,in The historical average confidence level of the trajectory. For coefficients;
[0023] In the feature extraction, appearance features ;
[0024] in, Indicates a Re-ID network; Represents the clipping function;
[0025] In the similarity calculation and false detection filtering, different second thresholds are set for real detection boxes and virtual detection boxes, and the second threshold for real detection boxes... >Second threshold of virtual detection box .
[0026] Preferably, in step S2, the detector optimization includes the following methods:
[0027] Online difficult example mining method: Increase the virtual detection box size during current training. The classification loss weights for the corresponding image regions are updated online, and the loss function is adjusted as follows:
[0028] ,in, This is the total loss function; The basic loss function; For a single virtual detection box Classification loss; This is the balance coefficient;
[0029] Pseudo-label accumulation method: Accumulate high-confidence labels Virtual detection box The corresponding image regions are saved to the difficult example library. The difficult example library is added to the training set in subsequent offline training to improve the recall rate of the detector in occluded scenes.
[0030] Attention-guided approach: Trajectory confidence is used as spatial attention weights to guide the detector to perform multi-scale feature fusion in a specific region. ,in, For feature fusion operations; This is the feature map after attention enhancement; Basic feature map; Feature map of ROI (Region of Interest); The trajectory confidence level; the trajectory confidence level The calculation method is as follows: based on the confidence scores of the detection boxes associated with the historical trajectory, statistics are performed using either historical averaging or exponential moving average to quantify the stability and reliability of the trajectory; where exponential moving average means:
[0031] ;
[0032] in, The momentum coefficient, This represents the confidence level of the detection box associated with the trajectory in the current frame.
[0033] Preferably, in step S2, the tracker update includes appearance model update and motion model optimization;
[0034] The appearance model update refers to: updating the feature vectors of high-confidence true detection boxes. Add the corresponding trajectory's appearance feature library and update the trajectory's template features using an exponential moving average. :
[0035] ,in The momentum coefficient;
[0036] The motion model optimization refers to: using the true position of the high-confidence detection box to correct the Kalman filter parameters, including: adaptively adjusting the process noise covariance matrix based on the prediction error of multiple consecutive frames. .
[0037] Preferably, in step S3, the strategy for the second data association is as follows: high-confidence real detection boxes are matched first; medium-confidence real detection boxes are matched normally; virtual detection boxes are used only when there are no other matches and the intersection-over-union (IoU) or similarity conditions are met.
[0038] The final output set of pedestrian trajectories It includes trajectory ID, current location, historical trajectory points, appearance feature template, and motion state estimation.
[0039] A readable storage medium storing a computer program that, when executed by a processor, causes the processor to perform the bidirectional feedback collaborative optimization method for pedestrian detection and tracking.
[0040] A computer device includes a processor and a memory for storing a processor-executable program, wherein when the processor executes the program stored in the memory, it implements the bidirectional feedback collaborative optimization method for pedestrian detection and tracking.
[0041] Compared with existing technologies, this invention constructs a two-way information feedback closed loop between detection and tracking, which has the following advantages and beneficial effects:
[0042] 1. A trajectory-guided detection completion and verification mechanism to resolve issues of missed detections due to occlusion and false detections due to environmental factors:
[0043] This mechanism integrates completion and verification. First, it uses the tracker to generate "virtual detection boxes" based on trajectory positions predicted by the motion model to complete the detection of pedestrians missed due to occlusion. Then, a lightweight Re-ID network (pedestrian re-identification network) is introduced to extract appearance features from all detection boxes (including real and virtual boxes) and compare them with the historical feature database of the corresponding trajectory to eliminate false detections. The key aspects are: the rule for determining the timing of virtual detection box generation (based on IoU between predicted and detected boxes to determine missed detections), the confidence assignment method (based on historical confidence decay of the trajectory), and the differentiated appearance similarity thresholds for real and virtual detection boxes. Through this mechanism, the continuity of the trajectory under occlusion is maintained while ensuring the accuracy of data association.
[0044] 2. A two-way feedback co-evolutionary mechanism for detection and tracking to achieve closed-loop optimization:
[0045] This is the core of the invention, specifically encompassing feedback in two directions:
[0046] Follow up with the detection feedback: The validated "virtual detection box" is fed back to the detector as a "difficult example". The detector can be optimized online through one or more combinations of online difficult example mining, pseudo-label accumulation, attention guidance, etc., so that it performs better in occluded scenes.
[0047] Feedback from detection to tracking: High-confidence ground truth detection boxes are fed back to the tracker to update the appearance feature template of the trajectory (exponential moving average) and adaptively correct motion model parameters (such as adjusting the noise covariance matrix of the Kalman filter according to the prediction error), thereby improving the accuracy and adaptability of tracking.
[0048] Through the aforementioned closed-loop mechanism, this invention enables the detector and tracker to mutually promote and co-evolve during the online inference process, fundamentally solving the shortcomings of the traditional unidirectional architecture. Attached Figure Description
[0049] Figure 1 This is a flowchart illustrating the bidirectional feedback collaborative optimization method for pedestrian detection and tracking of the present invention.
[0050] Figure 2 This is a schematic diagram of the detector optimization process of the present invention;
[0051] Figure 3 This is a comparison chart of the experimental results of the present invention and the DeepSORT method. Detailed Implementation
[0052] The present invention will now be described in further detail with reference to the accompanying drawings and specific embodiments.
[0053] Example 1
[0054] This embodiment presents a bidirectional feedback collaborative optimization method for pedestrian detection and tracking, such as... Figure 1 As shown, it includes the following steps:
[0055] S1. Use the detector to scan the current frame image. Generate an initial set of detection boxes Virtual detection boxes are generated using trajectory-guided detection and completion. This is done to enhance the set of detection boxes; the set of detection boxes is obtained after similarity verification. .
[0056] Specifically, it includes:
[0057] Input: The set of tracking trajectories from the previous frame. Current frame image The initial set of detection boxes for the current frame generated by the detector. Input is the historical appearance feature database of each trajectory; initial detection box set. Mark as a true detection bounding box;
[0058] Trajectory prediction: Based on the trajectory set of the previous frame Each trajectory The Kalman filter of the tracker is used to predict its position in the current frame, generating a set of prediction boxes. The state vector is defined as follows: ,in The coordinates of the center of the target bounding box The width and height of the target bounding box. These are the corresponding velocity components; the prediction steps are as follows: ,in The state transition matrix is used, employing a uniform motion model. This is for estimating the posterior state of the previous frame. To estimate the prior state of the current frame obtained from the prediction;
[0059] Missed detection determination and virtual detection box generation: Calculating the set of prediction boxes With the initial set of detection boxes The intersection-union ratio (IoU) is calculated and compared with a set first threshold. Perform size comparisons, for example :
[0060] If IoU≤ Then determine the set of prediction boxes. Prediction boxes in No match was found with any ground truth bounding boxes; the tracking trajectory set from the previous frame... trajectory Missed detection, generate virtual detection box Generate an enhanced set of detection boxes. Virtual detection box The generation method is: virtual detection box Position and size are taken from the prediction box. The confidence level is assigned a value. ( For coefficients, , (The historical average confidence level of the trajectory), the type is marked as "virtual"; if IoU > ,but ;
[0061] Feature extraction: For the enhanced set of detection boxes Each detection box in Extracting 256-dimensional appearance features using the lightweight OSNetRe-ID network. ;in, Indicates a Re-ID network; Represents the clipping function;
[0062] Similarity calculation and false detection filtering: For each detection box, find potentially associated historical trajectories based on spatial location and obtain its most recent... (For example, ) frame feature library ,calculate and Maximum cosine similarity ;
[0063] Set a second threshold: as shown in the actual detection bounding box. Virtual detection box ;like If the value is below the set second threshold, it is considered a false detection and is removed.
[0064] Output: The set of validated detection boxes .
[0065] S2. Employs a bidirectional feedback collaborative optimization detector and tracker, including:
[0066] The virtual detection box generated in step S1 As difficult positive samples, they are fed back to the detector for optimization.
[0067] And from the collection of detection boxes The high-confidence true detection boxes in the data are fed back to the tracker for tracker updates.
[0068] Detector optimization, such as Figure 2 As shown, this includes the following methods:
[0069] Online difficult example mining method: Increase the virtual detection box size during current training. The classification loss weights for the corresponding image regions are updated online, and the loss function is adjusted as follows:
[0070] ,in, This is the total loss function; The basic loss function; For a single virtual detection box Classification loss; For balance coefficients (e.g., );
[0071] Pseudo-label accumulation method: Accumulate high-confidence labels (like Virtual detection box The corresponding image regions are saved to the difficult example library. The difficult example library is added to the training set in subsequent offline training to improve the recall rate of the detector in occluded scenes.
[0072] Attention-guided approach: Trajectory confidence is used as spatial attention weights to guide the detector to perform multi-scale feature fusion in a specific region. ,in, For feature fusion operations; This is the feature map after attention enhancement; Basic feature map; Feature map of ROI (Region of Interest); The trajectory confidence score The calculation method is as follows: based on the confidence scores of the detection boxes associated with the historical trajectory, statistics are performed using either historical averaging or exponential moving average to quantify the stability and reliability of the trajectory. Specifically, exponential moving average is used for updating:
[0073] ;
[0074] in, The momentum coefficient, This represents the confidence level of the detection box associated with the trajectory in the current frame.
[0075] Tracker updates include appearance model updates and motion model optimizations;
[0076] The appearance model update refers to: updating the feature vectors of high-confidence true detection boxes. Add the corresponding trajectory's appearance feature library and update the trajectory's template features using an exponential moving average (EMA):
[0077] ,in The momentum coefficient (e.g.) );
[0078] The motion model optimization refers to: using the true position of the high-confidence detection box to correct the Kalman filter parameters, including: adaptively adjusting the process noise covariance matrix based on the prediction error of multiple consecutive frames. If the prediction error is small for multiple consecutive frames (e.g., the prediction error is less than 5 pixels for 3 consecutive frames), then reduce the error. (Increase the confidence of the motion model); if the prediction error is large for multiple consecutive frames (e.g., the prediction error for 3 consecutive frames is greater than 20 pixels), then increase the confidence level. (Increases uncertainty in motion models, leading to greater reliance on observations).
[0079] S3. Use the validated set of detection boxes. The tracker, optimized through bidirectional feedback, undergoes a second data association. The strategy for this second data association is: high-confidence true bounding boxes (e.g., confidence level) Prioritize matching; use true bounding boxes with medium confidence. Normal matching; virtual detection boxes are only used when there are no other matches and the Intersection over Union (IoU) or similarity conditions are met; the final output is a continuous and stable set of pedestrian trajectories. It includes trajectory ID, current location, historical trajectory points, appearance feature template, and motion state estimation.
[0080] Compared with the prior art, the present invention has the following significant advantages and positive effects:
[0081] (1) A trajectory-guided detection completion and verification mechanism to solve occlusion-induced missed detections and environmental false detections: By generating a "virtual detection box" to actively complete occlusion-induced missed detections, and using Re-ID verification to eliminate false detections, the trajectory interruption and ID switching are effectively reduced. Figure 3 As shown, tests on the occlusion subset of the MOT17 dataset demonstrate that the number of trajectory interruptions decreased from 127 on the benchmark DeepSORT to 89 (a reduction of 29.9%); and the trajectory integrity improved from 68.3% to 85.7% (an improvement of 25.5%).
[0082] (2) Two-way feedback co-evolution mechanism of detection and tracking to achieve closed-loop optimization: This invention realizes for the first time the co-evolution of detector and tracker in online inference process. The tracker provides occlusion difficulties to the detector (which can be optimized in parallel through multiple feedback methods), making it "smarter with use"; the reliable output of the detector, in turn, makes the tracker "more accurate with use", forming a virtuous cycle of continuous self-optimization, and the overall robustness of the system is significantly enhanced.
[0083] (3) High versatility: The method of the present invention does not depend on a specific detector or tracker and has good versatility and portability.
[0084] Example 2
[0085] This embodiment provides a readable storage medium storing a computer program that, when executed by a processor, causes the processor to perform the bidirectional feedback collaborative optimization method for pedestrian detection and tracking as described in Embodiment 1.
[0086] Example 3
[0087] This embodiment discloses a computer device, including a processor and a memory for storing processor-executable programs. When the processor executes the program stored in the memory, it implements the bidirectional feedback collaborative optimization method for pedestrian detection and tracking described in Embodiment 1.
[0088] The above embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above embodiments. Any changes, modifications, substitutions, combinations, or simplifications made without departing from the spirit and principle of the present invention shall be considered equivalent substitutions and shall be included within the protection scope of the present invention.
Claims
1. A bidirectional feedback collaborative optimization method for pedestrian detection and tracking, characterized in that: Includes the following steps: S1. Use the detector to scan the current frame image. Generate an initial set of detection boxes ; Virtual detection boxes are generated using trajectory-guided detection and completion. This is done to enhance the set of detection boxes; the set of detection boxes is obtained after similarity verification. ; S2. Employs a bidirectional feedback collaborative optimization detector and tracker, including: The virtual detection box generated in step S1 As difficult positive samples, they are fed back to the detector for optimization. And will be from the collection of detection boxes The high-confidence true detection boxes in the data are fed back to the tracker for tracker updates; S3, Use the validated set of detection boxes The tracker, after undergoing bidirectional feedback and collaborative optimization, performs a second data association, ultimately outputting a continuous and stable set of pedestrian trajectories. .
2. The bidirectional feedback collaborative optimization method for pedestrian detection and tracking according to claim 1, characterized in that: Step S1 includes: Input: The set of trajectories from the previous frame Current frame image The initial set of detection boxes for the current frame generated by the detector. Input is the historical appearance feature database of each trajectory; initial detection box set. Mark as a true detection bounding box; Trajectory prediction: Based on the trajectory set of the previous frame Each trajectory The tracker's Kalman filter is used to predict the position in the current frame, generating a set of prediction boxes. ; Missed detection determination and virtual detection box generation: Calculating the set of prediction boxes With the initial set of detection boxes The intersection-union ratio (IoU) is calculated and compared with a set first threshold. Perform a size comparison: if IoU ≤ Then determine the set of prediction boxes. No match was found with any ground truth bounding boxes; the tracking trajectory set from the previous frame... trajectory Missed detection, generate virtual detection box Generate an enhanced set of detection boxes. ;otherwise ; Feature extraction: For the enhanced set of detection boxes Each detection box in Use Re-ID network to extract appearance features ; Similarity calculation and false detection filtering: For each detection box, find potentially associated historical trajectories based on spatial location and obtain its most recent... Frame feature library ,calculate and Maximum cosine similarity ;like If the value is below the set second threshold, it is considered a false detection and is removed. Output: The set of validated detection boxes .
3. The bidirectional feedback collaborative optimization method for pedestrian detection and tracking according to claim 2, characterized in that: In the process of missing detection determination and virtual detection box generation, the virtual detection box... The generation method is: virtual detection box Position and size are taken from the set of prediction boxes. Prediction box The confidence level is assigned as: ,in The historical average confidence level of the trajectory. For coefficients; In the feature extraction, appearance features ; in, Indicates a Re-ID network; Represents the clipping function; In the similarity calculation and false detection filtering, different second thresholds are set for real detection boxes and virtual detection boxes, and the second threshold for real detection boxes... >Second threshold of virtual detection box .
4. The bidirectional feedback collaborative optimization method for pedestrian detection and tracking according to claim 1, characterized in that: In step S2, detector optimization includes the following methods: Online difficult example mining method: Increase the virtual detection box size during current training. The classification loss weights for the corresponding image regions are updated online, and the loss function is adjusted as follows: ,in, This is the total loss function; The basic loss function; For a single virtual detection box Classification loss; This is the balance coefficient; Pseudo-label accumulation method: Accumulate high-confidence labels Virtual detection box The corresponding image regions are saved to the difficult example library. The difficult example library is added to the training set in subsequent offline training to improve the recall rate of the detector in occluded scenes. Attention-guided approach: Trajectory confidence is used as spatial attention weights to guide the detector to perform multi-scale feature fusion in a specific region. ,in, For feature fusion operations; The feature map after attention enhancement; Basic feature map; For ROI feature maps; The trajectory confidence level; the trajectory confidence level The calculation method is as follows: based on the confidence scores of the detection boxes associated with the historical trajectory, statistics are performed using either historical averaging or exponential moving average to quantify the stability and reliability of the trajectory; where exponential moving average means: ; in, The momentum coefficient, This represents the confidence level of the detection box associated with the trajectory in the current frame.
5. The bidirectional feedback collaborative optimization method for pedestrian detection and tracking according to claim 2, characterized in that: In step S2, the tracker update includes appearance model update and motion model optimization; The appearance model update refers to: updating the feature vectors of high-confidence true detection boxes. Add the corresponding trajectory's appearance feature library and update the trajectory's template features using an exponential moving average. : ,in The momentum coefficient; The motion model optimization refers to: using the true position of the high-confidence detection box to correct the Kalman filter parameters, including: adaptively adjusting the process noise covariance matrix based on the prediction error of multiple consecutive frames. .
6. The bidirectional feedback collaborative optimization method for pedestrian detection and tracking according to claim 1, characterized in that: In step S3, the strategy for the second data association is as follows: high-confidence real detection boxes are matched first; medium-confidence real detection boxes are matched normally; virtual detection boxes are only used when there are no other matches and the intersection-over-union (IoU) or similarity conditions are met. The final output set of pedestrian trajectories It includes trajectory ID, current location, historical trajectory points, appearance feature template, and motion state estimation.
7. A readable storage medium, characterized in that, The storage medium stores a computer program that, when executed by a processor, causes the processor to perform the bidirectional feedback collaborative optimization method for pedestrian detection and tracking as described in any one of claims 1-6.
8. A computer device comprising a processor and a memory for storing a processor-executable program, characterized in that, When the processor executes the program stored in the memory, it implements the bidirectional feedback collaborative optimization method for pedestrian detection and tracking as described in any one of claims 1-6.