A station dense pedestrian tracking system and method based on spatial weak clues
The station dense pedestrian tracking system based on spatial weak cues utilizes a lightweight spatiotemporal attention mechanism and a modified Kalman filter module, combined with weak cues, to achieve efficient pedestrian tracking, solving the detection and tracking challenges in dense pedestrian environments and realizing high-precision and fast pedestrian analysis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NORTHEASTERN UNIV CHINA
- Filing Date
- 2024-02-01
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional methods struggle to efficiently track pedestrians in densely populated environments like subway stations, especially when occlusion and clustering occur, leading to detection and tracking failures. Existing technologies lack effective utilization of cross-frame information and spatial weak cue enhancement.
A station-based dense pedestrian tracking system based on weak spatial cues is adopted. It extracts target features through a lightweight spatiotemporal attention mechanism, and combines modified Kalman filtering and trajectory management modules. It uses weak cues such as trajectory confidence, mixed intersection-over-union ratio, and velocity direction to perform high-confidence and low-confidence correlation to achieve accurate pedestrian tracking.
It improves the accuracy and speed of pedestrian tracking, enabling real-time detection and analysis of dense crowds in complex environments, and enhances the monitoring capabilities for subway station safety.
Smart Images

Figure CN118037774B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of multi-target tracking (MOT) technology, specifically relating to a dense pedestrian tracking system and method for stations based on weak spatial cues. Background Technology
[0002] Target tracking is a crucial research area in computer vision, particularly in surveillance video and security systems, where it plays a key role. Video contains a wealth of information, making the extraction of effective features essential. Subway stations handle massive daily passenger flows, characterized by high population mobility and density, frequently leading to safety hazards such as pushing and trampling, potentially causing personal injury and property damage. Traditional methods, relying on visual inspection, manual labor, or mechanical means to count pedestrians, struggle to create an efficient system.
[0003] Subway stations present a unique real-world scenario characterized by extremely dense pedestrian traffic. Relying solely on manual observation or mechanical methods for pedestrian counting has significant limitations. Furthermore, when high detection accuracy is required, manual and mechanical methods alone cannot efficiently accomplish the task of tracking dense pedestrians. Recent methods often employ a combination of deep learning and mathematical algorithms to address this task. However, when occlusion, clustering, or motion blur occur in the video sequence, the number of missing detections and extremely low-scoring detections increases, and target tracking predictions may fail. Therefore, utilizing cross-frame information to enhance video detection performance and incorporating weak spatial cues during tracking are essential. Summary of the Invention
[0004] To address the shortcomings of existing technologies, this invention proposes a system and method for tracking dense pedestrians at stations based on weak spatial cues. By connecting weak spatial cues and adding a lightweight spatiotemporal attention mechanism, the spatial motion features of the target are extracted, enabling the tracking and analysis of dense pedestrians at stations. This method is simple and efficient.
[0005] A dense pedestrian tracking system for stations based on weak spatial cues includes a target detection module, a high-confidence association module, a low-confidence association module, and a trajectory management module. The target detection module includes a feature extraction layer, a feature fusion layer, and an output layer. The high-low confidence matching module includes a high-low confidence association module, a modified Kalman filter module, and a trajectory management module.
[0006] First, image frame F t The target detection module replaces the output of the original YOLOX backbone network with a backbone network composed of LSA blocks to obtain the model's predicted detection box set Z. t Then, by using a preset high threshold τ high and low threshold τ low Z tBinary classification is performed to obtain high-confidence detection boxes. and low confidence detection boxes Then the high-confidence association module sets the trajectory set from the previous frame Γ t-1 Tracking box in The tracking box of the current frame is obtained by prediction using a modified Kalman filter. Will and Perform the first spatial similarity match and calculate the association cost. Get the first set of successfully matched trajectories Detection boxes that did not match the trajectory Merging The low-confidence association module will set the remaining trajectory set Tracking box in The tracking box of the current frame is obtained by prediction using a modified Kalman filter. Low-confidence detection boxes and Perform a second spatial similarity match and calculate the association cost. Obtain the set of successfully matched trajectories from the second time, and merge them into... Finally, the trajectory management module creates, deletes, and updates the target trajectory to obtain the final output trajectory. Complete the tracking of dense pedestrian traffic at the station.
[0007] A method for tracking dense pedestrians at train stations based on weak spatial cues is proposed, which is implemented based on the aforementioned system for tracking dense pedestrians at train stations based on weak spatial cues. The specific steps are as follows:
[0008] Step 1: Acquire each frame of pedestrian image from the surveillance video. For the input pedestrian image frame F... t (t = 1, ..., T) is the set of predicted detection boxes obtained by the target detection module.
[0009] The object detection module introduces a lightweight spatiotemporal attention LSA block into the YoloX model. The replaced backbone network is named LSA. It utilizes temporal information to compensate for insufficient pedestrian feature extraction due to occlusion issues in the current frame, enabling the feature network in YoloX to extract more discriminative multi-scale feature maps. The final pedestrian bounding boxes are more accurate, which is beneficial for tracking operations. The backbone network composed of LSA blocks replaces the output part of the original YoloX backbone network to obtain the model's predicted detection box set Z. t Step 1 is as follows:
[0010] Step 1.1: Transfer image frame F tA backbone network composed of LSA blocks is used to obtain feature maps at multiple scales. Each feature map at a scale is then divided into l image patch feature embedding vectors. Image patch embedding for the current frame Image embedding at the same scale as the previous n frames Perform spatiotemporal cross-attention operations to obtain spatiotemporal information enhancement, and obtain the l-th spatiotemporal information. spatiotemporal information Normalize LN to obtain Utilizing the i-th (i = 1, 2, ..., h) attention head of LSA i Query weight matrix W i Q Key weight matrix W i K Value weight matrix W i V respectively with Multiplying these matrices yields the corresponding matrices Q, K, and V; next, these matrices are used to compute each attention head. i The formula is shown below:
[0011]
[0012]
[0013]
[0014] ε is the relative position offset matrix;
[0015] Multiple attention heads (headi, i = 1, 2, ..., h) are concatenated together to obtain multi-head self-attention features. in, Scaling factor;
[0016]
[0017] Combine W-MSA and SW-MSA to obtain the (l-1)th spatiotemporal information of the current t-th frame. The process combines existing technologies such as MLP and LN to obtain the (l+1)th spatiotemporal information of the current frame. Specifically:
[0018] After obtaining multi-head self-attention features, to further enhance spatial features while avoiding excessive computation, a cascaded window multi-head self-attention (W-MSA) and sliding window multi-head self-attention (SW-MSA) method is adopted. Compared with the global self-attention method (MSA), window multi-head self-attention (W-MSA) reduces computational complexity. However, since each window does not overlap, information cannot be exchanged between adjacent windows. Therefore, the sliding window multi-head self-attention (SW-MSA) method is proposed, which enables information exchange between two adjacent windows and cross-window association between upper and lower layers, thereby indirectly achieving the ability of global spatial modeling. Furthermore, combined with multilayer perceptron (MLP) and normalized processing (LN), the specific formula for LSA is as follows:
[0019]
[0020]
[0021]
[0022]
[0023] The output of the LSA block represents the output characteristics of the downsampled state at multiples.
[0024] Step 1.2: Obtain high-level features through the LSA feature extraction layer. The original backbone network output of YOLOX is replaced with LSA. The three effective feature layers of LSA are extracted, which are downsampled by 8 times, 16 times and 32 times respectively. These correspond to the three effective feature layers of the original backbone network input, which are downsampled by 8 times, 16 times and 32 times respectively, to complete the extraction of high-level features.
[0025] Step 1.3: High-level feature B t After passing through the Path Aggregation Feature Pyramid Network (PAFPN) feature fusion layer, it is further processed using a fully connected layer and the SiLU activation function; B t The predicted feature map is obtained by pass-through fusion through upsampling and downsampling, and finally outputs a tuple D consisting of three feature layers. t The PAFPN feature layers are analogous to the features at each level of an image pyramid, thus mapping RoIs of different scales to the corresponding feature layers.
[0026] Step 1.4: The output layer uses the anchorless decoupling head in the YOLOX model, feature map tuple D.t The final prediction result Z is obtained after passing through the output head. t It contains three parts of prediction information: cls_output: predicts the category and score of the target box; obj_output: determines whether the target box is foreground or background; and reg_output: predicts the coordinate information of the target box. The loss function is calculated as follows:
[0027]
[0028] L cls For classification loss, L obj For space loss, L reg To locate the loss, N pos The number of positive samples is denoted by μ, which represents the balance coefficient of the localization loss.
[0029] Step 2: Using a preset high threshold τ high and low threshold τ low Z t Binary classification is performed to obtain high-confidence detection boxes. and low confidence detection boxes
[0030] By using a preset high threshold τ high and low threshold τ low Z t Binary classification is performed to obtain high-confidence detection boxes. and low confidence detection boxes When the detection box set Z t The confidence level is greater than τ high At that time, a high-confidence detection box was obtained. When Z t The confidence level is greater than τ low and less than τ high And there are unmatched remaining detection boxes. At that time, from Z t Low-confidence detection boxes were selected from the samples. T is the number of image frames; N is the number of detection boxes;
[0031] Step 3: Use the high-confidence association module to set the trajectory Γ of the previous frame t-1 Tracking box in The tracking box of the current frame is obtained by prediction using a modified Kalman filter. Will and Perform the first spatial similarity match and calculate the association cost. Get the first set of successfully matched trajectories Detection boxes that did not match the trajectory Merging
[0032] The high-confidence association module introduces three weak spatial cues: trajectory confidence state, mixed intersection-over-union ratio (CIU), and velocity direction, to perform the first spatial similarity matching, thereby handling high-confidence association during target tracking. It proposes using a Modified Kalman Filter (MKF) model to analyze the tracking bounding box from the previous frame. Perform state prediction for the current frame to obtain the tracking box for the current frame. Then, using weak spatial cues... and Perform the first spatial similarity matching, and then use the Hungarian algorithm for linear allocation to find the optimal association cost. This is then combined with the MKF to perform linear allocation for the next frame; the set of successfully matched trajectories is denoted as... The failed matching trajectory is denoted as Detection boxes that are not assigned a trajectory are denoted as
[0033] Step 3.1 Use the MKF model to... Perform state prediction for the current frame to obtain the tracking box for the current frame. To better reflect the time continuity of the confidence state of the same trajectory, a weak cue is introduced in MKF, namely the trajectory confidence state c(t) and its velocity component c′(t).
[0034] Trajectory confidence is expressed in c trk This means that c(t) is c trk The state that is expressed, c trk A higher value of c(t) indicates greater confidence; when using MKF, the trajectory confidence state cost C Conf The confidence level of the trajectory was calculated as the estimated value. and detection confidence level c det The absolute difference between them is The formula for linear modeling of trajectory confidence is as follows:
[0035]
[0036] To make the detection box include a more complete human image, the original aspect ratio component of the Kalman filter is modified to include width w and height h components, and c(t) is introduced; the 10-tuple state vector, 5-tuple measurement vector, and noise parameters of the MKF are as follows:
[0037] x t=[u(t), v(t), w(t), h(t), c(t), u′(t), v′(t), w′(t), h′(t), c′(t)] T
[0038] y t =[y u (t), y v (t), y w (t), y h (t), y c (t)] T
[0039]
[0040]
[0041] Q t Let R be the process noise covariance matrix. t To measure the noise covariance matrix, Here is an estimate of c(t). and Let σ be the estimated values of w and h. p σ m and σ s This represents the corresponding noise variance;
[0042] Since the Kalman Filter (KF) is a uniform linear motion model, it becomes unsuitable for tracking nonlinear motion. Therefore, motion compensation (MC) is introduced into the MKF. Using global motion compensation techniques from OpenCV, key points in the image are extracted, followed by sparse optical flow feature tracking. The affine transformation matrix A is calculated, and finally, the predicted state vector is obtained through the MKF. The MC correction formula is as follows:
[0043]
[0044]
[0045]
[0046]
[0047] M∈R 2×2 The matrix contains rotations and scaling of matrix A, and T is the transformation vector containing the transformed portion; the prior states obtained after inserting MC are used. and the corresponding prior prediction covariance matrix P′ t|t-1 The posterior predicted state is calculated. and the corresponding posterior prediction covariance matrix P′ t|t This allows us to obtain the tracking box for the current frame. The formula for updating the MKF using MC is as follows:
[0048]
[0049]
[0050] P t|t =(IK t H t )P′ t|t-1
[0051] Among them, K t H is the gain coefficient. t This is the gain matrix;
[0052] Step 3.2 and When performing spatial similarity matching, another weak cue, the Highest Intersection over Union (HIoU), is introduced to compensate for the shortcomings of the Strong Cue IoU in cases of severe target occlusion and clustering. This method is denoted as Hybrid Intersection over Union (MIoU). The detection box and the tracking box are defined as b... 1 and b d Where x1 and y1 represent the top left corner, and x2 and y2 represent the bottom right corner, the areas of the two boxes are defined as α and β; MIoU is used to... and Spatial similarity matching is performed to determine the degree of overlap between the detection box and the tracking box, thereby deciding whether to connect the tracking box to the trajectory of the previous frame; the MIoU calculation formula is as follows:
[0053]
[0054]
[0055]
[0056]
[0057] MIoU = HIoU·IoU
[0058] In spatial similarity matching, weak spatial cues such as velocity direction are also effective; the cost metric for velocity consistency correlation is tracking the velocity direction θ. t and the detection velocity direction θ d The absolute difference between them is expressed as Δθ=|θ t -θ d |; Cost C in the direction of speed Vel From the center cost C t(θ t θ d The cost of extending to the four corners of the border Calculate; given two center points (x1, y1) and (x2, y2) of the tracking box and detection box, the velocity directions θ and C Vel The calculation formula is:
[0059]
[0060]
[0061] High confidence matching association cost It consists of multiple correlation factors, including MIoU correlation, velocity direction correlation, and trajectory confidence correlation. Therefore, the correlation cost calculation formula is as follows:
[0062]
[0063] γ1 and γ2 are weighting coefficients;
[0064] Furthermore, the Hungarian algorithm is used to linearly assign trajectories to obtain the optimal association cost, which is then combined with the MKF algorithm for further linear assignment. The set of successfully matched trajectories in the first spatial similarity matching is denoted as... The set of failed matching trajectories is denoted as The detection boxes that were not assigned a trajectory are denoted as
[0065] Step 4: Use the low-confidence association module to set the remaining trajectories. Tracking box in The tracking box of the current frame is obtained by Kalman filtering prediction. Low-confidence detection boxes and Perform a second spatial similarity match and calculate the association cost. Obtain the set of successfully matched trajectories from the second time, and merge them into...
[0066] The low-confidence association module introduces two weak cues (trajectory confidence state and mixed intersection-over-union ratio) to perform a second spatial similarity matching, thereby handling low-confidence associations in the target tracking process. Step 4 specifically involves:
[0067] Current frame tracking box of the low confidence matching module Depend on Obtain; To and After performing a second spatial similarity matching, the Hungarian algorithm is used for linear assignment to find the optimal association cost. And combine it with MKF to perform linear assignment; low confidence cost C′ t It is not suitable to introduce C Vel This will cause overfitting, therefore C′ t The calculation formula is:
[0068]
[0069] γ3 is the weighting coefficient;
[0070] Let the set of successfully matched trajectories in the second spatial similarity matching be denoted as . The set of failed matching trajectories is denoted as Detection boxes that are not assigned a trajectory are denoted as
[0071] The trajectory set is composed of the position information of the predicted tracking box or detection box. Each trajectory is assigned a different number to count the number of pedestrians.
[0072] Step 5: Use the trajectory management module to create, delete, and update the trajectories that were successfully matched in Steps 3 and 4, as well as the detection box trajectories that were not successfully matched in Steps 3 and 4, to obtain the final output trajectory.
[0073] The trajectory management module creates, deletes, and updates target trajectories to obtain the final output trajectory. Step 5 specifically involves:
[0074] Define the total set of trajectories as Γ t The new target trajectory is The set of detection boxes that failed to match in steps 3 and 4 is The expiration threshold is t expire ;
[0075] New trajectory: A set of detection boxes that failed to match in steps 3 and 4. Generate new target trajectory
[0076] Deleting a trajectory: For the total trajectory set Γ t Perform the following operations: if Γ t Number of untracked frames Γ t .N untracked Greater than t expire If the expiration threshold is exceeded, the trajectory will be deleted. This is to remove target trajectories that have not been detected for a long time.
[0077] Update trajectory: For For each target trajectory in the data, perform the following operations: based on the detection results of the current frame and the target state of the previous frame, update the data using an online recursive method. KF.parameters is the Kalman filter parameter that estimates the target state.
[0078] Ultimately, the new target trajectory will be... and successfully matched target trajectory Weighted to form the final output trajectory set For the final output trajectory set Post-processing operations are performed to further improve the quality of target tracking.
[0079] Beneficial technical effects of the present invention:
[0080] This invention introduces a lightweight spatiotemporal attention mechanism to replace the convolutional neural backbone network in the target detection module, which more effectively corrects the problems of false detection and missed detection of a single network in dense crowds. In target tracking, the high-cost state of appearance matching is abandoned, and the obtained detection boxes are classified according to thresholds. When matching high and low confidence, three weak cues are introduced: trajectory confidence, mixed intersection-over-union ratio, and velocity direction, in order to make up for the insufficient anti-occlusion ability of strong spatial and appearance cues when dealing with dense crowds.
[0081] Compared with existing technologies, this invention proposes a more accurate and faster method for tracking dense pedestrian traffic in subway stations based on weak spatial cues. This invention exhibits high recognition accuracy in complex environments and can detect and analyze dense pedestrian flows in subway stations in real time, thus possessing significant research value and importance for maintaining public safety. Attached Figure Description
[0082] Figure 1 An overall framework diagram of the station dense pedestrian tracking method based on weak spatial cues in this invention embodiment;
[0083] Figure 2 A schematic diagram of the target detection module structure according to an embodiment of the present invention;
[0084] Figure 3 A schematic diagram of the high and low confidence matching module structure in an embodiment of the present invention;
[0085] Figure 4 A schematic diagram showing the results of a method for tracking dense pedestrians at stations based on weak spatial cues according to an embodiment of the present invention. Detailed Implementation
[0086] The present invention will be further described below with reference to the accompanying drawings and embodiments;
[0087] A dense pedestrian tracking system for train stations based on weak spatial cues includes a target detection module, a high-confidence association module, a low-confidence association module, and a trajectory management module; a schematic diagram of the target detection module is attached. Figure 2 As shown, it includes a feature extraction layer, a feature fusion layer, and an output layer; a schematic diagram of the high and low confidence matching module structure is attached. Figure 3 As shown, it includes a high- and low-confidence correlation module, a modified Kalman filter module, and a trajectory management module;
[0088] First, image frame F t The target detection module replaces the output of the original YOLOX backbone network with a backbone network composed of LSA blocks to obtain the model's predicted detection box set Z. t Then, by using a preset high threshold τ high and low threshold τ low Z t Binary classification is performed to obtain high-confidence detection boxes. and low confidence detection boxes Then the high-confidence association module sets the trajectory set from the previous frame Γ t-1 Tracking box in The tracking box of the current frame is obtained by prediction using a modified Kalman filter. Will and Perform the first spatial similarity match and calculate the association cost. Get the first set of successfully matched trajectories Detection boxes that did not match the trajectory Merging The low-confidence association module will set the remaining trajectory set Tracking box in The tracking box of the current frame is obtained by prediction using a modified Kalman filter. Low-confidence detection boxes and Perform a second spatial similarity match and calculate the association cost. Obtain the set of successfully matched trajectories from the second time, and merge them into... Finally, the trajectory management module creates, deletes, and updates the target trajectory to obtain the final output trajectory. Complete the tracking of dense pedestrian traffic at the station.
[0089] A method for tracking dense pedestrians at train stations based on weak spatial cues is proposed, implemented based on the aforementioned system for tracking dense pedestrians at train stations based on weak spatial cues, as shown in the attached figure. Figure 1 As shown, the specific steps are as follows:
[0090] Step 1: Acquire each frame of pedestrian image from the surveillance video. For the input pedestrian image frame F... t (t = 1, ..., T) is the set of predicted detection boxes obtained by the target detection module.
[0091] The object detection module introduces a lightweight spatiotemporal attention LSA block into the YoloX model. The replaced backbone network is named LSA. It utilizes temporal information to compensate for insufficient pedestrian feature extraction due to occlusion issues in the current frame, enabling the feature network in YoloX to extract more discriminative multi-scale feature maps. The final pedestrian bounding boxes are more accurate, which is beneficial for tracking operations. The backbone network composed of LSA blocks replaces the output part of the original YoloX backbone network to obtain the model's predicted detection box set Z. t Step 1 is as follows:
[0092] Step 1.1: Transfer image frame F t A backbone network composed of LSA blocks is used to obtain feature maps at multiple scales. Each feature map at a scale is then divided into l image patch feature embedding vectors. Image patch embedding for the current frame Image embedding at the same scale as the previous n frames (In this method, n=3) a spatiotemporal cross-attention operation is performed to obtain spatiotemporal information enhancement, resulting in the l-th spatiotemporal information. spatiotemporal information Normalize LN to obtain Utilizing the i-th (i = 1, 2, ..., h) attention head of LSA i Query weight matrix W i Q Key weight matrix W i K Value weight matrix W i V respectively with Multiplying these matrices yields the corresponding matrices Q, K, and V; next, these matrices are used to compute each attention head. i The formula is shown below:
[0093]
[0094]
[0095]
[0096] ε is the relative position offset matrix;
[0097] Multiple attention heads i (i = 1, 2, ..., h) are concatenated together to obtain the multi-head self-attention features. in, Scaling factor;
[0098]
[0099] Combine W-MSA and SW-MSA to obtain the (l-1)th spatiotemporal information of the current t-th frame. The process combines existing technologies such as MLP and LN to obtain the (l+1)th spatiotemporal information of the current frame. Specifically:
[0100] After obtaining multi-head self-attention features, to further enhance spatial features while avoiding excessive computation, a cascaded window multi-head self-attention (W-MSA) and sliding window multi-head self-attention (SW-MSA) method is adopted. Compared with the global self-attention method (MSA), window multi-head self-attention (W-MSA) reduces computational complexity. However, since each window does not overlap, information cannot be exchanged between adjacent windows. Therefore, the sliding window multi-head self-attention (SW-MSA) method is proposed, which enables information exchange between two adjacent windows and cross-window association between upper and lower layers, thereby indirectly achieving the ability of global spatial modeling. Furthermore, combined with multilayer perceptron (MLP) and normalized processing (LN), the specific formula for LSA is as follows:
[0101]
[0102]
[0103]
[0104]
[0105] The output of the LSA block represents the output characteristics of the downsampled state at multiples.
[0106] Step 1.2: Obtain high-level features through the LSA feature extraction layer. The original backbone network output of YOLOX is replaced with LSA. The three effective feature layers of LSA are extracted, which are downsampled by 8 times, 16 times and 32 times respectively. These correspond to the three effective feature layers of the original backbone network input, which are downsampled by 8 times, 16 times and 32 times respectively, to complete the extraction of high-level features.
[0107] Step 1.3: High-level feature B t After passing through the Path Aggregation Feature Pyramid Network (PAFPN) feature fusion layer, it is further processed using a fully connected layer and the SiLU activation function; B tThe predicted feature map is obtained by pass-through fusion through upsampling and downsampling, and finally outputs a tuple D consisting of three feature layers. t The PAFPN feature layers are analogous to the features at each level of an image pyramid, thus mapping RoIs of different scales to the corresponding feature layers.
[0108] Step 1.4: The output layer uses the anchorless decoupling head in the YOLOX model, feature map tuple D. t The final prediction result Z is obtained after passing through the output head. t It contains three parts of prediction information: cls_output: predicts the category and score of the target box; obj_output: determines whether the target box is foreground or background; and reg_output: predicts the coordinate information of the target box. The loss function is calculated as follows:
[0109]
[0110] L cls For classification loss, L obj For space loss, L reg To locate the loss, N pos The number of positive samples is denoted by μ, which represents the balance coefficient of the localization loss and is set to 5.0.
[0111] Step 2: Using a preset high threshold τ high and low threshold τ low Z t Binary classification is performed to obtain high-confidence detection boxes. and low confidence detection boxes
[0112] Depending on the actual density of pedestrian traffic in the scenario, appropriate high and low thresholds are preset, where {0.55≤τ} high ≤0.85, 0.15≤τ low ≤0.35}. Through a preset high threshold τ high and low threshold τ low Z t Binary classification is performed to obtain high-confidence detection boxes. and low confidence detection boxes When the detection box set Z t The confidence level is greater than τ high At that time, a high-confidence detection box was obtained. When Z t The confidence level is greater than τ low and less than τ high And there are unmatched remaining detection boxes. At that time, from Z t Low-confidence detection boxes were selected from the samples. T is the number of image frames; N is the number of detection boxes;
[0113] Step 3: Use the high-confidence association module to set the trajectory Γ of the previous frame t-1 Tracking box in The tracking box of the current frame is obtained by prediction using a modified Kalman filter. Will and Perform the first spatial similarity match and calculate the association cost. Get the first set of successfully matched trajectories Detection boxes that did not match the trajectory Merging
[0114] The high-confidence association module introduces three weak spatial cues: trajectory confidence state, mixed intersection-over-union ratio (CIU), and velocity direction, to perform the first spatial similarity matching, thereby handling high-confidence association during target tracking. It proposes using a Modified Kalman Filter (MKF) model to analyze the tracking bounding box from the previous frame. Perform state prediction for the current frame to obtain the tracking box for the current frame. Then, using weak spatial cues... and Perform the first spatial similarity matching, and then use the Hungarian algorithm for linear allocation to find the optimal association cost. This is then combined with the MKF to perform linear allocation for the next frame; the set of successfully matched trajectories is denoted as... The failed matching trajectory is denoted as Detection boxes that are not assigned a trajectory are denoted as
[0115] Step 3.1 Use the MKF model to... Perform state prediction for the current frame to obtain the tracking box for the current frame. To better reflect the time continuity of the confidence state of the same trajectory, a weak cue is introduced in MKF, namely the trajectory confidence state c(t) and its velocity component c′(t).
[0116] Trajectory confidence is expressed in c trk This means that c(t) is c trk The state that is expressed, c trk A higher value of c(t) indicates greater confidence; when using MKF, the trajectory confidence state cost C Conf The confidence level of the trajectory was calculated as the estimated value. and detection confidence level c det The absolute difference between them is The formula for linear modeling of trajectory confidence is as follows:
[0117]
[0118] To make the detection box include a more complete human image, the original aspect ratio component of the Kalman filter is modified to include width w and height h components, and c(t) is introduced; the 10-tuple state vector, 5-tuple measurement vector, and noise parameters of the MKF are as follows:
[0119] x t =[u(t), v(t), w(t), h(t), c(t), u′(t), v′(t), w′(t), h′(t), c′(t)] T
[0120] y t =[y u (t), y v (t), y w (t), y h (t), y c (t)] T
[0121]
[0122]
[0123] Q t Let R be the process noise covariance matrix. t To measure the noise covariance matrix, Here is an estimate of c(t). and Let σ be the estimated values of w and h. p =σ m =0.055, σ s =0.00675 is the corresponding noise variance;
[0124] Since the Kalman Filter (KF) is a uniform linear motion model, it becomes unsuitable for tracking nonlinear motion. Therefore, motion compensation (MC) is introduced into the MKF. Using global motion compensation techniques from OpenCV, key points in the image are extracted, followed by sparse optical flow feature tracking. The affine transformation matrix A is calculated, and finally, the predicted state vector is obtained through the MKF. The MC correction formula is as follows:
[0125]
[0126]
[0127]
[0128]
[0129] M∈R 2×2 The matrix contains rotations and scaling of matrix A, and T is the transformation vector containing the transformed portion; the prior states obtained after inserting MC are used. and the corresponding prior prediction covariance matrix P′ t|t-1 The posterior predicted state is calculated. and the corresponding posterior prediction covariance matrix P′ t|t This allows us to obtain the tracking box for the current frame. The formula for updating the MKF using MC is as follows:
[0130]
[0131]
[0132] P t|t =(IK t H t )P′ t|t-1
[0133] Among them, K t H is the gain coefficient. t This is the gain matrix;
[0134] Step 3.2 and When performing spatial similarity matching, another weak cue, the Highest Intersection over Union (HIoU), is introduced to compensate for the shortcomings of the Strong Cue IoU in cases of severe target occlusion and clustering. This method is denoted as Hybrid Intersection over Union (MIoU). The detection box and the tracking box are defined as b... 1 and b d Where x1 and y1 represent the top left corner, and x2 and y2 represent the bottom right corner, the areas of the two boxes are defined as α and β; MIoU is used to... and Spatial similarity matching is performed to determine the degree of overlap between the detection box and the tracking box, thereby deciding whether to connect the tracking box to the trajectory of the previous frame; the MIoU calculation formula is as follows:
[0135]
[0136]
[0137]
[0138]
[0139] MIoU = HIoU·IoU
[0140] In spatial similarity matching, weak spatial cues such as velocity direction are also effective; the cost metric for velocity consistency correlation is tracking the velocity direction θ. t and the detection velocity direction θ d The absolute difference between them is expressed as Δθ=|θ t -θ d |; Cost C in the direction of speed Vel From the center cost C t (θ t θ d The cost of extending to the four corners of the border Calculate; given two center points (x1, y1) and (x2, y2) of the tracking box and detection box, the velocity directions θ and C Vel The calculation formula is:
[0141]
[0142]
[0143] High confidence matching association cost It consists of multiple correlation factors, including MIoU correlation, velocity direction correlation, and trajectory confidence correlation. Therefore, the correlation cost calculation formula is as follows:
[0144]
[0145] γ1 and γ2 are weighting coefficients;
[0146] Furthermore, the Hungarian algorithm is used to linearly assign trajectories to obtain the optimal association cost, which is then combined with the MKF algorithm for further linear assignment. The set of successfully matched trajectories in the first spatial similarity matching is denoted as... The set of failed matching trajectories is denoted as The detection boxes that were not assigned a trajectory are denoted as
[0147] Step 4: Use the low-confidence association module to set the remaining trajectories. Tracking box in The tracking box of the current frame is obtained by Kalman filtering prediction. Low-confidence detection boxes and Perform a second spatial similarity match and calculate the association cost. Obtain the set of successfully matched trajectories from the second time, and merge them into...
[0148] The low-confidence association module introduces two weak cues (trajectory confidence state and mixed intersection-over-union ratio) to perform a second spatial similarity matching, thereby handling low-confidence associations in the target tracking process. Step 4 specifically involves:
[0149] Current frame tracking box of the low confidence matching module Depend on Obtain; To and After performing a second spatial similarity matching, the Hungarian algorithm is used for linear assignment to find the optimal association cost. And combine it with MKF to perform linear assignment; low confidence cost C′ t It is not suitable to introduce C Vel This will cause overfitting, therefore C′ t The calculation formula is:
[0150]
[0151] γ3 is the weighting coefficient;
[0152] Let the set of successfully matched trajectories in the second spatial similarity matching be denoted as . The set of failed matching trajectories is denoted as Detection boxes that are not assigned a trajectory are denoted as
[0153] The trajectory set is composed of the position information of the predicted tracking box or detection box. Each trajectory is assigned a different number to count the number of pedestrians.
[0154] Step 5: Use the trajectory management module to create, delete, and update the trajectories that were successfully matched in Steps 3 and 4, as well as the detection box trajectories that were not successfully matched in Steps 3 and 4, to obtain the final output trajectory.
[0155] The trajectory management module creates, deletes, and updates target trajectories to obtain the final output trajectory. Step 5 specifically involves:
[0156] Define the total set of trajectories as Γ t The new target trajectory is The set of detection boxes that failed to match in steps 3 and 4 is The expiration threshold is t expire ;
[0157] New trajectory: A set of detection boxes that failed to match in steps 3 and 4. Generate new target trajectory
[0158] Deleting a trajectory: For the total trajectory set Γt Perform the following operations: if Γ t Number of untracked frames Γ t .N untracked Greater than t expire If the expiration threshold is exceeded, the trajectory will be deleted. This is to remove target trajectories that have not been detected for a long time.
[0159] Update trajectory: For For each target trajectory in the data, perform the following operations: based on the detection results of the current frame and the target state of the previous frame, update the data using an online recursive method. KF.parameters is the Kalman filter parameter that estimates the target state.
[0160] Ultimately, the new target trajectory will be... and successfully matched target trajectory Weighted to form the final output trajectory set
[0161] The final output trajectory set can be selected. Post-processing operations, such as filtering out abnormal trajectory points or optimizing trajectory correlations, are performed to further improve the quality of target tracking. A schematic diagram of the tracking results using this method is attached. Figure 4 As shown.
Claims
1. A spatially weak cue based station dense pedestrian tracking system, characterized in that, It includes an object detection module, a high-confidence association module, a low-confidence association module, a modified Kalman filter module, and a trajectory management module; the object detection module includes a feature extraction layer, a feature fusion layer, and an output layer; the high-confidence association module, the low-confidence association module, and the trajectory management module interact through the modified Kalman filter module; The target detection module will output an image frame The model predicts a set of detection boxes by replacing the original backbone network output part of YoloX with a backbone network composed of LSA blocks ; through a preset high threshold value and a low threshold value will be two classifications, get high confidence bounding box and low confidence bounding box ; Then the high-confidence association module sets the trajectory from the previous frame. Tracking box in The tracking box of the current frame is obtained by prediction using the modified Kalman filter module. ,Will and Perform the first spatial similarity matching and calculate the association cost. The first set of successfully matched trajectories is obtained. Detection boxes that did not match the trajectory Merging The low-confidence association module will set the remaining trajectory set. Tracking box in The tracking box of the current frame is obtained by prediction using the modified Kalman filter module. low-confidence detection boxes and Perform a second spatial similarity match and calculate the association cost. The second set of successfully matched trajectories is obtained and incorporated into... Finally, the trajectory management module creates, deletes, and updates the target trajectory to obtain the final output trajectory. Complete the tracking of dense pedestrian traffic at the station; The object detection module introduces a lightweight spatiotemporal attention LSA block into the YoloX model. The replaced backbone network is named LSA. It utilizes temporal information to compensate for insufficient pedestrian feature extraction due to occlusion issues in the current frame, enabling the feature network in YoloX to extract more discriminative multi-scale feature maps. The final pedestrian bounding boxes are more accurate, which is beneficial for tracking operations. The backbone network composed of LSA blocks replaces the output part of the original YoloX backbone network to obtain the model's predicted detection box set. ; The high-confidence association module introduces three weak spatial cues: trajectory confidence state, mixed intersection-over-union ratio, and velocity direction, to perform the first spatial similarity matching, thereby handling high-confidence association in the target tracking process; Use the modified Kalman filter module to track the previous frame's bounding box. Perform state prediction for the current frame to obtain the tracking box for the current frame. Then, using weak spatial cues... and Perform the first spatial similarity matching, and then use the Hungarian algorithm for linear allocation to find the optimal association cost. This is then combined with the MKF to perform linear allocation for the next frame; the set of successfully matched trajectories is denoted as... Failed matching trajectories are recorded as Detection boxes that are not assigned a trajectory are denoted as .
2. A method for tracking dense pedestrians at stations based on weak spatial cues, implemented based on the system for tracking dense pedestrians at stations based on weak spatial cues as described in claim 1, characterized in that... The specific steps are as follows: Step 1: Acquire each frame of pedestrian image from the surveillance video. For the input pedestrian image frames... The target detection module obtains the predicted detection box set. , ; Step 2: Using a preset high threshold and low threshold Will Binary classification is performed to obtain high-confidence detection boxes. and low confidence detection boxes ; Step 3: Use the high-confidence correlation module to set the trajectory of the previous frame. Tracking box in The tracking box of the current frame is obtained by prediction using a modified Kalman filter. ,Will and Perform the first spatial similarity matching and calculate the association cost. The first set of successfully matched trajectories is obtained. Detection boxes that did not match the trajectory Merging ; Step 4: Use the low-confidence association module to set the remaining trajectories. Tracking box in The tracking box of the current frame is obtained by Kalman filtering prediction. low-confidence detection boxes and Perform a second spatial similarity match and calculate the association cost. ( , ), obtain the set of successfully matched trajectories for the second time, and merge them into ; Step 5: Use the trajectory management module to create, delete, and update the trajectories that were successfully matched in Steps 3 and 4, as well as the detection box trajectories that were not successfully matched in Steps 3 and 4, to obtain the final output trajectory. .
3. A method for tracking dense pedestrians in stations based on weak spatial cues according to claim 2, characterized in that, Step 1 is as follows: Step 1.1: Through a backbone network composed of LSA blocks, feature maps at multiple scales are obtained, and the feature map at each scale is divided into... Image patch feature embedding vector For the current frame image patch embedding Image embedding at the same scale as the previous n frames Performing spatiotemporal cross-attention operations to obtain spatiotemporal information enhancement yields the first... spatiotemporal information spatiotemporal information Normalization processing ,get ; utilizing LSA's first A person's attention Query weight matrix Key weight matrix Value weight matrix respectively with Multiply, The corresponding matrix is obtained. , and ; Next, these matrices are used to calculate each attention head. The formula is shown below: , , This is the relative position offset matrix; Multiple attention heads By splicing them together, we obtain the multi-head self-attention features. ,in, Scaling factor ; Combined W-MSA and SW-MSA for the current t-th frame spatiotemporal information The process combines existing technologies such as MLP and LN to obtain the current frame number. spatiotemporal information Specifically: After obtaining multi-head self-attention features, to further enhance spatial features while avoiding excessive computation, a cascaded window multi-head self-attention (W-MSA) and sliding window multi-head self-attention (SW-MSA) method is adopted. Compared to the global self-attention method (MSA), window multi-head self-attention (W-MSA) reduces computational complexity. However, since each window does not overlap, information cannot be exchanged between adjacent windows. Therefore, the sliding window multi-head self-attention (SW-MSA) method is proposed, enabling information exchange between adjacent windows and cross-window associations between upper and lower layers, indirectly achieving the ability of global spatial modeling. Furthermore, combined with multilayer perceptron (MLP) and normalized processing (LN), the specific formula for LSA is as follows: The output of the LSA block represents the output characteristics of the downsampled state at multiples. Step 1.2: Obtain high-level features through the LSA feature extraction layer. The original backbone network output of YOLOX is replaced with LSA. The three effective feature layers of LSA are extracted as feature layers downsampled by 8 times, 16 times and 32 times respectively, which correspond to the three effective feature layers downsampled by 8 times, 16 times and 32 times in the input of the original backbone network, thus completing the extraction of high-level features. Step 1.3: High-level features After passing through the Path Aggregation Feature Pyramid Network (PAFPN) feature fusion layer, it is further processed using a fully connected layer and the SiLU activation function; The predicted feature map is obtained by passing and fusing through upsampling and downsampling, and finally outputs a tuple consisting of three feature layers. The PAFPN feature layers are analogous to the features at each level of an image pyramid, thus mapping RoIs of different scales to the corresponding feature layers. Step 1.4: The output layer uses the anchorless decoupling head in the YOLOX model, feature map tuples. The final prediction result is obtained after passing through the output head. It contains three parts of prediction information: cls_output: predicts the category and score of the target box; obj_output: determines whether the target box is foreground or background; and reg_output: predicts the coordinate information of the target box. The loss function is calculated as follows: For classifying losses, For space loss, To pinpoint the loss, denoted as the number of positive samples, and μ represents the balance coefficient of the localization loss.
4. A method for tracking dense pedestrians in stations based on weak spatial cues according to claim 2, characterized in that, Step 2 is as follows: By setting a high threshold and low threshold Will Binary classification is performed to obtain high-confidence detection boxes. and low confidence detection boxes When the detection box set The confidence level is greater than At that time, a high-confidence detection box was obtained. ;when The confidence level is greater than and smaller than And there are unmatched remaining detection boxes. At that time, from Low-confidence detection boxes were selected from the samples. T represents the number of image frames; N represents the number of detection boxes.
5. A method for tracking dense pedestrians in stations based on weak spatial cues according to claim 2, characterized in that, Step 3 specifically involves: Step 3.1 Use the MKF model to... Perform state prediction for the current frame to obtain the tracking box for the current frame. To better reflect the temporal continuity exhibited by the confidence states of the same trajectory, a weak cue, namely the trajectory confidence state, is introduced in MKF. and its velocity components ; Trajectory confidence is used express, yes The state that is displayed The higher the value, the more it represents The more reliable; when using MKF, the cost of trajectory confidence state The confidence level of the trajectory was calculated as the estimated value. and detection confidence The absolute difference between them is The formula for linear modeling of trajectory confidence is as follows: To make the detection bounding box include a more complete human image, the original aspect ratio component of the Kalman filter was modified to include width (w) and height (h) components, and a new feature was introduced. The MKF's 10-tuple state vector, 5-tuple measurement vector, and noise parameters are as follows: The process noise covariance matrix is... To measure the noise covariance matrix, for The estimated value, and Here are the estimated values for w and h. This represents the corresponding noise variance; Since the Kalman Filter (KF) is a uniform linear motion model, it becomes unsuitable for tracking nonlinear motion. Therefore, motion compensation (MC) is introduced into the MKF. Using global motion compensation techniques from OpenCV, key points in the image are extracted, followed by sparse optical flow feature tracking. The affine transformation matrix A is calculated, and finally, the predicted state vector is obtained through the MKF. The MC correction formula is as follows: The matrix contains rotations and scaling of matrix A, and T is the transformation vector containing the transformed portion; the prior states obtained after inserting MC are used. and the corresponding prior prediction covariance matrix The posterior predicted state is calculated. and the corresponding posterior prediction covariance matrix This allows us to obtain the tracking box for the current frame. The formula for updating the MKF using MC is as follows: in, This is the gain coefficient. This is the gain matrix; Step 3.2 For and When performing spatial similarity matching, another weak cue, the Highest Intersection over Union (HIoU), is introduced to compensate for the shortcomings of the Strong Cue IoU in cases of severe target occlusion and clustering. This method is denoted as Hybrid Intersection over Union (MIoU). The detection box and the tracking box are defined as... and ,in and Indicates the top left corner. and To indicate the bottom right corner, define the area of the two boxes as... and Using MIoU and Spatial similarity matching is performed to determine the degree of overlap between the detection box and the tracking box, thereby deciding whether to connect the tracking box to the trajectory of the previous frame; the MIoU calculation formula is as follows: In spatial similarity matching, weak spatial cues such as velocity direction are also effective; the cost metric for velocity consistency correlation is tracking the velocity direction. and detection speed direction The absolute difference between them is expressed as ; Cost in the direction of speed From the center cost Cost of extending to the four corners of the border Calculate; given the two center points of the tracking box and the detection box. and , velocity direction and The calculation formula is: High confidence matching association cost ( , The correlation cost is composed of multiple correlation factors, including MIoU correlation, velocity direction correlation, and trajectory confidence correlation. Therefore, the correlation cost calculation formula is as follows: , These are the weighting coefficients; Furthermore, the Hungarian algorithm is used to linearly assign trajectories to obtain the optimal association cost, which is then combined with the MKF algorithm for further linear assignment. The set of successfully matched trajectories in the first spatial similarity matching is denoted as... The set of failed matching trajectories is denoted as The detection boxes that were not assigned a trajectory are denoted as .
6. A method for tracking dense pedestrians in stations based on weak spatial cues according to claim 2, characterized in that, Step 4, the low-confidence association module, introduces two weak cues (trajectory confidence state and mixed intersection-union ratio) to perform a second spatial similarity matching, thereby handling low-confidence associations in the target tracking process. Step 4 specifically involves: Current frame tracking box of the low confidence matching module Depend on Obtain; To and After performing a second spatial similarity matching, the Hungarian algorithm is used for linear assignment to find the optimal association cost. ( , ) and combine them into MKF to perform linear assignment; low confidence cost Not suitable for introduction This will cause overfitting, therefore The calculation formula is: These are the weighting coefficients; Let the set of successfully matched trajectories in the second spatial similarity matching be denoted as . The set of failed matching trajectories is denoted as Detection boxes that are not assigned a trajectory are denoted as ; The trajectory set is composed of the position information of the predicted tracking box or detection box. Each trajectory is assigned a different number to count the number of pedestrians.
7. A method for tracking dense pedestrians in stations based on weak spatial cues according to claim 2, characterized in that, Step 5 describes the trajectory management module, which creates, deletes, and updates the target trajectory to obtain the final output trajectory. Step 5 specifically involves: Define the total trajectory set as The new target trajectory is The set of detection boxes that failed to match in steps 3 and 4 is The expiration threshold is New trajectory: A set of detection boxes that failed to match in steps 3 and 4. Generate new target trajectory ; Delete trajectory: For the total trajectory set Perform the following operations: If Untracked frames Greater than If the expiration threshold is exceeded, the trajectory will be deleted. This is to remove target trajectories that have not been detected for a long time. Update trajectory: For For each target trajectory in the data, perform the following operations: based on the detection results of the current frame and the target state of the previous frame, update the data using an online recursive method. KF.parameters is the Kalman filter parameter that estimates the target state. Ultimately, the new target trajectory will be... and successfully matched target trajectory Weighted to form the final output trajectory set ; For the final output trajectory set Post-processing operations are performed to further improve the quality of target tracking.
Citation Information
Patent Citations
Pedestrian multi-target tracking method and device and computer readable storage medium
CN115240130A
Multi-target tracking method in dense scene
CN116883452A