A vehicle and pedestrian online detection and tracking method based on improved ByteTrack
By improving the ByteTrack algorithm, incorporating ReID appearance features and optimizing the motion model, and combining it with model compression tools, the limitations of ByteTrack in detection and tracking in urban traffic scenarios have been overcome. This has enabled efficient and real-time vehicle and pedestrian detection and tracking, improving detection accuracy and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NORTHEASTERN UNIV CHINA
- Filing Date
- 2023-06-30
- Publication Date
- 2026-06-30
AI Technical Summary
The existing ByteTrack algorithm has limitations in urban public transportation road scenarios, such as strong dependence on detector performance and poor performance in handling nonlinear motion patterns, making it difficult to meet the needs of real-time and efficient vehicle and pedestrian detection and tracking.
By improving the ByteTrack algorithm, adding a ReID appearance feature extraction module, optimizing motion models and data association methods, and combining model compression acceleration tools, the algorithm performance is improved to adapt to complex and ever-changing traffic scenarios.
It improves the adaptability and tracking accuracy of vehicle and pedestrian detection, enhances the robustness and real-time performance of the algorithm, and improves the MOTA, IDF1 and HOTA indicators, thus meeting the real-time detection and tracking needs of urban traffic monitoring systems.
Smart Images

Figure CN116682078B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, and in particular to an online vehicle and pedestrian detection and tracking method based on an improved ByteTrack. Background Technology
[0002] Object detection is one of the four basic tasks (classification, localization, detection, and segmentation) in the field of computer vision (CV). Its basic implementation is to select objects of interest in an image or video using a rectangular bounding box (localization), and then identify and classify the objects within the bounding box, thus solving the "localization + classification" problem.
[0003] Multiple Object Tracking (MOTor MTT) is a technique that detects and identifies multiple objects of interest in each frame of a video without any prior knowledge of the number of targets. These objects are then identified by a unique ID, and the same target in different frames is associated with each other to obtain the complete motion trajectory of all targets in the video, and even predict their trajectories.
[0004] Most detection-based Motion Detection (MOT) algorithms, specifically those following the Tracking-By-Detection (TBD) paradigm, consist of two steps: detection and tracking. The tracking step typically involves two tasks: 1) motion modeling and state estimation, predicting and updating the bounding boxes of trajectories in subsequent frames. The Kalman filter (KF) is the mainstream choice for this task; 2) associating new frame detections with the current trajectory set. Two main methods are used to handle this association task: ① utilizing target localization (spatial similarity), primarily the Intersection over Union (IoU) between the predicted trajectory bounding boxes and the detected bounding boxes; ② utilizing the target appearance model (appearance similarity), i.e., extracting appearance features using ReID. Typically, both methods are quantized as distances and the Hungarian algorithm is used to treat the association task as a global assignment problem.
[0005] With the rapid advancements in computer computing performance, the widespread adoption of high-performance camera terminals, the ever-increasing demand for video analysis, and the application of deep models such as Convolutional Neural Networks (CNNs) and the significant leap in computational efficiency brought by GPU devices, target tracking technology has benefited from more robust and generalized feature representations and end-to-end model training. The application scope of target tracking algorithms is becoming increasingly broad, and the demand for their practical implementation is growing stronger. Intelligent traffic monitoring systems are large-scale systems, in which vehicle and pedestrian detection and tracking based on computer vision is a crucial component. However, real-world traffic conditions are often complex, with adverse weather conditions and dense traffic and pedestrian flows, placing high demands on the accuracy and speed of model detection and tracking. Secondly, target detection and tracking systems typically need to react rapidly to changes in target movement, requiring real-time performance. Furthermore, the system deployment environment (such as mobile devices) has limitations in computational load and storage, and the system's size and power consumption need to be considered. Therefore, real-time, lightweight, fast, and high-precision operation remains a constant theme in target detection and tracking.
[0006] ByteTrack is a simple and efficient MOT algorithm in the SORT family. Recently, other algorithms in the SORT family include StrongSORT and OC-SORT. StrongSORT is an enhanced version of Deep SORT, but its speed and accuracy are slightly inferior to ByteTrack. OC-SORT is based on the ByteTrack codebase. Although OC-SORT generally performs better than ByteTrack, ByteTrack's MOTA (Motion Response Time) metrics on the MOT17 and MOT20 datasets are higher than OC-SORT, and its framework is simpler. ByteTrack uses the YOLOX detector to obtain detection information. Its proposed BYTE detection association method, without a ReID branch (appearance features), uses only simple motion cues, leveraging the difference and re-matching between high-resolution and low-resolution bounding boxes, to effectively handle occlusion and association problems. It has great potential for solving the problem of occlusion caused by congestion in urban public transportation road scenes. However, it also suffers from some limitations of the SORT family of algorithms, such as strong dependence on detector performance and poor performance in handling non-linear motion patterns. Summary of the Invention
[0007] The technical problem to be solved by the present invention is to address the shortcomings of the prior art by providing an online detection and tracking method for vehicles and pedestrians based on an improved ByteTrack. The method improves and optimizes the ByteTrack algorithm model to the greatest extent possible from both detection and tracking perspectives, making it lightweight, real-time, and efficient, thereby enhancing its performance and making it better suited for urban public transportation road scenarios.
[0008] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows:
[0009] A method for online detection and tracking of vehicles and pedestrians based on an improved ByteTrack, with the following specific steps:
[0010] Step 1: Optimize and improve the original ByteTrack algorithm model from both detection and tracking perspectives;
[0011] Step 2: Add the ReID appearance feature extraction module to the original model to adapt to the complexity and variability of the scene;
[0012] Step 3: Use model compression acceleration tools to alleviate the drawbacks of SDE models and meet online real-time requirements;
[0013] Step 4: Design and develop the system's front-end UI and implement the ByteTrack improved model on the system;
[0014] Step 5: Combine the ByteTrack improved model to implement and improve the system backend functions.
[0015] Furthermore, the detection part in step 1 upgrades and improves the YOLOX algorithm used in the original ByteTrack algorithm model. Specifically, it includes replacing the CSP2_1 module at the end with a Transformer encoding block (TEB) in the Backbone and Neck parts of the model. Each TEB includes two sub-layers: the first layer is a multi-head attention layer, and the second layer (MLP) is a fully connected layer. Residual connections are used between the sub-layers. At the same time, an attention mechanism ACmix module is added to the Neck part. In the Head part of the model, a new branch for predicting small targets is added for the low-level, high-resolution feature maps. The new branch uses a Coupled Head structure, while the original three branches still use a Decoupled Head structure. Each Decoupled Head structure has three branches before Concat: cls_output, obj_output, and reg_output, which are used to predict the category, foreground or background, and position coordinates (x, y, w, h) of the target box, respectively. The Loss function of the foreground / background prediction branch is changed from BCELoss to FocalLoss.
[0016] Furthermore, the tracking part in step 1 includes improvements and optimizations in the two subtasks of motion model (state estimation) and data association;
[0017] The improvement to the motion model (state estimation) task is as follows: Unscented Kalman Filter (UKF) is used for state estimation, and the aspect ratio (width / height) in the UKF state vector is replaced with width, represented by an 8-dimensional vector. That is, tracking the state vector of the target;
[0018] The optimization and improvement of the data association task are as follows: The improved Volgenant-Jonker algorithm, VJ-IMP, is used instead of the Hungarian algorithm; when setting the matching cost matrix, the simple weighted sum of appearance and motion metrics is abandoned. Instead, spatial similarity and appearance similarity are fused together, incorporating cosine distance to eliminate incorrect matches. A loss matrix combining motion and appearance information is created, as detailed below:
[0019]
[0020] in, It is the element in the i-th row and j-th column of the matching cost matrix; It is the IoU distance between the i-th predicted bounding box and the j-th detected bounding box of the trajectory segment, representing the motion loss; It is the cosine distance between the appearance description i of the trajectory segment and the newly detected description j, representing the appearance loss; It is close to the threshold, set to 0.5, used to discard trajectory segments and detection pairs that are unlikely to match; This is the appearance threshold, set to 0.5, used to separate the positive and negative correlations between the appearance state of the trajectory segment and the detection embedding vector;
[0021] For targets that are similar in appearance and IoU, a smaller loss is applied; if the appearance similarity exceeds the threshold but the IoU similarity does not, the appearance loss is used as the determining factor for the loss, and vice versa; otherwise, the loss is set to 1; the elements in the matching cost matrix are updated according to this rule.
[0022] Furthermore, in step 1, a lightweight interpolation algorithm, namely Gaussian smooth interpolation (GSI), is used in the post-processing part of the ByteTrack model to fill the trajectory gaps caused by missing detection. This GSI algorithm uses Gaussian process regression to simulate nonlinear motion.
[0023] Furthermore, in step 2, ReID is added to the BYTE detection association method of ByteTrack to extract the appearance features of pedestrians and vehicles and measure their distance, and the appearance model is combined with the motion model UKF; the unsupervised method Cluster Contrast ReID is selected to extract ReID features; data related to vehicles and pedestrians from various public datasets are integrated, and two data augmentation methods, Mosaic and MixUp, are used.
[0024] Furthermore, in step 3, after adding ReID features, the MOT algorithm adopts an SDE-type algorithm; the ByteTrack original model is based on the PyTorch architecture, and ONNX is used to realize the mutual conversion between different frameworks, and then the model compression acceleration tools PocketFlow, TVM and TensorRT are used to meet the online and real-time requirements of traffic and road scenarios.
[0025] Furthermore, in step 4, the system front-end GUI interface is designed and developed using PyQt5; in the UI interface, the ByteTrack improved and optimized model and other comparative model parameters are specifically selected to detect and track the target. If no specific model is selected, the default model is used.
[0026] Furthermore, the backend functions of the system in step 5 include: single-lens tracking, multi-category tracking, trajectory drawing, and pedestrian / vehicle traffic statistics; trajectory drawing, that is, drawing the trajectory curve of the tracked target based on the center position of the moving object window outline, and using different colored curves to distinguish the targets; pedestrian / vehicle traffic statistics, that is, realizing real-time deduplication counting of dynamic pedestrian / vehicle traffic, and real-time monitoring of pedestrian / vehicle traffic on traffic roads and checkpoints; detecting the target source of the tracking count by selecting video or image files, or processing the images captured by the camera connected to the device in real time online; specifically as follows:
[0027] When selecting a video file for detection and tracking: Clicking the video button on the left will bring up a window to select a video file. If you exit the video file selection window without selecting a video file, the text box next to the video button on the left will display "Real-time video not selected". Selecting an MP4 or AVI video file will display the video screen, and the text box next to the video button on the left will display the name of the selected video file. The target is marked in the middle frame, and the right side displays the time taken, number of targets, confidence level, and location coordinates. If you want to specify a target for tracking, you can select it from the target drop-down selection box on the right. The screen will pause while selecting and wait for the selection to complete. The mark box on the screen will then be positioned on the selected target. In addition, you can switch between target detection, tracking, and counting functions. Selecting the option in the lower left corner will switch between detection, tracking, and counting functions. Selecting "Target Detection" will mark the category and confidence level on the target. Selecting "Target Tracking" will mark the category and count the target. Selecting "Tracking Counting" will mark the motion trajectory on the target and count the target. The target detection box and the target's trajectory curve are distinguished by different colored curves and rectangles.
[0028] When selecting an image for object detection: Click the image selection button on the left. Similar to selecting a video for object tracking, a pop-up image file selection interface will appear. Select an image for detection. The text box next to the image button on the left will display the name of the selected file. You can select a specific object for focused detection. The "Object Tracking", "Object Detection", and "Tracking Count" functions are the same as for videos. Due to the static nature of images, switching between "Tracking Count" and "Object Tracking" has a similar effect. "Tracking Count" will display the starting point of the trajectory in the middle of the object detection box.
[0029] When using a camera for detection and tracking: Clicking the camera button on the left will automatically open the currently connected camera device, and the detection and tracking marker information will also be displayed on the interface. Other functions and usage methods are the same as those for selecting video for target detection.
[0030] The beneficial effects of adopting the above technical solution are as follows: The online vehicle and pedestrian detection and tracking method based on the improved ByteTrack provided by this invention proposes a new tracking and detection model based on the ByteTrack model. According to the characteristics of the TBD paradigm, ByteTrack is comprehensively and multi-layeredly optimized and improved, resulting in an overall improvement in its performance in urban public transportation road scenarios. First, this invention combines pedestrian and vehicle detection and tracking, improving the model's adaptability to vehicle tracking. Simultaneously, the tracked vehicles are categorized into multiple types, such as bicycles, cars, trucks, buses, and tricycles, and pedestrian and vehicle traffic statistics are implemented for different categories, expanding the application scope of ByteTrack. Second, this invention filters vehicle and pedestrian-related data from several major public datasets, including COCO, MOT17, MOT20, VisDrone, and VeRi, selecting data more suitable for traffic road scenarios to train, validate, and test the algorithm model. Furthermore, it employs two data augmentation methods, Mosaic and MixUp, to prevent overfitting of the algorithm model and enhance its robustness and generalization. Finally, by improving YOLOX, a higher MOTA index is obtained; by extracting appearance features using the unsupervised method Cluster Contrast ReID, a higher IDF1 index is achieved; by predicting and updating the bounding boxes of trajectories in subsequent frames using UKF and by using the VJ-IMP algorithm and improving its cost matrix, a higher HOTA index is obtained; and by using model compression acceleration tools PocketFlow, TVM, and TensorRT, the computational cost of the algorithm is reduced without affecting accuracy, resulting in a higher FPS index. Combining these improvement strategies, the overall performance of the algorithm model is improved. Attached Figure Description
[0031] Figure 1 A schematic diagram illustrating an improvement to the online tracking portion of the ByteTrack model provided in an embodiment of the present invention;
[0032] Figure 2 This is a diagram illustrating the overall structure of the improved YOLOX-s model provided in an embodiment of the present invention.
[0033] Figure 3 This is a structural diagram of the Backbone module of the improved YOLOX-s model provided in an embodiment of the present invention.
[0034] Figure 4 This is a structural diagram of the Neck section of the YOLOX-s improved model provided in an embodiment of the present invention.
[0035] Figure 5 This is an expanded structural diagram of the Head section of the YOLOX-s improved model provided in an embodiment of the present invention;
[0036] Figure 6 A flowchart of the UKF algorithm for the field of MOT provided in this embodiment of the invention;
[0037] Figure 7 This is an improved schematic diagram of the post-processing part of the ByteTrack model provided in an embodiment of the present invention;
[0038] Figure 8 This is a schematic diagram of the overall front-end UI of the system provided in an embodiment of the present invention;
[0039] Figure 9 This is a schematic diagram of the video file selection interface provided in an embodiment of the present invention;
[0040] Figure 10 This is a diagram illustrating the effect of selecting video files for tracking and counting, provided in an embodiment of the present invention.
[0041] Figure 11 This invention provides an example of tracking a specific target when selecting a video file for detection and tracking.
[0042] Figure 12 The image shows the effect of switching "target detection" when selecting a video file for detection and tracking, as provided in an embodiment of the present invention.
[0043] Figure 13 The image shows the effect of switching to "target tracking" when selecting a video file for detection and tracking, as provided in an embodiment of the present invention.
[0044] Figure 14 This is an example of the effect of switching to "target detection" when selecting an image file for detection and tracking, provided by an embodiment of the present invention.
[0045] Figure 15 The image provided in this embodiment of the invention shows the effect of switching to "target tracking" when selecting an image file for detection and tracking.
[0046] Figure 16 The image provided in this embodiment of the invention shows the effect of switching the "tracking count" when selecting an image file for detection and tracking.
[0047] Figure 17 This invention provides an example of detecting a specific target when selecting an image file for detection and tracking.
[0048] Figure 18 This is a schematic diagram illustrating the detection and tracking process performed by opening the camera, as provided in an embodiment of the present invention. Detailed Implementation
[0049] The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and examples. The following examples are for illustrative purposes only and are not intended to limit the scope of the invention.
[0050] The application scenarios of this invention include, but are not limited to, urban traffic roads and intelligent traffic monitoring systems of relevant regulatory departments, which call the functional modules of this invention to process the captured monitoring videos or images.
[0051] The basic assumptions of this embodiment are as follows: the method of the present invention has been successfully integrated with the intelligent traffic monitoring system of the relevant regulatory authorities, the relevant hardware equipment can be used normally, and the system environment configuration is complete.
[0052] The state equations and detection equations of the target motion tracking system are shown below:
[0053]
[0054] Assume that the process noise Q k The effect on the system state transition process is linear, and the detection noise R k The impact on the system detection process is also linear; therefore, in the state equation and the detection equation, the process noise Q... k and detection noise R k It exists in the form of linearly additive terms, and therefore, Q k With X k They are of the same dimension, R k With Y k They are of the same dimension.
[0055] Suppose we have the following known conditions:
[0056] Gaussian state-variable random vector X,n X dimension, denoted by n X =n;
[0057] The posterior mean and posterior covariance of X at time k-1:
[0058] Gaussian process noise random vector:
[0059] State transition function: f(X) k-1 );
[0060] Gaussian detection vector Z,n Z dimension;
[0061] The detection value of Z at time k is z k ;
[0062] Gaussian detection noise random vector: R ~ N(0, ∑ R );
[0063] Detection function: h(X) k ).
[0064] The online vehicle and pedestrian detection and tracking method based on the improved ByteTrack provided in this embodiment is as follows.
[0065] Step 1: Optimize and improve the original ByteTrack algorithm model from the perspectives of detection and tracking.
[0066] Online tracking of overall optimization and improvement strategies, such as Figure 1 As shown, the specific explanation is as follows.
[0067] In the detection section, the YOLOX algorithm used in the original ByteTrack algorithm model is upgraded and improved.
[0068] The excellent performance of the target detector is a key factor in achieving tracking tasks using the MOT model based on the TBD paradigm. This invention is based on YOLOX- s Model, overall improved model structure as follows Figure 2 As shown, further details are provided for the CBS, Focus, CSP1_X, and SimCSPSPPF modules, as follows: Figure 3 As shown. In the Backbone and Neck sections of the model, the CSP2_1 modules at the ends are replaced with Transformer coded blocks (TEBs), as follows. Figure 4 As shown, each TEB contains two sub-layers: the first layer is a multi-head attention layer, and the second layer (MLP) is a fully connected layer. Residual connections are used between the sub-layers. Additionally, an attention mechanism module, ACmix, is added to the Neck section, with the structure shown below. Figure 4 As shown in the diagram. In the model's Head section, a new branch is added for predicting small targets, targeting the low-level, high-resolution feature maps. This allows for better adaptation to drastic changes in target scale. To reduce computational and storage costs, a Coupled Head structure is used for the new branch, as shown in the diagram. Figure 5 As shown, the original three branches still use the Decoupled Head structure. Each Decoupled Head structure has three branches before Concat: cls_output, obj_output, and reg_output, which are used to predict the category, foreground or background, and position coordinates (x, y, w, h) of the bounding box, respectively. Changing the loss function of the foreground and background prediction branches from BCELoss to FocalLoss can accelerate convergence.
[0069] The tracking component includes improvements and optimizations in two subtasks: motion model (state estimation) and data association.
[0070] The improvements to the motion model (state estimation) task are as follows:
[0071] When a Kalman filter (KF) is used for multi-target tracking, it is necessary to estimate two pieces of information about the trajectory: the mean and the covariance. (1) The mean represents the position information of the target, i.e., the KF state representation. ByteTrack attempts to estimate the aspect ratio of the bounding box, but if the fitting of the predicted bounding box size can be similar to that of the detected bounding box, the IoU matching can be very robust. Therefore, the aspect ratio (width / height) in the UKF state vector is changed to the width, which is represented by an 8-dimensional vector. That is, the state vector of the tracking target. (2) Covariance: that is, the covariance matrix of the target trajectory, which is an 8×8 diagonal matrix. The size of the numbers in the matrix is proportional to the uncertainty.
[0072] This embodiment employs an unscented Kalman filter (UKF). The main idea is that approximating the probability distribution of a nonlinear function is easier than approximating the nonlinear function itself. While the computational complexity of UKF and EKF is comparable, UKF offers higher estimation accuracy and is simpler to implement than EKF. For example... Figure 6 As shown, the specific steps are as follows.
[0073] Step 1.1: Initialization.
[0074] Step 1.1.1: Select initial filter values;
[0075] Calculate the initial mean of the corresponding state variables based on the target detection result z0 and the measurement matrix H. Set initial values for the covariance of the state variables.
[0076] Step 1.1.2: Select the unscented transform parameters;
[0077] Define the values of the proportional unscented transform parameters α, β, κ, and λ, where parameter λ satisfies:
[0078] λ=α2 (n+κ)-n (2)
[0079] Where, parameters α and κ are proportional parameters that determine how far the 2n+1 sigma points are distributed from the mean; α satisfies 10 -4 ≤α≤1. To avoid nonlocal effects in strongly nonlinear systems, α is usually chosen to be a small value; κ satisfies κ≥0, and is usually taken as κ=3-n or κ=0. The parameter β is used to introduce higher-order moment information of the probability distribution of the random variable. When the distribution is an exact Gaussian distribution, β=2 is the optimal choice.
[0080] Step 1.1.3: Calculate the sigma point weights;
[0081] Calculate the weights of each sigma point (2n+1 points) based on the values of the unscented transform parameters and the weight calculation formula:
[0082]
[0083]
[0084] in, W represents the weight of the sigma point when calculating the approximate mean. c (i) This represents the weight of the sigma point when calculating the approximate covariance.
[0085] For k = 1, 2, 3, ..., repeat steps 1.2-1.5:
[0086] Step 1.2: For the state variable X at time k-1 k-1 The posterior probability distribution is sampled using sigma, i.e., 2n+1 sigma points are obtained by sampling according to formula (5), forming an n×(2n+1) point set matrix.
[0087]
[0088] in, The mean value of the trajectory at time k-1; yes The mean; Representation matrix Construct the (i-1)th column of the lower triangular matrix after Kolesky decomposition. Similarly;
[0089] Step 1.3: Prediction phase, which is to predict the state of the trajectory at time k based on the state of the trajectory at time k-1.
[0090] Step 1.3.1: Nonlinear transformation of state transition
[0091]
[0092] In the formula, Let F be the mean state of the trajectory at time k-1, and let F be the state transition matrix; Equation (6) predicts the state at time k. The point set matrix after the nonlinear transformation has the following matrix form:
[0093]
[0094] in, The 8-dimensional vector representing the target location information From the center coordinates (x) of the target box c y c The matrix F consists of the aspect ratio 'a', the height 'h', and their respective velocity changes. dt in matrix F is the difference between the current frame and the previous frame. Expanding the matrix multiplication on the right side of the equals sign yields... That is, the Kalman filter here is a uniform velocity model.
[0095] Step 1.3.2: Weighted calculation of state variable X at time k k The prior probability distribution, i.e., the approximate mean. With approximate covariance ∑ y :
[0096]
[0097]
[0098] In the formula, ∑ Q It is the noise matrix, symbolizing the reliability of the entire motion system, and is initialized to a very small value.
[0099] Step 1.4: For the state variable X at time k k Sigma sampling is performed on the prior probability distribution:
[0100]
[0101] To reduce computational load, step 1.4 can be omitted, and the sigma point set from step 1.3 can be used directly in the next step. However, it will reduce accuracy to some extent.
[0102] Step 1.5: Update phase, which is to adjust the position and size information of the trajectory box associated with the detector at time k based on the detection box obtained by the detector at time k.
[0103] Step 1.5.1: Detect nonlinear transformation;
[0104]
[0105] In the formula, The mean vector of the detection, i.e., z = [x c y c [w, h], where H is the measurement matrix and is the mean vector of the trajectory bounding box. To detect the mean vector The transformation matrix.
[0106] Step 1.5.2: Weighted calculation of the detection quantity Z at time k k The probability distribution, i.e., the approximate mean. and approximate covariance
[0107]
[0108] In the formula, ∑ R The noise matrix of the detector is represented by a 4×4 diagonal matrix. The noise of the center point coordinates and the width and height (x, y, w, h) of the detection box are represented by the values on the diagonal. During initialization, the noise of the center point should be less than the noise of the width and height.
[0109] Step 1.5.3: Calculate the cross-covariance between the state variables and the detected variables.
[0110]
[0111] Step 1.5.4: Calculate the Kalman gain K k :
[0112]
[0113] Step 1.5.5: Calculate the state variable X at time k. k The posterior probability distribution, i.e., the updated approximate mean. and approximate covariance
[0114]
[0115]
[0116] The data association task has been optimized and improved as follows:
[0117] Data association is a crucial step in the MOT task, completing the matching of multiple target (ID) pairs between frames. This embodiment uses the improved Volgenant-Jonker algorithm, VJ-IMP, instead of the Hungarian algorithm to improve matching speed and reduce memory usage. When setting the matching cost matrix, the simple weighted sum of appearance and motion metrics is abandoned. Instead, spatial similarity and appearance similarity are fused, incorporating cosine distance to eliminate incorrect matches. A loss matrix combining motion and appearance information is created, as detailed below:
[0118]
[0119] in, It is the element in the i-th row and j-th column of the matching cost matrix; It is the IoU distance between the i-th predicted bounding box and the j-th detected bounding box of the trajectory segment, representing the motion loss; It is the cosine distance between the appearance description i of the trajectory segment and the newly detected description j, representing the appearance loss; It is close to the threshold, set to 0.5, used to discard trajectory segments and detection pairs that are unlikely to match; This is the appearance threshold, set to 0.5, used to separate the positive and negative correlations between the appearance state of the trajectory segment and the detection embedding vector. For targets with similar appearance and IoU, a smaller loss is applied; if the appearance similarity exceeds the threshold but the IoU similarity does not, the appearance loss is used as the determining factor for the loss, and vice versa; if the appearance similarity does not exceed the threshold but the IoU similarity exceeds it, the motion loss is used as the determining factor for the loss; otherwise, the loss is set to 1. Elements in the matching cost matrix are updated according to this rule.
[0120] In the post-processing stage, all sequence trajectories obtained from online tracking are interpolated to fill in the trajectory gaps caused by missing detection, thus obtaining the trajectories of all targets in the video. Interpolation is widely used to fill in trajectory gaps caused by missing detection. Linear interpolation is popular due to its simplicity; however, its accuracy is limited because it does not use motion information. This embodiment employs a lightweight interpolation algorithm, namely Gaussian smooth interpolation (GSI), such as... Figure 7 As shown, the GSI algorithm uses Gaussian process regression to simulate nonlinear motion, resulting in more accurate positioning and achieving a good trade-off between accuracy and efficiency.
[0121] Step 2: Add the ReID appearance feature extraction module to the original model to adapt to the complexity and variability of the scene.
[0122] ByteTrack's BYTE detection and association method effectively mitigates target loss caused by occlusion by utilizing the difference and re-matching between high-resolution and low-resolution bounding boxes. It is simple and efficient, and has great potential for solving occlusion problems caused by congestion in urban public transportation scenarios. It has achieved high performance on the MOT17 and MOT20 datasets, where motion patterns are relatively simple. However, the motion patterns of vehicles and pedestrians in real-world traffic scenarios are generally more complex. Therefore, this embodiment incorporates ReID to extract the appearance features of pedestrians and vehicles and perform distance measurements, combining the appearance model with the motion model (UKF) to improve tracking accuracy. Specifically, the unsupervised method Cluster Contrast ReID is chosen for ReID feature extraction. Its accuracy surpasses many supervised algorithms and unsupervised adaptive pedestrian re-identification methods, and its framework is very simple, also proving effective on vehicle re-identification datasets. This embodiment integrates vehicle and pedestrian-related data from major public datasets such as COCO, MOT17, MOT20, VisDrone, and VeRi, and employs two data augmentation methods, Mosaic and MixUp, to prevent overfitting of the algorithm model, enhance its robustness and generalization, and better reflect the real-world traffic and road scenario.
[0123] Step 3: Use model compression acceleration tools to alleviate the drawbacks of SDE models and meet online real-time requirements.
[0124] After incorporating ReID features, the MOT algorithm offers two options: SDE (Separate Detection and Embedding) and JDE (Joint Detection and Embedding). Considering the complex and variable nature of real-world urban public transportation routes, coupled with the limited amount of labeled tracking video data and the large volume of data for detection and ReID directions, coupled with their lower labeling costs, this embodiment adopts the SDE type algorithm. Addressing the issues of high computational cost and slow inference speed of the SDE algorithm, although the SDE type MOT algorithm model has a high computational cost, existing model compression and acceleration techniques can improve inference efficiency without sacrificing performance. The original ByteTrack model is based on the PyTorch architecture. To achieve compression and acceleration of the improved YOLOX-s model and the ClusterContrast ReID model, ONNX is used to achieve mutual conversion between different frameworks. Furthermore, model compression and acceleration tools PocketFlow, TVM, and TensorRT are used to meet the online and real-time requirements of traffic road scenarios, enabling deployment on mobile devices with limited computing resources.
[0125] Step 4: Design and develop the system's front-end UI and implement the ByteTrack improved model on the system.
[0126] Step 5: Combine the ByteTrack improved model to implement and improve the system backend functions.
[0127] This embodiment utilizes PyQt5 to design and develop the system's front-end GUI interface. The overall interface is as follows: Figure 8 As shown. It covers functions such as single-lens tracking, multi-category tracking, trajectory drawing, and pedestrian / vehicle traffic statistics. In the UI, users can selectively choose the ByteTrack improved and optimized model and other comparative model parameters to detect and track targets. If no specific model is selected, the default model is used. Trajectory drawing involves drawing the trajectory curve of the tracked target based on the center position of the moving object's window outline, using different colored curves to distinguish targets. Pedestrian / vehicle traffic statistics enable real-time deduplication and counting of dynamic pedestrian / vehicle traffic, providing real-time monitoring of pedestrian / vehicle traffic on roads and at checkpoints. The target sources for detection, tracking, and counting can be video and image files, or real-time online processing of footage captured by the connected camera, as detailed below:
[0128] (1) Select a video file for detection and tracking: Clicking the video button on the left will bring up a window to select a video file, such as... Figure 9 As shown, if you exit the video file selection window without selecting a video file, the text box next to the video button on the left will display "Live video not selected"; if you select an MP4 or AVI video file to display the video, the text box next to the video button on the left will display the name of the selected video file, such as... Figure 10 As shown, the target is labeled in the middle frame, and the right side displays the time taken, number of targets, confidence level, and location coordinates. Figure 11 As shown, to specify a target for tracking, you can select it from the target dropdown menu on the right. The screen will pause while selecting, and the selection will be completed; the marker box will then be positioned over the selected target. Additionally, you can switch between target detection, tracking, and counting functions. Selecting the option in the lower left corner allows you to toggle between these functions. Selecting "Target Detection" will mark the target with its category and confidence level, such as... Figure 12 As shown; selecting "Target Tracking" will categorize and count the targets, as shown. Figure 13 As shown; selecting "Tracking Count" will mark and count the motion trajectory on the target, as shown. Figure 10 As shown, the target detection box and the target's trajectory curve are both distinguished using curves and rectangles of different colors.
[0129] (2) Selecting an image for target detection: Click the image selection button on the left. Similar to selecting a video for target tracking, a pop-up image file selection interface will appear. Select an image for detection. The text box next to the image button on the left will display the name of the selected file. You can select a specific object for focused detection. The "Target Tracking," "Target Detection," and "Tracking Count" functions are the same as for videos. Due to the static nature of images, switching between "Tracking Count" and "Target Tracking" has a similar effect. "Tracking Count" will display the starting point of the trajectory in the middle of the target detection box. Figure 14 , Figure 15 , Figure 16 and Figure 17 As shown.
[0130] (3) Detection and tracking using a camera: Clicking the camera button on the left will automatically open the currently connected camera device, and the detection and tracking marker information will also be displayed on the interface, such as... Figure 18 As shown, other functions and usage methods are the same as those for selecting a video for object detection.
[0131] This invention proposes a new tracking and detection model based on the ByteTrack model. According to the characteristics of the TBD paradigm, it comprehensively and multi-dimensionally optimizes and improves ByteTrack, enhancing its overall performance in urban public transportation road scenarios. The overall improvement is divided into two aspects: detection and tracking. 1) In the detection part, the YOLOX-s model used by ByteTrack is upgraded and improved. 2) In the tracking part, improvements and optimizations are made in two sub-tasks: motion state estimation and association matching. ① In the motion state estimation task, the classic Kalman filter algorithm used to predict and update the target's position in the next frame is upgraded and improved. ② In the data association task, the Hungarian algorithm used to match the target IDs of the previous and current frames is upgraded and improved, and the corresponding cost matrix (involving IoU distance and ReID similarity) calculation method is modified and optimized.
[0132] Firstly, while research on detection and tracking technologies is extensive both domestically and internationally, it is somewhat one-sided and limited in its application to urban public transportation scenarios. Currently, most algorithms in the MOT (Moving Oriented Tracking) field focus on pedestrians as the primary tracking target, and vehicle tracking research is mostly concentrated in the field of autonomous driving, resulting in low adaptability to traffic monitoring scenarios. ByteTrack primarily studies pedestrian tracking, and its tracking performance significantly decreases when applied to vehicle tracking. This invention combines pedestrian and vehicle detection and tracking to improve the model's adaptability to vehicle tracking. Furthermore, it categorizes tracked vehicles into various types, such as bicycles, cars, trucks, buses, and tricycles, and implements pedestrian and vehicle traffic statistics for different categories, expanding the application scope of ByteTrack.
[0133] Secondly, in the field of MOT and related research, to demonstrate the superior performance of proposed algorithm models, the academic community generally uses the same single dataset with relatively simple motion patterns, and typically only focuses on pedestrians or vehicles. Datasets for vehicle tracking mainly come from the field of autonomous driving and consist of video data taken from the driver's perspective, which differs from the shooting angle of traffic monitoring systems. This invention filters vehicle and pedestrian-related data from several major public datasets, including COCO, MOT17, MOT20, VisDrone, and VeRi, selecting data that better fits traffic and road scenarios for training, validating, and testing the algorithm model. Furthermore, it employs two data augmentation methods, Mosaic and MixUp, to prevent overfitting and enhance the robustness and generalization of the algorithm model.
[0134] Finally, by improving and optimizing the ByteTrack model, higher evaluation metrics are achieved: a higher MOTA metric is obtained through improvements to YOLOX; a higher IDF1 metric is achieved by extracting appearance features using the unsupervised method Cluster Contrast ReID; a higher HOTA metric is obtained by predicting and updating the bounding boxes of trajectories in subsequent frames using UKF and by using the VJ-IMP algorithm and improving its cost matrix; and a higher FPS metric is achieved by using model compression acceleration tools PocketFlow, TVM, and TensorRT, which reduce the computational cost of the algorithm without affecting accuracy. By combining these improvement strategies, the overall performance of the algorithm model is improved.
[0135] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope defined by the claims of the present invention.
Claims
1. A method for online detection and tracking of vehicles and pedestrians based on an improved ByteTrack, characterized in that: Includes the following steps: Step 1: Optimize and improve the original ByteTrack algorithm model from both detection and tracking perspectives; Regarding the detection aspect, the YOLOX algorithm used in the original ByteTrack algorithm model is upgraded and improved. Specifically, in the model's Backbone and Neck parts, the CSP2_1 module at the end is replaced with Transformer coding blocks (TEBs). Each TEB includes two sub-layers: the first layer is a multi-head attention layer, and the second layer is an MLP, which is a fully connected layer. Residual connections are used between the sub-layers. At the same time, an attention mechanism ACmix module is added to the Neck part. In the model's Head part, a new branch for predicting small targets is added for the low-level, high-resolution feature maps. The new branch uses a Coupled Head structure, while the original three branches still use a Decoupled Head structure. Each Decoupled Head structure has three branches before Concat: cls_output, obj_output, and reg_output, which are used to predict the category, foreground or background, and position coordinates (x, y, w, h) of the target box, respectively. The Loss function of the foreground / background prediction branch is changed from BCELoss to FocalLoss. The tracking aspect includes improvements and optimizations in two subtasks: motion modeling, i.e., state estimation, and data association. The improvement to the motion model, i.e., state estimation task, is as follows: Unscented Kalman Filter (UKF) is used for state estimation. The aspect ratio (width / height) in the UKF state vector is changed to width, represented by an 8-dimensional vector. That is, the state vector of the tracking target; Step 2: Add the ReID appearance feature extraction module to the original model to adapt to the complexity and variability of the scene; Step 3: Use model compression acceleration tools to alleviate the drawbacks of SDE models and meet online real-time requirements; Step 4: Design and develop the system's front-end UI and implement the ByteTrack improved model on the system; Step 5: Combine the ByteTrack improved model to implement and improve the system backend functions.
2. The online vehicle and pedestrian detection and tracking method based on the improved ByteTrack as described in claim 1, characterized in that: In the tracking aspect of step 1, the optimization and improvement of the data association task is as follows: the improved Volgenant-Jonker algorithm VJ-IMP is used instead of the Hungarian algorithm. When setting the matching cost matrix, we abandon the simple weighted summation of appearance and motion metrics. Instead, we fuse spatial similarity and appearance similarity, incorporating cosine distance to eliminate incorrect matches. This process creates a loss matrix that combines motion and appearance information, as detailed below: (18); in, It is the element in the i-th row and j-th column of the matching cost matrix; It is the IoU distance between the i-th predicted bounding box and the j-th detected bounding box of the trajectory segment, representing the motion loss; It is the cosine distance between the appearance description i of the trajectory segment and the newly detected description j, representing the appearance loss; It is close to the threshold, set to 0.5, used to discard trajectory segments and detection pairs that are unlikely to match; This is the appearance threshold, set to 0.5, used to separate the positive and negative correlations between the appearance state of the trajectory segment and the detection embedding vector; For targets that are similar in appearance and IoU, a smaller loss is applied; if the appearance similarity exceeds the threshold but the IoU similarity does not, the appearance loss is used as the determining factor for the loss, and vice versa; otherwise, the loss is set to 1; the elements in the matching cost matrix are updated according to this rule.
3. The online vehicle and pedestrian detection and tracking method based on the improved ByteTrack as described in claim 1, characterized in that: In step 1, a lightweight interpolation algorithm, Gaussian smooth interpolation (GSI), is used in the post-processing part of the ByteTrack model to fill the trajectory gaps caused by missing detection. The GSI algorithm uses Gaussian process regression to simulate nonlinear motion.
4. The online vehicle and pedestrian detection and tracking method based on the improved ByteTrack according to claim 1, characterized in that: In step 2, ReID is added to the BYTE detection association method of ByteTrack to extract the appearance features of pedestrians and vehicles and measure their distance. The appearance model is combined with UKF in the motion model. The unsupervised method Cluster Contrast ReID is selected to extract ReID features. Data related to vehicles and pedestrians from various public datasets are integrated and two data augmentation methods, Mosaic and MixUp, are used.
5. The online vehicle and pedestrian detection and tracking method based on the improved ByteTrack according to claim 4, characterized in that: In step 3, after adding ReID features, the MOT algorithm adopts an SDE-type algorithm; the ByteTrack original model is based on the PyTorch architecture, and ONNX is used to realize the mutual conversion between different frameworks. Then, the model compression acceleration tools PocketFlow, TVM and TensorRT are used to meet the online and real-time requirements of traffic and road scenarios.
6. The online vehicle and pedestrian detection and tracking method based on the improved ByteTrack according to claim 1, characterized in that: In step 4, PyQt5 is used to design and develop the system's front-end GUI interface; In the UI, you can select the ByteTrack improved and optimized model and other comparison model parameters to detect and track the target. If no specific model is selected, the default model will be used.
7. The online vehicle and pedestrian detection and tracking method based on the improved ByteTrack according to claim 1, characterized in that: The system backend functions in step 5 include: single-lens tracking, multi-category tracking, trajectory drawing, and pedestrian / vehicle traffic statistics; trajectory drawing, which is to draw the running trajectory curve of the tracked target based on the center position of the moving object window outline, and use different colored curves to distinguish the target; pedestrian / vehicle traffic statistics, which is to realize the real-time deduplication counting of dynamic pedestrian / vehicle traffic, and monitor the pedestrian / vehicle traffic on traffic roads and checkpoints in real time; the target source for detection and tracking counting is selected as video or image files, or the images captured by the camera connected to the device are processed online in real time.