An improved pose tracking method and system of DeepSort algorithm

By improving the DeepSort algorithm and using the YOLOX model and BlazePose attitude estimator for firefighter attitude tracking, the problems of target association and inaccurate region estimation in firefighter attitude tracking are solved, and more accurate tracking and attitude estimation results are achieved.

CN115937987BActive Publication Date: 2026-06-30CHENGDU PANORAMIC STAR TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHENGDU PANORAMIC STAR TECH CO LTD
Filing Date
2023-01-03
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

The existing DeepSort algorithm suffers from poor target association and inaccurate target region in firefighter attitude tracking tasks, resulting in poor tracking and attitude estimation performance.

Method used

The YOLOX model is used for target detection, and the BlazePose pose estimator is used to extract pose information. The DeepSort algorithm is improved by pose feature association and target region correction algorithms. The Mahalanobis distance, apparent cosine distance and pose cosine distance are used for target matching and correction.

Benefits of technology

It improves the accuracy of firefighters' tracking and the ability to extract attitude information, solves the problems of similar target appearance and inaccurate regional information, and achieves better target tracking and attitude estimation results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115937987B_ABST
    Figure CN115937987B_ABST
Patent Text Reader

Abstract

This invention discloses an improved DeepSort algorithm for pose tracking and a system, relating to the field of computer vision technology. The steps are as follows: acquiring the image to be detected; using the YOLOX model to perform target detection on the image to be detected to obtain candidate boxes, and then filtering them using a non-maximum suppression algorithm to obtain detection boxes; based on the results of the previous frame, using Kalman filtering to predict the target region of the current frame; calculating the Mahalanobis distance, apparent cosine distance, and pose cosine distance between the detection boxes and the predicted target regions; using the Hungarian algorithm and cascaded matching to compare the similarity of targets between two consecutive frames to perform target matching and obtain the tracking result; and correcting the tracking result using the target's pose information. This invention optimizes the DeepSORT algorithm, resulting in more accurate human tracking and pose information extraction, and solves the problem of poor target tracking performance caused by similar target appearances and inaccurate target regions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, and more specifically to an improved pose tracking method and system for the DeepSort algorithm. Background Technology

[0002] Fire brigades are special forces undertaking urgent, difficult, and dangerous tasks such as firefighting and rescue. They bear the important responsibility of preventing and mitigating major safety risks and responding to various disasters and accidents. Daily training for fire and rescue teams is a crucial way to improve their combat effectiveness. Currently, methods for tracking firefighter posture mainly include sensor tracking, electromagnetic tracking, and sound source tracking. Most methods rely on wearable sensors to acquire firefighter posture information for tracking. However, due to the high intensity of firefighting and training movements, wearable devices have a high damage rate, leading to increased manufacturing and maintenance costs and failing to adequately meet the conditions for conducting fire and rescue training. With the rapid development of video surveillance technology, monitoring and analyzing fire and rescue operations through video surveillance offers convenience and efficiency. Target tracking and human posture estimation are key technologies for analyzing fire and rescue operations using video. Convolutional neural networks (CNNs) are increasingly being applied in human posture estimation research, and methods for extracting human posture information using CNNs are already quite mature.

[0003] In multi-target tracking algorithms, existing technologies extract target features and use filtering and matching algorithms to correlate targets in consecutive frames. However, for targets with similar appearances, ordinary appearance feature extraction cannot achieve good results, and the position and size of the target window depend entirely on the target detector, which is not conducive to further processing after tracking. The DeepSORT algorithm has excellent performance in target tracking and has achieved good results in areas such as pedestrian and vehicle tracking, but it has the following problems for firefighter pose tracking tasks:

[0004] (1) Using appearance features for target association performs well in pedestrian and vehicle tracking tasks, but its direct application in the field of firefighter tracking has limitations. Firefighters often wear uniforms and have similar appearance features in fire rescue and training, which makes the appearance features of different targets close together, resulting in poor target association.

[0005] (2) The algorithm uses the detection box obtained by the target detector as the target region, which makes the effectiveness of the target location information heavily dependent on the quality of the target detector. Previous multi-target tracking tasks did not have high requirements for the accuracy of the target location, but this method will have serious problems when transferred to the task of firefighter attitude tracking. Inaccurate target regions will make subsequent attitude estimation tasks difficult. Therefore, how to overcome the above defects and perform more accurate firefighter tracking and attitude information extraction is an urgent problem to be solved by those skilled in the art. Summary of the Invention

[0006] In view of this, the present invention provides an improved DeepSort algorithm for attitude tracking and a system to solve the problems mentioned in the background art.

[0007] To achieve the above objectives, the present invention adopts the following technical solution: a firefighter posture tracking method based on an improved DeepSort algorithm, the specific steps of which include the following:

[0008] Acquire the image to be detected;

[0009] The YOLOX model is used to perform target detection on the image to be detected to obtain candidate boxes, and the non-maximum suppression algorithm is used to filter and obtain detection boxes;

[0010] Based on the results of the previous frame, the target region of the current frame is predicted using Kalman filtering;

[0011] The Mahalanobis distance, apparent cosine distance, and pose cosine distance between the detection box and the predicted target region are calculated. The Hungarian algorithm and cascaded matching are used to compare the similarity of the target between two consecutive frames to perform target matching and obtain the tracking result.

[0012] The tracking results are corrected using the target's pose information, and the corrected results are used to update the parameters of the YOLOX model and the Kalman filter.

[0013] By adopting the above technical solution, the following beneficial technical effects are achieved: The accuracy of the target detector has a significant impact on the target tracking performance. To obtain a firefighter detector with high accuracy, this invention selects YOLOX as the target detection model. Since traditional anchor-based methods suffer from poor versatility and model complexity, YOLOX uses anchor-free methods to determine the prior boxes, reducing parameters, resulting in faster detection speed and higher accuracy.

[0014] Optionally, the pose information of the target is extracted using the BlazePose pose estimator, and the pose information is used to associate pose features of the target. Specifically, key point data of the human body is obtained through BlazePose, and the key point data is preprocessed to obtain the pose feature vector of the target. The cosine distance between the pose feature vectors is calculated to determine the target correlation degree. Finally, Mahalanobis distance and apparent feature cosine distance are combined, weight coefficients are set, and the three measurement methods are linearly weighted to construct a distance matrix for target matching.

[0015] By adopting the above technical solution, the following beneficial technical effects can be achieved: by adding the BlazePose attitude estimator to extract the attitude information of the target and using the attitude information to associate the attitude features of the target, the problem of similar appearance of the target can be solved, and the appearance features of the target can be extracted more effectively for target association.

[0016] Optionally, the specific steps for correcting the tracking results using the target's pose information are as follows:

[0017] The minimum area of ​​the human body is obtained using the key point data of the human body.

[0018] The minimum region is used as the correction threshold for the target region. The target bounding box is then corrected based on this correction threshold, using the following formula:

[0019] pose (x,y,w,h) =ε(x min ,y min ,x max -x min ,y max -y min );

[0020] T (x,y,w,h) =f(box) (x,y,w,h) pose (x,y,w,h) );

[0021] Where, x max y max x min y min These are the maximum x-coordinate, maximum y-coordinate, minimum x-coordinate, and minimum y-coordinate values ​​of key points on the human body; pose (x,y,w,h) It is the target region obtained based on pose information, box (x,y,w,h) It is the target region obtained by the YOLOX model.

[0022] By adopting the above technical solution, the following beneficial technical effects are achieved: The target region obtained by the DeepSORT algorithm is difficult to adapt to the human pose estimation task. To obtain a more accurate target region, this invention proposes a target region correction algorithm. Since the human keypoints obtained by the pose estimator can effectively reflect the position information of the target, and the human poses of consecutive frames have strong similarity, using the human pose information of the previous frame to correct the target box obtained by the detector in the current frame can yield a higher quality target region, thereby improving the effect of human keypoint extraction.

[0023] Optionally, the backbone network of the YOLOX model uses CSPDarknet53, outputting feature maps at three scales from large to small, and a decoupling head is added to the YOLOX model to predict the classification and regression tasks separately.

[0024] Optionally, the target correlation can be determined by calculating the cosine distance between the attitude feature vectors:

[0025] p = (x1, y1, x2, y2, ..., x j ,y j );

[0026]

[0027] Among them, P j Let P be the pose feature vector of the j-th target. (i) Let x be the pose feature vector stored by the i-th tracker. i ,y i These are the coordinates of key points on the human body.

[0028] On the other hand, an improved DeepSort algorithm-based firefighter posture tracking system is provided. This system utilizes the improved DeepSort algorithm for posture tracking and includes an image acquisition module, a target detection module, a prediction module, a matching module, and a correction module.

[0029] The image acquisition module is used to acquire the image to be detected;

[0030] The target detection module is used to perform target detection on the image to be detected using the YOLOX model to obtain candidate boxes and to filter the detection boxes using a non-maximum suppression algorithm.

[0031] The prediction module is used to predict the target region of the current frame based on the result of the previous frame using Kalman filtering.

[0032] The matching module is used to calculate the Mahalanobis distance, apparent cosine distance, and pose cosine distance between the detection box and the predicted target region, and to perform target matching by comparing the similarity of the target between two consecutive frames using the Hungarian algorithm and cascaded matching to obtain the tracking result.

[0033] The correction module is used to correct the tracking results using the target's pose information, and to update the parameters of the YOLOX model and the Kalman filter using the correction results.

[0034] As can be seen from the above technical solutions, compared with the prior art, the present invention discloses an improved DeepSort algorithm for attitude tracking and a system, which has the following beneficial technical effects: fully utilizing the powerful target detection capability of the detector, combined with the improved DeepSORT algorithm, it tracks and estimates the attitude of multiple targets of firefighters in fire training scenarios, and has a more accurate ability to track firefighters and extract attitude information. It successfully solves the problem of poor target tracking and attitude extraction effect caused by similar target appearance and inaccurate target area, and can better adapt to the task of firefighter attitude tracking. It provides a general solution for human attitude tracking, which can be used for other similar tasks and applications. Attached Figure Description

[0035] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0036] Figure 1 This is a flowchart of the method of the present invention;

[0037] Figure 2 This is a structural diagram of the YOLOX model of the present invention;

[0038] Figure 3 This is a flowchart of the attitude cosine distance calculation method of the present invention;

[0039] Figure 4 This is a flowchart of the target matching process combining attitude cosine distance according to the present invention;

[0040] Figure 5 This is a flowchart of the target region correction process based on attitude information according to the present invention.

[0041] Figure 6 This is a graph showing the mAP variation of the present invention. Detailed Implementation

[0042] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0043] This invention discloses an improved DeepSort algorithm for firefighter pose tracking, such as... Figure 1 As shown, the specific steps include the following:

[0044] Step 1: Obtain the image to be detected;

[0045] Step 2: Use the YOLOX model to perform target detection on the image to be detected to obtain candidate boxes, and then use the non-maximum suppression algorithm to filter and obtain the detection boxes;

[0046] Step 3: Based on the results of the previous frame, use Kalman filtering to predict the target region of the current frame;

[0047] Step 4: Calculate the Mahalanobis distance, apparent cosine distance, and pose cosine distance between the detection box and the predicted target region. Use the Hungarian algorithm and cascaded matching to compare the similarity of the target between two consecutive frames to perform target matching and obtain the tracking results.

[0048] Step 5: Correct the tracking results using the target's pose information, and use the corrected results to update the parameters of the YOLOX model and the Kalman filter.

[0049] This invention optimizes the DeepSORT algorithm as follows: (1) The target detector is replaced with the YOLOX model to improve the target detection accuracy. (2) The BlazePose pose estimator is added to extract the target's pose information, and the pose information is used to associate the target's pose features to solve the problem of similar target appearance. (3) Combining the constraint relationship between human pose and human body region, a target region correction algorithm is proposed to solve the problem that the target region is difficult to adapt to the pose estimation task.

[0050] Furthermore, the backbone network of the entire YOLOX model still uses CSPDarknet53 and outputs feature maps at three scales from large to small. Based on this, an FPN structure is adopted to complete the information exchange between feature maps of different sizes and to enhance feature extraction. Considering that classification and regression are different problems in object detection, and they focus on different features, and previous methods would lead to coupling of the detection heads, YOLOX adds a decoupling head to predict the two tasks separately, improving the accuracy of object detection. The specific structure is as follows: Figure 2 As shown.

[0051] Target association combined with attitude estimation:

[0052] Most multi-target tracking algorithms associate targets by extracting their features, such as positional and appearance features, and measuring the similarity of these features. In firefighter pose estimation tasks, targets often appear similar, making appearance feature measurement ineffective. To better extract appearance features for effective target association, this invention introduces a human pose feature metric based on the similarity of poses for the same target in adjacent frames and the high discriminative power of pose features for different targets. The metric calculates the pose feature distance for target association. This invention selects BlazePose as the pose estimator, obtaining 33 keypoint data points of the human body. The data is then filtered, flattened, and normalized to obtain the target's pose feature vector. The cosine distance between these pose feature vectors is calculated to determine the target association degree.

[0053] p = (x1, y1, x2, y2, ..., x j ,y j );

[0054]

[0055] Among them, P j Let P be the pose feature vector of the j-th target. (i) Let x be the pose feature vector stored by the i-th tracker. i ,y i This refers to the coordinates of key points on the human body. The specific process is as follows: Figure 3 As shown.

[0056] Finally, by combining Mahalanobis distance and apparent feature cosine distance, weighting coefficients are set, and the three measurement methods are linearly weighted. Then, a distance matrix is ​​constructed for target matching.

[0057] C ij =λd (1) (i,j)+μd (3) (i,j)+(1-λ-μ)d (2) (i,j);

[0058] Where, d (1) (i,j),d (2) (i,j) and d (3) (i,j) represent Mahalanobis distance, apparent feature cosine distance, and pose feature cosine distance, respectively; the complete process is as follows: Figure 4 As shown.

[0059] Target region correction based on attitude information:

[0060] The target regions obtained by the DeepSORT algorithm are not well-suited for human pose estimation tasks. To obtain more accurate target regions, this invention proposes a target region correction algorithm. Since the human keypoints obtained by the pose estimator effectively reflect the target's position information, and the human poses of consecutive frames have strong similarities, using the human pose information from the previous frame to correct the target bounding box obtained by the detector in the current frame can yield a higher-quality target region, thereby improving the effect of human keypoint extraction. To fully explore the relationship between human keypoints and human regions, the minimum human region is obtained using the coordinates of the human keypoints and used as the correction threshold for the target region. Based on this threshold, the target bounding box is corrected using the following formula:

[0061] pose (x,y,w,h) =ε(x min ,y min ,x max -x min ,y max -y min );

[0062] T (x,y,w,h) =f(box) (x,y,w,h) pose (x,y,w,h) );

[0063] Where, x max y max x min y min These are the maximum x-coordinate, maximum y-coordinate, minimum x-coordinate, and minimum y-coordinate values ​​of key points on the human body; pose (x,y,w,h) It is the target region obtained based on pose information, box (x,y,w,h) It is the target region obtained by the YOLOX model; the specific process is as follows: Figure 5 As shown.

[0064] In addition, experimental tests were conducted to verify the effectiveness of the algorithm of the present invention.

[0065] 6602 images were extracted from firefighter training videos and labeled using LabelImage to create the VOC dataset, which serves as the object detection dataset. Five training videos were selected and labeled to obtain the firefighter multi-target tracking dataset. Detailed information about the firefighter tracking dataset is shown in Table 1.

[0066] Table 1

[0067]

[0068] Training was performed for 300 epochs, and the mAP change curve during the training process is shown below. Figure 6As shown, mAP increases rapidly with iteration and then stabilizes around 250 epochs, indicating a fast overall convergence speed. Furthermore, the YOLOX model achieved good results by canceling data augmentation in the last 15 epochs, as evidenced by a significant increase in mAP after 285 epochs.

[0069] On the test dataset, AP and mAP were selected as evaluation metrics to assess its accuracy. The detection APs for the two types of firefighters were 97.8% and 96.7%, respectively, indicating high detection accuracy. Furthermore, the model's parameter count is only 3.52M, making it well-adaptable to various devices. To demonstrate the detector's performance in firefighter detection tasks, it was tested under different fire training scenarios. The detection results from the physical training scenario show that even when the target exhibits significant movements or body contortions, the detector can still effectively identify the target and filter out surrounding non-firefighter objects. Observation of the detection results from the collaborative training scenario shows that even with numerous targets, some of which are occluded, or the targets are small, the detector can still obtain relatively accurate bounding boxes. Experimental results indicate that the target detector has high accuracy and can adapt to complex fire training scenarios, effectively completing the firefighter detection task.

[0070] The algorithm of this invention is an improvement on the DeepSORT algorithm. To verify the performance of the algorithm in firefighter tracking tasks and to compare its performance with that of the DeepSORT algorithm, DarkLabel was used to annotate the collected videos to create a firefighter tracking dataset. Four metrics, namely ID switch, MOTA, MOTP, and IDF1, were selected to evaluate the tracker's performance on this dataset. The test results of the algorithm of this invention in various video sequences are shown in Table 2, and the comparison results with the DeepSORT algorithm are shown in Table 3.

[0071] Table 2

[0072] video sequence IDF1 / %↑ IDP / %↑ IDR / %↑ IDS↓ MOTA / %↑ MOTP↓ fireman-1 95.78% 99.36% 92.45% 0 91.85% 0.23 Fireman-2 65.14% 65.64% 64.65% 2 95.45% 0.21 fireman-3 78.24% 84.38% 72.94% 3 65.10% 0.17 fireman-4 75.70% 75.39% 75.84% 16 79.65% 0.28 fireman-5 95.42% 83.22% 81.56% 3 95.93% 0.24

[0073] Table 3

[0074] algorithm IDF1 / %↑ IDP / %↑ IDR / %↑ IDS↓ MOTA / %↑ MOTP↓ DeepSORT 74.77 78.07 71.66 27 74.72 0.29 Algorithm of this invention 82.36 83.22 81.56 24 82.96 0.25

[0075] As shown in Table 3, compared with the benchmark DeepSORT algorithm, the multi-target tracking accuracy of the algorithm in this invention is 82.4%, an improvement of 8.24%; the number of target label changes is 24, a decrease of 11.1%; the IDF1 is 82.36%, an improvement of 7.59%; and the multi-target tracking positioning error is 0.25, a decrease of 13.79%. This demonstrates that the algorithm in this paper has good tracking accuracy and positioning precision, and the IDP and IDS indices are also improved, proving that the algorithm can effectively improve the problems of trajectory loss and ID changes, and can perform accurate target tracking.

[0076] Compared to the benchmark DeepSORT, the target bounding box obtained by the algorithm of this invention more accurately includes the human body region, and the extraction of firefighter posture information is also more complete. The above experimental results show that the algorithm of this invention has relatively accurate target tracking ability and firefighter human posture information extraction ability, and performs well in firefighter tracking tasks.

[0077] Embodiment 2 of the present invention provides a firefighter posture tracking system based on an improved DeepSort algorithm. The system utilizes an improved DeepSort algorithm for firefighter posture tracking, and includes an image acquisition module, a target detection module, a prediction module, a matching module, and a correction module.

[0078] The image acquisition module is used to acquire the image to be detected;

[0079] The object detection module is used to perform object detection on the image to be detected using the YOLOX model to obtain candidate boxes and to filter them using a non-maximum suppression algorithm to obtain detection boxes;

[0080] The prediction module is used to predict the target region of the current frame based on the results of the previous frame using Kalman filtering.

[0081] The matching module is used to calculate the Mahalanobis distance, apparent cosine distance, and pose cosine distance between the detection box and the predicted target region. It uses the Hungarian algorithm and cascaded matching to compare the similarity of the target between two consecutive frames to perform target matching and obtain the tracking result.

[0082] The correction module is used to correct the tracking results using the target's pose information, and to update the parameters of the YOLOX model and the Kalman filter using the correction results.

[0083] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to the method section.

[0084] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An improved DeepSort algorithm for firefighter pose tracking, characterized in that, The specific steps include the following: Acquire the image to be detected; The YOLOX model is used to perform target detection on the image to be detected to obtain candidate boxes, and the non-maximum suppression algorithm is used to filter and obtain detection boxes; Based on the results of the previous frame, the target region of the current frame is predicted using Kalman filtering; The Mahalanobis distance, apparent cosine distance, and pose cosine distance between the detection box and the predicted target region are calculated. The Hungarian algorithm and cascaded matching are used to compare the similarity of the target between two consecutive frames to perform target matching and obtain the tracking result. The tracking results are corrected using the target's pose information, and the parameters of the YOLOX model and the Kalman filter are updated using the corrected results. The BlazePose pose estimator extracts the target's pose information and uses this information to associate pose features with the target. Specifically, BlazePose obtains keypoint data of the human body, preprocesses the keypoint data to obtain the target's pose feature vector, and calculates the cosine distance between the pose feature vectors to determine the target's correlation degree. Finally, Mahalanobis distance and apparent feature cosine distance are combined, weight coefficients are set, and the three measurement methods are linearly weighted to construct a distance matrix for target matching. The specific steps for correcting the tracking results using the target's pose information are as follows: The minimum area of ​​the human body is obtained using the key point data of the human body. The minimum region is used as the correction threshold for the target region. The target bounding box is then corrected based on this correction threshold, using the following formula: ; ; in, , , , These are the key points of the human body x Maximum coordinates y Maximum coordinates x Minimum coordinates y Minimum coordinates; It is the target area obtained based on attitude information. It is the target region obtained by the YOLOX model.

2. The firefighter attitude tracking method based on the improved DeepSort algorithm according to claim 1, characterized in that, The backbone network of the YOLOX model uses CSPDarknet53, which outputs feature maps at three scales from large to small. A decoupling head is added to the YOLOX model to predict the classification and regression tasks separately.

3. The firefighter pose tracking method based on the improved DeepSort algorithm according to claim 1, characterized in that, The target correlation is determined by calculating the cosine distance between the attitude feature vectors: ; ; in, For the first j The pose feature vector of a target. For the first i The pose feature vectors stored by each tracker These are the coordinates of key points on the human body.

4. A firefighter posture tracking system with an improved DeepSort algorithm, characterized in that, A firefighter pose tracking method using an improved DeepSort algorithm as described in any one of claims 1-3 includes an image acquisition module, a target detection module, a prediction module, a matching module, and a correction module; wherein, The image acquisition module is used to acquire the image to be detected; The target detection module is used to perform target detection on the image to be detected using the YOLOX model to obtain candidate boxes and to filter the detection boxes using a non-maximum suppression algorithm. The prediction module is used to predict the target region of the current frame based on the result of the previous frame using Kalman filtering. The matching module is used to calculate the Mahalanobis distance, apparent cosine distance, and pose cosine distance between the detection box and the predicted target region, and to perform target matching by comparing the similarity of the target between two consecutive frames using the Hungarian algorithm and cascaded matching to obtain the tracking result. The correction module is used to correct the tracking results using the target's pose information, and to update the parameters of the YOLOX model and the Kalman filter using the correction results.