Target tracking method, device, equipment and storage medium
By acquiring the fusion features and feature matching of video frames, and combining historical features with the Kalman filter algorithm, the problem of inaccurate target tracking in traditional video tracking algorithms is solved, and accurate tracking of multiple targets is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- MIGU CO LTD
- Filing Date
- 2023-02-15
- Publication Date
- 2026-06-26
AI Technical Summary
Traditional video tracking algorithms extract feature vectors from the feature map position corresponding to the center point of the target in the video frame as ReID information, which cannot accurately track the target.
By acquiring the fused features of video frames, identifying visible feature points, and performing feature matching, combined with historical feature vectors and the Kalman filter algorithm, accurate positioning and tracking of target objects can be achieved.
It improves the accuracy and comprehensiveness of multi-target tracking in videos, enabling better extraction of target feature information and achieving precise tracking of multiple targets.
Smart Images

Figure CN116012419B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of video processing technology, and in particular to a target tracking method, apparatus, device, and storage medium. Background Technology
[0002] Video multi-object tracking detects the location of targets using tracking algorithms, associates the same target across different frames of a video, and assigns it the same ID to complete the target tracking task. Currently, traditional tracking algorithms extract the feature vector of the feature map corresponding to the center point of the target in each video frame as a ReID (Person Re-identification), and then track the target based on the ReID information. However, since the ReID information only includes information about the target's center point and lacks other feature information, it cannot accurately track the target based solely on the ReID information. Summary of the Invention
[0003] This invention provides a target tracking method, apparatus, device, and storage medium, aiming to solve the technical problem that traditional tracking algorithms extract feature vectors corresponding to the feature map positions of the center point of the target in a video frame as ReID information, but cannot accurately track the target based on ReID information.
[0004] This invention provides a target tracking method, the target tracking method comprising:
[0005] The fusion features of each video frame corresponding to the video to be processed are obtained. The fusion features are obtained by splicing the image features and optical flow features of the video frames.
[0006] Based on the fusion features, the visible feature points in the video frame are determined;
[0007] The visible feature points of the target object in each of the video frames are matched to obtain the target detection result of the target object.
[0008] Based on the target detection results, the target tracking results of the video to be processed are determined.
[0009] Optionally, the step of performing feature matching on the visible feature points of the target object in each of the video frames to obtain the target detection result of the target object includes:
[0010] Obtain the historical feature vectors corresponding to the video frames to be processed, and construct the historical feature space of each target object based on the historical feature vectors;
[0011] Obtain the visible feature vector of each visible feature point of the target object detected in the current video frame, and construct the feature space of each target object in the current video frame based on the visible feature vector;
[0012] Based on the feature space of each target object in the current video frame and the corresponding historical feature space, each target object in the current video frame is matched and associated with the target objects in the historical feature space to obtain a first-level matching result. The first-level matching result includes target objects that are successfully matched and target objects that are not matched.
[0013] The target detection result is determined based on the first-level matching result of each video frame.
[0014] Optionally, the step of obtaining the historical feature vector corresponding to the video frame to be processed, and constructing the historical feature space of each target object based on the historical feature vector includes:
[0015] If the current video frame is not the first or second video frame, then obtain the visible feature vector of each visible feature point of the target object in the previous video frame.
[0016] Based on the visible feature vector and the historical feature vector corresponding to the second video frame before the current video frame, construct the historical feature space of each target object in the current video frame;
[0017] If the current video frame is the second video frame, then the visible feature vector of each visible feature point of the target object in the first video frame is used.
[0018] Based on the visible feature vectors, construct the historical feature space of each target object in the current video frame.
[0019] Optionally, after the step of matching and associating each target object in the current video frame with the target objects in the historical feature space based on the feature space of each target object in the current video frame and the corresponding historical feature space to obtain a first-level matching result, the method further includes:
[0020] Obtain the target objects that failed to match in the first-level matching results corresponding to the video to be processed;
[0021] Obtain the location information of the feature points of the unmatched target objects in each video frame of the video to be processed, and store the location information of the feature points of the same target object in a queue;
[0022] If the queue is full, the target objects corresponding to each frame in the queue are matched and associated to obtain a secondary matching result;
[0023] The target detection result is determined based on the first-level matching result and the second-level matching result corresponding to each video frame.
[0024] Optionally, after the step of storing the position information of the feature points of the target object in a queue based on the position information of the feature points of the same target object in each video frame of the video to be processed, the method further includes:
[0025] If the queue is not full, then obtain each visible feature point of the target object in the current video frame, and determine the predicted position information of each visible feature point in the video frame based on the optical flow feature and the position information of the visible feature points in the previous video frame.
[0026] Obtain each invisible feature point of the target object in the current video frame, and determine the predicted position information of the invisible feature points in the video frame according to the Kalman filter algorithm;
[0027] Based on the actual position information of the feature points of each target object in the current video frame and the predicted position information, the target objects in the current video frame are matched and associated with the target objects in the previous video frame to obtain a secondary matching result.
[0028] Optionally, after obtaining the second-level matching result, the method further includes:
[0029] Obtain the target objects for which the first-level and second-level matching results failed to match the video to be processed;
[0030] Identify the bounding boxes corresponding to unmatched target objects;
[0031] When the center feature point of the detection box is a visible feature point, the predicted box position information of the detection box in the current video frame is determined based on the optical flow feature and the position information of the detection box in the previous video frame.
[0032] When the center feature point of the detection box is an invisible feature point, the predicted box position information of the detection box in the current video frame is determined according to the Kalman filter algorithm.
[0033] Based on the frame position information of the unmatched target object in the current frame and the predicted frame position information, the target object in the current video frame and the target object in the previous video frame are matched and associated to obtain a three-level matching result.
[0034] The target detection result is determined based on the first-level matching result, the second-level matching result, and the third-level matching result corresponding to each video frame.
[0035] Optionally, the step of determining visible feature points in the video frame based on the fusion features includes:
[0036] The fused features are decoded and restored based on a pre-trained decoder to obtain an optical flow map;
[0037] Based on the optical flow map, visible and invisible feature points are extracted from the video frame.
[0038] Furthermore, to achieve the above objectives, the present invention also provides a target tracking device, the target tracking device comprising:
[0039] The feature extraction module is used to obtain the fusion features of each video frame corresponding to the video to be processed. The fusion features are obtained by splicing the image features and optical flow features of the video frames.
[0040] A feature determination module is used to determine visible feature points in the video frame based on the fused features;
[0041] The result detection module is used to perform feature matching on the visible feature points of the target object in each video frame to obtain the target detection result of the target object.
[0042] The result determination module is used to determine the target tracking result of the video to be processed based on the target detection result.
[0043] In addition, to achieve the above objectives, the present invention also provides a terminal device, the terminal device comprising: a memory, a processor, and a target tracking program stored in the memory and executable on the processor, wherein the target tracking program, when executed by the processor, implements the steps of the target tracking method described above.
[0044] In addition, to achieve the above objectives, the present invention also provides a storage medium storing a target tracking program thereon, which, when executed by a processor, implements the steps of the target tracking method described above.
[0045] The target tracking method, apparatus, device, and storage medium provided in this embodiment of the invention have at least the following technical effects or advantages:
[0046] This invention obtains the fused features of each video frame corresponding to the video to be processed. The fused features are obtained by concatenating the image features and optical flow features of the video frames. Based on the fused features, visible feature points in the video frames are determined. Feature matching is performed on the visible feature points of the target objects in each video frame to obtain the target detection results. Based on the target detection results, the target tracking results of the video to be processed are determined. This solves the technical problem that traditional tracking algorithms extract feature vectors corresponding to the feature map positions of the center points of the targets in the video frames as ReID information, but cannot accurately track targets based on ReID information. When tracking multiple targets in a video, this invention can extract sufficient target feature information through the fused image features and optical flow features, which is beneficial for achieving comprehensive and accurate tracking of multiple targets in a video. Attached Figure Description
[0047] Figure 1 This is a schematic diagram of the hardware operating environment involved in the embodiments of the present invention;
[0048] Figure 2 This is a flowchart illustrating an embodiment of the target tracking method of the present invention;
[0049] Figure 3 This is a schematic diagram of the target tracking model of the present invention;
[0050] Figure 4 This is a schematic diagram of the skeletal point annotation in this invention;
[0051] Figure 5 This is a schematic diagram showing the annotation of the outer contour points of the present invention;
[0052] Figure 6 This is a schematic diagram of the first-level matching process of the target tracking method of the present invention;
[0053] Figure 7 This is a schematic diagram of the feature space;
[0054] Figure 8 This is a schematic diagram of the second-level matching process of the target tracking method of the present invention;
[0055] Figure 9 This is a flowchart illustrating the third-level matching process of the target tracking method of the present invention;
[0056] Figure 10 This is a functional block diagram of the target tracking device of the present invention. Detailed Implementation
[0057] To better understand the above technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention can be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of the present invention and to fully convey the scope of the invention to those skilled in the art.
[0058] like Figure 1 As shown, Figure 1 This is a schematic diagram of the hardware operating environment involved in the embodiments of the present invention.
[0059] It should be noted that, Figure 1 This can be a schematic diagram of the hardware operating environment of the terminal device.
[0060] As one implementation method, it can be as follows Figure 1 As shown, the embodiment of the present invention relates to a terminal device, which includes: a processor 1001, such as a CPU, a memory 1002, and a communication bus 1003. The communication bus 1003 is used to enable communication between these components.
[0061] Memory 1002 can be high-speed RAX memory or stable memory (non-volatile XeXory), such as disk storage. Figure 1 As shown, the memory 1002, which serves as a storage medium, may include a target tracking program; and the processor 1001 can be used to call the target tracking program stored in the memory 1002 and perform the following operations:
[0062] The fusion features of each video frame corresponding to the video to be processed are obtained. The fusion features are obtained by splicing the image features and optical flow features of the video frames.
[0063] Based on the fusion features, the visible feature points in the video frame are determined;
[0064] The visible feature points of the target object in each of the video frames are matched to obtain the target detection result of the target object.
[0065] Based on the target detection results, the target tracking results of the video to be processed are determined.
[0066] Furthermore, the processor 1001 can be used to call the target tracking program stored in the memory 1002 and perform the following operations:
[0067] Obtain the historical feature vectors corresponding to the video frames to be processed, and construct the historical feature space of each target object based on the historical feature vectors;
[0068] Obtain the visible feature vector of each visible feature point of the target object detected in the current video frame, and construct the feature space of each target object in the current video frame based on the visible feature vector;
[0069] Based on the feature space of each target object in the current video frame and the corresponding historical feature space, each target object in the current video frame is matched and associated with the target objects in the historical feature space to obtain a first-level matching result. The first-level matching result includes target objects that are successfully matched and target objects that are not matched.
[0070] The target detection result is determined based on the first-level matching result of each video frame.
[0071] Furthermore, the processor 1001 can be used to call the target tracking program stored in the memory 1002 and perform the following operations:
[0072] If the current video frame is not the first or second video frame, then obtain the visible feature vector of each visible feature point of the target object in the previous video frame.
[0073] Based on the visible feature vector and the historical feature vector corresponding to the second video frame before the current video frame, construct the historical feature space of each target object in the current video frame;
[0074] If the current video frame is the second video frame, then the visible feature vector of each visible feature point of the target object in the first video frame is used.
[0075] Based on the visible feature vectors, construct the historical feature space of each target object in the current video frame.
[0076] Furthermore, the processor 1001 can be used to call the target tracking program stored in the memory 1002 and perform the following operations:
[0077] Obtain the target objects that failed to match in the first-level matching results corresponding to the video to be processed;
[0078] Obtain the location information of the feature points of the unmatched target objects in each video frame of the video to be processed, and store the location information of the feature points of the same target object in a queue;
[0079] If the queue is full, the target objects corresponding to each frame in the queue are matched and associated to obtain a secondary matching result;
[0080] The target detection result is determined based on the first-level matching result and the second-level matching result corresponding to each video frame.
[0081] Furthermore, the processor 1001 can be used to call the target tracking program stored in the memory 1002 and perform the following operations:
[0082] If the queue is not full, then obtain each visible feature point of the target object in the current video frame, and determine the predicted position information of each visible feature point in the video frame based on the optical flow feature and the position information of the visible feature points in the previous video frame.
[0083] Obtain each invisible feature point of the target object in the current video frame, and determine the predicted position information of the invisible feature points in the video frame according to the Kalman filter algorithm;
[0084] Based on the actual position information of the feature points of each target object in the current video frame and the predicted position information, the target objects in the current video frame are matched and associated with the target objects in the previous video frame to obtain a secondary matching result.
[0085] Furthermore, the processor 1001 can be used to call the target tracking program stored in the memory 1002 and perform the following operations:
[0086] Obtain the target objects for which the first-level and second-level matching results failed to match the video to be processed;
[0087] Identify the bounding boxes corresponding to unmatched target objects;
[0088] When the center feature point of the detection box is a visible feature point, the predicted box position information of the detection box in the current video frame is determined based on the optical flow feature and the position information of the detection box in the previous video frame.
[0089] When the center feature point of the detection box is an invisible feature point, the predicted box position information of the detection box in the current video frame is determined according to the Kalman filter algorithm.
[0090] Based on the frame position information of the unmatched target object in the current frame and the predicted frame position information, the target object in the current video frame and the target object in the previous video frame are matched and associated to obtain a three-level matching result.
[0091] The target detection result is determined based on the first-level matching result, the second-level matching result, and the third-level matching result corresponding to each video frame.
[0092] Furthermore, the processor 1001 can be used to call the target tracking program stored in the memory 1002 and perform the following operations:
[0093] The fused features are decoded and restored based on a pre-trained decoder to obtain an optical flow map;
[0094] Based on the optical flow map, visible and invisible feature points are extracted from the video frame.
[0095] This invention provides an embodiment of a target tracking method. It should be noted that although the logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than that shown here.
[0096] This invention pre-trains a target tracking model, which is then used to track multiple targets in the video to be processed. For example... Figure 3 As shown, the target tracking model structure includes a detector, a re-identification (REID) feature extraction module, and a tracker, which are connected sequentially. The detector includes an image feature extraction template, an optical flow feature extraction module, a feature fusion module, a detection feature enhancement module, a visible feature point extraction module, and at least two detection heads. The image feature extraction template and the optical flow feature extraction module are connected to the feature fusion module, which in turn is connected to both the detection feature enhancement module and the visible feature point extraction module. The detection feature enhancement module is connected to one detection head, and the visible feature point extraction module is connected to the other detection head. Both the detection feature enhancement module and the two detection heads are connected to the re-identification feature extraction module. The visible feature point extraction module is a pre-trained decoder, and the detection head can be understood as a pre-trained classifier.
[0097] The process before training the target tracking model includes: acquiring the video to be used for target object detection, decompiling the video to obtain a video frame sequence, and then annotating the video frames. The annotation content is as follows:
[0098] Label the detection bounding box (bbox) and identity (ID) number of the target object;
[0099] The location information of feature points of the target object is labeled and numbered, starting from 0. The label must indicate three cases: visible (labeled 0), occluded (labeled 1), and outside the frame (labeled 2). Taking human tracking as an example, feature points can be selected as the human center point, outer contour points, and skeletal points. Skeletal point labeling is as follows... Figure 4 As shown, the outer contour points are labeled as follows: Figure 5 As shown. The number and position of feature points representing the object (such as skeletal points and outer contour points) can be defined according to your needs.
[0100] After annotation, the initial tracking model is trained based on the annotated video frames until it converges. Training of the initial tracking model is then stopped, and the stopped initial tracking model is saved as the target tracking model, signifying the completion of target tracking model training. The target tracking model is then used to track targets in the video. These targets can be people, animals, or other objects; this invention uses a human body as an example for illustration.
[0101] like Figure 2 As shown, in one embodiment of the present invention, the target tracking method of the present invention includes the following steps:
[0102] Step S10: Obtain the fusion features of each video frame corresponding to the video to be processed. The fusion features are obtained by splicing the image features and optical flow features of the video frames.
[0103] Optionally, the video to be processed is acquired and the target object to be tracked is determined, wherein the target object is the object to be tracked, such as a human body or a face; then the video to be processed is deframed to obtain a video frame sequence, and then each video frame is obtained from the video frame sequence, and the fusion feature corresponding to each video frame is determined. Optionally, the fusion feature is obtained by splicing image features and optical flow features.
[0104] Optionally, each video frame is input into an image feature extraction module to extract image features from the video frames. Optionally, the image feature extraction module is a neural network. Optionally, each video frame is input into an optical flow feature extraction module to extract optical flow features.
[0105] Optionally, for the first video frame of the video sequence, the forward optical flow information between the first and second video frames, and the backward optical flow information between the second and first video frames are extracted; then, the forward optical flow information between the first and second video frames, and the backward optical flow information between the second and first video frames are concatenated to obtain optical flow features. Here, step S10 corresponds to... Figure 3 Step 2 in the process.
[0106] Optionally, for video frames other than the first frame of the video sequence, backward optical flow information between video frame a2 and the previous video frame a1, and forward optical flow information between the previous video frame a1 and video frame a2 are extracted. Then, the forward and backward optical flow information are concatenated to obtain the optical flow features of each video frame a1. That is, the optical flow features include backward optical flow information between the video frame and the previous video frame, and forward optical flow information between the previous video frames. Step S10 corresponds to... Figure 3 Step 2 in the process.
[0107] Optionally, optical flow information can be extracted based on optical flow extraction algorithms such as the Lucas-Kanade algorithm, or optical flow information can be extracted based on deep learning optical flow extraction networks such as the RAFT network.
[0108] Optionally, feature extraction can be performed on image features and optical flow information based on a two-stream architecture. This can be achieved using convolutional networks such as InceptionNet or ResNet.
[0109] Optionally, after obtaining the image features and optical flow features, the image features and optical flow features are concatenated to obtain fused features. Optionally, the image features and optical flow features are input into a feature fusion module to concatenate the image features and optical flow features to obtain fused features. Optionally, concatenating the image features and optical flow features can be understood as concatenating or superimposing the image features and optical flow features in their respective dimensional channels. Step S20 corresponds to... Figure 3 Step 3 in the process.
[0110] Step S20: Determine the visible feature points in the video frame based on the fusion features.
[0111] Optionally, the feature points are the feature points of the target object. Taking human body tracking as an example, the feature points can be the center point of the human body, the outer contour points, and the skeletal points. For example, the skeletal point annotation is as follows: Figure 4 As shown, the outer contour points are labeled as follows: Figure 5 As shown, the number and location of feature points representing an object can be defined according to requirements.
[0112] Optionally, the feature points of the target object are labeled. The labeling includes the target object detection bounding box (bbox) and its identification number, as well as the coordinates of the feature points of the target object, which are then numbered starting from 0. Figure 4 or Figure 5 As shown, feature points need to be labeled in three cases: visible (0), occluded (1), and outside the video frame (2).
[0113] Optionally, visible feature points are feature points whose position information can be determined in the video frame, while invisible feature points can be feature points that are occluded or feature points outside the video frame.
[0114] Optionally, the feature points include key feature points, the size information of the detection box, and offsets, where the size information of the detection box includes width and height, and the key feature points are contour points and / or skeletal points of the human body, such as... Figure 4 and Figure 5 As shown, the offset includes bone point offset and / or contour point offset.
[0115] Optionally, the fused features are decoded and restored based on a pre-trained decoder to obtain an optical flow map; based on the optical flow map, visible and invisible feature points in the video frame are extracted.
[0116] Understandably, the visible feature point extraction module is a pre-trained decoder. The fused features are input into this module, which then decodes and reconstructs the fused features to obtain the optical flow map. The optical flow map determines which feature points are visible and which are invisible; in other words, it extracts the visibility of each feature point on the target, distinguishing between visible and invisible features based on visibility.
[0117] Optionally, corresponding to Figure 3 In step 5, after obtaining the fused features, the fused features are input into the visible feature point extraction module, and the visible feature point features of the visible feature points on the target are output through the visible feature point extraction module.
[0118] Optionally, corresponding to Figure 3 In step 7, after the visible feature point extraction module outputs the visible feature point features on the target, the visible feature point features are input into the detection head. The detection head outputs the visibility of each feature point on the target. Visibility can characterize whether the feature point is occluded or extends beyond the frame. In other words, visibility can identify whether a feature point is visible or invisible. Invisible feature points include occluded feature points, i.e., occluded feature points, and feature points extending beyond the frame, i.e., feature points outside the frame.
[0119] Step S30: Perform feature matching on the visible feature points of the target object in each video frame to obtain the target detection result of the target object.
[0120] Optionally, corresponding to Figure 3 In step 4, after obtaining the fused features, the fused features are input into the detection feature enhancement module, and the detection feature enhancement module outputs the detection feature map, which includes the detection features. The detection features are used for subsequent feature point classification, object detection, and key point detection.
[0121] Optionally, corresponding to Figure 3 Step 6 involves inputting the detection feature map into the detection head, which outputs feature points. These feature points include feature point categories, target detection box size information, and the feature point detection head itself. Optionally, the feature point category can be based on the body part to which the feature point belongs on the target object, such as left foot, right foot, torso, etc. Optionally, the target detection box size information includes width, height, and center point. Optionally, the feature point detection head represents the offset of the target feature point relative to the target center feature point.
[0122] Optionally, when the target object is occluded, optical flow information cannot be extracted from the occluded part. Optical flow features can provide good occlusion judgment information. After the visible feature points extracted in step 5, a detection head is connected to output the visibility features of each feature point of the target object in the current image (visibility features can characterize whether the feature points are occluded or beyond the frame).
[0123] Optionally, corresponding to Figure 3 After obtaining the target detection results corresponding to the video frame in steps 8 and 9, the first-level matching is performed on the target in the video frame. The first-level matching is to track the target corresponding to the visible feature point after determining that the feature point is a visible feature point. That is, based on the feature vector of the visible feature point of the target, the Reid feature matching method is used to perform target matching, and the target tracking result corresponding to the first-level matching is obtained, that is, the first-level matching result.
[0124] Optionally, corresponding to Figure 3 Step 10 involves performing a second-level matching process for targets that failed to match using the REID feature matching method. This means applying second-level matching to targets missed by the first-level matching in the video frame. The second-level matching uses the actual position information of feature points for deformation matching. Specifically, the second-level matching predicts the predicted position information of key feature points (contour points and / or skeleton points) in the video frame based on their offsets. Then, it matches the predicted position information with the actual position information of the key feature points in the video frame to obtain the target tracking result corresponding to the second-level matching, i.e., the second-level matching result.
[0125] Optionally, corresponding to Figure 3 Step 11: After the first and second levels of matching, the number of unmatched targets in the video frame is very small. For targets that still do not match, a third level of matching is used, which refers to motion matching. Third-level matching is employed to match targets that still do not match in the video frame, thus achieving target tracking. Specifically, third-level matching predicts the predicted bounding box position in the video frame based on the actual center position information of the detection box's center feature point and the size information of the detection box. Then, it matches the actual bounding box position information with the predicted bounding box position information to obtain the target tracking result corresponding to the third-level matching, i.e., the third-level matching result.
[0126] Optionally, multiple targets in the video can be tracked through the above-mentioned first-level matching.
[0127] Optionally, the above-mentioned two-level matching can be used to track multiple targets in the video. That is, the first-level matching can be used to track targets corresponding to visible feature points in the video, and the second-level matching can be used to track targets corresponding to occluded feature points in the video. In other words, the first-level matching and the second-level matching are complementary, thereby enabling accurate tracking of multiple targets in the video.
[0128] Optionally, the above three-level matching can be used to track multiple targets in the video. Specifically, the first-level matching can track targets corresponding to visible feature points in the video, the second-level matching can track targets corresponding to occluded feature points in the video, and the third-level matching can track targets corresponding to feature points outside the frame in the video. In other words, the first-level matching, the second-level matching, and the third-level matching complement each other, thereby enabling more accurate tracking of multiple targets in the video.
[0129] Step S40: Determine the target tracking result of the video to be processed based on the target detection result.
[0130] Optionally, the target tracking result includes the position information and labeling of the target object in the video frame. Optionally, the target object can be labeled in the video frame using detection boxes. The target detection result includes the position information of multiple target objects in the video, etc.
[0131] Optionally, the target tracking result includes at least each feature point on the target object, the center feature point and size information of the detection box corresponding to the target object, and the offset of each key feature point on the target object relative to the center feature point.
[0132] In the technical solution of this embodiment, sufficient feature information of the target object can be extracted by fusing image features and optical flow features, thereby tracking the target object, which is beneficial to comprehensively and accurately tracking multiple target objects in the video.
[0133] Reference Figure 6 , Figure 6 In a second embodiment of the target tracking method of the present invention, based on the first embodiment, step S30 includes:
[0134] Step S31: Obtain the historical feature vector corresponding to the video frame to be processed, and construct the historical feature space of each target object based on the historical feature vector.
[0135] Step S32: Obtain the visible feature vector of the visible feature points of each target object detected in the current video frame, and construct the feature space of each target object in the current video frame based on the visible feature vector;
[0136] Step S33: Based on the feature space of each target object in the current video frame and the corresponding historical feature space, match and associate each target object in the current video frame with the target objects in the historical feature space to obtain a first-level matching result. The first-level matching result includes the target objects that are successfully matched and the target objects that are not matched.
[0137] Step S34: Determine the target detection result based on the first-level matching result of each video frame.
[0138] It should be understood that steps S31 to S34 constitute the first-level matching, corresponding to... Figure 3 Steps 8 to 9 in the video frame are the first-level matching of the target objects, which is to use the REID feature matching method to track the target corresponding to the visible feature points.
[0139] The first-level matching uses the feature space composed of the feature vectors of the visible feature points to perform REID association matching, obtaining the matching results. The matching results include: among N targets in the video frame, m1 targets may be successfully matched, and m2 targets may fail to match. Step 6 outputs the feature points and their position information in the detection feature map (e.g., the center feature point: [100, 300]). By outputting the feature points and their position information in the detection feature map, the position information of all feature points on the target can be obtained sequentially, i.e.:
[0140] (feature_point_1:[x1,y1],feature_point_2:[x2,y2],...,feature_point_n:[xn,yn]).
[0141] Optionally, a feature space corresponding to each of the feature points is constructed, and the first actual position information of each of the feature points in the video frame is determined; feature vectors corresponding to each of the feature points are obtained from the detection feature map based on the first actual position information; the feature space is constructed based on the feature vectors of each of the feature points, such as... Figure 7 As shown.
[0142] It should be understood that after acquiring each video frame, the first actual position information of each feature point in each video frame can be calculated. The feature points include key feature points and critical feature points. That is, the first actual position information includes the second actual position information of the key feature points and the actual center position information of the center feature points.
[0143] Optionally, if the current video frame is not the first or second video frame, then the visible feature vector of the visible feature points of each target object in the previous video frame is obtained; based on the visible feature vector and the historical feature vector corresponding to the second video frame before the current video frame, the historical feature space of each target object in the current video frame is constructed.
[0144] Optionally, if the current video frame is the second video frame, then based on the visible feature vectors of the visible feature points of each target object in the first video frame, a historical feature space for each target object in the current video frame is constructed.
[0145] Optionally, the location information of the feature points is three-dimensional coordinates. By mapping the location information of all feature points on the target onto the detection feature map output in step 4 (where the feature map dimension of the detection feature map is [W, H, C]), the feature vector [1, 1, C] corresponding to each feature point can be extracted. Based on the feature vectors corresponding to all feature points on the detection feature map, a feature space is constructed, as shown below. Figure 7 As shown.
[0146] In a video, feature points on a target can fall into three categories: 0: visible, 1: occluded, and 2: outside the frame. Due to camera or target movement, any feature point on the target may exhibit all three categories in turn. For feature points outside the frame, their feature vectors in the feature space are defaulted to a zero vector of [1, 1, N]. Since feature points outside the frame cannot be detected, the feature information extracted from occluded feature points may originate more from other objects occluding the target; therefore, the feature vectors corresponding to these two types of feature points cannot characterize the target. Therefore, based on the visibility output in step 7, this invention also sets the feature vectors of occluded feature points to a zero vector of [1, 1, N]. After the tracking task begins, the visible feature vectors of all targets detected in the first video frame are extracted, and then the extracted visible feature vectors are used to form a feature space set. The feature space set is feature_set_1, feature_set_1 = {object_1_feature space 1, object_2_feature space 1, ..., object_M_feature space 1}.
[0147] Furthermore, different IDs are assigned to the detected targets, and the constructed feature space is used as the history feature space history_feature_set, history_feature_set=feature_set_1; starting from the j-th frame (j>=2), the visible feature vectors of the detected targets are extracted to form the feature space set feature_set_j = {object_1_feature space j, object_2_feature space j, ..., object_M_feature space j}, that is, the historical feature vectors of each feature point can be extracted from the historical feature space.
[0148] Specifically, after constructing the feature space, visible feature points are determined from each feature point based on visibility. This means selecting visible feature points from among all feature points and then determining the corresponding visible feature vectors from the feature space. Next, the historical feature vectors of the target's historical visible feature points are obtained from the historical feature space. Historical visible feature points refer to the visible feature vectors of the same target in previous video frames. Here, visible feature points belong to the feature space, and historical feature points belong to the historical feature space.
[0149] After obtaining the visible feature vector and historical feature vector, the cosine distance between the visible feature vector and the historical feature vector is calculated. This cosine distance is then input into a preset matching algorithm to match and associate visible feature points with historical visible feature points, thereby obtaining the first-level matching result and enabling tracking of the target corresponding to the visible feature points. The preset matching algorithm can be either the Hungarian algorithm or the Kuhn-Munkras (KM) matching algorithm.
[0150] It is worth noting that after target matching is completed in each video frame, the historical feature space is updated using the feature space set in the j-th frame. The update method is as follows:
[0151] history_feature_set[object_i_feature_space] = γ * history_feature_set[object_i_feature_space] + (1 -γ) * feature_set_j[object_i_feature_space], and so on for subsequent video frames.
[0152] In the technical solution of this embodiment, the first-level matching result is determined by comparing the feature vectors of the feature space and the historical feature space, and the target detection result is determined based on the first-level matching result to achieve the tracking of the target object.
[0153] Reference Figure 8 , Figure 8 In a third embodiment of the target tracking method of the present invention, based on any one of the first to second embodiments, after step S33, the method further includes:
[0154] Step S35: Obtain the target objects that failed to match in the first-level matching results corresponding to the video to be processed;
[0155] Step S36: Obtain the location information of the feature points of the unmatched target objects in each video frame of the video to be processed, and store the location information of the feature points of the same target object in a queue.
[0156] Step S37: If the queue is full, then match and associate the target objects corresponding to each frame in the queue to obtain a secondary matching result;
[0157] Step S38: Determine the target detection result based on the first-level matching result and the second-level matching result corresponding to each video frame.
[0158] Optionally, the actual position information of each of the key feature points in the video frame is determined; when each of the key feature points is determined to be a visible feature point based on the visibility, the predicted position information of each of the key feature points in the video frame is determined based on the optical flow feature and the offset; when each of the key feature points is determined to be an invisible feature point based on the visibility, the historical position information corresponding to each of the key feature points in the previous video frame is obtained; the predicted position information of the invisible feature point in the video frame is determined based on the historical position information and the offset; the Euclidean distance between the actual position information and the predicted position information is determined; and the actual position information and the predicted position information are matched and associated based on the Euclidean distance to obtain a secondary matching result.
[0159] Optionally, corresponding to Figure 3 Step 10 uses visibility to determine whether each feature point is visible or invisible; that is, invisible points are further categorized as either occluded or outside the frame. Each feature point includes key feature points, meaning that after determining the visibility of each feature point, the visibility of each key feature point on the target can also be obtained. Since each feature point includes key feature points (contour points and / or skeletal points), the actual position information of each key feature point and its offset relative to the center feature point can also be obtained.
[0160] Optionally, for targets missed by the first-level matching, a second-level matching is used, based on... Figure 3 Step 6 outputs the actual location information of the feature points of each target, namely:
[0161] object_m=(feature_point_0,feature_point_1,feature_point_2,...,feature_point_i,...), feature_point_i=[x, y].
[0162] Optionally, if a feature point of a target cannot be detected or is outside the frame, the default actual position information is [-1, -1]. Then, a history deformation queue, `history_shape_list`, is maintained for each target to store the historical position information of each target's feature points. The queue size can be set manually. The reid features from step 8 are used to match identical targets appearing in each video frame, and the historical position information of the feature points of identical targets is placed in the same queue. When the queue is full, it indicates that the target object is in a tracked state. If an unmatched target appears in step 8, it is necessary to determine whether the key points of the unmatched target are visible or invisible feature points.
[0163] Optionally, key feature points may be visible or invisible, with invisible feature points specifically referring to occluded feature points. During the second-level matching process, based on the first actual position information of each feature point in the video frame, the second actual position information of each key feature point in that video frame is obtained. If visible feature points are included in the key feature points determined by visibility, the predicted position information of the visible feature points in the video frame is calculated based on the offset and forward optical flow information in the optical flow features. If invisible feature points are included in the key feature points determined by visibility, the historical position information of each key feature point in the previous video frame is obtained from the history_shape_list queue. Then, the historical position information and offset are input into the Kalman filter algorithm to obtain the predicted position information of the invisible feature points in the video frame. Key points with historical position information of [-1, -1] are not used as input to the Kalman filter algorithm.
[0164] Optionally, a second-level matching is used for targets missed by the first-level matching. Specifically, the Euclidean distance between the second actual position information and the predicted position information of key feature points on the target is calculated. After obtaining the Euclidean distance, it is input into the Hungarian algorithm or the KM matching algorithm to match and associate the second actual position information with the predicted position information, thereby obtaining the second-level matching result. This enables tracking of targets corresponding to occluded feature points, meaning that the second-level matching achieves the tracking of targets missed by the first-level matching.
[0165] In the technical solution of this embodiment, by introducing multi-level matching, targets that cannot be tracked can be supplemented, and multiple targets in the video can be accurately tracked. This reduces the problem of losing targets due to severe target occlusion, and avoids or improves the phenomenon of target switching under complex motion conditions.
[0166] Reference Figure 9 , Figure 9 In a fourth embodiment of the target tracking method of the present invention, based on any one of the first to third embodiments, after step S37, the method further includes:
[0167] Step S39: Obtain the target objects for which the first-level matching result and the second-level matching result failed to match the video to be processed;
[0168] Step S310: Determine the detection box corresponding to the unmatched target object;
[0169] Step S311: When the center feature point of the detection box is a visible feature point, determine the predicted frame position information of the detection box in the current video frame based on the optical flow feature and the position information of the detection box in the previous video frame.
[0170] Step S312: When the center feature point of the detection box is an invisible feature point, the predicted box position information of the detection box in the current video frame is determined according to the Kalman filter algorithm.
[0171] Step S313: Based on the frame position information of the unmatched target object in the current frame and the predicted frame position information, match and associate the target object in the current video frame with the target object in the previous video frame to obtain a three-level matching result.
[0172] Step S314: Determine the target detection result based on the first-level matching result, the second-level matching result, and the third-level matching result corresponding to each video frame.
[0173] Optionally, the actual center position information of the central feature point in the video frame is determined; the actual frame position information of the detection box in the video frame is determined based on the actual center position information and the size information; optionally, when the central feature point is determined to be a visible feature point based on the visibility, the predicted frame position information of the detection box in the video frame is determined based on the optical flow feature and the size information; optionally, when the central feature point is determined to be an invisible feature point based on the visibility, the historical center position information of the central feature point in the previous video frame is obtained; the predicted frame position information of the detection box in the video frame is determined based on the historical center position information and the size information. The intersection-union ratio (IUGR) between the actual frame position information and the predicted frame position information is determined; the actual frame position information and the predicted frame position information are matched and associated based on the IUGR to obtain a three-level matching result.
[0174] Optionally, during the third-level matching process, based on the first actual position information of each feature point in the video frame, the actual center position information of the center feature point of the detection box corresponding to the target in the video frame is obtained. After obtaining the center feature point of the detection box corresponding to the target in the video frame, the actual center position information of the center feature point, and the size information of the detection box, the actual box position information of the detection box in the video frame can be calculated using the actual center position information and the size information. Since each feature point includes the center feature point, the visibility of the center feature point of the detection box can also be obtained after the visibility of each feature point is obtained.
[0175] Optionally, after obtaining the visibility of the central feature point, if the central feature point is determined to be a visible feature point based on its visibility, then the predicted box position information of the detection box in the video frame is calculated based on the size information and the forward optical flow information in the optical flow features. If the central feature point is determined to be an invisible feature point based on its visibility, that is, the central feature point is a feature point that extends beyond the frame, then the historical center position information of the central feature point in the previous video frame is obtained. This historical center position information is then input into the Kalman filter algorithm, and based on the size information, the predicted box position information of the detection box in the video frame is predicted.
[0176] Optionally, after obtaining the actual and predicted bounding box positions in the video frame, the intersection-over-union (IoU) ratio between the actual and predicted bounding box positions is calculated. The IoU ratio is then input into either the Hungarian algorithm or the KM matching algorithm to match and associate the actual and predicted bounding box positions, thus obtaining the third-level matching result. This enables tracking of targets corresponding to feature points outside the frame, meaning that the third-level matching tracks targets missed by the first and second-level matching.
[0177] In the technical solution of this embodiment, by introducing multi-level matching, targets that cannot be tracked can be supplemented, and multiple targets in the video can be accurately tracked. This reduces the problem of target loss due to severe target occlusion, and avoids or improves the phenomenon of target ID switching under complex motion conditions.
[0178] like Figure 10 As shown, the present invention provides a target tracking device, the target tracking device comprising:
[0179] The feature extraction module 100 is used to obtain the fusion features of each video frame corresponding to the video to be processed. The fusion features are obtained by splicing the image features and optical flow features of the video frames.
[0180] The feature determination module 200 is used to determine visible feature points in the video frame based on the fused features;
[0181] The result detection module 300 is used to perform feature matching on the visible feature points of the target object in each video frame to obtain the target detection result of the target object.
[0182] The result determination module 400 is used to determine the target tracking result of the video to be processed based on the target detection result.
[0183] Optionally, the step of performing feature matching on the visible feature points of the target object in each of the video frames to obtain the target detection result of the target object includes:
[0184] Obtain the historical feature vectors corresponding to the video frames to be processed, and construct the historical feature space of each target object based on the historical feature vectors;
[0185] Obtain the visible feature vector of each visible feature point of the target object detected in the current video frame, and construct the feature space of each target object in the current video frame based on the visible feature vector;
[0186] Based on the feature space of each target object in the current video frame and the corresponding historical feature space, each target object in the current video frame is matched and associated with the target objects in the historical feature space to obtain a first-level matching result. The first-level matching result includes target objects that are successfully matched and target objects that are not matched.
[0187] The target detection result is determined based on the first-level matching result of each video frame.
[0188] Optionally, the step of obtaining the historical feature vector corresponding to the video frame to be processed, and constructing the historical feature space of each target object based on the historical feature vector includes:
[0189] If the current video frame is not the first or second video frame, then obtain the visible feature vector of each visible feature point of the target object in the previous video frame.
[0190] Based on the visible feature vector and the historical feature vector corresponding to the second video frame before the current video frame, construct the historical feature space of each target object in the current video frame;
[0191] If the current video frame is the second video frame, then the visible feature vector of each visible feature point of the target object in the first video frame is used.
[0192] Based on the visible feature vectors, construct the historical feature space of each target object in the current video frame.
[0193] Optionally, after the step of matching and associating each target object in the current video frame with the target objects in the historical feature space based on the feature space of each target object in the current video frame and the corresponding historical feature space to obtain a first-level matching result, the method further includes:
[0194] Obtain the target objects that failed to match in the first-level matching results corresponding to the video to be processed;
[0195] Obtain the location information of the feature points of the unmatched target objects in each video frame of the video to be processed, and store the location information of the feature points of the same target object in a queue;
[0196] If the queue is full, the target objects corresponding to each frame in the queue are matched and associated to obtain a secondary matching result;
[0197] The target detection result is determined based on the first-level matching result and the second-level matching result corresponding to each video frame.
[0198] Optionally, after the step of storing the position information of the feature points of the target object in a queue based on the position information of the feature points of the same target object in each video frame of the video to be processed, the method further includes:
[0199] If the queue is not full, then obtain each visible feature point of the target object in the current video frame, and determine the predicted position information of each visible feature point in the video frame based on the optical flow feature and the position information of the visible feature points in the previous video frame.
[0200] Obtain each invisible feature point of the target object in the current video frame, and determine the predicted position information of the invisible feature points in the video frame according to the Kalman filter algorithm;
[0201] Based on the actual position information of the feature points of each target object in the current video frame and the predicted position information, the target objects in the current video frame are matched and associated with the target objects in the previous video frame to obtain a secondary matching result.
[0202] Optionally, after obtaining the second-level matching result, the method further includes:
[0203] Obtain the target objects for which the first-level and second-level matching results failed to match the video to be processed;
[0204] Identify the bounding boxes corresponding to unmatched target objects;
[0205] When the center feature point of the detection box is a visible feature point, the predicted box position information of the detection box in the current video frame is determined based on the optical flow feature and the position information of the detection box in the previous video frame.
[0206] When the center feature point of the detection box is an invisible feature point, the predicted box position information of the detection box in the current video frame is determined according to the Kalman filter algorithm.
[0207] Based on the frame position information of the unmatched target object in the current frame and the predicted frame position information, the target object in the current video frame and the target object in the previous video frame are matched and associated to obtain a three-level matching result.
[0208] The target detection result is determined based on the first-level matching result, the second-level matching result, and the third-level matching result corresponding to each video frame.
[0209] Optionally, the step of determining visible feature points in the video frame based on the fusion features includes:
[0210] The fused features are decoded and restored based on a pre-trained decoder to obtain an optical flow map;
[0211] Based on the optical flow map, visible and invisible feature points are extracted from the video frame.
[0212] The specific implementation of the target tracking device of the present invention is basically the same as the various embodiments of the target tracking method described above, and will not be repeated here.
[0213] In addition, to achieve the above objectives, the present invention also provides a terminal device, the terminal device comprising: a memory, a processor, and a target tracking program stored in the memory and executable on the processor, wherein the target tracking program, when executed by the processor, implements the steps of the target tracking method described above.
[0214] In addition, to achieve the above objectives, the present invention also provides a storage medium storing a target tracking program thereon, which, when executed by a processor, implements the steps of the target tracking method described above.
[0215] The above are merely preferred embodiments of the present invention and do not limit the scope of the patent. Any equivalent structural or procedural transformations made based on the description and drawings of the present invention, or direct or indirect applications in other related technical fields, are similarly included within the scope of patent protection of the present invention.
Claims
1. A target tracking method, characterized in that, The target tracking method includes: The fusion features of each video frame corresponding to the video to be processed are obtained. The fusion features are obtained by concatenating the image features and optical flow features of the video frames. For the first video frame of the video sequence, the optical flow features include forward optical flow information between the first video frame and the second video frame, and backward optical flow information between the second video frame and the first video frame. For non-first video frames of the video sequence, the optical flow features include backward optical flow information between the current video frame and the previous video frame, and forward optical flow information between the previous video frame and the current video frame. Based on the fusion features, visible feature points in the video frame are determined; the visible feature points are feature points whose position information can be determined in the video frame, and the invisible feature points are feature points that are occluded or feature points that are outside the video frame. The visible feature points of the target object in each of the video frames are matched to obtain the target detection result of the target object. Based on the target detection results, the target tracking results of the video to be processed are determined; The step of performing feature matching on the visible feature points of the target object in each of the video frames to obtain the target detection result includes: Obtain the historical feature vectors corresponding to the video frames to be processed, and construct the historical feature space of each target object based on the historical feature vectors; Obtain the visible feature vector of each visible feature point of the target object detected in the current video frame, and construct the feature space of each target object in the current video frame based on the visible feature vector; Based on the feature space of each target object in the current video frame and the corresponding historical feature space, each target object in the current video frame is matched and associated with the target objects in the historical feature space to obtain a first-level matching result. The first-level matching result includes target objects that are successfully matched and target objects that are not matched. The target detection result is determined based on the first-level matching result of each video frame.
2. The target tracking method as described in claim 1, characterized in that, The steps of obtaining the historical feature vectors corresponding to the video frames to be processed, and constructing the historical feature space of each target object based on the historical feature vectors, include: If the current video frame is not the first or second video frame, then obtain the visible feature vector of each visible feature point of the target object in the previous video frame. Based on the visible feature vector and the historical feature vector corresponding to the second video frame before the current video frame, construct the historical feature space of each target object in the current video frame; If the current video frame is the second video frame, then the visible feature vector of each visible feature point of the target object in the first video frame is used. Based on the visible feature vectors, construct the historical feature space of each target object in the current video frame.
3. The target tracking method as described in claim 1, characterized in that, After the step of matching and associating each target object in the current video frame with the target objects in the historical feature space based on the feature space of each target object in the current video frame and the corresponding historical feature space to obtain a first-level matching result, the method further includes: Obtain the target objects that failed to match in the first-level matching results corresponding to the video to be processed; Obtain the location information of the feature points of the unmatched target objects in each video frame of the video to be processed, and store the location information of the feature points of the same target object in a queue; If the queue is full, the target objects corresponding to each frame in the queue are matched and associated to obtain a secondary matching result; The target detection result is determined based on the first-level matching result and the second-level matching result corresponding to each video frame.
4. The target tracking method as described in claim 3, characterized in that, After the step of storing the position information of the feature points of the target object in a queue based on the position information of the feature points of the same target object in each video frame of the video to be processed, the method further includes: If the queue is not full, then obtain each visible feature point of the target object in the current video frame, and determine the predicted position information of each visible feature point in the video frame based on the optical flow feature and the position information of the visible feature points in the previous video frame. Obtain each invisible feature point of the target object in the current video frame, and determine the predicted position information of the invisible feature points in the video frame according to the Kalman filter algorithm; Based on the actual position information of the feature points of each target object in the current video frame and the predicted position information, the target objects in the current video frame are matched and associated with the target objects in the previous video frame to obtain a secondary matching result.
5. The target tracking method as described in claim 3 or 4, characterized in that, After obtaining the secondary matching result, the method further includes: Obtain the target objects for which the first-level and second-level matching results failed to match the video to be processed; Identify the bounding boxes corresponding to unmatched target objects; When the center feature point of the detection box is a visible feature point, the predicted box position information of the detection box in the current video frame is determined based on the optical flow feature and the position information of the detection box in the previous video frame. When the center feature point of the detection box is an invisible feature point, the predicted box position information of the detection box in the current video frame is determined according to the Kalman filter algorithm. Based on the frame position information of the unmatched target object in the current frame and the predicted frame position information, the target object in the current video frame and the target object in the previous video frame are matched and associated to obtain a three-level matching result. The target detection result is determined based on the first-level matching result, the second-level matching result, and the third-level matching result corresponding to each video frame.
6. The method as described in claim 1, characterized in that, The step of determining the visible feature points in the video frame based on the fusion features includes: The fused features are decoded and restored based on a pre-trained decoder to obtain an optical flow map; Based on the optical flow map, visible and invisible feature points are extracted from the video frame.
7. A target tracking device, characterized in that, The target tracking device includes: The feature extraction module is used to obtain the fusion features of each video frame corresponding to the video to be processed. The fusion features are obtained by concatenating the image features and optical flow features of the video frames. For the first video frame of the video sequence, the optical flow features include forward optical flow information between the first video frame and the second video frame, and backward optical flow information between the second video frame and the first video frame. For non-first video frames of the video sequence, the optical flow features include backward optical flow information between the current video frame and the previous video frame, and forward optical flow information between the previous video frame and the current video frame. The feature determination module is used to determine visible feature points in the video frame based on the fused features; the visible feature points are feature points whose position information can be determined in the video frame, and the invisible feature points are feature points that are occluded or feature points that are outside the video frame. The result detection module is used to perform feature matching on the visible feature points of target objects in each of the video frames to obtain the target detection result of the target objects; wherein, the historical feature vector corresponding to the video frame to be processed is obtained, and a historical feature space of each target object is constructed based on the historical feature vector; the visible feature vector of the visible feature points of each target object detected in the current video frame is obtained, and a feature space of each target object in the current video frame is constructed based on the visible feature vector; according to the feature space of each target object in the current video frame and the corresponding historical feature space, each target object in the current video frame is matched and associated with the target objects in the historical feature space to obtain a first-level matching result, the first-level matching result including the target objects that are successfully matched and the target objects that are not matched; the target detection result is determined based on the first-level matching result of each video frame; The result determination module is used to determine the target tracking result of the video to be processed based on the target detection result.
8. A terminal device, characterized in that, The terminal device includes: a memory, a processor, and a target tracking program stored in the memory and executable on the processor, wherein the target tracking program, when executed by the processor, implements the steps of the target tracking method as described in any one of claims 1-6.
9. A storage medium, characterized in that, It stores a target tracking program, which, when executed by a processor, implements the steps of the target tracking method according to any one of claims 1-6.