Target tracking method and device, electronic equipment and computer readable storage medium
By using a similarity calculation method based on target detection networks and motion state information, the problems of slow target tracking speed and low accuracy in existing technologies are solved, achieving high-speed and high-accuracy target tracking.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CRSC COMM & INFORMATION GRP CO LTD
- Filing Date
- 2022-10-14
- Publication Date
- 2026-06-30
Smart Images

Figure CN115908481B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer technology, and in particular to a target tracking method, apparatus, electronic device, and computer-readable storage medium. Background Technology
[0002] In fields such as video surveillance, trajectory tracking, and industrial management, it is often necessary to detect and track targets in videos or images, such as tasks, animals, and vehicles. In practical applications, it is necessary to achieve high-speed and high-accuracy target tracking so that target tracking can be more widely used in more production and life applications. Summary of the Invention
[0003] This disclosure provides a target tracking method, apparatus, electronic device, and computer-readable storage medium.
[0004] In a first aspect, this disclosure provides a target tracking method, which includes:
[0005] The target detection network extracts features and detects targets in the current frame of the video to be detected, and obtains the features of the predetermined target extracted from the current frame and the target detection results.
[0006] Based on the target detection results, the reverse derivation features of the predetermined target in the current frame image are obtained;
[0007] Calculate the motion state information of the previous frame image of the predetermined target, and predict the position of the predetermined target in the current frame image based on the motion state information to obtain the predicted position coordinates;
[0008] Based on the position coordinates obtained from the target detection results, the predicted position coordinates, the reverse inference features of the target in at least the previous frame image, and the reverse inference features in the current frame image, the similarity information of the target in the current frame image is calculated.
[0009] Target tracking is performed based on the calculated similarity information to obtain the tracking result of the predetermined target.
[0010] In a second aspect, this disclosure provides a target tracking device, which includes: a target detection module, used to perform feature extraction and target detection on a current frame image in a video to be detected through a target detection network, so as to obtain features of a predetermined target extracted from the current frame image and target detection results;
[0011] The feature calculation module is used to deduce the reverse-derived features of the predetermined target in the current frame image based on the target detection results.
[0012] The position prediction module is used to calculate the motion state information of the previous frame image of the predetermined target, and predict the position of the predetermined target in the current frame image based on the motion state information to obtain the predicted position coordinates.
[0013] The similarity calculation module is used to calculate the similarity information of the predetermined target in the current frame image based on the position coordinates obtained from the target detection results, the predicted position coordinates, the reverse inference features of the predetermined target in at least the previous frame image, and the reverse inference features in the current frame image.
[0014] The tracking result determination module is used to track targets based on the calculated similarity information and obtain the tracking results of the predetermined targets.
[0015] Thirdly, this disclosure provides an electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores one or more computer programs executable by the at least one processor, the one or more computer programs being executed by the at least one processor to enable the at least one processor to perform the target tracking method described above.
[0016] Fourthly, this disclosure provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the target tracking method described above when executed by a processor / processor core.
[0017] The embodiments provided in this disclosure can more accurately represent the features of the target in the current frame image by back-deriving the target detection results, making the calculation results of similarity information more accurate. Furthermore, target tracking can be performed based on the similarity information calculated during the detection process, thereby enhancing the target tracking results during the detection process. According to the target tracking method of this disclosure, the accuracy of tracking results can be improved, the complexity of target tracking can be reduced, thereby improving the speed and accuracy of target tracking, and further reducing the number of ID switching times for target tracking.
[0018] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0019] The accompanying drawings are provided to further illustrate the present disclosure and form part of the specification. They are used together with the embodiments of the present disclosure to explain the disclosure and do not constitute a limitation thereof. The above and other features and advantages will become more apparent to those skilled in the art from the detailed description of exemplary embodiments with reference to the accompanying drawings, in which:
[0020] Figure 1 A flowchart illustrating target tracking in the related technologies provided in the embodiments of this disclosure;
[0021] Figure 2 A flowchart of a target tracking method provided in an embodiment of this disclosure;
[0022] Figure 3 A flowchart illustrating a target tracking method of an exemplary embodiment of this disclosure;
[0023] Figure 4 A block diagram of a target tracking device provided in an embodiment of this disclosure;
[0024] Figure 5 This is a block diagram of an electronic device provided in an embodiment of the present disclosure. Detailed Implementation
[0025] To enable those skilled in the art to better understand the technical solutions of this disclosure, exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments of this disclosure to aid understanding. These should be considered merely exemplary. Therefore, those skilled in the art should recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and conciseness, descriptions of well-known functions and structures are omitted in the following description.
[0026] Where there is no conflict, the various embodiments of this disclosure and the features thereof in the embodiments may be combined with each other.
[0027] As used herein, the term “and / or” includes any and all combinations of one or more related enumerated entries.
[0028] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit this disclosure. As used herein, the singular forms “a” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that when the terms “comprising” and / or “made of” are used in this specification, they specify the presence of features, integrals, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or groups thereof. Words such as “connected” or “linked” are not limited to physical or mechanical connections but can include electrical connections, whether direct or indirect.
[0029] Unless otherwise specified, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art. It will also be understood that terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant art and this disclosure, and will not be interpreted as having an idealized or overly formal meaning, unless expressly so defined herein.
[0030] In this embodiment of the disclosure, the target being tracked can be an object of interest in a video frame or image. For example, in traffic monitoring, it is often necessary to track specific targets such as people or vehicles in motion; in logistics scenarios, it is often necessary to track specific targets such as unmanned delivery vehicles or conveyor belt packages; and in production environments, it is often necessary to track certain specific products such as components.
[0031] Target tracking can be understood as locating a specific object within a series of video frames. In other words, it's the process of finding a target of interest defined in the current frame within subsequent frames of a video sequence. For example, if a specific target (a target of interest) is found in the current frame of a video, then the target's location needs to be found in subsequent frames. The current frame can be any frame in the video; typically, target tracking begins from the second frame, with the first frame used to mark the target's initial position.
[0032] Figure 1 A flowchart illustrating target tracking in the related technology provided in embodiments of this disclosure is shown. For example... Figure 1 As shown, target tracking in related technologies includes the following steps: S101, acquiring video data; S102, extracting features from the current frame image in the video data to obtain the extracted features of the current frame image; S103, performing target detection based on the extracted features to obtain the detection result of the target; S104, estimating the motion state of the target based on the detection result of the current frame and the detection results of the previous several frames; S105, performing target matching in the current frame image based on the target detection result and the estimated motion state of the target to obtain the matching result; S106, determining and outputting the target tracking result based on the matching result.
[0033] Figure 1The target tracking process can be implemented using a variety of related techniques. These techniques include at least one of the following: Simple Online and Realtime Tracking (Sort), Simple Online and Realtime Tracking with a Deep Association Metric (DeepSort), Fully-Convolutional Siamese Networks for Object Tracking (SiamFC), and Template Update (UpdateNet).
[0034] In the Sort technique, target detection is performed on the current frame using Sort. The current position is predicted using a Kalman filter, and the predicted position is matched with the actual detected position. Similarity calculation (considering only motion information) is used to calculate the matching results between the current and previous frames. If the matching result indicates that the position of a target in the current frame is the same as the position of a target in a previous frame, then that target in the current frame is considered the same as that target in a previous frame. The Hungarian algorithm is used for data association, assigning a target ID to each object. Because the Sort technique tracks targets based on their motion trends, the accuracy of the tracking results is very low when the frame image is occluded, resulting in a high number of ID switching operations.
[0035] DeepSort adds a neural network to Sort to extract the appearance features of the target, which greatly reduces ID-Switch. However, due to the introduction of the neural network, target tracking requires the cooperation of multiple models, which increases the complexity and computational cost, resulting in extremely slow running speed when there are many targets.
[0036] SiamFC is a template matching task. Specifically, the SiamFC network structure includes a Siamese network with two inputs: a baseline template (the target template) and candidate samples to be selected. In single-object tracking, the baseline template is the object to be tracked, typically the target object in the first frame of a video sequence. The candidate samples are the search images in each subsequent frame. The Siamese network's task is to find the candidate region in each subsequent frame that is most similar to the template in the first frame; this is the target in that frame, allowing for target tracking. However, because SiamFC does not update the target template and network weights during tracking, significant deformation of the target can cause large differences between the candidate bounding box and the target template, leading to tracking failure. Furthermore, the lack of updated network weights requires using the same network structure and parameters to adapt to all tracking scenarios, which is difficult to achieve.
[0037] UpdateNet improves upon the templates used by SiamFC. Using a convolutional neural network, UpdateNet generates the optimal template for the next frame using previously accumulated templates and the template from the current frame. However, because UpdateNet uses a fixed-rate moving average method to perform simple linear updates to the target template, this approach is often insufficient to handle constantly changing update requirements or cover all possible scenarios. For example, in the event of occlusion, only a portion of the template needs updating, but this does not allow for local updates, potentially leading to serious errors. Furthermore, UpdateNet's over-reliance on the initial template can result in severe drift problems, making it unable to recover from tracking failures.
[0038] The target tracking method of this disclosure can improve the speed and accuracy of target tracking and reduce the number of ID switching of tracking results.
[0039] The target tracking method according to embodiments of this disclosure can be executed by electronic devices such as terminal devices or servers. Terminal devices can be in-vehicle devices, user equipment (UE), mobile devices, user terminals, terminals, cellular phones, cordless phones, personal digital assistants (PDAs), handheld devices, computing devices, in-vehicle devices, wearable devices, etc. The method can be implemented by a processor calling computer-readable program instructions stored in memory. Servers can include independent physical servers, server clusters consisting of multiple servers, or cloud servers capable of cloud computing.
[0040] Figure 2A flowchart illustrating a target tracking method provided in an embodiment of this disclosure. (Refer to...) Figure 2 The target tracking method may include the following steps.
[0041] S210, the target detection network performs feature extraction and target detection on the current frame image in the video to be detected, and obtains the features of the predetermined target extracted from the current frame image and the target detection result.
[0042] In this step, the video to be detected may include multiple image frames, and the current frame image can be any frame image after the first frame image.
[0043] In some embodiments, the object detection network is a Convolutional Neural Network (CNN), comprising a feature extraction part and an object detection part. The feature extraction part is used to extract features from a predetermined target (hereinafter referred to as a specific target or target of interest) in the current frame image, obtaining the image feature extraction result of the target. Exemplarily, commonly used image features include at least one of the following features: color features, shape features, spatial features, texture features, and depth features obtained through a convolutional neural network in deep learning. The object detection part is used to perform target detection based on the extracted image features of the target; if the predetermined target is detected, the object detection result of the predetermined target is obtained. Exemplarily, the object detection part can be implemented using a fully convolutional network model from the YOLO series, which can be any YOLO model, such as the YOLO V3 model or the YOLO V5 model.
[0044] In some embodiments, the target detection result in S210 includes the coordinates, confidence score, and class probability of the predetermined object in the image; wherein, the coordinates of the predetermined object in the current frame image are used to indicate the position of the predetermined object in the current frame image, such as: center point coordinates (x, y) and the width and height (w, h) of the bounding box; the confidence score is a value in the range [0, 1], used to indicate the confidence of the coordinates of the predetermined object in the image, that is, the confidence of the positioning of the predetermined object in the current frame image; the class probability is a set of values in the range [0, 1], representing the confidence of each class corresponding to the predetermined object.
[0045] S220, based on the target detection results, reverse derivation is performed to obtain the reverse derivation features of the predetermined target in the current frame image.
[0046] In this step, the multidimensional features of the predetermined target can be calculated from the target detection results. Calculation, or reverse derivation, involves, for example, after feature extraction from the current frame image, performing mathematical operations on the extracted features through linear transformation operations (such as dimensionality changes and feature mapping). Therefore, the corresponding features of the target in the feature map of the current frame image can be obtained through the reverse derivation of linear transformation operations, serving as the reverse derivation features of the predetermined target.
[0047] S230, calculate the motion state information of the previous frame image of the predetermined target, and predict the position of the predetermined target in the current frame image based on the motion state information to obtain the predicted position coordinates.
[0048] For example, motion state information includes at least one of motion velocity, acceleration, and angular velocity.
[0049] In some embodiments, based on the features of the predetermined target in frame N-2 and the features in frame N-1, the motion state information in frame N-1 can be calculated. The motion state information in frame N-1 can be used to predict the position of the predetermined target in frame N, thus obtaining the predicted position of the predetermined target in the current frame. Here, N is an integer greater than or equal to 3.
[0050] Taking the first motion state as the motion speed as an example, to calculate the speed of any feature point of the predetermined target in the current frame image, we can first calculate the distance difference between the feature point position in the previous frame image and the feature point position in the current frame; secondly, the speed of the feature point can be obtained according to the ratio of the distance difference to the frame duration; since the frame duration in this embodiment is a unit duration, the speed of the feature point is determined by the distance difference.
[0051] A convolution operation is performed on the features of feature points in the previous frame and the features of feature points in the current frame. When the weights used in the convolution operation are negative, that is, through weighted summation in the convolution operation, the distance difference of the feature point can be obtained, and thus the velocity of the feature point can be obtained. Furthermore, when there are at least two previous frames or more, the acceleration of any feature point of the predetermined target can be calculated based on the rate of change of velocity with respect to time.
[0052] S240, using the calculated motion state of the predetermined target in the current frame image, the reverse derivation features of the predetermined target in at least the previous frame image, and the reverse derivation features in the current frame image, calculate the similarity of the predetermined target in the current frame image.
[0053] In this step, the reverse derivation features of the predetermined target in at least the previous frame image and the reverse derivation features in the current frame image are used to determine the similarity of the predetermined target in the previous frame image and the predetermined target in the current frame in terms of appearance features; the motion state of the target in the current frame image is used to determine the similarity of the predetermined target in terms of motion state between the previous and previous frames; based on the similarity in appearance features and the similarity in motion state, the similarity information of the target in the current frame image is determined.
[0054] S250, Based on the similarity information of the predetermined target in the current frame image, the predetermined target is tracked to obtain the tracking result of the predetermined target.
[0055] In the target tracking method of this disclosure, the features derived from the target detection results can more accurately represent the features corresponding to the target in the current frame image, making the calculation results of similarity information more accurate. Furthermore, target tracking can be performed based on the similarity information calculated during the detection process, thereby enhancing the target tracking results. Compared with the Deepsort model in the prior art, which uses multiple models, such as three or four models, to perform target tracking, the target tracking method of this disclosure can reduce the complexity of target tracking and the amount of computation in the processing process while improving the accuracy of the tracking results. This can improve the speed and accuracy of target tracking, and further reduce the number of ID switching operations in target tracking.
[0056] In some embodiments, step S210 may specifically include:
[0057] S11, extract the features of the predetermined target from the current frame image through the target detection network.
[0058] S12, based on the pre-stored region-of-interest (ROI) enhanced features of the predetermined target in at least the previous two frames of images, perform ROI enhancement on the features of the predetermined target extracted from the current frame image, and obtain and store the ROI enhanced features of the predetermined target in the current frame image.
[0059] In the embodiments of this disclosure, in machine vision and image processing, the region to be processed is delineated from the frame image being processed using methods such as boxes, circles, ellipses, and irregular polygons, and is called the region of interest (ROI). The ROI is used to determine the position of a predetermined target in a certain frame image. Enhancing the ROI can reduce the positioning error of the ROI and improve the positioning accuracy.
[0060] In this step, assuming the current frame image is the Nth frame image, the two frames preceding the Nth frame image are the N-2th frame image and the N-1th frame image. Based on the enhanced features of the predetermined target in the N-2th frame image and the enhanced features of the predetermined target in the N-1th frame image, the motion state of the predetermined target, such as the motion speed and direction of the predetermined target in the N-1th frame image, can be determined.
[0061] S13, Perform target detection processing based on the enhanced features of the region of interest of the predetermined target in the current frame image, and generate the target detection result of the predetermined target in the current frame image.
[0062] In this embodiment, based on the region-of-interest (ROI) enhanced features of the predetermined target detected in at least the previous two frames, the features of the predetermined target in the current frame are subjected to ROI enhancement processing, and the target detection is performed based on the ROI enhanced features in the current frame, which can improve the accuracy of the detection results.
[0063] In some embodiments, step S12 may specifically include: S21, using the enhanced features of the region of interest of the predetermined target in at least the first two frames of images as the predetermined features; S22, weighting the features of the predetermined target extracted from the current frame image based on the obtained region of interest weights corresponding to the predetermined features obtained through pre-training, to obtain the enhanced features of the region of interest of the predetermined target in the current frame image.
[0064] In step S22, the weights of the region of interest can be obtained in advance through training the object detection network.
[0065] In this embodiment, based on the enhanced features of the region of interest of the predetermined target in at least the first two frames, the corresponding weight information of interest is determined, the features of the predetermined target extracted from the current frame image are enhanced by the region of interest, and the target detection is performed based on the enhanced features of the region of interest in the current frame, which can improve the localization accuracy of the detection results.
[0066] In some embodiments, step S230 may specifically include:
[0067] S31, the features of the predetermined target enhanced in the region of interest of at least two specified frames are synthesized into features of at least two channels; wherein, the at least two specified frames include the previous frame and at least one frame located before the previous frame.
[0068] In this step, the features of each frame of an image correspond to a channel. Feature channels can be synthesized for two frames of images to obtain features of two channels; or feature channels can be synthesized for multiple frames of images to obtain features of multiple channels. Specifically, the frames preceding the current frame can be selected according to actual needs.
[0069] S32, convolve the features of at least two channels to obtain the motion state information of the predetermined target in the previous frame image.
[0070] S33, based on the motion state of the predetermined target in the previous frame image, predict the position coordinates of the predetermined target in the current frame image to obtain the predicted position coordinates.
[0071] In this embodiment, the motion state information of the predetermined target in the previous frame image is calculated through steps S31-S33, thereby predicting the specific position of the predetermined target in the current frame based on the motion state information of the previous frame image, providing a data basis for subsequent comparison of the similarity between the predicted position coordinates and the detected position coordinates.
[0072] In some embodiments, step S240 may specifically include:
[0073] S41, the position coordinates obtained from the target detection results are used as the detected position coordinates, and the similarity between the detected position coordinates and the predicted position coordinates is calculated to obtain position similarity information.
[0074] In this step, positional similarity can be calculated using spatial distance. The value of positional similarity is inversely proportional to the value of spatial distance; the smaller the spatial distance, the higher the positional similarity. For example, if the spatial distance is less than or equal to a predetermined value, the detected position coordinates and the predicted position coordinates can be considered to be the same location. That is, the predicted position coordinates of the predetermined target in the current frame can be considered to be the same as the detected position coordinates of the predetermined target in the current frame.
[0075] S42, based on the reverse derivation features of the predetermined target in at least the previous frame image, calculate the feature similarity information of the reverse derivation features of the predetermined target in the current frame image to obtain feature similarity information.
[0076] In some embodiments, step S42 may specifically include: combining the inverse derivation features of the predetermined target in at least the previous frame image into a convolution kernel, calculating the feature similarity information of the inverse derivation features of the predetermined target in the current frame image, and using the calculated similarity information as the feature similarity information of the predetermined target corresponding to the current frame image.
[0077] In this embodiment, features from the previous frame are used as convolution kernels, or features from previous multiple frames are used as convolution kernels to perform convolution operations on features in the current frame. This convolution operation is equivalent to calculating the cosine distance between the two, thereby obtaining feature similarity information.
[0078] S43, a weighted summation operation is performed based on positional similarity information and feature similarity information to obtain the similarity between the prediction result and the detection result of the predetermined target in the current frame image.
[0079] In some embodiments, the calculated similarity information can be normalized, and the normalized similarity can be used as the final similarity information.
[0080] In some embodiments, when performing a weighted summation operation, the weights corresponding to the position similarity information and the feature similarity information can be customized according to the actual situation.
[0081] In this embodiment, the similarity between the prediction result and the detection result of the predetermined target in the current frame can be compared based on the position information corresponding to the motion state and the feature information corresponding to the appearance. The final detection result can then be matched accordingly based on the similarity comparison result, thereby improving the accuracy and reliability of the final detection result.
[0082] In some embodiments, if no corresponding target is detected in the current frame within a predetermined image range, where the predetermined image range can be pre-defined, the predicted target of the current frame can be obtained using the motion state information (e.g., velocity) of the previous frame image of the predetermined target, and used as the detected target in the current frame. The above-mentioned position similarity and feature similarity comparison are then performed to achieve flexible processing of the detected target, enabling the method to cope with more complex applications and the target tracking method to be applied in more scenarios.
[0083] In some embodiments, step S250 may specifically include:
[0084] S51, the position coordinates obtained from the target detection result are used as the detection position coordinates, and the detection position coordinates are transformed according to the similarity information to obtain the transformed position coordinates.
[0085] In this step, the transformation based on similarity information includes: multiplying the features of each channel corresponding to the position coordinates in the detection result with the normalized similarity information, and adding the features of each channel obtained by the multiplication operation to obtain the transformed position coordinates.
[0086] S52, normalize the calculated similarity information to obtain the similarity value.
[0087] S53, if the similarity value is greater than or equal to a predetermined threshold, then obtain the tracking identifier of the predetermined target in the previous frame and use it as the tracking identifier of the predetermined target in the current frame.
[0088] S54. If the similarity value is less than the predetermined threshold, then the tracking identifier of the predetermined target is reassigned in the current frame.
[0089] Through the above steps S51-S54, IDs can be distributed based on similarity information. If the similarity reaches the threshold, there is no need to reassign IDs.
[0090] According to the target tracking method of this disclosure, the features of a predetermined target extracted in the current frame image can be enhanced based on the enhanced features of the regions of interest (ROIs) of the target detected in the previous two frames, thereby enhancing the current detection process with the prior tracking results. Furthermore, target tracking can be performed based on similarity information calculated during the detection process, enhancing the target tracking results with the detection process, thus achieving bidirectional enhancement of both detection and target tracking. Moreover, the features derived from the target detection results more accurately represent the features corresponding to the target in the current frame image, making the calculation results of the similarity information more accurate.
[0091] Compared to the Deepsort model in the prior art, which uses multiple models, such as three or four models, to coordinate for target tracking, the target tracking method disclosed in this paper can improve the accuracy of tracking results while reducing the complexity of target tracking and the amount of computation in the processing, thereby improving the speed and accuracy of target tracking and reducing the number of ID switching in target tracking.
[0092] Figure 3 A flowchart illustrating a target tracking method of an exemplary embodiment of this disclosure is shown. Figure 3 As shown, the target tracking method includes the following steps:
[0093] S301, such as Figure 3 As shown in "Video Frames", input the original video frame to be detected.
[0094] S302, such as Figure 3 As shown in "Feature Extraction and Feature Detection", the video frame is input into the target detection network to obtain the target detected in the current frame.
[0095] In this step, the target detection network includes a feature extraction model and an open-source YOLOv5 model. The YOLOv5 model processes the features extracted by the feature extraction model to detect the target.
[0096] S303, such as Figure 3As shown in the “Feature Reverse Derivation” section, the multidimensional features of a specific target in the current frame are calculated based on the detection results.
[0097] In this step, the reverse derivation features of the target in the current frame image can be calculated based on the detection results of the current frame, and used as the reverse-calculated features of the current frame.
[0098] S304, such as Figure 3 As shown in "Region of Interest Weight Enhancement", the features of the current frame are enhanced based on the enhanced target features of the region of interest in the two previous frames.
[0099] In some embodiments, step S304 can be implemented by the feature enhancement layer of the target detection network. Specifically, the feature enhancement layer can perform the corresponding steps of steps S210 and S12-S13 described above.
[0100] S305, such as Figure 3 As shown in "Multidimensional Features (Previous Frame)", the multidimensional features of a specific target in the previous frame of the current frame are calculated in reverse.
[0101] S306, such as Figure 3 As shown in "Multidimensional Features (Next Frame)", the multidimensional features of a specific target in the next frame after the current frame are calculated in reverse.
[0102] The inverse calculation operations in steps S305 and S306 have the same meaning as the inverse calculation operation in step S303.
[0103] S307, such as Figure 3 As shown in "Convolution (Velocity)", the two frames before the current frame are combined into two channels, and the features of these two channels are convolved to obtain the motion state information of the target.
[0104] In this step, the features of the two channels can be convolved, or multiple channels can be synthesized based on the target features from multiple frames of images before the current frame, and the features of these multiple channels can be convolved to obtain the target's motion state information, such as the target's speed.
[0105] In some embodiments, step S307 can be implemented through the first convolutional layer of the object detection network.
[0106] S308, such as Figure 3 As shown in "Multidimensional Features (Current Frame)", this allows you to obtain the multidimensional features of a specific target in the current frame.
[0107] S309, such as Figure 3 As shown in "Convolution (Similarity)", the previous frame or multiple previous frames are combined into a convolution kernel, and the multidimensional features of the current frame are convolved to calculate the similarity information.
[0108] In some embodiments, step S308 can be implemented through the second convolutional layer of the object detection network.
[0109] In this step, the calculation of similarity information can refer to step S240 above, and will not be repeated in this embodiment.
[0110] S310, such as Figure 3 As shown in "fully connected (1*1 convolution)," the similarity information is transformed to output similarity and coordinate information.
[0111] In this step, the specific processing of the transformation output similarity and coordinate information can be referred to steps S250 and S51-S54 above, and will not be repeated in this embodiment.
[0112] In some embodiments, step S310 can be implemented by a fully connected layer, which includes a 1*1 convolutional kernel.
[0113] The position coordinates obtained from the target detection results are used as the detection position coordinates. The detection position coordinates are then transformed based on similarity information to obtain the transformed position coordinates.
[0114] S311, such as Figure 3 As shown in "Identifier Distribution", IDs are distributed based on similarity. If the similarity reaches a threshold, the IDs are not reassigned.
[0115] In this step, IDs need to be reassigned only if the similarity does not reach the threshold.
[0116] According to the target tracking method of embodiments of this disclosure, the target detection model includes a feature extraction layer, a target detection layer, a feature enhancement layer, a first convolutional layer, a second convolutional layer, and a fully connected layer. According to this method, based on the enhanced features of the regions of interest (ROIs) of the target detected in the previous two frames of the current frame, the features of the predetermined target extracted in the current frame image are enhanced using ROIs, thereby enhancing the current detection process with the previous tracking results. Furthermore, target tracking can be performed based on similarity information calculated during the detection process, enhancing the target tracking results with the detection process, thus achieving bidirectional enhancement of both detection and target tracking. Moreover, the features derived from the target detection results more accurately represent the features corresponding to the target in the current frame image, making the calculation results of the similarity information more accurate.
[0117] In some scenarios, the target tracking method of this disclosure can detect targets in images and temporarily store their features; estimate the motion state of the target based on the target detection results; and back-infer the temporarily stored features based on the target detection results. Based on the stored features and the motion state estimation results, optimal matching is performed through heuristic search to obtain the best output result. Compared with the Deepsort model in the prior art, which uses multiple models, such as three or four models, to perform target tracking, the method of this disclosure improves the accuracy of the tracking results while reducing the complexity of target tracking and the computational load of the processing. This helps to solve the problem of high target tracking latency, realizes a more accurate target detection and tracking model, greatly improves the speed and accuracy of target tracking, reduces the number of ID switching operations in target tracking, and achieves high-speed, high-accuracy target detection and tracking.
[0118] It is understood that the various method embodiments mentioned above in this disclosure can be combined with each other to form combined embodiments without violating the principle and logic. Due to space limitations, this disclosure will not elaborate further. Those skilled in the art will understand that in the above methods of specific implementation, the specific execution order of each step should be determined by its function and possible internal logic.
[0119] In addition, this disclosure also provides a target tracking device, an electronic device, and a computer-readable storage medium, all of which can be used to implement any of the target tracking methods provided in this disclosure. The corresponding technical solutions and descriptions are described in the relevant section on methods and will not be repeated here.
[0120] Figure 4 This is a block diagram of a target tracking device provided in an embodiment of the present disclosure. (Refer to...) Figure 4 This disclosure provides a target tracking device, which may include the following modules:
[0121] The target detection module 410 is used to extract features and detect targets in the current frame image of the video to be detected through the target detection network, so as to obtain the features of the predetermined target extracted from the current frame image and the target detection result;
[0122] The feature calculation module 420 is used to deduce the reverse derivation features of the predetermined target in the current frame image based on the target detection results.
[0123] The position prediction module 430 is used to calculate the motion state information of the previous frame image of the predetermined target, and predict the position of the predetermined target in the current frame image based on the motion state information to obtain the predicted position coordinates.
[0124] The similarity calculation module 440 is used to calculate the similarity information of the predetermined target in the current frame image based on the position coordinates obtained from the target detection result, the predicted position coordinates, the reverse inference features of the predetermined target in at least the previous frame image, and the reverse inference features in the current frame image.
[0125] The tracking result determination module 450 is used to perform target tracking based on the calculated similarity information and obtain the tracking result of the predetermined target.
[0126] In some embodiments, the target detection module 410 is specifically configured to extract features of a predetermined target from the current frame image through a target detection network; enhance the features of the predetermined target extracted from the current frame image according to the enhanced features of the region of interest of the predetermined target in at least the previous two frames image, and obtain and store the enhanced features of the region of interest of the predetermined target in the current frame image; perform target detection processing according to the enhanced features of the region of interest of the predetermined target in the current frame image, and generate a target detection result of the predetermined target in the current frame image.
[0127] In some embodiments, the target detection module 410, when used to enhance the features of the predetermined target extracted from the current frame image based on the pre-stored features of the predetermined target enhanced in the region of interest in at least the first two frames of images, specifically includes: using the features of the predetermined target enhanced in the region of interest in at least the first two frames of images as predetermined features; and weighting the features of the predetermined target extracted from the current frame image based on the pre-trained region of interest weights corresponding to the predetermined features to obtain the features of the predetermined target enhanced in the region of interest in the current frame image.
[0128] In some embodiments, the position prediction module 430 is specifically used to synthesize the enhanced features of the predetermined target in the region of interest of at least two specified frames into features of at least two channels; wherein, the specified at least two frames include the previous frame and at least one frame preceding the previous frame; convolving the features of at least two channels to obtain the motion state information of the predetermined target in the previous frame; and predicting the position coordinates of the predetermined target in the current frame based on the motion state of the predetermined target in the previous frame to obtain the predicted position coordinates.
[0129] In some embodiments, the similarity calculation module 440 is specifically used to: use the position coordinates obtained from the target detection result as the detection position coordinates; calculate the similarity between the detected position coordinates and the predicted position coordinates to obtain position similarity information; calculate feature similarity information on the back-derived features of the predetermined target in the current frame image based on the back-derived features of the predetermined target in at least the previous frame image to obtain feature similarity information; and perform a weighted summation operation based on the position similarity information and the feature similarity information to obtain the similarity information between the predicted result and the detection result of the predetermined target in the current frame image.
[0130] In some embodiments, when the similarity calculation module 440 is used to calculate the feature similarity information of the back-derived features of the predetermined target in the current frame image based on the back-derived features of the predetermined target in at least the previous frame image, the module is specifically used to: combine the back-derived features of the predetermined target in at least the previous frame image into a convolution kernel, calculate the feature similarity information of the back-derived features of the predetermined target in the current frame image, and use the calculated similarity information as the feature similarity information of the predetermined target corresponding to the current frame image.
[0131] In some embodiments, the tracking result determination module 450 is specifically used to: use the position coordinates obtained from the target detection result as the detection position coordinates, transform the detection position coordinates according to the similarity information to obtain the transformed position coordinates; normalize the calculated similarity information to obtain a similarity value; if the similarity value is greater than or equal to a predetermined threshold, obtain the tracking identifier corresponding to the predetermined target in the previous frame as the tracking identifier of the predetermined target in the current frame; if the similarity value is less than the predetermined threshold, reassign the tracking identifier of the predetermined target in the current frame.
[0132] According to the target tracking apparatus of this application, the features of a predetermined target extracted from the current frame image can be enhanced based on the enhanced features of the regions of interest (ROIs) of the target detected in the previous two frames, thereby enhancing the current detection process with the prior tracking results. Furthermore, target tracking can be performed based on similarity information calculated during the detection process, enhancing the target tracking results with the detection process, thus achieving bidirectional enhancement of both detection and target tracking. Moreover, the features derived from the target detection results more accurately represent the features corresponding to the target in the current frame image, making the calculation results of the similarity information more accurate.
[0133] It should be clarified that the present invention is not limited to the specific configurations and processes described in the above embodiments and shown in the figures. For the sake of convenience and brevity, detailed descriptions of known methods are omitted here, and the specific working processes of the systems, modules, and units described above can be referred to the corresponding processes in the foregoing method embodiments, which will not be repeated here.
[0134] Figure 5 This is a block diagram of an electronic device provided in an embodiment of the present disclosure.
[0135] Reference Figure 5 This disclosure provides an electronic device, which includes: at least one processor 501; at least one memory 502; and one or more I / O interfaces 503 connected between the processor 501 and the memory 502; wherein the memory 502 stores one or more computer programs that can be executed by the at least one processor 501, and the one or more computer programs are executed by the at least one processor 501 to enable the at least one processor 501 to perform the target tracking method described above.
[0136] This disclosure also provides a computer-readable storage medium storing a computer program thereon, wherein the computer program, when executed by a processor / processor core, implements the target tracking method described above. The computer-readable storage medium may be volatile or non-volatile.
[0137] This disclosure also provides a computer program product, including computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code, wherein when the computer-readable code is run in the processor of an electronic device, the processor in the electronic device executes the above-described target tracking method.
[0138] Those skilled in the art will understand that all or some of the steps, systems, and apparatuses disclosed above, and their functional modules / units, can be implemented as software, firmware, hardware, or suitable combinations thereof. In hardware implementations, the division between functional modules / units mentioned above does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may be performed collaboratively by several physical components. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit (ASIC). Such software can be distributed on a computer-readable storage medium, which may include computer storage media (or non-transitory media) and communication media (or transient media).
[0139] As is known to those skilled in the art, the term computer storage medium includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information (such as computer-readable program instructions, data structures, program modules, or other data). Computer storage media includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), static random access memory (SRAM), flash memory or other memory technologies, portable compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical disc storage, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and is accessible to a computer. Furthermore, it is known to those skilled in the art that communication media typically contain computer-readable program instructions, data structures, program modules, or other data in modulated data signals such as carrier waves or other transmission mechanisms, and may include any information delivery medium.
[0140] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.
[0141] Computer program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), is personalized by utilizing the status information of the computer-readable program instructions to implement various aspects of this disclosure.
[0142] The computer program product described herein can be implemented specifically through hardware, software, or a combination thereof. In one alternative embodiment, the computer program product is specifically embodied in a computer storage medium; in another alternative embodiment, the computer program product is specifically embodied in a software product, such as a software development kit (SDK), etc.
[0143] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.
[0144] These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processor of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner; thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.
[0145] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.
[0146] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
[0147] Example embodiments have been disclosed herein, and while specific terminology has been used, it is for illustrative purposes only and should be construed as such, and is not intended to be limiting. In some instances, it will be apparent to those skilled in the art that features, characteristics, and / or elements described in connection with particular embodiments may be used alone, or in combination with features, characteristics, and / or elements described in connection with other embodiments, unless otherwise expressly indicated. Therefore, those skilled in the art will understand that various changes in form and detail may be made without departing from the scope of this disclosure as set forth by the appended claims.
Claims
1. A target tracking method, characterized in that, include: The target detection network extracts features and detects targets in the current frame of the video to be detected, and obtains the features of the predetermined target extracted from the current frame and the target detection results. Based on the target detection results, the reverse derivation features of the predetermined target in the current frame image are obtained; The reverse derivation feature is obtained by performing mathematical operations on the extracted features through linear transformation after feature extraction of the current frame image. Calculating the motion state information of the predetermined target in the previous frame image and predicting the position of the predetermined target in the current frame image based on the motion state information to obtain the predicted position coordinates includes: synthesizing the enhanced features of the predetermined target in the region of interest of at least two specified frames into features of at least two channels; wherein the specified at least two frames include the previous frame image and at least one frame image preceding the previous frame image; convolving the features of the at least two channels to obtain the motion state information of the predetermined target in the previous frame image; and predicting the position coordinates of the predetermined target in the current frame image based on the motion state of the predetermined target in the previous frame image to obtain the predicted position coordinates. Based on the position coordinates obtained from the target detection result, the predicted position coordinates, the inverse derivation features of the predetermined target in at least the previous frame image, and the inverse derivation features in the current frame image, the similarity information of the predetermined target in the current frame image is calculated, including: using the position coordinates obtained from the target detection result as the detected position coordinates, calculating the similarity between the detected position coordinates and the predicted position coordinates to obtain position similarity information; calculating feature similarity information based on the inverse derivation features of the predetermined target in at least the previous frame image, and obtaining feature similarity information; performing a weighted summation operation based on the position similarity information and the feature similarity information to obtain the similarity information between the predicted result and the detected result of the predetermined target in the current frame image. Target tracking is performed based on the calculated similarity information to obtain the tracking result of the predetermined target.
2. The method according to claim 1, characterized in that, The step of using a target detection network to extract features and detect targets in the current frame image of the video to be detected, and obtaining the features of the predetermined target extracted from the current frame image and the target detection results, includes: The target features are extracted from the current frame image using a target detection network. Based on the pre-stored enhanced features of the predetermined target in at least the first two frames of images, the features of the predetermined target extracted from the current frame image are enhanced in the region of interest, and the enhanced features of the predetermined target in the current frame image are obtained and stored. Target detection processing is performed based on the enhanced features of the region of interest of the predetermined target in the current frame image to generate the target detection result of the predetermined target in the current frame image.
3. The method according to claim 2, characterized in that, The step of enhancing the features of the predetermined target extracted from the current frame image based on the enhanced features of the region of interest (ROI) of the predetermined target in at least the previous two frames of images, and obtaining and storing the enhanced ROI features of the predetermined target in the current frame image, includes: The enhanced features of the region of interest of the predetermined target in at least the first two frames of images are used as the predetermined features; Based on the pre-trained region of interest weights corresponding to the predetermined features, the features of the predetermined target extracted from the current frame image are weighted according to the obtained region of interest weights to obtain the enhanced features of the predetermined target in the current frame image.
4. The method according to claim 1, characterized in that, The step of calculating feature similarity information for the inversely derived features of the predetermined target in the current frame image based on the inversely derived features of the predetermined target in at least the previous frame image, to obtain feature similarity information, includes: The inverse derivation features of the predetermined target in at least the previous frame image are combined into a convolution kernel. Feature similarity information is calculated on the inverse derivation features of the predetermined target in the current frame image. The calculated similarity information is used as the feature similarity information of the predetermined target and the current frame image.
5. The method according to claim 1, characterized in that, The target tracking based on the calculated similarity information, to obtain the tracking result of the predetermined target, includes: The position coordinates obtained from the target detection result are used as the detection position coordinates. The detection position coordinates are transformed according to the similarity information to obtain the transformed position coordinates. The calculated similarity information is normalized to obtain the similarity value; If the similarity value is greater than or equal to a predetermined threshold, the tracking identifier of the predetermined target in the previous frame is obtained and used as the tracking identifier of the predetermined target in the current frame. If the similarity value is less than the predetermined threshold, then the predetermined target is reassigned a tracking identifier in the current frame.
6. A target tracking device, characterized in that, include: The target detection module is used to extract features and detect targets in the current frame image of the video to be detected through the target detection network, so as to obtain the features of the predetermined target extracted from the current frame image and the target detection results; The feature calculation module is used to deduce the reverse-derived features of the predetermined target in the current frame image based on the target detection results. The reverse derivation feature is obtained by performing mathematical operations on the extracted features through linear transformation after feature extraction of the current frame image. A position prediction module is used to calculate the motion state information of the predetermined target in the previous frame image, and predict the position of the predetermined target in the current frame image based on the motion state information to obtain the predicted position coordinates. This includes: synthesizing the enhanced features of the predetermined target in the region of interest of at least two specified frames into features of at least two channels; wherein the specified at least two frames include the previous frame image and at least one frame image preceding the previous frame image; convolving the features of the at least two channels to obtain the motion state information of the predetermined target in the previous frame image; and predicting the position coordinates of the predetermined target in the current frame image based on the motion state of the predetermined target in the previous frame image to obtain the predicted position coordinates. The similarity calculation module is used to calculate the similarity information of the predetermined target in the current frame image based on the position coordinates obtained from the target detection result, the predicted position coordinates, the back-derived features of the predetermined target in at least the previous frame image, and the back-derived features in the current frame image. This includes: using the position coordinates obtained from the target detection result as the detected position coordinates, calculating the similarity between the detected position coordinates and the predicted position coordinates to obtain position similarity information; calculating feature similarity information based on the back-derived features of the predetermined target in at least the previous frame image, and performing a weighted summation operation on the back-derived features of the predetermined target in the current frame image to obtain feature similarity information; and performing a weighted summation operation based on the position similarity information and the feature similarity information to obtain the similarity information between the predicted result and the detected result corresponding to the predetermined target in the current frame image. The tracking result determination module is used to perform target tracking based on the calculated similarity information and obtain the tracking result of the predetermined target.
7. An electronic device, characterized in that, include: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores one or more computer programs that can be executed by the at least one processor, the one or more computer programs being executed by the at least one processor to enable the at least one processor to perform the target tracking method as described in any one of claims 1-5.
8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the target tracking method as described in any one of claims 1-5.