Target object tracking method and device, electronic equipment and storage medium

By using a cropped, lightweight detection model for target tracking in short video shooting, the problem of high computational complexity is solved, and stable tracking is achieved on devices with weak computing performance.

CN112102364BActive Publication Date: 2026-06-16GUANGZHOU FANGSI INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGZHOU FANGSI INFORMATION TECH CO LTD
Filing Date
2020-09-22
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing hand or face tracking models involve complex calculations and require a huge amount of computation, making them difficult to apply to devices with weak computing power, which leads to easy loss of tracking.

Method used

A cropped lightweight detection model is used for target tracking. The target detection box is obtained from the specified frame image, the intersection-union ratio is calculated and smoothing is performed to reduce the amount of computation and ensure the continuity of tracking.

🎯Benefits of technology

It reduces the computational complexity of the target tracking bounding box prediction process, ensures the continuity and accuracy of tracking, and is suitable for devices with weaker computing performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN112102364B_ABST
    Figure CN112102364B_ABST
Patent Text Reader

Abstract

The application discloses a target tracking method and device, electronic equipment and storage medium. The method comprises the following steps: obtaining a first target detection box from a first designated frame image of a to-be-processed video image; obtaining a to-be-tracked target image corresponding to the first to-be-tracked video image based on the first target detection box; inputting the to-be-tracked target image into a designated detection model; obtaining a target tracking box output by the designated detection model; obtaining a second target detection box from a second designated frame image; obtaining an intersection-over-union of the second target detection box and at least one target tracking box; performing smoothing processing on the second target detection box based on the first target tracking box corresponding to the intersection-over-union with a value greater than or equal to a preset threshold, to obtain a reference target detection box; and tracking the second to-be-tracked video image corresponding to the first target tracking box based on the reference target detection box. The application reduces the calculation complexity of the target tracking box in the tracking process and improves the continuity of tracking.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image processing technology, and more specifically, to a target tracking method, apparatus, electronic device, and storage medium. Background Technology

[0002] Short videos, also known as short films, are a form of internet content dissemination, generally consisting of videos under 5 minutes in length and shared on new media platforms. With the widespread adoption of mobile devices and faster internet speeds, short, fast-paced, high-traffic content has gained favor with major platforms, fans, and investors. To enhance the entertainment value of short videos, special effects can be added during filming, such as those controlled by hands or faces. As a method, to ensure users can accurately control these effects with their hands or faces, tracking of the user's hands or face and their changes is necessary during filming. However, existing hand or face tracking models are computationally complex and computationally intensive, easily leading to tracking errors and making them difficult to apply to devices with limited computing power. Summary of the Invention

[0003] In view of the above problems, this application proposes a target tracking method, apparatus, electronic device and storage medium to improve the above problems.

[0004] In a first aspect, embodiments of this application provide a target object tracking method applied to an electronic device. The method includes: obtaining a first target object detection box from a first designated frame image of a video image to be processed; obtaining a target object image to be tracked corresponding to a first video image to be tracked based on the first target object detection box, wherein the first video image to be tracked is an image following the first designated frame image; inputting the target object image to be tracked into a designated detection model to obtain a target tracking box output by the designated detection model, wherein the target tracking box includes at least one target object tracking box, and the designated detection model is a lightweight detection model obtained by cropping; obtaining a second target object detection box from a second designated frame image, wherein the second designated frame image is a next frame image adjacent to the last frame image in the first video image to be tracked; obtaining the intersection-union ratio (IUR) of the second target object detection box and the at least one target object tracking box; smoothing the second target object detection box based on the first target object tracking box corresponding to an IUR value greater than or equal to a preset threshold to obtain a reference target object detection box; and tracking a second video image to be tracked corresponding to the first target object tracking box based on the reference target object detection box, wherein the second video image to be tracked is an image following the second designated frame image.

[0005] Secondly, embodiments of this application provide a target object tracking device, operating in an electronic device, the device comprising: a first acquisition module, configured to acquire a first target object detection box from a first specified frame image of a video image to be processed; a second acquisition module, configured to acquire a target object image to be tracked corresponding to a first video image to be tracked based on the first target object detection box, wherein the first video image to be tracked is an image following the first specified frame image; a third acquisition module, configured to input the target object image to be tracked into a specified detection model, and acquire a target tracking box output by the specified detection model, wherein the target tracking box includes at least one target object tracking box, and the specified detection model is a lightweight detection model obtained by cropping; a fourth acquisition module, configured to acquire... The system includes a first acquisition module for acquiring a second target object detection box from a second specified frame image, wherein the second specified frame image is the next frame image adjacent to the last frame image in the first video image to be tracked; a second acquisition module for acquiring the intersection-union ratio (IUR) of the second target object detection box and the at least one target object tracking box; a processing module for smoothing the second target object detection box based on a first target object tracking box with an IUR value greater than or equal to a preset threshold, thereby obtaining a reference target object detection box; and a tracking module for tracking a second video image to be tracked corresponding to the first target object tracking box based on the reference target object detection box, wherein the second video image to be tracked is an image following the second specified frame image.

[0006] Thirdly, embodiments of this application provide an electronic device, including a memory and one or more processors; one or more programs are stored in the memory and configured to be executed by one or more processors, and the one or more programs are configured to perform the method described in the first aspect above.

[0007] Fourthly, embodiments of this application provide a computer-readable storage medium storing program code, wherein the method described in the first aspect is executed when the program code is run by a processor.

[0008] This application provides a target object tracking method, apparatus, electronic device, and storage medium. The method involves obtaining a first target object detection box from a first specified frame of a video image to be processed, then obtaining a target object image corresponding to the first video image to be tracked based on the first target object detection box. The target object image is then input into a specified detection model to obtain a target tracking box output by the specified detection model. A second target object detection box is then obtained from a second specified frame. The intersection-union ratio (IU) of the second target object detection box with at least one target object tracking box is then obtained. The second target object detection box is then smoothed based on the first target object tracking box with an IU value greater than or equal to a preset threshold to obtain a reference target object detection box. Finally, the second video image to be tracked corresponding to the first target object tracking box is tracked based on the reference target object detection box. This method uses a lightweight, cropped detection model to detect target object tracking boxes in the target object image to be tracked, and uses the detection result as the prediction result for the target object tracking box, without relying on the calculation result of a large target object tracking model, thus reducing the computational load in the target object tracking box prediction process. By smoothing the second object detection box, the connection between the second object detection box and the object tracking box in the previous frame can be made more natural. By acquiring the object detection box again at intervals of specified video frames, the computational complexity of the object tracking box can be reduced, while ensuring the continuity of tracking. Attached Figure Description

[0009] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0010] Figure 1 A schematic diagram of an application environment provided by an embodiment of this application is shown.

[0011] Figure 2 A flowchart of a target tracking method according to an embodiment of this application is shown.

[0012] Figure 3 This illustration shows a schematic diagram of the matching relationship between the hand in the current frame image and the hand in the previous frame image, provided in an embodiment of this application.

[0013] Figure 4 The diagram illustrates the calculation method of the hand tracking box and the hand detection box provided in the embodiments of this application.

[0014] Figure 5 A flowchart of a target tracking method according to another embodiment of this application is shown.

[0015] Figure 6 A flowchart of a target tracking method according to another embodiment of this application is shown.

[0016] Figure 7 A structural block diagram of a target tracking device provided in an embodiment of this application is shown.

[0017] Figure 8 A structural block diagram of an electronic device provided in an embodiment of this application is shown.

[0018] Figure 9 An embodiment of this application shows a storage unit for storing or carrying program code that implements the target tracking method according to an embodiment of this application. Detailed Implementation

[0019] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

[0020] In recent years, with the rapid development of internet technology, live streaming has become increasingly popular among users. Users can share their lives, work, and travel experiences by shooting short videos and uploading them to live streaming platforms. To enhance the entertainment value of these short videos, special effects can be added during filming, such as those controlled by hands or faces. However, to ensure that users can accurately control these effects with their hands or faces, it is necessary to track the user's hands or face and their changes during filming.

[0021] One approach is to extract features from the cropped image of the previous frame's hand detection bounding box to obtain a feature map. Then, segmentation is performed on the feature map to obtain the hand image. New detection bounding boxes are then generated based on the feature map and the hand image, serving as the input for the next frame, thus enabling tracking of the next frame. However, this tracking method requires hand image segmentation and skin color detection, and the computational process is complex and computationally intensive, easily leading to hand or face tracking loss. Furthermore, it is difficult to apply to devices with limited computing power.

[0022] To address the aforementioned problems, the inventors, through long-term research, discovered that a method can be developed that involves obtaining a first target object detection box from a first specified frame of the video image to be processed, then obtaining a target object image corresponding to the first video image to be tracked based on the first target object detection box, inputting the target object image into a specified detection model to obtain the target tracking box output by the specified detection model, obtaining a second target object detection box from a second specified frame, obtaining the intersection-union ratio (IU) of the second target object detection box with at least one target object tracking box, and then smoothing the second target object detection box based on the first target object tracking box with an IU value greater than or equal to a preset threshold to obtain a reference target object detection box. Finally, the second video image to be tracked corresponding to the first target object tracking box is tracked based on the reference target object detection box. This method uses a lighter-weight detection model after cropping to detect target object tracking boxes in the target object image to be tracked, and uses the detection result as the prediction result for the target object tracking box, without relying on the calculation result of a large target object tracking model, thus reducing the computational load in the target object tracking box prediction process. By smoothing the second object detection box, the connection between the second object detection box and the object tracking box in the previous frame can be made more natural. By acquiring the object detection box again at intervals of specified video frames, the computational complexity of the object tracking process can be reduced, while ensuring the continuity of tracking. Therefore, this application proposes an object tracking method, apparatus, electronic device, and storage medium according to embodiments.

[0023] To facilitate a detailed explanation of the present application, an application environment in one embodiment of the present application will be described below with reference to the accompanying drawings.

[0024] Please see Figure 1 This is a schematic diagram illustrating the application environment of a target tracking method provided in an embodiment of this application, such as... Figure 1 As shown, the application environment can be understood as a network system 10 provided in this application embodiment. The network system 10 includes: a user terminal 11 and a server 12. Optionally, the user terminal 11 can be any device with communication and storage functions, including but not limited to PC (Personal Computer), PDA (Tablet PC), smart TV, smartphone, smart wearable device or other smart communication device with network connection function. The server 12 can be a single server (network access server), a server cluster composed of several servers (cloud server), or a cloud computing center (database server).

[0025] In this embodiment, the user terminal 11 can be used to record or shoot short videos and track the user's hand or face during the video recording or shooting process. In order to improve the calculation speed of the tracking position corresponding to the user's hand or face, the user terminal 11 can send the tracking result to the server 12 for storage through the network, so as to reduce the storage space occupied by the user terminal 11 and improve the calculation speed of the tracking position of the target object, so that the target object tracking method of this application can be implemented in devices with weak computing performance.

[0026] The embodiments of this application will now be described in detail with reference to the accompanying drawings.

[0027] Please see Figure 2 The diagram illustrates a flowchart of a target tracking method according to an embodiment of this application. This embodiment provides a target tracking method that can be applied to electronic devices. The method includes:

[0028] Step S110: Obtain the first target object detection box from the first specified frame image of the video image to be processed.

[0029] Optionally, the video image to be processed in this embodiment can be a video image including specific video effects. For example, the video image to be processed can be a video image including effects controlled by a human hand, or a video image including effects controlled by a human face, or a video image including effects controlled by both a human hand and a human face simultaneously. Optionally, the video image to be processed can be a video image during real-time shooting. The video image to be processed can include multiple frames. The target object in this embodiment can be a human hand or a human face. It is understood that when tracking gestures (or faces) in video images including effects controlled by the user through gestures (or faces), in order to ensure the continuity of tracking, it is necessary to detect and track gestures (or faces) in each frame of video image. When the video file is large, it will bring huge computational pressure to the user terminal, thereby affecting the tracking effect and efficiency.

[0030] As a way to improve the above-mentioned problems, the electronic device can obtain a first target object detection box from a first specified frame image of the video image to be processed, that is, identify the first target object as the tracking object and obtain the detection box of the first target object in the first specified frame image. Here, the first specified frame image can be the first frame image of the video image to be processed (i.e., the starting frame image, in which case the first specified frame image includes the image of the first target object, for example, if the first target object is a gesture, then the first specified frame image includes the gesture image), or it can be any video frame image that first includes the image of the first target object (for example, assuming that the 20th frame image of the video image to be processed first includes the image of the first target object, then the first specified frame image is the 20th frame image).

[0031] In one implementation, if the first target object is a gesture, the first target object detection box can be obtained from the first specified frame image of the video image to be processed based on a preset hand detection model. Optionally, the preset hand detection model can be a model trained by pruning channels of a RetinaNet model based on ResNet18 as the backbone network.

[0032] Step S120: Obtain the target image to be tracked corresponding to the first video image to be tracked based on the first target detection box.

[0033] The first video image to be tracked is the image following the first specified frame of the video image to be processed. Optionally, the first video image to be tracked may include multiple frames of video images. As one approach, the target object image corresponding to the first video image to be tracked can be obtained based on the first target object detection box. For example, if the first specified frame image is the first frame of the video image to be processed, and the first video image to be tracked includes 10 frames, namely frames 2-10 of the video image to be processed, then the target object images corresponding to frames 2-10 can be obtained sequentially based on the first target object detection box in the first frame image. Specifically, the position of the target object image in the second frame image can be calibrated using the position of the first target object detection box in the first frame image, thereby obtaining the position of the target object image to be tracked in the second frame image. Then, the position of the target image in the third frame can be calibrated by using the position of the tracking box in the second frame image, and thus the position of the target image in the third frame image can be obtained. This process continues until the position of the target in the last frame of the first video image to be tracked is obtained.

[0034] Step S130: Input the image of the target object to be tracked into the specified detection model and obtain the target tracking box output by the specified detection model.

[0035] Optionally, after obtaining the target object images corresponding to each frame of the first video image to be tracked, these target object images can be input into a specified detection model, thereby obtaining the target tracking box output by the specified detection model. The target tracking box includes at least one target object tracking box, that is, the target tracking box can be understood as the tracking box position of the target object image in each frame of the first video image to be tracked, as output (predicted) by the specified detection model.

[0036] Optionally, the specified detection model in this embodiment is a lightweight detection model that has been trimmed. For example, the specified detection model can be a lighter detection model obtained by trimming the model based on the MobilenetV2+FPN framework, and the detection result of the specified detection model can be used as the tracking result of the first target object detection box.

[0037] As one approach, after acquiring the target tracking bounding box, center point smoothing processing can be performed on the aforementioned at least one target tracking bounding box according to a first smoothing processing rule, so that the intersection-union ratio (IUU) of the second target detection bounding box and the at least one target tracking bounding box after center point smoothing can be subsequently obtained. The first smoothing processing rule can be expressed as:

[0038]

[0039]

[0040] Among them, X mean The X and Y coordinates of the center point of the target object tracking box after center point smoothing are represented. mean The center point Y-coordinate of the target object tracking box after center point smoothing is represented by N, where N represents the number of at least one target object tracking box (e.g., it can be 3), and x i The x-coordinate and y-coordinate of the center point of at least one target object tracking box are represented. i λ represents the y-coordinate of the center point of at least one target object tracking box. i A weighted parameter characterizing the coordinates of the center point of at least one object tracking bounding box. Optional, λ i The specific value can be set according to the actual situation. For example, in this embodiment, λ i The possible values ​​are: λ1 = 0.05, λ2 = 0.25, λ3 = 0.70.

[0041] Step S140: Obtain the second target object detection box from the second specified frame image.

[0042] The second specified frame image is the next frame image adjacent to the last frame image in the first video image to be tracked. The second specified frame image and the first specified frame image are spaced apart by a fixed number of video frames. Optionally, the specific number of fixed video frames can be set according to actual needs. For example, the fixed video frames can be 10 frames, or 5 to 15 frames. The specific value or range is not limited.

[0043] One approach is to obtain a second target detection box from a second specified frame image of the video image to be processed, so as to enable continuous tracking of the detected target. The second target detection box and the first target detection box can correspond to the same detection (tracking) object. Optionally, the principle and specific process of obtaining the second target detection box from the second specified frame image can be found in step S110, and will not be repeated here.

[0044] Step S150: Obtain the intersection-union ratio (IUU) of the second target detection box and the at least one target tracking box.

[0045] Optionally, the second target detection box may include detection boxes for multiple targets. In this approach, the second target detection box and the first target detection box may not correspond to the same detection (tracking) object. For example, the second target detection box may include detection boxes corresponding to user A's left hand and right hand, while the first target detection box may only include the detection box corresponding to user A's left hand. In this approach, to ensure the accuracy and continuity of tracking, the electronic device may obtain the intersection-over-union (IoU) ratio between the second target detection box and at least one target tracking box.

[0046] Taking the aforementioned example, the intersection-union ratio (IUR) of the tracking bounding boxes of the target objects in the second target object detection box and the tracking bounding boxes of the target object images corresponding to frames 2-10 of the video image to be processed can be obtained, so as to determine whether the tracking is effective based on the calculated IUR. Optionally, if the calculated IUR is greater than or equal to a preset threshold, then the tracking can be determined to be effective; if the calculated IUR is less than the preset threshold, then the tracking can be determined to be ineffective. Optionally, the specific value of the preset threshold is not limited; for example, the preset threshold can be any value from 0.3 to 0.6.

[0047] For example, in some specific implementations, such as Figure 3 As shown, assume the tracking boxes containing the two hands generated in the previous frame are a1 and a2; and the detection boxes for the two hands generated in the current frame are b1 and b2. For each detection box, such as b1, the Intersection over Union (IOU) value between b1 and each hand tracking box a1 and a2 in the previous frame can be calculated. Optionally, if the IOU value between b1 and a1 is the largest, then it can be determined that the detection box b1 in the current frame matches the tracking box a1 in the previous frame; similarly, the detection box b2 in the current frame matches the tracking box a2 in the previous frame.

[0048] In the cross-union ratio (CUNR) calculation, as a specific implementation method, it is assumed that the target object is a human hand, and the last frame of the first video image to be tracked has m hand tracking boxes t1, t2, ..., t3. mThe second specified frame image generates n hand detection boxes d1, d2, ..., dn using a hand detection model. n It can calculate the intersection-union ratio (IOU) of each hand detection bounding box and each hand tracking bounding box. The specific calculation formula is as follows:

[0049] IoU=area / (area1+area2-area).

[0050] Where area is the rectangular bounding box for hand detection d i (i = 1, 2, ..., n) and the hand-tracking rectangle t i Let the areas of the overlapping regions (i = 1, 2, ..., m) be the areas of the hand detection bounding boxes and the hand tracking bounding boxes, respectively. Then, the IoU matrix of all hand tracking bounding boxes and all hand detection bounding boxes can be calculated as follows:

[0051]

[0052] Among them, IoU ij Let i = 1, 2, ..., n; j = 1, 2, ..., m. Optional, a diagram illustrating IoU calculation can be found here. Figure 4 .

[0053] Step S160: Based on the first target object tracking box corresponding to the intersection-union ratio with a value greater than or equal to a preset threshold, the second target object detection box is smoothed to obtain a reference target object detection box.

[0054] As one approach, if the calculated intersection-union ratio (IU) is greater than or equal to a preset threshold, to ensure the accuracy and continuity of tracking, the second target object detection box can be smoothed based on the first target object tracking box corresponding to an IU value greater than or equal to the preset threshold. This makes the position of the second target object detection box more coordinated and consistent with the position of the first target object tracking box, thereby making the special effects controlled by the target object more stable and accurate. Optionally, the detection box obtained after smoothing the second target object detection box can be used as a reference target object detection box.

[0055] For example, in a specific application scenario, suppose frames 2-10 of the video image to be processed correspond to a first object tracking bounding box, and frame 11 corresponds to a second object detection bounding box. The intersection-over-union (IoU) ratios of the second object detection bounding box in frame 11 and the first object tracking bounding boxes in frames 2-10 can be calculated. It can be understood that the second object detection bounding box can include detection boxes corresponding to multiple objects. For example, the second object detection bounding box can include a detection box corresponding to user A's left hand, or it can include a detection box corresponding to user A's right hand. Assume the first object tracking bounding box is the one corresponding to the user's right hand. If the tracking box corresponding to user A's left hand is used, then in the calculated cross-union ratio (CUB), the CUB between the detection box corresponding to user A's left hand and the first target object tracking box in the images of frames 2-10 may be greater than or equal to a preset threshold, while the CUB between the detection box corresponding to user A's right hand and the first target object tracking box in the images of frames 2-10 may be less than the preset threshold. In this way, in order to ensure the continuity of tracking user A's left hand, the detection box corresponding to user A's left hand can be smoothed based on the first target object tracking box in the images of frames 2-10, and then the detection box obtained after smoothing can be used as the target object detection box.

[0056] Optionally, in some possible implementations, the second target detection box can be smoothed based on the first target tracking box corresponding to the largest cross-union ratio (CURPR), and the smoothed detection box can be used as the reference target detection box. The specific calculation principle and process of the CURPR can be found in the above description and will not be repeated here.

[0057] Step S170: Track the second video image to be tracked based on the reference target detection box and the first target tracking box.

[0058] The second video image to be tracked is the image following the second specified frame of the video image to be processed. The second video image to be tracked includes a target object corresponding to the first target object tracking box. As one approach, the second video image to be tracked can be tracked based on a reference target object detection box to achieve continuous tracking of the target object corresponding to the first target object tracking box. Optionally, the specific tracking principle and process for tracking the second video image to be tracked can be referred to the description in steps S120-S130 above. For example, the target object image corresponding to the second video image to be tracked can be obtained based on the reference target object detection box, and then the target object image can be input into the aforementioned specified detection model, and then the target tracking box output by the specified detection model can be obtained. After tracking the second video image to be tracked, a third target object detection box can be obtained again, and the contents of steps S150 to S170 above can be repeated until the target object tracking of the entire video image to be processed is completed.

[0059] The target object tracking method provided in this embodiment obtains a first target object detection box from a first specified frame image of a video image to be processed, then obtains a target object image corresponding to the first video image to be tracked based on the first target object detection box, inputs the target object image to be tracked into a specified detection model, obtains the target tracking box output by the specified detection model, then obtains a second target object detection box from a second specified frame image, then obtains the intersection-union ratio (IU) of the second target object detection box with at least one target object tracking box, then smooths the second target object detection box based on the first target object tracking box with an IU value greater than or equal to a preset threshold, obtaining a reference target object detection box, and then tracks the second video image to be tracked corresponding to the first target object tracking box based on the reference target object detection box. This method uses a cropped, lighter detection model to detect target object tracking boxes in the target object image to be tracked, and uses the detection result as the prediction result of the target object tracking box, without relying on the calculation result of a large target object tracking model, thus reducing the computational load in the target object tracking box prediction process. By smoothing the second object detection box, the connection between the second object detection box and the object tracking box in the previous frame can be made more natural. By acquiring the object detection box again at intervals of specified video frames, the computational complexity of the object tracking box can be reduced, while ensuring the continuity of tracking.

[0060] Please see Figure 5 The diagram illustrates a flowchart of a target tracking method according to another embodiment of this application. This embodiment provides a target tracking method that can be applied to electronic devices, and the method includes:

[0061] Step S210: Obtain the first target object detection box from the first specified frame image of the video image to be processed.

[0062] Step S220: Obtain the target image to be tracked corresponding to the first video image to be tracked based on the first target detection box.

[0063] Step S230: Input the image of the target object to be tracked into the specified detection model and obtain the target tracking box output by the specified detection model.

[0064] Step S240: Perform center point smoothing on the at least one target object tracking box according to the first smoothing rule, and perform width and height smoothing on the at least one target object tracking box according to the second smoothing rule.

[0065] Optionally, after obtaining the target tracking bounding box, in addition to performing center point smoothing on at least one target tracking bounding box according to the first smoothing rule as described in the aforementioned embodiments, the width and height of the at least one target tracking bounding box after center point smoothing can also be smoothed according to the second smoothing rule, so as to facilitate the subsequent acquisition of the intersection-union ratio (IUR) between the second target detection bounding box and the at least one target tracking bounding box after center point smoothing and width and height smoothing. The second smoothing rule can be expressed as:

[0066]

[0067]

[0068] Among them, W mean h represents the width of the target object tracking box after center point smoothing. mean The height of the target object tracking box after center point smoothing is represented by N, and the number of at least one target object tracking box is represented by w. i h represents the width of at least one object tracking bounding box. i σ represents the height of at least one target tracking box. i The weights representing the width and height of at least one target object tracking bounding box. Optional, σ i The specific value can be set according to the actual situation. For example, σ in this embodiment... i The possible values ​​are: σ1 = 0.25, σ2 = 0.35, σ3 = 0.40.

[0069] Step S250: Obtain the second target object detection box from the second specified frame image.

[0070] Step S260: Obtain the intersection-union ratio of the second target detection box and the at least one target tracking box after center point smoothing and width and height smoothing.

[0071] Referring to the above description, this embodiment can obtain the intersection-union ratio (IUGR) of the second target detection box and at least one target tracking box after center point smoothing and width and height smoothing. The specific principle of obtaining the IUGR and the calculation process can be referred to the description of the foregoing embodiment, and will not be repeated here.

[0072] Step S270: Based on the first target object tracking box with an intersection-union ratio greater than or equal to a preset threshold, smooth the second target object detection box to obtain a reference target object detection box.

[0073] Step S280: Track the second video image to be tracked based on the reference target detection box and the first target tracking box.

[0074] Optionally, the second video image to be tracked is an image following the second specified frame image.

[0075] Optionally, in this embodiment, a correspondence between the target object and different special effects functions can be pre-established. The target object can be used to control the movement of the corresponding special effects function. For example, if the target object is a human hand, various special effects functions controlled by the human hand can be established, allowing the user to control the movement of related special effects (e.g., the "raindrop" effect) through gestures, thereby completing the shooting of short videos and other video content. Similarly, if the target object is a human face, various special effects functions controlled by the face can be established, allowing the user to control the movement of related special effects (e.g., the "raindrop" effect) through their face, thereby completing the shooting of short videos and other video content. It should be noted that the target object in this embodiment can include both a human hand and a human face. In this way, if the special effect function corresponding to the target object is detected to be in the active state, the movement direction of the target object can be obtained first; then the specified special effect (optionally, it can be a special effect controlled by gestures, a special effect controlled by a face, or a special effect that can be controlled by both gestures and a face. The specific content of the specified special effect is not limited. For example, the specified special effect can be a "raindrop effect", a "falling leaf effect", or a "sunflower effect", etc.) moves synchronously or with a delay according to the movement direction of the target object.

[0076] For example, in a specific application scenario, assuming the designated effect is "cloud and rain effect" and the target object is a gesture, when the user's gesture moves towards the camera of the electronic device to the left side of the screen, the "cloud and rain effect" can be controlled to move to the left as well. If the user's gesture moves towards the camera of the electronic device to the right side of the screen, the "cloud and rain effect" can be controlled to move to the right as well.

[0077] Optionally, if the specified effect is one that can be controlled by the user through gestures or facial expressions, the user can alternately use gestures or facial expressions to control the effect during the recording process. The order in which gestures or facial expressions appear is not limited. By tracking the target object and then controlling the movement of the specified effect based on the tracking results, the fun and interactivity of video shooting are enhanced.

[0078] The target object tracking method provided in this embodiment uses a lightweight detection model (after cropping) to detect target object bounding boxes in the image of the target object to be tracked. The detection result is used as the prediction result for the target object bounding box, eliminating the need to rely on the computational results of a large target object tracking model, thus reducing the computational load in the target object bounding box prediction process. Smoothing the target object bounding boxes ensures the continuity of target object tracking. Smoothing the second target object detection box makes the transition between the second and previous frame's target object bounding boxes more natural. Re-obtaining the target object detection boxes at specified intervals between video frames reduces the computational complexity of the target object bounding box tracking process while ensuring tracking continuity. By tracking the target object and then controlling the movement of specified special effects based on the tracking results, the fun and interactivity of video shooting are enhanced, thereby improving the user experience.

[0079] Please see Figure 6 The diagram illustrates a flowchart of a target tracking method according to another embodiment of this application. This embodiment provides a target tracking method that can be applied to electronic devices. The method includes:

[0080] Step S310: Obtain the first target object detection box from the first specified frame image of the video image to be processed.

[0081] Step S320: Obtain the target image to be tracked corresponding to the first video image to be tracked based on the first target detection box.

[0082] Step S330: Input the image of the target object to be tracked into the specified detection model, and obtain the target tracking box output by the specified detection model.

[0083] Step S340: Obtain the second target object detection box from the second specified frame image.

[0084] Step S350: Obtain the intersection-union ratio (IUU) of the second target detection box and the at least one target tracking box.

[0085] Step S361: Based on the first target object tracking box with an intersection-union ratio greater than or equal to a preset threshold, smooth the second target object detection box to obtain a reference target object detection box.

[0086] Step S362: Track the second video image to be tracked based on the reference target detection box and the first target tracking box.

[0087] Step S371: The target object tracking box with an intersection-union ratio value less than the preset threshold is used as the second target object tracking box.

[0088] Optionally, in some video effects recording scenarios, the user may first extend one hand to control the movement of the effect, and then extend the other hand to control the effect after a certain period of time (e.g., 5 seconds, 10 seconds, etc., the specific value is not limited). At this time, the effect can be controlled by both hands at the same time, or the first hand can be retracted when the other hand is extended. There are no restrictions on the specifics.

[0089] As one approach, if the calculated cross-union ratio (CUR) is less than a preset threshold, then the second target object detection box may include detection boxes corresponding to multiple target objects. In this approach, the target object tracking box corresponding to the CUR value less than the preset threshold can be used as the second target object tracking box (which can be understood as a new tracking object). It should be noted that the target object corresponding to the second target object tracking box can be different from that corresponding to the first target object tracking box. For example, the target object corresponding to the first target object tracking box can be the user's left hand, while the target object corresponding to the second target object tracking box can be the user's right hand. Optionally, the left hand and right hand here can be the left and right hands of the same user, or the left and right hands of different users, that is, it can be the left hand of user A and the right hand of user B.

[0090] Step S372: Track the second video image to be tracked based on the second target object detection box and the second target object tracking box.

[0091] Optionally, based on the above description, the second target object tracking box can be used as the new tracking object, and then the second video image to be tracked corresponding to the second target object tracking box can be tracked based on the second target object detection box. The specific tracking principle and process can be referred to the relevant descriptions in the foregoing embodiments, and will not be repeated here.

[0092] The target object tracking method provided in this embodiment uses a lightweight detection model (after cropping) to detect target object bounding boxes in the image of the target object to be tracked. The detection result is used as the prediction result of the target object bounding box, without relying on the calculation result of a large target object tracking model, thus reducing the computational load in the target object bounding box prediction process. Smoothing the target object bounding boxes ensures the continuity of target object tracking. Smoothing the second target object detection box makes the transition between the second target object detection box and the target object bounding box in the previous frame more natural. By obtaining the target object detection box again at intervals of specified video frames, the computational complexity in the target object bounding box tracking process can be reduced, while ensuring the continuity of tracking. By using the second target object bounding box as the new tracking object, and then tracking the second video image to be tracked based on the second target object detection box, it is possible to continuously track multiple targets (gestures or faces) simultaneously.

[0093] Please see Figure 7 This is a structural block diagram of a target tracking device provided in an embodiment of this application. This embodiment provides a target tracking device 400, which can operate in an electronic device. The device 400 includes: a first acquisition module 410, a second acquisition module 420, a third acquisition module 430, a fourth acquisition module 440, a fifth acquisition module 450, a processing module 460, and a tracking module 470.

[0094] The first acquisition module 410 is used to acquire a first target object detection box from a first specified frame image of the video image to be processed.

[0095] Optionally, the target object in this embodiment can be a human hand or a human face, etc. The user can control the video effects during the video recording process through the human hand or face, for example, make the video effects move synchronously with the movement of the user's human hand or face to enrich the video content.

[0096] The second acquisition module 420 is used to acquire a target image to be tracked corresponding to the first video image to be tracked based on the first target detection frame, wherein the first video image to be tracked is an image after the first specified frame image.

[0097] The third acquisition module 430 is used to input the image of the target object to be tracked into a specified detection model and acquire the target tracking box output by the specified detection model. The target tracking box includes at least one target object tracking box, and the specified detection model is a lightweight detection model obtained by cropping.

[0098] The fourth acquisition module 440 is used to acquire a second target detection box from a second specified frame image, wherein the second specified frame image is the next frame image adjacent to the last frame image in the first video image to be tracked.

[0099] The fifth acquisition module 450 is used to acquire the intersection-union ratio of the second target detection box and the at least one target tracking box.

[0100] Optionally, the device 400 may further include a smoothing module, which can be used to perform center point smoothing on the at least one target object tracking frame according to a first smoothing rule. Optionally, the first smoothing rule may be:

[0101]

[0102]

[0103] Wherein, X mean The X and Y coordinates of the center point of the target object tracking box after center point smoothing are represented. meanThe Y-coordinate of the center point of the target object tracking box after center point smoothing is represented by N, where N represents the number of the at least one target object tracking box, and x... i The x-coordinate of the center point of the at least one target object tracking box, the y-coordinate i The y-coordinate of the center point of the at least one target tracking box, λ i A weighted parameter characterizing the coordinates of the center point of the tracking box of the at least one target object.

[0104] In this manner, the fifth acquisition module 450 can be used to acquire the intersection-union ratio of the second target detection box and the at least one target tracking box after center point smoothing.

[0105] Optionally, in some embodiments, the smoothing module described above can be used to perform center point smoothing on the at least one target object tracking frame according to a first smoothing rule, and to perform width and height smoothing on the at least one target object tracking frame according to a second smoothing rule; in this way, the fifth acquisition module 450 can be used to acquire the intersection-over-union ratio of the second target object detection frame and the at least one target object tracking frame after center point smoothing and width and height smoothing.

[0106] In this embodiment, the second smoothing rule can be:

[0107]

[0108]

[0109] Wherein, the W mean The width of the target object tracking box after center point smoothing, h mean The height of the target object tracking box after center point smoothing is represented by N, and the number of the at least one target object tracking box is represented by w. i The h represents the width of the at least one target tracking box. i The σ characterizes the height of the at least one target tracking box. i The weights assigned to the width and height of the at least one target object tracking box.

[0110] Optionally, in some possible implementations, the smoothing module described above can also be used to perform width and height smoothing on the at least one target object tracking frame only according to the second smoothing rule. In this case, the fifth acquisition module 450 can be used to acquire the intersection-over-union ratio (IoU) of the second target object detection frame and the at least one target object tracking frame after width and height smoothing. The specific smoothing method used is not limited.

[0111] The processing module 460 is used to smooth the second target object detection box based on the first target object tracking box corresponding to the cross-union ratio with a value greater than or equal to a preset threshold, so as to obtain a reference target object detection box.

[0112] The tracking module 470 is used to track a second video image to be tracked corresponding to the first target object tracking frame based on the reference target object detection frame, wherein the second video image to be tracked is an image after the second specified frame image.

[0113] Optionally, the processing module 460 can also be used to use the target object tracking box corresponding to the intersection-union ratio with a value less than the preset threshold as the second target object tracking box (i.e., track it as a new target object). In this way, the tracking module 470 can be used to track the second video image to be tracked based on the second target object detection box and the second target object tracking box.

[0114] Optionally, the device 400 may further include a special effects control module, used to obtain the movement direction of the target object if it is detected that the special effects function corresponding to the target object is in an activated state, the target object being used to control the special effects function of the corresponding special effects function to move; and to control the specified special effects to move in the movement direction.

[0115] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the above-described device and module can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0116] In the several embodiments provided in this application, the coupling or direct coupling or communication connection between the modules shown or discussed may be an indirect coupling or communication connection through some interface, device or module, and may be electrical, mechanical or other forms.

[0117] Furthermore, the functional modules in the various embodiments of this application can be integrated into one processing module, or each module can exist physically separately, or two or more modules can be integrated into one module. The integrated modules described above can be implemented in hardware or as software functional modules.

[0118] Please see Figure 8 Based on the aforementioned target tracking method and apparatus, this application also provides an electronic device 100 capable of executing the aforementioned target tracking method. The electronic device 100 includes a memory 102 and one or more (only one shown in the figure) processors 104 coupled to each other, with communication lines connecting the memory 102 and the processors 104. The memory 102 stores programs capable of executing the contents of the aforementioned embodiments, and the processors 104 can execute the programs stored in the memory 102.

[0119] The processor 104 may include one or more processing cores. The processor 104 connects to various parts within the electronic device 100 using various interfaces and lines, and performs various functions and processes data by running or executing instructions, programs, code sets, or instruction sets stored in the memory 102, and by calling data stored in the memory 102. Optionally, the processor 104 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). The processor 104 may integrate one or a combination of several of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the displayed content; and the modem handles wireless communication. It is understood that the modem may also not be integrated into the processor 104 and may be implemented separately using a communication chip.

[0120] The memory 102 may include random access memory (RAM) or read-only memory (ROM). The memory 102 can be used to store instructions, programs, code, code sets, or instruction sets. The memory 102 may include a program storage area and a data storage area. The program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as touch functionality, sound playback functionality, image playback functionality, etc.), and instructions for implementing the aforementioned embodiments. The data storage area may also store data created by the electronic device 100 during use (such as phonebook data, audio and video data, chat log data, etc.).

[0121] Please refer to Figure 9 This diagram illustrates a structural block diagram of a computer-readable storage medium provided in an embodiment of this application. The computer-readable storage medium 500 stores program code that can be called by a processor to execute the methods described in the above method embodiments.

[0122] The computer-readable storage medium 500 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read-Only Memory), EPROM, hard disk, or ROM. Optionally, the computer-readable storage medium 500 includes a non-transitory computer-readable storage medium. The computer-readable storage medium 500 has storage space for program code 510 that performs any of the method steps described above. This program code can be read from or written to one or more computer program products. The program code 510 may be compressed, for example, in a suitable form.

[0123] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to specific features, structures, materials, or characteristics described in connection with that embodiment or example, which are included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.

[0124] In summary, the target object tracking method, apparatus, electronic device, and storage medium provided in this application involve obtaining a first target object detection box from a first specified frame image of a video image to be processed, then obtaining a target object image corresponding to the first video image to be tracked based on the first target object detection box, inputting the target object image to be tracked into a specified detection model to obtain the target tracking box output by the specified detection model, obtaining a second target object detection box from a second specified frame image, obtaining the intersection-union ratio (IU) of the second target object detection box with at least one target object tracking box, smoothing the second target object detection box based on the first target object tracking box with an IU value greater than or equal to a preset threshold to obtain a reference target object detection box, and then tracking the second video image to be tracked corresponding to the first target object tracking box based on the reference target object detection box. This method uses a cropped, lighter-weight detection model to detect target object tracking boxes in the target object image to be tracked, and uses the detection result as the prediction result of the target object tracking box, without relying on the calculation result of a large target object tracking model, thus reducing the computational load in the target object tracking box prediction process. By smoothing the second object detection box, the connection between the second object detection box and the object tracking box in the previous frame can be made more natural. By acquiring the object detection box again at intervals of specified video frames, the computational complexity of the object tracking box can be reduced, while ensuring the continuity of tracking.

[0125] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A target tracking method, characterized in that, The method includes: Obtain the first target object detection box from the first specified frame image of the video image to be processed; Based on the first target object detection box, obtain the target object image to be tracked corresponding to the first video image to be tracked. The first video image to be tracked is a series of frames after the first specified frame image. The step of obtaining the target object image to be tracked corresponding to the first video image to be tracked based on the first target object detection box includes: for each frame of the first video image to be tracked, using the position of the tracking box of the target object image to be tracked in the previous frame image, calibrating the position of the target object image in the next frame image to obtain the position of the target object image to be tracked in the next frame image. The image of the target object to be tracked is input into a specified detection model to obtain the target tracking box output by the specified detection model. The target tracking box includes at least one target object tracking box. The target object tracking box represents the position of the tracking box of the target object image in each frame of the first video image to be tracked predicted by the specified detection model. The specified detection model is a lightweight detection model obtained by cropping. Obtain a second target detection box from a second specified frame image, wherein the second specified frame image is the next frame image adjacent to the last frame image in the first video image to be tracked, and the second specified frame image and the first specified frame image are spaced at a fixed video frame interval. Obtain the intersection-union ratio (IoU) between the second target detection box and the at least one target tracking box; Based on the first target object tracking box whose cross-union ratio (CUI) value is greater than or equal to a preset threshold, the second target object detection box is smoothed to obtain a reference target object detection box. Based on the reference target detection box, the second video image to be tracked, which corresponds to the first target tracking box, is tracked. The second video image to be tracked is a series of frames following the second specified frame image.

2. The method according to claim 1, characterized in that, The method further includes: The center point of the at least one target object tracking frame is smoothed according to the first smoothing rule; The step of obtaining the intersection-union ratio (IUU) of the second target detection box and the at least one target tracking box includes: Obtain the intersection-union ratio (IoU) of the second target detection box and the at least one target tracking box after center point smoothing.

3. The method according to claim 2, characterized in that, The first smoothing rule is: Wherein, X mean The X and Y coordinates of the center point of the target object tracking box after center point smoothing are represented. mean The Y-coordinate of the center point of the target object tracking box after center point smoothing is represented by N, where N represents the number of the at least one target object tracking box, and x... i The x-coordinate of the center point of the at least one target object tracking box, the y-coordinate i The y-coordinate of the center point of the at least one target tracking box, λ i A weighted parameter characterizing the coordinates of the center point of the tracking box of the at least one target object.

4. The method according to claim 3, characterized in that, The method further includes: The width and height of the at least one target object tracking frame are smoothed according to the second smoothing rule; The step of obtaining the intersection-union ratio (IUU) of the second target detection box and the at least one target tracking box includes: Obtain the intersection-union ratio (IoU) of the second target detection box and the at least one target tracking box after center point smoothing and width and height smoothing.

5. The method according to claim 4, characterized in that, The second smoothing rule is: Wherein, the W mean The width of the target object tracking box after center point smoothing, h mean The height of the target object tracking box after center point smoothing is represented by N, and the number of the at least one target object tracking box is represented by w. i The h represents the width of the at least one target tracking box. i The σ characterizes the height of the at least one target tracking box. i The weights assigned to the width and height of the at least one target object tracking box.

6. The method according to claim 1, characterized in that, The method further includes: The target object tracking box with an intersection-union ratio value less than the preset threshold is used as the second target object tracking box; The second target object detection box is used to track the second video image to be tracked, which corresponds to the second target object tracking box.

7. The method according to any one of claims 1-6, characterized in that, The method further includes: If the special effect function corresponding to the target object is detected to be in the open state, the movement direction of the target object is obtained, and the target object is used to control the special effect function of the corresponding special effect function to move. Control the specified special effect to move in the specified direction.

8. A target tracking device, characterized in that, The device includes: The first acquisition module is used to acquire a first target detection box from a first specified frame image of the video image to be processed; The second acquisition module is used to acquire a target image to be tracked corresponding to the first video image to be tracked based on the first target detection box. The first video image to be tracked is a series of frames after the first specified frame image. The acquisition of the target image to be tracked corresponding to the first video image to be tracked based on the first target detection box includes: for each frame of the first video image to be tracked, using the position of the tracking box of the target image to be tracked in the previous frame image, the position of the target image in the next frame image is calibrated to obtain the position of the target image to be tracked in the next frame image. The third acquisition module is used to input the image of the target object to be tracked into a specified detection model and acquire the target tracking box output by the specified detection model. The target tracking box includes at least one target object tracking box. The target object tracking box represents the position of the tracking box of the target object image in each frame of the first video image to be tracked predicted by the specified detection model. The specified detection model is a lightweight detection model obtained by cropping. The fourth acquisition module is used to acquire a second target detection box from a second specified frame image. The second specified frame image is the next frame image adjacent to the last frame image in the first video image to be tracked. The second specified frame image and the first specified frame image are spaced at a fixed video frame interval. The fifth acquisition module is used to acquire the intersection-union ratio (IUU) of the second target detection box and the at least one target tracking box; The processing module is used to smooth the second target object detection box based on the first target object tracking box whose value is greater than or equal to a preset threshold, so as to obtain a reference target object detection box; The tracking module is used to track a second video image to be tracked corresponding to the first target object tracking frame based on the reference target object detection frame. The second video image to be tracked is a series of frames following the second specified frame image.

9. An electronic device, characterized in that, Includes one or more processors and memory; One or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs being configured to perform the method of any one of claims 1-7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores program code, wherein the program code, when executed by a processor, performs the method according to any one of claims 1-7.