Methods, apparatus, devices and storage media for generating sample datasets

By segmenting the video stream and detecting objects, a sample dataset is automatically generated, solving the problem of high cost and low efficiency in manually generating datasets in existing technologies, and realizing efficient training and recognition of pose recognition models.

CN116310952BActive Publication Date: 2026-06-30GREAT WALL MOTOR CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GREAT WALL MOTOR CO LTD
Filing Date
2023-02-17
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing technologies, generating datasets for training neural network models through manual processing increases labor costs and has low dataset production efficiency, making it difficult to effectively monitor employee behavior within the factory.

Method used

By dividing the video stream into video segments, object detection and pose recognition are performed, and a sample dataset is automatically generated for training the pose recognition model.

Benefits of technology

It reduced labor costs, improved the efficiency of dataset production, and enhanced the recognition efficiency of pose recognition models in single-object scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116310952B_ABST
    Figure CN116310952B_ABST
Patent Text Reader

Abstract

This application discloses a method, apparatus, device, and storage medium for generating sample datasets, belonging to the field of computer technology. It includes: dividing a video stream into video segments to obtain multiple candidate video segments, each candidate video segment containing at least one object; performing object detection and pose recognition on the multiple candidate video segments to obtain at least one target video segment, each target video segment containing one object whose pose conforms to a target pose condition; and generating a sample dataset based on the at least one target video segment and its annotations, where the annotations indicate whether the object in the corresponding target video segment is in the target pose, and the sample dataset is used to train a pose recognition model. This allows for the automatic generation of sample datasets for training pose recognition models, thereby reducing manual labor costs and improving the efficiency of sample dataset production.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to a method, apparatus, device and storage medium for generating sample datasets. Background Technology

[0002] Currently, in industrial parks or factories, to enhance employee alertness and improve operational accuracy, certain specific actions are mandated to improve safety during work. For example, some factories require employees to perform finger-pointing and calling gestures to help them concentrate. However, performing these actions relies on the employee's voluntary will; employees with poor self-discipline may not perform them, thus requiring supervision of each employee's behavior during work.

[0003] In related technologies, the monitoring of employee behavior is achieved through automation. That is, a neural network model is used to identify the employee's actions during the work process, thereby determining whether the employee has performed the required actions.

[0004] However, the above methods generally involve manually processing the datasets used to train neural network models, which increases labor costs and reduces the efficiency of dataset creation. Summary of the Invention

[0005] This application provides a method, apparatus, device, and storage medium for generating sample datasets, which can automatically generate sample datasets, reduce labor costs, and improve the efficiency of dataset production. The technical solution is as follows:

[0006] Firstly, a method for generating a sample dataset is provided, the method comprising:

[0007] The video stream is divided into video segments to obtain multiple candidate video segments, wherein each candidate video segment is a video segment in the video stream that contains at least one object;

[0008] Target detection and pose recognition are performed on the plurality of candidate video segments to obtain at least one target video segment among the plurality of candidate video segments. The target video segment is a candidate video segment containing an object, and the pose of the object in the target video segment meets the target pose condition.

[0009] A sample dataset is generated based on at least one target video segment and at least one annotation of the target video segment. The annotation is used to indicate whether an object in the corresponding target video segment is in a target pose. The sample dataset is used to train a pose recognition model.

[0010] In this application, the video stream is segmented to obtain multiple candidate video segments, specifically multiple candidate video segments containing at least one object. Then, object detection and pose recognition are performed on these candidate video segments to obtain at least one target video segment. This means selecting a candidate video segment containing only one object from the multiple candidate video segments, where the pose of this object conforms to the target pose condition. Next, a sample dataset is generated based on the at least one target video segment and its annotations. The annotations indicate whether the object in the corresponding target video segment is in the target pose. The generated sample dataset can be used to train a pose recognition model. This allows for the automatic generation of a sample dataset for training the pose recognition model, reducing manual costs and improving the efficiency of sample dataset creation. Furthermore, this application automatically obtains target video segments containing only one object, so the sample dataset is suitable for training pose recognition models in single-object scenarios, thus improving the recognition efficiency of pose recognition models in single-object scenarios.

[0011] Optionally, the video stream is divided into video segments to obtain multiple candidate video segments, including any one of the following:

[0012] For any object in the video stream, target detection is performed on the video stream to obtain multiple target video frames, all of which contain the object; the multiple target video frames are aggregated to obtain a candidate video segment corresponding to the object.

[0013] Alternatively, for any object in the video stream, target detection is performed on the video stream to obtain a reference video frame, which contains the object; target tracking is then performed in the video stream starting from the reference video frame to obtain a candidate video segment corresponding to the object.

[0014] Optionally, before performing target detection and pose recognition on the plurality of candidate video segments to obtain at least one target video segment from the plurality of candidate video segments, the method further includes:

[0015] For any one of the multiple candidate video segments, if the candidate video segment does not meet the preset conditions, the candidate video segment is deleted to obtain multiple reference video segments. The preset conditions are that the number of frames of the candidate video segment is greater than or equal to a first preset number of frames and less than or equal to a second preset number of frames, where the first preset number of frames is less than the second preset number of frames.

[0016] Optionally, the step of performing target detection and pose recognition on the plurality of candidate video segments to obtain at least one target video segment from the plurality of candidate video segments includes:

[0017] Target detection and pose recognition are performed on the plurality of reference video segments to obtain at least one target video segment among the plurality of reference video segments.

[0018] Optionally, the step of performing target detection and pose recognition on the plurality of candidate video segments to obtain at least one target video segment from the plurality of candidate video segments includes:

[0019] For any one of the multiple candidate video segments, a candidate bounding box is slid across the candidate video segment.

[0020] During the sliding of the candidate box, the area covered by the candidate box is classified to obtain the region classification results of multiple regions of the candidate video segment;

[0021] Based on the region classification results of the multiple regions, at least one detection box is determined for the candidate video segment, and the at least one detection box surrounds at least one object in the candidate video segment;

[0022] Based on at least one detection box among the plurality of candidate video segments, a plurality of single-object video segments are obtained by filtering from the plurality of candidate video segments, wherein the single-object video segment is a candidate video segment containing one object;

[0023] Pose recognition is performed on the plurality of single-object video segments to obtain at least one target video segment.

[0024] Optionally, the step of filtering multiple single-object video segments from the multiple candidate video segments based on at least one detection box among the multiple reference video segments includes:

[0025] For any one of the multiple candidate video segments, if multiple detection boxes are detected in the candidate video segment, the candidate video segment is deleted.

[0026] If a detection box is detected in the candidate video segment, the candidate video segment is determined as the single-object video segment.

[0027] Optionally, the step of performing pose recognition on the plurality of single-object video segments to obtain at least one target video segment includes:

[0028] For any one of the multiple single-object video segments, key points are extracted from the object in the single-object video segment to obtain multiple key points of the object in the single-object video segment.

[0029] Determine whether the multiple key points meet the target posture conditions;

[0030] If the target posture conditions are met at the multiple key points, the single object video segment is determined as the target video segment.

[0031] Optionally, the method further includes:

[0032] Based on the sample dataset, determine multiple key points of an object in at least one of the target video segments;

[0033] The pose recognition model is trained based on multiple key points of an object in at least one of the target video segments.

[0034] Optionally, training the pose recognition model based on multiple key points of an object in at least one of the target video segments includes:

[0035] For any one of the target video segments, input multiple key points of the object in the target video segment into the pose recognition model, perform pose recognition on the multiple key points of the object in the target video segment through the pose recognition model, and output the predicted recognition result;

[0036] Based on the difference information between the predicted recognition result and the annotation corresponding to the target video segment, the model parameters of the pose recognition model are adjusted.

[0037] Secondly, a sample dataset generation apparatus is provided, the apparatus comprising:

[0038] The first processing module is used to divide the video stream into video segments to obtain multiple candidate video segments, wherein the multiple candidate video segments are video segments in the video stream that contain at least one object.

[0039] The second processing module is used to perform target detection and pose recognition on the plurality of candidate video segments to obtain at least one target video segment among the plurality of candidate video segments. The target video segment is a candidate video segment containing an object, and the pose of the object in the target video segment meets the target pose condition.

[0040] A generation module is used to generate a sample dataset based on at least one target video segment and at least one annotation of the target video segment, wherein the annotation is used to indicate whether an object in the corresponding target video segment is in a target pose, and the sample dataset is used to train a pose recognition model.

[0041] Optionally, the first processing module is used to:

[0042] For any object in the video stream, target detection is performed on the video stream to obtain multiple target video frames, all of which contain the object; the multiple target video frames are aggregated to obtain a candidate video segment corresponding to the object.

[0043] Alternatively, for any object in the video stream, target detection is performed on the video stream to obtain a reference video frame, which contains the object; target tracking is then performed in the video stream starting from the reference video frame to obtain a candidate video segment corresponding to the object.

[0044] Optionally, the device further includes:

[0045] The first filtering module is used to delete any candidate video segment among the plurality of candidate video segments if the candidate video segment does not meet a preset condition, so as to obtain a plurality of reference video segments. The preset condition is that the number of frames of the candidate video segment is greater than or equal to a first preset number of frames and less than or equal to a second preset number of frames, wherein the first preset number of frames is less than the second preset number of frames.

[0046] Optionally, the second processing module is used for:

[0047] Target detection and pose recognition are performed on the plurality of reference video segments to obtain at least one target video segment among the plurality of reference video segments.

[0048] Optionally, the second processing module is used for:

[0049] For any one of the multiple candidate video segments, a candidate bounding box is slid across the candidate video segment.

[0050] During the sliding of the candidate box, the area covered by the candidate box is classified to obtain the region classification results of multiple regions of the candidate video segment;

[0051] Based on the region classification results of the multiple regions, at least one detection box is determined for the candidate video segment, and the at least one detection box surrounds at least one object in the candidate video segment;

[0052] Based on at least one detection box among the plurality of candidate video segments, a plurality of single-object video segments are obtained by filtering from the plurality of candidate video segments, wherein the single-object video segment is a candidate video segment containing one object;

[0053] Pose recognition is performed on the plurality of single-object video segments to obtain at least one target video segment.

[0054] Optionally, the second processing module is used for:

[0055] For any one of the multiple candidate video segments, if multiple detection boxes are detected in the candidate video segment, the candidate video segment is deleted.

[0056] If a detection box is detected in the candidate video segment, the candidate video segment is determined as the single-object video segment.

[0057] Optionally, the second processing module is used for:

[0058] For any one of the multiple single-object video segments, key points are extracted from the object in the single-object video segment to obtain multiple key points of the object in the single-object video segment.

[0059] Determine whether the multiple key points meet the target posture conditions;

[0060] If the target posture conditions are met at the multiple key points, the single object video segment is determined as the target video segment.

[0061] Optionally, the device further includes:

[0062] The determination module is used to determine multiple key points of at least one object in the target video segment based on the sample dataset;

[0063] The training module is used to train the pose recognition model based on multiple key points of an object in at least one of the target video segments.

[0064] Optionally, the training module is used for:

[0065] For any one of the target video segments, input multiple key points of the object in the target video segment into the pose recognition model, perform pose recognition on the multiple key points of the object in the target video segment through the pose recognition model, and output the predicted recognition result;

[0066] Based on the difference information between the predicted recognition result and the annotation corresponding to the target video segment, the model parameters of the pose recognition model are adjusted.

[0067] Thirdly, a computer device is provided, the computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the above-described sample dataset generation method.

[0068] Fourthly, a computer-readable storage medium is provided, the computer-readable storage medium storing a computer program, which, when executed by a processor, implements the above-described sample dataset generation method.

[0069] Fifthly, a computer program product containing instructions is provided, which, when run on a computer, causes the computer to perform the steps of the sample dataset generation method described above.

[0070] It is understood that the beneficial effects of the second, third, fourth, and fifth aspects mentioned above can be found in the relevant descriptions in the first aspect above, and will not be repeated here. Attached Figure Description

[0071] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0072] Figure 1 This is a flowchart of a sample dataset generation method provided in an embodiment of this application;

[0073] Figure 2 This is a flowchart of another sample dataset generation method provided in the embodiments of this application;

[0074] Figure 3 This is a schematic diagram of the structure of a sample dataset generation device provided in an embodiment of this application;

[0075] Figure 4 This is a schematic diagram of the structure of a computer device provided in an embodiment of this application. Detailed Implementation

[0076] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings.

[0077] It should be understood that "multiple" as mentioned in this application refers to two or more. In the description of this application, unless otherwise stated, " / " indicates "or," for example, A / B can mean A or B; "and / or" in this document is merely a description of the relationship between related objects, indicating that three relationships can exist, for example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Furthermore, to facilitate a clear description of the technical solutions of this application, the terms "first," "second," etc., are used to distinguish identical or similar items with essentially the same function and effect. Those skilled in the art will understand that the terms "first," "second," etc., do not limit the quantity or execution order, and that "first," "second," etc., do not necessarily imply differences.

[0078] Before providing a detailed explanation of the embodiments of this application, the application scenarios of these embodiments will be described first.

[0079] Currently, to enhance employee awareness and safety during work, certain specific actions are mandated. However, employees with poor self-discipline may not perform these actions, necessitating supervision of their behavior during work processes.

[0080] Generally, employee behavior can be actively monitored during work processes, meaning employees are manually judged to determine whether they are performing unusual actions. However, this method requires a significant amount of manpower, increasing unnecessary labor costs. Furthermore, human supervision may result in omissions or misjudgments, thus reducing the efficiency of monitoring employee behavior.

[0081] With the development of technology, the monitoring of employee behavior can also be achieved through automation. This is generally done by using neural network models to identify the employee's actions during the work process, thereby determining whether the employee has performed any unusual actions.

[0082] Furthermore, the dataset used to train the aforementioned neural network model is typically obtained from historical video information within the factory. Specifically, this usually involves first acquiring historical video information of the work area captured by cameras within the factory. Since this historical video information contains redundant information such as whether there are people or not, post-processing is necessary. Generally, this is done manually to retain the areas in the historical video information where employees are present, thus obtaining the dataset for training the neural network model. However, this method of obtaining the dataset for training the neural network model through manual processing is still labor-intensive, increasing labor costs, and the dataset creation efficiency is relatively low.

[0083] To address this, this application provides a method for generating a sample dataset, applicable to scenarios involving the creation of sample datasets for training pose recognition models. Specifically, the video stream data is first divided into video segments, resulting in multiple video segments containing at least one object. These video segments are then filtered to obtain at least one video segment containing only one object, where the object's pose satisfies specific conditions. Finally, a sample dataset is generated based on the obtained at least one video segment containing only one object and its corresponding annotations. This sample dataset can be used to train a pose recognition model. This method automatically generates sample datasets for training pose recognition models, reducing manual labor costs and improving dataset creation efficiency. Furthermore, since this application embodiment automatically obtains video segments containing only one object, the sample dataset is suitable for training pose recognition models in single-object scenarios, thus improving the recognition efficiency of pose recognition models in single-object scenarios.

[0084] The sample dataset generation method provided in the embodiments of this application will be explained in detail below.

[0085] Figure 1 This is a flowchart illustrating a method for generating a sample dataset according to an embodiment of this application. This method can be applied to computer devices; see [link / reference]. Figure 1 The method includes the following steps.

[0086] Step 101: The computer device divides the video stream into video segments to obtain multiple candidate video segments.

[0087] This video stream contains video data that includes multiple objects. For example, when generating a differential call dataset, this video stream could be video data of a work area captured by a camera in a factory, containing multiple workers in the work area.

[0088] The multiple candidate video segments are video segments in the video stream that contain at least one object, and each candidate video segment is the video segment corresponding to an object in the video stream. That is, each candidate video segment is a video fragment obtained from the video stream, and this video fragment contains at least one object.

[0089] Specifically, the operation in step 101 can be implemented in the following two possible ways.

[0090] The first possible approach may include the following steps (1)-(2).

[0091] (1) For any object in the video stream, the computer device performs target detection on the video stream to obtain multiple target video frames.

[0092] Object detection is used to detect multiple objects in a video stream. Specifically, it involves performing object detection on multiple video frames of the video stream to detect objects in each of those frames. As an example, after a computer device performs object detection on a video frame in the video stream, that video frame will be tagged with the object's ID (IdentityDocument), which is used to uniquely identify the object.

[0093] Each of the multiple target video frames contains this object. As an example, after a computer device performs object detection on multiple video frames of the video stream, it can identify the object in each of the multiple video frames. Subsequently, the video frames containing this object are retained, thus obtaining the multiple target video frames corresponding to this object.

[0094] In this case, the computer device can obtain all video frames in the video stream that contain the object, that is, obtain multiple target video frames, and these multiple target video frames are the video frames corresponding to the object. Subsequently, based on these multiple target video frames, the candidate video segments corresponding to the object can be obtained.

[0095] Optionally, step (1) can be performed as follows: the computer device slides a candidate box on the video stream; during the sliding of the candidate box, the computer device classifies the area covered by the candidate box to obtain the region classification results of multiple regions of the video stream; based on the region classification results of multiple regions of the video stream, the computer device determines multiple target video frames.

[0096] The region classification result is used to indicate whether the region covered by the candidate box contains the object, and the region classification result is the probability value that a region contains the object.

[0097] Specifically, the computer device slides a candidate box across multiple video frames of the video stream; during the sliding of the candidate box, the computer device classifies the area covered by the candidate box to obtain the region classification results of multiple regions on multiple video frames of the video stream; based on the region classification results of multiple regions on multiple video frames of the video stream, the computer device determines the multiple target video frames.

[0098] As an example, for any one of the multiple video frames in the video stream, a candidate bounding box is slid across the frame, and the region covered by the box is classified to determine the probability that the object is present within that region. This determines the region classification result for one area within the frame. Then, based on the region classification results of the multiple regions covered by the candidate bounding box during its sliding motion, it can be determined whether the object exists in that frame. By sliding the candidate bounding box across each frame, the region classification results of the multiple regions covered by the box can be obtained for each frame. Subsequently, based on these region classification results, it can be determined which of the multiple video frames contain the object. If the object exists in one video frame, that frame is retained, thus obtaining the target video frame. Therefore, if the object exists in some of the multiple video frames, those frames are retained, resulting in multiple target video frames corresponding to the object.

[0099] Optionally, the step size for sliding the candidate box can be preset, and this step size can be set to be relatively small.

[0100] In this way, when the candidate box slides on each video frame of the video stream according to the step size, every position on each video frame can be covered by the candidate box. Then, the area covered by the candidate box is classified to determine whether the object is contained in each position covered by the candidate box, thus more accurately determining whether the object is contained in the video stream.

[0101] Optionally, the computer device can generate multiple candidate bounding boxes, which can have different sizes. Then, for each of these candidate bounding boxes, the box can be slid across each video frame of the video stream to obtain a region classification result for the multiple areas covered by the candidate bounding box in each video frame. By applying the above operation to each of the multiple candidate bounding boxes, a region classification result for the multiple areas covered by the candidate bounding boxes in each video frame of the video stream can be obtained. Then, based on the region classification results of the multiple areas covered by the candidate bounding boxes in each video frame, the multiple target video frames can be determined.

[0102] In this case, since the multiple candidate boxes can cover areas of different sizes, and the computer device does not know the size of the object in the video stream, by sliding candidate boxes of different sizes on each video frame of the video stream, the object in each video frame can be covered as much as possible. In this way, it can be determined more accurately whether the video stream contains the object.

[0103] Optionally, the computer device can also perform object detection on the video stream using an object detection algorithm. For example, the object detection algorithm can be YOLO (You Only Look Once) algorithm, Fast RCNN (Fast Region-based Convolutional Network) algorithm, etc. The embodiments of this application do not limit the object detection algorithm.

[0104] (2) The computer device aggregates the multiple target video frames to obtain the candidate video segment corresponding to the object.

[0105] Since each of the multiple target video frames contains this object, aggregating these multiple target video frames yields a video segment containing this object within the video stream, which is essentially a candidate video segment corresponding to this object. Furthermore, because the computer device retains video frames containing this object, this video frame in the video stream may contain not only this object but also other objects. Therefore, the multiple target video frames corresponding to this object may contain other objects, and the candidate video segment obtained by aggregating these multiple target video frames may also contain other objects; that is, the candidate video segment contains at least one object.

[0106] Alternatively, consider a scenario where at least two objects in the video stream exhibit identical motion. These at least two objects can correspond to the same candidate video segment, and the candidate video segment containing these at least two objects is also included. In this case, the computer device obtains multiple identical target video frames for these at least two objects, and these multiple target video frames contain these at least two objects. Therefore, for these at least two objects, the candidate video segments obtained by the computer device from aggregating these multiple target video frames are also identical, thus the at least two objects can correspond to the same candidate video segment.

[0107] It is worth noting that for each of the multiple objects in the video stream, the computer device can obtain the candidate video segment corresponding to each of the multiple objects through the above steps (1)-(2).

[0108] In this case, the candidate video segments corresponding to each object can respectively reflect the action trajectory of each object in the video stream, so that a sample dataset with rich action can be generated based on the candidate video segments corresponding to each object among multiple objects in the video stream.

[0109] The second possible approach may include the following steps (1)-(2).

[0110] (1) For any object in the video stream, the computer device performs target detection on the video stream to obtain a reference video frame.

[0111] The reference video frame contains the object. Optionally, the reference video frame can be the first video frame in a video stream that contains the object. As an example, the computer device performs object detection on the multiple video frames sequentially, starting from the first video frame of the video stream, until the object is detected for the first time. The video frame in which the object is first detected is then the reference video frame. For example, if the computer device first performs object detection on the first video frame of the video stream and finds that the object is not present in the first video frame, then the computer device performs object detection on the second video frame and finds that the object is present in the second video frame. In this case, the second video frame in the video stream is the reference video frame.

[0112] In this scenario, the computer device can identify the first video frame in the video stream where the object appears. Therefore, it can subsequently retrieve candidate video segments corresponding to this object from the video stream, starting from the first video frame in which the object appears.

[0113] Optionally, step (1) can be performed as follows: the computer device slides a candidate box sequentially across multiple video frames of the video stream; during the process of the candidate box sliding sequentially across multiple video frames, the computer device classifies the area covered by the candidate box to obtain the region classification results of multiple regions on a video frame; based on the region classification results of multiple regions on this video frame, the computer device determines whether the object exists on this video frame; if the object exists on this video frame, this video frame is determined as the reference video frame.

[0114] As an example, the computer device first slides a candidate bounding box across the first video frame of the video stream. While the candidate bounding box is sliding across the first video frame, the computer device classifies the area covered by the candidate bounding box, obtaining region classification results for multiple regions on the first video frame. Then, based on the region classification results of the multiple regions on the first video frame, the computer device determines that the object does not exist in this video frame, thus determining that the first video frame is not a reference video frame. Next, the computer device can slide the candidate bounding box across the second video frame. While the candidate bounding box is sliding across the second video frame, the computer device classifies the area covered by the candidate bounding box, obtaining region classification results for multiple regions on the second video frame. Then, based on the region classification results of the multiple regions on the second video frame, the computer device determines that the object exists in the second video frame, thus determining that the second video frame is a reference video frame.

[0115] (2) In the video stream, target tracking is performed starting from the reference video frame to obtain the candidate video segment corresponding to the object.

[0116] Object tracking is used to track objects in the video stream. For any given object, object tracking can reveal its entire motion from appearance to disappearance within the video stream. Furthermore, consider a scenario where at least two objects in the video stream exhibit identical motion. In this case, the at least two objects can correspond to the same candidate video segment, and the candidate video segment containing these at least two objects will also contain the same object.

[0117] The reference video frame obtained through object detection is the video frame in which the object first appears in the video stream. In this case, by tracking the object starting from the reference video frame, the entire process from the appearance to the disappearance of the object can be obtained, which means obtaining multiple candidate video segments corresponding to the object.

[0118] It is worth noting that by performing the above steps (1)-(2) on each of the multiple objects in the video stream, the entire process from the appearance to the disappearance of each object can be obtained, that is, multiple candidate video segments corresponding to each of the multiple objects can be obtained.

[0119] Optionally, step (2) can be performed as follows: the computer device retains the reference video frame in the video stream and extracts the first feature of the reference video frame; for any one of the multiple video frames after the reference video frame in the video stream, extracts multiple second features of the video frame; if there is a second feature among the multiple second features whose similarity with the first feature meets the similarity threshold, retain the video frame in the video stream; if the similarity between each of the multiple second features and the first feature does not meet the similarity threshold, delete the video frame in the video stream and all video frames after the video frame; aggregate all the video frames in the retained video stream to obtain the candidate video segment corresponding to the object.

[0120] The first feature is the characteristic of this object in the reference video frame. The multiple second features are the characteristics of all objects in a video frame.

[0121] The similarity threshold can be preset, and it can be set relatively high. In this case, if the similarity between a second feature and the first feature meets the similarity threshold, it means that the similarity between the second feature and the first feature is high, that is, the second feature and the first feature are relatively close. This indicates that the object to which the second feature belongs is the same object. Therefore, this object exists in this video frame, and this video frame in the video stream can be retained. If the similarity between each of the multiple second features and the first feature does not meet the similarity threshold, it means that the similarity between each of the multiple second features and the first feature is low, that is, each of the multiple second features and the first feature are not close enough. This indicates that the object to which each of the multiple second features belongs may not be the same object. This indicates that this object may not exist in this video frame, and may not exist in any subsequent video frames. Therefore, this video frame and all subsequent video frames in the video stream can be deleted.

[0122] In this way, we can determine which video frames contain the object from its appearance to its disappearance in the video stream. The video segment composed of these video frames is the candidate video segment corresponding to this object. Then, by performing the above operation on each of these multiple objects, we can obtain the candidate video segment corresponding to each object, that is, obtain the multiple candidate video segments.

[0123] Optionally, the computer device can also perform target tracking on the video stream using a target tracking algorithm. For example, the target tracking algorithm can be Deep SORT (Deep Simple Online and Realtime Tracking) algorithm, Strong SORT (Deep SORT enhanced version) algorithm, etc. The embodiments of this application do not limit the target tracking algorithm to a single one.

[0124] Step 102: The computer device performs target detection and pose recognition on the multiple candidate video segments to obtain at least one target video segment among the multiple candidate video segments. The target video segment is a candidate video segment containing an object, and the pose of the object in the target video segment meets the target pose condition.

[0125] Pose recognition is used to identify the pose of objects in multiple candidate video segments in order to determine whether the pose of the object meets the target pose conditions.

[0126] The target posture conditions correspond to the target posture. These conditions can be preset and set based on the target posture. For example, if the target posture is a finger-to-touch gesture, the target posture condition can be set to the angle of the object's arm in the candidate video segment. As another example, if the target posture is a high knee raise in sports training, the target posture condition can be set to the angle of the object's leg in the candidate video segment.

[0127] In this scenario, the computer device can select at least one target video segment from the multiple candidate video segments. Specifically, it can select at least one video segment containing a single object whose pose satisfies the target pose condition. Thus, the computer device obtains data that can be used to train a target pose model.

[0128] Optionally, before the computer device performs target detection and pose recognition on the multiple candidate video segments to obtain at least one target video segment from the multiple candidate video segments, the computer device may also filter the multiple candidate video segments to retain the candidate video segments with valid information.

[0129] Specifically, for any one of the multiple candidate video segments, if a candidate video segment does not meet the preset conditions, the candidate video segment is deleted to obtain multiple reference video segments.

[0130] The preset condition is that the number of frames in a candidate video segment is greater than or equal to the first preset number of frames and less than or equal to the second preset number of frames. The first and second preset number of frames can be preset by technicians according to actual needs, and the first preset number of frames is less than the second preset number of frames.

[0131] As an example, the first preset frame count can be set to 5 frames, and the second preset frame count can be set to 50 frames. The preset condition is that the number of frames in a candidate video segment is greater than or equal to 5 frames and less than or equal to 50 frames. Subsequently, video segments with fewer than 5 frames or more than 50 frames can be deleted; that is, video segments with a length less than 1 second or longer than 10 seconds can be deleted from the multiple candidate video segments. In this way, video segments with insufficient information or too much redundant information can be eliminated from the multiple candidate video segments, thus retaining candidate video segments with valid information.

[0132] In this scenario, if a candidate video segment does not meet the preset conditions, it indicates that the segment lacks sufficient information or contains too much redundant information, and thus it can be deleted. If a candidate video segment meets the preset conditions, it indicates that the segment contains valid information, and thus it can be retained. By performing the above filtering operation on multiple candidate video segments, multiple candidate video segments with valid information can be obtained, which are the multiple reference video segments.

[0133] In this case, step 102 can be performed as follows: the computer device performs target detection and pose recognition on the multiple reference video segments to obtain at least one target video segment among the multiple reference video segments.

[0134] Thus, by performing target detection and pose recognition on the multiple reference video segments obtained after filtering, the effective information in the reference video segments can be fully utilized, thereby more accurately obtaining at least one target video segment from the multiple reference videos.

[0135] Optionally, the operation in step 102 can be implemented by the following steps (1)-(3).

[0136] (1) For any one of the multiple candidate video segments, the computer device slides the candidate box on the candidate video segment; during the sliding of the candidate box, the area covered by the candidate box is classified to obtain the region classification results of multiple regions of the candidate video segment; based on the region classification results of the multiple regions, at least one detection box of the candidate video segment is determined.

[0137] The region classification results of multiple regions in this candidate video segment are used to indicate whether an object exists in the multiple regions covered by the candidate box.

[0138] Specifically, step (1) can be performed as follows: for any one of the multiple candidate video segments, the computer device slides the candidate box across multiple video frames of the candidate video segment; during the sliding of the candidate box, the area covered by the candidate box is classified to obtain the region classification results of multiple regions on multiple video frames of the candidate video segment; based on the region classification results of multiple regions on multiple video frames, at least one detection box of the candidate video segment is determined.

[0139] As an example, for any one of the multiple video frames in this candidate video segment, a candidate bounding box is slid across the video frame, and the region covered by the candidate bounding box is classified to determine the probability that an object is contained within that region. This determines the region classification result for one region within that video frame. Then, based on the region classification results of the multiple regions covered by the candidate bounding box during its sliding across the video frame, it can be determined which regions in that video frame contain objects. By sliding the candidate bounding box across each video frame, the region classification results of the multiple regions covered by the candidate bounding box in each video frame can be obtained. Subsequently, based on the region classification results of the multiple regions covered by the candidate bounding box in each video frame, it can be determined which regions in each video frame contain objects. If an object is present in a region, the candidate bounding box covering that region is retained. Thus, when an object is present in a region covered by at least one candidate bounding box, at least one candidate bounding box can be retained, i.e., at least one detection box is obtained.

[0140] In this way, the computer device can exhaustively search for regions where objects may exist in each video frame of the candidate video segment, classify all regions covered by the candidate box, and then obtain at least one detection box for the candidate video segment based on the region classification results of all regions, thereby knowing which objects exist in the candidate video segment.

[0141] Optionally, the step size for sliding the candidate box can be preset, and this step size can be set to be relatively small.

[0142] In this way, when the candidate box slides on each video frame of the candidate video segment according to the step size, every position in each video frame of the candidate video segment can be fully covered by the candidate box. Then, the area covered by the candidate box is classified, so as to determine whether an object is contained in each position covered by the candidate box, thereby more accurately determining at least one detection box of the candidate video segment.

[0143] Optionally, the computer device can generate multiple candidate boxes, which can have different sizes. Then, for each of these candidate boxes, the candidate box can be slid across each video frame of the candidate video segment to obtain the region classification results for the multiple regions covered by the candidate box in each video frame of the candidate video segment. By performing the above operation on each candidate box, the region classification results for the multiple regions covered by the multiple candidate boxes in each video frame of the candidate video segment can be obtained. Then, based on the region classification results for the multiple regions covered by the multiple candidate boxes in each video frame of the candidate video segment, at least one detection box in the candidate video segment can be determined.

[0144] In this case, by sliding candidate boxes of different sizes across each video frame of the candidate video segment, the objects in each video frame of the candidate video segment can be covered as much as possible. In this way, at least one detection box of the candidate video segment can be determined more accurately, that is, at least one object contained in the candidate video segment can be determined more accurately.

[0145] (2) The computer device selects multiple single-object video segments from the multiple candidate video segments based on at least one detection box among the multiple candidate video segments.

[0146] A single-object video segment is a candidate video segment that contains a single object.

[0147] In this case, the computer device can obtain candidate video segments containing an object from the multiple candidate video segments. Thus, the multiple single-object video segments can be used to generate a training dataset for a pose recognition model in a single-object scene, which can then be used to handle pose recognition tasks in any single-object scene.

[0148] Optionally, when the computer device selects multiple single-object video segments from the multiple candidate video segments, it can also output the coordinate values ​​of the objects in the multiple single-object video segments, so as to know the position of each object in the multiple single-object video segments.

[0149] Specifically, step (2) can be performed as follows: if the computer device detects multiple detection boxes in any one of the multiple candidate video segments, it deletes the candidate video segment; if it detects one detection box in the candidate video segment, it determines the candidate video segment as a single object video segment.

[0150] In this embodiment, the focus is on target pose recognition in single-object scenarios, so it is necessary to obtain multiple single-object video segments from the multiple candidate video segments.

[0151] In this scenario, if multiple bounding boxes are detected in the candidate video segment, it indicates that the candidate video segment contains multiple objects, and the candidate video segment can be deleted. If only one bounding box is detected in the candidate video segment, it indicates that the candidate video segment contains one object, which meets the single-object scenario, and the candidate video segment can be identified as a single-object video segment.

[0152] (3) The computer device performs pose recognition on the multiple single-object video segments to obtain at least one target video segment.

[0153] Specifically, the operation of step (3) may include the following steps a-c.

[0154] a. For any one of the multiple single-object video segments, extract key points from the object in the single-object video segment to obtain multiple key points of the object in the single-object video segment.

[0155] This key point can be a skeletal key point of the object. If the object in the video segment is a human figure, this key point can be a human body key point. For example, these multiple key points can be: nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle.

[0156] Specifically, the computer device can input this single-object video segment into the key point extraction model, and use the key point extraction model to extract key points of the object in the single-object video segment, and output multiple key points of the object in the single-object video segment.

[0157] This key point extraction model is used to extract key points from an object.

[0158] It is worth noting that the keypoint extraction model needs to be trained before the computer device inputs this single-object video segment into the keypoint extraction model.

[0159] Specifically, the computer device can acquire multiple training samples and use these samples to train the neural network model to obtain the keypoint extraction model. For example, these multiple training samples can be the COCO dataset.

[0160] These multiple training samples can be pre-set. Each training sample includes sample data and sample labels. The sample data is video information containing sample objects, and the sample labels are multiple key points of the sample objects in the sample data. That is, the input data of each training sample is sample data containing sample objects, and the sample labels are multiple key points of the sample objects in the sample data.

[0161] This neural network model can include multiple network layers, including an input layer, multiple hidden layers, and an output layer. The input layer is responsible for receiving input data; the output layer is responsible for outputting processed data; the multiple hidden layers are located between the input and output layers and are responsible for processing the data. These hidden layers are not visible to the outside world. Optionally, this neural network model can be a deep neural network, and specifically a convolutional neural network, such as an HRNet (High-Resolution Network) model.

[0162] In this process, when a computer device trains a neural network model using multiple training samples, for each training sample, the input data from that training sample is input into the neural network model to obtain output data. A loss function is then used to determine the loss value between the output data and the sample labels in that training sample. The parameters in the neural network model are then adjusted based on this loss value. After adjusting the parameters of the neural network model based on each of the multiple training samples, the neural network model with adjusted parameters becomes the keypoint extraction model.

[0163] The operation of adjusting the parameters in the neural network model based on the loss value by the computer device can refer to relevant technologies, and will not be described in detail in the embodiments of this application.

[0164] For example, computer devices can use formulas to analyze the neural network model.

[0165] Any parameter of TH can be adjusted. Among them, TH1T (compute machine), H2 (store computer program), and H0 (processor) are the adjusted parameters. W is the parameter before adjustment. is the learning rate, which can be preset, such as 0.001, 0.000001, etc., and this embodiment does not limit this to a single value. dw is the derivative of the loss function with respect to W, which can be obtained from the loss value.

[0166] b. The computer equipment determines whether the multiple key points meet the target attitude conditions.

[0167] As an example, when the target pose is a finger-to-call gesture, the pose recognition model is a finger-to-call gesture recognition model. The target pose condition can be set as the existence of an angle in the arm of the object in the video segment. In this case, the computer device can first obtain the arm-related key points from multiple key points, that is, it can first obtain the key points of the left shoulder, right shoulder, left elbow, right elbow, left wrist, and right wrist. Then, it calculates whether there is an angle between the key points of the left shoulder, left elbow, and left wrist, and calculates whether there is an angle between the key points of the right shoulder, right elbow, and right wrist, that is, it determines whether the left arm or the right arm is bent. If either the left or right arm is bent, it can be determined that the multiple key points meet the target pose condition; if neither the left nor the right arm is bent, it can be determined that the multiple key points do not meet the target pose condition.

[0168] c. When the target attitude conditions are met at multiple key points, the computer device will identify this single-object video segment as the target video segment.

[0169] In the above example, if the multiple key points satisfy the target pose condition, it indicates that the object in this single-object video segment has an arm pose. This arm pose could be pointing or pointing, or it could be something other than pointing. Since pointing is also a type of arm pose, training a pointing recognition model using video segments with arm poses can make the model more accurate. Therefore, this single-object video segment can be identified as the target video segment, and a sample dataset can be generated based on it. If the multiple key points do not satisfy the target pose condition, it indicates that the object in this single-object video segment does not have an arm pose. In this case, it can be directly determined that the object's pose in this single-object video segment is not pointing. Thus, this single-object video segment is not relevant for model training and can be deleted.

[0170] In this case, if the multiple key points meet the target conditions, this single-object video can be identified as the target video segment. This allows the pose recognition model to be trained using the target video segment, resulting in a higher accuracy of the trained pose recognition model.

[0171] It is worth noting that after the computer device performs the above steps a-c on each of the multiple single-object video segments, it can obtain at least one target video segment, and then a sample dataset can be generated based on the at least one target video segment, that is, to continue to execute step 103.

[0172] Step 103: The computer device generates a sample dataset based on the at least one target video segment and the annotations of the at least one target video segment. The annotations are used to indicate whether the object in the corresponding target video segment is in the target pose. The sample dataset is used to train the pose recognition model.

[0173] Optionally, the annotation of the at least one target video segment can be obtained through manual classification. Specifically, for each of the at least one target video segments, each target video segment can be manually observed to determine whether the object in the target video segment is in the target pose, thereby classifying each target video segment and obtaining the classification result (annotation) for each target video segment.

[0174] In this scenario, computer equipment can automatically generate sample datasets for training pose recognition models. These datasets can be used to train pose recognition models in single-object scenarios, thereby reducing manual costs, improving the efficiency of dataset production, and resulting in higher accuracy of pose recognition models in single-object scenarios.

[0175] It is worth noting that after generating a sample dataset, the computer device can also train a pose recognition model based on the sample data, which is used to identify the target pose.

[0176] Specifically, the operation of training the pose recognition model by the computer device may include the following steps (1)-(2).

[0177] (1) The computer device determines multiple key points of an object in the at least one target video segment based on the sample dataset.

[0178] The multiple key points of the object in the at least one target video segment are the multiple key points of the object in each video frame of each target video segment in the at least one target video segment.

[0179] Optionally, for any one of the at least one target video segments, the computer device can input the target video segment into the key point extraction model, extract multiple key points of the object in each video frame of the target video segment through the key point extraction model, and output multiple key points of the object in each video frame of the target video segment.

[0180] Specifically, the details of the key point extraction model have been elaborated in step 102 above, and will not be repeated here.

[0181] Optionally, for any one of the at least one target video segments, before inputting the target video segment into the keypoint extraction model, the computer device can first obtain the coordinate values ​​of the object in each video frame of the target video segment. Subsequently, the coordinate values ​​in each video frame and the target video segment can be input into the keypoint extraction model. The keypoint extraction model first determines the position of the object in each video frame; then, the keypoint extraction model extracts multiple keypoints of the object indicated by that position in each video frame, and outputs multiple keypoints of the object in each video frame of the target video segment.

[0182] In this case, the key point extraction model can accurately find the object in each video frame. Subsequently, key points can be directly extracted from the object indicated by the coordinate value in each video frame, thereby improving the accuracy of extracting key points of the object in each video frame of the target video segment.

[0183] Optionally, when the computer device outputs multiple key points of an object in each video frame of the target video segment through the key point extraction model, it can also output the confidence level of multiple key points of an object in each video frame of the target video segment through the key point extraction model.

[0184] This confidence level indicates the accuracy of multiple keypoints of an object in each video frame. The higher the confidence level of these multiple keypoints, the more accurate the multiple keypoints of the object in each video frame output by the key extraction model; the lower the confidence level of these multiple keypoints, the less accurate the multiple keypoints of the object in each video frame output by the key extraction model.

[0185] Optionally, after obtaining multiple key points of an object and the confidence level of each key point in each video frame of the target video segment, the computer device may also save the multiple key points of an object and the confidence level of each key point in each video frame of the target video segment.

[0186] For example, a computer device can save multiple keypoints of an object in each video frame of a target video segment, along with the confidence level of each keypoint, by generating a pkl file (a form of file saving in Python).

[0187] It is worth noting that by performing the above operation on each of the at least one target video segment, the computer device can obtain multiple key points of the object in each video frame of each target video segment. Then, the computer device can train the pose recognition model based on the multiple key points of the object in each video frame of each target video segment, that is, continue to perform the following step (2).

[0188] (2) The computer device trains the pose recognition model based on multiple key points of the object in the at least one target video segment.

[0189] In this case, the pose recognition model can be trained, and subsequently used to identify the target pose. Furthermore, this pose recognition model can be applied to single-object scenarios, meaning it can accurately identify the target pose in any single-object scenario.

[0190] Specifically, step (2) can be performed as follows: for any one of the at least one target video segments, input multiple key points of the object in the target video segment into the pose recognition model, perform pose recognition on the multiple key points of the object in the target video segment through the pose recognition model, and output the predicted recognition result; based on the difference information between the predicted recognition result and the annotation corresponding to the target video segment, adjust the model parameters of the pose recognition model.

[0191] The prediction result indicates whether the object in the target video segment predicted by the pose recognition model is in the target pose. The label corresponding to the target video segment indicates whether the object in the target video segment is actually in the target pose.

[0192] This difference information can be the loss value between the predicted identification result and the corresponding label, determined by the loss function.

[0193] In this case, the pose recognition model first predicts whether the object in the target video segment is in the target pose. Then, based on the difference between the predicted recognition result and the actual result (annotation) of the target video segment, the model parameters of the pose recognition model are adjusted to improve the recognition accuracy of the pose recognition model.

[0194] Specifically, the entire training process can be as follows: the computer device can acquire multiple key points of the object in at least one target video segment and the corresponding annotations of the target video segment, wherein the sample data consists of multiple key points of the object in the target video segment, and the sample labels are the corresponding annotations of the target video segment. That is, the input data of the pose recognition model consists of multiple key points of the object in the target video segment and the sample labels are the corresponding annotations of the target video segment.

[0195] The pose recognition model can include multiple network layers, including an input layer, multiple hidden layers, and an output layer. The input layer is responsible for receiving input data; the output layer is responsible for outputting the processed data; the multiple hidden layers are located between the input and output layers and are responsible for processing the data. These hidden layers are not visible to the outside world. Optionally, the pose recognition model can be a deep neural network, and specifically a convolutional neural network, such as a poseC3D (pose Convolution 3dimensionality) model based on the PySKL environment.

[0196] When training the pose recognition model, the computer device can input the input data into the pose recognition model, process it to obtain a two-dimensional vector, and then classify this two-dimensional vector using a classification function to obtain output data (predicted recognition result). A loss function is used to determine the loss value between the output data (predicted recognition result) and the sample labels. The model parameters in the pose recognition model are adjusted based on this loss value. After adjusting the parameters in the pose recognition model based on multiple key points in each of the at least one target video segment and the corresponding annotations of the target video segment, the trained pose recognition model is obtained. The classification function can be an argmax function, a softmax function, etc., and this embodiment does not limit the specific function used.

[0197] Optionally, step (2) can also be performed as follows: for any one of the at least one target video segments, input multiple key points of the object in the target video segment and the confidence of the multiple key points into the pose recognition model, and use the pose recognition model to perform pose recognition on the multiple key points of the object in the target video segment based on the confidence of the multiple key points, and output the predicted recognition result; adjust the model parameters of the pose recognition model based on the difference information between the predicted recognition result and the annotation corresponding to the target video segment.

[0198] Since the confidence level of these multiple keypoints can indicate their accuracy, the pose recognition model can easily identify multiple keypoints with high confidence levels. This improves the accuracy of the pose recognition model during training.

[0199] It is worth noting that the computer device can also first divide the sample dataset into three parts: a training set, a validation set, and a test set. The training set is used to train the pose recognition model, the validation set is used to validate the trained pose recognition model, and the test set is used to test the accuracy of the pose recognition model.

[0200] After training the pose recognition model, a validation set can be used to validate the model, and the pose recognition model that performs best on the validation set can be tested on the test set.

[0201] Therefore, after the posture recognition model has been trained, the process of verifying and testing the posture recognition model can improve the recognition accuracy of the posture recognition model.

[0202] In this embodiment, a sample dataset can be generated through steps 101-103 described above. Then, a pose recognition model that can be trained based on this sample dataset can be used for recognition in single-object scenarios. Subsequently, when it is necessary to recognize a target pose, this pose recognition model can be used to identify the target pose, thereby improving the recognition accuracy in single-object scenarios.

[0203] To facilitate understanding, let's take the target pose as the index call and generate the index call dataset as an example, combined with... Figure 2 The sample dataset generation method provided in the embodiments of this application is illustrated by way of example.

[0204] See Figure 2 The process of generating the index-difference call dataset includes video stream 201, multiple candidate video segments 202, ID coordinate values ​​203, single-person video segments 204, target video segments 205, and index-difference call dataset 206.

[0205] See Figure 2The entire process of generating the index difference call dataset includes the following steps (1)-(4).

[0206] (1) The computer device first divides the video stream 201 into video segments to obtain multiple candidate video segments 202. The multiple candidate video segments 202 are video segments corresponding to multiple human images in the video stream 201.

[0207] (2) The computer device performs target detection on each of the multiple candidate video segments 202, and filters out the candidate video segments 202 containing a human figure, so as to obtain multiple single-person video segments 204. In addition, the computer device can also obtain the coordinate value 203 of this human figure ID in the candidate video segment 202, that is, obtain the position of this human figure in the candidate video segment.

[0208] (3) The computer device performs posture recognition on each of the multiple single-person video segments 204, retains the single-person video segments 204 that make arm movements, thereby obtaining at least one target video segment 205.

[0209] (4) For each of the at least one target video segment 205, the target video segment 205 is first manually classified to determine whether the human image in the target video segment 205 is in the gesture of fingering and calling, thereby obtaining the annotation of each target video segment 205. Then, the computer device generates the fingering and calling dataset 206 based on the at least one target video segment 205 and the corresponding annotation of the target video segment 205.

[0210] It is worth noting that after generating the index-difference call dataset 206, this dataset can also be used to train the index-difference call recognition model. See [link / reference] Figure 2 The model training process includes a key point extraction model 207, multiple key points and confidence scores 208, and a finger-difference call recognition model 209.

[0211] See Figure 2 The entire model training process includes the following steps (1)-(3).

[0212] (1) The computer device inputs the index difference call dataset 206 and the ID coordinate value 203 output during the generation of the index difference call dataset into the key point extraction model 207.

[0213] (2) The computer device extracts multiple key points of the object in the target video segment 205 through the key point extraction model 207, and obtains multiple key points and key point confidence 208.

[0214] (3) The computer device uses multiple key points and key point confidence 208 of an object in at least one target video segment in the index call dataset to train, validate and test the index call recognition model 209 to obtain the final index call recognition model 209 that can be used for recognition.

[0215] In this embodiment, a computer device divides a video stream into video segments to obtain multiple candidate video segments, specifically multiple candidate video segments containing at least one object. Then, object detection and pose recognition are performed on these candidate video segments to obtain at least one target video segment. This means selecting a candidate video segment containing only one object from the multiple candidate video segments, where the pose of this object conforms to the target pose condition. Next, a sample dataset is generated based on the at least one target video segment and its annotations. The annotations indicate whether the object in the corresponding target video segment is in the target pose. The generated sample dataset can be used to train a pose recognition model. This allows for the automatic generation of a sample dataset for training the pose recognition model, reducing manual costs and improving the efficiency of sample dataset creation. Furthermore, this application automatically obtains target video segments containing only one object, so the sample dataset is suitable for training pose recognition models in single-object scenarios, further improving the recognition efficiency of pose recognition models in single-object scenarios.

[0216] Figure 3 This is a schematic diagram of a sample dataset generation device provided in an embodiment of this application. The sample dataset generation device can be implemented as part or all of a computer device by software, hardware, or a combination of both. This computer device can be as described below. Figure 4 The computer equipment shown. See also Figure 3 The device includes: a first processing module 301, a second processing module 302, and a generation module 303.

[0217] The first processing module 301 is used to divide the video stream into video segments to obtain multiple candidate video segments, wherein the multiple candidate video segments are video segments in the video stream that contain at least one object.

[0218] The second processing module 302 is used to perform target detection and pose recognition on the multiple candidate video segments to obtain at least one target video segment among the multiple candidate video segments. The target video segment is a candidate video segment containing an object, and the pose of the object in the target video segment meets the target pose condition.

[0219] The generation module 303 is used to generate a sample dataset based on the at least one target video segment and the annotation of the at least one target video segment. The annotation is used to indicate whether the object in the corresponding target video segment is in the target pose. The sample dataset is used to train the pose recognition model.

[0220] Optionally, the first processing module 301 is used for:

[0221] For any object in the video stream, target detection is performed on the video stream to obtain multiple target video frames, all of which contain the object; these multiple target video frames are then aggregated to obtain the candidate video segment corresponding to the object.

[0222] Alternatively, for any object in the video stream, target detection is performed on the video stream to obtain a reference video frame containing the object; target tracking is then performed in the video stream starting from the reference video frame to obtain a candidate video segment corresponding to the object.

[0223] Optionally, the device further includes:

[0224] The first filtering module is used to delete any candidate video segment among the multiple candidate video segments if the candidate video segment does not meet the preset conditions, so as to obtain multiple reference video segments. The preset conditions are that the number of frames of the candidate video segment is greater than or equal to the first preset number of frames and less than or equal to the second preset number of frames, and the first preset number of frames is less than the second preset number of frames.

[0225] Optionally, the second processing module 302 is used for:

[0226] Target detection and pose recognition are performed on the multiple reference video segments to obtain at least one target video segment from the multiple reference video segments.

[0227] Optionally, the second processing module 302 is used for:

[0228] For any one of the multiple candidate video segments, a candidate bounding box is slid over that candidate video segment.

[0229] During the sliding of the candidate box, the area covered by the candidate box is classified to obtain the region classification results of multiple regions of the candidate video segment;

[0230] Based on the region classification results of the multiple regions, at least one detection box is determined for the candidate video segment, and the at least one detection box surrounds at least one object in the candidate video segment;

[0231] Based on at least one detection box among the multiple candidate video segments, multiple single-object video segments are obtained by filtering from the multiple candidate video segments. The single-object video segment is a candidate video segment containing one object.

[0232] Pose recognition is performed on the multiple single-object video segments to obtain at least one target video segment.

[0233] Optionally, the second processing module 302 is used for:

[0234] For any candidate video segment among the multiple candidate video segments, if multiple detection boxes are detected in the candidate video segment, the candidate video segment is deleted.

[0235] If a detection box is detected in the candidate video segment, the candidate video segment is identified as the single object video segment.

[0236] Optionally, the second processing module 302 is used for:

[0237] For any single object video segment among multiple single object video segments, extract key points of the object in the single object video segment to obtain multiple key points of the object in the single object video segment.

[0238] Determine whether these multiple key points meet the target attitude conditions;

[0239] If the target attitude conditions are met at these multiple key points, this single-object video segment is identified as the target video segment.

[0240] Optionally, the device further includes:

[0241] The determination module is used to determine multiple key points of an object in at least one target video segment based on the sample dataset;

[0242] The training module is used to train the pose recognition model based on multiple key points of an object in at least one target video segment.

[0243] Optionally, this training module is used for:

[0244] For any one of the at least one target video segments, input multiple key points of the object in the target video segment into the pose recognition model, perform pose recognition on the multiple key points of the object in the target video segment through the pose recognition model, and output the predicted recognition result.

[0245] Based on the difference between the predicted recognition result and the corresponding annotation of the target video segment, the model parameters of the pose recognition model are adjusted.

[0246] In this embodiment, the video stream is segmented to obtain multiple candidate video segments, specifically multiple candidate video segments containing at least one object. Then, object detection and pose recognition are performed on these candidate video segments to obtain at least one target video segment. This means selecting a candidate video segment containing only one object from the multiple candidate video segments, where the pose of this object conforms to the target pose condition. Next, a sample dataset is generated based on the at least one target video segment and its annotations. The annotations indicate whether the object in the corresponding target video segment is in the target pose. The generated sample dataset can be used to train a pose recognition model. This allows for the automatic generation of a sample dataset for training the pose recognition model, reducing manual costs and improving the efficiency of sample dataset creation. Furthermore, this application automatically obtains target video segments containing only one object, so the sample dataset is suitable for training pose recognition models in single-object scenarios, thus improving the recognition efficiency of pose recognition models in single-object scenarios.

[0247] It should be noted that the sample dataset generation device provided in the above embodiments is only illustrated by the division of the above functional modules when generating sample datasets. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above.

[0248] The functional units and modules in the above embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of the embodiments of this application.

[0249] The sample dataset generation device and sample dataset generation method provided in the above embodiments belong to the same concept. The specific working process and technical effects of the units and modules in the above embodiments can be found in the method embodiments section, and will not be repeated here.

[0250] Figure 4 This is a schematic diagram of the structure of a computer device provided in an embodiment of this application. Figure 4 As shown, the computer device 4 includes a processor 40, a memory 41, and a computer program 42 stored in the memory 41 and executable on the processor 40. When the processor 40 executes the computer program 42, it implements the steps in the sample dataset generation method in the above embodiments.

[0251] Computer device 4 can be a general-purpose computer device or a special-purpose computer device. In specific implementations, computer device 4 can be a desktop computer, portable computer, handheld computer, mobile phone, tablet computer, or other terminal or network server. This application embodiment does not limit the type of computer device 4. Those skilled in the art will understand that... Figure 4 The computer device 4 is merely an example and does not constitute a limitation on the computer device 4. It may include more or fewer components than shown in the figure, or combine certain components, or different components, such as input / output devices, network access devices, etc.

[0252] Processor 40 can be a Central Processing Unit (CPU), or it can be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.

[0253] In some embodiments, memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or memory of the computer device 4. In other embodiments, memory 41 may be an external storage device of the computer device 4, such as a plug-in hard disk, smart media card (SMC), secure digital card (SD) card, flash card, etc., provided on the computer device 4. Furthermore, memory 41 may include both internal storage units and external storage devices of the computer device 4. Memory 41 is used to store the operating system, applications, boot loader, data, and other programs. Memory 41 may also be used to temporarily store data that has been output or will be output.

[0254] This application also provides a computer device, which includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, wherein the processor executes the computer program to implement the steps in any of the above method embodiments.

[0255] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, can implement the steps in the various method embodiments described above.

[0256] This application provides a computer program product that, when run on a computer, causes the computer to perform the steps described in the various method embodiments above.

[0257] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the above method embodiments of this application can be implemented by a computer program instructing related hardware. This computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or some intermediate form. The computer-readable medium can include at least: any entity or device capable of carrying the computer program code to a photographing device / terminal device, a recording medium, a computer memory, ROM (Read-Only Memory), RAM (Random Access Memory), CD-ROM (Compact Disc Read-Only Memory), magnetic tape, floppy disk, and optical data storage devices. The computer-readable storage medium mentioned in this application can be a non-volatile storage medium; in other words, it can be a non-transient storage medium.

[0258] It should be understood that all or part of the steps of the above embodiments can be implemented by software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented in whole or in part as a computer program product. The computer program product includes one or more computer instructions. The computer instructions can be stored in the above-described computer-readable storage medium.

[0259] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0260] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0261] In the embodiments provided in this application, it should be understood that the disclosed apparatus / computer devices and methods can be implemented in other ways. For example, the apparatus / computer device embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0262] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0263] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.

Claims

1. A method of generating a sample data set, the method comprising: The method includes: The video stream is divided into video segments to obtain multiple candidate video segments, wherein each candidate video segment is a video segment in the video stream that contains at least one object; Target detection and pose recognition are performed on the plurality of candidate video segments to obtain at least one target video segment among the plurality of candidate video segments. The target video segment is a candidate video segment containing an object, and the pose of the object in the target video segment meets the target pose condition. The at least one target video segment is determined by classifying the area covered by the candidate box as the candidate box slides on any one of the plurality of candidate video segments, obtaining the region classification results of multiple regions of the candidate video segment, and determining the candidate video segment containing an object and whose pose meets the target pose condition based on the region classification results of the multiple regions of the plurality of candidate video segments. The region classification result is the probability value that the area covered by the candidate box contains an object. A sample dataset is generated based on at least one target video segment and at least one annotation of the target video segment. The annotation is used to indicate whether an object in the corresponding target video segment is in a target pose. The sample dataset is used to train a pose recognition model.

2. The method of claim 1, wherein, The video stream is segmented to obtain multiple candidate video segments, including any one of the following: For any object in the video stream, target detection is performed on the video stream to obtain multiple target video frames, all of which contain the object; the multiple target video frames are aggregated to obtain a candidate video segment corresponding to the object. Alternatively, for any object in the video stream, target detection is performed on the video stream to obtain a reference video frame, which contains the object; target tracking is then performed in the video stream starting from the reference video frame to obtain a candidate video segment corresponding to the object.

3. The method as described in claim 1, characterized in that, Before performing target detection and pose recognition on the plurality of candidate video segments to obtain at least one target video segment from the plurality of candidate video segments, the method further includes: For any one of the multiple candidate video segments, if the candidate video segment does not meet the preset conditions, the candidate video segment is deleted to obtain multiple reference video segments. The preset conditions are that the number of frames of the candidate video segment is greater than or equal to a first preset number of frames and less than or equal to a second preset number of frames, where the first preset number of frames is less than the second preset number of frames. The step of performing target detection and pose recognition on the plurality of candidate video segments to obtain at least one target video segment from the plurality of candidate video segments includes: Target detection and pose recognition are performed on the plurality of reference video segments to obtain at least one target video segment among the plurality of reference video segments.

4. The method as described in claim 1, characterized in that, The step of performing target detection and pose recognition on the plurality of candidate video segments to obtain at least one target video segment from the plurality of candidate video segments includes: For any one of the multiple candidate video segments, a candidate bounding box is slid across the candidate video segment. During the sliding of the candidate box, the area covered by the candidate box is classified to obtain the region classification results of multiple regions of the candidate video segment. The region classification result is the probability value that the area covered by the candidate box contains an object. Based on the region classification results of the multiple regions, at least one detection box is determined for the candidate video segment, and the at least one detection box surrounds at least one object in the candidate video segment; Based on at least one detection box among the plurality of candidate video segments, a plurality of single-object video segments are obtained by filtering from the plurality of candidate video segments, wherein the single-object video segment is a candidate video segment containing one object; Pose recognition is performed on the plurality of single-object video segments to obtain at least one target video segment.

5. The method as described in claim 4, characterized in that, The step of filtering multiple single-object video segments from the multiple candidate video segments based on at least one detection box from the multiple candidate video segments includes: For any one of the multiple candidate video segments, if multiple detection boxes are detected in the candidate video segment, the candidate video segment is deleted. If a detection box is detected in the candidate video segment, the candidate video segment is determined as the single-object video segment.

6. The method as described in claim 4, characterized in that, The step of performing pose recognition on the plurality of single-object video segments to obtain at least one target video segment includes: For any one of the multiple single-object video segments, key points are extracted from the object in the single-object video segment to obtain multiple key points of the object in the single-object video segment. Determine whether the multiple key points meet the target posture conditions; If the target posture conditions are met at the multiple key points, the single object video segment is determined as the target video segment.

7. The method as described in claim 1, characterized in that, The method further includes: Based on the sample dataset, determine multiple key points of an object in at least one of the target video segments; The pose recognition model is trained based on multiple key points of an object in at least one of the target video segments.

8. The method as described in claim 7, characterized in that, The step of training the pose recognition model based on multiple key points of an object in at least one of the target video segments includes: For any one of the target video segments, input multiple key points of the object in the target video segment into the pose recognition model, perform pose recognition on the multiple key points of the object in the target video segment through the pose recognition model, and output the prediction recognition result; Based on the difference information between the predicted recognition result and the annotation corresponding to the target video segment, the model parameters of the pose recognition model are adjusted.

9. A sample dataset generation device, characterized in that, The device includes: The first processing module is used to divide the video stream into video segments to obtain multiple candidate video segments, wherein the multiple candidate video segments are video segments in the video stream that contain at least one object. The second processing module is used to perform target detection and pose recognition on the plurality of candidate video segments to obtain at least one target video segment among the plurality of candidate video segments. The target video segment is a candidate video segment containing an object, and the pose of the object in the target video segment meets the target pose condition. The at least one target video segment is determined by classifying the area covered by the candidate box during the sliding process of the candidate box on any one of the plurality of candidate video segments, obtaining the region classification results of multiple regions of the candidate video segment, and determining the candidate video segment containing an object and whose pose meets the target pose condition based on the region classification results of the multiple regions of the plurality of candidate video segments. The region classification result is the probability value that the area covered by the candidate box contains an object. A generation module is used to generate a sample dataset based on at least one target video segment and at least one annotation of the target video segment, wherein the annotation is used to indicate whether an object in the corresponding target video segment is in a target pose, and the sample dataset is used to train a pose recognition model.

10. A computer device, characterized in that, The computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the method as described in any one of claims 1 to 8.

11. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed, implements the method as described in any one of claims 1 to 8.