Action timing counting method and apparatus, electronic device, and machine-readable storage medium

By using a pose estimation model to extract skeletal key points and query standard action features in the action counting method, the problems of low counting accuracy and poor functional scalability in existing methods are solved, and flexible and accurate action timing counting is achieved.

CN116434333BActive Publication Date: 2026-06-23HANGZHOU HIKVISION DIGITAL TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HANGZHOU HIKVISION DIGITAL TECHNOLOGY CO LTD
Filing Date
2023-03-27
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing motion counting methods suffer from low counting accuracy, high labor costs, poor functional scalability, and difficulty in posture correction. In particular, sensor-based and vision-based methods each have their own strengths but also limitations.

Method used

By acquiring the video to be detected, the skeletal key points are extracted using a trained pose estimation model to determine the spatial vector and motion vector. The target standard action features are then queried from the standard action database. Based on the query results, the action timing and counting are performed, and the coordinate information of the skeletal key points is used as a general feature template for action timing and counting.

Benefits of technology

It improves the expandability of action timing and counting functions, reduces the difficulty of function reuse, and enhances the accuracy and flexibility of counting.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116434333B_ABST
    Figure CN116434333B_ABST
Patent Text Reader

Abstract

The application provides an action timing counting method and device, electronic equipment and machine readable storage medium, the method comprises: obtaining a target video; using a human pose estimation model to extract the skeleton key point information of the target; converting the skeleton key point information into a space-time vector feature and performing retrieval matching with the features in the standard action library to obtain different action state labels of the standard action; performing logical analysis on the matched different action state labels to obtain the counting timing result of the standard action. The method has the characteristics of high efficiency and strong function expansion, and can use general logic to obtain the counting timing result of different actions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of deep learning technology, and in particular to an action timing and counting method, apparatus, electronic device, and machine-readable storage medium. Background Technology

[0002] With the rapid development of the internet, especially mobile internet and artificial intelligence technologies, in recent years, information technology has permeated all areas of society and every aspect of people's daily lives. For example, in areas such as school sports and student physical fitness tests, it is possible to automatically count the number of student assessment items, such as rope skipping and pull-ups.

[0003] Conventional motion counting schemes mainly include sensor-based motion counting methods and vision-based motion counting methods. Sensor-based motion counting methods have the advantage of high counting accuracy, but they are limited by the scenario or tools, have high manual costs and low efficiency, and the results of motion training and assessment cannot be reviewed, nor can posture correction be performed, making the assessment process susceptible to cheating. Vision-based motion counting methods can review results and perform motion correction, but they require training different feature extraction models for each motion to classify different motions, resulting in poor functional scalability and difficulty in function reuse. Summary of the Invention

[0004] In view of this, this application provides an action timing and counting method, apparatus, electronic device, and machine-readable storage medium.

[0005] Specifically, this application is implemented through the following technical solution:

[0006] According to a first aspect of the embodiments of this application, an action timing and counting method is provided, including:

[0007] Acquire the video to be detected, and perform target detection and tracking on the video to be detected;

[0008] Using a trained pose estimation model, skeletal key points of the detected target in the video frame are extracted, and the spatial vector and motion vector corresponding to the video frame are determined based on the coordinate information of the extracted skeletal key points; wherein, the spatial vector is used to determine the action pose, and the motion vector is used to describe the action motion state.

[0009] Based on the spatial vector and motion vector corresponding to the video frame, the target standard motion feature matching the video frame is queried from the standard motion database; wherein, the standard motion database stores standard motion features and labels for the standard motion features, and the standard motion features include spatial vectors and motion vectors of different states of the standard motion;

[0010] Determine the tag corresponding to the video frame based on the query results;

[0011] Based on the tag sequence corresponding to the video to be detected, action timing and counting are performed.

[0012] According to a second aspect of the embodiments of this application, an action timing and counting device is provided, comprising:

[0013] The acquisition unit is used to acquire the video to be detected;

[0014] The detection unit is used to perform target detection and tracking on the video to be detected;

[0015] The extraction unit is used to extract the skeletal key points of the detected target in the video frame using a trained pose estimation model.

[0016] The first determining unit is used to determine the spatial vector and motion vector corresponding to the video frame based on the coordinate information of the extracted skeletal key points; wherein, the spatial vector is used to determine the action posture, and the motion vector is used to describe the action motion state.

[0017] The query unit is used to query the target standard action feature that matches the video frame from the standard action base library based on the spatial vector and motion vector corresponding to the video frame; wherein, the standard action base library stores standard action features and labels for the standard action features, and the standard action features include spatial vectors and motion vectors of different states of the standard action;

[0018] The second determining unit is used to determine the tag corresponding to the video frame based on the query results;

[0019] The timing and counting unit is used to perform action timing and counting based on the tag sequence corresponding to the video to be detected.

[0020] According to a third aspect of the present application, an electronic device is provided, including a processor and a memory, the memory storing machine-executable instructions executable by the processor, the processor being configured to execute the machine-executable instructions to implement the method provided in the first aspect.

[0021] According to a fourth aspect of the embodiments of this application, a machine-readable storage medium is provided, wherein machine-executable instructions are stored therein, and when the machine-executable instructions are executed by a processor, the method provided in the first aspect is implemented.

[0022] The technical solution provided in this application can bring at least the following beneficial effects:

[0023] By performing target detection and tracking on the video to be detected, a trained pose estimation model is used to extract the skeletal keypoints of the detected target in the video frames. Based on the coordinate information of the extracted skeletal keypoints, the spatial vector and motion vector corresponding to the video frame are determined. Then, based on the spatial vector and motion vector corresponding to the video frame, the target standard action features matching the video frame are queried from the standard action database. Based on the query results, the label corresponding to the video frame is determined. Then, based on the label sequence corresponding to the video to be detected, action timing and counting are performed. By introducing the spatial vector and motion vector determined by the coordinate information of the skeletal keypoints, the spatial vectors and temporal vectors of different states of the standard action are used as a general feature template for action timing and counting. By matching with the standard action feature template, a general action timing and counting is achieved, which improves the functional scalability of the solution and reduces the difficulty of function reuse. Attached Figure Description

[0024] Figure 1 This is a flowchart illustrating an action timing and counting method according to an exemplary embodiment of this application;

[0025] Figure 2 This is a schematic diagram illustrating a skeletal key point according to an exemplary embodiment of this application;

[0026] Figure 3 This is a schematic diagram illustrating a human pose estimation model training process according to an exemplary embodiment of this application;

[0027] Figure 4 This is a schematic diagram illustrating a base library construction process according to an exemplary embodiment of this application;

[0028] Figure 5 This is a schematic diagram of an action timing and counting process illustrated in an exemplary embodiment of this application;

[0029] Figure 6A This is a schematic diagram illustrating the breakdown of a pull-up movement according to an exemplary embodiment of this application;

[0030] Figure 6B This is a schematic diagram of a tag sequence for counting pull-up movements, as illustrated in an exemplary embodiment of this application;

[0031] Figure 7 This is a schematic diagram of the structure of an action timing and counting device shown in an exemplary embodiment of this application;

[0032] Figure 8 This is a schematic diagram of the structure of another action timing and counting device shown in an exemplary embodiment of this application;

[0033] Figure 9 This is a schematic diagram of the hardware structure of an electronic device illustrated in an exemplary embodiment of this application. Detailed Implementation

[0034] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0035] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used in this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise.

[0036] To enable those skilled in the art to better understand the technical solutions provided in the embodiments of this application, and to make the above-mentioned objectives, features and advantages of the embodiments of this application more apparent and understandable, the technical solutions in the embodiments of this application will be further described in detail below with reference to the accompanying drawings.

[0037] Please see Figure 1 This is a flowchart illustrating an action timing and counting method provided in an embodiment of this application. Figure 1 As shown, the action timing and counting method may include the following steps:

[0038] Step S100: Obtain the video to be detected and perform target detection and tracking on the video to be detected.

[0039] For example, the video to be detected may include real-time acquired video data or recorded video data.

[0040] Taking real-time acquisition as an example, the terminal device that uses the action timing and counting scheme provided in the embodiments of this application to perform action timing and counting can acquire the video to be detected through the video acquisition device built into the device or an external video acquisition device.

[0041] For example, the terminal device may include, but is not limited to, smartphones or tablets. The aforementioned video capture device may be the camera of a smartphone or tablet.

[0042] In this embodiment of the application, target detection can be performed on the acquired video to be detected, and the detected target can be tracked to determine the position of the same detected target in different video frames.

[0043] Step S110: Using the trained pose estimation model, extract the skeletal key points of the detected target in the video frame, and determine the spatial vector and motion vector corresponding to the video frame based on the coordinate information of the extracted skeletal key points.

[0044] In this embodiment of the application, when a target is detected in a video frame, a trained pose estimation model can be used to extract the skeletal key points of the detected target in the video frame, and the spatial vector and motion vector corresponding to the video frame can be determined based on the coordinate information of the extracted skeletal key points.

[0045] Spatial vectors provide static information for determining motion posture. For example, spatial vectors can be determined based on the coordinates of key points on the skeleton in a single video frame.

[0046] Motion vectors provide dynamic information to describe the motion state of an action. For example, motion vectors can be determined based on the coordinates of skeletal keypoints at the same location in a video frame and adjacent video frames.

[0047] Taking pull-ups as an example, spatial vectors can be used to determine whether the posture is with arms outstretched or with the chin over the bar; motion vectors can be used to determine whether the body is in an upward or downward motion.

[0048] For example, skeletal keypoints may include the coordinates of skeletal joints in a video image.

[0049] The coordinate information of skeletal key points can be the image coordinate information of the skeletal key points (such as pixel coordinate information).

[0050] It should be noted that, for scenarios where video capture equipment is fixedly deployed, the coordinate information of the skeletal key points can also be the camera coordinate information of the skeletal key points (i.e., the coordinate information in the coordinate system of the video capture equipment).

[0051] Step S120: Based on the spatial vector and motion vector corresponding to the video frame, query the target standard action feature that matches the video frame from the standard action base library; wherein, the standard action base library stores standard action features and labels for standard action features, and the standard action features include spatial vectors and motion vectors of different states of standard actions.

[0052] In this embodiment of the application, a standard action base library can be constructed based on video data of standard actions that require timing and counting. The standard action base library can store standard action features and labels for the standard action features.

[0053] The standard action features can include spatial vectors and motion vectors of different states of the standard action.

[0054] Taking the pull-up as an example, which requires timing and counting, the different states of the standard pull-up can include different states of the standard pull-up, such as the state with arms extended, the state with the chin over the bar, etc.

[0055] Based on video data of different states of a standard action, the spatial vectors and motion vectors of different states of the standard action can be determined, and corresponding standard action features can be generated.

[0056] For example, labels for standard action features are used to identify the state of a standard action.

[0057] For example, the label for the state of arms being fully extended during a pull-up can be 0, while the label for the state of the jaw passing over the bar can be 1.

[0058] For example, different standard action libraries can be built for different types of actions.

[0059] For example, different standard movement libraries can be built for rope skipping and pull-ups.

[0060] In this embodiment of the application, after determining the spatial vector and motion vector corresponding to the video frame in the manner described above, the standard motion features (which can be referred to as target standard motion features) that match the video frame can be queried from the standard motion database based on the spatial vector and motion vector corresponding to the video frame.

[0061] For example, when querying the target standard action features, the target standard action features can be queried from the standard action base library corresponding to the current action type based on the action type of the current action.

[0062] For example, the action type can be determined by selecting an instruction based on the detected action type.

[0063] For example, a terminal device that uses the action timing and counting scheme provided in the embodiments of this application to perform action timing and counting can provide an action type selection interface so that the user can select an action type in the selection interface and determine the selected action type according to the user's selection instruction.

[0064] Step S130: Determine the tag corresponding to the video frame based on the query results.

[0065] In this embodiment of the application, the tag corresponding to the video frame can be determined based on the query results of the target standard action features.

[0066] Step S140: Based on the tag sequence corresponding to the video to be detected, perform action timing and counting.

[0067] In this embodiment of the application, for any video frame in the video to be detected, the tag corresponding to the video frame can be determined in the manner described in steps S110 to S130. Then, action timing and counting (timing and / or counting) can be performed based on the tag sequence corresponding to the video to be detected.

[0068] It should be noted that, in the embodiments of this application, when there are multiple detected targets in a video frame, for any detected target, motion timing can be performed in the manner described in the above embodiments. Alternatively, the detected target that needs to be motion timing counted can be determined from the multiple detected targets, and motion timing counted for that detected target can be performed in the manner described in the above embodiments. For example, the detected target that needs to be motion timing counted can be determined based on the received selection instruction, and its specific implementation is not limited.

[0069] It can be seen that, in Figure 1 In the illustrated method, target detection and tracking are performed on the video to be detected. Using a trained pose estimation model, the skeletal keypoints of the detected target in the video frame are extracted. Based on the coordinate information of the extracted skeletal keypoints, the spatial vector and motion vector corresponding to the video frame are determined. Then, based on the spatial vector and motion vector corresponding to the video frame, the target standard action features matching the video frame are queried from the standard action database. Based on the query results, the label corresponding to the video frame is determined. Then, based on the label sequence corresponding to the video to be detected, action timing and counting are performed. By introducing the spatial vector and motion vector determined based on the coordinate information of the skeletal keypoints, the spatial vectors and time vectors of different states of the standard action are used as a general feature template for action timing and counting. By matching with the standard action feature template, a general action timing and counting is achieved, which improves the functional scalability of the solution and reduces the difficulty of function reuse.

[0070] In some embodiments, determining the spatial vector and motion vector corresponding to the video frame based on the extracted coordinate information of the skeletal key points may include:

[0071] Based on the coordinate information of adjacent bone points in the skeletal keypoints extracted from the video frame, the spatial vector corresponding to the video frame is determined; and,

[0072] Based on the skeletal keypoints extracted from the video frame, and the coordinate information of the skeletal keypoints at the same location extracted from the skeletal keypoints in the adjacent video frames, the motion vector corresponding to the video frame is determined.

[0073] For example, the spatial vector corresponding to a video frame can be determined based on the coordinate information of adjacent bone points in the skeletal keypoints extracted from the video frame.

[0074] by Figure 2Taking the skeletal keypoints shown as an example, the spatial vector corresponding to this video frame can be determined based on the coordinate information of adjacent skeletal points. For example, a spatial vector determined based on the coordinate information of keypoint 0 and keypoint 1, a spatial vector determined based on the coordinate information of keypoint 1 and keypoint 2, a spatial vector determined based on the coordinate information of keypoint 1 and keypoint 2, ..., a spatial vector determined based on the coordinate information of keypoint 11 and keypoint 13, etc.

[0075] In this case, assuming the coordinates of key point 2 are k2(x2, y2) and the coordinates of key point 4 are k4(x4, y4), the spatial vectors corresponding to key points 2 and 4 can be represented by v = (x4-x2, y4-y2).

[0076] For example, the motion vector corresponding to a video frame can be determined based on the skeletal keypoints extracted from the video frame, and the coordinate information of the skeletal points at the same location in the skeletal keypoints of adjacent video frames.

[0077] For example, suppose the current video frame is frame i, and the key point on the left shoulder in this video frame is... The key point of the left shoulder in frame i-1 is The key point of the left shoulder in frame i+1 is The motion vector corresponding to the left shoulder of the current video frame can be used as follows: and express.

[0078] For example, during the process of querying target standard features, the spatial vectors corresponding to all skeletal key points in the video frame can be concatenated, as well as the motion vectors corresponding to all skeletal key points can be concatenated, and then the target standard features can be queried.

[0079] In some embodiments, the above-mentioned querying of target standard motion features matching the video frame from the standard motion database based on the spatial vector and motion vector corresponding to the video frame may include:

[0080] For any standard motion feature, determine the distance between the spatial vector corresponding to the video frame and the spatial vector included in the standard motion feature, and the distance between the motion vector corresponding to the video frame and the motion vector included in the standard motion feature.

[0081] If the distance between the spatial vector corresponding to the video frame and the spatial vector included in the standard motion feature is less than a first distance threshold, and the distance between the motion vector corresponding to the video frame and the motion vector included in the standard motion feature is less than a second distance threshold, then the standard motion feature is determined to be the target standard motion feature that matches the video frame.

[0082] For example, in the process of querying target standard motion features, the distance (such as cosine distance) between the spatial vector corresponding to the video frame and the spatial vectors included in each standard motion feature can be determined, as well as the distance between the motion vector corresponding to the video frame and the motion vectors included in each standard motion feature can be determined.

[0083] For example, it is possible to traverse each standard motion feature in the standard motion base library. For each traversed standard motion feature, the distance between the spatial vector corresponding to the video frame and the spatial vector included in the standard motion feature can be determined, as well as the distance between the motion vector corresponding to the video frame and the motion vector included in the standard motion feature.

[0084] If the distance between the spatial vector corresponding to the video frame and the spatial vector included in the standard motion feature is less than a distance threshold (which can be called the first distance threshold), and the distance between the motion vector corresponding to the video frame and the motion vector included in the standard motion feature is less than a distance threshold (which can be called the second distance threshold), then the standard motion feature is determined to be the target standard motion feature that matches the video frame.

[0085] In some embodiments, determining the tag corresponding to the video frame based on the query result includes:

[0086] If a target standard motion feature matching the video frame is found, the label corresponding to the video frame is determined based on the label of the target standard motion feature matching the video frame.

[0087] If no target standard action feature matching the video frame is found, the label corresponding to the video frame is determined to be an intermediate state label.

[0088] For example, consider that the standard motion features stored in the standard motion base library are usually standard motion features of partial states of standard motions.

[0089] Taking the pull-up as an example, when storing the features of the standard movement, the features of the hands in the extended state (such as the spatial vector and motion vector mentioned above) and the features of the chin over the bar are usually saved. The complete pull-up movement will also include the intermediate state between these two states. The features of the intermediate state may be significantly different from the features of the two states.

[0090] Therefore, when querying the standard motion feature that matches the video frame from the standard motion base in the manner described above, the query result may include either finding a target standard motion feature that matches the video frame or not finding a target standard motion feature that matches the video frame.

[0091] If a target standard motion feature matching the video frame is found, the corresponding label for the video frame can be determined based on the label of the target standard motion feature matching the video frame.

[0092] If no target standard action feature matching the video frame is found, the label corresponding to the video frame can be determined to be an intermediate state (also known as a background state) label.

[0093] Taking pull-ups as an example, assuming the label for the state with arms outstretched is 0 and the label for the state with the jaw over the bar is 1, then the label for the intermediate state can be a label other than 0 and 1, such as -1.

[0094] In one example, determining the label corresponding to the video frame based on the labels of the target standard action features that match the retrieved video frame may include:

[0095] Based on the tags of the target standard action features that match the video frame, determine the number of matches for each tag, and identify the tag with the most matches as the tag corresponding to the video frame.

[0096] For example, considering that different targets may have certain differences when performing the same action, in order to improve the accuracy of action timing and counting, when building a standard action base library, for the same action, the standard action features of the same label can include the standard action features of multiple different targets when performing the action.

[0097] Accordingly, the standard action base library can include multiple standard action features with the same label.

[0098] Taking the pull-up exercise as an example again, there can be multiple standard movement features labeled 1 and multiple standard movement features labeled 0. For example, the number of standard movement features labeled 1 and 50 in the standard movement base library can both be 50.

[0099] Furthermore, considering that the target standard motion features determined during the target standard motion feature query process may contain errors, leading to possible errors in the query results for a single target standard motion feature, in order to improve the accuracy of determining the labels corresponding to video frames, the spatial vector and motion vector corresponding to the video frame can be compared with each standard motion feature in the standard motion base library to determine the target standard motion feature. If the target standard motion feature exists, the number of matches for each label is determined based on the labels of the target standard motion features that match the video frame, and the label with the most matches is determined as the label corresponding to the video frame.

[0100] In some embodiments, the above-mentioned action timing and counting based on the tag sequence corresponding to the video to be detected may include:

[0101] Based on the label sequence corresponding to the video to be detected, and after determining that all states of the standard action have occurred, the action timing or counting is performed.

[0102] For example, if the tag sequence corresponding to the video to be detected is determined in the manner described above, it is possible to determine whether all states of the standard action (states recorded in the standard action database) have occurred based on the determined tag sequence.

[0103] Once it is determined that all states of a standard action have occurred, action timing or counting can be performed.

[0104] In one example, after determining that all states of a standard action have occurred based on the tag sequence corresponding to the video to be detected, action timing or counting is performed, including:

[0105] For a standard action that includes multiple different states, the action count is incremented by 1 if the labels of each of the multiple different states appear consecutively at least N1 times in the label sequence; N1≥2.

[0106] For standard actions that include actions in a single state, timing begins when the label for that state appears consecutively at least N2 times in the label sequence, and stops when the label for that state has not appeared consecutively at least N3 times; N2≥2, N3≥2.

[0107] For example, for a standard action that includes multiple different states (which can be called a dynamic action), such as rope skipping or pull-ups, the action can be determined to have occurred once if the labels of each of the multiple different states appear in sequence in the label sequence.

[0108] For example, considering real-world scenarios, the same state may appear in multiple consecutive frames during an action. For instance, for a pull-up, the arms may be outstretched for multiple consecutive frames.

[0109] Therefore, in order to avoid errors in action counting due to incorrect label determination and to improve the accuracy of action counting, for standard actions that include multiple different states, the action count is incremented by 1 if the labels of each of the multiple different states appear consecutively at least N1 times in the label sequence; N1≥2.

[0110] For example, for standard actions that include actions in a single state (which may be called static actions), such as planking, timing can begin when the label for that state appears in the label sequence and stop when the label for that state does not appear.

[0111] To avoid motion counting errors caused by incorrect label determination and to improve the accuracy of motion timing, for standard motions including single-state motions, timing begins when the label for that state appears consecutively at least N2 times in the label sequence, and stops when the label for that state has not appeared consecutively at least N3 times; N2≥2, N3≥2.

[0112] In some embodiments, the above-mentioned standard action library can be constructed in the following ways:

[0113] The standard action to be timed and counted is decomposed into states to obtain at least one state of the standard action;

[0114] For any state of a standard action corresponding to a video frame, the skeletal key points are extracted using a trained pose estimation model.

[0115] Based on the coordinate information of adjacent skeletal keypoints in the video frame, determine the spatial vector of this state; and,

[0116] Based on the video frame and the coordinate information of the skeletal key points of the same part in the adjacent video frames, the motion vector of this state is determined.

[0117] The spatial vector and motion vector of the state are used as the standard action features of the state, and the standard action features are bound to the label corresponding to the state and stored in the standard action base library.

[0118] For example, in order to build a standard movement library, for movements to be timed and counted, such as pull-ups, rope skipping, or planks, the standard movements can be broken down into states.

[0119] For example, pull-ups can include the state with arms outstretched and the state with the chin over the bar; rope skipping can include the state of jumping upwards and the state of falling downwards.

[0120] For example, the state of a standard action can include one or more states.

[0121] For example, for dynamic movements such as pull-ups and rope skipping, the standard movement can include multiple states; for static movements such as planks, the standard movement can include one state.

[0122] For any state of a standard action, the corresponding video frame can be obtained, and the skeletal key points can be extracted using a trained pose estimation model.

[0123] On the one hand, the spatial vector of the state can be determined based on the coordinate information of adjacent skeletal key points in the video frame.

[0124] On the other hand, the motion vector of the state can be determined based on the coordinate information of the skeletal key points of the same part in the video frame and the adjacent video frames of the video frame.

[0125] Furthermore, the spatial vector and motion vector of this state can be used as the standard action features of this state, and these standard action features can be bound to the label corresponding to this state and stored in the standard action library.

[0126] In one example, the action timing and counting method provided in this application embodiment may further include:

[0127] Based on the distance between the spatial vectors of states with the same label in the standard feature base library, determine the distance threshold corresponding to the spatial vector of the state of that label; and,

[0128] Based on the distance between motion vectors of states with the same label in the standard feature base library, determine the distance threshold corresponding to the motion vector of that label state.

[0129] For example, in order to improve the accuracy and rationality of action state recognition, and thus improve the accuracy of action timing and counting, a distance threshold (such as the first distance threshold mentioned above) corresponding to the spatial vector of the state of the same label can be determined based on the distance (such as cosine distance) between the spatial vectors of the states of the same label in the standard feature base library.

[0130] For example, assuming that the standard action feature of label 1 includes standard action features 1 to 3 (that is, standard action features 1 to 3 correspond to the same state of the same standard action), then the distance between the spatial vector of standard action feature 1 and the spatial vector of standard action feature 2, the distance between the spatial vector of standard action feature 1 and the spatial vector of standard action feature 3, and the distance between the spatial vector of standard action feature 2 and the spatial vector of standard action feature 3 can be calculated respectively. Based on the distance between the spatial vectors of each standard action feature, the distance threshold corresponding to the spatial vector of label 1 can be determined.

[0131] It should be noted that, since there are usually multiple spatial vectors in a given state, ... Figure 2Taking the skeletal keypoints shown as an example, the spatial vectors in any state can include the spatial vectors corresponding to keypoint 0 and keypoint 1, keypoint 1 and keypoint 2, keypoint 1 and keypoint 3, ..., keypoint 11 and keypoint 13, keypoint 10 and keypoint 12, etc. When calculating the distance between spatial vectors of states with the same label, it is necessary to calculate the distance between each spatial vector of the same label state separately. For example, in the process of calculating the distance between the spatial vector of standard motion feature 1 and the spatial vector of standard motion feature 2, it is necessary to... Calculate the distances between the spatial vectors corresponding to key points 0 and 1 in standard motion feature 1 and the spatial vectors corresponding to key points 0 and 1 in standard motion feature 2, the distances between the spatial vectors corresponding to key points 1 and 2 in standard motion feature 1 and the spatial vectors corresponding to key points 1 and 2 in standard motion feature 2, ..., the distances between the spatial vectors corresponding to key points 10 and 12 in standard motion feature 1 and the spatial vectors corresponding to key points 10 and 12 in standard motion feature 2, and sum the sum of all distances to determine the distance between the spatial vectors of the states of the two identical labels.

[0132] Similarly, the distance threshold corresponding to the motion vector of the state of the same label can be determined based on the distance between the motion vectors of the states of the same label in the standard feature base library (such as the second distance threshold mentioned above).

[0133] To enable those skilled in the art to better understand the technical solutions provided in the embodiments of this application, the technical solutions provided in the embodiments of this application are described below in conjunction with specific embodiments.

[0134] In this embodiment, human motion timing and counting are taken as an example.

[0135] In this embodiment, the motion timing and counting process can include three parts: human pose estimation model training, base database construction (standard motion base database construction), and motion timing and counting.

[0136] The following sections will explain each part separately.

[0137] I. Training of Human Pose Estimation Model

[0138] For example, the human pose estimation model can be a convolutional neural network, such as HRNet, which can be used for training.

[0139] For example, such as Figure 3 As shown, the training process for a human pose estimation model may include:

[0140] S300, calibration of key points in human skeleton data.

[0141] For example, a schematic diagram of key point markings on the human skeleton can be shown as follows: Figure 2 As shown.

[0142] S310. Using the calibrated data, train the initial model to obtain a trained human pose estimation model.

[0143] For example, during model training, the model can be optimized based on the coordinate differences between the predicted human skeletal key points and the calibrated human skeletal key points, until the model converges.

[0144] II. Base Library Construction

[0145] like Figure 4 As shown, the base library construction process may include:

[0146] S400: Decompose the state of the standard action to be timed and counted.

[0147] For example, the pull-up process can be broken down into two states: arms extended and chin over the bar. A diagram of this can be shown as follows: Figure 6A As shown.

[0148] S410. Use a human pose estimation model to extract human skeletal key points in different states of a standard action, determine the spatial vectors and motion vectors in different states of a standard action, and use the spatial vectors and motion vectors in different states as features of the standard action, bind them with the labels corresponding to the states, and store them in the base database.

[0149] For example, the coordinates of adjacent skeletal keypoints in a video frame can be converted into spatial vectors to represent the current action posture; the coordinates of skeletal keypoints in the same part of the video frame and adjacent video frames can be converted into motion vectors to describe the action motion state. The spatial vectors and motion vectors (which can be called spatiotemporal vectors) are used as standard action features of the current state and are bound to tags and stored in the database.

[0150] For example, assuming the coordinates of the key bone point of the left shoulder are k2(x2, y2) and the coordinates of the key bone point of the left elbow are k4(x4, y4), then the distance from the left shoulder to the left elbow can be represented by the spatial vector v = (x4 - x2, y4 - y2). Assuming the key bone points of the left shoulder in the i-th frame are... The key points of the human left shoulder skeleton in frame i-1 are: The key points of the human left shoulder skeleton in frame i+1 are: The motion information of the human left shoulder in the current state can be represented by a motion vector. and motion vector This means that concatenating the spatial vectors and motion vectors of all skeletal keypoints yields the feature template used for matching.

[0151] S420. Calculate the cosine distance of the same label features in the base database, and take the average value as the threshold of the current label.

[0152] For example, after extracting the spatial and motion vectors of all the extended hand states during a pull-up, the cosine distance is calculated for each state, and finally the mean of all the cosine distances is used as the threshold for the extended hand state.

[0153] For example, assuming that video frame 1 and video frame 2 are both video frames corresponding to the state of outstretched arms, and the spatial vectors in the state of outstretched arms include spatial vectors 1 to 3, then we can calculate the cosine distance between spatial vector 1 in video frame 1 and spatial vector 1 in video frame 2 (assumed to be cosine distance 1), the cosine distance between spatial vector 2 in video frame 1 and spatial vector 2 in video frame 2 (assumed to be cosine distance 2), and the cosine distance between spatial vector 3 in video frame 1 and spatial vector 3 in video frame 2 (assumed to be cosine distance 3). The sum of cosine distance 1, cosine distance 2, and cosine distance 3 is determined as the cosine distance between the spatial vectors of video frame 1 and the spatial vectors of video frame 2 (assumed to be cosine distance 12).

[0154] If the video frame corresponding to the outstretched arms state also includes video frame 3, then the cosine distance 13 (the cosine distance between the spatial vector of video frame 1 and the spatial vector of video frame 3) and the cosine distance 23 (the cosine distance between the spatial vector of video frame 2 and the spatial vector of video frame 3) can be calculated in the same way. The average value of the cosine distance 12, cosine distance 13 and cosine distance 23 is used as the distance threshold corresponding to the spatial vector in the outstretched arms state (such as the first distance threshold mentioned above).

[0155] The distance threshold corresponding to the motion vector can be obtained similarly.

[0156] III. Action Timing and Counting

[0157] like Figure 5 As shown, the action timing and counting process may include:

[0158] S500: Acquire the video to be detected and perform target detection and tracking on the video frames to be detected.

[0159] S510. Using the trained human pose estimation model, extract the skeletal key points of the detected target in the video frame, and determine the spatial vector and motion vector corresponding to the video frame based on the coordinate information of the extracted skeletal key points.

[0160] S520. Based on the spatial vector and motion vector corresponding to the video frame, match them with the standard motion features in the base database, vote on the matched tags, and take the tag with the most votes as the tag of the video frame.

[0161] For example, for any standard motion feature, if the distance between the spatial vector included in the standard motion feature and the spatial vector corresponding to the video frame is less than a distance threshold (such as the first distance threshold mentioned above), and the distance between the motion vector included in the standard motion feature and the motion vector corresponding to the video frame is less than a distance threshold (such as the second distance threshold mentioned above), then the video frame is determined to match the standard motion feature, and the matching vote of the standard motion feature is incremented by 1.

[0162] S530. Based on the tag sequence corresponding to the video to be detected, determine whether all states of the standard action have occurred and count or time them.

[0163] For example, if the label for the pull-up with arms outstretched is 0 and the label for the chin-over-bar state is 1, then the pull-up movement can be counted based on whether state 0 and state 1 occur in sequence.

[0164] like Figure 6B As shown, for pull-up movements, if the label sequence first shows at least two consecutive "1"s, and then at least two consecutive "0"s, the count of pull-up movements can be incremented by 1.

[0165] The method provided in this application has been described above. The apparatus provided in this application is described below:

[0166] Please see Figure 7 This is a schematic diagram of the structure of an action timing and counting device provided in an embodiment of this application, as shown below. Figure 7 As shown, the action timing and counting device may include:

[0167] Acquisition unit 710 is used to acquire the video to be detected;

[0168] Detection unit 720 is used to perform target detection and tracking on the video to be detected;

[0169] Extraction unit 730 is used to extract the skeletal key points of the detected target in the video frame using a trained pose estimation model;

[0170] The first determining unit 740 is used to determine the spatial vector and motion vector corresponding to the video frame based on the coordinate information of the extracted skeletal key points; wherein, the spatial vector is used to determine the action posture, and the motion vector is used to describe the action motion state.

[0171] The query unit 750 is used to query the target standard action feature that matches the video frame from the standard action base library based on the spatial vector and motion vector corresponding to the video frame; wherein, the standard action base library stores standard action features and labels for standard action features, and the standard action features include spatial vectors and motion vectors of different states of standard actions.

[0172] The second determining unit 760 is used to determine the tag corresponding to the video frame based on the query result;

[0173] The timing and counting unit 770 is used to perform action timing and counting based on the tag sequence corresponding to the video to be detected.

[0174] In some embodiments, the first determining unit 740 determines the spatial vector and motion vector corresponding to the video frame based on the extracted coordinate information of the skeletal key points, including:

[0175] Based on the coordinate information of adjacent bone points in the skeletal keypoints extracted from the video frame, the spatial vector corresponding to the video frame is determined; and,

[0176] Based on the skeletal keypoints extracted from the video frame, and the coordinate information of the skeletal keypoints at the same location extracted from the skeletal keypoints in the adjacent video frames, the motion vector corresponding to the video frame is determined.

[0177] In some embodiments, the query unit 750 queries a standard motion feature matching the video frame from a standard motion database based on the spatial vector and motion vector corresponding to the video frame, including:

[0178] For any standard motion feature, determine the distance between the spatial vector corresponding to the video frame and the spatial vector included in the standard motion feature, and the distance between the motion vector corresponding to the video frame and the motion vector included in the standard motion feature.

[0179] If the distance between the spatial vector corresponding to the video frame and the spatial vector included in the standard motion feature is less than a first distance threshold, and the distance between the motion vector corresponding to the video frame and the motion vector included in the standard motion feature is less than a second distance threshold, then the standard motion feature is determined to be the target standard motion feature that matches the video frame.

[0180] In some embodiments, the second determining unit 760 determines the tag corresponding to the video frame based on the query result, including:

[0181] If a target standard motion feature matching the video frame is found, the label corresponding to the video frame is determined based on the label of the target standard motion feature matching the video frame.

[0182] If no target standard action feature matching the video frame is found, the label corresponding to the video frame is determined to be an intermediate state label.

[0183] In some embodiments, the second determining unit 760 determines the tag corresponding to the video frame based on the tag of the target standard motion feature that matches the video frame, including:

[0184] Based on the tags of the target standard action features that match the video frame, determine the number of matches for each tag, and identify the tag with the most matches as the tag corresponding to the video frame.

[0185] In some embodiments, the timing and counting unit 770 performs action timing and counting based on the tag sequence corresponding to the video to be detected, including:

[0186] Based on the tag sequence corresponding to the video to be detected, if all states of the standard action have occurred, the action timing or counting is performed.

[0187] In some embodiments, the timing and counting unit 770 performs action timing or counting when it determines that all states of the standard action have occurred based on the tag sequence corresponding to the video to be detected, including:

[0188] For a standard action that includes multiple different states, the action count is incremented by 1 if the labels of each of the multiple different states appear consecutively at least N1 times in the label sequence; N1≥2.

[0189] For standard actions that include actions in a single state, timing begins when the label for that state appears consecutively at least N2 times in the label sequence, and stops when the label for that state has not appeared consecutively at least N3 times; N2≥2, N3≥2.

[0190] In some embodiments, such as Figure 8 As shown, the device further includes:

[0191] Building unit 780 is used to build the standard action base library in the following ways:

[0192] The standard action to be timed and counted is decomposed into states to obtain at least one state of the standard action;

[0193] For any state of a standard action corresponding to a video frame, the skeletal key points are extracted using a trained pose estimation model.

[0194] Based on the coordinate information of adjacent skeletal keypoints in the video frame, determine the spatial vector of this state; and,

[0195] Based on the video frame and the coordinate information of the skeletal key points of the same part in the adjacent video frames, the motion vector of this state is determined.

[0196] The spatial vector and motion vector of the state are used as the standard action features of the state, and the standard action features are bound to the label corresponding to the state and stored in the standard action base library.

[0197] In some embodiments, the construction unit 780 is further configured to determine a distance threshold corresponding to the spatial vector of the state of a label based on the distance between the spatial vectors of states of the same label in the standard feature base library; and

[0198] Based on the distance between the motion vectors of states with the same label in the standard feature base library, determine the distance threshold corresponding to the motion vector of the state of that label.

[0199] This application provides an electronic device, including a processor and a memory, wherein the memory stores machine-executable instructions that can be executed by the processor, and the processor executes the machine-executable instructions to implement the action timing and counting method described above.

[0200] Please see Figure 9 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of this application. The electronic device may include a processor 901 and a memory 902 storing machine-executable instructions. The processor 901 and the memory 902 can communicate via a system bus 903. Furthermore, by reading and executing the machine-executable instructions corresponding to the action timing and counting logic in the memory 902, the processor 901 can execute the action timing and counting method described above.

[0201] The memory 902 mentioned in this document can be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, etc. For example, machine-readable storage media can be: RAM (Random Access Memory), volatile memory, non-volatile memory, flash memory, storage drives (such as hard disk drives), solid-state drives, any type of storage disk (such as optical discs, DVDs, etc.), or similar storage media, or combinations thereof.

[0202] In some embodiments, a machine-readable storage medium, such as Figure 9 The memory 902 in the memory, which is a machine-readable storage medium, stores machine-executable instructions that, when executed by a processor, implement the action timing and counting method described above. For example, the storage medium may be ROM, RAM, CD-ROM, magnetic tape, floppy disk, or optical data storage device.

[0203] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0204] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.

Claims

1. A method for timing and counting actions, characterized in that, include: Acquire the video to be detected, and perform target detection and tracking on the video to be detected; Using a trained pose estimation model, skeletal key points of the detected target in the video frame are extracted, and the spatial vector and motion vector corresponding to the video frame are determined based on the coordinate information of the extracted skeletal key points; wherein, the spatial vector is used to determine the action pose, and the motion vector is used to describe the action motion state. Based on the spatial and motion vectors corresponding to the video frame, the target standard action features matching the video frame are queried from the standard action database. The standard action database stores standard action features and their labels. The standard action features include spatial and motion vectors of different states of the standard action. The standard action database is constructed using the following method: state decomposition of the standard action to be counted; extraction of human skeletal keypoints for different states of the standard action using a human pose estimation model; determination of spatial and motion vectors for different states of the standard action; and binding and storing the spatial and motion vectors of different states as standard action features, along with the corresponding labels, in the database. Determine the tag corresponding to the video frame based on the query results; For a standard action that includes multiple different states, the action count is incremented by 1 if the labels of each of the multiple different states appear consecutively at least N1 times in the label sequence; N1≥2. For standard actions that include actions in a single state, timing begins when the label for that state appears consecutively at least N2 times in the label sequence, and stops when the label for that state has not appeared consecutively at least N3 times; N2≥2, N3≥2.

2. The method according to claim 1, characterized in that, The step of determining the spatial vector and motion vector corresponding to the video frame based on the extracted coordinate information of the skeletal key points includes: Based on the coordinate information of adjacent bone points in the skeletal keypoints extracted from the video frame, the spatial vector corresponding to the video frame is determined; and, Based on the skeletal keypoints extracted from the video frame, and the coordinate information of the skeletal keypoints at the same location extracted from the skeletal keypoints in the adjacent video frames, the motion vector corresponding to the video frame is determined.

3. The method according to claim 1, characterized in that, The step of querying the target standard action features that match the video frame from the standard action database based on the spatial vector and motion vector corresponding to the video frame includes: For any standard motion feature, determine the distance between the spatial vector corresponding to the video frame and the spatial vector included in the standard motion feature, and the distance between the motion vector corresponding to the video frame and the motion vector included in the standard motion feature. If the distance between the spatial vector corresponding to the video frame and the spatial vector included in the standard motion feature is less than a first distance threshold, and the distance between the motion vector corresponding to the video frame and the motion vector included in the standard motion feature is less than a second distance threshold, then the standard motion feature is determined to be the target standard motion feature that matches the video frame.

4. The method according to claim 1, characterized in that, Determining the tag corresponding to the video frame based on the query results includes: If a target standard motion feature matching the video frame is found, the label corresponding to the video frame is determined based on the label of the target standard motion feature matching the video frame. If no target standard action feature matching the video frame is found, the label corresponding to the video frame is determined to be an intermediate state label.

5. The method according to claim 4, characterized in that, The step of determining the tag corresponding to the video frame based on the tags of the target standard action features that match the video frame includes: Based on the tags of the target standard action features that match the video frame, determine the number of matches for each tag, and identify the tag with the most matches as the tag corresponding to the video frame.

6. The method according to any one of claims 1-5, characterized in that, The standard action to be counted is decomposed into states; a human pose estimation model is used to extract human skeletal key points in different states of the standard action, determine the spatial vectors and motion vectors in different states of the standard action, and use the spatial vectors and motion vectors in different states as features of the standard action, which are then bound to the corresponding labels and stored in the base database, including: The standard action to be timed and counted is decomposed into states to obtain at least one state of the standard action; For any state of a standard action corresponding to a video frame, the skeletal key points are extracted using a trained pose estimation model. Based on the coordinate information of adjacent skeletal keypoints in the video frame, determine the spatial vector of this state; and, Based on the video frame and the coordinate information of the skeletal key points of the same part in the adjacent video frames, the motion vector of this state is determined. The spatial vector and motion vector of the state are used as the standard action features of the state, and the standard action features are bound to the label corresponding to the state and stored in the standard action base library.

7. The method according to claim 6, characterized in that, The method further includes: Based on the distance between the spatial vectors of states with the same label in the standard action base library, determine the distance threshold corresponding to the spatial vector of the state of that label; and, Based on the distance between motion vectors of states with the same label in the standard action base library, determine the distance threshold corresponding to the motion vector of the state of that label.

8. An action timing and counting device, characterized in that, include: The acquisition unit is used to acquire the video to be detected; The detection unit is used to perform target detection and tracking on the video to be detected; The extraction unit is used to extract the skeletal key points of the detected target in the video frame using a trained pose estimation model. The first determining unit is used to determine the spatial vector and motion vector corresponding to the video frame based on the coordinate information of the extracted skeletal key points; wherein, the spatial vector is used to determine the action posture, and the motion vector is used to describe the action motion state. The query unit is used to query the target standard action features that match the video frame from the standard action base library based on the spatial vector and motion vector corresponding to the video frame. The standard action base library stores standard action features and labels for these features. The standard action features include spatial vectors and motion vectors for different states of the standard action. The standard action base library is constructed using the following method: decomposing the standard action to be counted into states; using a human pose estimation model to extract key points of the human skeleton for different states of the standard action, determining the spatial vectors and motion vectors for different states of the standard action, and binding the spatial vectors and motion vectors for different states as standard action features, storing them in the base library along with the corresponding labels. The second determining unit is used to determine the tag corresponding to the video frame based on the query results; The timing and counting unit is used to increment the action count by 1 when the labels of each of the multiple different states appear consecutively at least N1 times in the label sequence for a standard action that includes multiple different states; N1≥2. For a standard action that includes a single state, timing begins when the label of that state appears consecutively at least N2 times in the label sequence, and stops when the label of that state has not appeared consecutively at least N3 times; N2≥2, N3≥2.

9. The apparatus according to claim 8, characterized in that, The first determining unit determines the spatial vector and motion vector corresponding to the video frame based on the extracted coordinate information of the skeletal key points, including: Based on the coordinate information of adjacent bone points in the skeletal keypoints extracted from the video frame, the spatial vector corresponding to the video frame is determined; and, Based on the skeletal key points extracted from the video frame, and the coordinate information of the skeletal key points in the same part extracted from the skeletal key points in the adjacent video frames, the motion vector corresponding to the video frame is determined. And / or, The query unit queries the standard motion features that match the video frame from the standard motion database based on the spatial vector and motion vector corresponding to the video frame, including: For any standard motion feature, determine the distance between the spatial vector corresponding to the video frame and the spatial vector included in the standard motion feature, and the distance between the motion vector corresponding to the video frame and the motion vector included in the standard motion feature. If the distance between the spatial vector corresponding to the video frame and the spatial vector included in the standard motion feature is less than a first distance threshold, and the distance between the motion vector corresponding to the video frame and the motion vector included in the standard motion feature is less than a second distance threshold, then the standard motion feature is determined to be the target standard motion feature that matches the video frame. And / or, The second determining unit determines the tag corresponding to the video frame based on the query result, including: If a target standard motion feature matching the video frame is found, the label corresponding to the video frame is determined based on the label of the target standard motion feature matching the video frame. If no target standard action feature matching the video frame is found, the label corresponding to the video frame is determined to be an intermediate state label. The second determining unit determines the tag corresponding to the video frame based on the tags of the target standard action features that match the video frame, including: Based on the tags of the target standard action features that match the video frame, determine the number of matches for each tag, and determine the tag with the most matches as the tag corresponding to the video frame. And / or, The device further includes: Building blocks are used to construct a standard action base library in the following ways: The standard action to be timed and counted is decomposed into states to obtain at least one state of the standard action; For any state of a standard action corresponding to a video frame, the skeletal key points are extracted using a trained pose estimation model. Based on the coordinate information of adjacent skeletal keypoints in the video frame, determine the spatial vector of this state; and, Based on the video frame and the coordinate information of the skeletal key points of the same part in the adjacent video frames, the motion vector of this state is determined. The spatial vector and motion vector of the state are used as the standard action features of the state, and the standard action features are bound to the label corresponding to the state and stored in the standard action base library. The construction unit is further configured to determine a distance threshold corresponding to the spatial vector of the state of a label based on the distance between the spatial vectors of states with the same label in the standard action base library; and, Based on the distance between motion vectors of states with the same label in the standard action base library, determine the distance threshold corresponding to the motion vector of the state of that label.

10. An electronic device, characterized in that, The method includes a processor and a memory, the memory storing machine-executable instructions that can be executed by the processor, the processor executing the machine-executable instructions to implement the method as described in any one of claims 1-7.

11. A machine-readable storage medium, characterized in that, The machine-readable storage medium stores machine-executable instructions, which, when executed by a processor, implement the method as described in any one of claims 1-7.