A taijiquan action scoring method based on deep learning, a storage medium and an electronic device
By using 3D human body keypoint detection and ST-GCNbackbone network to process Tai Chi video stream data, an adjacency matrix is constructed for feature extraction, which solves the problem of inaccurate Tai Chi movement scoring and achieves accurate score evaluation of Tai Chi movement rhythm and posture.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 恒鸿达(福建)体育科技有限公司
- Filing Date
- 2024-02-05
- Publication Date
- 2026-06-26
AI Technical Summary
Existing Tai Chi recognition models cannot accurately score the rhythm and posture of each movement in Tai Chi, and there are problems with inaccurate score evaluation.
A 3D human keypoint detection model is used to process Tai Chi movement video stream data to obtain the coordinates of key points of the human skeleton. The adjacency relationship of the human skeleton is constructed through time series normalization and adjacency matrix. The ST-GCNbackbone network is used for feature extraction, and the feature map is input into the classification head and regression head to realize action classification and score evaluation.
It provides a more accurate score assessment of human posture in sports, so that the rhythm and posture of each movement in Tai Chi have a corresponding score evaluation, which improves the accuracy and robustness of movement scoring.
Smart Images

Figure CN118053201B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of video stream motion recognition, and more particularly to a deep learning-based method for scoring Tai Chi movements, as well as a storage medium and electronic device. Background Technology
[0002] With the deepening integration of sports and AI, the demand for AI-powered motion scoring models has emerged to meet the needs of applications across multiple fields. This technological need stems from the need for accurate, objective, and real-time assessment of movement postures and rhythms. In Tai Chi, motion scoring AI models can provide crucial feedback to teachers and students, helping to optimize training programs and improve athletic performance by analyzing motor skills and postures.
[0003] Existing technologies include encoding dynamic human motion into a single image using motion energy images and motion history images; extending Harris angle detectors to the spatiotemporal domain using spatiotemporal interest point methods; applying spatiotemporally separable Gaussian kernels to videos to obtain their response functions, which are used to find large motion changes in spatial and temporal dimensions; and in recent years, deep learning-based spatiotemporal graph convolutional neural networks have been used to accurately evaluate the category of actions by judging the number of sub-actions in a complete motion and evaluating scores.
[0004] Since MEI and MHI are highly sensitive to changes in viewpoint, spatiotemporal points of interest can only capture information over a short period of time and cannot capture information over a long period of time. Various action recognition models based on ST-GCN (Spatio-Temporal Graph Convolutional Network, a neural network model for processing spatiotemporal graph data) have improved the ability to assess posture. However, existing methods do not truly score the rhythm and posture of an action and have the problem of inaccurate score assessment. Summary of the Invention
[0005] Therefore, a deep learning-based Tai Chi movement scoring method is needed to address the problem that existing Tai Chi recognition models cannot score the rhythm and posture of each movement in Tai Chi.
[0006] To achieve the above objectives, this invention provides a deep learning-based method for scoring Tai Chi movements, comprising the following steps:
[0007] Collect Tai Chi movements and obtain video stream data of Tai Chi movements;
[0008] The video stream data of Tai Chi Chuan is imported into the human body 3D key point detection model for processing to obtain the coordinates of the key points of the human skeleton and their corresponding time, and generate a human skeleton dataset.
[0009] The human skeleton dataset is normalized in the time dimension according to the time series of human skeleton key points to obtain the time series tensor p[K, T, 3] of human skeleton key points; where T is the time length, K is the number of human skeleton key points, and 3 represents the dimension of the coordinates (x, y, z) of each human skeleton key point.
[0010] Construct an adjacency matrix for the human skeleton based on its adjacency relationships.
[0011] The time series tensor p[K, T, 3] is combined with the adjacency matrix A=[A0, A1, …, A3] of the human skeleton. skip The posture of key points in the human skeleton. Figure 3 Dposeseq[K, T, A, 3];
[0012] posture Figure 3 The Dposeseq[K, T, A, 3] keypoint pose evaluation model is input into the keypoint pose evaluation model, which includes an ST-GCNbackbone network, a classification head, and a regression head, to evaluate the pose. Figure 3 The Dposeseq[K, T, A, 3] method is input into the ST-GCNbackbone network to obtain feature maps. After dimensionality reduction, the feature maps are then input into the classification head and regression head respectively. The classification head outputs the one-hot code representing the Tai Chi movement classification, and the regression head outputs the score value of the Tai Chi movement.
[0013] Furthermore, the key points of the human skeleton include the head, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, upper spine, central spine, right hip, right knee, right ankle, right toe, left hip, left knee, left ankle, left toe, central pelvis, left thumb tip, left middle finger tip, left little finger tip, right thumb tip, right middle finger tip, and right little finger tip.
[0014] Furthermore, the step of constructing the adjacency matrix of the human skeleton based on the adjacency relationship of the human skeleton includes the following steps:
[0015] The adjacency matrix of the human skeleton, constructed based on the adjacency relationships of the human skeleton, is a K×K matrix [A0, A1,…], where the values in the matrix represent adjacency weights, A0 represents adjacency relationships with a distance of 0, and A1 represents adjacency relationships with a distance of 1.
[0016] Define the left hand, right hand, left foot, and right foot as adjacent bone points with a distance of 1, and obtain the cross-node adjacency matrix A. skip This yields the adjacency matrix of the human skeleton, A = [A0, A1, …, A…]. skip ].
[0017] Furthermore, the classification head includes a first fully connected layer, a first activation layer, a second fully connected layer, a second activation layer, a third fully connected layer, and a softmax layer connected in sequence; the first fully connected layer has an input dimension of 512 and an output dimension of 512; the first activation layer uses the SiLU function, with an input dimension of 512 and an output dimension of 512; the second fully connected layer has an input dimension of 512 and an output dimension of 256; the second activation layer uses the SiLU function, with an input dimension of 256 and an output dimension of 256; the third fully connected layer has an input dimension of 256 and an output dimension of 24; and the softmax layer has an input dimension of 24 and an output dimension of 24.
[0018] Furthermore, the regression head comprises a first fully connected layer, a first activation layer, a second fully connected layer, a second activation layer, a third fully connected layer, and a Sigmoid layer connected in sequence; the first fully connected layer has an input dimension of 512 and an output dimension of 512; the first activation layer uses the SiLU function, with an input dimension of 512 and an output dimension of 512; the second fully connected layer has an input dimension of 512 and an output dimension of 256; the second activation layer uses the SiLU function, with an input dimension of 256 and an output dimension of 256; the third fully connected layer has an input dimension of 256 and an output dimension of 1; and the Sigmoid layer has an input dimension of 1 and an output dimension of 1.
[0019] Furthermore, after dimensionality reduction, the feature maps are input into the classification head and regression head respectively. The classification head outputs a one-hot code representing the Tai Chi movement category, and the regression head outputs the score value of the Tai Chi movement. The following steps are also included:
[0020] The feature map is split into two parts, Map0 and Map1, along the channel dimension using the split method. Map0 is then fed into the classification head after dimensionality reduction, and the classification head outputs the one-hot code representing the Tai Chi movement category. Map1 is then fed into the regression head after dimensionality reduction, and the regression head outputs the score value of the Tai Chi movement.
[0021] Furthermore, after dimensionality reduction, the feature maps are input into the classification head and regression head respectively. The classification head outputs a one-hot code representing the Tai Chi movement category, and the regression head outputs the score value of the Tai Chi movement. The following steps are also included:
[0022] The feature map is split into two parts, Map0 and Map1, along the channel dimension using the split method. Map0 is then dimensionality-reduced and fed into the classification head, which outputs a one-hot code representing the Tai Chi movement classification. The features output from the second fully connected layer in the classification head are then extracted and embedded into the dimensionality-reduced Map1. This feature is then fed into the regression head, which outputs the score of the Tai Chi movement.
[0023] Furthermore, the posture Figure 3 The steps for obtaining feature maps by inputting Dposeseq[K, T, A, 3] into the ST-GCNbackbone network include the following:
[0024] Input posture Figure 3 In Dposeseq[K, T, A, 3], the coordinates (x, y, z) of key points of the human skeleton are projected onto the three bases of the coordinate axis to form the features of the three bases.
[0025] The features of the three bases are convolved independently and then weighted using the weights B[b1, b2, b3, b4, b5, b6] of the channel attention mechanism to obtain the weighted features.
[0026] The three weighted features are concatenated along the channel dimension to obtain a feature map.
[0027] A storage medium storing a computer program, which, when executed by a processor, implements the steps of the deep learning-based Tai Chi movement scoring method described above.
[0028] An electronic device includes a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, it implements the steps of the above-described deep learning-based Tai Chi movement scoring method.
[0029] Unlike existing technologies, the above technical solution uses a 3D human keypoint detection model to process video stream data of Tai Chi movements, obtaining the coordinates of key points on the human skeleton and providing accurate human posture information, laying the foundation for subsequent action classification and scoring. Then, time series analysis is used to normalize the coordinates of the key points on the human skeleton, so that the coordinates of key points at different times form the human skeleton at different times. Adjacency matrices are used to construct the adjacency relationships of key points on the human skeleton, better capturing the posture of Tai Chi movements. Finally, the data is input into an ST-GCNbackbone network, where graph convolution operations are used to extract features from the posture map. These features are then input into a classification head and a regression head, respectively. The classification head classifies the rhythm and posture of Tai Chi movements based on the feature map, outputting the Tai Chi movement classification. The regression head maps the features on the feature map to a continuous bounded score interval, outputting the corresponding score value for the Tai Chi movement. In other words, by transforming the action scoring task into a classification and regression problem, a more accurate evaluation of human posture scores in sports is provided, ensuring that the rhythm and posture of each Tai Chi movement have a corresponding score evaluation. Attached Figure Description
[0030] Figure 1 This is a schematic diagram illustrating the adjacent relationships of the human skeleton according to the present invention.
[0031] Figure 2 This is the adjacency matrix diagram of A0 in this invention;
[0032] Figure 3 This is the adjacency matrix diagram of A1 in this invention;
[0033] Figure 4 A of the present invention skip Cross-node adjacency matrix graph;
[0034] Figure 5 A flowchart of the attitude graph input key point attitude evaluation model of the present invention;
[0035] Figure 6 A flowchart of the attitude graph input key point attitude evaluation model of the present invention;
[0036] Figure 7 A flowchart of the attitude graph input key point attitude evaluation model of the present invention;
[0037] Figure 8 This is a schematic diagram of the convolutional structure of the ST-GCNbackbone network of the present invention.
[0038] Explanation of reference numerals in the attached figures:
[0039] 1. Central pelvis; 2. Central spine; 3. Neck; 4. Head; 5. Right shoulder; 6. Right elbow; 7. Right wrist; 8. Right thumb tip; 9. Left shoulder; 10. Left elbow; 11. Left wrist; 12. Left thumb tip; 13. Right hip; 14. Right knee; 15. Right ankle; 16. Right toe; 17. Left hip; 18. Left knee; 19. Left ankle; 20. Left toe; 21. Upper spine; 22. Right little finger tip; 23. Right middle finger tip; 24. Left little finger tip; 25. Left middle finger tip. Detailed Implementation
[0040] To explain in detail the technical content, structural features, objectives, and effects of the technical solution, the following description is provided in conjunction with specific embodiments and accompanying drawings.
[0041] In this document, the term "embodiment" means that a specific feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The term "embodiment" appearing in various places throughout the specification does not necessarily refer to the same embodiment, nor does it specifically limit its independence or connection with other embodiments. In principle, in this application, as long as there are no technical contradictions or conflicts, the technical features mentioned in each embodiment can be combined in any way to form corresponding implementable technical solutions.
[0042] Unless otherwise defined, the technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains; the use of related terms herein is merely for the purpose of describing particular embodiments and is not intended to limit this application.
[0043] In the description of this application, the term "and / or" is used to describe the logical relationship between objects, indicating that three relationships can exist. For example, A and / or B means: A exists, B exists, and A and B exist simultaneously. Additionally, the character " / " in this document generally indicates that the preceding and following objects have an "or" logical relationship.
[0044] In this application, terms such as “first” and “second” are used only to distinguish one entity or operation from another, and do not necessarily require or imply any actual quantity, hierarchy or order relationship between these entities or operations.
[0045] Unless otherwise specified, the use of terms such as “comprising,” “including,” “having,” or other similar expressions in this application is intended to cover non-exclusive inclusion, which does not exclude the presence of additional elements in a process, method, or product that includes the stated elements, such that a process, method, or product that includes a list of elements may include not only those defined elements but also other elements not expressly listed, or elements inherent to such a process, method, or product.
[0046] Similar to the interpretation in the Patent Examination Guidelines, in this application, expressions such as "greater than," "less than," and "exceeding" are understood to exclude the stated number; expressions such as "above," "below," and "within" are understood to include the stated number. Furthermore, in the description of the embodiments in this application, "multiple" means two or more (including two), and similar expressions related to "multiple" are also interpreted in this way, such as "multiple groups" and "multiple times," unless otherwise explicitly specified.
[0047] In the description of the embodiments of this application, the space-related expressions used, such as "center," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "vertical," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," and "circumferential," indicate the orientation or positional relationship based on the orientation or positional relationship shown in the specific embodiments or drawings. They are only for the purpose of describing the specific embodiments of this application or for the reader's understanding, and do not indicate or imply that the device or component referred to must have a specific position, a specific orientation, or be constructed or operated in a specific orientation. Therefore, they should not be construed as limitations on the embodiments of this application.
[0048] Unless otherwise expressly specified or limited, the terms "installation," "connection," "linking," "fixing," and "setting," as used in the description of the embodiments of this application, should be interpreted broadly. For example, "connection" can be a fixed connection, a detachable connection, or an integral setting; it can be a mechanical connection, an electrical connection, or a communication connection; it can be a direct connection or an indirect connection through an intermediate medium; it can be the internal connection of two components or the interaction between two components. For those skilled in the art to which this application pertains, the specific meaning of the above terms in the embodiments of this application can be understood according to the specific circumstances.
[0049] See Figures 1-8 As shown, this invention provides a deep learning-based method for scoring Tai Chi movements. It employs a 3D human keypoint detection model to process video stream data of Tai Chi movements, obtaining the coordinates of key points on the human skeleton and providing accurate posture information, laying the foundation for subsequent movement classification and scoring. Then, it normalizes the coordinates of the key points on the human skeleton using time series, so that the coordinates of key points on the human skeleton at different times form the human skeleton at different times. It constructs the adjacency relationships of key points on the human skeleton through an adjacency matrix to better capture the posture of Tai Chi movements. Finally, it inputs the data into an ST-GCNbackbone network, using graph convolution operations to extract features from the posture map. These features are then input into a classification head and a regression head, respectively. The classification head classifies the rhythm and posture of Tai Chi movements based on the feature map and outputs the Tai Chi movement classification. The regression head maps the features on the feature map to a continuous bounded score interval and outputs the corresponding score value for the Tai Chi movement. In other words, by transforming the movement scoring task into a classification and regression problem, it provides a more accurate evaluation of human posture scores in sports, ensuring that the rhythm and posture of each Tai Chi movement have a corresponding score evaluation.
[0050] The following is a detailed description of the sock-shoe liner shaping and connecting structure provided by the present invention. See also... Figure 5 As shown, a deep learning-based method for scoring Tai Chi movements includes the following steps:
[0051] Collect Tai Chi movements and obtain video stream data of Tai Chi movements;
[0052] The video stream data of Tai Chi Chuan is imported into the human body 3D key point detection model for processing to obtain the coordinates of the key points of the human skeleton and their corresponding time, and generate a human skeleton dataset.
[0053] The human skeleton dataset is normalized in the time dimension according to the time series of human skeleton key points to obtain the time series tensor p[K, T, 3] of human skeleton key points; where T is the time length, K is the number of human skeleton key points, and 3 represents the dimension of the coordinates (x, y, z) of each human skeleton key point.
[0054] Construct an adjacency matrix for the human skeleton based on its adjacency relationships.
[0055] The time series tensor p[K, T, 3] is combined with the adjacency matrix A=[A0, A1, …, A3] of the human skeleton. skip The posture of key points in the human skeleton. Figure 3 Dposeseq[K, T, A, 3];
[0056] posture Figure 3 The Dposeseq[K, T, A, 3] keypoint pose evaluation model is input into the keypoint pose evaluation model, which includes an ST-GCNbackbone network, a classification head, and a regression head, to evaluate the pose. Figure 3 The Dposeseq[K, T, A, 3] method is input into the ST-GCNbackbone network to obtain feature maps. After dimensionality reduction, the feature maps are then input into the classification head and regression head respectively. The classification head outputs the one-hot code representing the Tai Chi movement classification, and the regression head outputs the score value of the Tai Chi movement.
[0057] The aforementioned acquisition of Tai Chi movement video stream data can be achieved by fixing the camera in a suitable position to ensure stability and prevent shaking, while also recording the entire Tai Chi movement process. The aforementioned camera refers to any device suitable for recording Tai Chi movement video, such as a camcorder or smartphone. Recording Tai Chi movement can be done from a single angle (front or back) or from multiple angles (e.g., four sides) to obtain the video stream data. The aforementioned importing of the Tai Chi movement video stream data into a human 3D keypoint detection model involves decomposing the video stream data into a series of frames, obtaining the coordinates of the human skeletal keypoints in each frame, and storing the coordinates of the detected human skeletal keypoints and their corresponding times in a human skeletal dataset. The data can be stored in CSV, JSON, or other common data formats. Of course, before storage, the coordinates of the detected human skeletal keypoints in each frame can be preprocessed, such as handling outliers, missing values, or noise in the dataset, to ensure data quality and accuracy. The aforementioned human 3D keypoint detection model can be OpenPose, AlphaPose, MMPose, etc. The above-mentioned normalization of the human skeleton dataset according to the time series of human skeleton key points in the time dimension to obtain the time series tensor p[K, T, 3] of human skeleton key points refers to the normalization processing of human skeleton key points according to the time series in the time dimension. In some embodiments, the human skeleton key points are normalized according to the time series using an interpolation algorithm. That is, for each skeleton key point, the earliest and latest time points of that key point in the dataset are found; then, for each skeleton key point, interpolation (such as linear interpolation, polynomial interpolation, etc.) is performed according to the time series to calculate the position of the skeleton key point at missing time points, and the missing time points are filled in until the positions of all missing time points of skeleton key points are filled in, and then the time series tensor p[K, T, 3] of human skeleton key points is generated according to the time series. See also Figure 1As shown above, K represents the number of key points in the human skeleton, preferably K=25, namely: central pelvis 1, central spine 2, neck 3, head 4, right shoulder 5, right elbow 6, right wrist 7, right thumb tip 8, left shoulder 9, left elbow 10, left wrist 11, left thumb tip 12, right hip 13, right knee 14, right ankle 15, right toe 16, left hip 17, left knee 18, left ankle 19, left toe 20, upper spine 21, right little finger tip 22, right middle finger tip 23, left little finger tip 24, and left middle finger tip 25. Constructing the adjacency matrix of the human skeleton based on its adjacency relationships refers to using the positional information of the key points in the human skeleton to express the adjacency relationships between them. The dimensionality reduction mentioned above, where the feature maps are input into the classification and regression heads after dimensionality reduction, refers to flattening the feature maps. The ST-GCN backbone network outputs a feature map with dimensions of 1×1×512. Dimensionality reduction layers read the data along each dimension to reduce it to 1 dimension, resulting in an output dimension of 512. These dimensionality reduction layers can be either flattening or squeezing layers. The ST-GCN backbone network described above is the backbone network of ST-GCN (Spatio-Temporal Graph Convolutional Network, a neural network model for processing spatio-temporal graph data), used for the pose of key points on the human skeleton. Figure 3 Dposeseq can effectively extract poses through convolution operations. Figure 3 The features in Dposeseq are encoded into feature maps. The classification head classifies the rhythm and posture of Tai Chi movements using these feature maps, outputting the Tai Chi movement classification. The regression head maps the features on the feature maps to a continuous bounded score interval, outputting the corresponding score value for the Tai Chi movement. In other words, by transforming the movement scoring task into a classification and regression problem, it provides a more accurate assessment of human posture scores in sports, ensuring that the rhythm and posture of each Tai Chi movement have a corresponding score evaluation.
[0058] See Figures 1-4 As shown, the adjacency matrix of the human skeleton focuses on the connections of the human skeleton and cannot express the adjacency relationships of key points of the hands and feet. In Tai Chi, the coordination of the hands and feet also needs to be used as a standard for scoring. Therefore, the step of constructing the adjacency matrix of the human skeleton based on the adjacency relationships of the human skeleton includes the following steps:
[0059] The adjacency matrix of the human skeleton, constructed based on the adjacency relationships, is a K×K matrix [A0, A1, ... ], where the values in the matrix represent adjacency weights, and A0 represents adjacency relationships with a distance of 0 (see [link to relevant documentation]). Figure 2 As shown), A1 represents an adjacency relationship with a distance of 1 (see...). Figure 3(as shown); if the head is connected to the neck, then the adjacency relationship between the head and the neck is A0.
[0060] Define the left hand, right hand, left foot, and right foot as adjacent bone points with a distance of 1, and obtain the cross-node adjacency matrix A. skip (See) Figure 4 As shown), this yields the adjacency matrix A=[A0, A1, … , A skip ].
[0061] The adjacency matrix of the human skeleton can be used to describe the adjacency relationship and adjacency weight between key points, and to make the coordination of the hands and feet a standard for score evaluation.
[0062] The aforementioned classification head (ClsHead) comprises a first fully connected layer, a first activation layer, a second fully connected layer, a second activation layer, a third fully connected layer, and a softmax layer connected in sequence. The first fully connected layer has an input dimension of 512 and an output dimension of 512. The first activation layer uses the SiLU function as its activation function, with an input dimension of 512 and an output dimension of 512. The second fully connected layer has an input dimension of 512 and an output dimension of 256. The second activation layer also uses the SiLU function as its activation function, with an input dimension of 256 and an output dimension of 256. The third fully connected layer has an input dimension of 256 and an output dimension of 24. The softmax layer has an input dimension of 24 and an output dimension of 24.
[0063] The aforementioned classification head extracts features from multiple levels through a combination of fully connected layers and activation functions. It then uses a softmax layer to output a one-hot code representing the Tai Chi movement category, improving the model's expressive and learning capabilities, thus enhancing the accuracy and robustness of the classification task. Furthermore, using the SiLU function as the activation function accelerates the model's convergence and training process. The structure of the classification head (ClsHead) is shown in Table 1.
[0064]
[0065] Table 1
[0066] The aforementioned Reg Head comprises a first fully connected layer, a first activation layer, a second fully connected layer, a second activation layer, a third fully connected layer, and a Sigmoid layer connected in sequence. The first fully connected layer has an input dimension of 512 and an output dimension of 512. The first activation layer uses the SiLU function, with an input dimension of 512 and an output dimension of 512. The second fully connected layer has an input dimension of 512 and an output dimension of 256. The second activation layer uses the SiLU function, with an input dimension of 256 and an output dimension of 256. The third fully connected layer has an input dimension of 256 and an output dimension of 1. The Sigmoid layer has an input dimension of 1 and an output dimension of 1.
[0067] The aforementioned regression head extracts features from multiple levels through a combination of fully connected layers and activation functions. It then performs regression through a Sigmoid layer, mapping the features to a continuous, bounded score interval and outputting the corresponding score for Tai Chi movements. This improves the model's expressive and learning capabilities, contributing to increased accuracy and robustness in regression tasks. Furthermore, using the SiLU function as the activation function accelerates model convergence and training. See Table 2 for the structure of the regression head (Reg Head).
[0068]
[0069] Table 2
[0070] See Figure 6 As shown, a further improvement to the above embodiment involves inputting the feature map, after dimensionality reduction, into the classification head and regression head respectively. When the classification head outputs a one-hot code representing the Tai Chi movement category and the regression head outputs the score value of the Tai Chi movement, the following steps are also included:
[0071] The feature map is split into two parts, Map0 and Map1, along the channel dimension using the split method. Map0 is then fed into the classification head after dimensionality reduction, and the classification head outputs the one-hot code representing the Tai Chi movement category. Map1 is then fed into the regression head after dimensionality reduction, and the regression head outputs the score value of the Tai Chi movement.
[0072] Map0 takes the first 256 channels of the third dimension of the feature map, and Map1 takes the last 256 channels of the third dimension of the feature map. That is, both Map0 and Map1 have a dimension of 1×1×256. In fact, it doesn't matter how Map0 and Map1 are split; the chain gradient calculation rule of the classification loss function dictates that the input to the Cls Head is the feature map used for classification, taking the first 256 channels of the third dimension. The chain gradient calculation of the regression loss function dictates that the input to the Reg Head is the feature map used for regression, taking the last 256 channels of the third dimension. By splitting the feature maps and feeding them separately into the classification and regression heads, the information in the feature maps can be fully utilized, and specialized processing can be performed according to different task requirements. This improves the flexibility and diversity of the model, while fully leveraging the information from different channels in the feature maps, thereby improving the accuracy and robustness of Tai Chi movement classification and regression tasks.
[0073] See Figure 7 As shown, because actions of the same category are similar, the data distribution is similar, while the data distribution of different categories differs significantly. Therefore, the above embodiment is further improved by inputting the feature maps, after dimensionality reduction, into the classification head and regression head respectively. When the classification head outputs a one-hot code representing the Tai Chi movement category, and the regression head outputs the score value of the Tai Chi movement, the following steps are also included:
[0074] The feature map is split into two parts, Map0 and Map1, along the channel dimension using the split method. Map0 is then dimensionality-reduced and fed into the classification head, which outputs a one-hot code representing the Tai Chi movement classification. The features output from the second fully connected layer in the classification head are then extracted and embedded into the dimensionality-reduced Map1. This feature is then fed into the regression head, which outputs the score of the Tai Chi movement.
[0075] The above process extracts features from the hidden layer of the classification head and adds these features to the regression head, i.e., embedding them into the Map1 feature. The hidden features (second fully connected layer) in the classification head have a dimension of 256, and Map1, after dimensionality reduction, also has a dimension of 256. The two features are then added together before being input into the regression head. By embedding the features output from the second fully connected layer of the classification head into Map1, richer and more accurate feature information can be provided, which helps improve the accuracy and robustness of the regression task and the accuracy of the Tai Chi movement score.
[0076] Traditional ST-GCNbackbone networks utilize image convolutional neural network structures for graph convolution. Since the three channels of an image represent the features of the same pixel and are artificially separated into RGB channels, the information in the channel dimension of the convolutional neural network is completely cross-interactive. Multiple convolutional kernels are used to achieve multi-channel output. This traditional method fuses features from the three basis vectors. However, in linear algebra and statistics, the correlation between two vertical basis vectors is zero. This unnecessary fusion increases model complexity and reduces model accuracy. See also... Figure 8 As shown, for this purpose, the above embodiment is further improved, the posture is... Figure 3 The steps for obtaining feature maps by inputting Dposeseq[K, T, A, 3] into the ST-GCNbackbone network include the following:
[0077] Input posture Figure 3 In Dposeseq[K, T, A, 3], the coordinates (x, y, z) of key points of the human skeleton are projected onto the three bases of the coordinate axis to form the features of the three bases.
[0078] The features of the three bases are convolved independently and then weighted using the weights B[b1, b2, b3, b4, b5, b6] of the channel attention mechanism to obtain the weighted features.
[0079] The three weighted features are concatenated along the channel dimension to obtain a feature map.
[0080] The above method improves the ST-GCNbackbone by projecting the coordinates of key points of the human skeleton onto three bases of the coordinate axis, performing convolution operations on the features of the three bases independently, and using the weights B of the channel attention mechanism for weighting. This reduces the computational load of the algorithm while improving the performance of the model in classification and regression.
[0081] The present invention also provides a storage medium storing a computer program, which, when executed by a processor, implements the steps of the above-described method. In this embodiment, the storage medium can be a storage medium disposed in an electronic device, allowing the electronic device to read the contents of the storage medium and achieve the effects of the present invention. Alternatively, the storage medium can be a separate storage medium connected to an electronic device, enabling the electronic device to read the contents of the storage medium and implement the method steps of the present invention.
[0082] The storage media include, but are not limited to, magnetic disks, magnetic tapes, magnetic cards, floppy disks, flash memory, optical disks, optical cards, read-only memory (ROM), random access memory (RAM), erasable programmable ROM (EPROM), and electrically erasable programmable ROM (EEPROM), etc., as well as other biological, physical, or chemical structures that can achieve the same or equivalent functions as the storage media listed above, such as DNA, RNA, proteins, and other units with information storage capabilities. In specific embodiments, the storage media may be one of the above-mentioned media types or a combination of the above-mentioned media types. In different embodiments, the computer program involved in the embodiments may be centrally stored in a single medium or distributed across multiple media. The memory containing computer device readable storage media may be non-volatile memory or random access memory. These computer device readable storage media may be built into the device or connected to the device involved in the embodiments as an external device or part of an external device. In some embodiments, the memory having a computer device readable storage medium is deployed locally; in other embodiments, the memory may be deployed remotely from the processor, for example, as a network-attached memory accessed via RF circuitry or an external port and a communication network, wherein the communication network may be the Internet, one or more intranets, a local area network (LAN), a wide area network (WLAN), a storage area network (SAN), or a suitable combination thereof, as long as computer device access to the memory is enabled. Furthermore, the computer program involved in the embodiments may be stored in plaintext / ciphertext form, or it may be designed as training data, integrated and recombined through model training and implicitly stored in the parameter states of a deep neural network or other machine learning model.The processor described in the embodiments of this application can be implemented by hardware, firmware, software, or a combination thereof. It can be at least one of the following: circuit, single or multiple application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), central processing units (CPUs), controllers, microcontrollers, and microprocessors. It also includes other physical, biological, or chemical structures that can implement the same or equivalent functions as the processors listed above, such as biological neurons, quantum computing units, DNA computing units, etc., so that the processor can execute some or all of the steps in the computer program or method involved in the various embodiments of this application, or any combination of the steps mentioned therein.
[0083] Data was collected from 24 Tai Chi movements, with 50 sets of data for each movement, totaling 1200 sets of data. These 1200 data points were manually labeled with scores ranging from 0 to 1. An ablation experiment was conducted on the improved ST-GCNbackbone network (using the classification and regression heads proposed in this patent) to calculate scores, split data, embed category information, and incorporate the weighted B attention mechanism, as shown in Table 3. The classification accuracy rate represents the correctness of the 24 movements, and the average error is the average absolute error when the maximum score is 1.
[0084]
[0085] Table 3
[0086] As shown in Table 3, the ablation experiment demonstrates that the motion scoring model proposed in this patent achieves higher accuracy in Tai Chi classification compared to the existing ST-GCNbackbone network. It also provides accurate and effective evaluation of Tai Chi motion scores, quantifies the results of the motions more clearly, and provides richer motion information feedback for teachers and students.
[0087] It should be noted that although the above embodiments have been described herein, this does not limit the scope of patent protection of the present invention. Therefore, any changes and modifications made to the embodiments described herein based on the innovative concept of the present invention, or equivalent structural or procedural transformations made using the content of the present invention's specification and drawings, directly or indirectly applying the above technical solutions to other related technical fields, are all included within the scope of patent protection of the present invention.
Claims
1. A deep learning-based method for scoring Tai Chi movements, characterized in that, Includes the following steps: Collect Tai Chi movements and obtain video stream data of Tai Chi movements; The video stream data of Tai Chi Chuan is imported into the human body 3D key point detection model for processing to obtain the coordinates of the key points of the human skeleton and their corresponding time, and generate a human skeleton dataset. The human skeleton dataset is normalized in the time dimension according to the time series of human skeleton key points to obtain the time series tensor p[K,T,3] of human skeleton key points; where T is the time length, K is the number of human skeleton key points, and 3 represents the dimension of the coordinates (x,y,z) of each human skeleton key point. Construct an adjacency matrix for the human skeleton based on its adjacency relationships. Combine the time series tensor p[K,T,3] with the adjacency matrix A=[A0,A1,…,Askip] of the human skeleton to construct the pose graph 3Dposeseq[K,T,A,3] of the key points of the human skeleton; The pose map 3Dposeseq[K,T,A,3] is input into the keypoint pose evaluation model, which includes an ST-GCNbackbone network, a classification head, and a regression head. The pose map 3Dposeseq[K,T,A,3] is input into the ST-GCNbackbone network to obtain a feature map. After dimensionality reduction, the feature map is input into the classification head and the regression head respectively. The classification head outputs a one-hot code representing the Tai Chi movement classification, and the regression head outputs the score value of the Tai Chi movement. The step of constructing an adjacency matrix of the human skeleton based on the adjacency relationship of the human skeleton includes the following steps: The adjacency matrix of the human skeleton, constructed based on the adjacency relationships of the human skeleton, is a K×K matrix [A0,A1,…], where the values in the matrix represent adjacency weights, A0 represents adjacency relationships with a distance of 0, and A1 represents adjacency relationships with a distance of 1. Define the left hand, right hand, left foot, and right foot as adjacent bone points with a distance of 1, and obtain the cross-node adjacency matrix Askip, that is, obtain the adjacency matrix A = [A0, A1, ..., Askip] of the human skeleton; After dimensionality reduction, the feature maps are input into the classification head and regression head respectively. The classification head outputs a one-hot code representing the Tai Chi movement category, and the regression head outputs the score value of the Tai Chi movement. The following steps are also included: The feature map is split into two parts, Map0 and Map1, along the channel dimension using the split method. Map0 is then dimensionality-reduced and fed into the classification head, which outputs a one-hot code representing the Tai Chi movement classification. The features output from the second fully connected layer in the classification head are then extracted and embedded into the dimensionality-reduced Map1. This feature is then fed into the regression head, which outputs the score of the Tai Chi movement. The step of inputting the pose map 3Dposeseq[K,T,A,3] into the ST-GCNbackbone network to obtain the feature map includes the following steps: The input pose graph 3Dposeseq[K,T,A,3] is projected with the coordinates (x, y, z) of the human skeleton key points onto the three bases of the coordinate axis to form the features of the three bases. The features of the three bases are convolved independently and then weighted using the channel attention mechanism weights B[b1,b2,b3,b4,b5,b6] to obtain the weighted features. The three weighted features are concatenated along the channel dimension to obtain a feature map.
2. The deep learning-based Tai Chi movement scoring method according to claim 1, characterized in that, The key points of the human skeleton include the head, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, upper spine, central spine, right hip, right knee, right ankle, right toe, left hip, left knee, left ankle, left toe, central pelvis, left thumb tip, left middle finger tip, left little finger tip, right thumb tip, right middle finger tip, and right little finger tip.
3. The deep learning-based Tai Chi movement scoring method according to claim 1, characterized in that, The classification head comprises a first fully connected layer, a first activation layer, a second fully connected layer, a second activation layer, a third fully connected layer, and a softmax layer connected in sequence. The first fully connected layer has an input dimension of 512 and an output dimension of 512. The first activation layer uses the SiLU function, with an input dimension of 512 and an output dimension of 512. The second fully connected layer has an input dimension of 512 and an output dimension of 256. The second activation layer also uses the SiLU function, with an input dimension of 256 and an output dimension of 256. The third fully connected layer has an input dimension of 256 and an output dimension of 24. The softmax layer has an input dimension of 24 and an output dimension of 24.
4. The deep learning-based Tai Chi movement scoring method according to claim 1, characterized in that, The regression head comprises a first fully connected layer, a first activation layer, a second fully connected layer, a second activation layer, a third fully connected layer, and a Sigmoid layer connected in sequence. The first fully connected layer has an input dimension of 512 and an output dimension of 512. The first activation layer uses the SiLU function, with an input dimension of 512 and an output dimension of 512. The second fully connected layer has an input dimension of 512 and an output dimension of 256. The second activation layer uses the SiLU function, with an input dimension of 256 and an output dimension of 256. The third fully connected layer has an input dimension of 256 and an output dimension of 1. The Sigmoid layer has an input dimension of 1 and an output dimension of 1.
5. The deep learning-based Tai Chi movement scoring method according to claim 1, characterized in that, After dimensionality reduction, the feature maps are input into the classification head and regression head respectively. The classification head outputs a one-hot code representing the Tai Chi movement category, and the regression head outputs the score value of the Tai Chi movement. The following steps are also included: The feature map is split into two parts, Map0 and Map1, along the channel dimension using the split method. Map0 is then fed into the classification head after dimensionality reduction, and the classification head outputs the one-hot code representing the Tai Chi movement category. Map1 is then fed into the regression head after dimensionality reduction, and the regression head outputs the score value of the Tai Chi movement.
6. A storage medium, characterized in that: The storage medium stores a computer program, which, when executed by a processor, implements the steps of the deep learning-based Tai Chi movement scoring method as described in any one of claims 1-5.
7. An electronic device, characterized in that: It includes a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, it implements the steps of the deep learning-based Tai Chi movement scoring method as described in any one of claims 1-5.