An intelligent analysis method for motion learning state based on spatio-temporal graph neural network and multi-modal fusion

By combining spatiotemporal graph neural networks with multimodal fusion methods, the problems of multimodal data fusion and real-time performance are solved, enabling efficient motion learning state analysis and providing personalized VR sports teaching support.

CN122242893APending Publication Date: 2026-06-19CIVIL AVIATION FLIGHT UNIV OF CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CIVIL AVIATION FLIGHT UNIV OF CHINA
Filing Date
2026-04-07
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies lack multimodal data fusion technology, making it impossible to effectively integrate skeletal motion data, physiological signal data, and cognitive behavior data. Temporal alignment algorithms are crude and cannot handle time axis scaling. High-precision algorithms have high computational complexity and are difficult to run in real time. Personalized recommendations lack algorithmic support, and existing VR sports teaching systems lack clear data analysis methods.

Method used

A spatiotemporal graph neural network and multimodal fusion method are adopted. Spatiotemporal coordinates are obtained through optical-inertial fusion sensors, ST-GCN network is used to extract features, DTW and Siamese network are used to evaluate the standard of action, attention mechanism is introduced to fuse multimodal data, DKVMN is used to evaluate knowledge mastery, deep reinforcement learning is used to generate personalized intervention strategies, and NVIDIA Jetson AGX Xavier edge computing unit is used to optimize end-to-end latency.

🎯Benefits of technology

It achieves millisecond-level real-time feedback with end-to-end latency of less than 20ms. The system can efficiently integrate multimodal data, provide personalized motion learning state analysis, and improve the data analysis accuracy and real-time performance of VR sports teaching.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242893A_ABST
    Figure CN122242893A_ABST
Patent Text Reader

Abstract

This application discloses an intelligent analysis method for motor learning states based on spatiotemporal graph neural networks and multimodal fusion, comprising: S1, using spatiotemporal graph neural networks and dynamic time warping methods to transform learners' raw motor data into structured action features; S2, introducing physiological and cognitive data based on the structured action features obtained in S1, and using an attention mechanism for dynamic weighted fusion to characterize the learner's comprehensive state; S3, using the fused state in S2 as context, combined with knowledge node interaction data, constructing a fine-grained knowledge mastery model through DKVMN to bridge the learner's comprehensive state and cognitive understanding, and obtaining the learner's knowledge gap; S4, using the learner's comprehensive state in S2 and the learner's knowledge gap in S3 as inputs, generating personalized intervention strategies through deep reinforcement learning to drive the optimization of the teaching process.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of athlete motion analysis technology, specifically to an intelligent analysis method for motion learning states based on spatiotemporal graph neural networks and multimodal fusion. Background Technology

[0002] In analyzing the motion of human skeletal joints, this technology is of paramount importance because it overcomes the limitations of traditional video analysis (susceptibility to light interference), deep learning (reliance on expensive GPUs), and the discomfort associated with wearable devices. Its low computational cost, high accuracy, and environmental robustness provide precise quantitative support for fields such as medical rehabilitation, sports training, and human-computer interaction. This technology has moved from laboratory research to practical applications, achieving real-time tracking and classification of joint angles and motion trajectories using deep learning models (such as SVM and LSTM) and 3D motion capture systems (such as NOKOV and OpenPose). Significant results have been achieved in gait analysis, abnormal movement recognition, and exoskeleton control (e.g., dynamic movement recognition rate of 84%, and a 20% improvement in rehabilitation efficiency). However, existing technologies suffer from several problems, including: a lack of multimodal data fusion technology, hindering the effective integration of skeletal motion data, physiological signal data, and cognitive behavioral data; coarse temporal alignment algorithms using fixed time windows, failing to handle timeline scaling issues; and a trade-off between real-time performance and accuracy, with high-precision algorithms being computationally complex and difficult to run in real-time. Personalized recommendations lack algorithmic support and are mostly based on manually generated rules, without employing adaptive optimization algorithms such as reinforcement learning. Therefore, further research is needed. Summary of the Invention

[0003] The purpose of this application is to provide a method for intelligent analysis of motion learning states based on spatiotemporal graph neural networks and multimodal fusion. The specific technical solution is as follows:

[0004] A method for intelligent analysis of motor learning states based on spatiotemporal graph neural networks and multimodal fusion includes: S1, using spatiotemporal graph neural networks and dynamic time warping to transform learners' raw motor data into structured action features; S2, introducing physiological and cognitive data based on the structured action features obtained in S1, and using an attention mechanism for dynamic weighted fusion to characterize the learner's overall state; S3, using the fused state in S2 as context, combined with knowledge node interaction data, constructing a fine-grained knowledge mastery model through DKVMN to bridge the learner's overall state and cognitive understanding, and obtaining the learner's knowledge gaps; S4, using the learner's overall state in S2 and the learner's knowledge gaps in S3 as inputs, generating personalized intervention strategies through deep reinforcement learning to drive the optimization of the teaching process.

[0005] S1 includes: S1.1, obtaining the spatiotemporal coordinate sequence of human skeletal joints during motion based on motion images; S1.2, constructing a spatiotemporal graph containing spatial and temporal edges based on the spatiotemporal coordinate sequence of human skeletal joints in S1.1; S1.3, extracting the time-varying features of skeletal joint coordinates from the spatiotemporal graph constructed in S1.2 using a hierarchical ST-GCN network to obtain motion feature vectors; S1.4, using the DTW algorithm to solve for the optimal solution of skeletal joint coordinates changing over time based on dynamic programming of human skeletal joints; S1.5, using a Siamese network to calculate the similarity between motion features and the optimal solution based on the motion feature vectors obtained in S1.3 and the optimal solution of skeletal joint coordinates changing over time in S1.4, outputting a score, locating erroneous segments by accumulating distance through DTW, and locating specific erroneous joints by analyzing node contribution.

[0006] When acquiring the spatiotemporal coordinate sequence of human skeletal joints in S1.1, the spatiotemporal coordinate sequence of human skeletal joints is acquired through an optical-inertial fusion sensor network. The optical acquisition frequency is 120fps, the inertial acquisition frequency is 1000fps, and clock synchronization is achieved through the IEEE1588 protocol.

[0007] In S1.2, when constructing a spatiotemporal graph containing spatial and temporal edges, the human skeletal structure is modeled as an undirected graph G=(V,E), where the set of nodes V corresponds to 15 key joints, and the set of edges E=E s ∪E t E s E is a spatial edge. t For time.

[0008] The network structure of the layered ST-GCN network in S1.3 is set as follows: Input layer → ST-GCN layer 1 (64 channels) → ST-GCN layer 2 (128 channels) → ST-GCN layer 3 (256 channels) → Global average pooling → Fully connected layer → Output 128-dimensional motion feature vector. The propagation rule of the first layer is:

[0009] ,

[0010] Where A_k is the normalized adjacency matrix of the k-th partitioning strategy, and W_k^(l) is the learnable weight matrix.

[0011] When introducing physiological data in S2, the steps include: S2.1, acquiring physiological data during exercise through smart wearable devices; S2.2, extracting the fluctuation features of the physiological data during exercise in S2.1 using 1D-CNN; and S2.3, analyzing the temporal relationship between the fluctuation features in S2.2 using bidirectional LSTM, and finally outputting the physiological feature vector.

[0012] When introducing cognitive data in S2, the steps include: S2.4, obtaining known motion cognitive text data; S2.5, using the Transformer model to refine the motion cognitive text data in S2.4, and finally outputting a cognitive feature vector.

[0013] S5 employs an NVIDIA Jetson AGX Xavier edge computing unit, ST-GCN INT8 quantization, TensorRT optimization, and a pipelined parallel design to execute the methods described in S1-S4, achieving an end-to-end latency of <20ms.

[0014] The beneficial effects of this application are as follows: it acquires spatiotemporal coordinates through optical-inertial fusion, extracts features through ST-GCN, evaluates action accuracy through DTW and Siamese networks, fuses multimodal data through an attention mechanism, assesses knowledge mastery through DKVMN, generates personalized paths through DQN, and implements risk warning through LSTM. The system adopts an edge-cloud architecture with millisecond-level real-time feedback, solving the technical problem of unclear data analysis methods in existing VR sports teaching systems. Attached Figure Description

[0015] Figure 1 This is a flowchart illustrating the application process. Detailed Implementation

[0016] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to specific embodiments and accompanying drawings. It should be understood that these descriptions are merely exemplary and not intended to limit the scope of this application. Furthermore, descriptions of well-known structures and technologies are omitted in the following description to avoid unnecessarily obscuring the concepts of this application.

[0017] like Figure 1 As shown, a motion learning state intelligent analysis method based on spatiotemporal graph neural network and multimodal fusion includes:

[0018] S1. Using a spatiotemporal graph neural network and dynamic time warping, the learner's raw motion data is transformed into structured action features. Specifically, this includes:

[0019] S1.1. Obtain the spatiotemporal coordinate sequence of human skeletal joints during motion based on motion images. In the application, when obtaining the spatiotemporal coordinate sequence of human skeletal joints, the spatiotemporal coordinate sequence of human skeletal joints is collected through an optical-inertial fusion sensor network. The optical acquisition frequency is 120fps, and the inertial acquisition frequency is 1000fps. Clock synchronization is achieved through the IEEE 1588 protocol.

[0020] S1.2. Based on the spatiotemporal coordinate sequence of human skeletal joints in S1.1, construct a spatiotemporal graph containing spatial and temporal edges. In application, when constructing the spatiotemporal graph containing spatial and temporal edges, the human skeletal structure is modeled as an undirected graph G=(V,E), where the node set V corresponds to 15 key joints, and the edge set E=E s ∪E t E s For spatial edges (skeleton connection relationships), E t This is a time edge (connecting adjacent frames).

[0021] S1.3. A hierarchical ST-GCN network is used to extract the time-varying features of skeletal joint coordinates from the spatiotemporal graph constructed in S1.2, obtaining motion feature vectors. In application, the network structure of the hierarchical ST-GCN network in S1.3 is set as follows: Input layer → ST-GCN layer 1 (64 channels) → ST-GCN layer 2 (128 channels) → ST-GCN layer 3 (256 channels) → Global average pooling → Fully connected layer → Output 128-dimensional motion feature vector. The propagation rule of the first layer is:

[0022] ,

[0023] Where A_k is the normalized adjacency matrix of the k-th partitioning strategy, and W_k^(l) is the learnable weight matrix.

[0024] S1.4. Based on dynamic programming of human skeletal joints, the DTW algorithm is used to solve for the optimal solution of skeletal joint coordinates changing over time. In application,

[0025] Construct the cumulative distance matrix:

[0026] ,

[0027] The optimal alignment path W* is solved using dynamic programming.

[0028] S1.5, based on the motion feature vector obtained in S1.3 and the optimal solution of the skeletal joint coordinates changing over time in S1.4, uses a Siamese network to calculate the similarity between the motion features and the optimal solution, outputs a score, and locates the erroneous segment by DTW cumulative distance, and locates the specific erroneous joint by node contribution analysis.

[0029] S2. Based on the structured action features obtained in S1, physiological and cognitive data are introduced and dynamically weighted and fused using an attention mechanism to characterize the learner's overall state. Specifically, the introduction of physiological data includes: S2.1, acquiring physiological data during movement through smart wearable devices; S2.2, extracting the fluctuation features of the physiological data during movement in S2.1 using 1D-CNN; S2.3, analyzing the temporal relationship of the fluctuation features in S2.2 using bidirectional LSTM, and finally outputting a physiological feature vector. The introduction of cognitive data includes: S2.4, acquiring known motion cognitive text data; S2.5, refining the motion cognitive text data in S2.4 using a Transformer model, and finally outputting a cognitive feature vector. In application, the cross-modal attention fusion is set as follows:

[0030] ;

[0031] ;

[0032] S3. Using the fusion state in S2 as context, and combining knowledge node interaction data, a fine-grained knowledge mastery model is constructed using the Deep Knowledge tracing (DKVMN) knowledge mastery assessment method. This model bridges the learner's overall state and cognitive understanding, identifying the learner's knowledge gaps. Specifically,

[0033] The concept matrix is ​​set as follows:

[0034] ,

[0035] The state matrix is ​​set as follows:

[0036] .

[0037] Attention weights are set as follows:

[0038] .

[0039] Write operation:

[0040] .

[0041] S4. Using the learner's overall state in S2 and the learner's knowledge gap in S3 as inputs, personalized intervention strategies are generated through deep reinforcement learning to drive the optimization of the teaching process. Specifically, the learner's overall state S is a 32-dimensional vector, including skill mastery, knowledge mastery, fatigue index, learning style, and historical trajectory. The action space A formed by the learner's knowledge gap consists of 200 atomic tasks (100 motor skills + 45 knowledge learning + 5 recovery and adjustment + 50 comprehensive application). The reward function is set as follows:

[0042] .

[0043] S5 employs an NVIDIA Jetson AGX Xavier edge computing unit, ST-GCN INT8 quantization, TensorRT optimization, and a pipelined parallel design to execute the methods described in S1-S4, achieving an end-to-end latency of <20ms.

Claims

1. A method for intelligent analysis of motion learning states based on spatiotemporal graph neural networks and multimodal fusion, characterized in that, include: S1. Using a spatiotemporal graph neural network and dynamic time warping method, the learner's raw motion data is transformed into structured action features; S2. Based on the structured action features obtained in S1, physiological and cognitive data are introduced and dynamically weighted and fused using an attention mechanism to characterize the learner's overall state. S3. Using the fusion state in S2 as context, and combining the knowledge node interaction data, construct a fine-grained knowledge mastery model through DKVMN to bridge the learner's comprehensive state and cognitive understanding, and obtain the learner's knowledge gap. S4. Using the learner's overall state in S2 and the learner's knowledge gap in S3 as inputs, personalized intervention strategies are generated through deep reinforcement learning to drive the optimization of the teaching process.

2. The intelligent motion learning state analysis method based on spatiotemporal graph neural network and multimodal fusion as described in claim 1, characterized in that, S1 includes: S1.1 Obtain the spatiotemporal coordinate sequence of human skeletal joints during motion based on motion images; S1.2 Construct a spatiotemporal graph containing spatial and temporal edges based on the spatiotemporal coordinate sequence of human skeletal joints in S1.1; S1.

3. Use a hierarchical ST-GCN network to extract the time-varying features of skeletal joint coordinates from the spatiotemporal graph constructed in S1.2, and obtain motion feature vectors. S1.4 Based on the dynamic programming of human skeletal joints, the DTW algorithm is used to solve the optimal solution of the skeletal joint coordinates changing over time. S1.5 Based on the motion feature vector obtained in S1.3 and the optimal solution of the bone joint coordinates changing over time in S1.4, a Siamese network is used to calculate the similarity between the motion features and the optimal solution, output a score, and locate the erroneous segment by DTW cumulative distance, and locate the specific erroneous joint by node contribution analysis.

3. The intelligent motion learning state analysis method based on spatiotemporal graph neural network and multimodal fusion as described in claim 2, characterized in that, In step S1.1, when acquiring the spatiotemporal coordinate sequence of human skeletal joints, the spatiotemporal coordinate sequence of human skeletal joints is acquired through an optical-inertial fusion sensor network. The optical acquisition frequency is 120fps, the inertial acquisition frequency is 1000fps, and clock synchronization is achieved through the IEEE 1588 protocol.

4. The intelligent motion learning state analysis method based on spatiotemporal graph neural network and multimodal fusion as described in claim 3, characterized in that, In step S1.2, when constructing the spatiotemporal graph containing spatial and temporal edges, the human skeletal structure is modeled as an undirected graph G=(V,E), where the node set V corresponds to 15 key joints, and the edge set... , For spatial edges, For time.

5. The intelligent motion learning state analysis method based on spatiotemporal graph neural network and multimodal fusion as described in claim 4, characterized in that, The network structure of the layered ST-GCN network in S1.3 is set as follows: Input layer → ST-GCN layer 1 (64 channels) → ST-GCN layer 2 (128 channels) → ST-GCN layer 3 (256 channels) → Global average pooling → Fully connected layer → Output 128-dimensional motion feature vector. The propagation rule of the first layer is: , Where A_k is the normalized adjacency matrix of the k-th partitioning strategy, and W_k^(l) is the learnable weight matrix.

6. The intelligent motion learning state analysis method based on spatiotemporal graph neural network and multimodal fusion as described in claim 5, characterized in that, When physiological data is introduced in S2, it includes: S2.1 Acquire physiological data during exercise through smart wearable devices; S2.2, Use 1D-CNN to extract the fluctuation characteristics of the physiological data during movement in S2.1; S2.

3. Use bidirectional LSTM to analyze the temporal relationship of the fluctuation characteristics in S2.2, and finally output the physiological feature vector.

7. The intelligent motion learning state analysis method based on spatiotemporal graph neural network and multimodal fusion as described in claim 6, characterized in that, When introducing cognitive data in S2, it includes: S2.4 Obtain known motion cognition text data; S2.

5. The Transformer model is used to refine the motion cognitive text data in S2.4, and finally output the cognitive feature vector.

8. The intelligent motion learning state analysis method based on spatiotemporal graph neural network and multimodal fusion as described in claim 7, characterized in that, Also includes: S5 employs an NVIDIA Jetson AGX Xavier edge computing unit, ST-GCN INT8 quantization, TensorRT optimization, and a pipelined parallel design to execute the methods described in S1-S4, achieving an end-to-end latency of <20ms.