Multimodal alignment-based sports interdisciplinary interaction system and method
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- RONGMENGYUESHI (SHANGHAI) SPORTS TECHNOLOGY CO LTD
- Filing Date
- 2026-06-01
- Publication Date
- 2026-06-26
AI Technical Summary
In current physical education teaching, movement evaluation relies on teachers' subjective judgment, which makes it difficult to accurately quantify and provide real-time feedback. Movement deviations are not corrected in a timely manner, resulting in low training efficiency. Physical exercise is monotonous and lacks interest, leading to low student participation. Furthermore, there is a lack of real-time linkage between movement perception and subject knowledge learning.
By collecting videos of students' physical exercise, extracting the coordinates of key skeletal points frame by frame to generate action temporal feature data, constructing a multimodal alignment linkage matching model, automatically pushing subject-specific answering tasks, and generating real-time scores by combining action completion and answering results, dynamically adjusting task difficulty and adaptation level.
It achieves a deep integration of physical exercise and subject learning, enhances interactivity and teaching relevance, accurately quantifies movement assessment, dynamically adjusts task difficulty, and improves students' learning initiative and comprehensive knowledge mastery.
Smart Images

Figure CN122290221A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of sports education technology, specifically a multi-modal alignment-based interdisciplinary interactive system and method for sports. Background Technology
[0002] Current physical education teaching relies heavily on traditional demonstrations and verbal instruction, with movement evaluation depending on teachers' subjective judgment. This makes it difficult to accurately quantify and provide real-time feedback on students' movement standardization and completion, leading to problems such as untimely correction of movement deviations and low training efficiency. Furthermore, the forms of physical exercise are monotonous and lack appeal, resulting in generally low student participation and hindering long-term, regular implementation. With the development of smart education, the integration of physical education and subject knowledge is becoming a trend. However, existing interactive models are mostly static, lacking real-time linkage between movement perception and knowledge learning. They cannot dynamically match learning tasks based on students' athletic performance, and traditional systems do not employ multimodal temporal alignment technology. This makes it difficult to accurately link movement data with subject knowledge points, resulting in delayed interactive triggering, poor adaptability, and an inability to automatically identify weak knowledge points based on incorrect answers. Summary of the Invention
[0003] To address the shortcomings of existing technologies, this invention proposes a multimodal alignment-based interdisciplinary interactive system and method for sports. It collects videos of students' physical exercise, extracts the coordinates of key skeletal points frame by frame to generate temporal feature data of movements, and calculates the degree of movement completion. The system segments the body using force exertion frames and posture transition frames as anchor points, and compares these segments with preset standard movement templates to generate movement deviation data. A linkage matching model between movements and subject knowledge points is constructed. After a student completes a standard movement and triggers the matching, a corresponding subject-specific question-and-answer task is automatically pushed to them. Real-time individual scoring data is generated by combining the movement completion rate with the answer results, and historical scoring data is synchronized. Weak knowledge points are marked and fed back to the linkage matching model, dynamically adjusting the difficulty of movement triggering and the level of question suitability. This application achieves deep integration of physical exercise and subject learning, enhances interactivity and teaching relevance, and is applicable to intelligent sports scenarios on campuses.
[0004] To achieve the above objectives, the present invention provides the following technical solution:
[0005] Interdisciplinary interaction methods in sports based on multimodal alignment include:
[0006] The AI exercise machine uses its front-facing camera to collect real-time RGB video streams of students' physical exercises during breaks. The machine extracts the coordinate sequence of key skeletal points frame by frame from the video stream to generate action temporal feature data. The action temporal feature data is then analyzed to obtain the degree of action completion.
[0007] The force application frame and posture transition frame in the action timing feature data are used as timing anchor points to segment and obtain standardized action segment data. At the same time, the action deviation data is generated by comparing with the preset standard action template.
[0008] Based on standardized action fragment data, a linkage and matching model between action fragments and subject knowledge points is constructed.
[0009] Once a student completes the corresponding physical activity and triggers the linkage matching model to generate a valid activity matching result, the AI exercise machine automatically pushes the subject-related question-and-answer task bound to the corresponding physical activity, forming a motion-triggered question-and-answer instruction.
[0010] Collect students' answers, combine them with the degree of action completion, generate real-time individual score data, synchronize historical score data, calculate and visualize the comparison results of individual historical scores, and mark weak knowledge points based on the errors in the answers.
[0011] The results of individual historical performance comparison, action deviation data and weak knowledge point data are fed back to the linkage matching model to dynamically adjust the difficulty of action triggering and the subject question matching level.
[0012] Specifically, the step of extracting the skeletal keypoint coordinate sequence frame by frame from the video stream to generate action temporal feature data includes:
[0013] A skeletal keypoint detection algorithm based on a convolutional pose machine is used to perform human detection on each frame of the video stream and output human bounding boxes.
[0014] Within the human body bounding box, the confidence heatmap of each joint point of the human body is predicted by the multi-stage convolutional network of the convolutional pose machine; the multi-stage convolutional network includes a first-stage convolutional network to an Nth-stage convolutional network, the first-stage convolutional network extracts a preliminary feature map from the input image, and the Nth-stage convolutional network fuses the feature map of the previous stage with the intermediate supervision signal to iteratively refine the joint point prediction results.
[0015] Non-maximum suppression processing is applied to the confidence heatmap to extract the pixel coordinates of each joint in the image coordinate system. The pixel coordinates are then arranged in chronological order to form a sequence of skeletal key point coordinates. Each frame corresponds to a coordinate vector containing all joints, and the coordinate vectors of all frames are indexed by the acquisition time to form action temporal feature data. The joints include the shoulder joint, elbow joint, wrist joint, hip joint, knee joint, and ankle joint.
[0016] Specifically, the synchronous analysis of action timing feature data to obtain action completion degree includes:
[0017] The coordinate vector of each frame in the action temporal feature data is input into a preset action recognition classifier. The action recognition classifier adopts a binary classification structure based on a temporal convolutional network and outputs the probability distribution of the action category to which the current frame belongs.
[0018] Based on the probability distribution, determine whether the current frame belongs to the action category in the preset standard action sequence. When the probability of the highest category in the probability distribution is greater than the preset category threshold, mark the current frame as a valid action frame.
[0019] Calculate the percentage of effective action frames within a continuous time window, normalize the percentage of frames, and obtain the action completion rate.
[0020] Specifically, the segmentation of motion sequence data using force application frames and posture transition frames as temporal anchors to obtain standardized motion segment data includes:
[0021] The displacement change of each joint point between adjacent frames is calculated for the coordinate sequence of the skeletal key points, and the displacement change is summed to obtain the inter-frame motion energy value, generating a motion energy curve;
[0022] Local peak points are detected on the motion energy curve. When the inter-frame motion energy value of the local peak point exceeds the preset force energy threshold, the frame index corresponding to the local peak point is marked as a force frame.
[0023] The change in joint angles between adjacent frames is calculated, and the change in joint angles is accumulated to obtain the attitude change rate. When the attitude change rate exceeds a preset conversion rate threshold, the current frame is marked as an attitude conversion frame. The joint angles are composed of three points: the shoulder joint, the elbow joint, and the wrist joint, or the hip joint, the knee joint, and the ankle joint.
[0024] Using the force-generating frame and the attitude transition frame as the segmentation boundary, the action timing feature data is divided into multiple continuous action segments. Each action segment is time-normalized, and each action segment is linearly interpolated to a fixed number of frames to obtain standardized action segment data.
[0025] Specifically, the step of comparing the data with a preset standard action template to generate action deviation data includes:
[0026] Obtain a preset standard motion template; the standard motion template includes a standard skeletal key point coordinate sequence and the allowable deviation range for each key point;
[0027] Calculate the Euclidean distance between the coordinate vectors of each joint point in each frame of the standardized motion segment data and the coordinate vectors of the same joint point in the corresponding frame of the standard motion template to obtain the frame-level deviation value of each joint point.
[0028] The frame-level deviation values of all key points are weighted and summed to obtain the comprehensive deviation value of the corresponding frame.
[0029] Frames whose overall deviation value exceeds a preset deviation threshold are marked as deviation frames. The duration of continuous occurrence and the peak deviation amplitude of the deviation frames are counted, and the duration of continuous occurrence and the peak deviation amplitude are encapsulated into motion deviation data.
[0030] Specifically, the process of establishing the linkage matching model includes:
[0031] A multimodal temporal alignment algorithm is constructed; the multimodal temporal alignment algorithm includes an action encoder and a text encoder, the action encoder adopts a neural network structure based on temporal convolutional network and self-attention mechanism, and the text encoder adopts a pre-trained language model structure based on bidirectional encoder representation;
[0032] The standardized motion fragment data is input into the motion encoder. The input layer of the motion encoder receives a fixed number of skeletal keypoint coordinate sequences. Local temporal features are extracted through multiple convolutional layers of a temporal convolutional network. Then, the inter-frame dependencies are calculated through the fully connected layer of the self-attention mechanism, and the motion embedding vector is output.
[0033] The preset subject knowledge point text is input into the text encoder, and semantic features are extracted through multiple transformer layers of the pre-trained language model based on bidirectional encoder representation, and the text embedding vector is output.
[0034] Specifically, the process of establishing the linkage matching model also includes:
[0035] A contrastive learning-based training method is adopted, using standardized action fragment data and corresponding subject knowledge point text as positive sample pairs, and standardized action fragment data and random subject knowledge point text as negative sample pairs. The parameters of the action encoder and the text encoder are optimized so that the cosine similarity between the action embedding vector and the text embedding vector of the positive sample pair is greater than a preset similarity threshold, and the cosine similarity between the negative sample pairs is less than a preset dissimilarity threshold. The action embedding vector output by the trained action encoder is used as the mapping representation of the standardized action fragment data in the embedding space.
[0036] A linkage matching model is established based on mapping representation; the linkage matching model includes a matching mapping table, which stores the association between action embedding vectors and text embedding vectors. When the cosine similarity between the action embedding vector and the text embedding vector of the standardized action segment data exceeds a preset effective matching threshold, the binding relationship between the corresponding action segment and the corresponding subject knowledge point is recorded in the matching mapping table.
[0037] Specifically, the process of forming the motion-triggered answer instruction includes:
[0038] The standardized action segment data is received in real time. The action embedding vector of the current action segment is calculated by the trained action encoder and cosine similarity is calculated with each action embedding vector stored in the matching mapping table of the linkage matching model to find the matching record corresponding to the maximum cosine similarity.
[0039] When the maximum cosine similarity is greater than the preset effective matching threshold, an effective action matching result is generated, and the subject knowledge point identifier bound to the corresponding action segment is obtained according to the binding relationship in the matching record.
[0040] Based on the subject knowledge point identifier, a set of questions associated with the corresponding knowledge point is queried from a preset subject answer database. Questions suitable for the current student's historical answer level are selected from the question set to generate a subject answer task. The subject answer database stores questions according to knowledge point tags and difficulty level indexes.
[0041] The AI training machine pushes the subject-specific quiz tasks through its display interface and generates motion-triggered quiz instructions; the motion-triggered quiz instructions include a quiz task identifier, a time limit for answering the questions, and a scoring weight.
[0042] Specifically, the process involves collecting students' answers, combining them with the degree of action completion, generating real-time individual score data, synchronizing historical score data, calculating and visualizing the comparison results of individual historical scores, and simultaneously marking weak knowledge points based on answer errors, including:
[0043] The AI training machine collects students' answers to the subject-specific quiz tasks via its touchscreen or voice input interface; the answers include option selection results, text input results, or voice recognition results.
[0044] The answers are compared with the standard answers in the subject-specific answer database to determine the correctness of the answers and to record the time taken to answer them.
[0045] According to the formula Calculate the individual real-time score data s, where a represents the action completion rate, b represents the correctness score of the answer, and c represents the answering time. This indicates the preset first weight coefficient. This indicates a preset second weighting coefficient. This indicates a preset third weighting coefficient, and satisfies... , , The correctness score is 1 for a correct answer and 0 for an incorrect answer.
[0046] The system synchronizes the real-time individual scores of each student's past performances via local area network communication, generates a comparison of individual historical scores, and displays the results visually on the AI training machine's screen.
[0047] When the answer is incorrect, the subject knowledge point identifier corresponding to the subject answering task is recorded. This subject knowledge point identifier is accumulated into the corresponding student's incorrect knowledge point set. The frequency of errors of each subject knowledge point identifier in the incorrect knowledge point set is statistically analyzed. Subject knowledge point identifiers whose frequency exceeds the preset weak threshold are marked as weak knowledge point data.
[0048] Specifically, the process of feeding back the comparison results of individual historical scores, action deviation data, and weak knowledge point data to the linkage matching model to dynamically adjust the difficulty of action triggering and the subject question suitability level includes:
[0049] The improvement rate and whether the individual's best historical record was reached in the comparison results of the individual's historical performance are mapped to an ability coefficient;
[0050] The comprehensive deviation value of each joint in the motion deviation data is normalized to obtain the motion standardization coefficient; the value of the motion standardization coefficient is equal to 1 minus the normalized comprehensive deviation value;
[0051] The error frequency corresponding to the knowledge point identifiers of each subject in the weak knowledge point data is normalized to obtain a knowledge point weakness vector.
[0052] According to the formula Calculate the action trigger difficulty adjustment factor And return it to the effective matching threshold in the linkage matching model, wherein, This indicates the preset basic trigger difficulty, p represents the ability coefficient, and d represents the action standardization coefficient. This indicates the preset first adjustment weight. This indicates that a second adjustment weight is preset, and ;
[0053] The subject-specific question matching level is adjusted based on the knowledge point weakness vector and then fed back to the subject-specific question-answering task generation process in the linked matching model.
[0054] Specifically, when the interdisciplinary sports interaction method is executed iteratively, the motion deviation data and weak knowledge point data generated in the previous iteration are used as inputs to the linkage matching model in the next iteration. The linkage matching model adjusts the allowable deviation range of each joint in the preset standard motion template according to the motion deviation data, and adjusts the binding relationship between questions and subject knowledge point identifiers in the subject answer database according to the weak knowledge point data, forming a closed-loop adaptive adjustment mechanism based on historical performance data.
[0055] A multimodal alignment-based interdisciplinary interactive system for sports includes:
[0056] The feature extraction module is used to collect real-time RGB video streams of human body movements during students' physical exercise, extract the coordinate sequence of key skeletal points frame by frame, generate temporal feature data of movements, and calculate the degree of completion of movements.
[0057] The deviation detection module is used to segment and normalize the action timing feature data to obtain standardized action segment data and compare it with the preset standard action template to generate action deviation data.
[0058] The linkage matching module is used to construct a multimodal temporal alignment algorithm, which maps standardized action fragment data to the embedding space and outputs action embedding vectors, converts subject knowledge point text into text embedding vectors, establishes a linkage matching model between action fragments and knowledge points, and forms a matching mapping table.
[0059] The motion-triggered push module is used to receive standardized motion fragment data in real time, calculate the cosine similarity between its motion embedding vector and each motion embedding vector in the linkage matching model, generate effective motion matching results, select questions suitable for students' answering level from the subject answering database based on the bound subject knowledge point identifier, push subject answering tasks through the AI training machine and generate motion-triggered answering instructions.
[0060] The scoring calculation module is used to collect students' touch screen or voice answer results, combine action completion rate, answer accuracy and answer time to calculate the real-time score data of an individual, and synchronously generate a comparison result of the individual's historical scores and display it visually, and mark the data of weak knowledge points.
[0061] Compared with the prior art, the beneficial effects of the present invention are:
[0062] 1. By introducing a multimodal temporal alignment algorithm and skeletal key point extraction technology, this solution achieves accurate capture and quantitative evaluation of students' physical exercise movements. Compared with the fuzzy evaluation method that relies on manual observation in traditional teaching, this solution can identify movement deviations with millisecond-level accuracy, transforming the abstract degree of movement completion into quantifiable data indicators, providing objective and scientific feedback for teaching, and effectively solving the technical pain points of strong subjectivity in movement guidance and inconsistent evaluation standards.
[0063] 2. This invention constructs an adaptive closed-loop system for physical exercise and subject learning, which not only enhances the interactivity of sports scenarios but also realizes the efficient reuse of educational resources. It can dynamically adjust the difficulty of tasks based on real-time motion data and answer results, accurately locate students' weak points, and upgrade from simple physical training to an intelligent learning experience that combines education and entertainment, thereby improving students' learning initiative and comprehensive knowledge mastery. Attached Figure Description
[0064] Figure 1 This is a schematic diagram of the interdisciplinary sports interaction method based on multimodal alignment according to the present invention;
[0065] Figure 2 This is a flowchart illustrating the principle of the interdisciplinary sports interaction method based on multimodal alignment of this invention.
[0066] Figure 3 This is a diagram of the interdisciplinary sports interaction system architecture based on multimodal alignment, as described in this invention. Detailed Implementation
[0067] Example 1:
[0068] Please see Figure 1 and Figure 2 The present invention provides an embodiment of a multimodal alignment-based interdisciplinary sports interaction method, the method comprising S1 to S6, including the following steps:
[0069] S1: The AI exercise machine's front-facing camera collects real-time RGB video streams of students' physical exercises during breaks. The machine extracts the coordinate sequence of key skeletal points frame by frame from the video stream, generates action temporal feature data, and simultaneously analyzes the action temporal feature data to obtain the degree of action completion.
[0070] In this embodiment, AI exercise machines are deployed in classrooms, corridors, playgrounds, and other recess activity areas in primary and secondary schools. These machines feature a front-facing high-definition RGB camera with 1080P resolution and a stable 30 frames per second capture capability. The devices have a built-in high-performance edge computing chip, enabling real-time local video processing without relying on cloud servers. The devices maintain a stable connection with classroom teaching terminals and teacher management backends via the campus LAN, ensuring low latency and high reliability of data transmission. During students' daily recess, the AI exercise machine automatically enters working mode, with the front-facing camera continuously focused on the designated exercise area, capturing real-time footage of students performing exercises such as calisthenics, rope skipping, squats, and chest expansions. The system captures real-time RGB video streams of human movements during standard recess physical exercises such as leg presses and lunges. Each frame of the video stream contains clear student body outlines, limb movement details, and complete posture information, without obstructions, blurring, or stuttering. It can completely reproduce the entire process of a student's movement from initiation to completion. While capturing the video stream, the device automatically adds a precise timestamp and student identification to each frame. The timestamp accuracy reaches the millisecond level, ensuring the accuracy of time-series analysis. Student identification is achieved through pre-recorded facial features or student ID numbers bound to the device, avoiding data confusion when multiple students exercise simultaneously.
[0071] Furthermore, this embodiment takes the standard broadcast gymnastics "chest expansion exercise" performed by third-grade elementary school students during recess as an example. An AI exercise machine is deployed in the recess activity area in front of the classroom. The front-facing camera of this device captures a real-time RGB video stream of a student completing a set of chest expansion exercises at a rate of 30 frames per second. The total duration of the video stream is approximately 8 seconds, with a total of 240 frames. Each frame clearly presents the complete movement details of the student's arms spreading out, chest expanding backward, and chest retracting forward, without any limb obstruction, light interference, or image shake.
[0072] The step of extracting the skeletal keypoint coordinate sequence frame by frame from the video stream to generate action temporal feature data includes:
[0073] S1.1: A skeletal keypoint detection algorithm based on a convolutional pose machine is used to perform human body detection on each frame of the video stream and output the human body bounding box.
[0074] Furthermore, the specific steps of S1.1 include:
[0075] (1) The acquired RGB real-time video stream of human body movement is decomposed into independent image frames at a frame rate of thirty frames per second. The size of each decomposed image frame is uniformly adjusted. The horizontal pixel count of the image is set to 1920 and the vertical pixel count is set to 1080. Then, the image is processed for brightness equalization. The overall brightness value of the image is adjusted to the range of 120 to 180 to eliminate the brightness difference caused by the intensity of natural light in the classroom and the angle of indoor lighting. Then, Gaussian noise filtering is performed. The size of the filter convolution kernel is set to 3×3 and the sigma parameter is set to 1.0 to remove random noise in the image without losing the details of the limb contour.
[0076] (2) The skeleton key point detection algorithm of convolutional pose machine is adopted, and the image features are extracted through human body detection network to output human body candidate regions. The convolutional pose machine is the prior art in this field and is not an inventive solution of this application. It will not be described in detail here.
[0077] (3) Non-maximum suppression is performed on all detected human candidate regions. The suppression threshold is set to 0.3. Redundant candidate regions with an overlap area of more than 70% are removed. The optimal candidate region with the highest confidence and no large area overlap is retained. For the retained optimal candidate region, the minimum bounding rectangle of the optimal candidate region is calculated. The horizontal boundary of the rectangle is based on the outermost pixel of the left and right limbs of the human body and is extended outward by 10 pixels respectively. The vertical boundary is based on the outermost pixel of the top and bottom limbs of the human body and is extended outward by 10 pixels respectively. This avoids the bounding box being too tight and causing the limb key points to be missing. The final output human bounding box is in standard rectangular format, which can completely wrap the student's head, torso and all limb structures, accurately locate the position of the human body in the image, and the coordinate information of the human bounding box is accurately represented in pixels with the upper left corner of the image as the origin, the horizontal axis as the horizontal axis and the vertical axis as the vertical axis.
[0078] Furthermore, the human detection network based on the convolutional pose machine uses a dedicated dataset containing 50,000 images of primary and secondary school students' physical activities during the training phase. The batch size is set to 16, the initial learning rate is set to 0.0001, the learning rate decay strategy is cosine decay, the total number of training epochs is set to 50, the Adam optimizer is selected, the weight decay parameter is set to 0.0005, the deactivation probability of the Dropout layer is set to 0.5, and it is placed before the fully connected output layer to prevent overfitting. After complete training, the model achieves a human detection accuracy of over 98% in school sports scenarios. It can stably adapt to students' fast swinging limb movements and varied limb postures without detection omissions or bounding box offsets.
[0079] S1.2: Within the human body bounding box, the confidence heatmap of each joint point of the human body is predicted by the multi-stage convolutional network of the convolutional pose machine; the multi-stage convolutional network includes a first-stage convolutional network to an Nth-stage convolutional network, the first-stage convolutional network extracts a preliminary feature map from the input image, and the Nth-stage convolutional network fuses the feature map of the previous stage with the intermediate supervision signal to iteratively refine the joint point prediction results. The convolutional network is a prior art in this field and is not an inventive solution of this application, so it will not be described in detail here.
[0080] Furthermore, the multi-stage convolutional network of the convolutional pose machine is set up as six stages, with all stages connected in a cascaded manner, and the output of the previous stage directly serving as the input of the next stage. The first-stage convolutional network consists of five consecutive convolutional layers and one output convolutional layer. In this embodiment, the image data within the human body bounding box is input into the first-stage convolutional network, and basic feature extraction is completed sequentially through the five convolutional layers. Basic visual features such as edges, textures, and limb structures are gradually extracted from the human body contour image to generate a preliminary feature map. The preliminary feature map output by the first-stage convolutional network is input into the output convolutional layer, which generates the initial confidence distribution of each joint point of the human body, resulting in the initial joint point confidence heatmap output by the first stage. This initial joint point confidence heatmap is used as the intermediate supervision signal for the current stage. The preliminary feature map output by the first stage is then compared with... The intermediate supervision signal in the first stage is simultaneously input into the second-stage convolutional network. First, deep features are extracted from the preliminary feature map through four convolutional layers. Then, the extracted deep features and the intermediate supervision signal are concatenated and fused along the channel dimension through a feature fusion layer to enhance the joint localization features. The fused features output from the second-stage feature fusion layer are input into the output convolutional layer to generate an optimized joint confidence heatmap. This heatmap is then used as the intermediate supervision signal for the second stage and passed to the next stage. Following the processing logic of the second stage, the feature map output from the previous stage and the intermediate supervision signal from the previous stage are sequentially input into the current stage convolutional network for iterative repetition. After the sixth-stage convolutional network completes its final iteration calculation, it outputs the final joint confidence heatmap. Each joint in the confidence heatmap corresponds to an independent confidence distribution region, with high confidence regions concentrated near the true location of the joint.
[0081] Furthermore, during the training of the multi-stage convolutional network, a fully labeled dataset of skeletal key points of primary and secondary school students' sports movements was used. The dataset sample size was set to 50,000 images, the training batch size was set to 16, the initial learning rate was set to 0.0001, and the cosine decay method was used to adjust it to 0.0005. After each stage of the convolutional layer, a Dropout layer with a deactivation probability of 0.5 was set to prevent overfitting.
[0082] S1.3: Perform non-maximum suppression processing on the confidence heatmap, extract the pixel coordinates of each joint point in the image coordinate system, and arrange the pixel coordinates in chronological order to form a skeletal key point coordinate sequence. Each frame corresponds to a coordinate vector containing all joint points, and the coordinate vectors of all frames constitute action temporal feature data according to the acquisition time index. The joint points include the shoulder joint, elbow joint, wrist joint, hip joint, knee joint, and ankle joint.
[0083] Furthermore, the confidence heatmap is a probability distribution image of joint points output by a multi-stage convolutional network. High-brightness areas in the image represent extremely high probabilities of the corresponding joint point, while low-brightness areas represent extremely low probabilities. Non-maximum suppression (NMS) is a crucial operation for eliminating redundant candidate points in the heatmap and retaining the optimal joint point positions. This operation traverses every position in the confidence heatmap, selecting the core point with the highest confidence and suppressing surrounding candidate points with lower confidence, ensuring that each joint point retains only one optimal joint point position, avoiding duplicate positioning or positioning offset problems. The specific process of NMS is existing technology in this field and is not an inventive solution of this application, so it will not be elaborated here. After completing NMS, each optimal joint point position is converted into pixel coordinates in an image coordinate system. The image coordinate system has the upper left corner of the image as the origin, the horizontal axis as the X-axis, and the vertical axis as the Y-axis. Pixel coordinates can accurately reflect the specific position of the joint point in the image, with numerical precision reaching the single pixel level. This embodiment focuses on extracting six core joints: shoulder, elbow, wrist, hip, knee, and ankle. Each joint corresponds to an independent set of X and Y pixel coordinates. The coordinates of all joints are combined to form the set of skeletal keypoint coordinates for the current frame. Subsequently, according to the acquisition time sequence of the video stream, the set of skeletal keypoint coordinates for each frame is arranged sequentially to form a skeletal keypoint coordinate sequence. The skeletal keypoint coordinate sequence completely records the positional changes of each joint point throughout the entire process of the student's chest expansion exercise. Each frame corresponds to a coordinate vector containing the coordinates of all six joints. The coordinate vectors of all frames are indexed sequentially according to the acquisition timestamp, ultimately generating motion temporal feature data that can completely represent the dynamic changes of the student's movements. The motion temporal feature data accurately recreates every detail of the student's chest expansion exercise in digital form.
[0084] The synchronous analysis of action timing feature data yields the action completion degree, including:
[0085] S1.4: Input the coordinate vector of each frame in the action temporal feature data into a preset action recognition classifier. The action recognition classifier adopts a binary classification structure based on a temporal convolutional network and outputs the probability distribution of the action category to which the current frame belongs.
[0086] Furthermore, the specific steps in S1.4 include:
[0087] The coordinate vectors corresponding to each frame are extracted sequentially from the action temporal feature data according to the acquisition time sequence. Each coordinate vector contains the position information of all specified joints of the human body in the current frame. An action recognition classifier based on a temporal convolutional network is constructed. This classifier consists of a feature input layer, a temporal convolutional feature extraction layer, a global pooling layer, a fully connected classification layer, and a probability output layer. The layers are sequentially connected, with the output of the previous layer directly serving as the input of the next. The feature input layer receives the coordinate vectors of each frame, and its dimension is set to 12 based on the number of human joints in the scene. The temporal convolutional feature extraction layer consists of three temporal convolutional units, with the first temporal convolutional unit... The number of convolutional kernels in the first layer is set to 32, the kernel length to 3, the stride to 1, and the padding to 1. The number of convolutional kernels in the second layer is set to 64, the kernel length to 3, the stride to 1, and the padding to 1. The number of convolutional kernels in the third layer is set to 128, the kernel length to 3, the stride to 1, and the padding to 1. Each layer of temporal convolutional units is followed by a linear rectified activation function to introduce nonlinear feature transformation capabilities. The normalized coordinate vector output from the feature input layer is fed into the temporal convolutional feature extraction layer, and feature extraction is completed sequentially through three layers of temporal convolutional units. The first layer extracts the basic distribution features of the coordinate vector, and the second layer extracts the basic distribution features of the coordinate vector. The second layer extracts the correlation features between key points, and the third layer extracts high-level discriminative features that can distinguish action categories, ultimately outputting a single-frame depth temporal feature map. The global pooling layer uses global average pooling, covering all feature dimensions of the depth temporal feature map. The pooling process does not change the number of feature channels, but only compresses the spatial dimension features into a single feature vector, outputting a fixed-dimensional global feature vector. In this embodiment, the depth temporal feature map output by the temporal convolutional feature extraction layer is input into the global average pooling layer. The global average pooling operation completes feature compression, removes redundant feature information, and obtains a fixed-dimensional global feature vector. The global feature vector carries all action discrimination of the current frame coordinate vector. Information; The fully connected classification layer is configured as a two-layer fully connected structure. The number of neurons in the first fully connected layer is set to 256, and it is connected to a linear rectified activation function. The second fully connected layer is a classification output layer with a fixed number of neurons of 2, corresponding to the two output categories of the binary classification structure, namely the target standard action category and the non-target standard action category. In this embodiment, the global feature vector output by the global pooling layer is input into the fully connected classification layer. First, the high-order mapping and fusion of features are completed through the first fully connected layer. Then, the mapped features are sent to the second fully connected layer to generate the original classification scores corresponding to the two action categories. The original classification scores have not been normalized and only represent the classifier's judgment tendency for the current category.The probability output layer uses a normalized exponential function to normalize the scores. The normalized exponential function operates on the original classification scores output by the fully connected classification layer. After calculation, the original classification scores are converted into values between zero and one, and the sum of the two category values is always equal to one, forming a standard probability distribution. In this embodiment, the original classification scores output by the fully connected classification layer are input into the probability output layer, and the normalization calculation is completed through the normalized exponential function. The output is a probability distribution containing two values: the first value represents the probability that the current frame belongs to the target standard action category, and the second value represents the probability that the current frame belongs to a non-target standard action category.
[0088] Furthermore, the action recognition classifier was trained using a dataset of labeled single-frame coordinate vectors of physical education actions of primary and secondary school students. The dataset contained 50,000 samples, the batch size was set to 16, the initial learning rate was set to 0.0001, and the learning rate was adjusted using a cosine decay strategy. The total number of training rounds was set to 50, the Adam optimizer was selected, and the weight decay parameter was set to 0.0005. After the first fully connected layer, a Dropout layer with a deactivation probability of 0.5 was set to reduce the risk of overfitting.
[0089] S1.5: Determine whether the current frame belongs to the action category in the preset standard action sequence based on the probability distribution. When the probability of the highest category in the probability distribution is greater than the preset category threshold, mark the current frame as a valid action frame.
[0090] Furthermore, the preset category threshold is a scientific judgment standard set based on a large amount of student action test data. Verified in actual campus scenarios, it can effectively distinguish between standard and non-standard action frames. In this embodiment, the category threshold set for chest expansion exercise is 0.85. After obtaining the probability distribution output by the action recognition classifier, the system automatically compares the probability values of the two categories, extracts the highest category probability, and compares this probability value with the preset category threshold. If the highest category probability is greater than 0.85, it indicates that the student's action in the current frame highly matches the standard chest expansion exercise action, and the limb posture and joint position meet the standard requirements. The system automatically marks this frame as a valid action frame. If the highest category probability is less than or equal to 0.85, it indicates that the student's action in the current frame has a significant deviation and does not meet the standard action requirements. The system determines this frame as an invalid action frame and does not include it in the action completion calculation. For example, in the 240-frame chest expansion exercise video stream of this embodiment, frames with standard student actions and postures are accurately marked as valid action frames, while frames with insufficient movement range, bent arms, or tilted bodies are judged as invalid action frames. The judgment result accurately matches the actual action state of the students.
[0091] S1.6: Calculate the percentage of effective action frames within a continuous time window, normalize the percentage of frames, and obtain the action completion rate.
[0092] Furthermore, the continuous time window represents the actual duration for a student to complete a full set of chest expansion exercises, which in this embodiment is 8 seconds, corresponding to 240 frames. This time window completely covers the entire process from start to finish, with no missing time or segment interruptions. The system first counts the total number of all valid action frames within the continuous time window, then divides the total number of valid action frames by the total number of frames within the continuous time window to obtain the percentage of valid action frames. This percentage reflects the proportion of the student's standardized movements throughout the entire exercise process. Subsequently, the percentage of frames is directly mapped to a percentage score from 0 to 100, representing the degree of action completion. A higher degree of action completion indicates more standardized and complete movements; a lower value indicates more serious problems with missing or distorted movements. In this embodiment, in the 240-frame chest expansion exercise video stream completed by the student, 216 frames are valid, representing 90% of the total. After normalization, the final degree of action completion is 90 points, which reflects the quality of the student's chest expansion exercise.
[0093] S2: Segment the force application frame and posture transition frame in the action timing feature data as timing anchor points to obtain standardized action segment data. At the same time, compare it with the preset standard action template to generate action deviation data.
[0094] Furthermore, the action timing feature data fully records the dynamic changes of the joints in the student's chest expansion exercise. The force exertion frame is the core frame with the largest muscle force exertion and limb movement amplitude in the student's action, and the posture transition frame is the key transition frame from one posture to another. These two types of frames are the core timing nodes of the action and can accurately divide the different execution stages of the action.
[0095] The process involves segmenting the motion sequence data using force application frames and posture transition frames as temporal anchors to obtain standardized motion segment data, including:
[0096] S2.1: Calculate the displacement change of each joint point between adjacent frames for the coordinate sequence of the skeletal key points, sum the displacement changes to obtain the inter-frame motion energy value, and generate a motion energy curve;
[0097] Furthermore, the skeletal keypoint coordinate sequence completely records the pixel coordinates of each joint in each frame. The coordinate changes between adjacent frames directly reflect the motion amplitude and speed of the joints. The system traverses the skeletal keypoint coordinate sequence frame by frame, calculating the X-axis and Y-axis displacement changes of the six major joints (shoulder, elbow, wrist, hip, knee, and ankle) between adjacent frames. The displacement change is the absolute difference between the coordinates of the current frame and the coordinates of the previous frame; the larger the value, the greater the motion amplitude of the joint. After calculating the displacement changes of all joints, the displacement changes of all joints in the current frame are summed, and the sum is the inter-frame motion of that frame. The kinetic energy value represents the intensity of the student's limb movement in the current frame. The greater the amplitude and speed of the movement, the higher the kinetic energy value. The kinetic energy values of all frames are arranged in chronological order to form a continuous kinetic energy curve. The peak value of the kinetic energy curve corresponds to the frame with a high kinetic energy value, and the trough value corresponds to the frame with a low kinetic energy value. The fluctuations of the kinetic energy curve can clearly identify the key points of force exertion in the student's chest expansion exercise. In the chest expansion exercise of this embodiment, the kinetic energy value reaches its peak in the frame where both arms are expanding backward, and the kinetic energy value is at its trough in the frame where both arms are stationary. The kinetic energy curve accurately presents the energy change pattern of the movement.
[0098] S2.2: Detect local peak points on the motion energy curve. When the inter-frame motion energy value of the local peak point exceeds the preset force energy threshold, mark the frame index corresponding to the local peak point as the force frame.
[0099] Further, the specific steps of S2.2 include: performing a full-domain scan of the motion energy curve, detecting all local peak points in the motion energy curve, where a local peak point is defined as a point whose motion energy value is higher than that of the adjacent frames, representing the instantaneous peak of force exertion in the action; comparing the inter-frame motion energy value of each local peak point with a preset force exertion energy threshold, in this embodiment the preset force exertion energy threshold is set to 80 pixels; if the inter-frame motion energy value corresponding to the local peak point is greater than the force exertion energy threshold, it indicates that the frame is the core force exertion frame of the student's action, with a large limb movement amplitude and obvious force exertion, and the system automatically records the frame index corresponding to the peak point and marks it as a force exertion frame; if the inter-frame motion energy value corresponding to the local peak point is less than or equal to the force exertion energy threshold, it indicates that the frame is only a slight movement and does not belong to the core force exertion node, and is not marked. In the chest expansion exercise video stream of this embodiment, the system detected a total of 3 local peak points that meet the force energy threshold requirements, which correspond to the three core force exertion moments of arm extension, chest expansion backward, and arm retraction. These 3 frames were accurately marked as force exertion frames and became the key timing anchor points for action segmentation.
[0100] S2.3: Calculate the change in joint angles between adjacent frames, accumulate the changes in joint angles to obtain the attitude change rate, and mark the current frame as an attitude transition frame when the attitude change rate exceeds a preset conversion rate threshold; the joint angles are composed of three points: shoulder joint, elbow joint and wrist joint or three points: hip joint, knee joint and ankle joint.
[0101] Furthermore, the specific steps in S2.3 include:
[0102] (1) Clarify the joint combination involved in the joint angle calculation. The upper limb joint angle is composed of the positions of the three joint points: shoulder joint, elbow joint and wrist joint. The lower limb joint angle is composed of the positions of the three joint points: hip joint, knee joint and ankle joint. All joint points involved in the calculation are fixed points in the skeletal key point coordinate sequence.
[0103] (2) Traverse the sequence of key points of the skeleton in chronological order, select the current frame frame by frame starting from the second frame of the video stream, and select the previous frame of the current frame as the comparison reference frame.
[0104] (3) Extract the coordinate information of the three joints of the upper limb in the current frame, calculate the upper limb joint angle value corresponding to the current frame based on the spatial position relationship of the shoulder joint, elbow joint and wrist joint, then extract the coordinate information of the three joints of the lower limb in the current frame, and calculate the lower limb joint angle value corresponding to the current frame based on the spatial position relationship of the hip joint, knee joint and ankle joint.
[0105] (4) Extract the coordinate information of the three joints of the upper limb in the previous frame to obtain the joint angle value of the upper limb in the previous frame, and then extract the coordinate information of the three joints of the lower limb in the previous frame to obtain the joint angle value of the lower limb in the previous frame.
[0106] (5) Subtract the upper limb joint angle value of the previous frame from the upper limb joint angle value of the current frame, and take the absolute value of the difference as the change in upper limb joint angle.
[0107] (6) Subtract the lower limb joint angle value of the previous frame from the lower limb joint angle value of the current frame, and take the absolute value of the difference as the change in lower limb joint angle.
[0108] (7) The change in upper limb joint angle and the change in lower limb joint angle obtained in the current frame are added together. The sum is defined as the pose change rate of the current frame, which is used to characterize the overall change in limb pose compared to the previous frame.
[0109] (8) Based on the posture change pattern of primary and secondary school students' physical education activities during breaks, the transition rate threshold is set to 15° as the critical standard to distinguish between normal posture fluctuations and significant posture changes.
[0110] (9) Compare the calculated attitude change rate of each frame with the conversion rate threshold to determine whether the current frame is an attitude conversion frame. When the attitude change rate of the current frame is greater than the conversion rate threshold, it is determined that the current frame has a significant limb attitude change, and the frame is officially marked as an attitude conversion frame and the corresponding frame index is recorded. When the attitude change rate of the current frame is less than or equal to the conversion rate threshold, it is determined that the current frame is only a minor attitude adjustment and no attitude conversion frame is marked.
[0111] (10) After traversing all adjacent frames, summarize all marked attitude transition frames and sort them in ascending order of frame index to form a continuous and ordered sequence of attitude transition frames.
[0112] S2.4: Using the force-generating frame and the attitude transition frame as the segmentation boundary, the action timing feature data is segmented into multiple continuous action segments. Each action segment contains the coordinate vectors of all frames from the start boundary frame to the end boundary frame. Time normalization processing is performed on each action segment, and each action segment is linearly interpolated to a fixed number of frames to obtain standardized action segment data.
[0113] Furthermore, the force application frames and posture transition frames together constitute the temporal segmentation boundary of the action. Following the frame index order, the system uses these temporal anchor points as boundaries to segment the complete action temporal feature data into multiple continuous and non-overlapping action segments. In this embodiment, the force application frames and posture transition frames segment the chest expansion movement into three core action segments: the arm extension phase, the backward chest expansion phase, and the arm retraction phase. Each action segment contains a complete action execution process, with no missing frames or segment errors. Due to differences in the movement speed of different students, the number of frames in the segmented action segments varies. To ensure the consistency of multimodal matching, the system performs time normalization on each action segment. Time normalization uses a linear interpolation algorithm to uniformly adjust the number of frames in each action segment to a preset fixed number. In this embodiment, the fixed number of frames is set to 60 frames. During the interpolation process, the original posture features and temporal patterns of the action are preserved, without changing the core features of the action. After time normalization, the length and format of all action segments are completely unified, and the final standardized action segment data has a unified feature dimension and temporal structure.
[0114] The comparison with the preset standard action template generates action deviation data, including:
[0115] S2.5: Obtain a preset standard motion template; the standard motion template includes a standard skeletal key point coordinate sequence and the allowable deviation range for each key point;
[0116] Furthermore, preset standard movement templates are stored in the AI exercise machine's local database, with separate templates set for different recess physical education movements. In this embodiment, a template specifically for the standard chest expansion exercise is used. This template was recorded by professional physical education teachers according to the primary and secondary school physical education curriculum standards, ensuring standardized movements and postures. It has undergone multiple rounds of verification and optimization, possessing both authority and universality. The standard chest expansion exercise template includes two core components: first, a standard skeletal key point coordinate sequence, which records the standard coordinates of the six major joints in each frame of the standard chest expansion exercise, serving as a benchmark reference for movement comparison; second, the allowable deviation range for each key point, set according to the physical development characteristics of primary and secondary school students, allowing reasonable limb errors and avoiding misjudgments due to individual body shape differences. For example, the allowable deviation range for the shoulder joint is 5 pixels, and for the elbow joint, it is 6 pixels, ensuring both movement standardization and consideration of individual student differences. When performing movement comparison, the system directly retrieves the standard movement template from the local database, eliminating the need for cloud downloads and ensuring a real-time and efficient comparison process.
[0117] S2.6: Calculate the Euclidean distance between the coordinate vector of each joint point in each frame of the standardized motion segment data and the coordinate vector of the same joint point in the corresponding frame of the standard motion template, to obtain the frame-level deviation value of each joint point. The Euclidean distance calculation formula is the prior art in this field and is not an inventive solution of this application, so it will not be described in detail here.
[0118] Furthermore, the system performs frame-by-frame mapping between the student's standardized movement segment data and the standard movement template, ensuring that movement data at the same time point are compared. For each frame, the system calculates the degree of difference in the coordinate vectors of the same joint point (shoulder, elbow, wrist, hip, knee, and ankle) between the student's movement and the standard movement, obtaining a frame-level deviation value for each joint point. The frame-level deviation value directly reflects the degree of deviation of a student's individual joint point from the standard position; the smaller the deviation value, the more standard the joint point position; the larger the deviation value, the more obvious the deviation. In this embodiment, the deviation value of the wrist joint is relatively small during the student's chest expansion exercise, while the deviation value of the shoulder joint is slightly larger in some frames. The system accurately records the deviation value for each frame and each joint.
[0119] S2.7: The frame-level deviation values of all joints are weighted and summed, where different joints have different weight coefficients preset according to their influence on motion quality, to obtain the comprehensive deviation value of the frame;
[0120] Furthermore, different joints have varying degrees of influence on the quality of the chest expansion exercise. Therefore, the system presets differentiated weight coefficients for each joint. These weight coefficients are scientifically allocated based on the characteristics of the movement; the joint with the greater impact on the movement quality receives a higher weight coefficient. In this embodiment, the chest expansion exercise is mainly performed by the upper limbs. Therefore, the shoulder joint has the greatest impact on the overall posture control, with a preset weight coefficient of 0.2. The elbow joint has the second greatest impact, with a preset weight coefficient of 0.18. The wrist joint has a preset weight coefficient of 0.15. The hip joint, as the core connecting the trunk and lower limbs, has a preset weight coefficient of 0.2. The knee joint has a preset weight coefficient of 0.17, and the ankle joint has a preset weight coefficient of 0.1. The weight coefficients of all joints are summed to one. The system multiplies the frame-level deviation value of all joints in the current frame by their corresponding weight coefficients, and then sums the weighted values to obtain the comprehensive deviation value for that frame. The comprehensive deviation value integrates the deviation information of all joints, enabling a comprehensive and objective evaluation of the overall deviation degree of the student's movement in the current frame. This avoids the one-sided influence of deviations from a single joint, accurately reflecting the overall standardization of the movement.
[0121] S2.8: Mark frames whose comprehensive deviation value exceeds a preset deviation threshold as deviation frames, count the continuous occurrence duration and peak deviation amplitude of the deviation frames, and encapsulate the continuous occurrence duration and peak deviation amplitude into motion deviation data.
[0122] Furthermore, the preset deviation threshold is a critical value for judging whether an action exceeds the standard. Based on standard action requirements and individual student differences, in this embodiment, the deviation threshold is set to 8 pixel units. The system compares the comprehensive deviation value of each frame with the preset deviation threshold. If the comprehensive deviation value is greater than the deviation threshold, it indicates that the student's action in that frame has a significant non-standard problem, and the system marks it as a deviation frame. If it is less than or equal to the deviation threshold, it indicates that the action meets the standard and is not marked. After marking, the system calculates the continuous occurrence duration of deviation frames, in milliseconds, reflecting the persistence of the action deviation. Simultaneously, it finds the maximum comprehensive deviation value among all deviation frames, which is the peak deviation amplitude, reflecting the severity of the action deviation. Finally, the continuous occurrence duration and peak deviation amplitude of the deviation frames are encapsulated to form complete action deviation data, which clearly records the location, duration, and severity of the student's action deviation.
[0123] S3: Based on standardized action fragment data, construct a linkage and matching model between action fragments and subject knowledge points;
[0124] In this embodiment, the specific process is to map standardized action segment data into the embedding space of a pre-built multimodal temporal alignment algorithm to establish a linkage matching model between action segments and subject knowledge points.
[0125] Furthermore, the multimodal temporal alignment algorithm is the core technology for linking sports movements with subject knowledge points. It can transform visual movement data and text-based knowledge point data into feature vectors of the same dimension, achieving accurate cross-modal matching. The embedding space is a high-dimensional feature space constructed by the multimodal temporal alignment algorithm. Data from different modalities can be accurately associated within this embedding space through similarity calculation. The system first inputs standardized movement segment data into the multimodal temporal alignment algorithm. Through feature extraction and mapping processing, the movement data is transformed into movement embedding vectors within the embedding space. Simultaneously, the preset subject knowledge point text is transformed into text embedding vectors. Through comparative learning training, the vector features are optimized to ensure that related movement and knowledge point vectors are close in distance within the embedding space. Finally, a linkage matching model between movement segments and subject knowledge points is established based on the vector mapping relationship. The model has a built-in matching mapping table that stores the binding relationship between movements and knowledge points, achieving accurate linkage for movement-triggered knowledge point push. In this embodiment, after the standardized movement segment data of chest expansion exercise is mapped to the embedding space, it forms an accurate binding with the symmetrical knowledge point of elementary school mathematics graphics, laying the foundation for motion-triggered answering.
[0126] The process of establishing the linkage matching model includes:
[0127] S3.1: Construct a multimodal temporal alignment algorithm; the multimodal temporal alignment algorithm includes an action encoder and a text encoder, the action encoder adopts a neural network structure based on temporal convolutional network and self-attention mechanism, and the text encoder adopts a pre-trained language model structure based on bidirectional encoder representation;
[0128] Furthermore, the input layer of the motion encoder receives a fixed number of frame sequences of skeletal keypoint coordinates after time-normalized processing. Here, the fixed number of frames is set to 60 frames. Each frame contains coordinate information for six joints: shoulder, elbow, wrist, hip, knee, and ankle. The input dimension is set to 60×6×2, corresponding to 60 frames, six joints, and two-dimensional coordinate values. The core of the motion encoder is a temporal convolutional network, which consists of four consecutive convolutional layers. The first convolutional layer has 64 kernels, a kernel size of 3, a stride of 1, and padding of 1. The second convolutional layer has 128 kernels, a kernel size of 3, a stride of 1, and padding of 1. The third convolutional layer has 256 kernels, a kernel size of 3, a stride of 1, and padding of 1. The fourth convolutional layer has 512 kernels, a kernel size of 3, a stride of 1, and padding of 1. Each convolutional layer is followed by a batch normalization layer and a linear activation function layer to extract local temporal features of action segments. The temporal convolutional network is then connected to a self-attention mechanism module, which is configured as a single-head attention structure. The dimensions of the query vector, key vector, and value vector are all set to 512, and the number of attention heads is set to 1. The dependencies between different frames are calculated through a fully connected layer to capture the long-term temporal correlation features of action segments. A fully connected layer is set after the self-attention mechanism module. The input dimension of this fully connected layer is 512, and the output dimension of the fully connected layer is set to 1024, finally outputting the action embedding vector.
[0129] Furthermore, the text encoder employs a pre-trained language model based on bidirectional encoder representation, using a basic model structure. The hidden layer dimension of the language model is set to 768, the number of attention heads to 12, the number of encoder layers to six, the intermediate layer dimension of the feedforward network to 3072, and the maximum sequence length to 128 characters to adapt to the input length of subject knowledge point text. The input to the text encoder is the subject knowledge point text after word segmentation. The input data is first converted into semantic feature vectors through an embedding layer, with the embedding layer dimension consistent with the model's hidden layer dimension, set to 768. Subsequently, it passes through a six-layer encoder structure, each encoder layer containing a multi-head self-attention layer and a feedforward network layer. Residual connections and layer normalization are used between layers. Finally, the model's pooling layer outputs a fixed-dimensional text embedding vector, with the output dimension consistent with the action encoder's output dimension, set to 1024.
[0130] Furthermore, the motion embedding vectors output by the motion encoder and the text embedding vectors output by the text encoder are placed in the same embedding space to ensure that the vector dimensions of the two modal data are consistent, both being 1024-dimensional. The integrated multimodal temporal alignment algorithm has the ability to process motion data and text data simultaneously, and can convert the temporal features of sports motion segments and the semantic features of subject knowledge points into embedding vectors of the same dimension. All hierarchical parameters and structural configurations are fixed according to preset values and do not require dynamic adjustment, ensuring the stability of the algorithm and the consistency of feature extraction.
[0131] S3.2: Input the standardized motion fragment data into the motion encoder. The input layer of the motion encoder receives a fixed number of skeletal keypoint coordinate sequences. After passing through multiple convolutional layers of a temporal convolutional network, local temporal features are extracted. Then, after passing through the fully connected layer of the self-attention mechanism, the inter-frame dependencies are calculated, and the motion embedding vector is output.
[0132] Furthermore, the specific steps in S3.2 include:
[0133] (1) The standardized motion segment data after time normalization is sent to the motion encoder. The input layer of the motion encoder receives a sequence of skeletal key point coordinates with a pre-set length of 60 frames. Each frame contains two-dimensional coordinate information of six joints: shoulder, elbow, wrist, hip, knee and ankle. The input layer organizes the data into a feature matrix of uniform dimension and sends it into the temporal convolutional network.
[0134] (2) The first convolutional layer of the temporal convolutional network completes the initial local action feature extraction. After outputting the feature map, it is connected to the batch normalization layer and the linear activation function layer to complete the feature optimization. The second convolutional layer further extracts more detailed action temporal change features based on the features of the first layer. It is also connected to the batch normalization layer and the linear activation function layer. The third convolutional layer focuses on the temporal features corresponding to the key poses in the action segment. After outputting, it is connected to the batch normalization layer and the linear activation function layer. The fourth convolutional layer completes the final high-level local temporal feature extraction. After outputting, it is also connected to the batch normalization layer and the linear activation function layer. After the progressive extraction of four convolutional layers, a high-dimensional feature map containing complete local temporal information is obtained.
[0135] (3) The high-dimensional feature map output by the temporal convolutional network is fed into the self-attention mechanism module. First, the query vector, key vector and value vector are generated by three independent fully connected layers respectively. The self-attention mechanism calculates the association weight between different frames based on the matching degree between the query vector and the key vector. Then, the weight and value vector are fused to obtain the attention feature. This process can completely capture the long-term dependency relationship between different frames in the action segment. After the output of the self-attention mechanism, a fully connected layer is connected to map the attention feature into the feature representation.
[0136] (4) After dimensional mapping and feature integration by the fully connected layer, the action encoder outputs the final action embedding vector, which is a fixed-dimensional vector of 1024 dimensions, fully carrying the temporal and pose features of the standardized action segment.
[0137] S3.3: Input the preset subject knowledge point text into the text encoder, extract semantic features through multiple transformer layers of the pre-trained language model based on bidirectional encoder representation, and output the text embedding vector;
[0138] Furthermore, the preset subject knowledge point text is standardized content compiled according to the primary and secondary school curriculum standards. In this embodiment, it is the symmetry knowledge point of elementary school mathematics figures. The text content is concise, accurate, and clearly defines the core concepts and key content of the knowledge point. After receiving the subject knowledge point text, the text encoder processes it through multiple transformer layers of a pre-trained language model based on bidirectional encoder representation. The transformer layer can extract the semantic features of the text from the bidirectional context, understand the connotation, extension, and logical relationship of the knowledge point, and transform the text into an abstract semantic feature representation. After multi-layer feature extraction and optimization, the text encoder outputs a text embedding vector with the same dimension as the action embedding vector. The text embedding vector accurately represents the semantic features of the knowledge point, ensuring comparability with the action embedding vector. In this embodiment, the text of the symmetry knowledge point of the figure is encoded to generate the corresponding text embedding vector.
[0139] Furthermore, the specific steps of S3.3 include:
[0140] (1) The pre-organized subject knowledge point text is fed into the text encoder. The text encoder adopts a pre-trained language model based on bidirectional encoder representation. The pre-trained language model first performs word segmentation on the input subject knowledge point text. The maximum sequence length after word segmentation is set to 128 words to adapt to the text length requirements of primary and secondary school subject knowledge points. After word segmentation, the text words are converted into word embedding vectors. The dimension of the embedding layer is set to 768, which is consistent with the overall hidden layer dimension of the model. At the same time, position encoding information is added to each word to preserve the word order features of the text. The dimension of the position encoding is also set to 768 to ensure that it matches the dimension of the word embedding vector.
[0141] (2) The embedding vector with position encoding is fed into multiple transformer layers of the model for semantic feature extraction. The number of transformer layers in this pre-trained language model is set to 6. Each transformer layer contains a multi-head self-attention module and a feedforward network module, and residual connections and layer normalization are used between layers. The number of attention heads in each multi-head self-attention module is set to 128, the dimension of each head is set to 64, and the total dimension is 768, which is consistent with the dimension of the hidden layer. The multi-head self-attention module can fully capture the bidirectional semantic dependency relationship between words in the knowledge point text. The feedforward network module of each transformer layer contains two fully connected layers. The dimension of the first fully connected layer is set to 3072, and the dimension of the second fully connected layer falls back to 768. The activation function adopts the linear activation method to extract the deep semantic features of the text.
[0142] (3) The six-layer transformer layer performs progressive feature extraction on the embedded vector in sequence. Each layer further mines the conceptual connotation and semantic association of the knowledge point text based on the output features of the previous layer. After processing by all six layers of transformer layer, a high-dimensional feature sequence carrying complete text semantic information is obtained.
[0143] (4) The high-dimensional feature sequence output by the transformer layer is pooled and the feature representation of the whole text is extracted by sentence-level pooling. The pooled features are then mapped through a fully connected layer. The output dimension of the fully connected layer is set to 1024, which is consistent with the dimension of the output vector of the action encoder. After dimension mapping and feature integration, the text encoder finally outputs a fixed-dimensional text embedding vector. This vector can be directly used to calculate the cosine similarity with the action embedding vector output by the action encoder in the same space.
[0144] S3.4: A contrastive learning-based training method is adopted, using standardized action fragment data and corresponding subject knowledge point text as positive sample pairs, and standardized action fragment data and random subject knowledge point text as negative sample pairs. The parameters of the action encoder and the text encoder are optimized so that the cosine similarity between the action embedding vector and the text embedding vector of the positive sample pair is greater than a preset similarity threshold, and the cosine similarity between the negative sample pairs is less than a preset dissimilarity threshold. The action embedding vector output by the trained action encoder is used as the mapping representation of the standardized action fragment data in the embedding space. The cosine similarity calculation formula is the prior art in this field and is not an inventive solution of this application, so it will not be described in detail here.
[0145] Furthermore, the specific steps in S3.4 include:
[0146] (1) Match pre-bound subject knowledge point texts for each standardized action segment data to form positive sample pairs. At the same time, randomly select other subject knowledge point texts that are not bound to the same standardized action segment data to form negative sample pairs. During the training process, the number of samples in each batch is set to 32, and each batch contains 16 positive sample pairs and 16 negative sample pairs to ensure a balanced distribution of positive and negative samples.
[0147] (2) The completed sample pairs are fed into the action encoder and the text encoder for forward inference. First, the standardized action fragment data in the positive sample pair is input into the action encoder to obtain the corresponding 1024-dimensional action embedding vector. Then, the corresponding subject knowledge point text in the positive sample pair is input into the text encoder to obtain the corresponding 1024-dimensional text embedding vector. The same processing method is used to process the negative sample pairs to obtain the action embedding vector and text embedding vector in the negative sample pairs respectively.
[0148] (3) Calculate the cosine similarity of the embedding vectors of all sample pairs. The cosine similarity calculation adopts the standard vector similarity calculation method. First, calculate the dot product of the two vectors, and then divide by the product of the magnitudes of the two vectors to obtain the normalized similarity value. The positive cosine similarity value is calculated for positive sample pairs, and the negative cosine similarity value is calculated for negative sample pairs. Before training, the positive similarity threshold is set to 0.8 and the negative dissimilarity threshold is set to 0.2 to constrain the training target.
[0149] (4) Construct a loss function for contrastive learning for parameter optimization. In the form of contrastive loss function, the similarity values of positive sample pairs and negative sample pairs are substituted into the calculation. The temperature parameter of contrastive loss function is set to 0.07 to control the smoothness of similarity distribution and ensure stable convergence of model training. After each batch of samples is calculated, the average loss value of the current batch is obtained.
[0150] (5) The Adam optimizer is used for gradient descent update. The learning rate of the Adam optimizer is set to 0.0001, the first moment decay coefficient is set to 0.9, the second moment decay coefficient is set to 0.999, and the weight decay is set to 0.0001. In each iteration, the gradient of the parameters related to the self-attention mechanism of the action encoder and the temporal convolutional network is calculated based on the loss value. At the same time, the gradient of the parameters related to each transformer layer and embedding layer of the pre-trained language model in the text encoder is calculated. All trainable parameters of the two encoders are updated synchronously based on the gradient values. The update direction aims to reduce the loss value.
[0151] (6) During the iterative training process, the cosine similarity values of positive sample pairs and negative sample pairs are continuously monitored. When the cosine similarity of all positive sample pairs in 50 consecutive batches is greater than 0.8 and the cosine similarity of all negative sample pairs is less than 0.2, the model is determined to have reached the convergence state and the training process is stopped.
[0152] (7) Retain the parameters of the action encoder and text encoder after training, put the trained action encoder into actual use, and use the action embedding vector output by the action encoder after the input standardized action fragment data is processed as the final mapping representation of the standardized action fragment data in the unified embedding space.
[0153] S3.5: Establish a linkage matching model based on mapping representation; the linkage matching model includes a matching mapping table, which stores the association between action embedding vectors and text embedding vectors. When the cosine similarity between the action embedding vector and the text embedding vector of the standardized action fragment data exceeds a preset effective matching threshold, the binding relationship between the action fragment and the subject knowledge point is recorded in the matching mapping table.
[0154] Furthermore, the specific steps of S3.5 include:
[0155] (1) The action embedding vector and text embedding vector trained by the multimodal temporal alignment algorithm are used as the processing data. The action embedding vector corresponds to the mapping representation of standardized action fragment data in the embedding space, and the text embedding vector corresponds to the semantic feature representation of the preset subject knowledge points. The dimensions of the two types of vectors are kept consistent.
[0156] (2) Based on the unified dimension of action embedding vector and text embedding vector, a linkage matching model for realizing the association between sports actions and subject knowledge points is established. The linkage matching model uses a persistent matching mapping table as the core data structure to record and save the corresponding association between action embedding vector and text embedding vector.
[0157] (3) Initialize the matching mapping table on which the linkage matching model depends, so that the matching mapping table is in a blank and usable initial state. The matching mapping table is used to store complete association records containing action fragment identifiers, action embedding vectors, subject knowledge point identifiers, text embedding vectors and similarity values.
[0158] (4) Determine the vector similarity calculation method used in the linkage matching model. In this embodiment, cosine similarity is used as the calculation basis for measuring the correlation between action embedding vector and text embedding vector. Standardized similarity calculation is only performed on two types of embedding vectors with the same dimension.
[0159] (5) Based on the matching accuracy requirements of the interdisciplinary interactive scenario of campus sports, a fixed effective matching threshold is set for the linkage matching model. In this embodiment, the preset effective matching threshold is set to 0.85 as the critical standard for judging whether the action and knowledge point are effectively bound.
[0160] (6) Read the action embedding vector corresponding to the standardized action fragment data to be matched from the output of the multimodal temporal alignment algorithm that has been trained, and read the text embedding vector corresponding to each subject knowledge point in the text encoding result set of the preset subject knowledge points in the order of storage of the knowledge points.
[0161] (7) Perform cosine similarity calculation between the currently read action embedding vector and each sequentially read text embedding vector to obtain the similarity value corresponding to each pair of data. The value is in the range of zero to one.
[0162] (8) Compare each calculated similarity value with the preset effective matching threshold of 0.85 one by one, determine whether the current similarity value exceeds the preset effective matching threshold, and retain the action embedding vector and text embedding vector combination with similarity value exceeding 0.85. The combination is determined to be a matching combination with effective association, and the corresponding standardized action fragment data and subject knowledge points form a binding correspondence.
[0163] (9) Assign a unique action segment identifier and subject knowledge point identifier to each group of valid matching combinations that have been judged, and integrate the action segment identifier, the corresponding action embedding vector, the subject knowledge point identifier, the corresponding text embedding vector, and the calculated similarity value into a complete valid binding record.
[0164] (10) Write the generated valid binding records into the matching mapping table of the linkage matching model in order to complete the formal recording of the binding relationship between the action and the knowledge point;
[0165] (11) Iterate through all the motion embedding vectors corresponding to all standardized motion fragment data and all the text embedding vectors corresponding to all subject knowledge points, repeating the process until all motions and knowledge points are matched and recorded.
[0166] (12) After completing all matching and recording, solidify the training parameters used in the construction phase of the linkage matching model. The linkage matching model is trained using a labeled sports action and subject knowledge point pairing dataset. The number of samples is set to 50,000, the training batch size is set to 16, the initial learning rate is set to 0.0001, the learning rate is adjusted by the cosine decay strategy, the total number of training rounds is set to 50, the Adam optimizer is selected, the weight decay parameter is set to 0.0005, and a Dropout layer with a deactivation probability of 0.5 is set before the fully connected layer to avoid overfitting of the linkage matching model and ensure the stability of the linkage matching model in actual use.
[0167] S4: When a student completes the corresponding physical activity and triggers the linkage matching model to generate a valid action matching result, the AI exercise machine automatically pushes the subject-related question-and-answer task bound to the corresponding physical activity, forming a motion-triggered question-and-answer instruction.
[0168] The process of generating the motion-triggered answer instruction includes:
[0169] S4.1: Receive the standardized action segment data in real time, calculate the action embedding vector of the current action segment through the trained action encoder, and perform cosine similarity calculation with each action embedding vector stored in the matching mapping table of the linkage matching model to find the matching record corresponding to the maximum cosine similarity.
[0170] Furthermore, the system receives standardized movement fragment data generated by students' physical exercises in real time. This data is transmitted to the trained movement encoder in real time, quickly generating the movement embedding vector for the current movement with a processing delay of no more than 15 milliseconds to ensure real-time triggering. Subsequently, the system calculates the similarity between the current movement embedding vector and all movement embedding vectors stored in the matching mapping table of the linkage matching model. The similarity value reflects the degree of matching between the current movement and the stored movements. By traversing all similarity values, the system finds the highest similarity value and locates the matching record corresponding to that value. This matching record is the movement-knowledge point binding entry that best matches the current student's movement. In this embodiment, after the student completes the chest expansion exercise, the system calculates the matching record of the highest similarity value corresponding to the chest expansion exercise and the symmetrical knowledge point of the graphic, accurately locking the target binding relationship.
[0171] S4.2: When the maximum cosine similarity is greater than the preset effective matching threshold, an effective action matching result is generated, and the subject knowledge point identifier bound to the action segment is obtained according to the binding relationship in the matching record;
[0172] Furthermore, the system compares the maximum similarity with the effective matching threshold. If the similarity is greater than the threshold, it indicates that the student's action is standardized and complete, and highly matches the bound action. The system immediately generates a valid action matching result and confirms the triggering of the answer process. If the similarity is less than or equal to the threshold, it indicates that the student's action is not standardized or incomplete, and the answer process is not triggered. After generating a valid matching result, the system extracts the corresponding subject knowledge point identifier from the matching record. The subject knowledge point identifier is a unique code that accurately corresponds to the symmetrical knowledge points of the figure, enabling rapid location of question resources in the answer database.
[0173] S4.3: Based on the subject knowledge point identifier, query the set of questions associated with the knowledge point from the preset subject answer database, select questions from the question set that are suitable for the current student's historical answer level, and generate a subject answer task; the subject answer database stores questions according to knowledge point tags and difficulty level indexes;
[0174] In this embodiment, the pre-set subject-specific question database is a question bank resource designed for interdisciplinary interaction in primary and secondary schools. It uses a two-layer indexing system based on knowledge point tags and difficulty levels, resulting in extremely high retrieval efficiency. The database provides multiple sets of questions for each knowledge point, categorized into basic, intermediate, and challenging levels to suit students of different learning levels. Based on the subject knowledge point identifiers, the system quickly retrieves all questions related to symmetry in the database. It then retrieves the current student's historical answer data, including accuracy rate, incorrect answers, and difficulty level, automatically selecting appropriate questions based on the student's actual level. In this embodiment, the student has a high historical accuracy rate, so the system selects intermediate-level questions, including symmetry judgment and symmetry axis drawing. The selected questions are integrated according to the answering order and requirements to generate a complete subject-specific question task. The task content aligns with the knowledge points and is suitable for the student's level, ensuring effective answering.
[0175] Furthermore, the specific steps of S4.3 include:
[0176] (1) Obtain the subject knowledge point identifier, which is the unique subject knowledge point code after the current action is successfully matched;
[0177] (2) Obtain the storage structure of the preset subject answer database. The subject answer database is stored in a two-level index according to knowledge point tags and difficulty level. Each question is bound to a unique knowledge point tag and the corresponding difficulty level. The difficulty level is divided into three levels from low to high: basic level, advanced level and challenge level, which are used to distinguish the appropriate questions for students with different learning levels.
[0178] (3) Using the obtained subject knowledge point identifier as the search condition, perform a precise search in the subject answer database, filter out all questions in the subject answer database that are bound to the subject knowledge point identifier, and form an initial set of questions directly related to the current knowledge point;
[0179] (4) Read the current student's historical answer data, which includes all information such as the student's past answer accuracy rate, wrong answer records, difficulty level of completed questions, and weak knowledge points, and use it as the basis for selecting suitable questions;
[0180] (5) Statistical analysis of the current students’ historical answer data, calculate the students’ historical answer accuracy rate under the corresponding subject knowledge points, divide the accuracy rate into three intervals: an accuracy rate greater than or equal to 90% corresponds to the challenge level, an accuracy rate greater than or equal to 70% and less than 90% corresponds to the advanced level, and an accuracy rate less than 70% corresponds to the basic level.
[0181] (6) Based on the difficulty matching results obtained from the analysis, select questions from the initial question set that match the current student's difficulty level to form a candidate question set after difficulty filtering.
[0182] (7) Set the number of questions in a single question-answering task. In combination with the time limit for students to answer questions during breaks, the number of questions in the subject-specific question-answering task is fixed at three to ensure that the answering time is appropriate and meets the rhythm requirements of break-time interaction.
[0183] (8) Randomly select three questions from the candidate question set, avoiding questions that have appeared in the student's past answers and repeated questions, to ensure the effectiveness and freshness of the question selection;
[0184] (9) Sort the three questions obtained by drawing them in order of increasing difficulty to form a question sequence that conforms to the students' thinking logic. Configure a unified answering rule for the sorted question sequence, including answering method, answering time limit and scoring standard. The answering time limit is fixed at 10 seconds for each question and the total answering time limit is 30 seconds. The answering method supports two forms: touch selection and voice answering.
[0185] (10) Integrate the sorted question sequence with the configured answer rules into a complete subject-specific answer task. The subject-specific answer task includes question content, answer requirements, time limit, and scoring method information.
[0186] (11) Standardize the format of the generated subject answer tasks to make them compatible with the display interface and data transmission format of the AI training machine, so as to ensure that they can be directly pushed to the student end for display;
[0187] (12) Solidify the construction parameters of the subject answer database. The subject answer database contains all subject knowledge points from primary school to junior high school. The number of questions for each knowledge point and each difficulty level is no less than 50. The questions are written by subject teachers in accordance with the curriculum standards to ensure the accuracy and suitability of the questions.
[0188] S4.4: Push the subject-specific answering task through the display interface of the AI training machine, and generate a motion-triggered answering instruction at the same time; the motion-triggered answering instruction includes an answering task identifier, a time limit for answering, and a scoring weight.
[0189] Furthermore, the AI exercise machine features a high-definition touchscreen display with a simple and easy-to-understand design, adapted to the operating habits of primary and secondary school students. The system clearly displays the generated subject-specific quiz tasks on the screen, including the question content, options, and answer prompts. Students can complete the quiz using either touch or voice. Simultaneously, the system generates motion-triggered answer commands, which contain three core pieces of information: a unique code identifying the quiz task and linking it to the current question and student information; a time limit set according to the question's difficulty (10 seconds in this example) to encourage quick thinking; and a scoring weight that clearly defines the percentage of marks for action completion and answer accuracy (40% for action completion and 60% for answer accuracy in this example).
[0190] S5: Collect students' answers, combine them with the completion rate of actions, generate real-time individual score data, synchronize historical score data, calculate and visualize the comparison results of individual historical scores, and mark weak knowledge points based on the answer errors.
[0191] The specific steps of S5 include:
[0192] S5.1: Collect the answer results submitted by students for the subject-specific question-and-answer task through the touch screen or voice input interface of the AI training machine; the answer results include option selection results, text input results, or voice recognition results;
[0193] Furthermore, the AI training machine is equipped with both touch and voice input interfaces, adapting to the different operating habits of primary and secondary school students, making the data collection process convenient and efficient. The touchscreen supports click options, handwriting input, and other operations, with fast response and no latency. The voice input interface has a built-in high-precision speech recognition model, which adopts an encoder-decoder structure. The front end first performs frame segmentation, windowing, and Mel filtering on the input speech to extract 80-dimensional Mel spectrum features, with a frame length of 25ms and a frame shift of 10ms. The encoder is constructed by cascading a 3-layer convolutional neural network and a 4-layer bidirectional long short-term memory network. The system features a 3×3 kernel size and a 1×1 stride, with output channels of 64, 128, and 256 respectively. The bidirectional long short-term memory network has a hidden layer dimension of 512, which is used to fully extract speech temporal features and contextual dependencies. A scaling dot product attention module is added in the middle to strengthen the weight of answer keyword features. The decoder uses a 1-layer long short-term memory network combined with a fully connected layer and a softmax output layer. The output dictionary covers 5832 commonly used Chinese characters, numbers, letters and symbols in primary and secondary schools. The overall structure is adapted to the short sequence and high real-time requirements of oral answering.
[0194] In this interdisciplinary interactive sports scenario, the input to this speech recognition model is limited to 3-10 seconds of student answer speech. It automatically filters out interference signals such as environmental noise, conversation, and vocalizations, outputting structured text results in a unified format of answer type plus specific content. This allows for direct character matching with standard answers in the subject-specific answer database and establishes clear data associations with action completion and student identification, enabling the speech recognition output to directly serve real-time scoring and adaptive task adjustments. The speech recognition model is specifically optimized for primary and secondary school answer scenarios, adapting to different accents and speaking speeds, achieving an accuracy rate exceeding 95% in typical campus noise environments. After completing the questions, students can choose to submit their answers via touch or voice. The system immediately collects the results, including multiple-choice options, fill-in-the-blank text input, and short-answer speech recognition. The collected results are quickly transmitted to the scoring module for processing.
[0195] S5.2: Compare the answer results with the standard answers in the subject answer database to determine the correctness of the answer and record the answering time;
[0196] Furthermore, after collecting the answer results, the system immediately retrieves the standard answer to the corresponding question from the subject-specific answer database and performs a precise comparison word by word and option by option to quickly determine the correctness of the answer. The comparison process takes no more than 5 milliseconds. If the answer is completely consistent with the standard answer, it is judged as correct; if there is a difference, it is judged as incorrect. At the same time, the system accurately records the answering time from the start of the motion-sensor-triggered answer command to the end of the student's submission of the answer, with the time accuracy reaching the millisecond level, reflecting the student's answering speed.
[0197] S5.3: According to the formula Calculate the individual real-time score data s, where a represents the action completion rate, b represents the correctness score of the answer, and c represents the answering time. This indicates the preset first weight coefficient. This indicates a preset second weighting coefficient. This indicates a preset third weighting coefficient, and satisfies... , , The correctness score is 1 for a correct answer and 0 for an incorrect answer.
[0198] Furthermore, the specific steps in S5.3 include:
[0199] (1) Obtain the action completion value obtained by the current student through action analysis. The action completion value is a real number between 0 and 100. At the same time, read the answer result of the current student after completing the subject answering task. Add up the answer correctness scores of all questions in this answering task to obtain the total answer correctness score of this answering task.
[0200] (2) Collect the actual time taken by the current student from the start of answering the question to the submission of the answer. The time taken is counted in seconds and the numerical precision is retained to the integer place.
[0201] (3) Load the three preset weight coefficients, wherein the preset first weight coefficient is used to weight the completion of the action, and the value is fixed at 0.4; the preset second weight coefficient is used to weight the correctness score of the answer, and the value is fixed at 0.6; and the preset third weight coefficient is used to weight the time spent answering the question, and the value is fixed at 0.1.
[0202] (4) Multiply the obtained action completion value by 0.4 to obtain the action completion weighted score, multiply the total answer correctness score by 0.6 to obtain the answer correctness weighted score, and multiply the collected answer time value by 0.1 to obtain the answer time weighted deduction value;
[0203] (5) Perform a summation operation on the weighted score of action completion and the weighted score of answer correctness to obtain the comprehensive basic score of action and answer. Then subtract the weighted deduction value of answer time from the comprehensive basic score to obtain the unstandardized original real-time score value.
[0204] (6) Perform interval constraint processing on the original real-time score values, and limit the calculation results to a closed interval range of 0 to 100. When the value is lower than zero, the value is directly taken as zero, and when the value is higher than 100, the value is directly taken as 100.
[0205] (7) The value after interval constraint is determined as the current student's real-time score data and is in integer form.
[0206] S5.4: Synchronize the current student's historical real-time individual score data through local area network communication, generate personal historical score comparison results, and visualize them on the AI training machine's display screen in the form of personal score curves, historical best records, and progress.
[0207] Furthermore, the AI training machine establishes a stable communication connection with all identical devices in the classroom via the campus LAN, with a communication latency of less than 10 milliseconds, ensuring real-time and efficient data synchronization. The system automatically collects real-time individual scoring data from all students participating in the online interaction within the LAN, generating an independent personal historical performance comparison file for each student. The comparison includes the current score and the student's personal best record, the average score of the last five tests, and the rate of improvement / regression. The class interface only displays a summary of all students' self-improvement status, without showing score comparisons or rankings between students. It showcases collective honors based on the number of personal best records achieved, the number of consecutive improvements, and the number of self-breakthroughs. The results of the personal historical performance comparison are visualized on the AI training machine's display screen in the form of a dedicated personal growth panel. The interface features vibrant colors and clear fonts, conforming to the visual habits of primary and secondary school students.
[0208] S5.5: When the answer result is incorrect, record the subject knowledge point identifier corresponding to the subject answer task, accumulate the subject knowledge point identifier into the corresponding student's incorrect knowledge point set, count the error frequency of each subject knowledge point identifier in the incorrect knowledge point set, and mark the subject knowledge point identifiers with a frequency exceeding the preset weak threshold as weak knowledge point data.
[0209] Furthermore, the specific steps of S5.5 include:
[0210] (1) Obtain the answer result submitted by the current student and the subject knowledge point identifier corresponding to the answer result. The subject knowledge point identifier is uniquely associated with the knowledge point bound to this answer task.
[0211] (2) Determine the accuracy of the current student's answer and confirm whether the answer is incorrect. Only answers that are determined to be incorrect will be included in the knowledge point recording and statistics process.
[0212] (3) When the answer result is judged to be wrong, extract the subject knowledge point identifier corresponding to the answer task and use the subject knowledge point identifier as the error association information to be recorded;
[0213] (4) Retrieve the current student’s exclusive set of error knowledge points preset in the system. The set of error knowledge points is allocated independently to each student and is used to store the subject knowledge point identifiers corresponding to the errors made by the student in each answer. The sets of error knowledge points of different students are independent of each other and do not overlap.
[0214] (5) Add the subject knowledge point identifiers extracted this time to the current student's set of incorrect knowledge points to complete the recording operation of a single incorrect knowledge point. During the recording process, retain the original content of the subject knowledge point identifiers without making any modifications.
[0215] (6) Traverse the set of incorrect knowledge points of the current students and count the total number of times each subject knowledge point identifier appears in the set of incorrect knowledge points. The total number of times is the error frequency of the corresponding knowledge point.
[0216] (7) Set a preset threshold for judging weak knowledge points. Based on the actual learning and answering patterns of primary and secondary school students, the threshold is fixed at 3 times.
[0217] (8) Compare the error frequency of each subject knowledge point in the set of error knowledge points with the preset weak threshold value, and complete the determination of the weakness of knowledge points one by one. Select the subject knowledge point markers with more than 3 errors and determine that such subject knowledge point markers are the content that students have not mastered stably and need to strengthen their learning.
[0218] (9) Organize and classify all subject knowledge point identifiers that meet the judgment conditions to form the current student’s exclusive weak knowledge point data, and bind the generated weak knowledge point data with the current student’s identity identifier to ensure that the data and the student’s subject maintain a unique correspondence.
[0219] S6: The personal historical performance comparison results, action deviation data and weak knowledge point data are fed back to the linkage matching model to dynamically adjust the difficulty of action triggering and the subject question matching level.
[0220] The specific steps of S6 include:
[0221] S6.1: Map the improvement rate and whether the individual's best historical record has been reached in the comparison results of the individual's historical performance to an ability coefficient;
[0222] Furthermore, the ability coefficient is a quantitative indicator reflecting a student's self-growth and comprehensive abilities. It is generated by mapping the results of a comparison with an individual's historical performance. The greater the breakthrough and the closer to / beyond a student's personal best record, the higher the ability coefficient. The system pre-defines the correspondence between individual progress, personal best achievement, and ability coefficient. For example, reaching a personal best record corresponds to an ability coefficient of 1.1, an improvement of 10% or more compared to a personal best record corresponds to 1.2, reaching or exceeding the historical average corresponds to 1.0, and falling short of the historical average corresponds to 0.9. The ability coefficient ranges from 0.8 to 1.2, objectively reflecting a student's level of self-challenge and self-transcendence. For instance, if a student's score breaks through their personal best record, the mapped ability coefficient is 1.2, indicating that the student's self-transcendence has been significant and their comprehensive abilities have steadily improved.
[0223] S6.2: Normalize the comprehensive deviation value of each joint in the motion deviation data to obtain the motion standardization coefficient; the value of the motion standardization coefficient is equal to 1 minus the normalized comprehensive deviation value;
[0224] Furthermore, the normalization process transforms the comprehensive deviation value in the motion deviation data into a standardized value between 0 and 1, eliminating the influence of different deviation magnitudes. The motion standardization coefficient is negatively correlated with the motion deviation. The smaller the normalized comprehensive deviation value, the higher the motion standardization coefficient, which represents a more standard motion. After the system normalizes the comprehensive deviation value, it subtracts the value from 1 to obtain the motion standardization coefficient. The closer the coefficient is to 1, the higher the motion standardization.
[0225] S6.3: Normalize the error frequency corresponding to the knowledge point identifiers of each subject in the weak knowledge point data to obtain a knowledge point weakness vector;
[0226] Furthermore, the knowledge point weakness vector is a feature vector that quantifies students' mastery of knowledge points. Each element corresponds to the weakness of a knowledge point. The system normalizes the error frequency in the weak knowledge point data, converting the frequency into a value between 0 and 1. The larger the value, the weaker the knowledge point. All the normalized weakness values are combined to form the knowledge point weakness vector, which can accurately reflect the students' mastery status of each knowledge point.
[0227] S6.4: According to the formula Calculate the action trigger difficulty adjustment factor And return it to the effective matching threshold in the linkage matching model, wherein, This indicates the preset basic trigger difficulty, p represents the ability coefficient, and d represents the action standardization coefficient. This indicates the preset first adjustment weight. This indicates that a second adjustment weight is preset, and In this embodiment, the preset basic trigger difficulty is 0.85, the first adjustment weight is 0.1, and the second adjustment weight is 0.1.
[0228] S6.5: Adjust the subject question adaptation level according to the knowledge point weakness vector. For subject knowledge points whose weakness is higher than the preset reinforcement threshold, increase the extraction weight of the corresponding questions in the subject answer database and increase the question difficulty level. For subject knowledge points whose weakness is lower than the preset ignore threshold, decrease the extraction weight of the corresponding questions. The adjusted extraction weight and question difficulty level are then sent back to the subject answer task generation step in S4.3.
[0229] Furthermore, the specific steps in S6.5 include:
[0230] (1) Obtain the current student's knowledge point weakness vector and read all subject knowledge point identifiers corresponding to the knowledge point weakness vector;
[0231] (2) Set the reinforcement threshold to 0.6 and the ignore threshold to 0.2, and then iterate through each value in the knowledge point weakness vector;
[0232] (3) For subject knowledge points with a weakness level higher than 0.6, the question extraction weight increase operation is performed, and the original extraction weight of the subject knowledge point in the subject answer database is increased by a fixed value. The weight increase is set to 0.3. At the same time, for subject knowledge points with a weakness level higher than 0.6, the question difficulty level increase operation is performed, and the difficulty level of the question corresponding to the knowledge point is increased by one level. After the difficulty level is increased, it is kept in line with the student's learning stage.
[0233] (4) For subject knowledge point identifiers with a weakness level of less than 0.2, the question extraction weight reduction operation is performed, and the original extraction weight of the subject knowledge point identifier in the subject answer database is reduced by a fixed value. The weight reduction range is set to 0.2.
[0234] (5) For subject knowledge points whose weakness level is between the neglect threshold of 0.2 and the reinforcement threshold of 0.6, the original question extraction weight and question difficulty level remain unchanged, and no additional adjustment operations are performed;
[0235] (6) After completing the traversal and adjustment of all knowledge points, summarize the updated extraction weights and updated difficulty levels corresponding to the knowledge point identifiers of all subjects to form a complete question adaptation adjustment result, and standardize the updated extraction weights to ensure that the weight values of all knowledge points are within a reasonable range and that there are no weight abnormalities or conflicts.
[0236] (7) The updated weights and difficulty levels are uniformly returned to the subject-specific question-answering task generation step as the direct basis for selecting questions based on subject knowledge point identifiers.
[0237] When the interdisciplinary sports interaction method is executed iteratively, the motion deviation data and weak knowledge point data generated in the previous iteration are used as inputs to the linkage matching model in the next iteration. The linkage matching model adjusts the allowable deviation range of each joint in the preset standard motion template according to the motion deviation data, and adjusts the binding relationship between questions and subject knowledge point identifiers in the subject answer database according to the weak knowledge point data, forming a closed-loop adaptive adjustment mechanism based on historical performance data.
[0238] Example 2:
[0239] Please see Figure 3 Another embodiment of the present invention provides: a multimodal alignment-based interdisciplinary sports interaction system, comprising:
[0240] Feature extraction module 10 collects real-time RGB video streams of students' physical exercise movements through the front-facing camera of the AI exercise machine, extracts the coordinate sequence of key skeletal points frame by frame, generates temporal feature data of the movement, and calculates the proportion of effective action frames based on the action recognition classifier of the temporal convolutional network to obtain the action completion degree.
[0241] The deviation detection module 20 is used to segment and normalize the motion timing feature data to obtain standardized motion segment data. At the same time, it compares the standardized motion segment data with the preset standard motion template, calculates the Euclidean distance of the joint coordinates and the comprehensive deviation value, statistically analyzes the deviation frame information, and generates motion deviation data.
[0242] The linkage matching module 30 is used to construct a multimodal temporal alignment algorithm that includes an action encoder and a text encoder. It maps standardized action fragment data to the embedding space to output action embedding vectors, converts subject knowledge point text into text embedding vectors, and establishes a linkage matching model between actions and knowledge points through comparative learning training. It forms a matching mapping table for action-knowledge point binding to achieve accurate alignment and association of multimodal data.
[0243] The motion-triggered push module 40 receives standardized action segment data in real time, calculates the cosine similarity between its action embedding vector and the action embedding vectors in the linkage matching model, generates effective action matching results, selects questions suitable for students' answering level from the subject answering database based on the bound subject knowledge point identifier, pushes subject answering tasks through the AI exercise machine and generates motion-triggered answering instructions, and completes the linkage triggering from sports actions to subject answering.
[0244] The scoring calculation module 50 is used to collect students' touch screen or voice answer results, compare them with the standard answers to determine the correctness and record the time taken, and calculate the real-time score data of a single person by weighting the action completion rate, answer accuracy rate and answer time. It also synchronizes the scoring data of multiple students in the local area network, synchronizes the historical scoring data of the corresponding students to generate personal historical score comparison results and displays them visually, and counts the frequency of knowledge point errors and marks the data of weak knowledge points.
[0245] The embodiments of the present invention have been described above with reference to the accompanying drawings. However, the present invention is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments under the guidance of the present invention without departing from the spirit and scope of the present invention. All of these variations are within the protection scope of the present invention.
Claims
1. A method for sports interdisciplinary interaction based on multi-modal alignment, characterized in that, include: The AI exercise machine uses its front-facing camera to collect real-time RGB video streams of students' physical exercises during breaks. The machine extracts the coordinate sequence of key skeletal points frame by frame from the video stream to generate action temporal feature data. The action temporal feature data is then analyzed to obtain the degree of action completion. The force application frame and posture transition frame in the action timing feature data are used as timing anchor points to segment and obtain standardized action segment data. At the same time, the action deviation data is generated by comparing with the preset standard action template. Based on standardized action fragment data, a linkage and matching model between action fragments and subject knowledge points is constructed. Once a student completes the corresponding physical activity and triggers the linkage matching model to generate a valid activity matching result, the AI exercise machine automatically pushes the subject-related question-and-answer task bound to the corresponding physical activity, forming a motion-triggered question-and-answer instruction. Collect students' answers, combine them with the degree of action completion, generate real-time individual score data, synchronize historical score data, calculate and visualize the comparison results of individual historical scores, and mark weak knowledge points based on the errors in the answers. The results of individual historical performance comparison, action deviation data and weak knowledge point data are fed back to the linkage matching model to dynamically adjust the difficulty of action triggering and the subject question matching level.
2. The interdisciplinary sports interaction method based on multimodal alignment as described in claim 1, characterized in that, The step of extracting the skeletal keypoint coordinate sequence frame by frame from the video stream to generate action temporal feature data includes: A skeletal keypoint detection algorithm based on a convolutional pose machine is used to perform human detection on each frame of the video stream and output human bounding boxes. Within the human body bounding box, the confidence heatmap of each joint point of the human body is predicted by the multi-stage convolutional network of the convolutional pose machine; the multi-stage convolutional network includes a first-stage convolutional network to an Nth-stage convolutional network, the first-stage convolutional network extracts a preliminary feature map from the input image, and the Nth-stage convolutional network fuses the feature map of the previous stage with the intermediate supervision signal to iteratively refine the joint point prediction results. Non-maximum suppression processing is applied to the confidence heatmap to extract the pixel coordinates of each joint in the image coordinate system. The pixel coordinates are then arranged in chronological order to form a sequence of skeletal key point coordinates. Each frame corresponds to a coordinate vector containing all joints, and the coordinate vectors of all frames are indexed by the acquisition time to form action temporal feature data. The joints include the shoulder joint, elbow joint, wrist joint, hip joint, knee joint, and ankle joint.
3. The interdisciplinary sports interaction method based on multimodal alignment as described in claim 2, characterized in that, The synchronous analysis of action timing feature data yields the action completion degree, including: The coordinate vector of each frame in the action temporal feature data is input into a preset action recognition classifier. The action recognition classifier adopts a binary classification structure based on a temporal convolutional network and outputs the probability distribution of the action category to which the current frame belongs. Based on the probability distribution, determine whether the current frame belongs to the action category in the preset standard action sequence. When the probability of the highest category in the probability distribution is greater than the preset category threshold, mark the current frame as a valid action frame. Calculate the percentage of effective action frames within a continuous time window, normalize the percentage of frames, and obtain the action completion rate.
4. The interdisciplinary sports interaction method based on multimodal alignment as described in claim 3, characterized in that, The process involves segmenting the motion sequence data using force application frames and posture transition frames as temporal anchors to obtain standardized motion segment data, including: The displacement change of each joint point between adjacent frames is calculated for the coordinate sequence of the skeletal key points, and the displacement change is summed to obtain the inter-frame motion energy value, generating a motion energy curve; Local peak points are detected on the motion energy curve. When the inter-frame motion energy value of the local peak point exceeds the preset force energy threshold, the frame index corresponding to the local peak point is marked as a force frame. The change in joint angles between adjacent frames is calculated, and the change in joint angles is accumulated to obtain the attitude change rate. When the attitude change rate exceeds a preset conversion rate threshold, the current frame is marked as an attitude conversion frame. The joint angles are composed of three points: the shoulder joint, the elbow joint, and the wrist joint, or the hip joint, the knee joint, and the ankle joint. Using the force-generating frame and the attitude transition frame as the segmentation boundary, the action timing feature data is divided into multiple continuous action segments. Each action segment is time-normalized, and each action segment is linearly interpolated to a fixed number of frames to obtain standardized action segment data.
5. The interdisciplinary sports interaction method based on multimodal alignment as described in claim 4, characterized in that, The comparison with the preset standard action template generates action deviation data, including: Obtain a preset standard motion template; the standard motion template includes a standard skeletal key point coordinate sequence and the allowable deviation range for each key point; Calculate the Euclidean distance between the coordinate vectors of each joint point in each frame of the standardized motion segment data and the coordinate vectors of the same joint point in the corresponding frame of the standard motion template to obtain the frame-level deviation value of each joint point. The frame-level deviation values of all key points are weighted and summed to obtain the comprehensive deviation value of the corresponding frame. Frames whose overall deviation value exceeds a preset deviation threshold are marked as deviation frames. The duration of continuous occurrence and the peak deviation amplitude of the deviation frames are counted, and the duration of continuous occurrence and the peak deviation amplitude are encapsulated into motion deviation data.
6. The interdisciplinary sports interaction method based on multimodal alignment as described in claim 5, characterized in that, The process of establishing the linkage matching model includes: A multimodal temporal alignment algorithm is constructed; the multimodal temporal alignment algorithm includes an action encoder and a text encoder, the action encoder adopts a neural network structure based on temporal convolutional network and self-attention mechanism, and the text encoder adopts a pre-trained language model structure based on bidirectional encoder representation; The standardized motion fragment data is input into the motion encoder. The input layer of the motion encoder receives a fixed number of skeletal keypoint coordinate sequences. Local temporal features are extracted through multiple convolutional layers of a temporal convolutional network. Then, the inter-frame dependencies are calculated through the fully connected layer of the self-attention mechanism, and the motion embedding vector is output. The preset subject knowledge point text is input into the text encoder, and semantic features are extracted through multiple transformer layers of the pre-trained language model based on bidirectional encoder representation, and the text embedding vector is output.
7. The interdisciplinary sports interaction method based on multimodal alignment as described in claim 6, characterized in that, The process of establishing the linkage matching model also includes: A contrastive learning-based training method is adopted, using standardized action fragment data and corresponding subject knowledge point text as positive sample pairs, and standardized action fragment data and random subject knowledge point text as negative sample pairs. The parameters of the action encoder and the text encoder are optimized so that the cosine similarity between the action embedding vector and the text embedding vector of the positive sample pair is greater than a preset similarity threshold, and the cosine similarity between the negative sample pairs is less than a preset dissimilarity threshold. The action embedding vector output by the trained action encoder is used as the mapping representation of the standardized action fragment data in the embedding space. A linkage matching model is established based on mapping representation; the linkage matching model includes a matching mapping table, which stores the association between action embedding vectors and text embedding vectors. When the cosine similarity between the action embedding vector and the text embedding vector of the standardized action segment data exceeds a preset effective matching threshold, the binding relationship between the corresponding action segment and the corresponding subject knowledge point is recorded in the matching mapping table.
8. The interdisciplinary sports interaction method based on multimodal alignment as described in claim 7, characterized in that, The process of generating the motion-triggered answer instruction includes: The standardized action segment data is received in real time. The action embedding vector of the current action segment is calculated by the trained action encoder and cosine similarity is calculated with each action embedding vector stored in the matching mapping table of the linkage matching model to find the matching record corresponding to the maximum cosine similarity. When the maximum cosine similarity is greater than the preset effective matching threshold, an effective action matching result is generated, and the subject knowledge point identifier bound to the corresponding action segment is obtained according to the binding relationship in the matching record. Based on the subject knowledge point identifier, a set of questions associated with the corresponding knowledge point is queried from a preset subject answer database. Questions suitable for the current student's historical answer level are selected from the question set to generate a subject answer task. The subject answer database stores questions according to knowledge point tags and difficulty level indexes. The AI training machine pushes the subject-specific quiz tasks through its display interface and generates motion-triggered quiz instructions; the motion-triggered quiz instructions include a quiz task identifier, a time limit for answering the questions, and a scoring weight.
9. The interdisciplinary sports interaction method based on multimodal alignment as described in claim 8, characterized in that, The process involves collecting students' answers, combining them with the degree of action completion, generating real-time individual score data, synchronizing historical score data, calculating and visualizing the comparison results of individual historical scores, and simultaneously highlighting weak knowledge points based on answer errors, including: The AI training machine collects students' answers to the subject-specific quiz tasks via its touchscreen or voice input interface; the answers include option selection results, text input results, or voice recognition results. The answers are compared with the standard answers in the subject-specific answer database to determine the correctness of the answers and to record the time taken to answer them. According to the formula Calculate the individual real-time score data s, where a represents the action completion rate, b represents the correctness score of the answer, and c represents the answering time. This indicates the preset first weight coefficient. This indicates a preset second weighting coefficient. This indicates a preset third weighting coefficient, and satisfies... , , The correctness score is 1 for a correct answer and 0 for an incorrect answer. The system synchronizes the real-time individual scores of each student's past performances via local area network communication, generates a comparison of individual historical scores, and displays the results visually on the AI training machine's screen. When the answer is incorrect, the subject knowledge point identifier corresponding to the subject answering task is recorded. This subject knowledge point identifier is accumulated into the corresponding student's incorrect knowledge point set. The frequency of errors of each subject knowledge point identifier in the incorrect knowledge point set is statistically analyzed. Subject knowledge point identifiers whose frequency exceeds the preset weak threshold are marked as weak knowledge point data.
10. The interdisciplinary sports interaction method based on multimodal alignment as described in claim 9, characterized in that, The process of feeding back individual historical performance comparison results, action deviation data, and weak knowledge point data to the linkage matching model to dynamically adjust the difficulty of action triggering and the subject question suitability level includes: The improvement rate and whether the individual's best historical record was reached in the comparison results of the individual's historical performance are mapped to an ability coefficient; The comprehensive deviation value of each joint in the motion deviation data is normalized to obtain the motion standardization coefficient; the value of the motion standardization coefficient is equal to 1 minus the normalized comprehensive deviation value; The error frequency corresponding to the knowledge point identifiers of each subject in the weak knowledge point data is normalized to obtain a knowledge point weakness vector. According to the formula Calculate the difficulty adjustment factor for triggering actions And return it to the effective matching threshold in the linkage matching model, wherein, This indicates the preset basic trigger difficulty, p represents the ability coefficient, and d represents the action standardization coefficient. This indicates the preset first adjustment weight. This indicates that a second adjustment weight is preset, and ; The subject-specific question matching level is adjusted based on the knowledge point weakness vector and then fed back to the subject-specific question-answering task generation process in the linked matching model.
11. The interdisciplinary sports interaction method based on multimodal alignment as described in claim 10, characterized in that, When the interdisciplinary sports interaction method is executed iteratively, the motion deviation data and weak knowledge point data generated in the previous iteration are used as inputs to the linkage matching model in the next iteration. The linkage matching model adjusts the allowable deviation range of each joint in the preset standard motion template according to the motion deviation data, and adjusts the binding relationship between questions and subject knowledge point identifiers in the subject answer database according to the weak knowledge point data.
12. A sports interdisciplinary interaction system based on multimodal alignment, used to implement the sports interdisciplinary interaction method based on multimodal alignment as described in any one of claims 1-11, characterized in that, include: The feature extraction module is used to collect real-time RGB video streams of human body movements during students' physical exercise, extract the coordinate sequence of key skeletal points frame by frame, generate temporal feature data of movements, and calculate the degree of completion of movements. The deviation detection module is used to segment and normalize the action timing feature data to obtain standardized action segment data and compare it with the preset standard action template to generate action deviation data. The linkage matching module is used to construct a multimodal temporal alignment algorithm, which maps standardized action fragment data to the embedding space and outputs action embedding vectors, converts subject knowledge point text into text embedding vectors, establishes a linkage matching model between action fragments and knowledge points, and forms a matching mapping table. The motion-triggered push module is used to receive standardized motion fragment data in real time, calculate the cosine similarity between its motion embedding vector and each motion embedding vector in the linkage matching model, generate effective motion matching results, select questions suitable for students' answering level from the subject answering database based on the bound subject knowledge point identifier, push subject answering tasks through the AI training machine and generate motion-triggered answering instructions. The scoring calculation module is used to collect students' touch screen or voice answer results, combine action completion rate, answer accuracy and answer time to calculate the real-time score data of an individual, and synchronously generate a comparison result of the individual's historical scores and display it visually, and mark the data of weak knowledge points.