A training robot system, device and computer program product based on a multi-modal large model
By using a multimodal large-scale training and testing robot system, combined with multi-sensor fusion and edge computing technologies, the system solves the problems of real-time image processing and intelligent decision-making in multimodal systems in environments without network or power. This enables efficient and accurate training and testing, as well as support for multi-person training, thereby improving the system's robustness and user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUANGZHOU AEBELL ELECTRICAL TECH
- Filing Date
- 2025-07-01
- Publication Date
- 2026-06-23
AI Technical Summary
Multimodal systems face challenges such as high demands for data integration and computing resources, poor synchronization of information across different modalities, and a decline in user experience. In particular, they struggle to achieve real-time image processing and intelligent decision-making in environments without network or power.
The training and testing robot system adopts a multimodal large model, which combines machine vision, multi-sensor fusion, edge computing, adaptive visual analysis and intelligent voice interaction technologies to achieve real-time posture detection, motion capture and intelligent decision-making. It supports real-time image processing in environments without network or power, and improves the system's robustness and accuracy through cross-modal collaborative learning.
It enables efficient and accurate training and assessment in environments without internet or electricity, supports training and assessment for multiple people on the same or different subjects, eliminates subjective errors in manual assessment, and improves training efficiency and user experience.
Smart Images

Figure CN120876176B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of next-generation information technology, and in particular relates to a training and testing robot system, equipment and computer program product based on a multimodal large model. Background Technology
[0002] Multimodality is a comprehensive concept encompassing multiple fields. It refers to the transmission, processing, and understanding of information through various means or channels within the same system or environment. These diverse means can include multiple senses or data types, such as vision, hearing, touch, and language. In a multimodal context, the various modalities do not exist in isolation but are interconnected and complementary, collectively forming a rich information interaction network. For example, in an intelligent education system, students can learn knowledge through multiple methods, such as watching videos (visual modality), listening to explanations (auditory modality), and hands-on activities (tactile modality). These different modalities work together to improve the efficiency and effectiveness of learning.
[0003] Multimodal applications are extremely broad, encompassing fields ranging from artificial intelligence to education, healthcare, and entertainment. In artificial intelligence, multimodal technology enables machines to perceive the world through multiple senses, much like humans, thus more accurately understanding complex environments and tasks. For example, self-driving cars not only rely on visual sensors to identify roads and obstacles but also combine data from other sensors such as radar and lidar to obtain more comprehensive environmental information. In healthcare, multimodal imaging technology combines various imaging modalities, such as magnetic resonance imaging (MRI) and positron emission tomography (PET), providing doctors with more comprehensive diagnostic information. In education, multimodal teaching methods, by combining various teaching resources such as text, images, and videos, meet the needs of students with different learning styles, improving the personalization and effectiveness of teaching.
[0004] The advantage of multimodal systems lies in their ability to integrate multiple information sources, thereby improving the accuracy and richness of information. By combining data from different modalities, the system can better handle complex situations and reduce the errors and uncertainties that may arise from a single modality. For example, in speech recognition systems, combining visual information (such as lip reading) can significantly improve recognition accuracy. However, multimodal systems also face many challenges. First, data from different modalities often have different formats and characteristics, and effectively integrating this data is a technical challenge. Second, multimodal systems need to process large amounts of data, which places higher demands on computing resources and storage capacity. Furthermore, ensuring information synchronization and consistency between different modalities is also a problem that needs to be solved. For example, in video conferencing, synchronization issues between voice and video can lead to a degraded user experience.
[0005] With continuous technological advancements, multimodal systems are evolving towards greater intelligence and automation. In the field of artificial intelligence, the application of technologies such as deep learning and reinforcement learning enables multimodal systems to automatically learn and adapt to different environments and tasks. For example, through joint training with multimodal data, machine learning models can better understand human language and behavioral patterns. In the Internet of Things (IoT) field, the deployment of multimodal sensor networks allows devices to perceive their environment more comprehensively, thereby achieving more intelligent automated control. In the future, multimodal technology is expected to be applied in more fields, bringing greater convenience and innovation to people's lives and work. Simultaneously, as the technology matures, multimodal systems will place greater emphasis on user experience and security to meet the needs of diverse users.
[0006] Large models are a rapidly emerging concept in the field of artificial intelligence in recent years, referring to machine learning models with a massive number of parameters. These models typically contain billions or even hundreds of billions of parameters and are capable of processing and understanding vast amounts of data. The core advantage of large models lies in their powerful representational and generalization abilities; they can automatically extract complex patterns and relationships from massive amounts of data through learning. For example, large models in natural language processing, such as the GPT series, can generate fluent text, answer various questions, and perform multiple tasks such as language translation. This capability stems from the large models' learning from large amounts of text data, enabling them to understand and generate natural language. The emergence of large models marks a significant step forward for artificial intelligence in handling complex tasks and large-scale data, laying the foundation for more intelligent machine learning systems.
[0007] Large-scale models have a wide range of applications, demonstrating enormous potential in fields such as natural language processing, computer vision, and speech recognition. In natural language processing, large-scale models like GPT and BERT can generate high-quality text for tasks such as language translation, sentiment analysis, and question answering. By learning from large amounts of text data, these models can understand and generate natural language, providing powerful support for various language-related applications. In computer vision, large-scale models can process and understand image and video data for tasks such as object recognition, image classification, and video analysis. In speech recognition, large-scale models can convert speech signals into text, enabling voice control and voice interaction. These applications not only improve system performance but also bring users a more convenient and intelligent experience.
[0008] This invention proposes a training and assessment robot system based on a multimodal large model. Its fundamental goal is to accelerate the transformation and upgrading of training, and to generate and improve responsiveness. Based on the team's regular training work, and using training and assessment robots, sensor acquisition equipment, and a big data platform as its foundation, the system proposes a multimodal data acquisition layer that uses edge computing algorithms to achieve real-time posture detection and motion capture; a data processing and analysis layer that uses computer vision algorithms including low-light compensation and motion blur correction to perform real-time image processing through localized edge computing in environments without network or power; an intelligent decision-making and control layer that establishes a training subject replication decision-making mechanism to complete intelligent decision-making, evaluation, and error correction; and a human-computer interaction layer. While achieving training subject transposition evaluation, it utilizes cross-modal collaborative learning middleware to achieve modal representation consistency, and uses vision to compensate for inertial drift, inertial-assisted visual occlusion, and physiological data to verify motion quality monitoring to improve system robustness and accuracy. Simultaneously, it supports training and assessment operations for multiple people on the same subject or multiple people on different subjects, significantly improving the efficiency of training and assessment. Summary of the Invention
[0009] This invention aims to provide a training and assessment robot system based on a multimodal large model, which is superior to existing technologies. The system consists of three parts: a training and assessment robot, sensor acquisition equipment, and a big data platform. The training and assessment robot is a robot developed for scenarios such as physical training, assessment, and competition. It achieves high integration, with a single robot capable of intelligently assessing eight general training subjects, including pull-ups, dips, push-ups, sit-ups, crunches, parallel bar dips, 30m x 2 serpentine run, and 3000m run. The core technologies employed are as follows: First, multimodal perception fusion technology: This uses a machine vision and multi-sensor fusion scheme, combining cameras with edge computing algorithms to achieve real-time posture detection and motion capture. For example, through 3D skeletal key point recognition technology, it can accurately determine the standard of movements such as the shoulder angle in pull-ups and the torso curvature in push-ups, with an accuracy of over 99%. Simultaneously, it integrates tactile sensors and an inertial measurement unit (IMU) to assist in judging the contact force of actions and the body's balance. Second, it features an adaptive visual analysis algorithm: the independently developed computer vision algorithm supports adaptability to complex environments, including low-light compensation and motion blur correction technologies. Even in environments without network or power, it can still complete real-time image processing through localized edge computing, solving the problem of traditional devices' strong dependence on networks. This technology can identify eight training subjects, covering intelligent evaluation from basic physical fitness to tactical actions. Third, it has an intelligent voice interaction system: combining Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) technologies to achieve multi-scenario voice command response. It supports dialect adaptive recognition and tactical terminology database matching, and broadcasts action error prompts in real time during training (such as "both elbows did not touch the knees, this action is invalid"). It also enhances the reliability of interaction in noisy environments through sound source localization technology and uses a large speech model to correct recognition results, ensuring the accuracy of interaction.
[0010] 4) Fully automated assessment system: Pioneering an AI referee decision-making model, it establishes a database of action standards through big data analysis and dynamically optimizes assessment thresholds using reinforcement learning. The system can automatically generate training records and performance curve analysis reports, and supports video playback arbitration, eliminating subjective errors inherent in manual assessment.
[0011] 5) Adaptive Lifting Mechanical Structure: The independently developed automatic lifting device, linked with a precision motor and sensors, can automatically adjust the camera height and angle according to the needs of different subjects (such as 30m x 2 serpentine run, push-ups) and the characteristics of the trainees' training system using AI algorithms, achieving full-scene coverage. An anti-shake algorithm is employed to ensure image stability during movement.
[0012] 6) Gesture recognition interaction system: Combines shallow CNN + statistical segmentation algorithm to achieve high-precision gesture interaction. Trainees can control the start, pause and stop of the algorithm by making specific gestures in the video area.
[0013] To achieve the above objectives, the technical solution of the present invention is as follows:
[0014] A training and testing robot system based on a multimodal large model, comprising a training and testing robot, sensor acquisition equipment, and a big data platform, and consisting of the following system functional layers:
[0015] The multimodal data acquisition layer uses machine vision and multi-sensor fusion to achieve real-time posture detection and motion capture with edge computing algorithms;
[0016] The data processing and analysis layer uses computer vision algorithms that include low light compensation and motion blur correction to perform real-time image processing through localized edge computing in environments without network or power.
[0017] The intelligent decision-making and control layer establishes a decision-making mechanism for rewriting training subjects, executes decision rewriting based on the transposition evaluation strategy of training subjects, and completes intelligent decision-making and evaluation error correction.
[0018] The human-computer interaction layer is based on multimodal feedback to achieve human-computer interaction.
[0019] Preferably, the system further includes a cross-modal collaborative learning middleware, which serves as an external attachment to the data processing and analysis layer to achieve unified representation of visual, inertial, and physiological data. It utilizes visual compensation for inertial drift, inertial-assisted visual occlusion, and physiological data to verify motion quality monitoring, thereby improving the robustness and accuracy of the system.
[0020] Preferably, the system further includes an intelligent evaluation subsystem, which automatically generates a detailed evaluation report after training, the report being used for:
[0021] 1) Motion breakdown scoring: Quantitatively evaluate the quality of each stage of the motion;
[0022] 2) Progress trajectory analysis: Compare with historical training data to show the progress;
[0023] 3) Weakness diagnosis: Identify the key actions that need improvement;
[0024] 4) Training plan recommendations: Based on the current level, recommend key training areas for future progress;
[0025] Furthermore, the report presents key action frames and sensor data curves in a graphical format, facilitating post-mortem analysis.
[0026] Preferably, the multimodal data acquisition layer includes at least:
[0027] The visual data acquisition submodule is equipped with a high-precision, automatically adjustable camera array that adjusts its height and angle according to the trainee's height and the training scene to ensure all-around capture of training movements. The camera array uses: an RGB camera to capture color images and identify the trainee's appearance features and the training environment; a depth camera to acquire scene depth information; and an infrared camera to operate normally in low-light environments and assist in identification.
[0028] The camera system uses an automatic lifting mechanism and a gimbal control system to track the trainee's position in real time and maintain the best shooting angle.
[0029] The wearable device data acquisition submodule, which uses wearable devices as the out-of-band data source of the system, includes: an inertial measurement unit (IMU) containing an accelerometer, gyroscope, and magnetometer to accurately measure limb movement trajectory and posture; a surface electromyography (sEMG) sensor to monitor muscle activity and force exertion; a heart rate sensor to monitor the trainee's physiological load; and a pressure sensor integrated into the insole or handle to measure force distribution.
[0030] The aforementioned sensors communicate with the main system in real time via Bluetooth or proprietary protocols, with a sampling frequency of over 100Hz to ensure the accuracy of motion data;
[0031] The environmental perception submodule integrates various environmental sensors, including: a ground reaction force measurement platform force plate, an environmental temperature and humidity sensor, a sound acquisition device, and an air quality monitoring sensor, to jointly construct a digital twin of the training scenario and provide data support for subsequent analysis.
[0032] Preferably, the data processing and analysis layer includes at least:
[0033] The multimodal data synchronization and fusion submodule employs a unified timestamp mechanism to ensure strict synchronization of data from different sources, with errors controlled within milliseconds. It achieves: spatiotemporal alignment: mapping visual, inertial, and physiological data to a unified spatiotemporal coordinate system; data completion: when data from one modality is missing, it uses data from other modalities for inference; and conflict resolution: when data from different modalities contradict each other, it performs weighted fusion based on confidence levels.
[0034] The fused data forms a digital motion twin of the trainee, including skeletal posture, muscle activity, and physiological state information;
[0035] The motion analysis submodule based on the multimodal large model adopts a CLIP-like multimodal large model architecture to achieve unified representation learning across modalities. The motion analysis submodule based on the multimodal large model includes: a visual encoder: based on Vision Transformer, extracting spatiotemporal features from video streams; a sensor data encoder: processing time series data such as IMU and sEMG; and a text encoder: parsing training subject descriptions and scoring criteria.
[0036] The encoder described above maps different modal data to a shared feature space, enabling the system to understand the semantic relationships between visual actions, sensor data, and scoring criteria.
[0037] The skeletal point recognition and motion reconstruction submodule employs an improved OpenPose algorithm to detect key points of the human body in real time, and combines inertial data to achieve high-precision motion reconstruction. Specifically, the skeletal point recognition and motion reconstruction submodule implements: 2D key point detection: identifying joint positions from RGB images; 3D pose estimation: fusing multi-view images and depth data to reconstruct three-dimensional skeletons; kinematic completion: using biomechanical models and kinematic constraints to correct detection errors; and dynamic analysis: combining force plate data and electromyographic signals to calculate joint torques and power.
[0038] Preferably, the intelligent decision-making and control layer includes at least:
[0039] The Training Subject Rewriting Decision Submodule configures and executes training subject rewriting decisions.
[0040] The automatic training subject identification and adaptation submodule employs prompting engineering technology to understand the requirements of different training subjects through natural language descriptions. This submodule is used to achieve: subject feature extraction: encoding subject descriptions into feature vectors; scene understanding: analyzing the environmental equipment layout and trainee equipment; action pattern matching: comparing real-time actions with subject standards in a multimodal feature space; and automatically adjusting evaluation criteria and feedback strategies when subject switching is detected.
[0041] The real-time action evaluation and error correction submodule is used to achieve zero-shot action evaluation based on a pre-trained multimodal contrastive learning model. The real-time action evaluation and error correction submodule includes at least the following functions: constructing positive and negative sample prompts; similarity calculation: comparing the cosine similarity between real-time action features and standard action features; error diagnosis: locating specific problem areas through feature differences; and correction suggestion generation: automatically generating personalized improvement suggestions based on a language model.
[0042] The adaptive control strategy submodule is used to optimize the control strategy using a reinforcement learning framework: simulation training: multiple virtual training sessions are conducted in a digital twin environment; policy transfer: the simulation strategy is transferred to the physical system for fine-tuning; online learning: the strategy is continuously optimized based on trainee feedback, so that the system can adapt to different trainees' body types, habits and ability levels, providing personalized assistance.
[0043] Preferably, the human-computer interaction layer includes at least:
[0044] A multimodal feedback system is used to provide real-time feedback through multiple channels: visual feedback: a virtual coach demonstrating correct movements on AR glasses or a display screen; auditory feedback: voice guidance or prompts; tactile feedback: vibration of a wristband to indicate movement deviations; force feedback: for exoskeleton integrated systems, it can directly guide limb movements to ensure that trainees receive effective guidance in different scenarios.
[0045] Preferably, the visual compensation for inertial drift specifically includes: when the IMU data deviates due to integration, the visual data provides an absolute reference;
[0046] The inertial-assisted visual occlusion specifically includes: when a limb is occluded, inertial data maintains motion tracking;
[0047] The physiological data verification of exercise quality specifically includes: monitoring electromyography and heart rate, and confirming exercise quality based on the reflected movements.
[0048] Meanwhile, the present invention also proposes a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the program, implements the corresponding functions of the training and testing robot system based on a multimodal large model as described above.
[0049] At the same time, the present invention also proposes a computer program product, which includes computer instructions that, when executed by a processor, perform the corresponding functions of the training and testing robot system based on a multimodal large model as described above.
[0050] This invention proposes a training and assessment robot system based on a multimodal large model. Its fundamental goal is to accelerate the transformation and upgrading of training, and to generate and improve responsiveness. Based on the team's regular training work, and using training and assessment robots, sensor acquisition equipment, and a big data platform as its foundation, the system proposes a multimodal data acquisition layer that uses edge computing algorithms to achieve real-time posture detection and motion capture; a data processing and analysis layer that uses computer vision algorithms including low-light compensation and motion blur correction to perform real-time image processing through localized edge computing in environments without network or power; an intelligent decision-making and control layer that establishes a training subject replication decision-making mechanism to complete intelligent decision-making, evaluation, and error correction; and a human-computer interaction layer. While achieving training subject transposition evaluation, it utilizes cross-modal collaborative learning middleware to achieve modal representation consistency, and uses vision to compensate for inertial drift, inertial-assisted visual occlusion, and physiological data to verify motion quality monitoring to improve system robustness and accuracy. Simultaneously, it supports training and assessment operations for multiple people on the same subject or multiple people on different subjects, significantly improving the efficiency of training and assessment. Attached Figure Description
[0051] Figure 1 This is a basic example diagram of a training and testing robot system based on a multimodal large model as shown in this invention;
[0052] Figure 2 This is a basic example diagram of the multimodal data acquisition layer in a training and testing robot system based on a multimodal large model, as shown in this invention.
[0053] Figure 3 This is an example diagram of the data processing and analysis layer in the training and testing robot system based on a multimodal large model, which is the subject of this invention.
[0054] Figure 4 This is one of the embodiments of the intelligent decision-making and control layer of the training and testing robot system based on a multimodal large model that is claimed in this invention;
[0055] Figure 5 This is yet another specific embodiment of the training and testing robot system based on a multimodal large model that is claimed in this invention. Detailed Implementation
[0056] The following describes in detail several embodiments and beneficial effects of the multimodal large model-based training and testing robot system and method claimed in this invention, in order to facilitate a more detailed examination and breakdown of this invention.
[0057] To better understand the technical solution of the present invention, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
[0058] It should be understood that the described embodiments are merely some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.
[0059] The terminology used in the embodiments of this invention is for the purpose of describing particular embodiments only and is not intended to limit the invention. The singular forms “a,” “the,” and “the” as used in the embodiments of this invention and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise.
[0060] It should be understood that the term "and / or" used in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.
[0061] It should be understood that although terms such as "first," "second," etc., may be used to describe methods and corresponding apparatus in embodiments of the present invention, these terms should not be limited to. These terms are only used to distinguish the terms from each other. For example, without departing from the scope of embodiments of the present invention, "first component," "first module," etc., may also be referred to as "second component," "second module," etc., and "second component," "second module," etc., may also be referred to as "first component," "first module."
[0062] Depending on the context, the word "if" as used here can be interpreted as "when," "when," "in response to determination," or "in response to detection." Similarly, depending on the context, the phrase "if determination" or "if detection (of the stated condition or event)" can be interpreted as "when determination," "in response to determination," "when detection (of the stated condition or event)," or "in response to detection (of the stated condition or event)."
[0063] As per the instruction manual Figure 1 - Appendix Figure 5 The diagram shown is a basic example of a training and testing robot system based on a multimodal large model and its internal functional layers, as illustrated in this invention. As a preferred embodiment that can be superimposed, each node or module can preferably interconnect with other nodes or modules for data and command transmission. Of course, as another preferred embodiment that can be superimposed, some nodes may not have interconnection with some other nodes, or may be allowed to disable or enable interconnection with other nodes.
[0064] As per the instruction manual Figure 1 - Appendix Figure 5The image shown is one embodiment of the multimodal large-scale model-based training and testing robot system claimed in this invention, along with its specific internal components and interconnections. The system includes:
[0065] The system comprises three components: a training and assessment robot, sensor acquisition equipment, and a big data platform, and consists of the following system functional layers:
[0066] The multimodal data acquisition layer uses machine vision and multi-sensor fusion to achieve real-time posture detection and motion capture with edge computing algorithms;
[0067] The data processing and analysis layer uses computer vision algorithms that include low light compensation and motion blur correction to perform real-time image processing through localized edge computing in environments without network or power.
[0068] The intelligent decision-making and control layer establishes a decision-making mechanism for rewriting training subjects, executes decision rewriting based on the transposition evaluation strategy of training subjects, and completes intelligent decision-making and evaluation error correction.
[0069] As a preferred, superimposed embodiment, the establishment of a training subject rewriting decision mechanism, based on the transposition evaluation strategy of the training subject, specifically involves: the intelligent decision and control layer setting up a training subject transposition monitoring module to record the transposition actions and transposition time before and after the training subject switch. When a subject switch is detected, the transposition action and transposition time are recorded. On the one hand, the training assessment decision model during the transposition switching period is used to replace the training assessment decision strategy of the project before transposition; on the other hand, a timer is used to record the real-time duration of the transposition switch. When the real-time duration of the transposition switch reaches the transposition time, the training assessment decision strategy of the project after transposition is used to replace the training assessment decision model during the transposition switching period. This achieves the connection between the transposition period assessment and the transposition of the project before and after, improving the segmented real-time evaluation and adaptive control effect of the training and assessment robot system based on a multimodal large model.
[0070] The transition period training assessment decision model is a system-preset assessment decision model during the project transition. The transition period training assessment decision model can be a low assessment index model to match the low training volume during the transition period, or it can be a model with pre-set assessment indexes, or an assessment index model with some assessment items being empty.
[0071] The human-computer interaction layer is based on multimodal feedback to achieve human-computer interaction.
[0072] As another preferred embodiment that can be superimposed, the system also includes a cross-modal collaborative learning middleware. The cross-modal collaborative learning middleware serves as an external attachment to the data processing and analysis layer to achieve a unified representation of visual, inertial, and physiological data. It utilizes vision to compensate for inertial drift, inertial-assisted visual occlusion, and physiological data to verify motion quality monitoring, thereby improving the robustness and accuracy of the system.
[0073] As another preferred embodiment that can be overlaid, the system further includes an intelligent evaluation subsystem, which automatically generates a detailed evaluation report after training, the report being used for:
[0074] 1) Motion breakdown scoring: Quantitatively evaluate the quality of each stage of the motion;
[0075] 2) Progress trajectory analysis: Compare with historical training data to show the progress;
[0076] 3) Weakness diagnosis: Identify the key actions that need improvement;
[0077] 4) Training plan recommendations: Based on the current level, recommend key training areas for future progress;
[0078] Furthermore, the report presents key action frames and sensor data curves in a graphical format, facilitating post-mortem analysis.
[0079] As another preferred embodiment that can be superimposed, the multimodal data acquisition layer includes at least:
[0080] The visual data acquisition submodule is equipped with a high-precision, automatically adjustable camera array that adjusts its height and angle according to the trainee's height and the training scene to ensure all-around capture of training movements. The camera array uses: an RGB camera to capture color images and identify the trainee's appearance features and the training environment; a depth camera to acquire scene depth information; and an infrared camera to operate normally in low-light environments and assist in identification.
[0081] The camera system uses an automatic lifting mechanism and a gimbal control system to track the trainee's position in real time and maintain the best shooting angle.
[0082] The wearable device data acquisition submodule, which uses wearable devices as the out-of-band data source of the system, includes: an inertial measurement unit (IMU) containing an accelerometer, gyroscope, and magnetometer to accurately measure limb movement trajectory and posture; a surface electromyography (sEMG) sensor to monitor muscle activity and force exertion; a heart rate sensor to monitor the trainee's physiological load; and a pressure sensor integrated into the insole or handle to measure force distribution.
[0083] The aforementioned sensors communicate with the main system in real time via Bluetooth or proprietary protocols, with a sampling frequency of over 100Hz to ensure the accuracy of motion data;
[0084] The environmental perception submodule integrates various environmental sensors, including: a ground reaction force measurement platform force plate, an environmental temperature and humidity sensor, a sound acquisition device, and an air quality monitoring sensor, to jointly construct a digital twin of the training scenario and provide data support for subsequent analysis.
[0085] As another preferred embodiment that can be overlaid, the data processing and analysis layer includes at least:
[0086] The multimodal data synchronization and fusion submodule employs a unified timestamp mechanism to ensure strict synchronization of data from different sources, with errors controlled within milliseconds. It achieves: spatiotemporal alignment: mapping visual, inertial, and physiological data to a unified spatiotemporal coordinate system; data completion: when data from one modality is missing, it uses data from other modalities for inference; and conflict resolution: when data from different modalities contradict each other, it performs weighted fusion based on confidence levels.
[0087] The fused data forms a digital motion twin of the trainee, including skeletal posture, muscle activity, and physiological state information;
[0088] The motion analysis submodule based on the multimodal large model adopts a CLIP-like multimodal large model architecture to achieve unified representation learning across modalities. The motion analysis submodule based on the multimodal large model includes: a visual encoder: based on Vision Transformer, extracting spatiotemporal features from video streams; a sensor data encoder: processing time series data such as IMU and sEMG; and a text encoder: parsing training subject descriptions and scoring criteria.
[0089] The encoder described above maps different modal data to a shared feature space, enabling the system to understand the semantic relationships between visual actions, sensor data, and scoring criteria.
[0090] The skeletal point recognition and motion reconstruction submodule employs an improved OpenPose algorithm to detect key points of the human body in real time, and combines inertial data to achieve high-precision motion reconstruction. Specifically, the skeletal point recognition and motion reconstruction submodule implements: 2D key point detection: identifying joint positions from RGB images; 3D pose estimation: fusing multi-view images and depth data to reconstruct three-dimensional skeletons; kinematic completion: using biomechanical models and kinematic constraints to correct detection errors; and dynamic analysis: combining force plate data and electromyographic signals to calculate joint torques and power.
[0091] As another preferred embodiment that can be superimposed, the intelligent decision-making and control layer includes at least:
[0092] The automatic training subject identification and adaptation submodule employs prompting engineering technology to understand the requirements of different training subjects through natural language descriptions. This submodule is used to achieve: subject feature extraction: encoding subject descriptions into feature vectors; scene understanding: analyzing the environmental equipment layout and trainee equipment; action pattern matching: comparing real-time actions with subject standards in a multimodal feature space; and automatically adjusting evaluation criteria and feedback strategies when subject switching is detected.
[0093] The Training Subject Rewriting Decision Submodule configures and executes training subject rewriting decisions.
[0094] As a preferred, superimposed embodiment, the establishment, configuration, and execution of training subject rewriting decisions, based on the transposition evaluation strategy of the training subjects, specifically involves: the intelligent decision and control layer setting up a training subject transposition monitoring module to record the transposition actions and transposition times before and after the training subject switch. When a subject switch is detected, the transposition action and transposition time are recorded. On the one hand, the training assessment decision model during the transposition transition period is used to replace the training assessment decision strategy of the project before transposition; on the other hand, a timer is used to record the real-time duration of the transposition transition. When the real-time duration of the transposition transition reaches the transposition time, the training assessment decision strategy of the project after transposition is used to replace the training assessment decision model during the transposition transition period. This achieves seamless integration between the transposition period assessment and the transposition of projects before and after, improving the segmented real-time evaluation and adaptive control effect of the training and assessment robot system based on a multimodal large model.
[0095] The real-time action evaluation and error correction submodule is used to achieve zero-shot action evaluation based on a pre-trained multimodal contrastive learning model. The real-time action evaluation and error correction submodule includes at least the following functions: constructing positive and negative sample prompts; similarity calculation: comparing the cosine similarity between real-time action features and standard action features; error diagnosis: locating specific problem areas through feature differences; and correction suggestion generation: automatically generating personalized improvement suggestions based on a language model.
[0096] The adaptive control strategy submodule is used to optimize the control strategy using a reinforcement learning framework: simulation training: multiple virtual training sessions are conducted in a digital twin environment; policy transfer: the simulation strategy is transferred to the physical system for fine-tuning; online learning: the strategy is continuously optimized based on trainee feedback, so that the system can adapt to different trainees' body types, habits and ability levels, providing personalized assistance.
[0097] As another preferred embodiment that can be superimposed, the human-computer interaction layer includes at least:
[0098] A multimodal feedback system is used to provide real-time feedback through multiple channels: visual feedback: a virtual coach demonstrating correct movements on AR glasses or a display screen; auditory feedback: voice guidance or prompts; tactile feedback: vibration of a wristband to indicate movement deviations; force feedback: for exoskeleton integrated systems, it can directly guide limb movements to ensure that trainees receive effective guidance in different scenarios.
[0099] As another preferred embodiment that can be superimposed, the visual compensation for inertial drift specifically includes: when the IMU data deviates due to integration, the visual data provides an absolute reference;
[0100] The inertial-assisted visual occlusion specifically includes: when a limb is occluded, inertial data maintains motion tracking;
[0101] The physiological data verification of exercise quality specifically includes: monitoring electromyography and heart rate, and confirming exercise quality based on the reflected movements.
[0102] Meanwhile, the present invention also proposes a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the program, implements the corresponding functions of the training and testing robot system based on a multimodal large model as described above.
[0103] At the same time, the present invention also proposes a computer program product, which includes computer instructions that, when executed by a processor, perform the corresponding functions of the training and testing robot system based on a multimodal large model as described above.
[0104] Furthermore, the training and testing robot system based on a multimodal large model includes: First, cross-modal collaborative learning functionality. The system innovatively achieves unified representation learning of visual, inertial, and physiological data, enabling different modal information to mutually enhance each other. Specifically, this includes visual compensation for inertial drift: when IMU data deviates due to integration, visual data provides an absolute reference. Inertial-assisted visual occlusion: when a limb is occluded, inertial data maintains motion tracking. Physiological data verifies motion quality: electromyography and heart rate reflect the actual effect of the movement and other practical functions. This synergistic effect significantly improves the robustness and accuracy of the system. Second, dynamic prompting engineering. The system uses a graph-text model similar to BLIP to automatically generate action descriptions. This is achieved through action description generation: extracting natural language descriptions from demonstration videos; high-frequency word statistics: analyzing common descriptive words for similar actions; prompt template construction: such as "a standard {attention} posture should {feet together, chest out, abdomen in}"; and prompt integration: integrating the semantic features of multiple prompts to enhance discrimination ability. Third, the system supports simultaneous assessment of multiple people and multiple projects, enabling multiple people to train and assess each other in the same subject or different subjects, which significantly improves the efficiency of training and assessment.
[0105] This invention proposes a training and assessment robot system based on a multimodal large model. Its fundamental goal is to accelerate the transformation and upgrading of training, and to generate and improve responsiveness. Based on the team's regular training work, and using training and assessment robots, sensor acquisition equipment, and a big data platform as its foundation, the system proposes a multimodal data acquisition layer that uses edge computing algorithms to achieve real-time posture detection and motion capture; a data processing and analysis layer that uses computer vision algorithms including low-light compensation and motion blur correction to perform real-time image processing through localized edge computing in environments without network or power; an intelligent decision-making and control layer that establishes a training subject replication decision-making mechanism to complete intelligent decision-making, evaluation, and error correction; and a human-computer interaction layer. While achieving training subject transposition evaluation, it utilizes cross-modal collaborative learning middleware to achieve modal representation consistency, and uses vision to compensate for inertial drift, inertial-assisted visual occlusion, and physiological data to verify motion quality monitoring to improve system robustness and accuracy. Simultaneously, it supports training and assessment operations for multiple people on the same subject or multiple people on different subjects, significantly improving the efficiency of training and assessment.
[0106] In all the above embodiments, in order to achieve certain special data transmission and read / write function requirements, the above methods and corresponding devices can be expanded by adding devices, modules, components, hardware, pin connections or memory, processor differences during operation.
[0107] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the methods, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0108] In the embodiments provided by this invention, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of method steps is only a logical or functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the couplings or direct couplings or communication connections shown or discussed may be indirect couplings or communication connections through some interfaces, apparatuses, or units, and may be electrical, mechanical, or other forms.
[0109] The units described as separate components of the method and apparatus may or may not be logically or physically separate, and may not be physical units. That is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0110] Furthermore, the method steps and their implementations, as well as the functional units, in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit described above can be implemented in hardware or in the form of hardware plus software functional units.
[0111] The aforementioned methods and apparatus can be implemented as integrated units in the form of software functional units, which can be stored in a computer-readable storage medium. These software functional units, stored in a storage medium, include several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute some steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), NVRAM, magnetic disks, or optical disks.
[0112] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
[0113] It should be noted that the above embodiments are only used to more clearly explain and illustrate the technical solutions of the present invention, and are not intended to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. These modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A training and testing robot system based on a multimodal large model, the system comprising a training and testing robot, sensor acquisition equipment, and a big data platform, and consisting of the following system functional layers: The multimodal data acquisition layer uses machine vision and multi-sensor fusion to achieve real-time posture detection and motion capture with edge computing algorithms; The data processing and analysis layer uses computer vision algorithms that include low light compensation and motion blur correction to perform real-time image processing through localized edge computing in environments without network or power. The data processing and analysis layer includes at least: The multimodal data synchronization and fusion submodule enables data completion: when data of a certain modality is missing, it uses data from other modalities for inference; conflict resolution: when data from different modalities conflict, it performs weighted fusion based on confidence level. Based on the motion analysis submodule of the multimodal large model, the training subject descriptions and scoring criteria are analyzed; different modal data are mapped to a shared feature space to understand the semantic relationship between visual actions, sensor data and scoring criteria; The system also includes a cross-modal collaborative learning middleware, which serves as an external attachment to the data processing and analysis layer to achieve a unified representation of visual, inertial, and physiological data. It utilizes vision to compensate for inertial drift, inertial-assisted visual occlusion, and physiological data to verify motion quality monitoring, thereby improving the robustness and accuracy of the system. The intelligent decision-making and control layer establishes a decision-making mechanism for rewriting training subjects, executes decision rewriting based on the transposition evaluation strategy of training subjects, and completes intelligent decision-making and evaluation error correction. The intelligent decision-making and control layer is equipped with a training subject transition monitoring module, which records the transition actions and transition times before and after the training subject switch. When a subject switch is detected, the transition actions and transition times are recorded. On the one hand, the training assessment decision model during the transition period is used to replace the training assessment decision strategy of the project before the transition. On the other hand, a timer is used to record the real-time duration of the transition. When the real-time duration of the transition reaches the transition time, the training assessment decision strategy of the project after the transition is used to replace the training assessment decision model during the transition period. This realizes the connection between the transition period assessment and the transition of the project before and after the transition, and improves the segmented real-time evaluation and adaptive control effect of the training and assessment robot system based on the multimodal large model. The intelligent decision-making and control layer includes at least: a training subject replication decision submodule, which configures and executes training subject replication decisions; The automatic identification and adaptation submodule for training subjects is used to achieve subject feature extraction: encoding subject descriptions into feature vectors; scene understanding: analyzing the layout of environmental equipment and trainee equipment; action pattern matching: comparing real-time actions with subject standards in a multimodal feature space; and automatically adjusting evaluation criteria and feedback strategies when subject switching is detected. The human-computer interaction layer is based on multimodal feedback to achieve human-computer interaction.
2. The training and testing robot system based on a multimodal large model as described in claim 1, characterized in that: The system also includes an intelligent assessment subsystem, which automatically generates a detailed evaluation report after training. This report is used for: 1) Motion breakdown scoring: Quantitatively evaluate the quality of each stage of the motion; 2) Progress trajectory analysis: Compare with historical training data to show the progress; 3) Weakness diagnosis: Identify the key actions that need improvement; 4) Training plan recommendations: Based on the current level, recommend key training areas for future progress; Furthermore, the report presents key action frames and sensor data curves in a graphical format, facilitating post-mortem analysis.
3. The training and testing robot system based on a multimodal large model as described in claim 2, characterized in that: The multimodal data acquisition layer includes at least: The visual data acquisition submodule is equipped with a high-precision, automatically adjustable camera array that adjusts its height and angle according to the trainee's height and the training scene to ensure all-around capture of training movements. The camera array uses: an RGB camera to capture color images and identify the trainee's appearance features and the training environment; a depth camera to acquire scene depth information; and an infrared camera to operate normally in low-light environments and assist in identification. The camera system uses an automatic lifting mechanism and a gimbal control system to track the trainee's position in real time and maintain the best shooting angle. The wearable device data acquisition submodule, which uses wearable devices as the out-of-band data source of the system, includes: an inertial measurement unit (IMU) containing an accelerometer, gyroscope, and magnetometer to accurately measure limb movement trajectory and posture; a surface electromyography (sEMG) sensor to monitor muscle activity and force exertion; a heart rate sensor to monitor the trainee's physiological load; and a pressure sensor integrated into the insole or handle to measure force distribution. The aforementioned sensors communicate with the main system in real time via Bluetooth or proprietary protocols, with a sampling frequency of over 100Hz to ensure the accuracy of motion data; The environmental perception submodule integrates various environmental sensors, including: a ground reaction force measurement platform force plate, an environmental temperature and humidity sensor, a sound acquisition device, and an air quality monitoring sensor, to jointly construct a digital twin of the training scenario and provide data support for subsequent analysis.
4. The training and testing robot system based on a multimodal large model as described in claim 3, characterized in that: The data processing and analysis layer specifically includes: The multimodal data synchronization and fusion submodule employs a unified timestamp mechanism to ensure strict synchronization of data from different sources, with errors controlled within milliseconds. It achieves: spatiotemporal alignment: mapping visual, inertial, and physiological data to a unified spatiotemporal coordinate system; data completion: when data from one modality is missing, it uses data from other modalities for inference; and conflict resolution: when data from different modalities contradict each other, it performs weighted fusion based on confidence levels. The fused data forms a digital motion twin of the trainee, including skeletal posture, muscle activity, and physiological state information; The motion analysis submodule based on the multimodal large model adopts a CLIP-like multimodal large model architecture to achieve unified representation learning across modalities. The motion analysis submodule based on the multimodal large model includes: a visual encoder: based on VisionTransformer, extracting spatiotemporal features from the video stream; a sensor data encoder: processing time series data such as IMU and sEMG; and a text encoder: parsing training subject descriptions and scoring criteria. The encoder described above maps different modal data to a shared feature space, enabling the system to understand the semantic relationships between visual actions, sensor data, and scoring criteria. The skeletal point recognition and motion reconstruction submodule employs an improved OpenPose algorithm to detect key points of the human body in real time, and combines inertial data to achieve high-precision motion reconstruction. Specifically, the skeletal point recognition and motion reconstruction submodule implements: 2D key point detection: identifying joint positions from RGB images; 3D pose estimation: fusing multi-view images and depth data to reconstruct three-dimensional skeletons; kinematic completion: using biomechanical models and kinematic constraints to correct detection errors; and dynamic analysis: combining force plate data and electromyographic signals to calculate joint torques and power.
5. The training and testing robot system based on a multimodal large model as described in claim 4, characterized in that: The intelligent decision-making and control layer also includes at least: The real-time action evaluation and error correction submodule is used to achieve zero-shot action evaluation based on a pre-trained multimodal contrastive learning model. The real-time action evaluation and error correction submodule includes at least the following functions: constructing positive and negative sample prompts; similarity calculation: comparing the cosine similarity between real-time action features and standard action features; error diagnosis: locating specific problem areas through feature differences; and correction suggestion generation: automatically generating personalized improvement suggestions based on a language model. The adaptive control strategy submodule is used to optimize the control strategy using a reinforcement learning framework: simulation training: multiple virtual training sessions are conducted in a digital twin environment; policy transfer: the simulation strategy is transferred to the physical system for fine-tuning; online learning: the strategy is continuously optimized based on trainee feedback, so that the system can adapt to different trainees' body types, habits and ability levels, providing personalized assistance.
6. The training and testing robot system based on a multimodal large model as described in claim 5, characterized in that: The human-computer interaction layer includes at least: A multimodal feedback system is used to provide real-time feedback through multiple channels: visual feedback: a virtual coach demonstrating correct movements on AR glasses or a display screen; auditory feedback: voice guidance or prompts; tactile feedback: vibration of a wristband to indicate movement deviations; force feedback: for exoskeleton integrated systems, it can directly guide limb movements to ensure that trainees receive effective guidance in different scenarios.
7. The training and testing robot system based on a multimodal large model as described in claim 6, wherein, The visual compensation for inertial drift specifically includes: when IMU data deviates due to integration, visual data provides an absolute reference; The inertial-assisted visual occlusion specifically includes: when a limb is occluded, inertial data maintains motion tracking; The physiological data verification of exercise quality specifically includes: monitoring electromyography and heart rate, and confirming exercise quality based on the reflected movements.
8. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the corresponding functions of the training and testing robot system based on a multimodal large model as described in any one of claims 1-7.
9. A computer program product comprising computer instructions, wherein the computer instructions, when executed by a processor, perform corresponding functions of the training and testing robot system based on a multimodal large model as described in any one of claims 1-7.