A method for full-body control of a humanoid robot based on human motion capture
By using multimodal human motion data acquisition and relocalization methods, combined with reinforcement learning algorithms, a task-independent low-level control strategy was trained, solving the problem of unstable control of humanoid robots in complex environments and achieving robust whole-body control and efficient learning.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- YIJIAHE TECH CO LTD
- Filing Date
- 2024-11-11
- Publication Date
- 2026-06-30
AI Technical Summary
Existing humanoid robot control methods struggle to maintain stability in complex environments. Traditional methods are sensitive to environmental changes and lack high-quality training data and safety features. Reinforcement learning faces challenges in high-dimensional control spaces and struggles to effectively mimic human actions.
By acquiring multimodal human motion data, standardizing the format, and relocalizing the data, human motion data is mapped onto a humanoid robot. Combined with reinforcement learning algorithms, a task-independent low-level control policy is trained. Deep reinforcement learning models such as Decoder Transformer and PPO algorithm are then used to optimize robot control.
Robust whole-body control of humanoid robots in diverse environments has been achieved, improving the robustness and adaptability of control, and enhancing the stability and learning efficiency of robots in real-world environments.
Smart Images

Figure SMS_5 
Figure SMS_10 
Figure SMS_12
Abstract
Description
Technical Field
[0001] This invention belongs to the field of humanoid robot technology, specifically relating to a method for full-body control of a humanoid robot based on human motion acquisition. Background Technology
[0002] In recent years, humanoid robots, as intelligent agents capable of mimicking and replicating human behavior, have received widespread attention. Due to their potential applications in service, industry, medicine, and home companionship, humanoid robots have become an important direction in robotics research. Humanoid robots can perform complex tasks such as walking, running, jumping, and carrying objects, thanks to their human-like shape and multi-degree-of-freedom joint design. However, achieving robust full-body control and learning in diverse and dynamically changing environments remains a significant challenge.
[0003] Because of their human-like appearance, humanoid robots serve as a natural hardware platform for general-purpose robots, potentially capable of solving all tasks humans can perform. Furthermore, real-world human-based devices, tasks, and tools are constructed and designed based on human forms, providing humanoid robots with a unique opportunity to utilize vast amounts of human motion and skill data for training, thus compensating for the scarcity of data in traditional robotics. By mimicking humans, humanoid robots can potentially leverage the rich skills and movements exhibited by humans, offering a promising pathway to achieving general-purpose robotic intelligence.
[0004] Existing humanoid robot control systems have the following problems:
[0005] 1. In the field of robot control, current control methods largely rely on model-driven approaches, which decouple the problem into physical modeling, mechanical analysis, and predefined control rules to achieve robot motion. However, these methods are often sensitive to environmental changes and task diversity, making it difficult to maintain stability in complex real-world scenarios. Furthermore, traditional methods depend on high-precision physical models, making them highly dependent on the environment and task, and difficult to adapt to changes in uncertain environments.
[0006] 2. Although humanoid robots are very similar to humans compared to other forms of robots, there are still significant differences in physical structure and dynamic characteristics between humanoid robots and humans in terms of morphology and actuation, including the number of degrees of freedom, link length, height, weight, visual parameters and mechanisms, actuation strength and responsiveness. This makes it difficult for humanoid robots to effectively read and analyze human motion data, resulting in problems such as unstable and unnatural movements. These factors all pose obstacles to humanoid robots' effective use and learning of human data.
[0007] 3. In recent years, data-driven control has attracted widespread attention. In particular, reinforcement learning methods enable robots to autonomously learn and optimize their control through interaction with the environment. However, reinforcement learning faces significant challenges in the high-dimensional, continuous control space of humanoid robots, mainly in terms of algorithm convergence, learning efficiency, and robustness. Furthermore, the lack of high-quality training data and safety issues in real-world environments are also bottlenecks limiting the application of reinforcement learning methods in practical robot control.
[0008] Therefore, developing a universal method to transmit rich human motion data to robots, enabling them to learn and imitate human actions in simulated and real environments, is a practical and effective path to achieving robust control of humanoid robots. Summary of the Invention
[0009] Purpose of the invention: To address the aforementioned existing technologies, a method for full-body control of a humanoid robot based on human motion acquisition is proposed.
[0010] Technical solution:
[0011] A method for full-body control of a humanoid robot based on human motion capture includes:
[0012] Step 1: Multimodal human motion data acquisition;
[0013] Step 2: Standardize the format of the collected multimodal human motion data;
[0014] Step 3: Relocalization method based on reference motion data for humanoid robots.
[0015] Preferably, multimodal human motion data acquisition includes: extracting human motion data from publicly available human motion datasets, acquiring human motion data from IMU (Inertial Measurement Unit), and acquiring human motion data from pure video.
[0016] Preferably, the collected multimodal human motion data is formatted, including using npy or npz format, converting the collected FBX format to npy or npz, or BVH format.
[0017] A preferred method for relocalization of a humanoid robot based on reference motion data; specifically:
[0018] Step 3.1: URDF file processing: URDF is an XML format file used to describe the structure of a robot. It defines information including the geometry of each part of the robot, connection method, joint type, and the position, rotation axis, and range limit of each joint. The joint positions and angles in the human motion data are matched with the parameters defined in the URDF file.
[0019] Step 3.2: Joint Mapping and Transformation: Determine the number and type of joints in the humanoid robot, including rotary joints, translational joints, and motion axes; based on the number and type of joints in the humanoid robot, including rotary joints, translational joints, and motion axes, establish a one-to-one mapping relationship between human joints and robot joints;
[0020] Step 3.3: Local and Global Coordinate System Conversion: Since the human body and the robot use different coordinate systems, coordinate system conversion is required;
[0021] Step 3.4: Motion data optimization and correction: After completing the basic joint mapping, the motion data needs to be further corrected to ensure its consistency and physical feasibility;
[0022] Step 3.5: Write the converted motion data into the humanoid robot's control system in a format supported by URDF.
[0023] Preferably, the number of joints and degrees of freedom in human motion data have more redundancy than those in humanoid robots. Some degrees of freedom are heuristically removed to achieve complete alignment of the two types of data information and ensure that the data conforms to the joint motion range defined in URDF. If the range is exceeded, scaling or thresholding is used to adjust the range of the data to avoid compromising the realism and coherence of the motion.
[0024] Preferably, smoothing filtering techniques, including low-pass filtering or Kalman filtering, are applied to reduce high-frequency noise, smooth data curves, and thus eliminate unnecessary jitter. In addition, artifacts are detected by analyzing outliers in human motion data and interpolated using nearby normal data points.
[0025] Preferably, the motion data is sent to the humanoid robot's control system using the relevant URDF library to issue motion commands in real time, enabling the robot to move along a predetermined motion trajectory. After verifying the validity of the data, the actual motion data is collected and a reference motion dataset is generated. The reference motion dataset is used as training data for machine learning or reinforcement learning models to improve the robustness and adaptability of the humanoid robot control. The reference motion dataset is also used to verify the performance of new controls. By comparing the ideal motion trajectory in the reference motion dataset with the actual motion trajectory, the accuracy and stability of the humanoid robot control system are evaluated.
[0026] Beneficial effects: This invention discloses a general system and implementation path for humanoid robots to mimic human movements to achieve full-body control. It utilizes human motion data sequences as input, extracts key motion information from them, and repositions this information onto the humanoid robot. It can fully adapt to human motion data of different modalities, analyze the corresponding joint positions, rotations, and other information, generate a reference dataset for humanoid robots to mimic human movements, and deploy reinforcement learning algorithms, enabling the robot to achieve robust full-body control in diverse and dynamically changing environments. This not only improves the robustness of control but also effectively unifies various paradigms of human motion data that affect humanoid robot learning, enhancing the robot's adaptability in real-world environments. Detailed Implementation
[0027] To achieve the above-mentioned technical objectives, the present invention adopts the following technical solution:
[0028] A general implementation path for whole-body control learning of a humanoid robot includes a humanoid robot motion data acquisition and normalization path, and a whole-body control learning method. The humanoid robot motion data acquisition and normalization path includes a multimodal human motion data acquisition scheme, a human motion data format normalization method, and a general relocalization method for humanoid robot reference motion data. The whole-body control learning method is a task-independent low-level algorithm trained using a reinforcement learning algorithm in simulation to control the humanoid robot's movements.
[0029] I. The Multimodal Human Motion Data Acquisition Scheme:
[0030] 1. Extracting Actions from Publicly Available Human Motion Datasets: Publicly available human motion datasets are typically released by research institutions or commercial companies and contain a large amount of processed and labeled motion data. Common publicly available datasets include CMUMotion Capture Database, Human3.6M, and AMASS. These datasets contain a wide variety of motion types, such as walking, running, jumping, and weightlifting, covering various human postures and movements. The advantages of publicly available datasets are their large data volume, rich data types, and detailed annotations, making them suitable for initial training and validation of models. However, their limitations lie in the fact that the data collection environment is usually a laboratory setting, lacking adaptability to complex real-world environments, and the data may not fully meet the needs of specific application scenarios.
[0031] 2. Acquiring Human Motion Data from an IMU (Inertial Measurement Unit): An IMU is a widely used sensor for acquiring human motion data. It infers the human body's motion state by detecting physical quantities such as acceleration and angular velocity. An IMU typically consists of an accelerometer, a gyroscope, and a magnetometer. When acquiring human motion data, the IMU sensor needs to be fixed to multiple key parts of the body, such as the limbs, waist, and chest. These sensors can record the motion information of each part in real time, and the overall motion trajectory of the human body can be inferred through data fusion technology. The data acquired by the IMU includes three-axis acceleration, three-axis angular velocity, and magnetic field strength. By integrating this raw data, the attitude changes, velocities, and displacements of each part can be calculated. Combined with a human skeletal model, IMU data can be used to reconstruct the human body's motion trajectory. The advantages of IMUs are their portability, low cost, and ability to be used in complex environments, making them suitable for long-term monitoring and large-scale data acquisition. However, the accuracy of IMUs is lower than that of motion capture systems, and due to sensor drift, long-term data acquisition may lead to error accumulation. In addition, IMUs have difficulty accurately capturing the relative motion of different parts of the human body, and usually need to be combined with data from other sensors to improve accuracy.
[0032] 3. Acquiring Human Motion Data from Pure Video: Pure video acquisition is another method for obtaining human motion data, utilizing computer vision technology to extract motion information from video. Video data can be obtained through monocular or multi-view cameras. The video acquisition scenario can be a laboratory, outdoors, or any other practical application scenario. The camera setup and resolution directly affect the quality and accuracy of the data. Through computer vision technology, especially deep learning models, the joint positions and motion trajectories of the human body can be extracted from the video. Commonly used methods include pose estimation and depth estimation. Models such as OpenPose and HRNet can identify and track key points of the human body from monocular or multi-view videos, generating motion sequences of the human skeleton. In the case of multi-view videos, triangulation can be used to further improve the accuracy of 3D motion data. The advantage of pure video acquisition is that it is non-invasive, requiring no devices attached to the human body, and can be performed in natural scenes. In addition, video acquisition can capture rich environmental contextual information, facilitating subsequent behavior recognition and scene understanding.
[0033] II. The method for standardizing the format of human motion data
[0034] Standardizing the format of human motion data is a crucial step. This process ensures that motion data from different sources can be processed and used uniformly, facilitating subsequent algorithm development and model training. Data format standardization involves not only data storage and retrieval but also format conversion so that the data can be directly used by a wide range of machine learning and reinforcement learning tools.
[0035] 1. NPY / NPZ formats: Commonly used data storage formats in Python, particularly suitable for handling large-scale numerical data such as human motion data. NPY and NPZ formats are directly supported by Python's NumPy library. These formats are highly efficient in storing and retrieving data, capable of quickly loading large-scale multidimensional arrays. This efficiency is especially important for applications handling large amounts of motion data, such as the control learning of humanoid robots.
[0036] 2. Conversion from FBX to NPY / NPZ: FBX (Filmbox) format is a common 3D animation format widely used to store skeletal animation data and 3D models. Specialized libraries (such as pyfbx or Autodesk's FBXSDK) are used to parse FBX files. During parsing, joint positions, rotations, and skeletal hierarchy information related to human movement are extracted. The extracted joint data is usually stored in a local coordinate system. For easier subsequent processing, the local coordinates need to be converted to global coordinates. This step can be done by recursively calculating the transformation matrix of the joint relative to its parent node. Once the global coordinate data of all joints has been extracted and calculated, this data can be organized into a multidimensional array. The data for each frame can be stored as a 3D array (number of joints × 3 (X, Y, Z coordinates)). Finally, NumPy is used to save these multidimensional arrays in NPY or NPZ format. For example, numpy.savez_compressed() can be used to compress and store multi-frame motion data into an NPZ file for later use.
[0037] 3. BVH (Biovision Hierarchy) format is a common 3D animation format, especially in motion capture data. BVH files typically contain two parts: skeletal structure information and motion data. A specialized parsing library (such as bvh-python) is used to read BVH files. The parsing process involves extracting the skeletal hierarchy and motion data for each frame. The skeletal structure information typically defines the parent-child relationships of each joint and its initial position, while the motion data records the rotation or displacement information of the joints in each frame. The motion data in the BVH is converted from rotation angles (usually Euler angles) to joint positions in the global coordinate system. This usually requires recursively calculating the transformation matrix of each joint relative to the root node. The motion data for each frame is converted into a 3D array containing the positions of all joints in space. Similarly, the processed data is organized into a multidimensional array representing the entire motion sequence. The converted multidimensional array can be saved using NumPy in NPY or NPZ format. During this process, multiple motion sequences can be compressed for storage to reduce file size.
[0038] III. A General Relocation Method for Humanoid Robot Reference Motion Data
[0039] In the learning process of humanoid robot control, human motion data needs to be adapted to different robot platforms. This process involves converting standardized human motion data (usually stored in NPY / NPZ format) into a data format suitable for the specific robot structure and control system. The core of the universal relocalization method lies in adjusting and mapping human motion data according to the robot's physical structure (such as joints and links) and control requirements to ensure that the robot can effectively imitate and execute these movements. The following details the relocalization methods for converting motion data from NPY / NPZ format to different humanoid robots.
[0040] 1. URDF File Explained: URDF (Unified Robot Description Format) is an XML format file used to describe the structure of a robot. It defines the geometry, connection methods, joint types, and other information for each part of the robot. It also includes information such as the position, rotation axis, and range constraints of each joint. The joint positions and angles in human motion data need to be matched with these parameters defined in the URDF.
[0041] 2. Joint Mapping and Transformation: Determine the number, type (e.g., rotational joints, translational joints), and axes of motion of the humanoid robot. Based on this information, establish a one-to-one mapping between human joints and robot joints. For example, if the human shoulder joint corresponds to the robot shoulder joint, the shoulder joint position and rotation information in the human motion data need to be mapped to the motion parameters of the robot shoulder joint. In most cases, the number of joints and degrees of freedom in human motion data will have more redundancy than those in humanoid robot data. Some degrees of freedom need to be heuristically removed to achieve complete alignment of the two sets of data and ensure that the data conforms to the joint motion range defined in URDF. If the data exceeds the range, scaling or thresholding can be used to adjust the range to avoid compromising the realism and consistency of the motion.
[0042] 3. Local and Global Coordinate System Transformation: Since the coordinate systems used by humans and robots may differ, coordinate system transformation is usually required. For example, the global coordinate system and the local coordinate systems of each joint of some humanoid robots may differ from the natural coordinate system of human body data. In this case, the motion data needs to be rotated and flipped to adapt to the robot's coordinate system, ensuring that the motion data can be correctly mapped onto the robot.
[0043]
[0044] This formula is used to extract the 3D pose and motion mapping of a human model from a monocular video. For video sequences, These are the joint pose parameters of the human posture model, where K is the number of joint degrees of freedom, and +1 represents global rotation. These are the shape parameters for each rigid body structural block.
[0045]
[0046] This formula describes the process of extracting and mapping human joint parameters to the individual joint parameters of a humanoid robot (retargeting). The parameters include the global three-dimensional rigid body positions of all rigid bodies in the robot. ,direction linear velocity and angular velocity .
[0047] 4. Motion Data Optimization and Correction: After completing basic joint mapping, further correction of the motion data is necessary to ensure its consistency and physical feasibility. Due to noise limitations of data acquisition equipment or the accuracy limitations of attitude estimation, the motion data mapped to the robot may contain unreasonable data such as jitter, foot slippage, and artifacts. Smoothing filtering techniques, such as low-pass filtering or Kalman filtering, should be applied to reduce high-frequency noise, smooth the data curve, and thus eliminate unnecessary jitter. Additionally, by analyzing outliers in the motion data (such as excessive angular velocity or angle changes), artifacts can be detected and corrected using neighboring normal data points. For example, for a sudden change in joint angle, interpolation using smoothed data segments before and after the change can eliminate the abrupt shift.
[0048] 5. Write the converted motion data into the robot control system in a URDF-supported format. URDF libraries (such as PyBulllet or ROS) can be used to send the motion data to the humanoid robot controller (typically using the `sensor_msgs` / `JointState` message type to describe joint states) for real-time motion command issuance, enabling the robot to move along a predetermined trajectory. After verifying the data's validity, collect the actual motion data and generate a Reference Motion dataset. This dataset can be used as training data for machine learning or reinforcement learning models to improve the robustness and adaptability of robot control. The dataset can also be used to validate the performance of new controls. By comparing the ideal and actual motion trajectories in the Reference Motion dataset, the controller's accuracy and stability can be evaluated.
[0049] 1) Training environment
[0050] Physical simulation environments mimic real-world physical laws, such as gravity, friction, and collision detection. In these environments, humanoid robots can perform various actions and receive feedback based on the results. Commonly used physical simulation environments include PyBullet, MuJoCo, and Isaac Gym. These environments support complex rigid body and soft body simulations and are suitable for training low-level policies for humanoid robots. To ensure the trained policy is task-independent, diverse tasks need to be designed in the training environment. For example, different task scenarios such as walking, running, jumping, and obstacle avoidance require the robot to learn to perform corresponding actions under different conditions. During training, environmental parameters (such as terrain, friction coefficient, and obstacle location) can be randomly varied to improve the generalization ability of the policy and ensure that the robot performs well in different environments.
[0051] 2) Deep reinforcement learning algorithm
[0052] In this invention, a deep reinforcement learning algorithm based on Decoder Transformer is used to train the low-level policy of a humanoid robot. Decoder Transformer is a powerful sequence model capable of processing time-series data and learning complex temporal dependencies, making it ideal for capturing the correlations between robot actions. During low-level policy training, the Decoder Transformer model is used to predict the robot's joint actions at the next moment. Specifically, the input is the robot's current state (such as joint position, velocity, sensor data, etc.). The robot's current state is encoded into a fixed-dimensional vector sequence, which serves as the input to the Transformer. The model uses a self-attention mechanism to capture the dependencies between the current state and past states, thereby determining the next action. Finally, the model outputs an action vector, which is directly applied to the robot's joint control.
[0053]
[0054]
[0055]
[0056] This formula represents a task-independent underlying behavioral strategy for humanoid robots, enabling them to maintain stability, coordination, and resistance to interference when later mimicking various human movements (with large limb movements). Among these, For the proprioception state input, Input the target state. Positions for eight selected reference body positions (shoulder, elbow, hand, ankle); The positional difference between the reference joint and the anthropomorphic joint; For reference, the linear velocity of the moving joint, These are the joint position, joint velocity, root linear velocity, root angular velocity, and previous motion, respectively.
[0057] 3) Reinforcement learning and training methods
[0058] To obtain task-independent low-level policies, the Proximal Policy Optimization (PPO) algorithm is employed for policy training. PPO is a widely used reinforcement learning algorithm that constrains the magnitude of updates by introducing a shearing term, while utilizing gradient ratios to ensure the effectiveness of updates. This algorithm achieves higher learning efficiency and stability without sacrificing overall efficiency. In robot manipulation and control, the PPO algorithm can optimize robot motion policies, enabling precise gait control and manipulation, providing effective methods and technical support for solving robot task execution in complex environments. During training, data augmentation techniques are used to improve the policy's generalization ability. For example, using different initial conditions and target positions within the same task expands the training dataset, allowing the model to adapt to more varied environments. In a physical simulation environment, training can be accelerated by parallelizing multiple instances. Each instance can use different environmental parameters and initial conditions, thereby improving the policy's convergence speed.
[0059] 4) Reward Design
[0060] The design of the reward function needs to comprehensively consider multiple factors to guide the robot to learn stable and generalized motion strategies. Task-independent full-body motion control strategy rewards include: balance reward: the longer the robot maintains balance, the higher the reward. This reward helps the robot maintain balance in various motion tasks; joint stability reward: if the robot's joints change smoothly without violent shaking during movement, an additional reward is given; reference motion imitation reward: the robot replicates the reference motion sequence as closely as possible within a controllable range, and the higher the reward is given as it can maintain stability, thus achieving the goal of imitating human movements while ensuring stability; collision penalty: if the robot collides with obstacles or other objects in the environment during movement, a penalty is given to encourage the robot to learn to avoid obstacles.
[0061] By using a deep reinforcement learning algorithm based on Decoder Transformer, task-independent low-level policies can be effectively trained, enabling humanoid robots to maintain stable motion control in different environments and tasks. Diverse training environments, reasonable reward design, and scientific training methods are key factors for success. Ultimately, the trained policy model possesses good generalization ability and can cope with complex real-world application scenarios.
[0062] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.
Claims
1. A full-body control method for a humanoid robot based on human motion capture, characterized by, include: Step 1: Multimodal human motion data acquisition; Step 2: Standardize the format of the collected multimodal human motion data; Step 3: Relocalization of the humanoid robot based on reference motion data; The specific method for relocalization of humanoid robots based on reference motion data is as follows: Step 3.1: URDF file processing: URDF is an XML format file used to describe the structure of a robot. It defines information including the geometry of each part of the robot, connection method, joint type, and the position, rotation axis, and range limit of each joint. The joint positions and angles in the human motion data are matched with the parameters defined in the URDF file. Step 3.2: Joint Mapping and Transformation: Determine the number and type of joints in the humanoid robot, including rotary joints, translational joints, and motion axes; based on the number and type of joints in the humanoid robot, including rotary joints, translational joints, and motion axes, establish a one-to-one mapping relationship between human joints and robot joints; Step 3.3: Local and Global Coordinate System Conversion: Since the human body and the robot use different coordinate systems, coordinate system conversion is required; Step 3.4: Motion data optimization and correction: After completing the basic joint mapping, the motion data needs to be further corrected to ensure its consistency and physical feasibility; Step 3.5: Write the converted motion data into the humanoid robot's control system in a format supported by URDF.
2. The full-body control method for a humanoid robot based on human motion capture according to claim 1, wherein Multimodal human motion data acquisition includes: extracting human motion data from publicly available human motion datasets, acquiring human motion data from IMU (Inertial Measurement Unit), and acquiring human motion data from pure video.
3. The full body control method for a humanoid robot based on human motion capture according to claim 1, wherein, Standardizing the format of the collected multimodal human motion data includes: using npy or npz format; or converting the collected FBX format to npy or npz; or converting the collected BVH format to npy or npz.
4. The method for full-body control of a humanoid robot based on human motion acquisition as described in claim 1, characterized in that, The number of joints and degrees of freedom in human motion data are more redundant than those in humanoid robots. Some degrees of freedom are removed heuristically to achieve complete alignment of the two types of data and to ensure that the data conform to the joint range of motion defined in URDF. If the data is out of range, use scaling or threshold limits to adjust the range of the data to avoid compromising the realism and coherence of the motion.
5. The method for full-body control of a humanoid robot based on human motion acquisition as described in claim 1, characterized in that, Smoothing filtering techniques, including low-pass filtering or Kalman filtering, are applied to reduce high-frequency noise and smooth data curves, thereby eliminating unnecessary jitter. In addition, artifacts are detected by analyzing outliers in human motion data and interpolated using nearby normal data points for correction.
6. The method for full-body control of a humanoid robot based on human motion acquisition as described in claim 1, characterized in that, Using the relevant URDF library, motion data is sent to the humanoid robot's control system to issue real-time motion commands, enabling the robot to move according to a predetermined motion trajectory. After verifying the validity of the data, actual motion data is collected and a reference motion dataset is generated. The reference motion dataset is used as training data for machine learning or reinforcement learning models to improve the robustness and adaptability of humanoid robot control. The reference motion dataset is also used to verify the performance of new controls. By comparing the ideal motion trajectory in the reference motion dataset with the actual motion trajectory, the accuracy and stability of the humanoid robot control system are evaluated.