A humanoid robot motion tracking control method, device, and electronic equipment

By constructing a target optimization problem function and a Markov decision process, and combining it with an off-policy reinforcement learning algorithm to train a humanoid robot motion tracking and control model, the problem of instability or falls during high-dynamic motion is solved. This achieves unified control of motion tracking and safe recovery, and improves the stability and safety of robot motion execution.

CN122253166APending Publication Date: 2026-06-23BEIJING INST OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING INST OF TECH
Filing Date
2026-02-14
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

In existing technologies, when humanoid robots encounter unknown external disturbances that cause instability or falls while performing high-dynamic actions, the high-dynamic motion tracking task and the high-dynamic motion safety recovery task are treated as two separate problems, making it difficult to effectively achieve motion tracking control.

Method used

A target optimization problem function is constructed and modeled as a Markov decision process by combining the reward functions of high-dynamic motion tracking and safety recovery tasks. A motion tracking control model for humanoid robots is trained using a policy-based reinforcement learning algorithm, and unified control is achieved through a policy network and a value network.

Benefits of technology

It achieves stable tracking and safe recovery of humanoid robots during highly dynamic movements, avoiding the shortcomings of phased control and improving the safety and stability of motion execution.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122253166A_ABST
    Figure CN122253166A_ABST
Patent Text Reader

Abstract

This invention discloses a humanoid robot motion tracking control method. Addressing the safety control problem of high-dynamic motion tracking for humanoid robots, it constructs a target optimization problem function based on the reward function for high-dynamic motion tracking and the reward function for high-dynamic motion safety recovery. This target optimization problem function is modeled as a Markov decision process. An initial humanoid robot motion tracking control model is trained using a policy-based reinforcement learning algorithm to solve the Markov decision process, resulting in a trained humanoid robot motion tracking control model. This model is used to obtain safety decisions for high-dynamic motion tracking of the humanoid robot and consists of a policy network and a value network. This approach effectively achieves safe motion tracking control of humanoid robots.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of robotics, and in particular to a method, apparatus, and electronic device for tracking and controlling the motion of a humanoid robot. Background Technology

[0002] With the rapid development of humanoid robot technology, it has shown broad application prospects in service robots, entertainment performances, human-computer interaction, and complex environment operations. Compared with basic motion control tasks such as walking and standing, high-dynamic movements such as dancing and martial arts place higher demands on the robot's control precision, stability, and real-time decision-making capabilities.

[0003] Currently, a combination of deep reinforcement learning and imitation learning can be used to achieve high-fidelity tracking of human motion data in simulated environments, thereby controlling humanoid robots to perform various highly dynamic actions. However, these methods typically assume that the humanoid robot is always in a safe and controllable state of motion, meaning that it will not experience serious imbalance or falls during the execution of actions. Once the robot encounters unknown external disturbances while performing actions corresponding to reference actions, such as thrust, changes in ground friction, or execution errors, it is highly susceptible to instability or even falls.

[0004] However, in the existing technology, when a humanoid robot becomes unstable or even falls, the high-dynamic motion tracking task and the high-dynamic motion safety recovery task are regarded as two independent problems. Different control methods are usually used, or a phased control method is used to handle the high-dynamic motion tracking task and the high-dynamic motion safety recovery task separately. Based on this, it is difficult to effectively achieve motion tracking control of humanoid robots. Summary of the Invention

[0005] The purpose of this invention is to provide a humanoid robot motion tracking control method to solve the problem in the prior art where, when a humanoid robot becomes unstable or even falls, the high-dynamic motion tracking task and the high-dynamic motion safety recovery task are treated as two independent problems, and different control methods or staged control methods are usually used to handle them separately, making it difficult to effectively achieve motion tracking control of the humanoid robot.

[0006] To achieve the above objectives, in a first aspect, embodiments of the present invention provide a humanoid robot motion tracking and control method, the method comprising: To address the safety control problem of humanoid robots in high-dynamic motion tracking, a target optimization problem function is constructed based on the reward function of high-dynamic motion tracking task and the reward function of high-dynamic motion safety recovery task. The objective optimization problem function is modeled as a Markov decision process; An initial humanoid robot motion tracking control model is trained using a policy-based reinforcement learning algorithm to solve the Markov decision process, resulting in a trained humanoid robot motion tracking control model. This model is used to obtain high-dynamic motion tracking safety decisions for the humanoid robot and consists of a policy network and a value network.

[0007] In one embodiment, the objective optimization problem function may be defined by the following expression:

[0008] in, This represents the discount factor at the k-th time step. This represents the reward function for high-dynamic motion tracking tasks. This represents the high-dynamic motion safety recovery task indication function at the k-th time step. This represents the reward function for a high-dynamic-action safety recovery task. Represents the probability of performing a high-dynamic motion tracking task, 1- This indicates the probability of performing a high-dynamic motion safe recovery task. This indicates the task execution instructions for high dynamic motion tracking tasks and high dynamic motion safety recovery tasks. This represents the reference action state at the k-th time step. This indicates the initial reference motion state for a high-dynamic motion tracking task or a high-dynamic motion safety recovery task. This represents the set of non-safety reference motion state data corresponding to a high-dynamic motion safety recovery task. This represents the set of reference motion state data corresponding to a high-dynamic motion tracking task.

[0009] In one embodiment, modeling the objective optimization problem function as a Markov decision process includes: For the target optimization problem function, establish the state space, action space, and reward function; The state space includes: the joint information state of the humanoid robot, the base posture information state, the contact information state, and the reference action state; The motion space includes: control commands for the joints of the humanoid robot when performing high-dynamic actions; The reward function is determined based on the reward function for high dynamic motion tracking tasks and the reward function for high dynamic motion safety recovery tasks.

[0010] In one embodiment, the reward function is determined based on the high-dynamic motion tracking task reward function and the high-dynamic motion safety recovery task reward function, including: According to the formula Determine the reward function; in, These represent the weight parameters of the reward function for high-dynamic motion tracking tasks. The weight parameters represent the reward function for a high-dynamic motion safety recovery task.

[0011] In one embodiment, the method further includes: The system is designed to perform high-dynamic motion tracking tasks using the following functions: motion change rate penalty function, joint limit penalty function, global reference position error function, global reference posture error function, relative body position error function, linear velocity and angular velocity consistency function, and undesired contact penalty function. The reward function for the high-dynamic motion tracking task is constructed by weighted summing of the motion change rate penalty function, joint limit penalty function, global reference position error function, global reference posture error function, relative body position error function, linear velocity and angular velocity consistency function, and undesired contact penalty function.

[0012] In one embodiment, the method further includes: The shoulder height relative error penalty function, relative posture penalty function, and displacement penalty function of the x-y axis before standing are constructed for performing high dynamic motion safety recovery tasks. The reward function for the high-dynamic motion safety recovery task is constructed by weighted summing of the shoulder height relative error penalty function, the relative posture penalty function, and the displacement penalty function of the x-y axis before standing.

[0013] In one embodiment, the humanoid robot motion tracking control model is used to obtain high-dynamic motion tracking safety decisions for the humanoid robot, including: Acquire the joint status, base posture status, contact status, and reference motion status of the humanoid robot; The state space is determined based on the joint information state, base posture information state, contact information state, and reference action state of the humanoid robot. The state space is input into the humanoid robot motion tracking control model to obtain the high-dynamic motion tracking safety decision of the humanoid robot.

[0014] In one embodiment, the step of training an initial humanoid robot motion tracking control model using a policy-based reinforcement learning algorithm to solve the Markov decision process and obtain a trained humanoid robot motion tracking control model includes: Obtain the initial task state space, which is: the first initial task state space corresponding to the high dynamic motion tracking task, or the second initial task state space corresponding to the high dynamic motion safety recovery task. The initial task state space is input into the policy network to obtain the predicted high-dynamic motion tracking safety decision; The initial task state space is updated based on the predicted high-dynamic motion tracking safety decision, and the reward value for the predicted high-dynamic motion tracking safety decision is calculated using the reward function. The predicted high-dynamic motion tracking safety decision, the reward value, and the updated initial task state space are input into the value network. The humanoid robot motion tracking control model is trained based on the policy optimization objective loss function until the model converges, thus obtaining the trained humanoid robot motion tracking control model.

[0015] Secondly, embodiments of the present invention provide a humanoid robot motion tracking and control device, the device comprising: The module for constructing and acquiring the objective optimization problem function is used to construct the objective optimization problem function for the safety control problem of humanoid robots in high dynamic motion tracking, based on the reward function of high dynamic motion tracking task and the reward function of high dynamic motion safety recovery task. The modeling module is used to model the objective optimization problem function as a Markov decision process; The solution module is used to train an initial humanoid robot motion tracking control model using a policy-based reinforcement learning algorithm to solve the Markov decision process and obtain a trained humanoid robot motion tracking control model. The humanoid robot motion tracking control model is used to obtain the high-dynamic motion tracking safety decision of the humanoid robot. The humanoid robot motion tracking control model consists of a policy network and a value network.

[0016] Thirdly, embodiments of the present invention provide an electronic device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the humanoid robot motion tracking and control method described in the first aspect.

[0017] The technical solution provided by the embodiments of the present invention has the following advantages compared with the prior art: This invention provides a humanoid robot motion tracking control method. First, addressing the high-dynamic motion tracking safety control problem of humanoid robots, a target optimization problem function is constructed based on the reward function for the high-dynamic motion tracking task and the reward function for the high-dynamic motion safety recovery task. This target optimization problem function is further modeled as a Markov decision process. Finally, an initial humanoid robot motion tracking control model is trained using a policy-based reinforcement learning algorithm to solve the Markov decision process, resulting in a trained humanoid robot motion tracking control model. This model is used to obtain high-dynamic motion tracking safety decisions for the humanoid robot and consists of a policy network and a value network. This approach enables unified control of both high-dynamic motion tracking and high-dynamic motion safety recovery tasks, avoiding the common practice in existing technologies of using different control methods or staged control methods to handle these two types of tasks separately, thus effectively achieving safe motion tracking control of the humanoid robot. Attached Figure Description

[0018] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this invention, illustrate exemplary embodiments of the invention and are used to explain the invention, but do not constitute an undue limitation of the invention. In the drawings: Figure 1 A flowchart illustrating a humanoid robot motion tracking and control method provided in an embodiment of the present invention; Figure 2 This is a schematic diagram of a humanoid robot motion tracking and control device provided in an embodiment of the present invention. Detailed Implementation

[0019] To facilitate a clear description of the technical solutions in the embodiments of the present invention, the terms "first" and "second" are used to distinguish identical or similar items with essentially the same function and effect. For example, the first threshold and the second threshold are merely used to distinguish different thresholds and do not limit their order. Those skilled in the art will understand that the terms "first" and "second" do not limit the quantity or execution order, and that the terms "first" and "second" are not necessarily different.

[0020] It should be noted that in this invention, the terms "exemplary" or "for example" are used to indicate examples, illustrations, or descriptions. Any embodiment or design described as "exemplary" or "for example" in this invention should not be construed as being more preferred or advantageous than other embodiments or designs. Specifically, the use of terms such as "exemplary" or "for example" is intended to present the relevant concepts in a concrete manner.

[0021] In this invention, "at least one" means one or more, and "more than one" means two or more. "And / or" describes the relationship between the associated objects, indicating that three relationships can exist.

[0022] like Figure 1 As shown, Figure 1 This is a flowchart illustrating a humanoid robot motion tracking and control method according to an embodiment of the present invention, which specifically includes the following steps: S10: To address the safety control problem of humanoid robots in high-dynamic motion tracking, a target optimization problem function is constructed based on the reward function of high-dynamic motion tracking task and the reward function of high-dynamic motion safety recovery task.

[0023] The high-dynamic motion tracking safety control problem refers to the safety control issue whereby, by redirecting human motion data and controlling a humanoid robot to perform corresponding actions, it can promptly resume normal movement when encountering unknown external disturbances, such as thrust, changes in ground friction, or execution errors, causing instability or falls. In other words, the high-dynamic motion tracking safety control problem can be understood to include both the high-dynamic motion tracking problem and the high-dynamic motion safety recovery problem. The high-dynamic motion tracking problem refers to the tracking of the humanoid robot's normal execution of high-dynamic movements, while the high-dynamic motion safety recovery problem refers to the safe recovery of the humanoid robot after instability or falls, enabling it to promptly return to normal and continue performing its actions.

[0024] Specifically, for the safety control problem of humanoid robots in high-dynamic motion tracking, including the high-dynamic motion tracking problem and the high-dynamic motion safety recovery problem, corresponding objective optimization problem functions are constructed based on the reward functions of the high-dynamic motion tracking task and the high-dynamic motion safety recovery task.

[0025] Optionally, based on the above embodiments, in some embodiments of the present invention, the objective optimization problem function may be defined by the following expression:

[0026] in, This represents the discount factor at the k-th time step. This represents the reward function for high-dynamic motion tracking tasks. This represents the high-dynamic motion safety recovery task indication function at the k-th time step. This represents the reward function for a high-dynamic-action safety recovery task. Represents the probability of performing a high-dynamic motion tracking task, 1- This indicates the probability of performing a high-dynamic motion safe recovery task. This indicates the task execution instructions for high dynamic motion tracking tasks and high dynamic motion safety recovery tasks. This represents the reference action state at the k-th time step. This indicates the initial reference motion state for a high-dynamic motion tracking task or a high-dynamic motion safety recovery task. This represents the set of non-safety reference motion state data corresponding to a high-dynamic motion safety recovery task. This represents the set of reference motion state data corresponding to a high-dynamic motion tracking task.

[0027] S11: Model the objective optimization problem function as a Markov decision process.

[0028] Specifically, after constructing the objective optimization problem function, the objective optimization problem function is modeled as a Markov decision process.

[0029] Optionally, based on the above embodiments, in some embodiments of the present invention, one implementation of S11 may be: S111: Optimize the target problem function, and establish the state space, action space, and reward function.

[0030] The state space includes: the joint information state of the humanoid robot, the base posture information state, the contact information state, and the reference motion state. The joint information state reflects the motion state of the humanoid robot during the execution of an action, and includes joint position information and joint velocity information. The base posture information state refers to the posture information of the humanoid robot during the current action. The reference motion state refers to the various reference motion postures that exist for the humanoid robot during the execution of an action. The motion space includes the control commands for the humanoid robot's joints when performing high-dynamic actions. The reward function is determined based on the reward function for high-dynamic motion tracking and the reward function for high-dynamic motion safety recovery.

[0031] Specifically, for the constructed target optimization problem function, corresponding state space, action space, and reward function are established.

[0032] Optionally, based on the above embodiments, in some embodiments of the present invention, the reward function is determined according to the reward function for high-dynamic motion tracking tasks and the reward function for high-dynamic motion safety recovery tasks, including: S20: According to the formula Determine the reward function.

[0033] in, These represent the weight parameters of the reward function for high-dynamic motion tracking tasks. The weight parameters represent the reward function for a high-dynamic motion safety recovery task. , The specific value of is not specifically limited in this invention, and those skilled in the art can determine it according to the actual situation.

[0034] Optionally, based on the above embodiments, in some embodiments of the present invention, one implementation of the reward function for high-dynamic motion tracking tasks may be: S201: Constructs the motion change rate penalty function, joint limit penalty function, global reference position error function, global reference posture error function, relative body position error function, linear velocity and angular velocity consistency function, and undesired contact penalty function for performing high dynamic motion tracking tasks.

[0035] S202: The reward function for high-dynamic motion tracking tasks is constructed by weighted summation of the motion change rate penalty function, joint limit penalty function, global reference position error function, global reference posture error function, relative body position error function, linear velocity and angular velocity consistency function, and undesired contact penalty function.

[0036] Specifically, for humanoid robots, a reward function for performing high-dynamic motion tracking tasks is constructed by generating a penalty function for the rate of change of motion, a penalty function for joint constraint, a global reference position error, a global reference posture error, a relative body position error, a linear velocity and angular velocity consistency function, and an unwanted contact penalty function. After obtaining these functions, a weighted sum is performed on them to construct the reward function for the high-dynamic motion tracking task.

[0037] Optionally, based on the above embodiments, in some embodiments of the present invention, the reward function for the high dynamic motion tracking task may be defined by the following expression:

[0038] in, This represents the penalty function for the rate of change of motion when performing a high-dynamic motion tracking task. The weight parameters represent the penalty function for the rate of change of action. This represents the joint limit penalty function during the execution of high-dynamic motion tracking tasks. The weight parameters represent the joint limit penalty function. This represents the global reference position error function during high-dynamic motion tracking tasks. The weight parameters represent the global reference position error function. This represents the global reference pose error function during the execution of a high-dynamic motion tracking task. The weight parameters represent the global reference attitude error function. This represents the relative body position error function during the execution of a high-dynamic motion tracking task. The weight parameters represent the relative body position error function. This represents the consistency function of linear velocity and angular velocity when performing high-dynamic motion tracking tasks. The weighting parameters represent the consistency function between linear velocity and angular velocity. This represents the penalty function for unwanted contact during high-dynamic motion tracking tasks. The weight parameters represent the undesirable contact penalty function.

[0039] Optionally, based on the above embodiments, in some embodiments of the present invention, the penalty function for the rate of change of action may be defined by the following expression:

[0040] in, This represents the action performed by the j-th humanoid robot at time t.

[0041] Optionally, based on the above embodiments, in some embodiments of the present invention, the joint point limiting penalty function may be defined by the following expression:

[0042] in, The first humanoid robot Each joint has a degree of freedom. The first humanoid robot The maximum number of degrees of freedom at each joint. The first humanoid robot The minimum number of degrees of freedom for each joint.

[0043] Optionally, based on the above embodiments, in some embodiments of the present invention, the global reference position error function may be defined by the following expression:

[0044] in, This indicates the joint positions for a reference motion of a humanoid robot. This indicates the joint positions of the humanoid robot when performing the action corresponding to the reference action. Hyperparameters representing the position of joints.

[0045] Optionally, based on the above embodiments, in some embodiments of the present invention, the global reference attitude error function may be defined by the following expression:

[0046] in, The joint degrees of freedom represent the reference motion of a humanoid robot. This represents the degrees of freedom at the joints of a humanoid robot when performing a reference action. Hyperparameters representing the degrees of freedom of joints.

[0047] Optionally, based on the above embodiments, in some embodiments of the present invention, the relative body position error function may be defined by the following expression:

[0048] in, This represents the position of the i-th joint relative to the root node in the reference motion of the humanoid robot. This represents the position of the i-th joint relative to the root node when the humanoid robot performs the action corresponding to the reference action. Hyperparameters representing position.

[0049] Optionally, based on the above embodiments, in some embodiments of the present invention, the linear velocity and angular velocity consistency function may be defined by the following expression:

[0050]

[0051] in, Let represent the reference linear velocity of the i-th joint of the humanoid robot's reference motion. Let represent the reference angular velocity of the i-th joint of the humanoid robot's reference motion. This represents the linear velocity of the i-th joint of the humanoid robot when performing the action corresponding to the reference action. This represents the angular velocity of the i-th joint of the humanoid robot when performing the action corresponding to the reference action. This represents the hyperparameter corresponding to the linear velocity. This represents the hyperparameter corresponding to the angular velocity.

[0052] Optionally, based on the above embodiments, in some embodiments of the present invention, the undesired contact penalty function may be defined by the following expression:

[0053] in, This represents the force received by the humanoid robot at the i-th joint at time t. This represents the set of joints where contact forces are not permitted. This represents the minimum permissible contact force at each joint of a humanoid robot.

[0054] Optionally, based on the above embodiments, in some embodiments of the present invention, one implementation of the reward function for a high-dynamic motion safety recovery task can be: S203: Constructs a shoulder height relative error penalty function, a relative posture penalty function, and a displacement penalty function of the x-y axis before standing for performing a high-dynamic motion safety recovery task.

[0055] S204: Weighted summation of the shoulder height relative error penalty function, the relative posture penalty function, and the displacement penalty function of the x-y axis before standing is performed to construct the reward function for the high dynamic motion safety recovery task.

[0056] Specifically, for humanoid robots, a shoulder height relative error penalty function, a relative posture penalty function, and a displacement penalty function along the x-y axis before standing are constructed for the humanoid robot to perform a high-dynamic motion safety recovery task. After obtaining the shoulder height relative error penalty function, the relative posture penalty function, and the displacement penalty function along the x-y axis before standing are obtained, and then a weighted sum is performed on the shoulder height relative error penalty function, the relative posture penalty function, and the displacement penalty function along the x-y axis before standing is used to construct a reward function for the high-dynamic motion safety recovery task.

[0057] Optionally, based on the above embodiments, in some embodiments of the present invention, the reward function for the high-dynamic motion safety recovery task can be defined by the following expression:

[0058] in, This represents the shoulder height relative error penalty function for performing a high-dynamic motion safety recovery task. The weight parameters represent the relative error penalty function for shoulder height. This represents the relative attitude penalty function during the safe recovery task of performing high-dynamic actions. The weight parameters represent the relative pose penalty function. This represents the displacement penalty function along the x-y axis before standing during a high-dynamic motion safety recovery task. The weight parameters represent the displacement penalty function along the x-y axis before standing.

[0059] Optionally, based on the above embodiments, in some embodiments of the present invention, the shoulder height relative error penalty function may be defined by the following expression:

[0060] in, The shoulder height represents the reference pose for the humanoid robot. This indicates the shoulder height of the humanoid robot when performing the action corresponding to the reference action.

[0061] Optionally, based on the above embodiments, in some embodiments of the present invention, the relative attitude penalty function may be defined by the following expression:

[0062] in, This represents the joint degrees of freedom of the root node of a humanoid robot.

[0063] Optionally, based on the above embodiments, in some embodiments of the present invention, the displacement penalty function of the x-axis to y-axis can be defined by the following expression:

[0064] in, This represents the x-axis and y-axis coordinates of the root node of the humanoid robot. This indicates the reference height difference of the humanoid robot's shoulder position. This indicates a custom parameter.

[0065] S12: Use a policy-based reinforcement learning algorithm to train an initial humanoid robot motion tracking control model to solve the Markov decision process and obtain a well-trained humanoid robot motion tracking control model.

[0066] Among these, an off-policy reinforcement learning algorithm could be FastSAC, a type of off-policy reinforcement learning algorithm designed for high-degree-of-freedom continuous control problems. The humanoid robot motion tracking control model consists of a policy network and a value network, and is used to obtain high-dynamic motion tracking safety decisions for the humanoid robot. High-dynamic motion tracking safety decisions refer to the control commands given to the humanoid robot's joints when it performs high-dynamic actions.

[0067] Specifically, the initial humanoid robot motion tracking control model is trained using a policy-based reinforcement learning algorithm to solve the Markov decision process, thereby obtaining a trained humanoid robot motion tracking control model. Based on the trained humanoid robot motion tracking control model, high-dynamic motion tracking safety decisions for the humanoid robot are obtained.

[0068] Optionally, based on the above embodiments, in some embodiments of the present invention, S12 may be implemented as follows: S120: Obtain the initial task state space.

[0069] The initial task state space is determined based on the high-dynamic motion tracking task and the high-dynamic motion safety recovery task. Therefore, the initial task state space is either the first initial task state space corresponding to the high-dynamic motion tracking task or the second initial task state space corresponding to the high-dynamic motion safety recovery task. The first initial task state space can be determined based on the initial posture of the reference motion in the state space. The second initial task state space can be randomly determined from a preset posture set. This preset posture set consists of postures collected when controlling the humanoid robot to perform preset high-dynamic actions. Examples of preset high-dynamic actions include: releasing the humanoid robot from different heights; applying an external thrust of random direction and magnitude to the humanoid robot during free fall; and waiting for a period of time before the humanoid robot collides with the ground.

[0070] S121: Input the initial task state space into the policy network to obtain the predicted high-dynamic motion tracking safety decision.

[0071] S122: Update the initial task state space based on the predicted high-dynamic motion tracking safety decision, and calculate the reward value of the predicted high-dynamic motion tracking safety decision through the reward function.

[0072] S123: Input the predicted high-dynamic motion tracking safety decision, reward value, and updated initial task state space into the value network, and train the humanoid robot motion tracking control model based on the policy optimization objective loss function until the model converges, thus obtaining the trained humanoid robot motion tracking control model.

[0073] The policy-optimized objective loss function is used to adjust the model parameters during the training of the humanoid robot motion tracking control model. The policy-optimized objective loss function can be defined by the following expression:

[0074] in, This indicates that an action is performed in state s. The value function, This indicates that an action is performed in state s. The probability, Represents any state in the state space. This represents the action corresponding to a state in the action space. This represents hyperparameters.

[0075] Specifically, a first initial task state space corresponding to a high-dynamic motion tracking task, or a second initial task state space corresponding to a high-dynamic motion safety recovery task, is obtained. After obtaining the first or second initial task state space, it is input into the policy network to obtain a predicted high-dynamic motion tracking safety decision. The initial task state space is updated based on the predicted high-dynamic motion tracking safety decision, and the reward value of the current predicted high-dynamic motion tracking safety decision is calculated using a reward function. Further, the predicted high-dynamic motion tracking safety decision, the reward value, and the updated initial task state space are input into the value network, and the humanoid robot motion tracking control model is trained based on the policy optimization objective loss function. The model parameters are adjusted until the model converges, resulting in a trained humanoid robot motion tracking control model.

[0076] Optionally, based on the above embodiments, in some embodiments of the present invention, one implementation of the humanoid robot motion tracking control model for obtaining high-dynamic motion tracking safety decisions for the humanoid robot may be: S30: Acquire the joint status, base posture status, contact status, and reference motion status of the humanoid robot.

[0077] S31: Determine the state space based on the joint information state, base posture information state, contact information state, and reference action state of the humanoid robot.

[0078] S32: Input the state space into the humanoid robot motion tracking control model to obtain the high dynamic motion tracking safety decision of the humanoid robot.

[0079] Specifically, for humanoid robots, firstly, the joint state information, base posture state information, contact state information, and reference motion state of the humanoid robot are acquired. After obtaining the joint state information, base posture state information, contact state information, and reference motion state, the state space is further determined based on the joint state information, base posture state information, contact state information, and reference motion state of the humanoid robot. Finally, the state space is input into the humanoid robot motion tracking control model, and the high-dynamic motion tracking safety decision of the humanoid robot is obtained through the policy network in the humanoid robot motion tracking control model.

[0080] Thus, the humanoid robot motion tracking control method provided in this embodiment first constructs a target optimization problem function based on the reward function of the high-dynamic motion tracking task and the reward function of the high-dynamic motion safety recovery task, addressing the high-dynamic motion tracking safety control problem of humanoid robots. This target optimization problem function is further modeled as a Markov decision process. Finally, an initial humanoid robot motion tracking control model is trained using a policy-based reinforcement learning algorithm to solve the Markov decision process, resulting in a trained humanoid robot motion tracking control model. This model is used to obtain high-dynamic motion tracking safety decisions for the humanoid robot and consists of a policy network and a value network. This approach enables unified control of both high-dynamic motion tracking and high-dynamic motion safety recovery tasks, avoiding the common practice in existing technologies of using different control methods or staged control methods to handle these two types of tasks separately, thereby effectively achieving safe motion tracking control of humanoid robots.

[0081] It should be understood that, although Figure 1 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figure 1 At least some of the steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but may be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but may be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

[0082] In one embodiment, such as Figure 2 As shown, Figure 2 A schematic diagram of a humanoid robot motion tracking and control device provided in an embodiment of the present invention includes: a target optimization problem function construction and acquisition module 10, a modeling module 11, and a solution module 12.

[0083] The objective optimization problem function construction and acquisition module 10 is used to construct an objective optimization problem function for the safety control problem of humanoid robots in high dynamic motion tracking, based on the high dynamic motion tracking task reward function and the high dynamic motion safety recovery task reward function.

[0084] Modeling module 11 is used to model the objective optimization problem function as a Markov decision process.

[0085] The solution module 12 is used to train an initial humanoid robot motion tracking control model using a policy-based reinforcement learning algorithm to solve the Markov decision process and obtain a trained humanoid robot motion tracking control model. The humanoid robot motion tracking control model is used to obtain high-dynamic motion tracking safety decisions for the humanoid robot. The humanoid robot motion tracking control model consists of a policy network and a value network.

[0086] Thus, the humanoid robot motion tracking control device provided in this embodiment, through a target optimization problem function construction and acquisition module, addresses the high-dynamic motion tracking safety control problem of humanoid robots by constructing a target optimization problem function based on the reward function of the high-dynamic motion tracking task and the reward function of the high-dynamic motion safety recovery task. The modeling module models the target optimization problem function as a Markov decision process. The solution module uses a policy-based reinforcement learning algorithm to train an initial humanoid robot motion tracking control model to solve the Markov decision process, obtaining a trained humanoid robot motion tracking control model. This model is used to obtain high-dynamic motion tracking safety decisions for the humanoid robot and consists of a policy network and a value network. In this way, unified control of the high-dynamic motion tracking task and the high-dynamic motion safety recovery task can be achieved, avoiding the situation in existing technologies where different control methods or staged control methods are typically used to handle these two types of tasks separately, thereby effectively achieving motion tracking safety control of the humanoid robot.

[0087] Specific limitations regarding the humanoid robot motion tracking and control device can be found in the limitations of the humanoid robot motion tracking and control method described above, and will not be repeated here. Each module in the aforementioned server can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in the computer device in hardware form, or stored in the memory of the computer device in software form, so that the processor can call and execute the corresponding operations of each module.

[0088] This invention provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it can implement a humanoid robot motion tracking and control method provided in this invention. For example, when the processor executes the computer program, it can implement... Figure 1 The technical solutions of the method embodiments shown are similar in principle and in effect, and will not be described again here.

[0089] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. Any references to memory, databases, or other media used in the embodiments provided by this invention can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, or optical storage, etc. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static random access memory (SRAM) and dynamic random access memory (DRAM), etc.

[0090] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0091] The embodiments described above are merely illustrative of several implementations of the present invention, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these all fall within the protection scope of the present invention. Therefore, the protection scope of this invention patent should be determined by the appended claims.

Claims

1. A method for motion tracking and control of a humanoid robot, characterized in that, The method includes: To address the safety control problem of humanoid robots in high-dynamic motion tracking, a target optimization problem function is constructed based on the reward function of high-dynamic motion tracking task and the reward function of high-dynamic motion safety recovery task. The objective optimization problem function is modeled as a Markov decision process; An initial humanoid robot motion tracking control model is trained using a policy-based reinforcement learning algorithm to solve the Markov decision process, resulting in a trained humanoid robot motion tracking control model. This model is used to obtain high-dynamic motion tracking safety decisions for the humanoid robot and consists of a policy network and a value network.

2. The method according to claim 1, characterized in that, The objective optimization problem function can be defined by the following expression: ; in, This represents the discount factor at the k-th time step. This represents the reward function for high-dynamic motion tracking tasks. This represents the high-dynamic motion safety recovery task indication function at the k-th time step. This represents the reward function for a high-dynamic-action safety recovery task. Represents the probability of performing a high-dynamic motion tracking task, 1- This indicates the probability of performing a high-dynamic motion safe recovery task. This indicates the task execution instructions for high dynamic motion tracking tasks and high dynamic motion safety recovery tasks. This represents the reference action state at the k-th time step. This indicates the initial reference motion state for a high-dynamic motion tracking task or a high-dynamic motion safety recovery task. This represents the set of non-safety reference motion state data corresponding to a high-dynamic motion safety recovery task. This represents the set of reference motion state data corresponding to a high-dynamic motion tracking task.

3. The method according to claim 2, characterized in that, The step of modeling the objective optimization problem function as a Markov decision process includes: For the target optimization problem function, establish the state space, action space, and reward function; The state space includes: the joint information state of the humanoid robot, the base posture information state, the contact information state, and the reference action state; The motion space includes: control commands for the joints of the humanoid robot when performing high-dynamic actions; The reward function is determined based on the reward function for high dynamic motion tracking tasks and the reward function for high dynamic motion safety recovery tasks.

4. The method according to claim 3, characterized in that, The reward function is determined based on the reward function for high-dynamic motion tracking tasks and the reward function for high-dynamic motion safety recovery tasks, and includes: According to the formula Determine the reward function; in, These represent the weight parameters of the reward function for high-dynamic motion tracking tasks. The weight parameters represent the reward function for a high-dynamic motion safety recovery task.

5. The method according to claim 4, characterized in that, The method further includes: The system is designed to perform high-dynamic motion tracking tasks using the following functions: motion change rate penalty function, joint limit penalty function, global reference position error function, global reference posture error function, relative body position error function, linear velocity and angular velocity consistency function, and undesired contact penalty function. The reward function for the high-dynamic motion tracking task is constructed by weighted summing of the motion change rate penalty function, joint limit penalty function, global reference position error function, global reference posture error function, relative body position error function, linear velocity and angular velocity consistency function, and undesired contact penalty function.

6. The method according to claim 5, characterized in that, The method further includes: The shoulder height relative error penalty function, relative posture penalty function, and displacement penalty function of the x-y axis before standing are constructed for performing high dynamic motion safety recovery tasks. The reward function for the high-dynamic motion safety recovery task is constructed by weighted summing of the shoulder height relative error penalty function, the relative posture penalty function, and the displacement penalty function of the x-y axis before standing.

7. The method according to claim 6, characterized in that, The humanoid robot motion tracking control model is used to obtain high-dynamic motion tracking safety decisions for the humanoid robot, including: Acquire the joint status, base posture status, contact status, and reference motion status of the humanoid robot; The state space is determined based on the joint information state, base posture information state, contact information state, and reference action state of the humanoid robot. The state space is input into the humanoid robot motion tracking control model to obtain the high-dynamic motion tracking safety decision of the humanoid robot.

8. The method according to claim 7, characterized in that, The process of training an initial humanoid robot motion tracking control model using a policy-based reinforcement learning algorithm to solve the Markov decision process and obtain a trained humanoid robot motion tracking control model includes: Obtain the initial task state space, which is: the first initial task state space corresponding to the high dynamic motion tracking task, or the second initial task state space corresponding to the high dynamic motion safety recovery task. The initial task state space is input into the policy network to obtain the predicted high-dynamic motion tracking safety decision; The initial task state space is updated based on the predicted high-dynamic motion tracking safety decision, and the reward value for the predicted high-dynamic motion tracking safety decision is calculated using the reward function. The predicted high-dynamic motion tracking safety decision, the reward value, and the updated initial task state space are input into the value network. The humanoid robot motion tracking control model is trained based on the policy optimization objective loss function until the model converges, thus obtaining the trained humanoid robot motion tracking control model.

9. A humanoid robot motion tracking and control device, characterized in that, The device includes: The module for constructing and acquiring the objective optimization problem function is used to construct the objective optimization problem function for the safety control problem of humanoid robots in high dynamic motion tracking, based on the reward function of high dynamic motion tracking task and the reward function of high dynamic motion safety recovery task. The modeling module is used to model the objective optimization problem function as a Markov decision process; The solution module is used to train an initial humanoid robot motion tracking control model using a policy-based reinforcement learning algorithm to solve the Markov decision process and obtain a trained humanoid robot motion tracking control model. The humanoid robot motion tracking control model is used to obtain the high-dynamic motion tracking safety decision of the humanoid robot. The humanoid robot motion tracking control model consists of a policy network and a value network.

10. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the humanoid robot motion tracking and control method according to any one of claims 1 to 8.