Modular multi-legged robot unified reinforcement learning control method and system
By constructing a unified action space and observation space, using graph neural networks to encode the robot's topology, and combining action masking mechanisms and multi-commentator value networks, the problem of policy transfer when the robot's structure changes is solved, and efficient control of robots with different configurations is achieved under the same policy network.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANDONG UNIV
- Filing Date
- 2026-05-26
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies struggle to effectively transfer reinforcement learning strategies when robot structures change, and traditional methods require retraining control strategies for each robot configuration, resulting in high development costs, low training efficiency, and difficulty in reusing strategies.
A unified action space and observation space are constructed, and a graph neural network is used to encode the robot topology. Combined with the action masking mechanism, a unified joint action control vector is generated. The vector is then jointly trained through a multi-commentator value network to achieve action output of robots with different configurations under the same policy network.
This approach enables robots with different leg counts to share the same reinforcement learning strategy, reducing training costs, improving strategy learning efficiency and generalization ability, and enhancing adaptability to new configurations.
Smart Images

Figure CN122275014A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of robot control and reinforcement learning technology, and in particular to a unified reinforcement learning control method and system for modular multi-legged robots. Background Technology
[0002] The statements in this section are merely background information related to the present invention and do not necessarily constitute prior art.
[0003] With the development of robotics and artificial intelligence technologies, the mobility of legged robots in complex environments has gradually attracted attention. Traditional control methods for legged robots are typically designed for a single robot configuration, such as establishing separate control models, state spaces, and motion spaces for bipedal, quadrupedal, or hexapedal robots. When the number of legs, joints, or structural topology of the robot changes, it is often necessary to rebuild the control model and retrain the control strategy, resulting in high development costs, low training efficiency, and difficulty in reusing the strategy.
[0004] In recent years, reinforcement learning methods have been widely applied in the field of robot motion control. Through continuous interaction and optimization in a simulation environment, reinforcement learning can automatically learn complex control policies. However, existing reinforcement learning methods typically design fixed-dimensional observation and action spaces for specific robot structures. When the robot structure changes, the policy network struggles to directly transfer to the new robot form. Furthermore, traditional policy networks often use multilayer perceptrons (MLPs) to directly process state vectors, lacking the ability to model robot joint connections and mechanical topology, making it difficult to fully utilize robot morphological information.
[0005] With the development of modular robotics technology, reusing standardized leg modules to construct robot configurations with different numbers of legs has become an important design approach. For example, based on a unified leg drive unit, by increasing the number of modules and adjusting the body structure, bipedal, quadrupedal, or hexaped robots can be formed. However, existing control methods usually require training control strategies separately for each robot configuration, making it difficult to achieve unified control. Summary of the Invention
[0006] To address the aforementioned issues, this invention proposes a unified reinforcement learning control method and system for modular multi-morphological legged robots. It constructs a unified action space and a unified observation space, and generates action masks based on robot morphology types, enabling robots with different configurations to complete action outputs under the same policy network.
[0007] In some implementations, the following technical solutions are adopted: A unified reinforcement learning control method for modular multimorphic legged robots includes: Construct legged robot configurations with different numbers of legs, and establish a unified motion space based on the joint degrees of freedom corresponding to the robot with the largest number of legs; The base state, joint position, joint velocity, and historical motion information of different robot configurations are mapped into fixed-dimensional observation vectors, thereby constructing a unified observation space; Using the joints of the target robot as nodes and the joint position and joint velocity as node features, a topology graph is constructed based on the mechanical connection relationship of the joints of the target robot, and a graph neural network is used to generate the morphological feature vector of the target robot. The morphological feature vector and the unified observation vector are input into the policy network to generate a unified joint motion control vector. Combined with the motion mask of the target robot configuration, the joint motion control vector of the target robot configuration is generated.
[0008] As a further option, it also includes: Using the morphological feature vector and the unified observation vector as inputs to the multi-commentator value network, the state value estimates corresponding to each commentator branch are output respectively. The fusion weights of each commentator branch are generated based on the morphological feature vector of the target robot to obtain the final state value estimate; the final state value estimate represents the expected long-term cumulative return of the current state under the current strategy. Based on the final state value estimate, a reinforcement learning algorithm is used to jointly train the graph neural network, policy network, and multi-commentator value network to update the network parameters.
[0009] As a further embodiment, the multi-commenter value network also includes a reward function for generating immediate rewards based on the robot's state and task completion during training. The reward function is composed of a weighted sum of reward and penalty terms; wherein the reward terms include speed tracking reward, gait quality reward, foot lift reward and survival reward; the penalty terms include energy consumption penalty, joint velocity penalty, joint acceleration penalty, rate of change of motion penalty, joint over-limit penalty, foot slip penalty, posture stability penalty and illegal contact penalty.
[0010] As a further approach, when jointly training the graph neural network, policy network, and multi-commentator value network using reinforcement learning algorithms, the loss function is constructed as follows: ; in, For strategic losses, For the loss of value, For entropy regularization, The loss is due to the symmetry constraint term; , , These are the weighting coefficients, which can be set according to actual needs.
[0011] As a further embodiment, the unified motion space is a fixed-dimensional motion vector, the dimension of which is consistent with the number of joint degrees of freedom corresponding to the robot with the maximum number of legs; Each dimension of the motion vector corresponds to a control command for a joint, and the control command is specifically the joint target position increment.
[0012] As a further solution, the fixed-dimensional observation vector specifically includes: the angular velocity of the robot base, the gravity projection, the angles of each joint of the robot, the angular velocities of each joint of the robot, and the previous action information of each joint of the robot.
[0013] As a further approach, by combining the motion mask of the target robot configuration, joint motion control vectors for the target robot configuration are generated, specifically as follows: ; in, For the target joint position, The default joint angle. This is the motion scaling factor. For the target robot's action mask, The original action vector output by the strategy. This means that the motion mask is multiplied element-wise by the original motion vector, so that only the activated joints will produce motion offsets.
[0014] In other embodiments, the following technical solutions are adopted: A modular, multi-morphological legged robot unified reinforcement learning control system includes: The unified motion space construction module is configured to construct legged robot configurations with different numbers of legs, and establishes a unified motion space based on the joint degrees of freedom corresponding to the robot with the largest number of legs. The unified observation generation module is configured to map the base state, joint position, joint velocity, and historical motion information of different robot configurations into fixed-dimensional observation vectors, thereby constructing a unified observation space. The morphological graph encoding module is configured to construct a topological graph based on the mechanical connection relationship of the target robot joints, with the joint positions and joint velocities as node features, and generate the morphological feature vector of the target robot through a graph neural network. The strategy output module is configured to input the morphological feature vector and the unified observation vector into the strategy network to generate a unified joint motion control vector, and combine it with the motion mask of the target robot configuration to generate the joint motion control vector of the target robot configuration.
[0015] In other embodiments, the following technical solutions are adopted: A terminal device includes a processor and a memory, wherein the processor is used to implement instructions; and the memory is used to store multiple instructions adapted to be loaded and executed by the processor to implement the above-described modular multimorphic legged robot unified reinforcement learning control method.
[0016] In other embodiments, the following technical solutions are adopted: A computer-readable storage medium storing a plurality of instructions adapted for loading and execution by a processor of an end device of the aforementioned modular multimorphic legged robot unified reinforcement learning control method.
[0017] Compared with the prior art, the beneficial effects of the present invention are: This invention enables bipedal, quadrupedal, and hexapedal robots to share the same reinforcement learning strategy by unifying the action space and observation space design; it encodes the robot joint topology using a graph neural network, allowing the policy network to better adapt to different robot configurations; and it combines the action masking mechanism of the target robot to achieve policy learning for robots with different configurations under a unified training framework, enabling robots with different configurations to complete their respective action outputs under the same policy network.
[0018] Other features and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description
[0019] Figure 1 This is a flowchart of the unified reinforcement learning control method for modular multi-morphic legged robots in this embodiment of the invention. Figure 2 This is a schematic diagram of the multi-configuration legged robot structure in an embodiment of the present invention; Figure 3 This is a schematic diagram demonstrating the invocation of a trained strategy in a simulation environment in an embodiment of the present invention. Detailed Implementation
[0020] It should be noted that the following detailed description is illustrative and intended to provide further explanation of the invention. Unless otherwise specified, all technical and scientific terms used in this invention have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.
[0021] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of exemplary embodiments according to the invention. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.
[0022] Example 1 In one or more embodiments, a unified reinforcement learning control method for modular multimorphic legged robots is disclosed, combining... Figure 1 Specifically, it includes the following processes: S101: Construct legged robot configurations with different numbers of legs, and establish a unified motion space based on the joint degrees of freedom corresponding to the robot with the largest number of legs.
[0023] As a specific implementation method, three robot configurations are constructed, including bipedal, quadrupedal, and hexapedal robots. Each robot configuration uses the same leg drive module, and each leg contains multiple joint drive units. By changing the number of leg modules and the body structure, robots with different numbers of legs can be formed.
[0024] Specifically, such as Figure 2 As shown, a modular design concept is adopted, using a bipedal robot as the basic motion unit, and connecting modules are used to construct legged robot configurations with different numbers of legs. Figure 2 The left side shows a single bipedal robot, and the middle of the image shows the body connection module. Figure 2 The right side shows different robot configurations formed by splicing multiple bipedal robots with a body connection module. By changing the number of bipedal robots, various configurations such as bipedal, quadrupedal, and hexapedal robots can be formed. This demonstrates that the robot system in this embodiment has good modular expansion capabilities, enabling the rapid construction of robots of different shapes through structural splicing while maintaining consistency in the basic drive units.
[0025] Furthermore, Figure 2 The structure shown illustrates that the control method in this embodiment is not designed for a single fixed-configuration robot, but rather for modular, multi-legged robot configurations. Therefore, the control method design requires the construction of a unified observation space, a unified motion space, and a morphological representation mechanism adapted to the robot configuration, enabling the same control framework to be applicable to robots with different numbers of legs.
[0026] Of course, the above structural forms are just examples, and more other robot configurations can be set up according to actual needs.
[0027] A unified motion space is established based on the joint degrees of freedom corresponding to the robot with the largest number of legs. The motion space is a fixed-dimensional motion vector, and the dimension of the motion space is consistent with the number of joint degrees of freedom corresponding to the robot with the largest number of legs. Each dimension in the motion vector corresponds to a control command for a joint. In this embodiment, the control command is designed in the form of joint target position increment.
[0028] The specific method for constructing the unified action space in this embodiment is as follows: S1011: Establish a unified sequence rule for all possible joints in a modular multi-legged robot system. First, sort each leg, then sort the internal joints of each leg to obtain a standard joint sequence.
[0029] S1012: Select the robot configuration with the most legs (i.e., the most degrees of freedom) as the dimensional benchmark for the unified motion space. For example, if the system includes three configurations: bipedal, quadrupedal, and hexapedal, and each leg has 3 driveable joints, then the robot configuration with the most legs is the hexapedal robot, with a total of 3 × 6 = 18 motion dimensions. The unified motion space is defined as an 18-dimensional vector, where each dimension of the vector corresponds one-to-one with the robot's joints according to a preset joint sequence. Specifically, the bipedal robot activates only the first 6 dimensions of the unified motion space, the quadrupedal robot activates only the first 12 dimensions of the unified motion space, and the hexapedal robot activates all 18 dimensions of the unified motion space.
[0030] S102: Map the base state, joint position, joint velocity, and historical motion information of different robot configurations into fixed-dimensional observation vectors, thereby constructing a unified observation space.
[0031] In this embodiment, the fixed-dimensional observation vector includes the angular velocity (3D) of the robot base, the gravity projection (3D), the angles of each joint of the robot (the dimension is consistent with the degree of freedom of the robot with the largest number of legs), the angular velocity of each joint of the robot (the dimension is consistent with the degree of freedom of the robot with the largest number of legs), and the previous action information of each joint of the robot (the dimension is consistent with the degree of freedom of the robot with the largest number of legs).
[0032] For robot configurations with fewer degrees of freedom, only the dimensions corresponding to the actual joints in the unified observation vector are filled with effective joint states, while the remaining unused dimensions are filled with zeros. For robot configurations with more degrees of freedom, more effective dimensions are filled in. Thus, robots with different numbers of legs can be mapped to fixed observation vectors of the same dimensions, thereby serving as inputs to the same policy network.
[0033] This embodiment constructs a unified action space and observation space, enabling bipedal, quadrupedal, and hexapedal robots to share the same reinforcement learning strategy. This allows for full utilization of the common characteristics of different robot configurations in motion control, achieving cross-configuration knowledge transfer and experience sharing. It reduces the complexity of designing, training, and maintaining control strategies separately for different robots, lowers training costs and sample consumption, improves strategy learning efficiency and generalization ability, and enhances the adaptability of the control method to new robot configurations.
[0034] S103: Using the joints of the target robot as nodes and the joint position and joint velocity as node features, a topology graph is constructed based on the mechanical connection relationship of the joints of the target robot, and the morphological feature vector of the target robot is generated through a graph neural network.
[0035] In this embodiment, in order to enable the control strategy to perceive different robot morphologies, a robot topology graph is further constructed based on a unified observation space, and morphological features are extracted through a graph neural network.
[0036] Specifically, a topological graph representing the robot's structural features is constructed using robot joints as graph nodes and mechanical connections, kinematic parent-child relationships, or pre-defined structural connections between joints as graph edges. In some implementations, a unified topological graph based on the maximum degree of freedom configuration is used. This unified topological graph has a fixed number of nodes and pre-defined connecting edges. Different robot configurations are mapped onto this unified topological graph through effective filling of node features, invalid zeroing, or masking of corresponding motion dimensions.
[0037] In this embodiment, node features include the position and velocity information of the corresponding joint; the graph neural network is constructed as follows: ,in, V For a set of joint nodes, E For a set of mechanically connected edges, the node characteristics are: , This refers to the joint position. This refers to the joint velocity.
[0038] After inputting the topology graph and node features into the graph neural network, the graph neural network... G The network propagation of node features is performed according to the following formula: ; in, For node i, the first k Layer node features For adjacent nodes, For the neighboring node j, the first k Layer node features For network parameters, is a learnable weight matrix. It is a non-linear activation function.
[0039] The final generated morphological feature vector representing the current structural form of the robot is: ; in, This describes the pooling process in a graph neural network.
[0040] This embodiment uses a graph neural network based on the robot's real joint topology to perform morphological encoding of joint position and joint velocity. This can accurately characterize whether the current robot belongs to different configurations such as bipedal, quadrupedal, or hexapedal, and reflect the differences in its joint organization and kinematic structure.
[0041] S104: Input the morphological feature vector and the unified observation vector into the policy network to generate a unified joint motion control vector. Combine this with the motion mask of the target robot configuration to generate the joint motion control vector of the target robot configuration.
[0042] This embodiment constructs a unified policy network, which receives the unified state observation vector at the current time. It takes morphological feature vectors as input and outputs fixed-dimensional joint motion control vectors. The dimensions of the joint motion control vector are consistent with the degrees of freedom of the maximum-degree-of-freedom robot configuration, providing a unified motion representation for all potentially controllable joints. As an example, the joint motion control vector is an 18-dimensional motion vector, where each dimension represents the joint control quantity of the corresponding joint, i.e., the increment of the joint target position.
[0043] Since the actual number of joints varies in different robot configurations, an action masking mechanism is further constructed.
[0044] In this embodiment, an action mask is generated based on the target robot configuration, so that the target robot configuration only activates a portion of the action dimensions in a unified action space, resulting in an action vector with effective dimensions. a* .
[0045] An action mask is a binary vector with the same dimension as the unified action space, used to characterize whether each action dimension is valid under the current robot configuration. Let the unified action space have dimension 1. d The action mask is defined as follows: ;in, m i =1 indicates that the joint corresponding to this dimension actually exists in the current robot configuration, allowing the motion vector output to take effect; m i =0 indicates that the joint corresponding to this dimension does not exist in the current configuration, or is not currently under control, and this dimension is hidden.
[0046] The specific method for constructing the action mask is as follows: First, determine the actual set of joints based on which leg modules the robot is composed of; then map this actual set of joints to the corresponding dimension of the unified motion space, setting the dimension of the actual joints to 1 and the dimension of the non-existent joints to 0, to obtain the motion mask vector corresponding to the current configuration.
[0047] For example, the unified motion space is established according to "six legs, three joints per leg", totaling 18 dimensions.
[0048] A bipedal robot actually has two legs with a total of six joints, so the corresponding motion mask is: m biped =[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0]; This means that only the first 6 dimensions of the action vector are valid, while the remaining 12 dimensions are invalid.
[0049] The quadruped robot actually has four legs and a total of 12 joints, so the corresponding motion mask is: m quad =[1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0]; This means that only the first 12 dimensions of the action vector are valid, while the remaining 6 dimensions are invalid.
[0050] The hexapod robot actually has six legs and a total of 18 joints, so the corresponding motion mask is: m hexapod =[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]; This indicates that all 18-dimensional action vectors are active.
[0051] After performing element-wise operations on the joint motion control vectors output by the policy network and the motion mask, only the effective motion portion of the current robot configuration is retained, and this is superimposed with the default joint positions corresponding to this configuration to generate the actual joint control target, specifically: ; in, For the target joint position, The default joint angle. This is the motion scaling factor. For the target robot's action mask, ; The original action vector output by the strategy. ; This indicates that the action mask is multiplied element-wise by the original action vector.
[0052] This embodiment outputs the original action vector from the strategy. Action mask of the target robot Element-wise multiplication is performed so that only activated joints will produce motion offsets, thus masking invalid motion dimensions that do not belong to the current robot configuration.
[0053] Finally, the actual joint control target is sent to the joint control interface of the currently active robot to drive the target robot to perform the corresponding action.
[0054] The target robot receives a reward after performing the corresponding action and interacting with the environment. r Simultaneously, update the robot's state output to the unified state observation vector for the next time step. .
[0055] As a further implementation, a multi-critic value network is constructed. In this embodiment, the multi-critic value network includes multiple critic branches. Each commentator branch corresponds to a bipedal, quadrupedal, or hexapedal robot configuration.
[0056] Using the aforementioned target robot morphological feature vector and unified observation vector as inputs to the multi-commentator value network, the state value estimates corresponding to each commentator branch are output respectively. Then, based on the target robot morphological feature vector, the fusion weights of each commentator branch are generated to obtain the final state value estimate. v .
[0057] Specifically, the morphological feature vector of the target robot is obtained. Then, the morphological feature vector is input into a gating network, and three unnormalized scores are obtained according to the following formula: ; in, These are the parameters of the gated network, obtained through network training and optimization. For activation function, The original scores correspond to the bipedal, quadrupedal, and hexapedal critic branches, respectively.
[0058] The original scores are normalized using softmax to obtain the fusion weights. ; in, .
[0059] The final fusion weight is a three-dimensional probability vector: ; Value estimation of the outputs of each commentator branch network We perform weighted fusion to obtain the final state value estimate, specifically: ; in, .
[0060] During training, the gating network updates itself through backpropagation by the value loss between the final fusion value and the discounted reward, thereby gradually learning the weight allocation suitable for different robot forms.
[0061] This state value estimate is used to characterize the expected long-term cumulative reward of the current state under the current policy, and is used to calculate the advantage function and target reward during reinforcement learning training to guide the updating of policy network parameters.
[0062] It should be noted that the calculation of the advantage function and the target reward follows the standard calculation process of the actor-critic algorithm in reinforcement learning. However, in this embodiment, the state value estimate used to calculate the target reward is not provided by a single critic network, but is obtained by weighting and fusing the outputs of multiple critic branches after generating fusion weights from morphological feature vectors.
[0063] Bipedal, quadrupedal, and hexapod robots differ in stability, support methods, gait characteristics, and state-reward mapping relationships. The true long-term cumulative reward for the same observed state varies across different robot configurations. If a single critic network is used to uniformly estimate the value of all robot configurations, it can easily lead to mutual interference between samples of different configurations, causing the value function to tend to average out and thus reducing the accuracy of state value assessment.
[0064] This embodiment sets up multiple commentator branches to learn value functions that are more suitable for different robot configurations, and adaptively generates fusion weights based on the target robot configuration feature vectors extracted by the graph neural network. This makes the commentator branch corresponding to the current robot configuration occupy a higher proportion in the final state value estimation, thereby obtaining a value estimation result that is more consistent with the true state-reward relationship of the current configuration. At the same time, it maintains the consistency of the overall training framework, thereby improving the accuracy of state value assessment under different robot configurations and enhancing the training stability and convergence performance of the unified control strategy.
[0065] Based on the final state value estimation, a reinforcement learning algorithm is used to jointly train the graph neural network, policy network, and multi-commentator value network to achieve network parameter updates.
[0066] This embodiment employs a reinforcement learning algorithm to jointly train the policy network, graph neural network, and value network, thereby obtaining a unified control strategy applicable to various robot configurations.
[0067] Specifically, the training data is generated online by the robot interacting with the environment in a parallel simulation environment.
[0068] In the simulation training environment, bipedal robot models, quadrupedal robot models, and hexapod robot models are loaded simultaneously, and a robot configuration identifier is assigned to each parallel environment. The robot configuration identifier is used to characterize the type of robot that is activated to perform control in the current environment, where bipedal robots, quadrupedal robots, and hexapod robots correspond to different configuration categories.
[0069] During the reinforcement learning environment reset phase, the currently active robot configuration is randomly sampled. For non-currently active robot models, they can be isolated by hiding, sinking them below the ground, or moving them out of the effective simulation area to avoid interference from inactive robots in contact detection, observation acquisition, and action execution. This enables parallel sampling of multiple robot configurations within the same training framework, with each environment corresponding to only one active robot configuration at a single moment.
[0070] Each interaction records observations, actions, rewards, termination markers, action probabilities, and value estimates to form a trajectory sample.
[0071] Specifically, the policy network outputs a unified action based on the unified observation vector and the morphological embedding generated by the graph neural network. This action, combined with the action mask, drives the robot of the current configuration to perform control. After the robot performs the action, the simulation environment updates the robot's state according to the physics engine, obtaining the next-moment state observation, immediate reward, and termination flag. The next-moment state observation is then fed back to the policy network input after being mapped by the unified observation, thus forming a closed-loop control process.
[0072] The entire process of a robot continuously performing actions, receiving feedback, and updating its state in its environment generates trajectory sample data required for reinforcement learning training. This trajectory sample data is a dataset automatically generated through online interaction between the robot and its environment, and includes at least the current observation, action, reward, next-moment observation, termination marker, and auxiliary information related to policy distribution and value estimation.
[0073] After collecting trajectories of a predetermined length, the PPO algorithm is used to jointly update the policy network, graph neural network, multiple commentator networks and their fusion weight network to obtain a unified control policy applicable to various robot configurations.
[0074] In this embodiment, in order to achieve efficient strategy optimization, a reward and constraint system is further constructed.
[0075] The reward function is used to generate immediate rewards based on the robot's state and task completion during training. Furthermore, a discounted reward and advantage function are constructed as learning objectives for the multi-critic fusion value network, thereby guiding the value network to evaluate the quality of states under different robot configurations and instructing the update of policy network parameters.
[0076] The reward function can be composed of a weighted sum of multiple reward and penalty terms. Reward terms include velocity tracking reward, gait quality reward, foot lift reward, and survival reward; penalty terms include postural stability penalty, energy consumption penalty, joint velocity penalty, joint acceleration penalty, rate of change of motion penalty, joint overstepping penalty, foot slip penalty, and illegal contact penalty.
[0077] The specific calculation methods for each reward and penalty item are explained below: (1) The speed tracking bonus is calculated exponentially based on the error between the target speed and the actual speed, specifically as follows: ; in, Rewards for planar linear velocity tracking: ; The target plane linear velocity, This is the actual planar linear velocity. This is a scale parameter, and this is a set value. Yaw angular velocity tracking reward: ; For the target yaw rate, This is the actual yaw rate.
[0078] (2) Gait quality bonus is calculated based on foot hang time and the consistency between the target gait phase and the actual contact state, specifically as follows: ; in, Bonus for time spent airborne on the feet: ; Let f be the time the f-th foot is airborne. For reference hang time threshold, This indicates that a foot is currently experiencing a ground landing event; Gait phase matching reward: ; For the target gait phase, This represents the actual gait phase.
[0079] (3) The foot lift reward is calculated based on the deviation of the foot height from the target lift height, specifically as follows: ; in, The current height of the f-th foot. To the desired leg lift height, is the scale parameter, and is the set value.
[0080] (4) The survival reward is the constant reward for the robot before it terminates: That is, as long as the robot has not terminated, a constant reward is given.
[0081] (5) Attitude stability penalty is calculated based on fuselage gravity projection deviation, base height deviation, vertical velocity, and roll and pitch angular velocity, specifically as follows: ; in, For fuselage flatness penalty items: , The component of gravity projected onto the horizontal plane in the body coordinate system represents the degree to which the fuselage deviates from a horizontal state. The larger the value, the greater the roll and pitch deviations, and the greater the penalty.
[0082] Penalty item for base height: , This is the current base height. The target base height is represented by the base height penalty, which indicates the degree to which the base height deviates from the target height. The greater the deviation, the greater the penalty.
[0083] For numerical speed penalty items: ; The velocity of the base in the vertical direction is a numerical velocity penalty term used to suppress invalid vibrations and jumps of the robot base in the vertical direction.
[0084] For roll and pitch rate penalties: ; The angular velocity vector of the base in the roll and pitch directions is used to suppress violent swaying of the fuselage in the roll and pitch directions.
[0085] (6) The energy consumption penalty is calculated based on the product of the joint torque and the joint velocity, specifically as follows: ; (7) The joint velocity penalty and joint acceleration penalty are calculated based on the square of the joint velocity and the square of the joint acceleration, respectively, as follows: Joint speed penalty: ; Joint acceleration penalty: ; in, For the torque of the j-th joint, Let the velocity be the velocity of the j-th joint. Let be the acceleration of the j-th joint.
[0086] (8) The penalty for rate of change of action is calculated based on the squared difference between actions at adjacent time points, specifically: ; in, For the current k-th dimension action, This refers to the k-th dimension action at the previous moment.
[0087] (9) The penalty for exceeding the joint limit is calculated based on the amount by which the joint position exceeds the upper or lower limit, specifically as follows: ; in, For the j-th joint position, , These are the upper and lower limits of the j-th joint, respectively. This means that penalties will only be imposed if the upper or lower limits are exceeded.
[0088] (10) Foot slip penalty is calculated based on the planar velocity of the foot when it contacts the ground, specifically as follows: ; in, This indicates whether the f-th foot is in contact with the ground. Let be the velocity of the foot within the plane of the ground.
[0089] (11) Penalties for unlawful contact are calculated based on whether the contact force at the unintended contact site exceeds a threshold, specifically as follows: ; in, The contact force at the point where contact should not occur. A preset contact threshold is set, meaning that if a significant collision occurs with a part of the fuselage that should not be touched, a penalty is imposed.
[0090] For different robot configurations, different weights can be assigned to different reward items to take into account the differences in gait, stability, and body structure between bipedal, quadrupedal, and hexapedal robots. Termination conditions may include time step timeout, illegal body contact, posture violation, robot fall, or other abnormal states, to promptly end the current round when the robot becomes unstable or the task fails.
[0091] During parameter optimization, reinforcement learning algorithms are used to jointly train the unified policy network, the graph neural network morphological encoding module, and the multi-commentator value network. As an example, the proximal policy optimization algorithm (PPO) is employed.
[0092] Specifically, firstly, interactive trajectories of predetermined step lengths are acquired in multiple parallel environments. Then, based on the immediate rewards, termination markers, and state value estimates in the trajectories, discounted rewards and advantage functions are calculated. Next, a policy loss is constructed based on the probability ratio between the current policy and historical policies, and a value loss is constructed based on the difference between the state value estimate and the discounted reward. Simultaneously, an entropy regularization term can be introduced to enhance exploration capabilities, and an action symmetry constraint term can be introduced to encourage the robot to learn symmetrical gaits. Afterward, backpropagation is performed on the total loss function, which consists of the policy loss, value loss, entropy regularization term, and symmetry constraint term, and an optimizer is used to update the network parameters using gradients. Since the graph neural network, unified policy network, and multi-critic value network all participate in the calculation of the total loss, the above network structures can be jointly updated during the same training process.
[0093] Specifically, the total loss function is as follows: ; in, For strategic losses, For the loss of value, For entropy regularization, The loss is due to the symmetry constraint term; , , These are the weighting coefficients, which can be set according to actual needs.
[0094] As a specific implementation method, PPO is used to cut the proxy target, and the strategy loss is... Specifically: ; in, , For the current unified observation, This is the morphological vector output by the graph neural network. For the current action, For the dominant function, The PPO shear coefficient, As the current strategy, This is the old strategy.
[0095] Indicates the current strategy parameters Given the observation vector at the current time... and the morphological feature vector of the target robot At that time, the policy network outputs an action. The conditional probability distribution value; Indicates the old strategy parameters Below, for the same action The conditional probability distribution value.
[0096] The final state value estimate after merging multiple commentators is: Value loss Specifically: ; in, As a discount return.
[0097] The entropy regularization term is specifically: ; in, Represents the policy distribution entropy; the symbol " " represents all possible action values in the action space; Represents the observation vector at the given current time. and the morphological feature vector of the target robot Under the given conditions, the current strategy Conditional probability distribution of action variables.
[0098] Entropy regularization is used to maintain the policy's exploration capability and prevent the policy from converging to a local optimum too early.
[0099] The loss due to the symmetry constraint term is specifically as follows: ; in, To observe the mirror mapping matrix, This is the action mirror mapping matrix; This represents the action output obtained by inputting the mirrored observations into the current policy. This represents the result after mirroring the strategy output action corresponding to the original observation.
[0100] Because bipedal, quadrupedal, and hexapod modular robots all have left-right mirror structures, and a unified observation space and a unified action space ensure that corresponding left and right modules have consistent semantics across different robot configurations, a symmetry constraint loss is introduced. This explicitly requires the policy network to output corresponding mirror actions under mirror observation inputs, thereby injecting left-right symmetry priors from robot control into the training process. This reduces meaningless asymmetric solutions and parameter search space, improves sample utilization, and makes it easier for the policy to learn the shared left-right coordination rules across configurations, thus enhancing training stability, generalization ability, and cross-configuration transfer ability.
[0101] Figure 3 The demonstration shows the invocation of the trained strategy in a simulation environment. Figure 3The image shows hexapod, quadruped, and bipedal robots. It can be seen that robots of different configurations can all operate stably within the same simulation framework and are all controlled by the same trained unified control strategy, without the need to train separate strategies for bipedal, quadrupedal, and hexapod robots.
[0102] Figure 3 The results show that the reinforcement learning control method proposed in this embodiment, based on graph neural network morphological encoding, unified observation space, unified action space, and multi-critic value assessment, can effectively extract the structural commonalities and morphological differences between different robot configurations, and learn a unified control strategy applicable to multiple robot configurations based on this. Simulation results demonstrate that the unified strategy of this embodiment can be adapted to bipedal, quadrupedal, and hexapedal robots simultaneously, enabling robots with different numbers of legs to complete motion control tasks under the same strategy, thus verifying the effectiveness and feasibility of the method in the unified control of multi-morphological legged robots.
[0103] Example 2 In one or more embodiments, a modular multimorphic legged robot unified reinforcement learning control system is disclosed, comprising: The unified motion space construction module is configured to construct legged robot configurations with different numbers of legs, and establishes a unified motion space based on the joint degrees of freedom corresponding to the robot with the largest number of legs. The unified observation generation module is configured to map the base state, joint position, joint velocity, and historical motion information of different robot configurations into fixed-dimensional observation vectors, thereby constructing a unified observation space. The morphological graph encoding module is configured to construct a topological graph based on the mechanical connection relationship of the target robot joints, with the joint positions and joint velocities as node features, and generate the morphological feature vector of the target robot through a graph neural network. The strategy output module is configured to input the morphological feature vector and the unified observation vector into the strategy network to generate a unified joint motion control vector, and combine it with the motion mask of the target robot configuration to generate the joint motion control vector of the target robot configuration.
[0104] It should be noted that the specific implementation methods of the above modules are exactly the same as those in Example 1, and will not be described in detail again.
[0105] Example 3 In one or more embodiments, a terminal device is disclosed, comprising a processor and a memory, wherein the processor is used to implement instructions; and the memory is used to store multiple instructions adapted to be loaded by the processor and executed by the processor to perform the modular multimorphic legged robot unified reinforcement learning control method described in Embodiment 1.
[0106] It should be understood that in this embodiment, the processor can be a central processing unit (CPU), or it can be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc.
[0107] Memory may include read-only memory and random access memory, and provides instructions and data to the processor. A portion of memory may also include non-volatile random access memory. For example, memory may also store information about the device type.
[0108] In the implementation process, each step of the above method can be completed by the integrated logic circuits in the processor hardware or by software instructions.
[0109] Example 4 In one or more embodiments, a computer-readable storage medium is disclosed, wherein a plurality of instructions are stored, the instructions being adapted to be loaded by a processor of a terminal device and executed by the unified reinforcement learning control method for modular multimorphic legged robots described in Embodiment 1.
[0110] While the specific embodiments of the present invention have been described above in conjunction with the accompanying drawings, this is not intended to limit the scope of protection of the present invention. Those skilled in the art should understand that various modifications or variations that can be made by those skilled in the art without creative effort based on the technical solutions of the present invention are still within the scope of protection of the present invention.
Claims
1. A unified reinforcement learning control method for modular multi-morphological legged robots, characterized in that, include: Construct legged robot configurations with different numbers of legs, and establish a unified motion space based on the joint degrees of freedom corresponding to the robot with the largest number of legs; The base state, joint position, joint velocity, and historical motion information of different robot configurations are mapped into fixed-dimensional observation vectors, thereby constructing a unified observation space; Using the joints of the target robot as nodes and the joint position and joint velocity as node features, a topology graph is constructed based on the mechanical connection relationship of the joints of the target robot, and a graph neural network is used to generate the morphological feature vector of the target robot. The morphological feature vector and the unified observation vector are input into the policy network to generate a unified joint motion control vector. Combined with the motion mask of the target robot configuration, the joint motion control vector of the target robot configuration is generated.
2. The modular multi-morphic legged robot unified reinforcement learning control method as described in claim 1, characterized in that, Also includes: Using the morphological feature vector and the unified observation vector as inputs to the multi-commentator value network, the state value estimates corresponding to each commentator branch are output respectively. The fusion weights of each commentator branch are generated based on the morphological feature vector of the target robot to obtain the final state value estimate; the final state value estimate represents the expected long-term cumulative return of the current state under the current strategy. Based on the final state value estimate, a reinforcement learning algorithm is used to jointly train the graph neural network, policy network, and multi-commentator value network to update the network parameters.
3. The unified reinforcement learning control method for modular multi-morphic legged robots as described in claim 2, characterized in that, The multi-commenter value network also includes a reward function for generating immediate rewards based on the robot's state and task completion during training; The reward function is composed of a weighted sum of reward and penalty terms; wherein the reward terms include speed tracking reward, gait quality reward, foot lift reward and survival reward; the penalty terms include energy consumption penalty, joint velocity penalty, joint acceleration penalty, rate of change of motion penalty, joint over-limit penalty, foot slip penalty, posture stability penalty and illegal contact penalty.
4. The modular multi-morphic legged robot unified reinforcement learning control method as described in claim 2, characterized in that, When using reinforcement learning algorithms to jointly train the graph neural network, policy network, and multi-commentator value network, the loss function is constructed as follows: ; in, For strategic losses, For the loss of value, For entropy regularization, The loss is due to the symmetry constraint term; , , These are the weighting coefficients, which can be set according to actual needs.
5. The modular multi-morphic legged robot unified reinforcement learning control method as described in claim 1, characterized in that, The unified motion space is a fixed-dimensional motion vector, and the dimension is consistent with the number of joint degrees of freedom corresponding to the robot with the maximum number of legs. Each dimension of the motion vector corresponds to a control command for a joint, and the control command is specifically the joint target position increment.
6. The unified reinforcement learning control method for a modular multi-morphic legged robot as described in claim 1, characterized in that, The fixed-dimensional observation vector specifically includes: the angular velocity of the robot base, gravity projection, the angles of each joint of the robot, the angular velocities of each joint of the robot, and the previous action information of each joint of the robot.
7. The unified reinforcement learning control method for modular multi-morphic legged robots as described in claim 1, characterized in that, By combining the motion mask of the target robot configuration, joint motion control vectors for the target robot configuration are generated, specifically as follows: ; in, For the target joint position, The default joint angle. This is the motion scaling factor. For the target robot's action mask, The original action vector output by the strategy. This means that the motion mask is multiplied element-wise by the original motion vector, so that only the activated joints will produce motion offsets.
8. A modular multi-morphological legged robot unified reinforcement learning control system, characterized in that, include: The unified motion space construction module is configured to construct legged robot configurations with different numbers of legs, and establishes a unified motion space based on the joint degrees of freedom corresponding to the robot with the largest number of legs. The unified observation generation module is configured to map the base state, joint position, joint velocity, and historical motion information of different robot configurations into fixed-dimensional observation vectors, thereby constructing a unified observation space. The morphological graph encoding module is configured to construct a topological graph based on the mechanical connection relationship of the target robot joints, with the joint positions and joint velocities as node features, and generate the morphological feature vector of the target robot through a graph neural network. The strategy output module is configured to input the morphological feature vector and the unified observation vector into the strategy network to generate a unified joint motion control vector, and combine it with the motion mask of the target robot configuration to generate the joint motion control vector of the target robot configuration.
9. A terminal device comprising a processor and a memory, the processor for implementing instructions; the memory for storing multiple instructions, characterized in that, The instructions are adapted to be loaded by a processor and executed using the unified reinforcement learning control method for modular multimorphic legged robots according to any one of claims 1-7.
10. A computer-readable storage medium storing a plurality of instructions, characterized in that, The instructions are adapted to be loaded and executed by the processor of the terminal device using the unified reinforcement learning control method for modular multimorphic legged robots according to any one of claims 1-7.