Humanoid robot motion control method and system based on deep reinforcement learning

A robot motion, humanoid robot technology, applied in the field of humanoid robot motion control based on deep reinforcement learning, can solve the problems of slow training, poor anti-interference ability, difficult parameter adjustment, etc., to improve stability and reliability, improve The effect of learning speed and improving training efficiency

Pending Publication Date: 2020-07-03
CENT SOUTH UNIV
6 Cites 8 Cited by

AI-Extracted Technical Summary

Problems solved by technology

However, compared with wheeled or tracked robots, humanoid robots are inherently unstable and require active control to achieve equilibrium due to their limited support area, high center of mass, and limited actuator capabilities
Therefore, the scope of application scenarios of humanoid robots is mainly limited by the balance of humanoid robots and the ability to deal with disturbances and uncertainties.
[0003] Classical control methods propo...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

In the present embodiment, as shown in Figure 4, for the characteristics of the ankle action when the humanoid robot walks, at the stage where the pin leaves the ground, after the depth deterministic strategy gradient network determines the target angle of the ankle joint, by Passive control of the ankle joint. With this strategy, the advantages are: (1) the contact between the foot and the ground can be made smoother; (2) the dynamic characteristics of the inverted pendulum are maintained; (3) when the foot is in contact with the ground, minimal force is required to drive the body around ankle; (4) reduces the overall noise in the system. Further preferably, the damping coefficient of the ankle is set to 1, and this damping amount helps to absorb the impact of ground contact without hindering the swing.
[0034] In the present embodiment, a specific humanoid robot model is taken as an example, as shown in Figure 2, and walking is selected as the humanoid robot movement mode. The humanoid robot model is composed of a head, torso, two arms, and two legs, and is constructed based on real anthropometric data. The model contains twelve rigid bodies, including: head, torso and left and right upper arms, left and right forearms and left and right thighs, left and right lower legs and left and right feet. In addition, the model has the following ten joints: left and right hip joints, left and right knee joints, left and right ankle joints, left and right shoulder joints, and left and right elbow joints. Among them, the hip and ankle joints can rotate along the x-axis (inside-outside) and the y-axis (front-to-back), and the shoulder and elbow joints can rotate a...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention discloses a humanoid robot motion control method and system based on deep reinforcement learning. The method comprises the steps that S1, simulation control is carried out, specifically,the current state of a humanoid robot is obtained, and the target angle of each joint of the humanoid robot are calculated and determined through a preset deep reinforcement learning model accordingto the current state; and S2, PD control is carried out, specifically, through a PD controller, the target angle serves as a control target, the actual angle and the joint torque of the joint serve asfeedback, the control torque of the joint is determined, and the joint is controlled to act according to the control torque. The method has the advantages of good control stability, good reliabilityand the like.

Application Domain

Technology Topic

Image

  • Humanoid robot motion control method and system based on deep reinforcement learning
  • Humanoid robot motion control method and system based on deep reinforcement learning
  • Humanoid robot motion control method and system based on deep reinforcement learning

Examples

  • Experimental program(1)

Example Embodiment

[0032] The following further describes the present invention with reference to the accompanying drawings of the specification and specific preferred embodiments, but the protection scope of the present invention is not limited thereby.
[0033] Such as figure 1 As shown, the motion control method of a humanoid robot based on deep reinforcement learning of this embodiment includes: S1. Simulation control: acquiring the current state of the humanoid robot, and calculating and determining the humanoid robot according to the current state using a preset deep reinforcement learning model The target angle of each joint; S2.PD control: Through the PD controller, the target angle is used as the control target, and the actual angle and joint torque of the joint are feedback to determine the control torque of the joint, and control the joint action according to the control torque.
[0034] In this embodiment, a specific humanoid robot model is taken as an example for description, such as figure 2 As shown, and choose walking as the motion mode of the humanoid robot. Suppose the humanoid robot model is composed of a head, a torso, two arms, and two legs, and is constructed based on real human body measurement data. The model contains twelve rigid bodies, including: head, torso and left and right big arms, left and right forearms and left and right thighs, left and right calves and left and right feet. In addition, the model has the following ten joints: left and right hip joints, left and right knee joints, left and right ankle joints, left and right shoulder joints, and left and right elbow joints. Among them, the hip joint and ankle joint can rotate along the x-axis (medial-lateral) and the y-axis (front and back), and the shoulder joint and elbow joint can rotate along the x-axis (left and right) and the z axis (up and down). Two frictionless walls were added to the simulated environment to constrain the humanoid robot to move in the sagittal plane, so the x-axis rotation of the ankle provides most of the movement. The y-axis rotation of the ankle remains unchanged, so that when you lean, the foot can make firm contact with the ground. The knee joint is constrained to rotate only around the x axis, giving the system a total of 14 degrees of freedom. According to the weight and height of the human, the mass and length ratio of the body part are calculated from the anthropometric table, and the body shape and moment of inertia are simplified into a uniform capsule shape to speed up the simulation. Suppose the height of the humanoid robot model is set to 1.8 meters and the weight is 75 kg. An analog inertial measurement unit (IMU) sensor is connected to the center of the torso to measure its velocity and acceleration. Force sensors are built on the bottom of the left and right feet to detect ground contact force. All joint angles and joint speeds can be directly read from the simulation environment. It should be noted that the structure and joints of the humanoid robot model may also be in other forms, and the motion type may also be other motions, such as arm motion.
[0035] In this embodiment, the deep reinforcement learning model includes a first experience replay pool and a second experience replay pool; the first experience replay pool is used to store the newly generated experience of the deep reinforcement learning model; the second experience replay pool is used to store the deep reinforcement The newly generated experience of the learning model and the old experience removed from the first experience replay pool; the deep reinforcement learning model extracts experience from the first experience replay pool according to the preset first probability, and selects the second experience with the preset second probability Extract experience from the experience replay pool and train the neural network. The reward function of the deep reinforcement learning model is the sum of multiple reward sub-items; the reward sub-items include: adjusting upper body posture reward, adjusting mass center position reward, adjusting mass center speed reward and adjusting ground contact force reward. The reward sub-items preferably also include: ground contact state reward and power consumption reward. It should be noted that when the motion form of the humanoid robot is different, the reward function changes accordingly, and the reward sub-item is added or deleted.
[0036] In this embodiment, adjust the upper body posture reward r pose As shown in formula (1):
[0037]
[0038] ω torsoPitch Is the pitch angle of the upper torso, Reward for upper body torso pitching, ω pelvisPitch Is the pitch angle of the lower body pelvis, Reward for lower body pelvic elevation, ω torsoRoll Is the tilt angle of the upper body torso, Reward for the upper torso tilt, ω PitchRoll Is the tilt angle of the lower body pelvis, Reward for the tilt of the lower body and pelvis; in this embodiment, the upper body posture is represented by the pitch and roll angles of the torso and pelvis, and the required direction of the pitch-roll angle of the pelvis and torso is 0, that is, when the upper body is upright Direction.
[0039] Adjust the reward of the center of mass position r CoM_pos As shown in formula (2):
[0040]
[0041] ω xyCoM Is the horizontal position of the center of mass, Is the reward for the horizontal position, ω zCoM Is the vertical position of the center of mass, It is the reward at the vertical position; in this embodiment, the reward item at the center of mass position is decomposed into horizontal and vertical components. For the position of the horizontal center of mass, the target position is the center of the supporting polygon to provide maximum interference compensation. For the position of the vertical center of mass, the robot should stand upright and maintain a certain height.
[0042] Adjust the center of mass speed reward As shown in formula (3):
[0043]
[0044] The definition of each parameter in formula (3) is the same as above. In this embodiment, the center of mass velocity is similar to the position of the center of mass, and the reward of the center of mass velocity is decomposed into two components: the velocity in the horizontal and vertical planes. The center of mass velocity is expressed in the world coordinate system. With the goal of minimizing vertical movement, the required vertical center of mass velocity is 0, and the required velocity of the horizontal center of mass velocity is derived from the capture point. The juxtaposition capture point is only valid when the robot is in contact with the ground and is not slipping.
[0045] Adjust the ground contact force reward r GRF As shown in formula (4):
[0046]
[0047] ω Fleft Is the contact moment of the left foot, Reward for left foot contact moment, ω Fright Is the contact moment of the right foot, Reward the right foot contact moment; in this embodiment, the contact force must be evenly distributed between the two feet to maintain a stable and stable balance. The total mass of 137kg produces 671.3N of force per foot.
[0048] Ground contact status reward r contact As shown in formula (5):
[0049]
[0050] k is a preset first constant, and l is a preset second constant; both the first constant and the second constant are negative numbers, and the first constant is greater than the second constant. Preferably k=-2 and l=-10. In this embodiment, when the robot is standing, only the feet are in contact with the ground, so when the feet lose contact with the ground or body parts other than the feet in contact with the ground, it will be punished.
[0051] Power reward r contact As shown in formula (6):
[0052]
[0053] ω power Is the preset weight, j is the drive number of the joint, J is the total number of drives of the joint, τ j Is the joint torque of the drive number j, q j Is the joint angular velocity of the drive number j.
[0054] In this embodiment, the upper body torso pitch reward Lower body pelvic bone pitch reward Upper torso tilt reward Lower body pelvic tilt reward Horizontal position reward Vertical position reward Left foot contact moment reward Right foot contact moment reward The specific calculation method is shown in formula (7):
[0055] r i =exp(-α i (x target -x) 2 ) (7)
[0056] In formula (7), r i Is the calculated reward value, x target Is the expected reward value, α i Is the preset normalization factor, and x is the reward parameter.
[0057] Then the reward function of the deep reinforcement learning model is shown in equation (8):
[0058] r=r pose +r CoM_pos +r CoM_vel +r GRF +r contact +r power (8)
[0059] The definition of each parameter in formula (8) is the same as above.
[0060] In this embodiment, the deep reinforcement learning model adopts a deep deterministic policy gradient network, including an action network and an evaluation network, and both the action network and the evaluation network have two hidden layers. The first layer has 400 neurons and the second layer has 300 neurons. The output of the action network passes through the ReLU activation function. During the training process of the deep deterministic strategy gradient network, the training experience is stored in the experience replay pool. In this embodiment, there are two experience replay pools, the first experience replay pool and the second experience replay pool, which can store 70,000 experiences. The training starts when 20,000 experiences are stored. The learning rate of Actor and Critic is set to 10 respectively -8 And 2×10 -8. The reward discount γ is set to 0.99, and the training batch is 100 samples. The depth deterministic strategy gradient network determines the distance and speed of the next swinging foot based on the speed of the previous step, the pitch angle of the trunk, the step length and the ZMP (zero moment point) position.
[0061] In this embodiment, the action network input parameter of the deep deterministic strategy gradient network is the current state of the humanoid robot, that is, the current angle of each joint is used as a state feature, and the output is the target angle of each joint. In addition to the state characteristics, the input parameters of the evaluation network of the deep deterministic policy gradient network also take action parameters as input. The value of the action parameter will skip the first hidden layer and directly forward to the second hidden layer. The network input of the deep deterministic strategy gradient network consists of continuous state features, which are filtered by a Butterworth filter with a cut-off frequency of 10 Hz, while the discrete state features remain unchanged.
[0062] In this embodiment, as image 3 As shown, the training process of the deep deterministic strategy gradient network is: 1. Initialize the neural network parameters and initialize the experience playback pool; 2. According to the current state s t , The deep deterministic strategy gradient network calculates the action a in the current state t , Calculate for the action a t The reward function r t , Update the network, after the humanoid robot executes the action a t , Enter the next state s t+1 , And the state transition process [s t ,a t ,r t ,s t+1 ] Save to the first experience replay pool and the second experience replay pool. The first experience playback pool stores experience in a standard FIFO (First In First Out) manner. Therefore, the distribution of experience samples in the first experience playback pool will roughly correspond to the current strategy. The second experience replay pool will not only store the new experience generated by the deep deterministic strategy gradient network during the state transition process [s t ,a t ,r t ,s t+1 ] At the same time, when the first experience replay pool is full, the discarded experience of the first experience replay pool will also be stored in the second experience replay pool. After the second experience replay pool is full, follow the new The size of the empirical sample distance difference covers the old experience, and the calculation method of the distance difference can be expressed as equation (9):
[0063]
[0064] In formula (9), i overwrite Is the old experience to be covered, i is the old experience sample in the second experience replay pool, D is the experience sample set in the second experience replay pool, j is the new experience sample in the second experience replay pool, d is the state Dimension of action space, D N Is the total dimension of the state action space, i d Is the dth dimension of the i sample, j d Is the dth dimension of j sample, C d Is the preset size-related scaling constant, preferably C is a preset constant, which depends on the size and distribution of the database.
[0065] In this embodiment, when the neural network is trained through the experience samples stored in the first experience replay pool and the second experience replay pool, the experience samples are uniformly and randomly selected from the first experience replay pool at the probability β, with a probability of 1. -β uniformly randomly select experience samples from the first experience replay pool to train the neural network.
[0066] In this embodiment, the joints of the humanoid robot are specifically controlled to perform the next action, that is, when the target angle is executed, PD control is adopted: that is, through the PD controller, the target angle is taken as the control target, and the actual angle and joint torque of the joint are taken as Feedback, determine the control torque of the joint, and control the joint action according to the control torque. The PD controller is used as a low-level controller. Because of its spring damping characteristics, the PD controller is similar to the biomechanics of the system and can well control the humanoid robot to execute the target angle. The input of the PD controller is the target angle calculated by the deep deterministic strategy gradient network, and the output is the torque of the joint driving device. The PD controller uses the actual angle of the joint and the torque of the joint driving device as feedback, and provides feedback to the feedback signal. For filtering, the filter cutoff frequency is preferably 50 Hz, and the filtering method is preferably Butterworth filtering.
[0067] In this embodiment, the control process of the PD controller is shown in equation (10):
[0068] u=K p (q target -q measured )-K d q' mearsured (10)
[0069] In formula (10), u is the output of the PD controller, that is, the action step length of the joint driver controlled by the PD controller, K p And K d Are the preset PD gain, q target Is the target angle of the joint, q measured Is the measured current angle of the joint, q' mearsured Is the measured current speed of the joint.
[0070] For example, when the humanoid robot is walking, when the raised foot touches the ground, the humanoid robot starts to rotate around the ankle joint. At this time, the hip joint needs to move according to the ankle joint to keep the torso straight and provide power to move the torso Push forward. At this time, the output of the PD controller is the target angular velocity of the hip joint. The purpose is to keep the torso upright without overshooting, because overshooting will cause the torso to swing back and forth and endanger stability. Ideally, the torso should lean forward slightly to maintain momentum and a smooth natural gait. For this reason, this embodiment uses the residual error of the PD controller to slightly deviate the torso from the Z axis.
[0071] Under the condition that the pitch of the torso remains unchanged relative to the z axis, the horizontal speed of the hip will be the same as the horizontal speed of the center of the torso, that is, v t =v p with Where v t And v p Are the linear velocity of the torso center of mass and the hip joint, ω is the angular velocity of the thigh around the hip joint, It is the angular velocity around the ankle. The angular velocity around the ankle can be directly measured to satisfy the following formula: α is the angle between the leg and the Z axis, and L is the length of the leg. The control equation for the PD controller to control this can be expressed as: Among them, K is the control gain, Φ is the trunk pitch angle, if the trunk pitch angle Φ is greater than the target value Φ 0 , Namely Φ>Φ 0 ,then Therefore, the pitch angle decreases, and vice versa. The control gain K is When the selected target pitch is close to zero, Φ 0 = 0.02.
[0072] In this embodiment, as Figure 4 As shown, in view of the characteristics of the ankle motion of the humanoid robot during walking, when the foot leaves the ground, after the depth deterministic strategy gradient network determines the target angle of the ankle joint, the ankle joint is controlled by passive control. Through this strategy, its advantages are: (1) It can make the foot contact with the ground smoother; (2) Maintain the dynamic characteristics of the inverted pendulum; (3) When the foot is in contact with the ground, the minimum force is required to drive the body around Ankle; (4) Reduce the total noise in the system. More preferably, the damping coefficient of the ankle is set to 1. This damping amount helps to absorb the impact of ground contact without hindering the swing action.
[0073] Specifically, when the foot is off the ground, torque will be applied to the ankle to push the body forward. The torque is determined by the current walking speed. The goal is to keep the momentum of the humanoid robot within a certain range. If the required walking speed is given, then Δv=v 0 -v desire , Where Δv is the required speed, v 0 Is the current speed, v desire Is the target speed. If the pitch of the trunk remains constant, the angular velocity of the trunk is zero, ω torso =0. Hip speed Δv hip Equal to the speed of the torso center Δv center , Δv center =Δv hip. If the toe-off phase is short, during the toe-up period, the hip joint angle of the leg behind the body remains approximately the same during the movement, and the momentum of the hind foot can be ignored. In order to make the trunk angular velocity ω torso =0, the torque hip must act on the hip joint of the leg behind the body during exercise: Where τ hip For the torque acting on the hip joint, J torso Is the moment of inertia of the trunk, Δt is the unit time, It is the angular velocity of rotation around the ankle per unit time. For the ankle joint of the leg in front of the body during exercise, there are τ is the torque acting on the ankle joint, τ c Is the torque caused by the damper, τ hip Is the torque acting on the hip joint, Δt is the unit time, J leg Is the moment of inertia of the leg in front of the body around the front ankle joint during exercise, Is the angular velocity of rotation around the ankle per unit time, l is the length of the leg, m l Is the mass of the leg, β'is the angle between the two legs, and c is the damping coefficient of the ankle joint.
[0074] In this embodiment, through the aforementioned control strategy, the stability and reliability of the motion control of the humanoid robot can be effectively ensured. It should be noted that although in this embodiment only the walking form of the humanoid robot is used as an example to describe the motion control, the technical solution of the present invention is not limited to the walking motion control of the humanoid robot.
[0075] In this embodiment, the control frequency of simulation control is less than the control frequency of PD control. For the walking motion form of the humanoid robot, it is preferable that the frequency of simulation control is less than or equal to 50 Hz, more preferably less than or equal to 25 Hz; the control frequency of PD control is greater than or equal to 300 Hz, and more preferably greater than or equal to 500 Hz. That is to say, through the simulation control of the deep deterministic strategy gradient network, a larger-grained joint control target can be given, and then the fine-grained PD control can be used to specifically control the joint to achieve the above control target.
[0076] The humanoid robot motion control system based on deep reinforcement learning of this embodiment includes a simulation control module and a PD control module; the simulation control module is used to obtain the current state of the humanoid robot, and calculate according to the current state with a preset deep reinforcement learning model Determine the target angle of each joint of the humanoid robot; the PD control module is used to take the target angle as the control target, take the actual angle and joint torque of the joint as feedback, determine the control torque of the joint, and control the joint action according to the control torque. The humanoid robot motion control system based on deep reinforcement learning of this embodiment is used to implement the above motion control method.
[0077] In this embodiment, the deep reinforcement learning model includes a first experience replay pool and a second experience replay pool; the first experience replay pool is used to store the newly generated experience of the deep reinforcement learning model; the second experience replay pool is used to store the deep reinforcement The newly generated experience of the learning model and the old experience removed from the first experience replay pool; the deep reinforcement learning model extracts experience from the first experience replay pool according to the preset first probability, and selects the second experience with the preset second probability Extract experience from the experience replay pool and train the neural network.
[0078] In this embodiment, the reward function of the deep reinforcement learning model is the sum of multiple reward sub-items; the reward sub-items include: adjust upper body posture reward, adjust mass center position reward, adjust mass center speed reward, and adjust ground contact force reward . The reward sub-items also include: ground contact status reward and power consumption reward. The control frequency of simulation control is less than that of PD control.
[0079] The above are only preferred embodiments of the present invention, and do not limit the present invention in any form. Although the present invention has been disclosed as above in preferred embodiments, it is not intended to limit the present invention. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments based on the technical essence of the present invention without departing from the technical solution of the present invention should fall within the protection scope of the technical solution of the present invention.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Magnetic bead separation device with adjustable magnetic flux

InactiveCN104371918AAvoid multiple complicated operationsImprove stability and reliabilityBiological testingStress based microorganism growth stimulationElectrical controlMagnetic liquids
Owner:NANJING ZHONGKE SHENGUANG TECH

Classification and recommendation of technical efficacy words

  • Improve stability and reliability
  • Fast learning

Receiver system

ActiveUS20140119479A1Improve stability and reliabilityGain controlAmplitude demodulation by non-linear two-pole elementsVIT signalsReceiver system
Owner:EM MICROELECTRONIC-MARIN

Lock-up control for torque converter

ActiveUS20050222738A1Fast learningEliminate biasAnalogue computers for trafficGearing controlVariatorClutch control
Owner:NISSAN MOTOR CO LTD

Device for detecting fullness of bladder

ActiveUS20160174866A1Fast learningInexpensive and easy to operateStrain gaugeElectromyographyElectromyographyDistension
Owner:CHAN KA WING
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products