[0020] Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art. It should be noted that the embodiments of the present invention and the features in the embodiments can be combined with each other if there is no conflict. Hereinafter, the present invention will be described in detail with reference to the drawings and in conjunction with the embodiments.
[0021] figure 1 It is a flowchart of a method for motion control of a footed robot based on deep reinforcement learning provided by an embodiment of the present invention. Such as figure 1 As shown, the method includes the following steps:
[0022] (1) Build a 3D model of a foot robot in the Webots simulation environment, and set environmental variables, including robot density, weight, gravitational coefficient, surface friction, joint motor parameters, control cycle, etc. The 3D model of the foot robot includes a body and four legs. The four legs are respectively connected to the body. The four legs are located below the body and are respectively located at the left front, left rear, right front and right rear; each leg includes calf and knee Joints, thighs and hip joints; among them, the lower leg is connected to the thigh through the knee joint, and the thigh is connected to the body through the hip joint.
[0023] (2) Initialize the state of the 3D model of the foot robot (preset the position of the centroid, the speed of the centroid, the attitude, the angular velocity, the initial angle of the knee and hip joints), and set the instant reward function at time t
[0024] R(s(t),a(t))=w 1 × forward speed (t)-w 2 ×attitude deviation (t)-w 3 ×Position deviation (t)
[0025] And the cumulative reward function (state-action value function)
[0026]
[0027] Where s(t) is the state of the robot 3D model of the robot at time t, a(t) is the expected angle of the knee and hip joints of the robot 3D model at time t, R(s(t), a(t) ) Is the reward function, w 1 , W 2 And w 3 Both are constants, Q(s(t), a(t)) is the cumulative reward function, and γ is a constant.
[0028] (3) Construct the action network and the target action network. The input of the action network and the target action network are the state s(t) of the robot 3D model in step (2), and the output of the action network is the expected angle of the knee and hip joints a(t), the output value of the target action network is a'(t); the weight of the action network is θ a (t), the weight of the target action network is θ a t (t);
[0029] Construct an evaluation network and a goal evaluation network. The inputs of the evaluation network and goal evaluation network are the state s(t) of the robot 3D model in step (2) and the expected angle a(t) of the knee and hip joints. The output is the cumulative reward function Q(s(t),a(t)); the output of the target evaluation network is the cumulative reward function Q'(s(t),a(t)); the weight of the evaluation network is θ c (t), the weight of the target evaluation network is θ c t (t);
[0030] Take the state s(t) of the robot 3D model in step (2) as input, and the expected angle a(t) of the knee and hip joints as output, initialize the deep neural network, including the action network and the target action network; take the robot 3D The state s(t) of the model and the expected angle a(t) of the knee and hip joints are input, and the cumulative reward function Q(s(t), a(t)) is the output. The deep neural network is initialized, including the evaluation network and Target evaluation network;
[0031] (4) In the state s(t) of the robot 3D model at time t, the expected angle a(t) of the knee and hip joints is generated through the action network, and the knee and hip joints of the robot 3D model are moved at time t+1 To the desired angle a(t), read the robot's state information s(t+1) at this time, calculate the instant reward function R(t+1) of the robot's 3D model movement at t+1, and set [s(t), a(t), s(t+1), R(t+1)] is stored as a sample in the replay variable.
[0032] (5) Repeat step (4) until multiple samples are collected, randomly select a certain number of samples from the replay variable, and use the target action network to generate a'(t+1) corresponding to s(t+1), Take this as input, and then use the target evaluation network to obtain the value of the cumulative reward function Q Q'(s(t+1), a'(t+1)); where a'(t+1) is the time t+1 The output value of the target action network;
[0033] (6) Use the instant reward function R(t+1) stored in the replay variable to update the value Q(s(t), a(t)) of the cumulative reward function Q
[0034] Q(s(t),a(t))=R(t+1)+γQ'(s(t+1),a(t+1))
[0035] (7) Take [s(t), a(t)] as input and Q(s(t), a(t)) as output, construct training samples, train the evaluation network, and obtain new evaluation network weights θ c (t+1);
[0036] (8) Take the state s(t) of the robot 3D model as input, and get the expected angle a(t) of the knee joint and hip joint according to the action network, and get the step (7) according to s(t) and a(t) The output of the new evaluation network is the cumulative reward function Q(s(t),a(t)), and further calculate the gradient of Q(s(t),a(t)) with respect to a(t), based on this gradient pair The action network is trained to obtain a new action network weight θ a (t+1);
[0037] (9) According to the weight θ of the new evaluation network obtained in step (7) c (t+1) Update the weight θ of the target evaluation network c t (t+1): θ c t (t+1)=θ c (t+1)τ+θ c t (t)(1-τ);
[0038] According to the weight θ of the new action network obtained in step (8) a (t+1) Update the weight θ of the target action network a t (t+1): θ a t (t+1)=θ a (t+1)τ+θ a t (t)(1-τ); where τ is a constant.
[0039] Through the above process, a training of the target action network and the target evaluation network is completed.
[0040] (10) Repeat steps (4)-(9) until all networks (action network, target action network, evaluation network and target evaluation network) converge
[0041] (11) According to the state of the foot robot 3D model, use the convergent action network in step (10) to obtain the expected angles of the knee joints and hip joints of the foot robot 3D model, and realize the motion control of the foot robot 3D model.
[0042] In other words, the control instruction can be obtained in real time according to the motion state of the robot, and the control instruction can maximize the cumulative reward Q.
[0043] Take the foot movement of a quadruped robot as an example to illustrate the implementation process of the present invention.
[0044] Firstly, a 3D model of the foot robot is established in the Webots simulation environment. Set the robot body coordinate system ( figure 2 ): The position of the center of gravity of the robot is the origin, the z-axis is a vertical upward direction, the x-axis points to the side of the robot in the horizontal plane, and the y-axis points to the forward direction of the robot in the horizontal plane. Set the global coordinate system ( figure 2 ): The starting point on the horizontal plane is the origin of the global coordinate system, the z-axis is vertically upward, and the y-axis points to the north pole. Looking down on the robot, with the origin of the body coordinate system as the center, command the four legs of the robot to front left, front right, back left, and back right. Each leg has two degrees of freedom, namely, the hip joint h rotates along the X axis and the knee joint k rotates along the X axis. Therefore, in the foot-like motion, the control volume of the 3D model of the foot-like robot is selected as:
[0045] a=[θ hfr θ kfr θ hfl θ kfl θ hbr θ kbr θ hbl θ kbl ]
[0046] Where θ is the joint angle, the subscript h is the hip joint, k is the knee joint, f and b are front and back respectively, and l and r are left and right respectively.
[0047] The state quantity of the selected robot is:
[0048] s=[p x p y p z v x v y v z θ roll θ pitch θ yaw w x w y w z ]
[0049] Where p x And v x Are the position and velocity of the robot along the x direction in the global coordinate system, θ roll θ pitch θ yaw They are the roll angle (rotating along the body's y-axis), pitch angle (rotating along the body's x-axis), and sideslip angle (rotating along the body's z-axis).
[0050] Set the immediate reward and punishment function as:
[0051] R(s(t),a(t))=1000v y (t)-50(|θ pitch (t)|+|θ roll (t)|+|θ yaw (t)|)-
[0052] 100(|p x (t)-p x d (t)|+|p y (t)-p y d (t)|+|p z (t)-p z d (t)|)
[0053] Select γ=0.95 in the step 2 state action value function.
[0054] According to step 3, taking s as the network input and a as the network output, establish and initialize the action network and evaluation network. The network contains 5 hidden layers, each of which contains 500 neurons. The activation function of the neuron is the ReLU function:
[0055] φ(a)=max(0,a)
[0056] According to step 4, obtain the state s(t) of the foot robot 3D model from Webots, enter the action network to obtain a(t), and each joint of the robot reaches the desired angle given by a(t) at time t+1, and read Take the current state s(t+1), and calculate the instant reward R(s(t), a(t)). Store s(t), a(t), R(s(t), a(t)), s(t+1) in the variable replay.
[0057] Step 4 is repeated continuously. When the number of samples in replay is greater than 1000, 200 samples are randomly selected, and according to step 5, the output of the target action network and the target evaluation network are calculated. And using the output of the target evaluation network, according to step 6, the value Q(s(t), a(t)) of the cumulative reward function Q is updated.
[0058] So far, the sample {[s(t),a(t)][Q(s(t),a(t))]} can be used to train the evaluation network.
[0059] According to step 8, update the weight θ of the action network a (t+1). According to step 9, use the action network and the evaluation network to update the weights of the target action network and the target evaluation network.
[0060] At this point, the training process is over. Steps 4-9 are repeated continuously, and after about 20,000 iterations, the network converges.
[0061] This embodiment also provides a foot robot motion control system based on deep reinforcement learning. The system includes: a first module for constructing a foot robot 3D model in a Webots simulation environment; wherein the foot robot 3D model includes The main body and four legs, of which, the four legs are connected to the main body, and the four legs are located at the lower part of the main body; each leg includes the calf, knee joint, thigh and hip joint; among them, the calf is connected to the thigh through the knee joint, The thigh is connected to the body through the hip joint; the second module is used to initialize the state of the foot robot 3D model, preset the instant reward function R(s(t), a(t)) and the cumulative reward function Q(s) at time t (t), a(t)), where s(t) is the state of the robot 3D model of the robot at time t, and a(t) is the expected angle of the knee joint and hip joint of the robot 3D model at time t; The third module is used to construct the action network and the target action network; the evaluation network and the target evaluation network are constructed; the fourth module is used to generate the knee joints and hips through the action network under the state s(t) of the robot 3D model at time t The desired angle a(t) of the joint, at t+1, the knee and hip joints of the robot 3D model are moved to the desired angle a(t). At this time, read the state information of the robot s(t+1) and calculate t The instant reward function R(t+1) of the robot's 3D model movement at time +1, store [s(t),a(t),s(t+1),R(t+1)] as a sample in replay Among the variables; the fifth module is used to randomly select a certain number of samples from the replay variable after collecting multiple samples through the fourth module, and use the target action network to generate a'(t+1) corresponding to s(t+1) ), take this as input, and then use the target evaluation network to obtain the value of the cumulative reward function Q Q'(s(t+1), a'(t+1)); where a'(t+1) is t+ The output value of the target action network at time 1; the sixth module is used to use the instant reward function R(t+1) stored in the replay variable to the cumulative reward function Q value Q(s(t), a(t)) Update; the seventh module is used to take [s(t),a(t)] as input and Q(s(t),a(t)) as output, construct training samples, train the evaluation network, and get New evaluation network weight θ c (t+1); The eighth module is used to take the state s(t) of the robot 3D model as input, and obtain the expected angle a(t) of the knee joint and hip joint according to the action network, according to s(t) and a (t) Obtain the output of the new evaluation network, namely the cumulative reward function Q(s(t), a(t)), and further calculate the gradient of Q(s(t), a(t)) with respect to a(t), Train the action network based on this gradient to obtain a new action network weight θ a (t+1); The ninth module is used to obtain the weight of the new evaluation network θ c (t+1) Update the weight θ of the target evaluation network c t (t+1); the weight θ of the new action network obtained a (t+1) Update the weight θ of the target action network a t (t+1); The tenth module is used to converge the action network, the target action network, the evaluation network and the target evaluation network; the eleventh module is used to use the converged in step 10 according to the state of the footed robot 3D model The motion network obtains the expected angles of the knee joints and hip joints of the foot robot 3D model, and realizes the motion control of the foot robot 3D model.
[0062] This embodiment aims at a new type of movement mechanism and a new compound motion mode of a footed robot. Based on the depth deterministic strategy gradient method, the footed robot realizes the stable and rapid movement of the footed robot in an unknown environment for the first time, and obtains a new and efficient Compound movement method. This method breaks through the dependence of the existing control methods on the object dynamics model and environment model, greatly reduces the workload of parameter debugging in the existing controller design, and can simultaneously learn motion control strategies for multiple motion modes. And training does not need to be designed separately. More importantly, because the robot continuously learns to obtain the optimal motion strategy during the active environmental exploration process, when the structure or design parameters of the robot change, or when the surface contact conditions change, only a short training is required. An intelligent motion control method can be applied to similar objects and similar environments.
[0063] The depth deterministic strategy gradient method proposed in this embodiment relies on deep neural networks to model strategy functions and value functions in a continuous motion space. Through reasonable algorithm design and sufficient learning and training, it can be used in continuous motion space. Get the optimal control strategy. This process is realized through autonomous exploration and learning of robots, without human intervention, and a large amount of early learning and training can be realized in the Webots robot simulation software.
[0064] The above-mentioned embodiments are only preferred specific implementations of the present invention, and the usual changes and substitutions made by those skilled in the art within the scope of the technical solution of the present invention should be included in the protection scope of the present invention.