Multi-axis collaborative robot dynamic path planning method based on deep reinforcement learning
By constructing a kinematic and collision detection model for a multi-axis collaborative robot and introducing dynamic Gaussian noise and an importance sample replay mechanism, the deep reinforcement learning model was optimized, solving the learning difficulties in dynamic obstacle avoidance planning for a six-axis collaborative robot and achieving more efficient obstacle avoidance planning.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2024-04-23
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies for dynamic obstacle avoidance planning in six-axis collaborative robots suffer from problems such as long learning cycles, unreasonable reward function construction, and high computational load due to the high dimensionality of the state space, making it difficult to effectively improve the robot's dynamic obstacle avoidance capabilities.
We construct a kinematic model and an obstacle collision detection model for a multi-axis collaborative robot, design a deep reinforcement learning model, introduce irrelevant dynamic zero-mean Gaussian noise and an important sample replay mechanism, optimize the neural network strategy, and improve learning efficiency and accuracy by dynamically controlling the Gaussian noise variance and the reward function design.
By optimizing deep reinforcement learning methods, the robot's obstacle avoidance planning ability in dynamic environments is improved, the learning cycle is shortened, and the convergence speed and obstacle avoidance efficiency of the algorithm are enhanced.
Smart Images

Figure CN118596133B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of robot dynamic obstacle avoidance planning technology, specifically relating to a dynamic path planning method for multi-axis cooperative robots based on deep reinforcement learning. Background Technology
[0002] Robots are widely used in many fields such as automobile production, agricultural harvesting, medical surgery, and aerospace. However, the wide range of applications has brought new challenges to robot programming technology. In particular, the real-time requirements for algorithms are higher in dynamic obstacle avoidance planning scenarios, and the large amount of online collision detection increases the difficulty of obstacle avoidance planning. Traditional path planning algorithms are quite challenging in such scenarios. In recent years, reinforcement learning methods have gradually emerged in various complex tasks, providing a new approach to dynamic obstacle avoidance planning for robots. Reinforcement learning differs significantly from traditional decision-making and planning algorithms. It improves the agent's control over the robot by continuously interacting with and learning from the environment. While reinforcement learning has achieved some success in robot control planning, most current methods focus on obstacle avoidance planning in dynamic environments such as mobile robots, autonomous vehicles, and unmanned vessels. Research on obstacle avoidance planning for six-axis collaborative robots based on deep reinforcement learning algorithms in dynamic environments is relatively limited. Furthermore, when constructing a deep reinforcement learning model for an agent in complex scenarios involving dynamic obstacle avoidance planning for a six-axis collaborative robot, there are some issues that need to be addressed and optimized. These include long learning cycles due to environmental complexity, difficulty in convergence due to unreasonable reward function construction, and excessive computational load due to the high dimensionality of the state space. Summary of the Invention
[0003] To address the shortcomings of existing technologies and improve the dynamic obstacle avoidance capabilities of robots, this invention adopts the following technical solution:
[0004] A dynamic path planning method for multi-axis collaborative robots based on deep reinforcement learning includes the following steps:
[0005] Step 1: Construct the kinematic model of the multi-axis collaborative robot;
[0006] Step 2: Construct a collision detection model between the multi-axis collaborative robot and obstacles;
[0007] Step 3: Based on the kinematic model and collision detection model, construct a deep reinforcement learning model for the multi-axis collaborative robot. The deep reinforcement learning model includes a deep neural network and a reinforcement learning algorithm. The deep neural network includes a participant neural network and a commentator neural network. The participant neural network acquires state values and outputs action values, while the commentator neural network evaluates the value of the action values. The reinforcement learning algorithm introduces irrelevant dynamic neighborhood Gaussian noise. By dynamically controlling the Gaussian noise, the exploration intensity of the action values is adjusted, thereby optimizing the neural network strategy. As the learning process and the number of explorations increase, the neural network strategy becomes more and more mature, and the estimation of the optimal action becomes more and more accurate. At this point, the exploration intensity can be reduced to save computing power and time. An important sample replay mechanism is introduced, using the TD error of the sample data as a standard to measure the importance of the sample and ranking them by importance. The probability of extracting and learning data samples with higher rankings is increased. If the difference δ between the current state / action value and the target state / action value is large, then the sample data is considered to be of high importance, and its importance is ranked higher.
[0008] Step 4: Design the state space and action space of the deep reinforcement learning model;
[0009] Step 5: Design the reward function for the deep reinforcement learning model;
[0010] Step 6: Optimize the deep reinforcement learning model.
[0011] Furthermore, in step 1, the robot's structural dimensions and the angles of each joint are obtained, and the position and orientation of each joint in space are obtained through matrix translation and rotation calculations.
[0012] Furthermore, in step 2, the robot's links are wrapped with cylindrical bounding boxes and simplified into spatial line segments, and obstacles are wrapped with spherical bounding boxes. A collision detection model is constructed based on the radii of the cylindrical and spherical bounding boxes, thereby converting the complex spatial relationships between the robot's links and obstacles into relative positional relationships between spatial line segments and spheres.
[0013] Furthermore, the central axis segment of the cylindrical enclosure is L, with endpoints M1 and M2, and the center of the spherical enclosure is P. obs The radius of the spherical bounding box is r. obs The collision model of the cylinder has a radius of r, and the collision detection is as follows:
[0014] like and Then, the center P of the sphere can be obtained using Heron's formula. obs The minimum distance to the central axis segment L is:
[0015]
[0016] Where a = |M1-M2|, b = |M1-P obs |,c=|M2-P obs |, t=(a+b+c) / 2;
[0017] like and Then the center of the ball P obs The minimum distance d to the central axis segment L is d = |M1P obs |;
[0018] like and Then the center of the ball P obs The minimum distance d to the central axis segment L is d = |M2P obs |;
[0019] If drr obs When the value is greater than 0, the robot will not collide with the obstacle; otherwise, a collision will occur.
[0020] Furthermore, in step 5, the reward function of the deep reinforcement learning model is as follows:
[0021]
[0022] Where, ω g and ω i d represents the weight of the corresponding reward function. g d represents the Euclidean distance between the target object at its current position and the robot end effector at its current position. s Indicates the safe distance, d i (t) represents the distance between the i-th link and the center point of the spherical bounding box of the obstacle at time t, which is calculated by the robot-environment obstacle collision model, d i (t-1) represents the distance between the i-th link and the center point of the spherical bounding box of the obstacle at time t-1. When the difference between the two is greater than 0, it indicates that the robot link is moving away from the obstacle, and the agent will receive a positive reward. Conversely, it indicates that the robot is moving closer to the obstacle, and the agent will receive a negative reward. t This represents the distance threshold, and collision represents a collision.
[0023] ω t .d g The reward function is used to guide the robot gripper to gradually approach the target position;
[0024] (if d i <d s The reward function is used when the distance between each link of the robot and an environmental obstacle is less than a safe distance d. sWhen performing obstacle avoidance maneuvers, priority should be given to ensure that the robot body does not collide with obstacles.
[0025] R g The Euclidean distance between the robot end effector position and the target position reaches a threshold d. t When this is achieved, it indicates that the robot's terminal gripper has been positioned at the target location, thus providing the agent with a significant positive reward.
[0026] R c This is used to impose a severe negative penalty on the agent when any link of the robot collides with an obstacle.
[0027] Furthermore, in step 3, the Gaussian noise variance σ is dynamically controlled. 2 The intensity of the agent's exploration: σ 2 The smaller the value of σ, the thinner the normal distribution, and the greater the probability that the output action value is close to the target action value 'a', thus enhancing the development properties; 2 The larger the value, the wider the normal distribution shape, and the greater the probability that the output action value deviates from the target action value 'a', thus enhancing the exploration nature.
[0028] Furthermore, the dynamic control Gaussian noise variance σ 2 The formula is as follows:
[0029]
[0030] The learning process is divided into T stages, with the initial stage (0 to 10) being the first stage. In this stage, select a large σ 2 The value σ helps the agent's exploratory learning and increases the probability of obtaining effective samples. As the learning process progresses through multiple stages, the number of samples gradually becomes sufficient, necessitating the development of sample data. 2 Gradually reduce the sample size, increase the intensity of sample development, and in Maintain constant until stage T-2 This ensures a good balance between exploration and development in the learning process, preventing σ from becoming too high after T-2 phases. 2 Instead of decaying to 0, it needs to remain. N e To learn the number of rounds, since a fully greedy algorithm based on a finite number of rounds is still a greedy algorithm with local information and is still far from the optimal solution, it is necessary for the agent to retain a certain amount of exploration capability.
[0031] Furthermore, in the reinforcement learning algorithm of step 3, to avoid the learning process getting stuck in local optima, samples are drawn according to their respective probabilities:
[0032]
[0033] in rank(i) represents the ranking of the i-th sample among all samples, and N represents the number of samples, arranged in descending order according to the corresponding δ.
[0034] Furthermore, the robot in step 4 is a six-axis collaborative robot, and the state space of the deep reinforcement learning model contains five parts: the three-dimensional coordinates of the six joints of the six-axis collaborative robot and the difference J between the target position. i (x,y,z)-(x t ,y t ,z t The difference J between the three-dimensional coordinates of the robot's six joints and the position of the dynamic obstacle. i (x,y,z)-(x o ,y o ,z o The target position's three-dimensional coordinates (x, y) are referenced to the robot's base coordinate system. t ,y t ,z t ), the three-dimensional coordinates of the dynamic obstacle (x) with reference to the robot's base coordinate system. o ,y o ,z o If the robot has reached the target position (in_goal), the input state is represented as follows:
[0035] s={J1(x,y,z)-(x t ,y t ,z t ),J2(x,y,z)-(x t ,y t ,z t ),J3(x,y,z)-(x t ,y t ,z t ),
[0036] J4(x,y,z)-(x t ,y t ,z t ),J5(x,y,z)-(x t ,y t ,z t ),J6(x,y,z)-(x t ,y t ,z t ),
[0037] J1(x,y,z)-(x o ,y o ,z o ),J2(x,y,z)-(x o ,yo ,z o ),J3(x,y,z)-(x o ,y o ,z o ),
[0038] J4(x,y,z)-(x o ,y o ,z o ),J5(x,y,z)-(x o ,y o ,z o ),J6(x,y,z)-(x o ,y o ,z o ),
[0039] (x t ,y t ,z t ),(x o ,y o ,z o ),in_goal}.
[0040] The action space of the deep reinforcement learning model is designed as a six-dimensional vector based on the characteristics of path planning tasks, and outputs the rotational angular velocities of the six joints of the robot, as follows:
[0041] a={ω1,ω2,ω3,ω4,ω5,ω6}.
[0042] Furthermore, the specific process of step 6 is as follows:
[0043] First, a CoppeliaSim simulation environment is set up for the deep reinforcement learning model optimization process, and the CoppeliaSim simulation environment is initialized for each planning round.
[0044] Then, the current relevant states in the environment, including the differences between the robot's multiple joint positions and the target position, the differences between the robot's joint positions and the obstacle positions, the target position, the obstacle position, and the target arrival state value, are input into the deep reinforcement learning network model as state space values. The target position and the dynamic real-time position of the obstacle are obtained from the simulation environment. Based on this, the remaining state space values are calculated by the kinematic model and the collision detection model.
[0045] Next, the motion space values output from the deep reinforcement learning model drive the robot's movement in the simulation environment. A collision detection model is used to determine whether a collision has occurred or whether the planning task has been completed. If a collision occurs or the number of planning experiments reaches the maximum number of steps, the task is not completed in this round, the round ends, the dynamic obstacle avoidance planning task is deemed to have failed, a large negative reward is given, and a new round is started. If the robot's end effector successfully plans to the target position, the task is deemed to have been executed successfully; otherwise, the next step is performed.
[0046] By continuously interacting with the environment, the parameters of the deep reinforcement learning network model are optimized, thereby continuously improving the robot's dynamic obstacle avoidance planning ability. When the obstacle avoidance planning ability of the robot converges, the training process stops to avoid overfitting.
[0047] The advantages and beneficial effects of this invention are as follows:
[0048] (1) This invention constructs a kinematic model and an obstacle collision model for the robot, designs a deep reinforcement learning model for the robot, and introduces unrelated dynamic zero-mean Gaussian noise to improve the exploratory nature of action selection. It also dynamically controls the magnitude of the Gaussian noise variance to maintain the balance between the algorithm exploration process and the development process. Finally, it introduces an important sample replay mechanism to improve the convergence speed of the learning process.
[0049] (2) This invention constructs a state space, action space and reward function that fully consider the characteristics of robot dynamic obstacle avoidance planning, so as to promote better learning of robot agents and enable agents to continuously iterate and update strategies in the CoppeliaSim simulation environment to obtain intelligent dynamic obstacle avoidance planning strategies. Attached Figure Description
[0050] Figure 1 This is a flowchart of the method in an embodiment of the present invention.
[0051] Figure 2 This is a schematic diagram of the structural dimensions of a six-axis collaborative robot in an embodiment of the present invention.
[0052] Figure 3a This is a bounding box model diagram of the robot model in an embodiment of the present invention.
[0053] Figure 3b This is a spatial line segment diagram of the robot model in an embodiment of the present invention.
[0054] Figure 3c This is a diagram of a sphere surrounded by obstacles in an embodiment of the present invention.
[0055] Figure 4a This is one of the simplified diagrams showing the relative positional distribution of the obstacle-surrounded sphere and the cylinder axis in an embodiment of the present invention.
[0056] Figure 4b This is the second simplified diagram showing the relative positional distribution of the obstacle-surrounded sphere and the cylinder axis in an embodiment of the present invention.
[0057] Figure 4c This is the third simplified diagram showing the relative positional distribution of the obstacle-surrounded sphere and the cylinder axis in an embodiment of the present invention.
[0058] Figure 5 This is a schematic diagram of the dynamic obstacle avoidance planning and training process of a six-axis collaborative robot in an embodiment of the present invention. Detailed Implementation
[0059] The specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for illustration and explanation only and are not intended to limit the present invention.
[0060] like Figure 1 As shown, the dynamic path planning method for multi-axis collaborative robots based on deep reinforcement learning includes the following steps:
[0061] Step 1: Construct the kinematic model of the six-axis collaborative robot;
[0062] like Figure 2 As shown, the model elements include the known robot structural dimensions and the angles of each joint of the robot. By using matrix translation and rotation calculations to determine the position and orientation of each joint in space, the kinematic model of the robot can be constructed.
[0063] Step 2: Construct a collision detection model between the six-axis collaborative robot and obstacles;
[0064] like Figure 3a As shown, firstly, a cylindrical enclosure is used to wrap around each link of the robot, and then simplified as follows. Figure 3b The spatial line segment shown, then as Figure 3c As shown, an obstacle-bound sphere model is obtained by wrapping the environmental obstacle with a spherical bounding box. The radius of this model is the radius r of the spherical bounding box. obs The sum of the robot's link radius r transforms the complex spatial relationships between the robot's links and obstacles into the relative positional relationships between spatial line segments and the sphere.
[0065] like Figures 4a to 4c As shown, three possible positional relationships between each link and the obstacle bounding box are revealed, where L is the central axis segment of the cylindrical collision model of a single robot link, with endpoints M1, M2, and P. obs r is the center of the spherical bounding box of the obstacle. obs Let r be the radius of the spherical bounding box of the obstacle, and r be the radius of the cylindrical collision model. Let a = |M1 - M2|, b = |M1 - P|.obs |,c=|M2-P obs |, then let t = (a + b + c) / 2. If and That is Figure 4a In the case shown, using Heron's formula, the minimum distance from the center of the bounding box to the central axis segment L is:
[0066]
[0067] if and That is Figure 4b In the case shown, the minimum distance d from the center of the bounding box sphere to the central axis segment L is d = |M1P obs |. If and That is Figure 4c In the case shown, the minimum distance d from the center of the bounding box sphere to the central axis segment L is d = |M²P|. obs In conclusion, if drr obs If the value is greater than 0, the robot will not collide with the obstacle; otherwise, a collision is considered to have occurred.
[0068] Step 3: Based on the kinematic model and collision detection model, construct a deep reinforcement learning model for the six-axis collaborative robot;
[0069] The deep reinforcement learning model consists of a deep neural network and a reinforcement learning algorithm. The deep neural network includes a fully connected network and an activation function. The neural network consists of an input layer, three hidden layers, and an output layer. The neural network architecture is shown in Tables 1 and 2 below. The input of the actor neural network is the state space value s, and the output is the action space value a. The critic neural network is used to evaluate the value of the action value.
[0070] Table 1 Actor Neural Network Architecture
[0071]
[0072] Table 2. Critical Neural Network Architecture
[0073]
[0074]
[0075] The reinforcement learning algorithm chosen is the DDPG algorithm, and uncorrelated dynamic zero-mean Gaussian noise is introduced into the original DDPG algorithm to dynamically control the variance σ of the Gaussian noise. 2 σ can be used 2 To control the intensity of the agent's exploration: σ 2The smaller the value of σ, the "thinner" the normal distribution, the greater the probability that the output action value will be close to 'a', and the more properties it can develop; 2 The larger the value, the "fatter" the normal distribution shape, increasing the probability of the output action value deviating from 'a', and leading to more exploration-related characteristics. As the learning process and number of explorations increase, the neural network strategy becomes more sophisticated, and the estimation of the optimal action becomes more accurate. At this point, the exploration effort can be reduced to save computing power and time, and the Gaussian noise variance σ can be dynamically controlled. 2 As shown in equation (2).
[0076]
[0077] The learning process is divided into T stages, with the initial stage (0 to 10) being the first stage. In this stage, the larger value of σ is selected. 2 This facilitates the agent's exploratory learning, increases the probability of obtaining effective samples, and as the learning process progresses through multiple stages, the number of samples gradually becomes sufficient. It then becomes necessary to develop sample data, σ. 2 Gradually reducing the sample size can increase the development of the sample, therefore in Maintaining a constant at stage T-2 This ensures a good balance between exploration and development in the learning process. After T-2 phases, σ will not be affected. 2 Instead of decaying to 0, it needs to remain. (N e (For the number of learning rounds), because a fully greedy algorithm based on a finite number of rounds is still a greedy algorithm with local information and is still far from the optimal solution, it is necessary for the agent to retain a certain amount of exploration ability.
[0078] Furthermore, the original DDPG algorithm introduces an important sample replay mechanism, using the magnitude of the TD error of the sample data as a standard to measure the importance of the sample and ranking them by importance. This increases the probability of extracting and learning data samples with higher rankings. If the difference δ between the current state action value and the target state action value is relatively large, it can be concluded that this sample data is of high importance, and its importance is ranked higher. To avoid the learning process getting stuck in local optima, samples are extracted according to their respective probabilities.
[0079]
[0080] in rank(i) is the ranking of the i-th sample among all samples, arranged in descending order according to the corresponding δ.
[0081] The improved DDPG algorithm flow is shown in Table 3.
[0082] Table 3. Flowchart of the improved DDPG algorithm integrating dynamic variance action extraction and key sample playback
[0083]
[0084] Step 4: Design the state space and action space of the deep reinforcement learning model;
[0085] The state space of a deep reinforcement learning model consists of five parts: the three-dimensional coordinates of the six joints of the six-axis collaborative robot and the difference J between the target position. i (x, y, z)-(x) t y t , z t The difference J between the three-dimensional coordinates of the robot's six joints and the position of the dynamic obstacle. i (x, y, z)-(x) o y o , z o The target position's three-dimensional coordinates (x, y) are referenced to the robot's base coordinate system. t y t , z t ), the three-dimensional coordinates of the dynamic obstacle (x) with reference to the robot's base coordinate system. o y o , z o If the robot has reached the target position (in_goal), the input state is represented as follows:
[0086] s={J1(x,y,z)-(x t y t , z t ), J2(x, y, z)-(x t y t , z t ), J3(x, y, z)-(x t y t , z t ),
[0087] J4(x,y,z)-(x t ,y t ,z t ),J5(x,y,z)-(x t ,y t ,z t ),J6(x,y,z)-(x t ,y t ,z t ),
[0088] J1(x,y,z)-(x o ,y o ,z o),J2(x,y,z)-(x o ,y o ,z o ),J3(x,y,z)-(x o ,y o ,z o ),
[0089] J4(x,y,z)-(x o ,y o ,z o ),J5(x,y,z)-(x o ,y o ,z o ),J6(x,y,z)-(x o ,y o ,z o ),
[0090] (x t ,y t ,z t ),(x o ,y o ,z o ),in_goal}(4)
[0091] The action space of the deep reinforcement learning model is designed as a six-dimensional vector based on the characteristics of path planning tasks, and outputs the rotational angular velocities of the six joints of the robot, as follows:
[0092] a={ω1,ω2,ω3,ω4,ω5,ω6}#(5)
[0093] Step 5: Design the reward function for the deep reinforcement learning model;
[0094] The reward function of a deep reinforcement learning model consists of four parts, the first part being:
[0095] R1=ω t .d g #(6)
[0096] Where d g It is the Euclidean distance between the target object at its current position and the robot end effector at its current position, expressed as follows:
[0097]
[0098] Where (x) t ,y t ,z t (x) represents the coordinates of the target object relative to the robot's coordinate system. g ,y g ,z g) represents the coordinates of the robot's end effector gripper. By setting this reward function, the robot gripper can be guided to gradually approach the target position.
[0099] When the distance between each link of the robot and environmental obstacles is less than a certain safe distance d s When performing obstacle avoidance, priority should be given to ensuring that the robot body does not collide with obstacles. Therefore, a second reward function needs to be set, as shown below:
[0100]
[0101] Where d i (t) is the distance between the i-th link and the center point of the spherical bounding box of the obstacle at time t, which can be calculated by the robot-environment obstacle collision model. i (t-1) is the distance between the i-th link and the center point of the spherical bounding box of the obstacle at time t-1. When the difference between the two is greater than 0, it indicates that the robot link is moving away from the obstacle, and the agent will receive a positive reward. Conversely, it is inferred that the robot is moving closer to the obstacle, and the agent will receive a negative reward.
[0102] When the Euclidean distance between the robot end effector and the target position reaches a certain threshold d t When the robot's end gripper has been positioned at the target location, a large positive reward R is given to the agent. g This is the reward value for the third part. When any link of the robot collides with an obstacle, the agent is given a negative maximum penalty R. c This is the reward value for the fourth part. Therefore, the reward function of the robot's deep reinforcement learning model is as follows:
[0103]
[0104] Where ω g and ω i The weights are for the corresponding reward functions.
[0105] Step 6: Optimize the deep reinforcement learning model. After designing the deep reinforcement learning network model, based on the characteristics of the six-axis collaborative robot, design an appropriate state space, action space, and reward function to train the robot's dynamic obstacle avoidance process.
[0106] like Figure 5As shown, the deep reinforcement learning model optimization process first involves building a CoppeliaSim simulation environment, which is initialized for each planning round. Then, the current relevant states in the environment, including five types of state space values (state, target position, obstacle position, and target arrival state), are input into the deep reinforcement learning network model. The target position and the dynamic real-time position of the obstacle are obtained from the simulation environment. Based on this, the remaining state space values are calculated by the kinematics model and the collision detection model. Next, the motion space values output from the deep reinforcement learning model drive the robot's movement in the simulation environment. The collision detection model determines whether a collision has occurred or whether the planning task has been completed. If a collision occurs or the maximum number of steps in the planning experiment is reached, the task is considered unsuccessful in that round, the round ends, the dynamic obstacle avoidance planning task is deemed a failure, a large negative reward is given, and a new round begins. If the robot's end effector successfully plans to the target position, the task is considered successful; otherwise, the next step is performed. By continuously interacting with the environment, the parameters of the deep reinforcement learning network model are optimized, thereby continuously improving the robot's dynamic obstacle avoidance planning ability. When the obstacle avoidance planning ability of the robot converges, the training process stops to avoid overfitting.
[0107] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.
Claims
1. A dynamic path planning method for multi-axis cooperative robots based on deep reinforcement learning, characterized in that... Includes the following steps: Step 1: Construct the kinematic model of the multi-axis collaborative robot; obtain the robot's structural dimensions and the angles of each joint; and calculate the position and orientation of each joint in space through matrix translation and rotation. Step 2: Construct a collision detection model between the multi-axis collaborative robot and obstacles; wrap the robot's rods with a cylindrical bounding box and simplify them into spatial line segments, wrap the obstacles with a spherical bounding box, and construct a collision detection model based on the radii of the cylindrical and spherical bounding boxes; The central axis segment of the cylindrical enclosure is Its endpoints are , The center of the spherical enclosure is The radius of the spherical bounding box is The radius of the cylinder collision model is The collision detection is as follows: like and The center of the sphere can then be obtained using Heron's formula. to the central axis segment The minimum distance is: in, , , , ; like and Then the center of the ball to the central axis segment minimum distance ; like and Then the center of the ball to the central axis segment minimum distance ; like When the time is right, the robot will not collide with the obstacle; otherwise, it will collide. Step 3: Based on the kinematic model and collision detection model, construct a deep reinforcement learning model for the multi-axis collaborative robot. The deep reinforcement learning model includes a deep neural network and a reinforcement learning algorithm. The deep neural network includes a participant neural network and a commentator neural network. The participant neural network acquires state values and outputs action values, while the commentator neural network evaluates the value of the action values. The reinforcement learning algorithm introduces irrelevant dynamic neighborhood Gaussian noise, and adjusts the exploration intensity of action values by dynamically controlling the Gaussian noise. An important sample replay mechanism is introduced, using the magnitude of sample data error as a standard to measure the importance of samples, and ranking them by importance. This increases the probability of extracting and learning data samples with higher rankings. If the difference between the current state / action value and the target state / action value is large, then this sample data is highly important, and its importance is ranked higher. Step 4: Design the state space and action space of the deep reinforcement learning model; Step 5: Design the reward function for the deep reinforcement learning model; Step 6: Optimize the deep reinforcement learning model.
2. The dynamic path planning method for multi-axis cooperative robots based on deep reinforcement learning according to claim 1, characterized in that: In step 5, the reward function of the deep reinforcement learning model is as follows: in, and The weights of the corresponding reward function, This represents the Euclidean distance between the target object at its current position and the robot's end effector at its current position. Indicates a safe distance. This represents the distance between the i-th link and the center point of the spherical bounding box of the obstacle at time t, which is calculated by the robot-environment obstacle collision model. Let represent the distance between the i-th link and the center point of the spherical bounding box of the obstacle at time t-1. When the difference between the two is greater than 0, it indicates that the robot link is moving away from the obstacle, and the agent will receive a positive reward. Conversely, if the difference is less than 0, it indicates that the robot is moving towards the obstacle, and the agent will receive a negative reward. This represents the distance threshold, and collision represents a collision. The reward function is used to guide the robot gripper to gradually approach the target position; The reward function is used when the distance between each link of the robot and environmental obstacles is less than a safe distance. When necessary, obstacle avoidance should be prioritized. The Euclidean distance between the robot end effector position and the target position reaches a threshold. At that time, the agent is given a very large positive reward; This is used to impose a severe negative penalty on the agent when any link of the robot collides with an obstacle.
3. The dynamic path planning method for multi-axis cooperative robots based on deep reinforcement learning according to claim 1, characterized in that: In step 3, the Gaussian noise variance is dynamically controlled. Controlling the intensity of the agent's exploration: The smaller the normal distribution, the thinner its shape, and the greater the probability that the output action value is close to the target action value, thus enhancing the development properties. The larger the value, the wider the normal distribution shape, and the greater the probability that the output action value deviates from the target action value, thus enhancing the exploration nature.
4. The dynamic path planning method for multi-axis cooperative robots based on deep reinforcement learning according to claim 3, characterized in that: The dynamic control Gaussian noise variance The formula is as follows: The learning process is divided into T stages, with the initial stage (0 to 10) being the first stage. At this stage, select the larger ones. Value, as the learning process progresses through multiple stages, Gradually reduce the sample size, increase the intensity of sample development, and in Maintain constant until stage T-2 After T-2 stages, let Keep , This represents the number of learning rounds.
5. The dynamic path planning method for multi-axis cooperative robots based on deep reinforcement learning according to claim 1, characterized in that: In the reinforcement learning algorithm of step 3, samples are drawn according to their respective probabilities: in , This indicates the rank of the i-th sample among all samples, where N represents the number of samples, and the rank is determined by the corresponding... Arranged from largest to smallest.
6. The dynamic path planning method for multi-axis cooperative robots based on deep reinforcement learning according to claim 1, characterized in that: The robot in step 4 is a six-axis collaborative robot. The state space of the deep reinforcement learning model contains five parts: the three-dimensional coordinates of the six joints of the six-axis collaborative robot and the difference between the target position. The difference between the three-dimensional coordinates of the robot's six joints and the position of the dynamic obstacle. The three-dimensional coordinates of the target position with reference to the robot base coordinate system. The three-dimensional coordinates of dynamic obstacles with reference to the robot's base coordinate system. The input state, in_goal, indicates whether the robot has reached the target location. The input state is represented as follows: The action space of the deep reinforcement learning model is designed as a six-dimensional vector based on the characteristics of path planning tasks, and outputs the rotational angular velocities of the six joints of the robot, as follows: 。 7. The dynamic path planning method for multi-axis cooperative robots based on deep reinforcement learning according to claim 1, characterized in that: The specific process of step 6 is as follows: First, a simulation environment is set up for the optimization process of the deep reinforcement learning model, and the simulation environment is initialized for each planning round. Then, the current relevant states in the environment, including the differences between the robot's multiple joint positions and the target position, the differences between the robot's joint positions and the obstacle positions, the target position, the obstacle position, and the target arrival state value, are input into the deep reinforcement learning network model as state space values. The target position and the dynamic real-time position of the obstacle are obtained from the simulation environment. Based on this, the remaining state space values are calculated by the kinematic model and the collision detection model. Next, the motion space values output from the deep reinforcement learning model drive the robot's movement in the simulation environment. A collision detection model is used to determine whether a collision has occurred or whether the planning task has been completed. If a collision occurs or the number of planning experiments reaches the maximum number of steps, the task is not completed in this round, the round ends, the dynamic obstacle avoidance planning task is deemed to have failed, a large negative reward is given, and a new round is started. If the robot's end effector successfully plans to the target position, the task is deemed to have been executed successfully; otherwise, the next step is performed. Through continuous interaction between the agent and the environment, the parameters of the deep reinforcement learning network model are optimized. When the agent's obstacle avoidance planning ability converges, the training process stops.