Mobile robot dynamic path planning method, device, equipment and storage medium
By introducing LSTM network and RAdam optimization algorithm into DDPG algorithm, and combining it with priority experience replay mechanism, the problems of slow convergence speed and insufficient stability of deep reinforcement learning path planning method in dynamic environment are solved, and efficient path planning of mobile robot in complex environment is realized.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- WUXI TAIHU UNIV
- Filing Date
- 2026-03-27
- Publication Date
- 2026-06-19
Smart Images

Figure CN122237584A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, and in particular to a method, apparatus, device and storage medium for dynamic path planning of mobile robots. Background Technology
[0002] Mobile robot navigation technology is an important research area in the field of robotics, typically involving key issues such as environment modeling, robot pose estimation, and path planning. Among these, the path planning problem refers to autonomously planning a feasible and optimal path from the robot's starting position to the target position, given the robot's pose as known or partially known, and based on a predetermined optimization objective.
[0003] Traditional path planning methods mostly rely on accurate environmental map information, for example, A Algorithms such as Dijkstra's algorithm are path planning methods based on graph search. These methods perform well in static or structured environments, but in complex dynamic environments, due to the difficulty in obtaining complete environmental information and the uncertainty of obstacles, they often suffer from problems such as low path planning efficiency, poor real-time performance, and low success rate.
[0004] In recent years, with the improvement of computing hardware performance and the development of deep learning technology, deep reinforcement learning methods have shown promising application prospects in the path planning problem of mobile robots in unknown or dynamic environments. Unlike traditional planning methods, deep reinforcement learning can achieve end-to-end path planning decisions through continuous interaction between the robot and its environment, based on a reward feedback mechanism. However, existing deep reinforcement learning path planning methods still generally suffer from slow network training convergence speed and insufficient stability, especially in dynamic and complex environments where these problems are even more pronounced. Summary of the Invention
[0005] To help address the problems of slow network training convergence speed and insufficient stability that are still prevalent in existing deep reinforcement learning path planning methods, this application provides a method, apparatus, device, and storage medium for dynamic path planning of mobile robots.
[0006] In a first aspect, this application provides a dynamic path planning method for a mobile robot, employing the following technical solution: The method includes:
[0007] The state information of the mobile robot is obtained, including the pose information, speed information, and distance information of obstacles in the environment in which the mobile robot is located. A dynamic path planning neural network original model based on the DDPG algorithm is constructed, and the network parameters of the original dynamic path planning neural network model are updated using the RAdam algorithm to obtain an optimized dynamic path planning neural network model. The training data of the mobile robot in a preset training environment is obtained, and the dynamic path planning neural network optimization model is trained based on the training data to obtain a trained dynamic path planning neural network model. The state information is input into the trained dynamic path planning neural network model, and the optimal action of the mobile robot in its environment is output.
[0008] In one specific implementation, the original model of the dynamic path planning neural network includes an LSTM network, an Actor network, and a Critic network; wherein the LSTM network is cascaded with the Actor network, and the output of the LSTM network is the input of the Actor network.
[0009] In one specific implementation, updating the network parameters of the original model of the dynamic path planning neural network using the RAdam algorithm includes: Calculate the loss functions of the Actor network and the Critic network respectively, and calculate the gradients of the Actor network and the Critic network respectively based on the loss functions; Calculate the first-order moment estimate and the second-order moment estimate of the bias correction for the Actor network and the Critic network respectively, based on the gradients corresponding to the Actor network and the Critic network. Calculate the adaptive learning rate correction factor for the Actor network and the Critic network respectively based on the second-order moment estimate of the bias correction for the Actor network and the Critic network; The network parameters of the Actor network and the Critic network are updated according to the adaptive learning rate correction factor corresponding to the Actor network and the Critic network and the first moment estimate of the bias correction, respectively.
[0010] In a specific implementation scheme, calculating the gradients corresponding to the Actor network and the Critic network according to the loss function includes: , in, The loss function of the Actor network or the Critic network is represented. The parameters of the Actor network or the Critic network are represented; k represents the number of iterations. Representing network parameters The gradient operator; This represents the gradient of the Actor network or the Critic network; The step of calculating the first-order moment estimate and the second-order moment estimate of the bias correction for the Actor network and the Critic network respectively based on the gradients corresponding to the Actor network and the Critic network includes: , , , , in, This represents the first moment estimate of the gradient after k iterations. This represents the first moment estimate of the gradient after k-1 iterations. This represents the first-order moment attenuation coefficient. This represents the gradient corresponding to the Actor network or the Critic network. This represents the first-order moment estimate of the deviation correction after k iterations. This represents the second moment estimate of the gradient after k iterations. This represents the second moment estimate of the gradient after k-1 iterations. This represents the element-wise square of the gradient. This represents the second-order moment attenuation coefficient; The step of calculating the adaptive learning rate correction factors for the Actor network and the Critic network based on the bias correction second-order moment estimates for the Actor network and the Critic network, respectively, includes: , in, This represents the adaptive learning rate correction factor; The step of updating the network parameters of the Actor network and the Critic network according to the adaptive learning rate correction factor corresponding to the Actor network and the Critic network and the first moment estimate of the bias correction, respectively, includes: , in, This represents the network parameters at the (k+1)th iteration. This represents the network parameters at the k-th iteration. This represents the base learning rate of the RAdam optimizer. This represents the smoothing term, which is a constant.
[0011] In one specific implementation scheme, training the dynamic path planning neural network optimization model based on the training data includes: The training data is input into the dynamic path planning neural network optimization model, and the state transition experience corresponding to the current moment is output. Store the state transition experience corresponding to the current moment into the experience replay pool built into the dynamic path planning neural network optimization model; Calculate the temporal difference error value of each state transition experience in the experience replay pool, and set the calculated temporal difference error value as the learning value corresponding to each state transition experience; Upon receiving the learning value corresponding to the state transition experience at a new moment, the priority of each state transition experience is determined based on the learning value, and model training samples are extracted based on the priority of all state transition experiences. The extracted model training samples are input into the dynamic path planning neural network optimization model for model training; When the dynamic path planning neural network optimization model reaches the preset training termination condition, the model training is completed, and a trained dynamic path planning neural network model is generated.
[0012] In one specific implementation, calculating the temporal difference error value for each state transition experience in the experience replay pool includes: , in, The temporal difference error value representing the state transition experience; This represents the discount factor, which is a constant. These represent the training data of the mobile robot at time t and time t+1, respectively. These represent the execution actions obtained by inputting the training data at time t and time t+1 into the dynamic path planning neural network optimization model, respectively. This represents the maximum Q value of the mobile robot at time t+1; This represents the Q value of the mobile robot at time t; Indicates that the mobile robot performs Actions, and from state Transferred to The return value obtained at that time.
[0013] In one specific implementation, the higher the priority of the state transition experience, the higher the probability of it being extracted as a model training sample.
[0014] Secondly, this application provides a dynamic path planning device for a mobile robot, which adopts the following technical solution: the device includes: The information acquisition module is used to acquire the state information of the mobile robot, including the pose information, speed information and obstacle distance information in the environment in which the mobile robot is located. The model building module is used to build an original model of a dynamic path planning neural network based on the DDPG algorithm, and to update the network parameters of the original model of the dynamic path planning neural network using the RAdam algorithm to obtain an optimized model of the dynamic path planning neural network. The model training module is used to acquire training data of the mobile robot in a preset training environment, and to train the dynamic path planning neural network optimization model based on the training data to obtain a trained dynamic path planning neural network model. The path planning module is used to input the state information into the trained dynamic path planning neural network model and output the optimal action of the mobile robot in the environment.
[0015] Thirdly, this application provides a computer device that adopts the following technical solution: it includes a memory and a processor, wherein the memory stores a computer program that can be loaded by the processor and executed as any of the above-described mobile robot dynamic path planning methods.
[0016] Fourthly, this application provides a computer-readable storage medium that stores a computer program capable of being loaded by a processor and executing any of the above-mentioned mobile robot dynamic path planning methods.
[0017] In summary, this application has the following beneficial technical effects: 1. Based on the traditional Actor-Critic dual-network structure of the DDPG algorithm, an LSTM (Long Short-Term Memory) network is introduced to design a policy network structure that cascades the LSTM network and the Actor network, so as to enhance the policy network's ability to model the temporal characteristics of the environment state.
[0018] 2. The RAdam (Rectified Adam) optimization algorithm is introduced during the network parameter update process to improve the stability and convergence speed of the network training process.
[0019] 3. Introduce a priority experience replay mechanism during the training phase, and assign different sampling priorities according to the importance of experience samples, thereby improving the utilization efficiency of high-value experience samples and accelerating the learning process of the policy network.
[0020] 4. By optimizing the network design, the solution in this application can output the optimal executable action of the mobile robot in the current state in a complex dynamic environment, guide the robot to reach the target position safely and quickly, and improve the success rate and robustness of path planning. Attached Figure Description
[0021] Figure 1 This is a flowchart of the dynamic path planning method for mobile robots in the embodiments of this application; Figure 2 This is a network architecture diagram of the DDPG algorithm model after cascading an LSTM network in an embodiment of this application; Figure 3 This is a schematic diagram of the dynamic path planning device for a mobile robot in an embodiment of this application; Figure 4 This is a schematic diagram used to illustrate a computer device in the embodiments of this application.
[0022] Attached reference numerals: 301, Information acquisition module; 302, Model building module; 303, Model training module; 304, Path planning module. Detailed Implementation
[0023] The following combination Figures 1-4 This application will be described in further detail.
[0024] This application discloses a dynamic path planning method for mobile robots. This method addresses the problems of slow model training convergence, low stability, and low path planning success rate in mobile robots operating in dynamic and complex environments.
[0025] Mobile robot navigation technology is an important research area in the field of robotics, and path planning is a key issue in the development of mobile robots. Traditional path planning methods mostly rely on accurate environmental map information, for example, A... Algorithms such as Dijkstra's algorithm are path planning methods based on graph search. These methods perform well in static or structured environments, but in complex dynamic environments, due to the difficulty in obtaining complete environmental information and the uncertainty of obstacles, they often suffer from problems such as low path planning efficiency, poor real-time performance, and low success rate.
[0026] In recent years, with the improvement of computing hardware performance and the development of deep learning technology, deep reinforcement learning methods have shown promising application prospects in the path planning problem of mobile robots in unknown or dynamic environments. Unlike traditional planning methods, deep reinforcement learning can achieve end-to-end path planning decisions through continuous interaction between the robot and its environment, based on a reward feedback mechanism. However, existing deep reinforcement learning path planning methods still generally suffer from slow network training convergence speed and insufficient stability, especially in dynamic and complex environments where these problems are more pronounced. Furthermore, current methods also exhibit low path planning success rates, resulting in poor path planning performance for mobile robots in dynamic and complex environments. Therefore, to help address the problems of slow training speed, poor stability, and low success rate in path planning models for mobile robots in dynamic and complex environments, this application provides a dynamic path planning method for mobile robots.
[0027] Reference Figure 1 The method includes the following steps: S10: Obtain the state information of the mobile robot, including the pose information, speed information, and distance information of obstacles in the environment in which the mobile robot is located.
[0028] Specifically, sensors mounted on the mobile robot can detect and acquire information about the robot itself, such as its pose and distance. They can also detect the distance to obstacles in the robot's environment. Combining the robot's pose, distance, and the distance to obstacles in the environment yields the robot's state information, which can be arranged in chronological order to form a state sequence. Where n represents the length of the state sequence, This represents the state information of the mobile robot at time t. In this embodiment, a multi-line LiDAR is used to acquire the robot's state information, but other sensors can also be used in actual applications, and no limitation is made here.
[0029] S20. Construct the original model of the dynamic path planning neural network based on the DDPG algorithm, and use the RAdam algorithm to update the network parameters of the original model of the dynamic path planning neural network to obtain the optimized model of the dynamic path planning neural network.
[0030] Specifically, a basic model of a dynamic path planning neural network based on the DDPG algorithm is constructed. This basic model includes an LSTM network, an Actor network, and a Critic network. The LSTM network and the Actor network are cascaded, with the output of the LSTM network serving as the input to the Actor network. The basic model of the dynamic path planning neural network based on the DDPG algorithm is a policy network structure with a cascaded LSTM network and Actor network, built upon the original Actor-Critic dual-network structure of the DDPG algorithm. The LSTM network is used to perform temporal modeling of the environmental state information acquired by the mobile robot at continuous time points. Its output features serve as the input to the Actor network, used to generate the optimal action corresponding to the current state, thereby achieving temporal dependency modeling of the robot's behavioral decisions.
[0031] Reference Figure 2 This is a network architecture diagram of the DDPG algorithm model after cascading an LSTM network in an embodiment of this application. The Actor network in the original DDPG algorithm includes a state input module, a temporal feature extraction module, and an action decision module. The state input module is used to receive environmental state information obtained by the mobile robot at multiple consecutive time points. In this application, the Actor network and the LSTM network are cascaded. The temporal feature extraction module is used to receive the output results of the LSTM network and input the output results of the LSTM network into the action decision module, ultimately generating the action decision of the mobile robot in the current state.
[0032] LSTM networks, or Long Short-Term Memory networks, are primarily used for temporal modeling of the state sequences of mobile robots across multiple consecutive time points. LSTM networks selectively memorize and update environmental state information from consecutive time points through forget gates, input gates, and output gates. The hidden state output integrates historical state information and current perceived information, and the output result is input into the Actor network, thereby achieving temporal modeling and stable decision-making for dynamic environments to extract implicit features reflecting the dynamic trends of environmental changes. Its state update process satisfies the following conditions: , in, This represents the state information of the mobile robot at time t; This represents the hidden state output by the LSTM network at time t, used to characterize the temporal features after fusing historical state information with current state information; This represents the memory state of the LSTM network at the current moment, used to store and update long-term feature information of the environment state; These represent the forget gate, input gate, and output gate of the LSTM network, respectively, used to control the retention of historical information, the writing of new information, and the output ratio of the current memory state; These represent the forget gate weight matrix, the input gate weight matrix, and the output gate weight matrix, respectively. These represent the forget gate bias, input gate bias, and output gate bias, respectively. and These represent the weight matrix and bias vector used to generate candidate memories, respectively. and These represent the Sigmoid function and the hyperbolic tangent function, respectively. This indicates element-wise multiplication.
[0033] Through the gating mechanism in the LSTM network, the LSTM network can perform temporal modeling of environmental state information at continuous time steps. Its output is used as the input of the Actor network to generate the action decision of the mobile robot in the current state.
[0034] The action decision module of the Actor network is a multi-layer fully connected neural network, and its input is the hidden state output by the LSTM network at the current time step. The action output of the mobile robot in its current state is generated through nonlinear mapping. The action decision module's execution action output can be represented as: , in, Represents the Actor policy function. Represents the Actor network parameters. This indicates the output action, which includes the linear and angular velocity control values of the mobile robot.
[0035] By using a cascaded connection structure between an LSTM network and an Actor network, with the LSTM network located at the front end of the Actor network, the Actor network is used to provide state feature representations containing historical information to the Actor network. This enhances the policy network's ability to perceive dynamic environments and generates a probability distribution of actions to be taken in a given state, thus enabling more accurate selection of the actions that the robot will actually perform.
[0036] The Actor network is updated based on the policy gradient method, and the parameter update method can be expressed as follows: , , in, The temporal difference error value representing the state transition experience; This represents the discount factor, which is a constant and generally takes values in the range (0, 1). These represent the training data of the mobile robot at time t and time t+1, respectively. These represent the execution actions obtained by inputting the training data into the dynamic path planning neural network optimization model at time t and time t+1, respectively. This represents the maximum Q value of the mobile robot at time t+1; This represents the Q value of the mobile robot at time t; Indicates that the mobile robot performs Actions, and from state Transferred to The return value obtained at that time is the immediate reward that the mobile robot receives after performing an action; The network parameters that determine the strategy at time t are used. To approximate the policy function , This represents the learning rate.
[0037] The Critic network in the original DDPG algorithm consists of two identical neural networks: a Q-target network and a Q-estimation network. In the Critic network, the input layer takes the robot's state information and actions as input, and the intermediate layers have three hidden layers, all fully connected. The Critic network updates its parameters using gradient descent, and its loss function is defined as: , Where L represents the loss function of the Critic network, The time-series difference error value represents the state transition experience.
[0038] The update method for the Critic network parameters can be expressed as follows: , in, Indicates the parameters of the Critic network; represents the learning rate of the Critic network. L represents the loss function of the Critic network.
[0039] By cascading an LSTM network at the front end of the Actor network based on the original DDPG algorithm, time-series features in the environment can be extracted, thereby enhancing the stability of the policy network.
[0040] In addition, in order to improve the convergence speed of the network, accelerate the training speed, and further increase the stability of the network, the RAdam optimization algorithm is introduced to update the parameters of the model network during the network parameter update process to obtain an optimized model.
[0041] S30: Acquire training data of the mobile robot in a preset training environment, and train the dynamic path planning neural network optimization model based on the training data to obtain the trained dynamic path planning neural network model.
[0042] Specifically, a dynamic path planning training environment for the mobile robot is pre-built, and a dynamic obstacle model is set up within this environment. The mobile robot is placed in the training environment, and its initial and target positions are randomly generated. Training data is acquired using sensors on the robot within the training environment. This training data can be understood as the robot's state information obtained in the training environment, including its posture, speed, and distances to obstacles in its environment. Then, the dynamic path planning neural network optimization model is trained using the training data obtained from each robot state iteration, ultimately resulting in a fully trained dynamic path planning neural network model.
[0043] S40 inputs the state information into the trained dynamic path planning neural network model and outputs the optimal action for the mobile robot in its environment.
[0044] Specifically, the status information acquired by the mobile robot in real-world application scenarios The system takes a trained dynamic path planning neural network model as input and outputs the optimal action of the mobile robot in its environment. The optimal action output includes the linear velocity and angular velocity control values of the mobile robot, thereby achieving optimal dynamic path planning.
[0045] In this application, an LSTM (Long Short-Term Memory) network is introduced into the traditional Actor-Critic dual-network structure of the DDPG algorithm. A policy network structure cascaded with the Actor network is designed to enhance the policy network's ability to model the temporal characteristics of the environmental state. Simultaneously, the RAdam (Rectified Adam) optimization algorithm is introduced during network parameter updates to improve the stability and convergence speed of the network training process. Through optimized network design, this application enables the output of the optimal executable action for the mobile robot in the current state within a complex dynamic environment, guiding the robot to the target location safely and quickly, thus improving the success rate and robustness of path planning.
[0046] In one embodiment, updating the network parameters of the original model of the dynamic path planning neural network using the RAdam algorithm can be specifically performed as follows: First, calculate the loss functions for the Actor network and the Critic network respectively, and then calculate the gradients for the Actor network and the Critic network based on the loss functions. The gradient calculation method can be expressed as: , in, Represents the loss function of an Actor network or Critic network; This represents the network parameters of the Actor network or Critic network; k represents the number of iterations. Representing network parameters The gradient operator is used to calculate the loss function with respect to the parameters. The partial derivatives; This represents the gradient of the Actor network or Critic network.
[0047] After calculating the gradients, the first-order and second-order moment estimates of the bias correction for the Actor and Critic networks are calculated based on their respective gradients. The calculation method can be expressed as follows: , , , , in, This represents the first moment estimate of the gradient after k iterations. This represents the first moment estimate of the gradient after k-1 iterations. This represents the first-order moment attenuation coefficient. This represents the gradient corresponding to the Actor network or Critic network. This represents the first-order moment estimate of the deviation correction after k iterations. This represents the second moment estimate of the gradient after k iterations. This represents the second moment estimate of the gradient after k-1 iterations. This represents the element-wise square of the gradient. This represents the second-order moment attenuation coefficient.
[0048] Next, based on the estimated second-order moments of the bias correction for the Actor and Critic networks, the adaptive learning rate correction factors for the Actor and Critic networks are calculated respectively. The calculation method can be expressed as follows: , in, This represents the adaptive learning rate adjustment factor, which can be used to measure the reliability of the current gradient estimate. By introducing an adaptive learning rate adjustment factor, the learning rate gradually increases with the number of iterations in the early stages of training, achieving the Warmup effect. This avoids the problem of unstable parameter updates caused by an excessively large learning rate in the early stages of training. In deep learning optimization, the Warmup effect refers to the strategy of gradually, linearly, or incrementally increasing the learning rate in the initial stages of training, rather than using a pre-set large base learning rate from the beginning, thereby improving the stability of the system.
[0049] Finally, the network parameters of the Actor network and the Critic network are updated according to the adaptive learning rate correction factor and the first moment estimate of the bias correction, respectively. The network parameter update method can be expressed as follows: , in, This represents the network parameters at the (k+1)th iteration. This represents the network parameters at the k-th iteration. This represents the base learning rate of the RAdam optimizer. This represents the smoothing term, which is a constant used to prevent the denominator from being zero.
[0050] It should be noted that the RAdam algorithm introduced in this application updates the parameters of the Actor network and the Critic network separately. The parameter update processes for the Actor and Critic networks are executed independently. In the original DDPG algorithm, the parameter update direction is determined based on the policy gradient method. The introduced RAdam optimization algorithm acts as a parameter optimizer to improve the gradient update process. Specifically, the gradient information of the network parameters is first calculated using the DDPG algorithm. Then, the RAdam optimization algorithm is used to estimate the first and second moments of the gradient, and the learning rate is adaptively adjusted through variance correction, thereby achieving stable updates of the network parameters. Therefore, the RAdam optimization algorithm does not change the basic update rules of the DDPG algorithm, but rather optimizes the parameter update process based on it.
[0051] In this application, by introducing the RAdam algorithm for optimization, the random fluctuations in parameter update amplitude can be effectively reduced in the early stage of neural network training, and the stability of gradient update can be improved. As the number of training iterations increases, the adaptive learning rate gradually transitions to a stable update stage, thereby ensuring training stability while accelerating the convergence speed of the neural network, and thus improving the learning efficiency and policy performance of the improved DDPG algorithm in dynamic path planning tasks.
[0052] In one embodiment, training a dynamic path planning neural network optimization model based on training data can be specifically performed as follows: First, training data is input into the dynamic path planning neural network optimization model, and the model outputs the state transition experience corresponding to the current moment. Specifically, the mobile robot continuously acquires training data containing its state information in the simulated training environment. This training data includes the robot's pose, velocity, and obstacle information in its environment. After each acquisition of training data, it is input into the dynamically path planning neural network optimization model after parameter updates, and the corresponding execution action result, i.e., the state transition experience, is output. Each piece of training data corresponds to a state transition experience. Then, the state transition experience corresponding to the current moment is stored in the experience replay pool built into the dynamic path planning neural network optimization model. By introducing the experience replay pool, different priorities can be assigned based on the importance of experience samples during model training. High-value samples are preferentially extracted from the experience pool for training, thereby improving sample utilization efficiency and accelerating neural network convergence.
[0053] Next, the temporal difference error value of each state transition experience in the experience replay pool is calculated, and the calculated temporal difference error value is set as the learning value corresponding to each state transition experience. The calculation method can be expressed as follows: , in, The temporal difference error value representing the state transition experience; This represents the discount factor, which is a constant. These represent the training data of the mobile robot at time t and time t+1, respectively. These represent the execution actions obtained by inputting the training data into the dynamic path planning neural network optimization model at time t and time t+1, respectively. This represents the maximum Q value of the mobile robot at time t+1; This represents the Q value of the mobile robot at time t; Indicates that the mobile robot performs Actions, and from state Transferred to The return value obtained at that time.
[0054] It should be noted that, regarding the timing difference error value The purpose of this calculation in the Actor network is to calculate the parameter updates of the Actor network, while here we are calculating the temporal difference error value. The purpose is to determine the timing difference error value. The magnitude of the value determines the worth of a student. The same parameter can have different effects in different scenarios.
[0055] Subsequently, upon receiving the learning value corresponding to the state transition experience at a new time step, the priority of each state transition experience is determined based on the learning value, and model training samples are extracted according to the priority of all state transition experiences; the higher the priority of a state transition experience, the higher the probability of it being extracted as a model training sample. Specifically, during the actual execution of the model, extraction is not simply based on a threshold value, but rather on probability sampling based on a priority ratio. This sampling method can be implemented using the existing SumTree data structure. Threshold-based extraction may result in low-priority samples not being selected; probability sampling avoids this situation and improves the diversity of training samples, thereby improving the accuracy of model training. The extracted model training samples are then input into the dynamic path planning neural network optimization model for model training.
[0056] It should be noted that after acquiring new training data and calculating new state transition experiences each time, the new experiences are stored in the experience replay pool. Each time the data is updated, the model can update the priority and probability of each state transition experience in the experience pool based on the new experiences, thereby improving the utilization efficiency of high-value experience samples and accelerating the learning process of the policy network.
[0057] Finally, when the dynamic path planning neural network optimization model reaches the preset training termination condition, the model training is completed, and the trained dynamic path planning neural network model is finally generated.
[0058] In this application, a priority experience replay mechanism is introduced during the training phase. Different sampling priorities are assigned based on the importance of experience samples, thereby improving the utilization efficiency of high-value experience samples and accelerating the learning process of the policy network. After training, the output of the neural network model represents the optimal executable action for the mobile robot in the current environmental state, guiding the robot to reach the target location safely and quickly, thus improving the success rate and accuracy of path planning.
[0059] Figure 1 This is a flowchart illustrating a dynamic path planning method for a mobile robot in one embodiment. It should be understood that, although... Figure 1 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows; unless explicitly stated otherwise, there is no strict order requirement for the execution of these steps, and they can be executed in other orders; and Figure 1At least some of the steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.
[0060] Based on the above method, this application also discloses a dynamic path planning device for a mobile robot.
[0061] Reference Figure 3 The device includes the following modules: The information acquisition module 301 is used to acquire the state information of the mobile robot, including the pose information, speed information and obstacle distance information in the environment in which the mobile robot is located. The model building module 302 is used to build the original model of the dynamic path planning neural network based on the DDPG algorithm, and to update the network parameters of the original model of the dynamic path planning neural network using the RAdam algorithm to obtain the optimized model of the dynamic path planning neural network. The model training module 303 is used to acquire training data of the mobile robot in a preset training environment, and to train the dynamic path planning neural network optimization model based on the training data to obtain the trained dynamic path planning neural network model. The path planning module 304 is used to input state information into the trained dynamic path planning neural network model and output the optimal action of the mobile robot in its environment.
[0062] In one embodiment, the original model of the dynamic path planning neural network in the model building module 302 includes an LSTM network, an Actor network, and a Critic network; wherein the LSTM network and the Actor network are cascaded, and the output of the LSTM network is the input of the Actor network.
[0063] In one embodiment, the model building module 302 is specifically used to calculate the loss functions of the Actor network and the Critic network respectively, and calculate the gradients of the Actor network and the Critic network respectively based on the loss functions; calculate the first-order moment estimate and the second-order moment estimate of the bias correction for the Actor network and the Critic network respectively based on the gradients of the Actor network and the Critic network respectively; calculate the adaptive learning rate correction factor for the Actor network and the Critic network respectively based on the second-order moment estimate of the bias correction for the Actor network and the Critic network respectively; and update the network parameters of the Actor network and the Critic network respectively based on the adaptive learning rate correction factor and the first-order moment estimate of the bias correction for the Actor network and the Critic network respectively.
[0064] In one embodiment, in the model building module 302, calculating the gradients corresponding to the Actor network and the Critic network according to the loss function includes: , in, Represents the loss function of an Actor network or Critic network; This represents the network parameters of the Actor network or Critic network; k represents the number of iterations. Representing network parameters The gradient operator; This represents the gradient of the Actor network or Critic network; Based on the gradients corresponding to the Actor network and the Critic network, calculate the first-order moment estimate and the second-order moment estimate of the bias correction for the Actor network and the Critic network, respectively. , , , , in, This represents the first moment estimate of the gradient after k iterations. This represents the first moment estimate of the gradient after k-1 iterations. This represents the first-order moment attenuation coefficient. This represents the gradient corresponding to the Actor network or Critic network. This represents the first-order moment estimate of the deviation correction after k iterations. This represents the second moment estimate of the gradient after k iterations. This represents the second moment estimate of the gradient after k-1 iterations. This represents the element-wise square of the gradient. This represents the second-order moment attenuation coefficient; Based on the bias-corrected second-order moment estimates of the Actor and Critic networks, the adaptive learning rate correction factors for the Actor and Critic networks are calculated respectively, including: , in, This represents the adaptive learning rate correction factor; The network parameters of the Actor network and the Critic network are updated based on the adaptive learning rate correction factor and the first moment estimate of the bias correction, respectively, including: , in, This represents the network parameters at the (k+1)th iteration. This represents the network parameters at the k-th iteration. This represents the base learning rate of the RAdam optimizer. This represents the smoothing term, which is a constant.
[0065] In one embodiment, the model training module 303 is specifically used to input training data into the dynamic path planning neural network optimization model and output the state transition experience corresponding to the current time step; store the state transition experience corresponding to the current time step into the experience replay pool built into the dynamic path planning neural network optimization model; calculate the temporal difference error value of each state transition experience in the experience replay pool, and set the calculated temporal difference error value as the learning value corresponding to each state transition experience; when receiving the learning value corresponding to the state transition experience at a new time step, determine the priority of each state transition experience according to the learning value, and extract model training samples according to the priority of all state transition experiences; input the extracted model training samples into the dynamic path planning neural network optimization model for model training; when the dynamic path planning neural network optimization model reaches the preset training termination condition, the model training is completed, and a trained dynamic path planning neural network model is generated.
[0066] In one embodiment, the model training module 303 calculates the temporal difference error value for each state transition experience in the experience replay pool, including: , in, The temporal difference error value representing the state transition experience; This represents the discount factor, which is a constant. These represent the training data of the mobile robot at time t and time t+1, respectively. These represent the execution actions obtained by inputting the training data into the dynamic path planning neural network optimization model at time t and time t+1, respectively. This represents the maximum Q value of the mobile robot at time t+1; This represents the Q value of the mobile robot at time t; Indicates that the mobile robot performs Actions, and from state Transferred to The return value obtained at that time.
[0067] In one embodiment, in the model training module 303, the higher the priority of the state transition experience, the higher the probability of it being extracted as a model training sample.
[0068] The mobile robot dynamic path planning device provided in this application embodiment can be applied to the mobile robot dynamic path planning method provided in the above embodiments. For relevant details, please refer to the above method embodiments. The implementation principle and technical effect are similar, and will not be repeated here.
[0069] It should be noted that the mobile robot dynamic path planning device provided in this embodiment is only illustrated by the above-described division of functional modules / units when performing mobile robot dynamic path planning. In practical applications, the above functions can be assigned to different functional modules / units as needed, that is, the internal structure of the mobile robot dynamic path planning device can be divided into different functional modules / units to complete all or part of the functions described above. Furthermore, the implementation method of the mobile robot dynamic path planning method provided in the above method embodiments and the implementation method of the mobile robot dynamic path planning device provided in this embodiment belong to the same concept. The specific implementation process of the mobile robot dynamic path planning device provided in this embodiment is detailed in the above method embodiments and will not be repeated here.
[0070] This application also discloses a computer device.
[0071] Specifically, such as Figure 4As shown, the computer device can be a desktop computer, laptop computer, handheld computer, or cloud server, etc. The computer device may include, but is not limited to, a processor and memory. The processor and memory can be connected via a bus or other means. The processor can be a Central Processing Unit (CPU). The processor can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, graphics processing units (GPUs), embedded neural network processing units (NPUs) or other dedicated deep learning coprocessors, discrete gate or transistor logic devices, discrete hardware components, or combinations of the above types of chips.
[0072] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as the program instructions / modules corresponding to the methods in the above embodiments of this application. The processor executes various functional applications and data processing by running the non-transitory software programs, instructions, and modules stored in the memory, thereby implementing the methods in the above embodiments. The memory may include a program storage area and a data storage area, wherein the program storage area may store the operating system and at least one application program required for a function; the data storage area may store data created by the processor, etc. Furthermore, the memory may include high-speed random access memory and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0073] This application also discloses a computer-readable storage medium.
[0074] Specifically, the computer-readable storage medium is used to store a computer program, which, when executed by a processor, implements the methods described in the above-described method embodiments. Those skilled in the art will understand that implementing all or part of the processes in the methods described in the above-described embodiments of this application can be accomplished by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments described above. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), random access memory (RAM), flash memory, hard disk drive (HDD), or solid-state drive (SSD), etc.; the storage medium can also include combinations of the above types of memory.
[0075] This specific embodiment is merely an explanation of the present invention and is not intended to limit the invention. After reading this specification, those skilled in the art can make modifications to this embodiment without contributing any inventive step, but such modifications are protected by patent law as long as they are within the scope of the claims of the present invention.
Claims
1. A dynamic path planning method for a mobile robot, characterized in that: The method includes: The state information of the mobile robot is obtained, including the pose information, speed information, and distance information of obstacles in the environment in which the mobile robot is located. A dynamic path planning neural network original model based on the DDPG algorithm is constructed, and the network parameters of the original dynamic path planning neural network model are updated using the RAdam algorithm to obtain an optimized dynamic path planning neural network model. The training data of the mobile robot in a preset training environment is obtained, and the dynamic path planning neural network optimization model is trained based on the training data to obtain a trained dynamic path planning neural network model. The state information is input into the trained dynamic path planning neural network model, and the optimal action of the mobile robot in its environment is output.
2. The method according to claim 1, characterized in that: The original model of the dynamic path planning neural network includes an LSTM network, an Actor network, and a Critic network; wherein the LSTM network is cascaded with the Actor network, and the output of the LSTM network is the input of the Actor network.
3. The method according to claim 2, characterized in that: The step of updating the network parameters of the original dynamic path planning neural network model using the RAdam algorithm includes: Calculate the loss functions of the Actor network and the Critic network respectively, and calculate the gradients of the Actor network and the Critic network respectively based on the loss functions; Calculate the first-order moment estimate and the second-order moment estimate of the bias correction for the Actor network and the Critic network respectively, based on the gradients corresponding to the Actor network and the Critic network. Calculate the adaptive learning rate correction factor for the Actor network and the Critic network respectively based on the second-order moment estimate of the bias correction for the Actor network and the Critic network; The network parameters of the Actor network and the Critic network are updated according to the adaptive learning rate correction factor corresponding to the Actor network and the Critic network and the first moment estimate of the bias correction, respectively.
4. The method according to claim 3, characterized in that: The step of calculating the gradients corresponding to the Actor network and the Critic network according to the loss function includes: , in, The loss function of the Actor network or the Critic network is represented. The parameters of the Actor network or the Critic network are represented; k represents the number of iterations. Representing network parameters The gradient operator; This represents the gradient of the Actor network or the Critic network; The step of calculating the first-order moment estimate and the second-order moment estimate of the bias correction for the Actor network and the Critic network respectively based on the gradients corresponding to the Actor network and the Critic network includes: , , , , in, This represents the first moment estimate of the gradient after k iterations. This represents the first moment estimate of the gradient after k-1 iterations. This represents the first-order moment attenuation coefficient. This represents the gradient corresponding to the Actor network or the Critic network. This represents the first-order moment estimate of the deviation correction after k iterations. This represents the second moment estimate of the gradient after k iterations. This represents the second moment estimate of the gradient after k-1 iterations. This represents the element-wise square of the gradient. This represents the second-order moment attenuation coefficient; The step of calculating the adaptive learning rate correction factors for the Actor network and the Critic network based on the bias correction second-order moment estimates for the Actor network and the Critic network, respectively, includes: , in, This represents the adaptive learning rate correction factor; The step of updating the network parameters of the Actor network and the Critic network according to the adaptive learning rate correction factor corresponding to the Actor network and the Critic network and the first moment estimate of the bias correction, respectively, includes: , in, This represents the network parameters at the (k+1)th iteration. This represents the network parameters at the k-th iteration. This represents the base learning rate of the RAdam optimizer. This represents the smoothing term, which is a constant.
5. The method according to claim 1, characterized in that: The step of training the dynamic path planning neural network optimization model based on the training data includes: The training data is input into the dynamic path planning neural network optimization model, and the state transition experience corresponding to the current moment is output. Store the state transition experience corresponding to the current moment into the experience replay pool built into the dynamic path planning neural network optimization model; Calculate the temporal difference error value of each state transition experience in the experience replay pool, and set the calculated temporal difference error value as the learning value corresponding to each state transition experience; Upon receiving the learning value corresponding to the state transition experience at a new moment, the priority of each state transition experience is determined based on the learning value, and model training samples are extracted based on the priority of all state transition experiences. The extracted model training samples are input into the dynamic path planning neural network optimization model for model training; When the dynamic path planning neural network optimization model reaches the preset training termination condition, the model training is completed, and a trained dynamic path planning neural network model is generated.
6. The method according to claim 5, characterized in that: The calculation of the temporal difference error value for each state transition experience in the experience replay pool includes: , in, The temporal difference error value representing the state transition experience; This represents the discount factor, which is a constant. These represent the training data of the mobile robot at time t and time t+1, respectively. These represent the execution actions obtained by inputting the training data at time t and time t+1 into the dynamic path planning neural network optimization model, respectively. This represents the maximum Q value of the mobile robot at time t+1; This represents the Q value of the mobile robot at time t; Indicates that the mobile robot performs Actions, and from state Transferred to The return value obtained at that time.
7. The method according to claim 5, characterized in that: The higher the priority of the state transition experience, the higher the probability that it will be extracted as a model training sample.
8. A dynamic path planning device for a mobile robot, characterized in that: The device includes: The information acquisition module (301) is used to acquire the state information of the mobile robot, including the pose information, speed information and obstacle distance information in the environment where the mobile robot is located. The model building module (302) is used to build the original model of the dynamic path planning neural network based on the DDPG algorithm, and to update the network parameters of the original model of the dynamic path planning neural network using the RAdam algorithm to obtain the optimized model of the dynamic path planning neural network. The model training module (303) is used to acquire the training data of the mobile robot in a preset training environment, and to train the dynamic path planning neural network optimization model according to the training data to obtain the trained dynamic path planning neural network model. The path planning module (304) is used to input the state information into the trained dynamic path planning neural network model and output the optimal action of the mobile robot in the environment.
9. A computer device, characterized in that, It includes a memory and a processor, wherein the memory stores a computer program that can be loaded by the processor and executed according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, The computer program is stored that can be loaded by a processor and executed according to any one of claims 1 to 7.