Improved multi-agv path planning method based on matd3 algorithm

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By improving the MATD3 algorithm, constructing a state space and action space, and introducing a heuristic soft action probability and reward network, the problems of high computational complexity and long training time in multi-AGV path planning are solved, and efficient path planning and obstacle avoidance capabilities are achieved.

CN122237618APending Publication Date: 2026-06-19SHENYANG INST OF AUTOMATION - CHINESE ACAD OF SCI

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHENYANG INST OF AUTOMATION - CHINESE ACAD OF SCI
Filing Date: 2024-12-18
Publication Date: 2026-06-19

Application Information

Patent Timeline

18 Dec 2024

Application

19 Jun 2026

Publication

CN122237618A

IPC: G01C21/34

AI Tagging

Application Domain

Instruments for road network navigation

Technology Topics

AlgorithmReinforcement learning algorithm

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing multi-AGV path planning algorithms suffer from high computational cost and complexity in large-scale environments and multi-AGV collaborative operations, while deep reinforcement learning methods suffer from long training times and low search efficiency.

Method used

The MATD3 algorithm is designed and improved by constructing a state space, action space, and reward function, introducing a heuristic soft action probability and reward network, optimizing the AGV's learning process, simplifying the AGV's motion process, providing dense feedback signals, reducing the exploration space, and improving the learning convergence speed.

Benefits of technology

This technology enables multiple AGVs to autonomously avoid obstacles and collisions in locally observable discrete environments, and to plan collision-free paths, thereby improving the efficiency and convergence speed of path planning.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122237618A_ABST

Patent Text Reader

Abstract

This invention belongs to the field of multi-AGV path planning, specifically an improved multi-AGV path planning method based on the MATD3 algorithm. It incorporates improvements to the deep reinforcement learning algorithm MATD3, specifically by introducing heuristic soft action probabilities and a reward network optimization algorithm. Furthermore, to meet the task requirements of AGVs, this algorithm redesigns the state space and action space for the AGVs and proposes a novel dynamic reward function design. This invention addresses path planning for AGVs in locally observable discrete environments, enabling multiple AGVs to autonomously avoid static obstacles and collisions between AGVs, and to plan collision-free, efficient paths to the target location.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of multi-AGV path planning technology, specifically to a multi-AGV path planning method based on an improved MATD3 algorithm. Background Technology

[0002] Automated Guided Vehicles (AGVs) have become ubiquitous in many industries due to their superior efficiency in transportation. In practical applications, multiple AGVs often need to cooperate to complete transportation tasks. During this process, AGVs autonomously move, transport goods, and stop at designated locations according to specific path planning and operational requirements. However, with the increasing size of environments and the growing number of AGVs, using classic algorithms to plan collision-free paths for multiple AGVs faces challenges such as high computational load and complexity.

[0003] In recent years, the rapid development of deep reinforcement learning (DRL) methods has opened up a new avenue for solving multi-AGV path planning problems. By continuously interacting with complex environments, deep reinforcement learning enables AGVs to learn and optimize their strategies autonomously and independently, enhancing their autonomous decision-making and collaborative capabilities, and demonstrating adaptability and robustness in dynamic and complex environments.

[0004] Although deep reinforcement learning methods have great potential for multi-AGV path planning problems, using deep reinforcement learning methods alone has problems such as long training time and low search efficiency. Summary of the Invention

[0005] To address the aforementioned problems, the present invention aims to design an improved multi-AGV path planning method based on the MATD3 algorithm. By building a simulation interactive environment, designing the state space, action space, and reward function of the AGV, and improving the reinforcement learning MATD3 algorithm, the method can achieve path planning for multiple AGVs.

[0006] The technical solution adopted by this invention to achieve the above objectives is: an improved MATD3 algorithm for multi-AGV path planning, comprising the following steps:

[0007] 1) The AGV acquires a state space containing its position and orientation information;

[0008] 2) The AGV inputs the state space into its own Actor network to generate multi-dimensional action probabilities, and takes the action with the highest action probability as the action to be executed;

[0009] 3) The AGV generates its reward value and heuristic soft action probability through its state space and the actions it performs;

[0010] 4) Train the MATD3 network, learn the heuristic soft action probability during the update process of the AGV's Actor network, introduce and adopt a reward network to jointly update the MATD3 network parameters, and obtain the trained MATD3 model.

[0011] 5) After each update, the trained MATD3 algorithm is tested. Each AGV inputs the real-time state information obtained from the environment into its own Actor network, calculates and executes the actions output by the network; when all AGVs have planned a collision-free effective path, the multi-AGV path planning task is completed.

[0012] In step 1), the state space includes four channel information and one vector information;

[0013] The four channels of information include: the location information of static obstacles within the field of view, the current location information of all AGVs within the field of view, the target location information of other AGVs within the field of view, and the target location information of the current AGV within the field of view.

[0014] A vector information, which is the direction vector of the AGV from its current position to its target position.

[0015] In step 1), the AGV's motion space is A = {α0, a1, a2, a3, a4}. At each discrete time step, the AGV can choose to remain stationary at its current position or move to the position of the next time step by choosing one of the four basic directions: up, down, left, and right.

[0016] In step 3), the reward value is obtained through a reward function, where the dynamic reward R in the reward function is... dynamic as follows:

[0017] R dynamic =R1+R2

[0018] Specifically, a dynamic action reward function R1 is calculated and generated based on the current position, state environment, and specific actions taken by the AGV.

[0019]

[0020] In the formula, ω o ω a ω g1 These represent the weights of the total repulsive force F1 exerted by static obstacles on the current AGV in the state space, the weights of the total repulsive force F2 exerted by other AGVs on the current AGV in the state space, and the weights of the attractive forces F1 exerted by the target position on the current AGV, respectively. g Weights, vector action R represents the specific action currently being taken by the AGV. MAX1 =(ωo ·F1+ω a ·F2+ω g1 ·F g )·vector action This is the calculated maximum value;

[0021] Calculate the dynamic state reward function R2 for the current position of the AGV:

[0022] R2=-ω g2 ·(|pos g -pos A |)

[0023] In the formula, ω g2 For the weights of the dynamic state reward function R2, pos g pos represents the current target position of the AGV. A This represents the current position of the AGV.

[0024] Step 3) involves generating heuristic soft action probabilities, including the following steps:

[0025] Step 1: The AGV records the action it took in the previous moment;

[0026] Step 2: The AGV removes the action from the previous moment from the motion space;

[0027] Step 3: The AGV uses the channel information in its own state space to delete actions that have obstacles in the corresponding direction;

[0028] Step 4: The AGV selects an action from the available actions using vector information in its own state space and generates heuristic soft action probabilities.

[0029] Step 5: The AGV executes its own actions calculated by the Actor network;

[0030] Step 6: Update the heuristic soft action probability based on the results of the AGV's actions.

[0031] The loss function of the reward network is:

[0032]

[0033] in, For the i-th AGV, its reward network parameters μ are calculated using the s, a, r, s′ information of all AGVs in the acquired experience. E is the value of the calculation. The expected value of the squared difference between s and r i a iLet represent the state and action of the i-th AGV, respectively. r represents the specific reward value obtained by the AGV in the environment by taking action a from state s, and s′ represents the next state.

[0034] The Actor network is updated using the following formula:

[0035]

[0036] Among them, D KL Let P represent the KL divergence formula, where P is the action probability calculated by the Actor network of the AGV, Q is the heuristic soft action probability, and P(i) and Q(i) are the probabilities of P and Q on the i-th discrete action, respectively.

[0037]

[0038] Wherein, J(μ) i This indicates that the motion network parameters of the i-th AGV are updated. This indicates that the parameters of the i-th AGV are updated using the policy gradient. This indicates the value of the i-th AGV. and The Actor network in the middle is being updated, E s,a～D This indicates that state information s and action information a are taken from the experience buffer D for training, with the goal of maximizing the expected value E after gradient update, μ. i This represents the action strategy of the i-th AGV, i.e., the parameters of the Actor network. Represents the Critic function. This represents the reward network, and i represents the AGV sequence number.

[0039] The present invention has the following beneficial effects and advantages:

[0040] 1. To meet the task requirements of AGVs, this algorithm redesigns the state space and action space of AGVs and proposes a novel dynamic reward function. Discretizing the state space and action space simplifies the AGV's motion process, and the proposed novel dynamic reward function provides denser feedback signals to the AGV, accelerating its learning and exploration.

[0041] 2. This invention addresses path planning for AGVs in locally observable discrete environments, enabling multiple AGVs to autonomously avoid static obstacles and collisions between AGVs, and to plan collision-free effective paths to reach the target location.

[0042] 3. This invention proposes and employs a novel heuristic soft-motion guidance mechanism, which can help AGVs learn in a discrete space, reducing the AGV's exploration space and accelerating the convergence speed.

[0043] 4. This invention can be applied to discrete environment scenarios and has high convergence speed and planning results. Attached Figure Description

[0044] Figure 1 A flowchart of an improved MATD3 algorithm multi-AGV path planning method provided in an embodiment of the present invention;

[0045] Figure 2 This is a simulation training scenario diagram of multiple AGVs provided in an embodiment of the present invention;

[0046] Figure 3 The improved MATD3 algorithm training total reward curve provided in this embodiment of the invention;

[0047] Figure 4 The training total path curve of the improved MATD3 algorithm provided in the embodiments of the present invention. Detailed Implementation

[0048] This invention proposes an improved multi-AGV path planning method for the MATD3 algorithm. To make the design scheme and technical advantages of this invention clearer, the invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

[0049] An improved multi-AGV path planning method for the MATD3 algorithm specifically includes the following steps:

[0050] (1) Improved MATD3 algorithm network

[0051] This invention uses the MATD3 network as a framework and guides the AGV during training by introducing heuristic soft action probabilities and reward networks.

[0052] (I)MATD3 network

[0053] The AGV's Actor network is updated by maximizing the Q-value of its Critic network, and the gradient calculation formula is as follows:

[0054]

[0055] In the formula, s=(s1,...,s N ) contains the status information of all AGVs, a = (a1,...,a N It contains the motion information of all AGVs, i = 1...N, where N represents the total number of AGVs. is an action-value function, which, according to the characteristics of the MATD3 algorithm, is the smaller value in the double Critic network. D is an experience buffer used to store the sampled information (s, a, r, s') for each agent. E s,a～DThis means that state information s and action information a are taken from the experience buffer D for training, with the goal of maximizing the expected value E after gradient update. This indicates that the policy gradient is updated for the i-th AGV, thus updating the parameters of the Actor network. This indicates the value of the i-th AGV. Update the Actor network in μ i Let be the action strategy of the i-th AGV, i.e., the parameters of the Actor network.

[0056] Input the next time step state s'=(s1',...,s) into the target Actor network of the AGV. N '), to obtain the action that the AGV may take in the next moment a'=(a1',...,a'). N The value function of the double Critic network is defined as follows:

[0057]

[0058] In the formula, y1 and y2 are two values of the double Critic network, and r i γ is the reward value, and γ is the discount rate. These are the values calculated for the two target networks of the double Critic network. i represents the i-th AGV. and These are the specific parameters of the two target networks. By inputting s' and a' respectively, the values calculated by the two target networks are obtained.

[0059] Therefore, the loss function for each Critic network in the double Critic network is calculated using the mean squared error (MSE), and the update formula is as follows:

[0060]

[0061] Where y = min(y1, y2) is the estimated value of the double Critic network. This is for the value assessment of taking corresponding actions under corresponding conditions. For the i-th Critic network, its own network parameters μ are used, and the values are calculated using the s, a, r, s′ information of all AGVs from the acquired experience. E is the calculation... The expected value of the squared difference between y and y is the loss function, and the smaller the expected value, the better.

[0062] (II) Heuristic Soft Action Probability

[0063] This invention designs a heuristic soft-motion probability to assist AGVs in path planning. Generating the heuristic soft-motion probability includes the following steps:

[0064] Step 1: The AGV records the action it took in the previous moment;

[0065] Step 2: The AGV removes the action from the previous moment from the motion space;

[0066] Step 3: The AGV uses the channel information in its own state space to delete actions that contain obstacles in the corresponding direction;

[0067] Step 4: The AGV selects an action from the available actions using vector information in its own state space and generates heuristic soft action probabilities;

[0068] Step 5: The AGV executes its own network calculations.

[0069] Step 6: Update the heuristic soft action probability based on the results of AGV execution.

[0070] This invention designs an AGV that learns its heuristic soft-motion probabilities using KL divergence, calculated as follows:

[0071]

[0072] Among them, D KL Represented by the KL divergence formula, P is the action probability calculated by the Actor network of the AGV, Q is the heuristic soft action probability, and P(i) and Q(i) are the probabilities of P and Q on the i-th discrete action, respectively.

[0073] (III) Reward Network

[0074] This invention redesigns the network structure of the MATD3 algorithm by introducing a reward network during the network's learning process. The reward network has the same structure as the Critic network; it receives the state and action information of all AGVs as input and outputs the reward value for the current AGV. The reward loss function is:

[0075]

[0076] in, For the i-th AGV, its reward network parameters μ are calculated using the s, a, r, s′ information of all AGVs in the acquired experience. E is the value of the calculation. The expected value of the squared difference between s and r is the loss function, and the smaller the expected value, the better. i a i Let represent the state and action of the i-th AGV, respectively, and r represent the specific reward value obtained by the AGV in the environment by taking action a from state s.

[0077] Therefore, the update of the Actor network is redefined as follows:

[0078]

[0079] Wherein, J(μ) i This indicates that the motion network parameters of the i-th AGV are updated. This indicates that the policy gradient is updated for the i-th AGV, thus updating the parameters of the Actor network. This represents the expression for the i-th AGV. and The Actor network in the middle is being updated, E s,a～D This means that state information s and action information a are taken from the experience buffer D for training, with the goal of maximizing the expected value E after gradient update, μ. i This represents the action strategy of the i-th AGV, i.e., the parameters of the Actor network. Represents the Critic function. This represents the reward network, and i represents the AGV sequence number.

[0080] (2) Initialization of the training environment and the improved MATD3 algorithm network

[0081] In this invention, the environment is set as a discrete environment. An obstacle map is generated randomly or systematically, and AGVs are added to this environment. Each AGV is assigned a number and a task. In the discrete map environment, all movable positions within the local view are represented by 0, and the positions of static obstacles are represented by 1. AGVs are numbered starting from 0, and within the local view, starting from 3.

[0082] In this invention, based on the characteristics of the improved MATD3 algorithm, when there are N AGVs in a discrete environment, the improved MATD3 algorithm has N pairs of Actor-double Critic networks. Simultaneously, each network also has a corresponding target network to aid learning. At the start of training, the networks for all AGVs and the target network in the improved MATD3 algorithm are initialized.

[0083] (3) Action selection

[0084] Action selection refers to the AGV using its own independent Actor network, calculating the action probability by inputting its current state, and then selecting the action with the highest probability to execute.

[0085] The above steps involve state space design, action space design, reward function design, and algorithm parameter setting, the specific designs of which are as follows:

[0086] ①State space design: The state space of the AGV is set as four channels of information and one vector information.

[0087] The four channel information is designed as follows: static obstacle position information within the field of view, current position information of all AGVs within the field of view, target position information of other AGVs within the field of view, and target position information of the current AGV within the field of view.

[0088] A vector information is the direction vector of the AGV from its current position to its target position.

[0089] ② Motion space design: The motion space of the AGV is A={a0,a1,a2,a3,a4}. At each discrete time step, the AGV can choose to remain stationary at its current position, or choose one of the four basic directions of up, down, left, and right to move to the position of the next time step.

[0090] ③ Reward Function Design: Based on the concept of an artificial potential field, a dynamic reward function is introduced during the AGV's movement, calculating the reward using the AGV's state and actions. A reward function related to Euclidean or Manhattan distance is also utilized.

[0091] Given the current position, state, and actions of the AGV, calculate and generate a dynamic action reward function R1:

[0092]

[0093] In the formula, ω o =0.3, ω a =0.5, ω g1 =3 represent the total repulsive force F1 exerted by static obstacles on the current AGV in the state space, the total repulsive force F2 exerted by other AGVs on the current AGV in the state space, and the attractive force F1 exerted by the target position on the current AGV, respectively. g1 Weights, vector action R represents the specific action currently being taken by the AGV. MAX1 =(ω o ·F1+ω a ·F2+ω g1 ·F g )·vector action , which is the calculated maximum value.

[0094] Calculate the dynamic state reward function R2 for the current position of the AGV:

[0095] R2=-ω g2 ·(|pos g -pos A |)

[0096] In the formula, ω g2 =0.05 is the weight of the dynamic state reward function R2, pos g pos represents the current target position of the AGV. A This represents the current position of the AGV.

[0097] Dynamic Rewards R dynamic It consists of R1 and R2:

[0098] R dynamic =R1+R2

[0099] The final reward function consists of several parts. The first part is the dynamic reward R. dynamic ∈(-0.15,0). The second part is the fixed reward, which is the same as the traditional path planning reward design. Our goal is to enable the AGV to reach the goal earlier, so the AGV will receive a small negative reward of -0.05 for each step it takes. When the AGV reaches the goal, it will receive a positive reward of 0.4. When the AGV collides, it will receive a large negative reward of -0.1. During training, we encourage the AGV to explore. When the AGV fails to reach the goal and takes a stationary action, it will receive a negative reward, which is less than the minimum dynamic reward value, set to -0.2.

[0100] ④ The improved MATD3 algorithm parameter settings are shown in Table 1:

[0101] Table 1

[0102]

[0103] A flowchart of an improved MATD3 algorithm for multi-AGV path planning is shown below. Figure 1 As shown, the specific implementation environment settings are as follows: Figure 2 As shown.

[0104] An improved multi-AGV path planning method for the MATD3 algorithm includes the following steps:

[0105] Step 1: Build a simulation environment model in PyCharm using Python, set obstacle positions, and randomly generate the starting and target positions of the AGV. Write functions for the AGV to interact with the environment, enabling the AGV to collect data while performing actions within the environment.

[0106] Step 2: Initialization: For each AGV in the improved MATD3 algorithm, there is an Actor network and its target network, a double Critic network and its target network, and a reward network and its target network. Initialize these networks by setting the learning rate, optimizer, and hyperparameters, and initialize the simulation environment.

[0107] Step 3: The AGV obtains its state space, which contains four channels and one vector.

[0108] Step 4: The AGV inputs the state space into its own Actor network to generate 5-dimensional action probabilities, and takes the action with the highest action probability as the action to be executed.

[0109] Step 5: The AGV generates its reward value and heuristic soft action probability through its state space and the actions it performs.

[0110] Step 5.1: The reward value and heuristic soft action probability, as well as the algorithm process and design, are detailed in the description of the invention.

[0111] Step 6: Set the algorithm training parameters. After the amount of data in the experience buffer reaches the set value, train the network and perform gradient descent according to the loss function to update the network parameters.

[0112] Step 6.1: In this case scenario, the training parameters are set using the algorithm parameters listed in the invention description.

[0113] Step 7: Repeat the above steps iteratively to obtain a trained network model.

[0114] During the testing phase, the AGVs use a pre-trained network model. Each AGV inputs its own state space into the Actor network and executes the actions calculated by the network.

[0115] exist Figure 2 In the example environment, the total reward curve and total path length curve of the improved MATD3 algorithm during the iteration process are as follows: Figure 3 , Figure 4 As shown.

[0116] The above is a detailed description of one embodiment of the present invention, but the present invention is not limited to the described embodiment. Without departing from the core spirit of the present invention, those skilled in the art can make various equivalent modifications or substitutions, and these equivalent modifications or substitutions are all considered to be included within the scope of the claims of this application.

Claims

1. An improved MATD3 algorithm for multi-AGV path planning, characterized in that, Includes the following steps: 1) The AGV acquires a state space containing its position and orientation information; 2) The AGV inputs the state space into its own Actor network to generate multi-dimensional action probabilities, and takes the action with the highest action probability as the action to be executed; 3) The AGV generates its reward value and heuristic soft action probability through its state space and the actions it performs; 4) Train the MATD3 network, learn the heuristic soft action probability during the update process of the AGV's Actor network, introduce and adopt a reward network to jointly update the MATD3 network parameters, and obtain the trained MATD3 model. 5) After each update, the trained MATD3 algorithm is tested. Each AGV inputs the real-time state information obtained from the environment into its own Actor network, calculates and executes the actions output by the network; when all AGVs have planned a collision-free effective path, the multi-AGV path planning task is completed.

2. The multi-AGV path planning method of the improved MATD3 algorithm according to claim 1, characterized in that, In step 1), the state space includes four channel information and one vector information; The four channels of information include: the location information of static obstacles within the field of view, the current location information of all AGVs within the field of view, the target location information of other AGVs within the field of view, and the target location information of the current AGV within the field of view. A vector information, which is the direction vector of the AGV from its current position to its target position.

3. The multi-AGV path planning method of the improved MATD3 algorithm according to claim 1, characterized in that, In step 1), the AGV's motion space is A = {a0, a1, a2, a3, a4}. At each discrete time step, the AGV can choose to remain stationary at its current position or move to the position of the next time step by choosing one of the four basic directions: up, down, left, and right.

4. The multi-AGV path planning method of the improved MATD3 algorithm according to claim 1, characterized in that, In step 3), the reward value is obtained through a reward function, where the dynamic reward R in the reward function is... dynamic as follows: R dynamic =R1+R2 Specifically, a dynamic action reward function R1 is calculated and generated based on the current position, state environment, and specific actions taken by the AGV. In the formula, ω o ω a ω g1 These represent the weights of the total repulsive force F1 exerted by static obstacles on the current AGV in the state space, the weights of the total repulsive force F2 exerted by other AGVs on the current AGV in the state space, and the weights of the attractive forces F1 exerted by the target position on the current AGV, respectively. g Weights, vector action R represents the specific action currently being taken by the AGV. MAX1 =(ω o ·F1+ω a ·F2+ω g1 ·F g )·vector action This is the calculated maximum value; Calculate the dynamic state reward function R2 for the current position of the AGV: R2=-ω g2 ·(|post g -post A |) In the formula, ω g2 For the weights of the dynamic state reward function R2, pos g pos represents the current target position of the AGV. A This represents the current position of the AGV.

5. A multi-AGV path planning method for an improved MATD3 algorithm according to claim 1, characterized in that, Step 3) involves generating heuristic soft action probabilities, including the following steps: Step 1: The AGV records the action it took in the previous moment; Step 2: The AGV removes the action from the previous moment from the motion space; Step 3: The AGV uses the channel information in its own state space to delete actions where there are obstacles in the corresponding direction; Step 4: The AGV selects an action from the available actions using vector information in its own state space and generates heuristic soft action probabilities. Step 5: The AGV executes its own actions calculated by the Actor network; Step 6: Update the heuristic soft action probability based on the results of the AGV's actions.

6. A multi-AGV path planning method for an improved MATD3 algorithm according to claim 1, characterized in that, The loss function of the reward network is: in, The reward network represents the value calculated by the i-th AGV using its own reward network parameters μ and the s, a, r, s′ information of all AGVs in the acquired experience. E is the calculation... The expected value of the squared difference between s and r i a i Let represent the state and action of the i-th AGV, respectively. r represents the specific reward value obtained by the AGV in the environment by taking action a from state s, and s′ represents the next state.

7. A multi-AGV path planning method for an improved MATD3 algorithm according to claim 1, characterized in that, The Actor network is updated using the following formula: Among them, D KL Let P represent the KL divergence formula, where P is the action probability calculated by the Actor network of the AGV, Q is the heuristic soft action probability, and P(i) and Q(i) are the probabilities of P and Q on the i-th discrete action, respectively. Wherein, J(μ) i This indicates that the motion network parameters of the i-th AGV are updated. This indicates that the parameters of the i-th AGV are updated using the policy gradient. This indicates the value of the i-th AGV. and The Actor network in the middle is being updated, E s,a～D This indicates that state information s and action information a are taken from the experience buffer D for training, with the goal of maximizing the expected value E after gradient update, μ. i This represents the action strategy of the i-th AGV, i.e., the parameters of the Actor network. Represents the Critic function. This represents the reward network, and i represents the AGV sequence number.