Multi-agent path planning method based on hybrid heuristic search and reinforcement learning
By combining heuristic search and reinforcement learning methods with A*, MCTS and MAPPO algorithms, multi-agent path planning is optimized, solving the problems of low path planning efficiency and poor inter-agent cooperation in dynamic environments, and realizing efficient and flexible path planning and cooperation capabilities.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHENYANG ROBOT IND DEVELOPMENT GROUP CO LTD
- Filing Date
- 2024-12-20
- Publication Date
- 2026-06-23
AI Technical Summary
Existing multi-agent path planning methods are inefficient in dynamic environments, have difficulty in resolving conflicts, and exhibit poor cooperation among agents, especially in autonomous robots, drone swarms, and autonomous driving.
A hybrid heuristic search and reinforcement learning approach is adopted, combining the A* algorithm to generate the initial path, using Monte Carlo Tree Search (MCTS) to optimize the local path, resolving conflicts through backtracking, and optimizing agent cooperation through MAPPO reinforcement learning. A hierarchical training task is designed to gradually adapt to complex environments.
It improves the efficiency of path planning and the collaborative ability among agents, adapts to dynamic environments, enhances the real-time performance, robustness, and adaptability of path planning, and significantly improves the efficiency and security of multi-agent path planning.
Smart Images

Figure CN122258884A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of multi-agent path planning, specifically a multi-agent path planning method based on hybrid heuristic search and reinforcement learning. Background Technology
[0002] Multi-agent path planning is an important research area in artificial intelligence and robotics, with wide applications in autonomous robots, drone swarms, and autonomous driving. Its core objective is to plan paths for multiple agents in a shared environment, ensuring that each agent's path from the starting point to the target point is conflict-free and completes the task in the shortest time or distance. In multi-agent path planning, traditional methods often plan paths independently for each agent. Graph search-based multi-agent path planning methods, such as Cooperative A* and Conflict-based Search (CBS) algorithms, have addressed the conflict problem to some extent. The CBS algorithm, based on heuristic search, uses a conflict backtracking mechanism to optimize conflict paths, ensuring that the paths of multiple agents do not overlap. However, these methods typically assume that each agent's actions are discrete and are suitable for static environments; they perform poorly in dynamic environments and complex agent interactions. In contrast, learning-based multi-agent path planning methods, such as Deep Q-Network (DQN), approximate the Q-function by training a deep neural network, allowing agents to select actions by maximizing the Q-value. Although DQN has achieved good results in single-agent path planning, in multi-agent environments, the interactions between agents are more complex, the training process requires a huge amount of data and a long time, and it is difficult to achieve stable convergence. Summary of the Invention
[0003] The purpose of this invention is to address the problems of low path planning efficiency, difficult conflict resolution, and poor inter-agent cooperation in the field of multi-agent path planning. It designs a multi-agent path planning method based on hybrid heuristic search and reinforcement learning. This method leverages the advantages of both local path optimization and global path cooperation, improving the efficiency of path planning and the collaborative capabilities between agents. It is particularly suitable for multi-agent tasks in dynamic environments, such as robot swarms, autonomous driving, and intelligent logistics, exhibiting high real-time performance, robustness, and adaptability, and significantly improving the efficiency and safety of multi-agent path planning.
[0004] The technical solution adopted by this invention to achieve the above objectives is: a multi-agent path planning method based on hybrid heuristic search and reinforcement learning, comprising the following steps:
[0005] S1: Generate a preliminary path for each agent using the A* algorithm to ensure the shortest path from the starting point to the target;
[0006] S2: The Monte Carlo tree search strategy is used to locally optimize the initial path and simulate the behavior between agents;
[0007] S3: Optimize the path by backtracking to solve the local conflict problem caused by mutual interference between agents; and perform multiple simulations in a loop to obtain the updated global path;
[0008] S4: Optimize the behavior strategy of each agent through reinforcement learning to achieve cooperation with other agents, avoid collisions, and combine the updated global path to obtain a global path planning strategy to complete the path planning of multiple agents.
[0009] Step S1 specifically includes:
[0010] S1-1: Initialize the path planning task by inputting the starting and target positions, and set the starting and target positions for each agent;
[0011] S1-2: Use the A* algorithm to search for the shortest path from the starting point to the target point on the grid map, and calculate the overall priority of each node, i.e.:
[0012] f(x) = g(x) + h(x)
[0013] Where g(x) represents the actual cost from the starting point to the current node x, i.e. the length of the path already traveled, and h(x) represents the heuristically estimated cost from the current node x to the target node;
[0014] S1-3: During the path search process, dynamically update the OPEN list and the CLOSEED list to ensure that the search process can find the optimal solution;
[0015] S1-4: Based on the path search results, save the complete initial path for each agent. The path includes: the sequence of nodes visited and cost information.
[0016] If there is a conflict between the paths of the agents, that is, the paths of multiple agents intersect or the nodes overlap, the conflict area is marked and the conflict information is transmitted to the local path optimization module.
[0017] Step S2 specifically includes:
[0018] S2-1: Within the conflict areas marked in the initial path, the local path optimization module performs local optimization on the initial path using a Monte Carlo tree search strategy;
[0019] S2-2: Each node represents the current path state, and the edges represent path decisions such as adjusting the direction or reselecting nodes. The search budget and path cost evaluation criteria are set.
[0020] S2-3: Construct a search tree from the current path's state, where the root node represents the current path and child nodes represent possible path choices; expand the tree by selecting the node with the largest UCB1 value, i.e.:
[0021]
[0022] Where Wi is the cumulative reward of node i, representing the total quality of the current path; Ni is the number of times node i is visited, Np is the number of times the parent node is visited, and C is a constant;
[0023] S2-4: When a leaf node is not fully explored, select an unvisited action (such as adjusting direction or waiting) to expand it and generate a new node;
[0024] The new node record includes: status information such as the path, current benefits, and conflict costs;
[0025] S2-5: Starting from the expanded node, simulate the path planning and decision-making of the agent until the set termination condition is met, at which point the simulation ends.
[0026] In step S3, the backtracking optimization path includes the following steps:
[0027] S3-1: Revenue Determination: In Monte Carlo tree search, after a certain number of simulation operations, for each path obtained from the simulation, that is, a path from the root node to the leaf node in the corresponding search tree, the corresponding revenue is calculated according to the preset revenue evaluation criteria.
[0028] S3-2: Profit backtracking preparation: Start the profit backtracking operation from the leaf node reached at the end of the simulation; the leaf node is the node corresponding to the final state of each simulation, which stores the profit information obtained by the simulation path; at the same time, record the local path situation represented by each node along the path backtracking from the leaf node to the root node, so as to facilitate the subsequent comparison of different paths and the determination of the path with the highest profit.
[0029] S3-3: Profit backtracking process: Starting from the current leaf node, the cumulative profit obtained by the leaf node is passed to all its parent nodes according to certain rules; during the backtracking process, in addition to updating the cumulative profit, the relevant statistical information of the parent nodes is updated; with the updated parent node as the new current node, the above steps are repeated to continue backtracking the profit to the parent nodes at higher levels along the tree structure until the root node is reached.
[0030] S3-4: Path selection optimization: After completing the revenue backtracking, based on the cumulative revenue of each node and other relevant statistical information, a comprehensive value assessment is performed on the paths represented by different branches in the search tree; by comparing the value assessment results of different paths, the path with the highest revenue is found.
[0031] S3-5: Merge paths into a new global path: Determine the local range of the highest-yielding optimized path in the entire search tree, replace the parts of the original global path that need optimization with the recorded highest-yielding optimized path, and obtain the updated global path.
[0032] Step S4 includes the following steps:
[0033] S4-1: Initialize the reinforcement learning model of the agent, define the observation space, action space and joint reward function in the spatial encoder, and the reward function comprehensively considers path cost, collision penalty and overall efficiency index.
[0034] S4-2: In a path planning environment, the agents are trained using the MAPPO method, and the behavior of each agent is optimized using a joint policy.
[0035] S4-3: Through multiple rounds of training, the agent learns to cooperate with other agents in dynamic environments and optimizes the global path planning strategy.
[0036] S4-4: Design layered training tasks based on environmental complexity, gradually increasing scene complexity;
[0037] S4-5: Use a validation set to test the path planning performance of the agent in different scenarios, and verify the stability and effectiveness of the algorithm.
[0038] During the training process in step 4-2), the following also applies: collecting the experience of each agent and storing it in the experience replay pool, specifically:
[0039] Multi-agent parallel interaction: Multiple agents simultaneously generate and execute actions in the path planning environment based on their respective current policy networks; each agent inputs the observed environmental state into the policy network, the policy network outputs the corresponding action probability distribution or specific action value, and determines and executes the action according to the corresponding rules;
[0040] During training, the state s of each agent is collected. t Action a t Rewards r t and the next state s t+1 The sample data forms (s t ,a t ,r t ,s t+1The sequence of data is collected; after collecting a set amount of experience data, the data is stored in the experience replay buffer pool to provide material for subsequent strategy learning and optimization.
[0041] In step S4-2, the multi-agent proximal policy optimization method optimizes the behavior of each agent using a joint policy, including the following steps:
[0042] a. Value network assessment and gaining advantage estimation
[0043] The value network is used to estimate the value of each agent's state, predicting the expected cumulative reward that the agent can obtain by continuing to execute the current policy when in state s; the advantage function A of each agent at each time step is calculated. t ;
[0044] b. Construct the loss function:
[0045] The loss functions of all agents are summed or weighted summation is used to construct a joint loss function. Based on this joint loss function, the policy network parameters of all agents are updated uniformly using the policy network and value network.
[0046] The policy network and value network are optimized, and the loss function is:
[0047] L t =min(r t (θ)A t ,clip(r t (θ), 1-∈, 1+∈)A t )
[0048] Where, r t (θ) represents the strategy ratio, A t ∈ is the dominance function, and ∈ is the clipping range parameter.
[0049] c. Joint strategy optimization:
[0050] To construct a joint loss function, the loss functions of all agents are integrated. A common approach is weighted summation. Assuming there are n agents and their respective loss functions are... The weight is set to ω i Then the joint loss function for:
[0051]
[0052] d. Calculate the gradient of the joint loss function using stochastic gradient descent, and update the parameters of all agent policy networks based on the gradient.
[0053] In steps S4-6, the step of designing hierarchical training tasks based on environmental complexity and gradually increasing scene complexity specifically involves:
[0054] The agent is first trained in a simple environment and gradually transitions to a complex dynamic environment, including: adding obstacles, simulating dynamic obstacles, and the way to cooperate with other agents;
[0055] During training, the reward function and environmental complexity at each stage are adjusted according to the progress of the task in order to gradually improve the learning efficiency and ability of the agent.
[0056] The present invention has the following beneficial effects and advantages:
[0057] 1. This invention employs an efficient multi-agent path planning method combining A*, MCTS, MAPPO, and curriculum learning. The A* algorithm quickly generates the globally optimal initial path, MCTS optimizes conflicts in local dynamic environments, and MAPPO reinforcement learning further optimizes the multi-agent cooperation strategy. The introduction of curriculum learning enables agents to gradually adapt to environmental complexity, improving learning efficiency and robustness.
[0058] 2. The method of the present invention combines the high efficiency of global path planning, the flexibility of local dynamic optimization, and the holistic nature of agent cooperation, making it suitable for multi-agent task scenarios in complex environments.
[0059] 3. The method of the present invention achieves a dual improvement in path planning efficiency and collaborative performance through the organic combination of multiple technologies, and exhibits strong generalization ability and robustness in complex environments with dynamic and multi-constraint conditions. Attached Figure Description
[0060] Figure 1 The principle diagram of multi-agent path planning combining hybrid heuristic search and reinforcement learning of this invention;
[0061] Figure 2 Schematic diagram of the conflict area marking of the present invention;
[0062] Figure 3 The hybrid heuristic search flowchart of this invention;
[0063] Figure 4 The MCTS execution simulation diagram of the present invention;
[0064] Figure 5 The MAPPO neural network structure diagram of this invention. Detailed Implementation
[0065] The present invention will now be described in further detail with reference to the accompanying drawings and embodiments.
[0066] like Figure 1 The diagram shown illustrates the principle of multi-agent path planning that combines hybrid heuristic search and reinforcement learning in this invention.
[0067] An initial path is generated for each agent using the A* method, and Monte Carlo Tree Search (MCTS) is used for local optimization of the path to avoid conflicts and improve path quality. This method effectively combines global optimal path search with local path optimization, reducing conflicts and improving efficiency. For collaboration among multiple agents, the MAPPO reinforcement learning algorithm is used for coordination and optimization between agents. Reinforcement learning enhances the agents' cooperation and further optimizes path planning. To train agents more efficiently and enhance their adaptability in complex environments, a curriculum learning strategy is introduced. Layered training tasks, from simple to complex, guide agents to gradually adapt to scenarios of varying difficulty, from basic path planning in static environments to path adjustment in dynamic obstacle environments, and finally to complex tasks involving deep multi-agent collaboration. At each stage, the reward function is dynamically adjusted to improve the agent's learning performance. By combining the local path optimization capabilities of MCTS with the global collaboration strategy of MAPPO, and the step-by-step guidance of course learning, the system can perform real-time path planning and updates in dynamic environments, significantly improving the efficiency and robustness of multi-agent path planning, adapting to changes in obstacles and adjustments in agent behavior, and comprehensively enhancing task completion efficiency and safety.
[0068] (1) Algorithm Framework
[0069] like Figure 1As shown, the method provided by this invention combines three algorithms—MCTS, A*, and MAPPO—to form an efficient multi-agent path planning framework. First, the A* algorithm is used to generate an initial path for each agent, ensuring the shortest path from the starting point to the goal. However, since the traditional A* algorithm cannot handle dynamic obstacles and the behavior of other agents, it may cause conflicts in complex environments. Therefore, MCTS is then used to locally optimize the initial path, simulating the behavior between agents, and backtracking to optimize the path, resolving local conflicts caused by mutual interference between agents. MCTS balances exploration and utilization through multiple simulations, optimizing path selection and adapting to environmental changes. Finally, to achieve cooperation between agents and global path optimization, the MAPPO reinforcement learning method is used to optimize the behavioral strategy of each agent, enabling agents to not only consider their own goals but also coordinate and cooperate with other agents, avoiding collisions and improving the overall path planning effect. Furthermore, combined with a learning strategy, agents can gradually adapt to complex environments, improving learning efficiency. The overall approach combines these three techniques to overcome the limitations of a single algorithm. It can ensure the global optimum of the path and flexibly cope with dynamic changes and multi-agent cooperation problems, thereby achieving more efficient and reliable path planning.
[0070] (2) Specific process of the method
[0071] like Figure 1 As shown, the specific method and process are as follows:
[0072] S1: Generate a preliminary path for each agent using the A* algorithm to ensure the shortest path from the starting point to the target;
[0073] S2: The Monte Carlo tree search strategy is used to locally optimize the initial path and simulate the behavior between agents;
[0074] S3: Optimize the path by backtracking to solve the local conflict problem caused by mutual interference between agents; and perform multiple simulations in a loop to obtain the updated global path;
[0075] S4: Optimize the behavior strategy of each agent through reinforcement learning to achieve cooperation with other agents, avoid collisions, and combine the updated global path to obtain a global path planning strategy to complete the path planning of multiple agents.
[0076] like Figure 2 The diagram shown illustrates the conflict region marking method of this invention. The multi-agent path planning method of this invention comprises two parts: a hybrid heuristic search and a reinforcement learning policy optimization. The conflict region marking method is implemented through a hybrid heuristic search, specifically as follows:
[0077] S1-1: Initialize the path planning task by inputting the starting and target positions, and set the starting and target positions for each agent;
[0078] S1-2: Use the A* algorithm to search for the shortest path from the starting point to the target point on the grid map, and calculate the overall priority of each node, i.e.:
[0079] f(x) = g(x) + h(x)
[0080] Where g(x) represents the actual cost from the starting point to the current node x, i.e. the length of the path already traveled, and h(x) represents the heuristically estimated cost from the current node x to the target node;
[0081] S1-3: During the path search process, dynamically update the OPEN list and the CLOSEED list to ensure that the search process can find the optimal solution;
[0082] S1-4: Based on the path search results, save the complete initial path for each agent. The path includes: the sequence of nodes visited and cost information.
[0083] If there is a conflict between the paths of the agents, that is, the paths of multiple agents intersect or the nodes overlap, the conflict area is marked and the conflict information is transmitted to the local path optimization module.
[0084] like Figure 4 The diagram shown is a simulation diagram of the MCTS execution of the present invention. For step S2, the specific method is as follows:
[0085] S2-1: Within the conflict areas marked in the initial path, the local path optimization module performs local optimization on the initial path using a Monte Carlo tree search strategy;
[0086] S2-2: Each node represents the current path state, and the edges represent possible path decisions (such as adjusting the direction or reselecting nodes). Set the search budget (such as the number of simulations or time limit) and path cost evaluation criteria (combining path length, conflict cost, etc.).
[0087] S2-3: Construct a search tree from the current path's state, where the root node represents the current path and child nodes represent possible path choices; expand the tree by selecting the node with the largest UCB1 value, i.e.:
[0088]
[0089] Where Wi is the cumulative reward of node i, representing the total quality of the current path; Ni is the number of times node i is visited, Np is the number of times the parent node is visited, and C is a constant;
[0090] S2-4: When a leaf node is not fully explored, select an unvisited action (such as adjusting direction or waiting) to expand it and generate a new node. The new node records state information, including the path, current reward, and conflict cost.
[0091] The new node record includes: status information such as the path, current benefits, and conflict costs;
[0092] S2-5: Starting from the expanded node, simulate the path planning and decision-making of the agent until the set termination condition is met (such as path completion or end of conflict area).
[0093] like Figure 3 The diagram shown is a flowchart of the hybrid heuristic search process of the present invention. Further, in conjunction with the preceding step S2, the present invention backtracks to optimize the path, specifically including the following steps:
[0094] S3-1: Revenue Determination: In Monte Carlo tree search, after a certain number of simulation operations, for each path obtained from the simulation, that is, a path from the root node to the leaf node in the corresponding search tree, the corresponding revenue is calculated according to the preset revenue evaluation criteria.
[0095] S3-2: Profit backtracking preparation: Start the profit backtracking operation from the leaf node reached at the end of the simulation; the leaf node is the node corresponding to the final state of each simulation, which stores the profit information obtained by the simulation path; at the same time, record the local path situation represented by each node along the path backtracking from the leaf node to the root node, so as to facilitate the subsequent comparison of different paths and the determination of the path with the highest profit.
[0096] S3-3: Profit backtracking process: Starting from the current leaf node, the cumulative profit obtained by the leaf node is passed to all its parent nodes according to certain rules; during the backtracking process, in addition to updating the cumulative profit, the relevant statistical information of the parent nodes is updated; with the updated parent node as the new current node, the above steps are repeated to continue backtracking the profit to the parent nodes at higher levels along the tree structure until the root node is reached.
[0097] S3-4: Path selection optimization: After completing the revenue backtracking, based on the cumulative revenue of each node and other relevant statistical information, a comprehensive value assessment is performed on the paths represented by different branches in the search tree; by comparing the value assessment results of different paths, the path with the highest revenue is found.
[0098] S3-5: Merge paths into a new global path: Determine the local range of the highest-yielding optimized path in the entire search tree, replace the parts of the original global path that need optimization with the recorded highest-yielding optimized path, and obtain the updated global path.
[0099] Furthermore, in combination Figure 1 and Figure 5 As shown, Figure 5 The diagram shows the MAPPO neural network structure of this invention. The method for optimizing the reinforcement learning strategy of this invention includes the following steps:
[0100] S4-1: Initialize the reinforcement learning model of the agent, define the observation space, action space and joint reward function in the spatial encoder, and the reward function comprehensively considers path cost, collision penalty and overall efficiency index.
[0101] S4-2: In a path planning environment, agents are trained using the MAPPO method (Multi-Agent Proximal Policy Optimization Method), and the behavior of each agent is optimized using a joint policy.
[0102] During training, the experience of each agent is collected and stored in the experience replay pool, specifically as follows:
[0103] Multi-agent parallel interaction: Multiple agents simultaneously generate and execute actions in the path planning environment based on their respective current policy networks; each agent inputs the observed environmental state into the policy network, the policy network outputs the corresponding action probability distribution or specific action value, and determines and executes the action according to the corresponding rules;
[0104] During training, the state s of each agent is collected. t Action a t Rewards r t and the next state s t+1 The sample data forms (s t ,a t ,r t ,s t+1 The sequence of data is collected; after collecting a set amount of experience data, the data is stored in the experience replay buffer pool to provide material for subsequent strategy learning and optimization.
[0105] Then, a multi-agent proximal policy optimization method is used to optimize the behavior of each agent through a joint policy, specifically as follows:
[0106] a. Value network assessment and gaining advantage estimation
[0107] The value network is used to estimate the value of each agent's state, predicting the expected cumulative reward that the agent can obtain by continuing to execute the current policy when in state s; the advantage function A of each agent at each time step is calculated. t ;
[0108] b. Construct the loss function:
[0109] The loss functions of all agents are summed or weighted summation is used to construct a joint loss function. Based on this joint loss function, the policy network parameters of all agents are updated uniformly using the policy network and value network.
[0110] The policy network and value network are optimized, and the loss function is:
[0111] L t =min(r t (θ)A t ,clip(r t (θ), 1-∈, 1+∈)A t )
[0112] Where, r t (θ) represents the strategy ratio, A t ∈ is the dominance function, and ∈ is the clipping range parameter.
[0113] c. Joint strategy optimization:
[0114] To construct a joint loss function, the loss functions of all agents are integrated. A common approach is weighted summation. Assuming there are n agents and their respective loss functions are... The weight is set to ω i Then the joint loss function for:
[0115]
[0116] d. Calculate the gradient of the joint loss function using stochastic gradient descent, and update the parameters of all agent policy networks based on the gradient.
[0117] S4-3: Through multiple rounds of training, the agent learns to cooperate with other agents in dynamic environments and optimizes the global path planning strategy.
[0118] S4-4: Design layered training tasks based on environmental complexity, gradually increasing scene complexity;
[0119] The agent is first trained in a simple environment and gradually transitions to a complex dynamic environment, including: adding obstacles, simulating dynamic obstacles, and the way to cooperate with other agents;
[0120] During training, the reward function and environmental complexity at each stage are adjusted according to the progress of the task in order to gradually improve the learning efficiency and ability of the agent.
[0121] S4-5: Use a validation set to test the path planning performance of the agent in different scenarios, and verify the stability and effectiveness of the algorithm.
[0122] S4-6: At each stage, the agent gradually adapts to environmental changes and masters corresponding strategies through reinforcement learning training.
[0123] S4-7: Through step-by-step training in the course, the learning efficiency of the agent and its generalization ability in complex scenarios are significantly improved.
[0124] S4-8: Finally, use a validation set to test the path planning performance of the agent in different scenarios to verify the stability and effectiveness of the algorithm.
[0125] In summary, this invention proposes a multi-agent path planning method by combining heuristic search and reinforcement learning strategies. The main core technologies include: the A* algorithm generates an initial path for fast search of the globally optimal path.
[0126] MCTS performs local path optimization, resolving conflicts between agents through simulation and backtracking to improve path quality. MAPPO is used to enhance agent cooperation, achieving dynamic adjustment of the global path through joint policy optimization. Curriculum learning strategies are used to train agents to gradually adapt to dynamic environments ranging from simple to complex, improving learning efficiency and generalization ability. Pre-protection points: A multi-agent path planning framework combining heuristic search and reinforcement learning, including the integrated application of the A* algorithm, MCTS for path conflict optimization, MAPPO for agent cooperation optimization, and curriculum learning strategies. Real-time path planning and update capabilities in dynamic environments: Supports efficient and robust path planning for multiple agents under changing obstacles and behavioral adjustments.
[0127] Those skilled in the art will understand that the above description is merely a preferred embodiment of the present invention, and the features described in the various embodiments and / or claims of this disclosure can be combined or combined in various ways, even if such combinations or combinations are not explicitly described in this disclosure. This is not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
[0128] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention. Clearly, those skilled in the art can make various alterations and modifications to the invention without departing from its spirit and scope. Thus, if these modifications and modifications of the invention fall within the scope of the claims and their equivalents, the invention is also intended to include these modifications and modifications.
Claims
1. A multi-agent path planning method based on hybrid heuristic search and reinforcement learning, characterized in that, Includes the following steps: S1: Generate a preliminary path for each agent using the A* algorithm to ensure the shortest path from the starting point to the target; S2: The Monte Carlo tree search strategy is used to locally optimize the initial path and simulate the behavior between agents; S3: Optimize the path by backtracking to solve the local conflict problem caused by mutual interference between agents; and perform multiple simulations in a loop to obtain the updated global path; S4: Optimize the behavior strategy of each agent through reinforcement learning to achieve cooperation with other agents, avoid collisions, and combine the updated global path to obtain a global path planning strategy to complete the path planning of multiple agents.
2. The multi-agent path planning method based on hybrid heuristic search and reinforcement learning according to claim 1, characterized in that, Step S1 specifically includes: S1-1: Initialize the path planning task by inputting the starting and target positions, and set the starting and target positions for each agent; S1-2: Use the A* algorithm to search for the shortest path from the starting point to the target point on the grid map, and calculate the overall priority of each node, i.e.: f(x) = g(x) + h(x) Where g(x) represents the actual cost from the starting point to the current node x, i.e. the length of the path already traveled, and h(x) represents the heuristically estimated cost from the current node x to the target node; S1-3: During the path search process, dynamically update the OPEN list and the CLOSEED list to ensure that the search process can find the optimal solution; S1-4: Based on the path search results, save the complete initial path for each agent. The path includes: the sequence of nodes visited and cost information. If there is a conflict between the paths of the agents, that is, the paths of multiple agents intersect or the nodes overlap, the conflict area is marked and the conflict information is transmitted to the local path optimization module.
3. The multi-agent path planning method based on hybrid heuristic search and reinforcement learning according to claim 1, characterized in that, Step S2 specifically includes: S2-1: Within the conflict areas marked in the initial path, the local path optimization module performs local optimization on the initial path using a Monte Carlo tree search strategy; S2-2: Each node represents the current path state, and the edges represent path decisions such as adjusting the direction or reselecting nodes. The search budget and path cost evaluation criteria are set. S2-3: Construct a search tree from the current path's state, where the root node represents the current path and child nodes represent possible path choices; expand the tree by selecting the node with the largest UCB1 value, i.e.: Where Wi is the cumulative reward of node i, representing the total quality of the current path; Ni is the number of times node i is visited, Np is the number of times the parent node is visited, and C is a constant; S2-4: When a leaf node is not fully explored, select an unvisited action (such as adjusting direction or waiting) to expand it and generate a new node; The new node record includes: status information such as the path, current benefits, and conflict costs; S2-5: Starting from the expanded node, simulate the path planning and decision-making of the agent until the set termination condition is met, at which point the simulation ends.
4. The multi-agent path planning method based on hybrid heuristic search and reinforcement learning according to claim 1, characterized in that, In step S3, the backtracking optimization path includes the following steps: S3-1: Revenue Determination: In Monte Carlo tree search, after a certain number of simulation operations, for each path obtained from the simulation, that is, a path from the root node to the leaf node in the corresponding search tree, the corresponding revenue is calculated according to the preset revenue evaluation criteria. S3-2: Profit backtracking preparation: Start the profit backtracking operation from the leaf node reached at the end of the simulation; the leaf node is the node corresponding to the final state of each simulation, which stores the profit information obtained by the simulation path; at the same time, record the local path situation represented by each node along the path backtracking from the leaf node to the root node, so as to facilitate the subsequent comparison of different paths and the determination of the path with the highest profit. S3-3: Profit backtracking process: Starting from the current leaf node, the cumulative profit obtained by the leaf node is passed to all its parent nodes according to certain rules; during the backtracking process, in addition to updating the cumulative profit, the relevant statistical information of the parent nodes is updated; with the updated parent node as the new current node, the above steps are repeated to continue backtracking the profit to the parent nodes at higher levels along the tree structure until the root node is reached. S3-4: Path selection optimization: After completing the revenue backtracking, based on the cumulative revenue of each node and other relevant statistical information, a comprehensive value assessment is conducted on the paths represented by different branches in the search tree; by comparing the value assessment results of different paths, the path with the highest revenue is identified. S3-5: Merge paths into a new global path: Determine the local range of the highest-yielding optimized path in the entire search tree, replace the parts of the original global path that need optimization with the recorded highest-yielding optimized path, and obtain the updated global path.
5. The multi-agent path planning method based on hybrid heuristic search and reinforcement learning according to claim 1, characterized in that, Step S4 includes the following steps: S4-1: Initialize the reinforcement learning model of the agent, define the observation space, action space and joint reward function in the spatial encoder, and the reward function comprehensively considers path cost, collision penalty and overall efficiency index. S4-2: In a path planning environment, the agents are trained using the MAPPO method, and the behavior of each agent is optimized using a joint policy. S4-3: Through multiple rounds of training, the agent learns to cooperate with other agents in dynamic environments and optimizes the global path planning strategy. S4-4: Design layered training tasks based on environmental complexity, gradually increasing scene complexity; S4-5: Use a validation set to test the path planning performance of the agent in different scenarios, and verify the stability and effectiveness of the algorithm.
6. The multi-agent path planning method based on hybrid heuristic search and reinforcement learning according to claim 5, characterized in that, During the training process in step 4-2), the following also applies: collecting the experience of each agent and storing it in the experience replay pool, specifically: Multi-agent parallel interaction: Multiple agents simultaneously generate and execute actions in a path planning environment based on their respective current policy networks; each agent inputs the observed environmental state into the policy network, the policy network outputs the corresponding action probability distribution or specific action value, and determines and executes the action according to the corresponding rules; During training, the state s of each agent is collected. t Action a t Rewards r t and the next state s t+1 The sample data forms (s t ,a t ,r t ,s t+1 The sequence of data is collected; after collecting a set amount of experience data, the data is stored in the experience replay buffer pool to provide material for subsequent strategy learning and optimization.
7. The multi-agent path planning method based on hybrid heuristic search and reinforcement learning according to claim 5, characterized in that, In step S4-2, the multi-agent proximal policy optimization method optimizes the behavior of each agent using a joint policy, including the following steps: a. Value network assessment and gaining advantage estimation The value network is used to estimate the value of each agent's state, predicting the expected cumulative reward that the agent can obtain by continuing to execute the current policy when in state s; the advantage function A of each agent at each time step is calculated. t ; b. Construct the loss function: The loss functions of all agents are summed or weighted summation is used to construct a joint loss function. Based on this joint loss function, the policy network parameters of all agents are updated uniformly using the policy network and value network. The policy network and value network are optimized, and the loss function is: L t =min(r t (i)A t ,clip(r t (θ),1-∈,1+∈)A t ) Where, r t (θ) represents the strategy ratio, A t ∈ is the dominance function, and ∈ is the clipping range parameter. c. Joint strategy optimization: To construct a joint loss function, the loss functions of all agents are integrated. A common approach is weighted summation. Assuming there are n agents and their respective loss functions are... i = 1, 2…m; weights are set to ω i Then the joint loss function for: d. Calculate the gradient of the joint loss function using stochastic gradient descent, and update the parameters of all agent policy networks based on the gradient.
8. The multi-agent path planning method based on hybrid heuristic search and reinforcement learning according to claim 5, characterized in that, In steps S4-6, the step of designing hierarchical training tasks based on environmental complexity and gradually increasing scene complexity specifically involves: The agent is first trained in a simple environment and gradually transitions to a complex dynamic environment, including: adding obstacles, simulating dynamic obstacles, and the way to cooperate with other agents; During training, the reward function and environmental complexity at each stage are adjusted according to the progress of the task in order to gradually improve the learning efficiency and ability of the agent.