Multi-ugv path planning method and device based on multi-agent reinforcement learning

By using a multi-agent reinforcement learning approach, the UGV cluster is divided into groups and a two-layer search task model is designed. This solves the problem of low path planning efficiency in traditional multi-UGV systems in dynamic environments, and achieves efficient path planning and computational resource optimization.

CN116029473BActive Publication Date: 2026-06-26HUBEI UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUBEI UNIV OF TECH
Filing Date
2022-12-30
Publication Date
2026-06-26

Smart Images

  • Figure CN116029473B_ABST
    Figure CN116029473B_ABST
Patent Text Reader

Abstract

The application provides a multi-UGV path planning method and device based on multi-agent reinforcement learning. The method comprises steps 1 to 8. The application operates on the basis of existing resources, improves mutual cooperation and coordination between each agent in the multi-agent system, uses a distributed search structure, automatically decomposes a complex learning problem into a local sub-problem that is easier to learn, improves the intelligent degree, expands the application field, learns a decentralized strategy in a centralized setting, greatly reduces the calculation amount, and has high practicability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of multi-agent system optimization technology, and in particular to a multi-UGV path planning method and device based on multi-agent reinforcement learning. Background Technology

[0002] UGV (Unmanned Ground Vehicle) is an intelligent mobile device with a certain degree of self-learning and adaptability. It has broad application prospects in multi-task fields such as industrial automation, disaster relief, intelligent transportation, and military operations. Traditional multi-UGV systems mostly use centralized search algorithms, which can solve most path planning problems in static environments. However, their central planner needs to possess complete map information and the positions of all agents to plan the optimal path, leading to significant computational resource consumption. Furthermore, in real-world dynamic environments, map information and the specific positions of agents change in real time, resulting in poor scalability, limited search space, and low learning efficiency for agent path planning in traditional centralized search algorithms. Besides the core objective of completing tasks through self-navigation and path planning, the application of multi-UGV systems faces the challenge of considering the heterogeneous characteristics of each UGV, scheduling tasks accordingly, and finding the collision-free optimal path for each UGV. Therefore, developing a multi-UGV path planning method and device based on multi-agent reinforcement learning to effectively overcome the shortcomings of the aforementioned technologies has become a pressing technical problem for the industry. Summary of the Invention

[0003] To address the aforementioned problems in the existing technology, embodiments of the present invention provide a multi-UGV path planning method and device based on multi-agent reinforcement learning.

[0004] Firstly, embodiments of the present invention provide a multi-UGV path planning method based on multi-agent reinforcement learning, comprising: Step 1: Dividing the UGV cluster into groups according to the size of the search area, with each group of UGVs having the same performance, and the groups unable to communicate or avoid obstacles, communicating with each other through relay UGVs; Step 2: Rasterizing the rectangular search area into several search regions, establishing a two-layer search task model. The upper-layer model is responsible for issuing region search instructions, directing each group of UGVs to enter different regions to carry out search tasks. The lower-layer model is an adaptive distributed search within a region, with individual UGVs within a group conducting searches to complete a traversal search task within a specific region; Step 3: Designing the model state space, with 3 as the label of the region where the current group is located, 2 as the label of the region where other groups are located, 1 as the label of the covered region, and 0 as the label of the uncovered region; Step 4: Designing the model action space, for each agent's action space, with the current position as 1, the range that can be reached in one step is represented by a 3x3 grid, and a complete action space set is represented as A. i ={1,2,3,4,5,6,7,8,9}; Step 5: Complete the design of reward functions for the upper-level task model and the lower-level task model; Step 6: Design the model network structure; Step 7: Complete network training; Step 8: Complete the testing of the upper-level and lower-level task models.

[0005] Based on the above method embodiments, the multi-UGV path planning method based on multi-agent reinforcement learning provided in this embodiment of the invention specifically includes the following step 5: For the upper-level task, it is required that no two groups are adjacent, and the covered areas cannot be searched again. The group position at the end of the previous state determines the target at the next moment, so as to achieve the minimum energy consumption. The reward function is:

[0006]

[0007] Where reward is the reward value, s′(i) is the state at the previous time step, s(i) is the state at different time steps, n is the number of agent groups, out indicates that the agent is outside the region, avg is the mean of the dispersion, and KL is the KL divergence.

[0008] The reward value is at most 1, decreasing according to different states: if two groups of UGVs collide, a negative reward is added; if a group of UGVs moves into an area it has already traversed, a negative reward is added; if a group of UGVs leaves the area, a negative reward is added; a continuous negative reward is added based on the position of each group of UGVs in the previous moment and the current target position in the next moment; the overall dispersion is evaluated using KL divergence. The higher the degree of discretization, the larger the KL divergence, and a positive reward is given, with the expectation that the whole will develop in a more discrete direction;

[0009] For the lower-level task, complete the search as quickly as possible and move in straight lines as much as possible, avoiding collisions between agents and repeatedly covering already scanned areas. The reward design is the same as for the upper-level task, except that two items are changed to rewarding adjacent actions if their values ​​are the same. The reward design includes:

[0010]

[0011] Where a(i) represents the action taken at different times, and a′(i) represents the action taken at the previous time.

[0012] Based on the above method embodiments, the multi-UGV path planning method based on multi-agent reinforcement learning provided in this embodiment of the invention specifically includes step 6 as follows: In the upper-layer task model, after each agent performs a behavior, it needs to obtain a reward value Q. jt This is used to determine whether the current action has achieved the expected value, which is the reward value Q. jt It is necessary to consider the current state matrix and the relative position vectors of all agents. A neural network with convolutional and linear layers in parallel is used. The convolutional layers process the current state, which is flattened to obtain a vector. The linear layers process the vectorized relative state information. Finally, these vectors are merged and passed through the linear neural network to output the Q-value of each action. Then, a Mix network is connected to obtain the reward value Q that satisfies monotonicity. jt For the lower-level task model, a convolutional network can be used, and the rest is the same as the upper-level task.

[0013] Based on the above method embodiments, the multi-UGV path planning method based on multi-agent reinforcement learning provided in this embodiment of the invention specifically includes step 7 as follows: During network training, each agent takes action a in the current state s, obtains the real-time reward value r and the state s' after the environment transition, and the target reward value Q. target The calculation formula is:

[0014] Q target =r+γmaxQ(s',a')

[0015] Where γ is the reward discount factor, max selects the maximum value among all current Q values, and the loss function L is designed as: L = (Q jt -Q target ) 2 .

[0016] Based on the above method embodiments, the multi-UGV path planning method based on multi-agent reinforcement learning provided in this embodiment of the invention includes step 8 as follows: after the network training is completed, the state S of the agent is input, the output is the value vector of each action, a greedy strategy is adopted to select the action with the highest value, and the planning of the upper and lower layer task models is completed.

[0017] Secondly, embodiments of the present invention provide a multi-UGV path planning device based on multi-agent reinforcement learning, comprising: a first main module for implementing step 1: dividing the UGV cluster into groups according to the size of the search area, with each group of UGVs having the same performance, and the groups unable to communicate or avoid obstacles, communicating with each other through relay UGVs; step 2: rasterizing the rectangular search area into several search patches, establishing a two-layer search task model. The upper-layer model is responsible for issuing patch search instructions, directing each group of UGVs to enter different patches to carry out search tasks. The lower-layer model is an adaptive distributed search within a patch, with individual UGVs within a group conducting searches to complete a traversal search task within a specific patch; a second main module for implementing step 3: designing the model state space, with 3 as the label of the patch where the current group is located, 2 as the label of the patch where other groups are located, 1 as the label of the covered patch, and 0 as the label of the uncovered area; step 4: designing the model action space, for each agent's action space, with the current position as 1, the range that can be reached in one step is represented by a 3x3 grid, and a complete action space set is represented as A. i ={1,2,3,4,5,6,7,8,9}; The third main module is used to implement step 5: design the reward function of the upper-layer task model and the lower-layer task model; step 6: design the model network structure; the fourth main module is used to implement step 7: complete network training; step 8: complete the testing of the upper-layer and lower-layer task models.

[0018] Thirdly, embodiments of the present invention provide an electronic device, comprising:

[0019] At least one processor; and

[0020] At least one memory communicatively connected to the processor, wherein:

[0021] The memory stores program instructions that can be executed by the processor. The processor can call the program instructions to execute the multi-UGV path planning method based on multi-agent reinforcement learning provided by any of the various implementations of the first aspect.

[0022] Fourthly, embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer instructions that cause a computer to execute a multi-UGV path planning method based on multi-agent reinforcement learning provided by any of the various implementations of the first aspect.

[0023] The multi-UGV path planning method and device based on multi-agent reinforcement learning provided in this invention operates on the basis of existing resources, improves the cooperation and coordination among agents in a multi-agent system, uses a distributed search structure to automatically decompose complex learning problems into easier-to-learn local sub-problems, improves the level of intelligence, expands the application field, and performs end-to-end learning distributed strategy in a centralized setting, which greatly reduces the amount of computation and has high practicality. Attached Figure Description

[0024] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0025] Figure 1 A flowchart of a multi-UGV path planning method based on multi-agent reinforcement learning provided in an embodiment of the present invention;

[0026] Figure 2 This is a schematic diagram of the structure of a multi-UGV path planning device based on multi-agent reinforcement learning provided in an embodiment of the present invention;

[0027] Figure 3 A schematic diagram of the physical structure of an electronic device provided in an embodiment of the present invention;

[0028] Figure 4 This is a schematic diagram of the structure of a multi-UGV path planning model based on multi-agent reinforcement learning provided in an embodiment of the present invention;

[0029] Figure 5 This is a schematic diagram of the discretized action region of the intelligent agent provided in an embodiment of the present invention;

[0030] Figure 6 A schematic diagram of the QMIX network architecture used in the upper and lower layers provided in an embodiment of the present invention;

[0031] Figure 7 This is a schematic diagram of the training results of the upper-layer task model provided in an embodiment of the present invention. Detailed Implementation

[0032] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention. In addition, the technical features of the various embodiments or individual embodiments provided by the present invention can be arbitrarily combined with each other to form feasible technical solutions. Such combinations are not constrained by the order of steps and / or structural composition patterns, but must be based on the ability of those skilled in the art to implement them. When the combination of technical solutions is contradictory or cannot be implemented, it should be considered that such a combination of technical solutions does not exist and is not within the scope of protection claimed by the present invention.

[0033] This invention provides a multi-UGV path planning method based on multi-agent reinforcement learning. (See also...) Figure 1 The method includes: Step 1: Divide the UGV cluster into groups based on the size of the search area. Each group of UGVs has the same performance and cannot communicate or avoid obstacles between groups. They communicate with each other through relay UGVs. Step 2: Rasterize the rectangular search area into several search patches, establishing a two-layer search task model. The upper-layer model is responsible for issuing patch search instructions, directing each group of UGVs to enter different patches to carry out search tasks. The lower-layer model is an adaptive distributed search within a patch, where individual UGVs within a group carry out searches to complete the traversal search task within a specific patch. Step 3: Design the model state space, with 3 as the label for the patch where the current group is located, 2 as the label for the patch where other groups are located, 1 as the label for covered patches, and 0 as the label for uncovered areas. Step 4: Design the model action space. For the action space of each agent, the current position is 1, and the range that can be reached in one step is represented by a 3x3 grid. A complete action space set is represented as A. i ={1,2,3,4,5,6,7,8,9}; Step 5: Complete the design of reward functions for the upper-level task model and the lower-level task model; Step 6: Design the model network structure; Step 7: Complete network training; Step 8: Complete the testing of the upper-level and lower-level task models.

[0034] Based on the above method embodiments, as an optional embodiment, the multi-UGV path planning method based on multi-agent reinforcement learning provided in this embodiment of the invention specifically includes the following step 5: For the upper-level task, it is required that no two groups are adjacent, and the covered areas cannot be searched again. The group position at the end of the previous state determines the target at the next moment, so as to achieve the minimum energy consumption. The reward function is:

[0035]

[0036] Where reward is the reward value, s′(i) is the state at the previous time step, s(i) is the state at different time steps, n is the number of agent groups, out indicates that the agent is outside the region, avg is the mean of the dispersion, and KL is the KL divergence.

[0037] The reward value is at most 1, decreasing according to different states: if two groups of UGVs collide, a negative reward is added; if a group of UGVs moves into an area it has already traversed, a negative reward is added; if a group of UGVs leaves the area, a negative reward is added; a continuous negative reward is added based on the position of each group of UGVs in the previous moment and the current target position in the next moment; the overall dispersion is evaluated using KL divergence. The higher the degree of discretization, the larger the KL divergence, and a positive reward is given, with the expectation that the whole will develop in a more discrete direction;

[0038] For the lower-level task, complete the search as quickly as possible and move in straight lines as much as possible, avoiding collisions between agents and repeatedly covering already scanned areas. The reward design is the same as for the upper-level task, except that two items are changed to rewarding adjacent actions if their values ​​are the same. The reward design includes:

[0039]

[0040] Where a(i) represents the action taken at different times, and a′(i) represents the action taken at the previous time.

[0041] Based on the above method embodiments, as an optional embodiment, the multi-UGV path planning method based on multi-agent reinforcement learning provided in this embodiment of the invention specifically includes step 6 as follows: In the upper-layer task model, after each agent performs a behavior, it needs to obtain a reward value Q. jt This is used to determine whether the current action has achieved the expected value, which is the reward value Q. jt It is necessary to consider the current state matrix and the relative position vectors of all agents. A neural network with convolutional and linear layers in parallel is used. The convolutional layers process the current state, which is flattened to obtain a vector. The linear layers process the vectorized relative state information. Finally, these vectors are merged and passed through the linear neural network to output the Q-value of each action. Then, a Mix network is connected to obtain the reward value Q that satisfies monotonicity. jt For the lower-level task model, a convolutional network can be used, and the rest is the same as the upper-level task.

[0042] Based on the above method embodiments, as an optional embodiment, the multi-UGV path planning method based on multi-agent reinforcement learning provided in this embodiment of the invention specifically includes step 7 as follows: During network training, each agent takes action a in the current state s, obtains the real-time reward value r and the state s' after the environment transition, and the target reward value Q. target The calculation formula is:

[0043] Q target =r+γmax Q(s',a') (3)

[0044] Where γ is the reward discount factor, max selects the maximum value among all current Q values, and the loss function L is designed as: L = (Q jt -Q target ) 2 (4).

[0045] Based on the above method embodiments, as an optional embodiment, the multi-UGV path planning method based on multi-agent reinforcement learning provided in this embodiment of the invention includes step 8 as follows: after the network training is completed, the state S of the agent is input, the output is the value vector of each action, a greedy strategy is adopted to select the action with the highest value, and the planning of the upper and lower layer task models is completed.

[0046] The multi-UGV path planning method based on multi-agent reinforcement learning provided in this invention operates on the basis of existing resources, improves the cooperation and coordination among agents in a multi-agent system, uses a distributed search structure to automatically decompose complex learning problems into easier-to-learn local sub-problems, improves the level of intelligence, expands the application field, and performs end-to-end learning distributed strategy in a centralized setting, which greatly reduces the amount of computation and has high practicality.

[0047] In another embodiment, a two-layer search task model is established. The rectangular search area is rasterized and divided into several zones. The upper-layer model is a command center that centrally issues zone search commands, and each group of UGVs enters the target zone to search, which can be regarded as a task allocation algorithm. The lower-layer model is an intra-group adaptive distributed search, where individual UGVs within the group conduct searches independently, which is a conventional distributed search algorithm.

[0048] The state space is designed as follows: 3 represents the region where the current group is located, 2 represents the regions where other groups are located, 1 represents covered regions, and 0 represents uncovered regions. For example, for a 3x3 region, if two groups of UGVs are performing a search task, one possible state is:

[0049]

[0050] Please see Figure 4 For each agent's action space, with the current position as 1, the range that can be reached in one step can be represented by a 3x3 grid. A complete action space set can be represented as A. i ={1,2,3,4,5,6,7,8,9}.

[0051] For the upper-level task, it is required that no two groups are adjacent and that the covered areas cannot be searched again. At the same time, the group position at the end of the previous state needs to be taken into account to determine the target at the next moment, so as to achieve the minimum energy consumption. Therefore, the following reward function is designed as shown in equation (1).

[0052] The reward value is initially 1, decreasing according to different states: if two groups of UGVs collide, a negative reward is added; if a group of UGVs moves into an area it has already traversed, a negative reward is added; if a group of UGVs leaves the area, a negative reward is added; simultaneously, a continuous negative reward is added based on the position of each group of UGVs in the previous moment and the current target position in the next moment; then, the overall dispersion is evaluated using KL divergence. The higher the degree of discretization, the larger the KL divergence, and a positive reward is given, that is, it is expected that the whole will develop in a more discrete direction.

[0053] Please see Figure 5 After each agent performs an action, it needs to receive a reward value Q. jt The Q-value is used to determine whether the current action has achieved the expected value. This reward value needs to consider the current state (matrix) and the relative positions (vectors) of all agents. Therefore, a convolutional and linear parallel neural network is required. The convolutional layer processes the current state, and after flattening, it becomes a vector. The linear layer processes the vectorized relative state information, and finally merges it with the above vector. Then, it passes through the linear neural network to output the Q-value for each action. This is then connected to a Mix network to obtain the Q-value that satisfies monotonicity. jt .

[0054] During network training, each UGV group takes action a in the current state s, obtaining the reward value r at that moment, as well as the states s' and a' after the environment transition, Q. target The calculation formula is shown in equation (3), and the loss function is shown in equation (4). After the network training is completed, input the UGV state of each group to obtain the value of each action. Adopt a greedy strategy to select the action with the maximum value, and the path planning of the upper layer area can be completed.

[0055] Please see Figure 6The image shows the training results of an upper-layer network in a 10x10 area. Yellow represents the area where the current group is located, green represents the areas where the other two groups are located, purple represents areas that have not yet been searched, and blue-green represents areas that have been searched. It can be seen that after training, the groups maintain a relatively large distance from each other.

[0056] For the lower-level task, the requirement is to complete the search as quickly as possible and move in straight lines as much as possible. Essentially, it is also necessary to avoid collisions between agents and repeated coverage of already scanned areas. The reward value design is the same as that of the upper-level task, except that the last two terms are changed to rewarding if adjacent actions have the same value. The reward value design is shown in equation (2). The design of the lower-level task network structure is relatively simple, requiring only a convolutional network. The rest is basically the same as that of the upper-level task and will not be described in detail.

[0057] Please see Figure 7 The left image shows the training results of the lower-level network. As can be seen, the movements of each agent are relatively straight lines, and the overlap is low, which better meets the actual needs of the scenario.

[0058] The implementation of the various embodiments of the present invention is based on programmed processing through a device with processor functionality. Therefore, in practical engineering, the technical solutions and functions of the various embodiments of the present invention can be encapsulated into various modules. Based on this reality, and building upon the above embodiments, the embodiments of the present invention provide a multi-UGV path planning device based on multi-agent reinforcement learning. This device is used to execute the multi-UGV path planning method based on multi-agent reinforcement learning in the above method embodiments. See also... Figure 2 The device includes: a first main module for implementing step 1: dividing the UGV cluster into groups based on the size of the search area. Each group of UGVs has the same performance and cannot communicate or avoid obstacles between groups, but communicates with each other through relay UGVs; step 2: rasterizing the rectangular search area into several search patches, establishing a two-layer search task model. The upper-layer model is responsible for issuing patch search instructions, directing each group of UGVs to enter different patches to carry out search tasks. The lower-layer model is an adaptive distributed search within a patch, where individual UGVs within a group conduct searches to complete the traversal search task within a specific patch; a second main module for implementing step 3: designing the model state space, where 3 is the label for the patch where the current group is located, 2 is the label for the patch where other groups are located, 1 is the label for covered patches, and 0 is the label for uncovered areas; step 4: designing the model action space. For the action space of each agent, the current position is 1, and the range that can be reached in one step is represented by a 3x3 grid. A complete action space set is represented as A. i={1,2,3,4,5,6,7,8,9}; The third main module is used to implement step 5: design the reward function of the upper-layer task model and the lower-layer task model; step 6: design the model network structure; the fourth main module is used to implement step 7: complete network training; step 8: complete the testing of the upper-layer and lower-layer task models.

[0059] The multi-UGV path planning device based on multi-agent reinforcement learning provided in this embodiment of the invention employs... Figure 2 Several modules operate on the basis of existing resources, improving the cooperation and coordination among agents in a multi-agent system. Using a distributed search structure, complex learning problems are automatically decomposed into easier-to-learn local sub-problems, improving the level of intelligence and expanding the application field. The end-to-end learning distributed strategy in a centralized setting greatly reduces the amount of computation and has high practicality.

[0060] It should be noted that the apparatus in the device embodiments provided by the present invention can be used not only to implement the methods in the above method embodiments, but also to implement the methods in other method embodiments provided by the present invention. The difference lies only in the setting of corresponding functional modules. Its principle is basically the same as that of the above device embodiments provided by the present invention. As long as those skilled in the art, based on the above device embodiments and referring to the specific technical solutions in other method embodiments, obtain corresponding technical means and technical solutions composed of these technical means by combining technical features, and improve the apparatus in the above device embodiments while ensuring the practicality of the technical solutions, they can obtain corresponding device-type embodiments for implementing the methods in other method-type embodiments. For example:

[0061] Based on the above device embodiments, as an optional embodiment, the multi-UGV path planning device based on multi-agent reinforcement learning provided in this embodiment of the invention further includes: a first sub-module, used to implement step 5, specifically including: for the upper-level task, requiring that any two groups are not adjacent, and that the covered areas cannot be searched again; using the group position at the end of the previous state to determine the target at the next moment, achieving minimal energy consumption; the reward function is:

[0062]

[0063] Where reward is the reward value, s′(i) is the state at the previous time step, s(i) is the state at different time steps, n is the number of agent groups, out indicates that the agent is outside the region, avg is the mean of the dispersion, and KL is the KL divergence.

[0064] The reward value is at most 1, decreasing according to different states: if two groups of UGVs collide, a negative reward is added; if a group of UGVs moves into an area it has already traversed, a negative reward is added; if a group of UGVs leaves the area, a negative reward is added; a continuous negative reward is added based on the position of each group of UGVs in the previous moment and the current target position in the next moment; the overall dispersion is evaluated using KL divergence. The higher the degree of discretization, the larger the KL divergence, and a positive reward is given, with the expectation that the whole will develop in a more discrete direction;

[0065] For the lower-level task, complete the search as quickly as possible and move in straight lines as much as possible, avoiding collisions between agents and repeatedly covering already scanned areas. The reward design is the same as for the upper-level task, except that two items are changed to rewarding adjacent actions if their values ​​are the same. The reward design includes:

[0066]

[0067] Where a(i) represents the action taken at different times, and a′(i) represents the action taken at the previous time.

[0068] Based on the above device embodiments, as an optional embodiment, the multi-UGV path planning device based on multi-agent reinforcement learning provided in this embodiment of the invention further includes: a second sub-module, used to implement step 6, specifically including: in the upper-layer task model, after each agent performs a behavior, it needs to obtain a reward value Q. jt This is used to determine whether the current action has achieved the expected value, which is the reward value Q. jt It is necessary to consider the current state matrix and the relative position vectors of all agents. A neural network with convolutional and linear layers in parallel is used. The convolutional layers process the current state, which is flattened to obtain a vector. The linear layers process the vectorized relative state information. Finally, these vectors are merged and passed through the linear neural network to output the Q-value of each action. Then, a Mix network is connected to obtain the reward value Q that satisfies monotonicity. jt For the lower-level task model, a convolutional network can be used, and the rest is the same as the upper-level task.

[0069] Based on the above device embodiments, as an optional embodiment, the multi-UGV path planning device based on multi-agent reinforcement learning provided in this embodiment of the invention further includes: a third sub-module, used to implement step 7, specifically including: during network training, each agent takes action a in the current state s to obtain a real-time reward value r and the state s' after the environment transition, and a target reward value Q. target The calculation formula is:

[0070] Q target =r + γmaxQ(s',a')

[0071] Where γ is the reward discount factor, max selects the maximum value among all current Q values, and the loss function L is designed as: L = (Q jt -Q target ) 2 .

[0072] Based on the above device embodiments, as an optional embodiment, the multi-UGV path planning device based on multi-agent reinforcement learning provided in this embodiment of the invention further includes: a fourth sub-module, used to implement step 8, which specifically includes: after the network training is completed, inputting the state S of the agent, outputting the value vector of each action, adopting a greedy strategy, selecting the action with the highest value, and completing the planning of the upper and lower layer task models.

[0073] The method in this embodiment of the invention is implemented using an electronic device; therefore, it is necessary to introduce the relevant electronic device. For this purpose, this embodiment of the invention provides an electronic device, such as... Figure 3 As shown, the electronic device includes at least one processor, a communications interface, at least one memory, and a communications bus, wherein the at least one processor, the communications interface, and the at least one memory communicate with each other via the communications bus. The at least one processor can invoke logical instructions stored in the at least one memory to execute all or part of the steps of the methods provided in the foregoing method embodiments.

[0074] Furthermore, when the logical instructions in at least one of the aforementioned memories can be implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various method embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0075] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0076] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0077] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. Based on this understanding, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions marked in the blocks may occur in a different order than those shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, or sometimes in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0078] It should be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0079] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A multi-UGV path planning method based on multi-agent reinforcement learning, characterized in that, include: Step 1: Divide the UGV cluster into groups according to the size of the area to be searched. Each group of UGVs has the same performance and cannot communicate or avoid obstacles with each other. They communicate with each other through relay UGVs. Step 2: Rasterize the rectangular area to be searched, divide it into several areas to be searched, and establish a two-layer search task model. The upper-layer model is responsible for issuing area search instructions and directing each group of UGVs to enter different areas to carry out search tasks. The lower-layer model is an adaptive distributed search within the area, where individual UGVs within the group carry out searches to complete the traversal search task within a specific area. Step 3: Design the model state space, with 3 as the label for the area where the current group is located, 2 as the label for the area where other groups are located, 1 as the label for the covered area, and 0 as the label for the uncovered area; Step 4: Design the model's action space. For each agent's action space, with the current position as 1, the range that can be reached in one step is represented by a 3x3 grid. A complete action space set is represented as A. i ={1,2,3,4,5,6,7,8,9}; Step 5: Complete the design of the reward functions for the upper-level task model and the lower-level task model, including: For the upper-level task, it is required that no two groups are adjacent, and the covered areas cannot be searched again. The group position at the end of the previous state determines the target at the next moment, achieving the goal of minimizing energy consumption. The reward function is: Where reward is the return value. s(i) represents the state at the previous time step, s(i) represents the state at different time steps, n represents the number of agent groups, out represents the agent exceeding the region range, avg is the mean of the dispersion, and KL is the KL divergence. The reward value is at most 1, decreasing according to different states: if two groups of UGVs collide, a negative reward is added; if a group of UGVs moves into an area it has already traversed, a negative reward is added; if a group of UGVs leaves the area, a negative reward is added; a continuous negative reward is added based on the position of each group of UGVs in the previous moment and the current target position in the next moment; the overall dispersion is evaluated using KL divergence. The higher the degree of discretization, the larger the KL divergence, and a positive reward is given, with the expectation that the whole will develop in a more discrete direction; For the lower-level task, complete the search as quickly as possible and move in straight lines as much as possible, avoiding collisions between agents and repeatedly covering already scanned areas. The reward design is the same as for the upper-level task, except that two items are changed to rewarding adjacent actions if their values ​​are the same. The reward design includes: Where a(i) represents the action taken at different times. The action taken in the previous moment; Step 6: Design the model network structure, including: In the upper-level task model, each agent needs to receive a reward value after performing an action. This is used to determine whether the current action has achieved the expected value, which is the reward value. It is necessary to consider the current state matrix and the relative position vectors of all agents. A neural network with convolutional and linear layers in parallel is used. The convolutional layers process the current state, which is flattened to obtain a vector. The linear layers process the vectorized relative state information. Finally, these vectors are merged and passed through the linear neural network to output the Q-value of each action. Then, a Mix network is connected to obtain the reward value that satisfies monotonicity. For the lower-level task model, a convolutional network can be used, and the rest is the same as the upper-level task. Step 7: Complete network training; Step 8: Complete the testing of the upper and lower level task models.

2. The multi-UGV path planning method based on multi-agent reinforcement learning according to claim 1, characterized in that, Step 7 specifically includes: During network training, each agent takes action a in the current state s, obtains a real-time reward value r, and the state s' after the environment transition, and the target reward value. The calculation formula is: in, To determine the reward discount factor, max selects the option with the maximum value among all current Q values, and designs the loss function L as follows: .

3. The multi-UGV path planning method based on multi-agent reinforcement learning according to claim 2, characterized in that, Step 8 specifically includes: after the network training is completed, input the state S of the agent, output the value vector of each action, adopt a greedy strategy, select the action with the highest value, and complete the planning of the upper and lower layer task models.

4. A multi-UGV path planning device based on multi-agent reinforcement learning, characterized in that, include: The first main module is used to implement step 1: According to the size of the area to be searched, the UGV cluster is divided into groups. Each group of UGVs has the same performance and cannot communicate or avoid obstacles between groups. They communicate with each other through relay UGVs. Step 2: The rectangular area to be searched is rasterized and divided into several areas to be searched. A two-layer search task model is established. The upper layer model is responsible for issuing area search instructions and directing each group of UGVs to enter different areas to carry out search tasks. The lower layer model is an adaptive distributed search within the area. Individual UGVs within the group carry out searches to complete the traversal search task within a specific area. The second main module is used to implement step 3: designing the model state space, where 3 is the label for the area where the current group is located, 2 is the label for the areas where other groups are located, 1 is the label for covered areas, and 0 is the label for uncovered areas; and step 4: designing the model action space. For each agent's action space, the current position is 1, the range that can be reached in one step is represented by a 3x3 grid, and a complete action space set is represented as A. i ={1,2,3,4,5,6,7,8,9}; The third main module is used to implement step 5: designing the reward function for the upper-level task model and the lower-level task model. This includes: for the upper-level task, requiring that no two groups are adjacent, and that covered areas cannot be searched again; using the group position at the end of the previous state to determine the target at the next moment, achieving minimal energy consumption. The reward function is: Where reward is the return value. s(i) represents the state at the previous time step, s(i) represents the state at different time steps, n represents the number of agent groups, out represents the agent exceeding the region range, avg is the mean of the dispersion, and KL is the KL divergence. The reward value is at most 1, decreasing according to different states: if two groups of UGVs collide, a negative reward is added; if a group of UGVs moves into an area it has already traversed, a negative reward is added; if a group of UGVs leaves the area, a negative reward is added; a continuous negative reward is added based on the position of each group of UGVs in the previous moment and the current target position in the next moment; the overall dispersion is evaluated using KL divergence. The higher the degree of discretization, the larger the KL divergence, and a positive reward is given, with the expectation that the whole will develop in a more discrete direction; For the lower-level task, complete the search as quickly as possible and move in straight lines as much as possible, avoiding collisions between agents and repeatedly covering already scanned areas. The reward design is the same as for the upper-level task, except that two items are changed to rewarding adjacent actions if their values ​​are the same. The reward design includes: Where a(i) represents the action taken at different times. The action taken in the previous moment; Step 6: Design the model network structure, including: In the upper-level task model, each agent needs to receive a reward value after performing an action. This is used to determine whether the current action has achieved the expected value, which is the reward value. It is necessary to consider the current state matrix and the relative position vectors of all agents. A neural network with convolutional and linear layers in parallel is used. The convolutional layers process the current state, which is flattened to obtain a vector. The linear layers process the vectorized relative state information. Finally, these vectors are merged and passed through the linear neural network to output the Q-value of each action. Then, a Mix network is connected to obtain the reward value that satisfies monotonicity. For the lower-level task model, a convolutional network can be used, and the rest is the same as the upper-level task. The fourth main module is used to implement step 7: complete network training; and step 8: complete the testing of the upper and lower layer task models.

5. An electronic device, characterized in that, include: At least one processor, at least one memory, and a communication interface; wherein, The processor, memory, and communication interface communicate with each other; The memory stores program instructions that can be executed by the processor, which invokes the program instructions to perform the method described in any one of claims 1 to 3.

6. A non-transitory computer-readable storage medium, characterized in that, The non-transitory computer-readable storage medium stores computer instructions that cause the computer to perform the method described in any one of claims 1 to 3.