An unmanned aerial vehicle cluster task planning algorithm based on hierarchical multi-agent deep reinforcement learning and an evaluation method thereof
The UAV swarm task planning algorithm based on hierarchical multi-agent deep reinforcement learning solves the coupling problem between task allocation and trajectory planning, realizes autonomous task planning of UAV swarms in unknown environments, improves the efficiency and robustness of task planning, and can be quantitatively evaluated in a physical simulation environment.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING UNIV OF AERONAUTICS & ASTRONAUTICS
- Filing Date
- 2024-07-25
- Publication Date
- 2026-06-26
AI Technical Summary
Existing UAV swarm mission planning algorithms neglect the coupling between mission allocation and trajectory planning, resulting in long computation time and high resource consumption, failing to meet real-time and dynamic requirements. Furthermore, deep reinforcement learning-based methods exhibit poor generalization and robustness in mission planning models.
A UAV swarm task planning algorithm based on hierarchical multi-agent deep reinforcement learning is adopted. The task allocation and trajectory planning models are designed using QMIX and MADDPG algorithms respectively, and the coupling relationship between the two is maintained during the training process. A distributed UAV swarm collaborative task planning algorithm and simulation system are designed, and a simulation environment is built using UE4 and Airsim for evaluation.
It enables autonomous mission planning for UAV swarms in unknown environments, improves autonomy, reduces reliance on global information, enhances the efficiency and robustness of mission planning, and allows for quantitative evaluation of algorithm performance in physical simulation environments.
Smart Images

Figure CN119088073B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of unmanned aerial vehicle (UAV) technology, specifically relating to a UAV swarm task planning algorithm and its evaluation method based on hierarchical multi-agent deep reinforcement learning. Background Technology
[0002] Unmanned aerial vehicle (UAV) systems have wide applications in both military and civilian fields. They offer advantages such as low cost, flexibility, and ease of deployment, and are currently widely used in battlefield reconnaissance, enemy strikes, and search and rescue operations. However, due to the limited endurance and equipment carried by a single UAV, it cannot complete complex and large-scale mission requirements. UAV swarm systems, composed of multiple UAVs, can accomplish more complex tasks through autonomous decision-making and control. UAV swarm mission planning technology is one of the core technologies for achieving autonomous decision-making and control of UAV swarms. The purpose of mission planning technology is to assign each UAV a set of optimal target points to maximize the overall mission benefit and ensure that the UAVs avoid obstacles, threat zones, and no-fly zones in the environment while reaching the target points.
[0003] Most current methods separate task allocation from flight path planning in mission planning technology, ignoring the coupling effect between actual task allocation and flight path. These methods often model the task allocation and flight path planning problems separately, transforming the mission planning problem into a multi-objective optimization problem, and then using intelligent optimization algorithms, such as ant colony optimization and genetic algorithms, to solve it. These methods require a lot of computing time and resources and cannot meet the requirements of real-time and dynamic performance.
[0004] With the development of artificial intelligence, deep reinforcement learning has shown its superiority in solving UAV mission planning problems. Most deep reinforcement learning-based methods either require global information in the mission allocation stage and need a centralized ground station to issue mission allocation instructions, or simplify the mission planning problem to meet the convergence of the algorithm, resulting in poor generalization and robustness of their mission planning models. Summary of the Invention
[0005] This invention provides a UAV swarm task planning algorithm and its evaluation method based on hierarchical multi-agent deep reinforcement learning. A distributed UAV swarm task planning algorithm is designed, taking into account the coupling between task allocation and trajectory planning problems. A UAV swarm collaborative task planning simulation system is established, and a swarm task planning algorithm performance evaluation method is designed based on the simulation system.
[0006] To achieve the above objectives, the present invention adopts the following technical solution:
[0007] A drone swarm task planning algorithm based on hierarchical multi-agent deep reinforcement learning includes the following steps:
[0008] Step 1: Based on the task requirements, establish Markov decision process models for the task allocation and trajectory planning of the UAV swarm, respectively, for multi-UAV and multi-target scenarios.
[0009] Step 2: Based on the Markov decision process models for UAV swarm task allocation and trajectory planning established in Step 1, design a QMIX-based task allocation algorithm for the UAV swarm task allocation problem and a MADDPG-based trajectory planning algorithm for the UAV swarm trajectory planning problem, employing a centralized training and distributed execution approach. The QMIX-based task allocation algorithm comprises two networks: a Q-network and a Mix network. The input to the Q-network is the observations of the UAV swarm task allocation. The output is the task number 'a' selected in the next time step. T ={i}, the network contains two hidden layers of dimension 200, using the ReLU activation function. The input of the Mix network is the joint observation of all UAVs and the task selected by all UAVs, and the output is the weighted sum of the outputs of the Q network of all UAVs. The network contains two hidden layers of dimension 200, using the ReLU activation function. The trajectory planning algorithm based on MADDPG contains an Actor network and a Critic network, where the input of the Actor network is the observation of UAV swarm trajectory planning. The output is the speed control command of the UAV. The input of the Critic network is the joint observation of all UAVs and the speed control command of all UAVs. The output is the evaluation of the current action to be performed under the current state.
[0010] Step 3: Based on the task allocation and trajectory planning algorithms designed in Step 2, a hierarchical deep reinforcement learning framework is designed. Considering the coupling between the two sub-problems of task allocation and trajectory planning, the task allocation and trajectory planning problems are solved simultaneously through training. The hierarchical deep reinforcement learning framework is divided into a top layer and a bottom layer. The top layer is the task allocation layer, which assigns an optimal target point to each UAV and outputs the task selected by the UAV in the next moment. The bottom layer is the trajectory planning layer, which provides each UAV with collision-free speed commands and outputs speed control commands for the UAV based on the task selected by the top task allocation layer, so that it can successfully reach the target point. During training, the output of the top task allocation layer affects the input of the bottom trajectory planning layer, and the output of the bottom trajectory planning layer affects the reward of the top task allocation layer, realizing an interactive training method. By maintaining the coupling relationship between task allocation and trajectory planning during training, the relationship between task allocation and trajectory planning is implicitly learned during the training of each model.
[0011] In the steps described above, the Markov decision process model for task allocation in step 1 includes the observation space for UAV swarm task allocation. in This includes the drone's current location, attack capabilities, the location of all mission points, attack requirements, and the action space 'a' for drone swarm mission allocation. T = {i}, where i represents the task sequence number selected by the drone in the next moment, and r is the reward function of the drone swarm. T It comprises two parts: trajectory cost and attack penalty; the trajectory planning part is a Markov decision process model, including the observation space for UAV swarm trajectory planning. This includes the target point's location information, the drone's own location information, the drone's own speed information, and external sensor information; the action space a for drone swarm trajectory planning. P ={v x ,v y} contains the velocity command of the UAV in the x and y directions at the next moment; the reward function r for UAV swarm trajectory planning. P This includes transfer rewards, success rewards, and collision penalties.
[0012] Another aspect of the present invention provides an evaluation method for the above-mentioned UAV swarm mission planning algorithm, comprising the following steps:
[0013] Step 1: Establish a simulation environment for UAV swarm task planning based on UE4. The simulation environment includes a static 3D environment with obstacles. Use Airsim as the UAV control interface to interact with the simulation environment, control the movement of the UAV, and obtain UAV information.
[0014] Step 2: Design evaluation metrics to quantitatively evaluate the performance of the task allocation and trajectory planning algorithms designed in Step 2. Evaluation metrics include average reward, success rate, collision rate, trajectory efficiency, motion smoothness, and trajectory curvature.
[0015] Beneficial Effects: This invention provides a UAV swarm task planning algorithm and its evaluation method based on hierarchical multi-agent deep reinforcement learning. When modeling the task planning problem, it decomposes the task planning into two sub-problems: task allocation and trajectory planning. When solving the task planning problem, a joint solution framework is designed to solve both the task allocation and trajectory planning problems simultaneously. During training, the output of the top-level task allocation affects the input of the bottom-level trajectory planning, and the output of the bottom-level trajectory planning affects the reward of the top-level task allocation, achieving an interactive training method. By maintaining the coupling relationship between task allocation and trajectory planning during training, the relationship between task allocation and trajectory planning is implicitly learned during the training of each model, solving the problem of existing task planning algorithms ignoring the coupling between task allocation and trajectory planning. This invention uses multi-agent deep reinforcement learning to train the task allocation and trajectory planning algorithms, enabling UAVs to select appropriate tasks and reach the task area without obtaining global information. It allows UAVs to complete swarm task planning in unknown environments by relying on local information, improving the autonomy of UAVs. Furthermore, this invention builds a UAV swarm combat simulation environment based on UE4 and Airsim, and designs evaluation indicators to comprehensively, reasonably, and quantitatively evaluate the effectiveness of the task planning algorithm in a physical simulation environment. Attached Figure Description
[0016] Figure 1 This is a schematic diagram showing that the drone is equipped with 7 distance sensors in an embodiment of the present invention;
[0017] Figure 2 This is a schematic diagram of the network architecture of the QMIX and MADDPG algorithms in an embodiment of the present invention;
[0018] Figure 3 This is a schematic diagram of a hierarchical deep reinforcement learning framework in an embodiment of the present invention;
[0019] Figure 4 The average reward during the training process for the MADDPG, MATD3, and MAPPO algorithms;
[0020] Figure 5 This is a schematic diagram of the mission planning of 5 UAVs and 5 target points in a simulation environment using the MAPPO algorithm in an embodiment of the present invention;
[0021] Figure 6 This is a schematic diagram of the mission planning of 5 target points of 5 UAVs in a simulation environment using the MADDPG algorithm in an embodiment of the present invention;
[0022] Figure 7 This is a schematic diagram of the task planning algorithm for 10 drones and 10 target points in a simulation environment, as described in this embodiment of the invention.
[0023] Figure 8This is a schematic diagram of deep reinforcement learning in an embodiment of the present invention. Detailed Implementation
[0024] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments.
[0025] A method for drone swarm mission planning based on hierarchical multi-agent deep reinforcement learning includes the following steps:
[0026] 1. Multi-agent deep reinforcement learning model
[0027] Based on mission requirements, Markov decision process models for drone swarm task allocation and trajectory planning are established for multi-drone, multi-target scenarios.
[0028] (1) Top-level task allocation model
[0029] 1) State and Observation Space
[0030] The state contains global information, while observations are the local information observed by each drone. In the top-level task allocation model, each drone's observations include its current position, attack capabilities, and the positions and attack requirements of all task points, defined as: The observations of all UAVs constitute the global state, defined as:
[0031] 2) Action Space
[0032] The drone's action is defined as the selected task sequence number at each moment, a T ={i}, where i represents the task sequence number selected by the drone at the next moment.
[0033] 3) Reward function
[0034] During task allocation, all drones share a global reward. The reward function consists of two parts: trajectory cost and attack penalty. The trajectory cost is shown in the following formula:
[0035] r d =-(d1+d2+...d i ...+d N )
[0036] In the formula: d i This represents the actual trajectory length of drone i as it reaches its selected target point.
[0037] The attack penalty is shown in the following formula:
[0038] r c =-λ×η
[0039] In the formula: η is the number of mismatches between the drone's attack capabilities and the mission's attack requirements, and λ is a constant.
[0040] Therefore, the global reward is as follows:
[0041] r T =δ1r d +δ2r c
[0042] In the formula: δ1 and δ2 are constants;
[0043] (2) Low-level trajectory planning model
[0044] 1) State and Observation Space
[0045] During trajectory planning, the observation space includes the position and velocity information of each UAV, the position information of the target point, and information from external sensors, and is defined as follows: Where g i This represents the polar coordinates of the target point relative to the UAV, v x v y This provides the current velocity information of the drone in the x and y directions. For external sensor information, such as Figure 1 As shown, the drone is equipped with 7 distance sensors, which return distance information from the 7 drones to the obstacle. The observations of all drones constitute the global state.
[0046] 2) Action Space
[0047] To simplify model training, it is assumed that the drone flies at a fixed altitude. The action space contains the drone's velocity components in the x and y directions, a P ={v x ,v y};
[0048] 3) Reward function
[0049] The reward function includes a transition reward, a success reward, and a collision penalty. The transition reward is used to avoid sparse rewards, providing the drone with a small reward at each time step to guide it to the target point. The transition reward is expressed as follows:
[0050] r1=α(d t-1 -d t )
[0051] In the formula: d t Let t be the distance from the UAV to the target point. α It is a constant;
[0052] The success reward represents the incentive given to the drone after it reaches the target point, as shown in the following formula:
[0053]
[0054] In the formula: r arrival p is a positive number i Let g be the current coordinates of the drone. i Select the mission point coordinates for the drone;
[0055] Collision penalties are used to guide drones to avoid obstacles and prevent collisions with each other, as shown in the following formula:
[0056]
[0057] In the formula: R is the radius of the UAV. The shortest distance to the obstacle;
[0058] The total reward is shown in the following formula:
[0059] r P =r1+r2+r3
[0060] 2. Design of QMIX-based task allocation algorithm and MADDPG-based trajectory planning algorithm
[0061] The overall architecture of the network structure is as follows Figure 2 As shown, each UAV contains a task allocation network and a trajectory planning network. The input to the task allocation network is the current position of the UAV, its attack capabilities, and the positions and attack requirements of all tasks. After passing through a series of linear layers, the output is the value of the task selected by the current UAV. Then, based on the greedy principle, the task with the highest value is selected as the task selection for the next moment. The input to the trajectory planning network is the position information of the target point, the action information of the UAV, and the information of external sensors. After being processed by linear layers, the output is the speed command for the next moment.
[0062] The hidden layers of the task assignment network are two fully connected networks with a dimension of 200, and the hidden layers of the trajectory planning network are two fully connected networks with a dimension of 64. The networks use the ReLU activation function. It is worth noting that the task assignment network is solved using the QMIX algorithm. In addition to the task assignment network, a hybrid network needs to be defined during training to perform a weighted summation of the Q-values output by all task selection networks. The input of this hybrid network is all the Q-values and the global state, and the output is the global value after weighted summation of the Q-values. The trajectory planning network is solved using the MADDPG algorithm. The algorithm is trained based on the Actor-Critic framework. Therefore, a centralized Critic network is needed during training. Its structure is similar to that of the Actor network. The input is all the information of all UAVs, and the output is the evaluation of the current UAV action considering global information and the actions of other UAVs.
[0063] 3. Design of a hierarchical deep reinforcement learning framework
[0064] The task assignment model is trained using the QMIX algorithm, an extension of the DQN algorithm. QMIX is a multi-agent reinforcement learning algorithm based on value decomposition. Value decomposition allows the UAV to access global information during training, enabling it to train from a global perspective. The core idea of the QMIX algorithm is to use a neural network f to approximate the global value Q. total (s,u) satisfies the following condition:
[0065]
[0066] In the formula, Q i A function for the value of each drone;
[0067] This condition makes Q total and Q i The relationship between them is monotonic, thus ensuring that the following condition holds:
[0068]
[0069] The above formula means to ensure Q total The joint action obtained by (s,u)arg max u With each Q i The result of arg max is [u1, u2, ..., u N The same applies, so that the locally optimal action chosen by each drone is exactly a part of the globally optimal action, Q. total The update adopts a similar approach to DQN, learning the target global action value function corresponding to the optimal policy by minimizing the loss function. The global action value function is shown in the following equation:
[0070]
[0071] In the formula, L(θ) is the loss function of the hybrid network, and Q... * (s,a|θ) represents the global value at the current moment, Q * (s′,a′) represents the global value at the next time step. r Let s, s′, a, a′ represent the reward at the current moment, s, s′, a′ represent the state and action at the current moment and the next moment, respectively, θ represent the network parameters, and γ represent the discount factor.
[0072] The trajectory planning model is trained using the MADDPG algorithm, which is an extension of the deterministic policy gradient algorithm (DDPG). It uses N consecutive policies... Its gradient can be written as follows:
[0073]
[0074] In the formula, Let μ be the policy gradient of the i-th drone. i (a i |o i Let be the Actor network of the i-th drone. For a global Critic network, x represents the global state, and a N For the action of the Nth drone, update the Actor network by maximizing the policy gradient.
[0075] The centralized Critic network is updated using the following formula:
[0076]
[0077] In the formula: This is the output of the Critic network at the current moment. Let x and x′ be the output of the Critic network at the next time step, and a be the global state at the current and next time steps, respectively. N ,a′ N This refers to the actions of the Nth drone at the current moment and the next moment.
[0078] like Figure 3 As shown, during training, the two algorithms are trained synchronously and influence each other. The input of the bottom-level trajectory planning network depends on the output of the top-level task allocation network. At the same time, the top-level task allocation network does not interact directly with the environment, but interacts with the environment through the output of the bottom-level trajectory planning network. The state and reward of the top-level network depend on the output of the bottom-level network. By maintaining the coupling relationship between task allocation and trajectory planning during training, the correlation between the two processes is implicitly learned based on updating the two network models, thereby improving the efficiency of the overall task planning algorithm. The detailed algorithm flow is as follows.
[0079]
[0080]
[0081] The evaluation of the above-mentioned drone swarm task planning algorithm includes the following steps:
[0082] 1. Establish a simulation environment for UAV swarm mission planning based on UE4.
[0083] To compare and verify the effectiveness of the above-designed task planning algorithm in multi-UAV swarm task planning, the simulation environment used is the AirSim UAV simulator developed by Microsoft. The AirSim UAV simulator, developed based on Unreal Engine, has a near-realistic quadcopter dynamics model and rich API interfaces, making it very suitable for UAV swarm simulation experiments.
[0084] Both the simulation and the training of the reinforcement learning model were developed in a Python 3.9.19, PyTorch 1.13.0 environment. The specific hardware configuration of the training environment is shown in the table below:
[0085]
[0086] Approximately 5000 episodes were trained in total. The algorithm took about 12 hours to converge. The environment dimensions were set to 60m × 60m × 50m, obstacles were several cylinders or cuboids of varying sizes, the target point was a sphere with a radius of 0.15m, the drone radius was 1m, the speed was set to (-2, 2)m / s, the drone's flight altitude was fixed at 3 meters, and the control frequency was set to 5Hz. The network and parameter settings during training are shown in the table below:
[0087]
[0088]
[0089] 2. Design evaluation metrics to quantitatively evaluate algorithm performance.
[0090] (1) Evaluation index design
[0091] To evaluate the quality of task planning, a set of quantitative evaluation metrics was designed, and 100 experiments were conducted to evaluate the algorithm's performance from multiple perspectives, including safety, effectiveness, and robustness.
[0092] 1) Average reward: The average reward is the average reward of all drones at each time step during the completion of the mission. A higher average reward indicates higher overall mission efficiency.
[0093] 2) Success rate: The percentage of drones that reach the target location within the specified time without colliding with it;
[0094] 3) Collision rate: The percentage of drones that failed to complete their mission due to collisions with obstacles;
[0095] 4) Trajectory efficiency: The ratio of the sum of the straight-line distances between the starting point and the target point of all drones to the sum of the actual trajectory lengths of all drones;
[0096] 5) Motion smoothness: Motion smoothness reflects the change in drone speed. The lower the motion smoothness, the smoother the drone's movement. The calculation formula is as follows:
[0097]
[0098] In the formula: v i The value represents the drone speed, N represents the number of drones, and T is the total time step.
[0099] 6) Trajectory curvature: Curvature reflects the turning points of the trajectory. The smaller the curvature, the smoother the trajectory. The calculation formula is as follows:
[0100]
[0101] In the formula: x y represents the drone coordinates, N represents the number of drones, and T represents the total time step.
[0102] (2) Algorithm performance evaluation
[0103] 1) Convergence analysis
[0104] The trend of average reward change during training for various algorithms is as follows: Figure 4 As shown in the figure, the MADDPG algorithm has a faster convergence speed compared to the MATD3 and MAPPO algorithms.
[0105] 2) Algorithm Performance Analysis
[0106] The table below shows the evaluation metrics for the MADDPG and MAPPO algorithms when the number of drones is 5.
[0107]
[0108] As can be seen from the table, the MADDPG algorithm has a higher average reward and a higher success rate compared to the MAPPO algorithm.
[0109] Figure 5 and Figure 6The diagram illustrates the task planning of the MADDPG and MAPPO algorithms in a simulation environment. It can be clearly seen from the diagram that during the path planning process, the MADDPG algorithm keeps a greater distance from obstacles than the MAPPO algorithm, resulting in better obstacle avoidance.
[0110] The table below compares the centralized task allocation algorithm with the QMIX task allocation algorithm.
[0111]
[0112] Compared to the centralized Hungarian algorithm, the top-level task allocation algorithm trained by QMIX has similar overall efficiency to the centralized task allocation algorithm. However, the QMIX task allocation algorithm does not require each drone to know the information of all drones, making it a distributed task allocation algorithm. When the algorithm performance is similar, its scalability and robustness are better than the centralized task allocation algorithm.
[0113] 3) Generalization analysis
[0114] In a simulated urban environment with dimensions of 400m × 400m × 50m, a task planning algorithm for 10 drones at 10 target points was tested. The simulation results are as follows: Figure 7 As shown in the figure, the experimental results show that the task planning algorithm proposed in this paper can complete the task planning of 10 UAVs and 10 target points.
[0115] The above are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments. For those skilled in the art, within their knowledge and without departing from the principle of the present invention, several improvements can be made to the present invention, and these improvements are also considered to be within the scope of protection of the present invention.
Claims
1. A drone swarm task planning algorithm based on hierarchical multi-agent deep reinforcement learning, characterized in that, Includes the following steps: Step 1: Based on the task requirements, establish Markov decision process models for the task allocation and trajectory planning of the UAV swarm, respectively, for multi-UAV and multi-target scenarios. Step 2: Based on the Markov decision process model for UAV swarm task allocation and trajectory planning established in Step 1, design a task allocation algorithm based on QMIX for the UAV swarm task allocation problem, and a trajectory planning algorithm based on MADDPG for the UAV swarm trajectory planning problem, using a centralized training and distributed execution approach. Step 3: Based on the task allocation and trajectory planning algorithms designed in Step 2, a hierarchical deep reinforcement learning framework is designed. Considering the coupling between the two sub-problems of task allocation and trajectory planning, the framework is trained synchronously to solve the task allocation and trajectory planning problems. The hierarchical deep reinforcement learning framework consists of a top-level task allocation layer and a bottom-level trajectory planning layer. The top-level task allocation layer assigns an optimal target point to each UAV and outputs the task selected by the UAV at the next moment. The bottom-level trajectory planning layer provides each UAV with collision-free speed commands and outputs speed control commands for the UAV based on the task selected by the top-level task allocation layer, enabling it to successfully reach the target point. During training, the QMIX-based task allocation algorithm and the MADDPG-based trajectory planning algorithm are trained synchronously and influence each other. The input of the bottom-level trajectory planning network depends on the output of the top-level task allocation network. At the same time, the top-level task allocation network does not directly interact with the environment but interacts with the environment through the output of the bottom-level trajectory planning network. The state and reward of the top-level network depend on the output of the bottom-level network. By maintaining the coupling relationship between task allocation and trajectory planning during training, the correlation between the two processes is implicitly learned based on updating the two network models.
2. The UAV swarm task planning algorithm based on hierarchical multi-agent deep reinforcement learning according to claim 1, characterized in that, The Markov decision process models for both the UAV swarm task allocation and trajectory planning include the observation space, action space, and reward function for UAV swarm task allocation.
3. The UAV swarm task planning algorithm based on hierarchical multi-agent deep reinforcement learning according to claim 2, characterized in that, The Markov decision process model for task allocation includes an observation space encompassing the UAV's current location, attack capabilities, and the locations and attack requirements of all task points; the UAV's actions are defined as the selected task sequence number at each moment; and the reward function... It consists of two parts: track cost and attack penalty, as shown in the following formula: , In the formula: , It is a constant. For the price of flight path, As punishment for attacks; The observation space in the Markov decision process model for the trajectory planning part includes the UAV's position information, velocity information, target point position information, and information from external sensors. The action space includes the drone's position in the next moment. Directional speed command; reward function It includes transfer rewards, success rewards, and collision penalties, as shown in the following formula: , In the formula: To transfer rewards, As a reward for success, As a collision penalty.
4. The UAV swarm task planning algorithm based on hierarchical multi-agent deep reinforcement learning according to claim 1, characterized in that, The QMIX-based task allocation algorithm consists of two networks: a Q-network and a Mix network. The Q-network takes the observations from the drone swarm task allocation as input and outputs the task number selected in the next time step. It contains two hidden layers of 200 dimension and uses the ReLU activation function. The Mix network takes the joint observations from all drones and the tasks selected by all drones as input and outputs a weighted sum of the Q-network outputs from all drones. It also contains two hidden layers of 200 dimension and uses the ReLU activation function. Training employs a centralized training and distributed execution approach. The specific training process is as follows: The local states of each drone at the current time and the next time are input into the Q network to obtain the value of the actions at the current time and the next time, respectively. The outputs and global states of all Q-networks at the current and next time steps are input into the Mix network. The Mix outputs are the predictions of the global value at the current time step and the global value at the next time step, respectively. Based on the TD error, the global return at the current time is calculated using the prediction of the reward and the global value at the next time step; The mean squared error between the predicted global reward and the predicted global value at the current time is calculated as the loss function, and the Q-network and Mix network are updated by minimizing the loss function.
5. The UAV swarm task planning algorithm based on hierarchical multi-agent deep reinforcement learning according to claim 1, characterized in that, The MADDPG-based trajectory planning algorithm comprises an Actor network and a Critic network. The Actor network takes as input the observations for UAV swarm trajectory planning and outputs the UAV speed control commands. The Critic network takes as input the joint observations and speed control commands from all UAVs and outputs an evaluation of the current action given the current state. Training employs a centralized training and distributed execution approach. The specific training process is as follows: The global state at the next time step is input into the Critic network, and the output is the value at the next time step. Calculate the reward for the current time step based on the value of the next time step and the reward for the current time step, using the TD error as a guide. The current global state is input into the Critic network, and the output is the current value, which is used as the predicted global value. The mean squared error between the global reward and the predicted global reward is calculated as the loss function, and the Critic network is updated by minimizing the loss function. The output of the negative Critic network is used as the loss function of the Actor network, and the Actor network is updated by minimizing the loss function.
6. The UAV swarm task planning algorithm based on hierarchical multi-agent deep reinforcement learning according to claim 1, characterized in that, The QMIX-based task allocation algorithm ensures that the locally optimal action selected by each UAV is precisely a part of the globally optimal action. By minimizing the loss function, the target global action value function corresponding to the optimal policy is learned. The global action value function is shown in the following equation: , In the formula: The loss function for the hybrid network, For the overall value at the current moment, For the overall value in the next moment, Rewards for the current moment. These represent the current state and the next state / action, respectively. For network parameters, This is the discount factor.