A multi-unmanned aerial vehicle cooperative search path planning method based on three-dimensional motion decoupling

CN122308393APending Publication Date: 2026-06-30NANJING UNIV OF AERONAUTICS & ASTRONAUTICS

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANJING UNIV OF AERONAUTICS & ASTRONAUTICS
Filing Date
2026-03-19
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing multi-UAV collaborative search trajectory planning technologies suffer from problems such as low efficiency of action coupling learning, insufficient modeling of target uncertainty, imbalance between exploration and utilization, and poor multi-UAV collaboration in 3D environments, making it difficult to plan paths and discover targets efficiently and safely in complex 3D scenes.

Method used

A multi-agent reinforcement learning algorithm based on three-dimensional action decoupling is adopted. By using a three-dimensional action decoupling Actor network, a non-sparse reward function, and a dynamic noise attenuation mechanism, the horizontal motion and vertical altitude of the UAV are decoupled and controlled. The exploration intensity is adjusted by combining the target confidence and a non-sparse multi-dimensional reward function is constructed to accelerate policy learning.

Benefits of technology

It improved the target detection rate, reduced energy consumption and collision rate, and enhanced the collaborative search efficiency and robustness of UAV swarms in complex 3D environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122308393A_ABST
    Figure CN122308393A_ABST
Patent Text Reader

Abstract

This invention discloses a multi-UAV cooperative search trajectory planning method based on 3D action decoupling, aiming to solve the core problems of low efficiency in 3D action coupling learning, insufficient target uncertainty modeling, imbalance between exploration and utilization, and poor multi-UAV cooperation in multi-UAV cooperative search under 3D unknown environments. This invention models the multi-UAV cooperative search process as a partially observable Markov decision process, designs a 3D action decoupling Actor network to achieve decoupled control of horizontal plane motion and vertical altitude motion, constructs a non-sparse multi-dimensional reward function to provide dense feedback for policy learning, and introduces a dynamic noise attenuation mechanism based on target confidence to achieve spatial optimization allocation of exploration resources and complete iterative optimization of the cooperative search strategy. This method can significantly improve the target detection rate, reduce flight energy consumption and conflict rate, and can efficiently adapt to the needs of multi-UAV cooperative search trajectory planning in complex 3D unknown environments.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of multi-UAV cooperative control and trajectory planning technology, and in particular to a multi-UAV cooperative target search trajectory planning method and system based on three-dimensional action decoupling for three-dimensional unknown environments. Background Technology

[0002] In recent years, multi-UAV collaborative systems have gradually become a research hotspot due to their high efficiency and robustness in mission execution, and have been widely applied in scenarios such as agricultural monitoring, industrial inspection, disaster perception, and emergency search and rescue. In this problem, since the mission area and target distribution information are unknown, UAVs need to gradually explore and discover targets through the system's search strategy. In practical applications, UAV platforms have energy limitations, and the mission has high time response requirements; at the same time, there may be potential threats such as obstacles in the mission area. In target search missions in unknown environments, UAV swarms need to autonomously plan flight paths under energy and safety constraints to complete area detection, target search, and identification, exhibiting higher search efficiency and environmental adaptability compared to single-UAV operations.

[0003] However, current research largely focuses on path planning strategies in two-dimensional discrete space, with few works fully considering the complexity and constraints brought about by the three-dimensional continuous flight environment. Furthermore, in practical applications, existing multi-UAV cooperative search trajectory planning techniques still have significant limitations: First, existing research mainly focuses on path planning in two-dimensional discrete space, with few works fully considering the complexity and constraints of the three-dimensional continuous flight environment, failing to adapt to the actual needs of UAVs in low-altitude operations for dynamic altitude adjustment and obstacle avoidance in three-dimensional space; Second, in real-world scenarios, airborne sensors have noise and detection errors, and target recognition results have inherent uncertainty. Existing methods do not effectively model this uncertainty, easily leading to low search efficiency due to missed or false detections; Third, traditional reinforcement learning-based trajectory planning methods treat planar movements in three-dimensional space as... The coupling of motion and altitude control outputs leads to an explosion in the dimensionality of the action space, low policy learning efficiency, and slow convergence speed, making it difficult to adapt to the real-time decision-making requirements of multi-UAV collaboration. Fourth, existing exploration strategies mostly use fixed or monotonically decreasing motion noise, which cannot dynamically adjust the exploration intensity according to the environmental cognition state, easily resulting in insufficient exploration of unknown areas and excessive perturbation of already detected areas, making it difficult to balance the relationship between exploration and utilization. Fifth, traditional single-agent reinforcement learning algorithms lack multi-UAV collaboration mechanisms, while existing multi-agent algorithms suffer from large convergence fluctuations and weak generalization ability in three-dimensional unknown environments, with significant performance degradation in scenarios where the number of targets and the scale of obstacles change. How to dynamically plan UAV paths in a three-dimensional continuous flight environment to effectively avoid potential threat areas while maximizing the number of targets detected has become a key challenge in this field. Summary of the Invention

[0004] Technical solution: To achieve the above-mentioned objectives, the present invention adopts the following technical solution:

[0005] In a first aspect, the present invention provides a multi-UAV cooperative search trajectory planning method based on three-dimensional action decoupling, comprising the following steps:

[0006] Based on a collaborative search-oriented UAV swarm system architecture, a system model is established, including an airborne environmental perception model, a UAV dynamics model, an air-to-air communication model, and a target confidence probability update model, providing a theoretical foundation for subsequent algorithm design. The multi-UAV collaborative search process is modeled as a partially observable Markov decision process, transforming the 3D trajectory planning problem into a sequential decision problem solvable through reinforcement learning. A multi-agent reinforcement learning algorithm based on 3D action decoupling is designed, comprising a fully connected backbone, a CNN spatial feature extraction branch, and a dual-action output branch, achieving decoupled control of UAV horizontal plane motion and vertical altitude actions. A non-sparse multi-dimensional reward function is constructed to provide dense feedback signals for policy learning, accelerating algorithm convergence. A dynamic noise attenuation mechanism based on target confidence is introduced, combining a hybrid exploration strategy of Gaussian noise and OU noise, adaptively adjusting the action noise intensity according to the regional target confidence to achieve spatial optimization allocation of exploration resources.

[0007] Furthermore, the airborne environment perception model includes calculating the target detection probability using a camera's depth perception model, dynamically adjusting low-probability areas based on flight altitude, quantifying the target confidence probability using a Bayesian update method, and generating a global target confidence probability map through multi-aircraft information fusion.

[0008] Furthermore, the UAV dynamics model employs three-dimensional dynamic equations to describe the UAV's motion state. The three-dimensional action decoupling Actor network structure consists of a fully connected backbone, parallel convolutional neural network branches, and two lightweight action output branches. The fully connected backbone extracts UAV state information and non-spatial global information, while the convolutional branches capture the spatial context features of the environmental map. After concatenation and fusion, these two components are input into the two independent action output branches, which are used to generate horizontal motion control commands and vertical height adjustment amounts, respectively.

[0009] Furthermore, the non-sparse multidimensional reward function design is divided into five categories: search-driven reward positively correlated with area exploration coverage and negatively correlated with UAV energy consumption; target discovery reward associated with target confidence; threat avoidance penalty when facing no-fly zone threats; distance maintenance reward to avoid collisions and communication interruptions; and penalty for UAVs flying out of the search area boundary.

[0010] Furthermore, the dynamic noise attenuation mechanism combines Gaussian noise and Ornstein-Uhlenbeck noise to form a hybrid mechanism. The dynamic noise attenuation mechanism based on target confidence is as follows: a target confidence threshold is preset. When the regional target confidence is lower than the threshold, the noise attenuation rate is slowed down to maintain strong exploration capability; when the regional target confidence is higher than the threshold, the noise attenuation is accelerated to promote strategy convergence.

[0011] Secondly, the present invention provides a multi-UAV cooperative search trajectory planning system based on three-dimensional action decoupling, comprising:

[0012] The cognitive modeling module is used to construct a three-dimensional unknown search environment, generate target and no-fly zone distribution models, construct UAV three-dimensional dynamic equations, sensor observation models, communication models and Bayesian-based target confidence update models, and initialize environmental parameters and task constraints.

[0013] The collaborative decision-making module is used to model the collaborative search problem as a partially observable Markov decision process, defining the state space and action space. It has a built-in trained 3D action decoupling Actor network to receive local observation information from UAVs and output distributed trajectory control actions for horizontal plane motion and vertical height adjustment, thereby realizing multi-UAV collaborative trajectory planning.

[0014] The reinforcement learning module is used to construct the Actor-Critic network framework, design reward functions and exploration strategies, utilize the Actor network to fuse full observations and spatial features, decouple output planar motion and height adjustment actions, complete centralized training based on the experience replay pool, iteratively update network parameters, and optimize the cooperative search strategy.

[0015] The exploration optimization module uses an action noise adaptive decay mechanism based on target confidence to dynamically adjust the exploration intensity, enabling online adaptive updating and training optimization of policies in a multi-machine collaborative environment.

[0016] Thirdly, the present invention provides a computer system including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, it implements the steps of the multi-UAV cooperative search trajectory planning method based on three-dimensional motion decoupling.

[0017] Fourthly, the present invention provides a computer program product, including a computer program, which, when executed by a processor, implements the steps of the multi-UAV cooperative search trajectory planning method based on three-dimensional motion decoupling.

[0018] Beneficial Effects: This invention proposes a multi-UAV cooperative target search algorithm (AD-MARL) based on 3D action-decoupled multi-agent reinforcement learning. This algorithm enhances policy representation and the learnability of the action space by decoupling the Actor network through 3D actions; it introduces an action noise attenuation mechanism based on target confidence to achieve reasonable spatial allocation of exploration resources; and it designs a non-sparse reward function to significantly accelerate the policy learning process. The algorithm exhibits excellent comprehensive performance in core indicators such as target discovery rate, normalized average energy consumption, collision rate, and search efficiency. In complex 3D scenarios with no-fly zones, communication distance limitations, and collision avoidance constraints, the algorithm can guide UAV swarms to adaptively adjust flight altitude and path, efficiently explore unknown areas, and avoid threats. Furthermore, it demonstrates strong generalization ability and robustness under different environmental complexities and constraints. Attached Figure Description

[0019] Figure 1 This is a schematic diagram of the drone depth perception model in an embodiment of the present invention.

[0020] Figure 2 This is a schematic diagram of the AD-MARL algorithm framework in an embodiment of the present invention.

[0021] Figure 3 This is a comparison chart of the convergence performance curves of the embodiments of the present invention with those of the DDPG and MADDPG algorithms.

[0022] Figure 4 This is a comparison chart of the target discovery rates of the embodiments of the present invention and the comparison algorithms.

[0023] Figure 5 This is a comparison chart of the average conflict rates of the embodiments of the present invention and the comparison algorithms.

[0024] Figure 6 This is a comparison chart of normalized energy consumption between the embodiments of the present invention and the comparison algorithm. Detailed Implementation

[0025] To make the technical problems, technical solutions, and beneficial effects of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments.

[0026] Combination Figure 1As shown in the scenario, this invention discloses a multi-UAV cooperative search trajectory planning method based on 3D action decoupling. Addressing the core problems of existing methods in multi-UAV target search tasks under unknown 3D environments, such as low efficiency of 3D action coupling learning, insufficient target uncertainty modeling, imbalance between exploration and utilization, and poor multi-UAV collaboration, this invention innovatively proposes the AD-MARL algorithm. Through a 3D action decoupling network, a non-sparse reward function, an adaptive exploration mechanism, and a centralized training distributed execution framework, it achieves efficient, safe, and collaborative trajectory planning for multiple UAVs in unknown environments.

[0027] The cooperative search method described in this invention first establishes a system model encompassing UAV dynamics, environmental perception, communication constraints, and target uncertainty modeling, providing a theoretical foundation for subsequent algorithm design. Based on this, the multi-UAV cooperative search process is modeled as a partially observable Markov decision process (POMDP), and a non-sparse reward function is constructed, incorporating multi-dimensional guidance signals such as target discovery, obstacle avoidance, inter-UAV collision avoidance, and communication connectivity maintenance, to provide dense feedback and promote efficient strategy convergence. Subsequently, a three-dimensional action decoupling Actor network structure is adopted, fusing full UAV observation information with spatial map features to achieve decoupled control output for horizontal plane motion and vertical altitude actions. Simultaneously, a dynamic noise attenuation mechanism based on target confidence is introduced to dynamically adjust the exploration intensity and achieve a rational spatial allocation of exploration resources. Finally, the network is trained and its parameters updated using a multi-agent reinforcement learning framework with centralized training and distributed execution, continuously optimizing the cooperative trajectory planning strategy of the UAV swarm, thereby maximizing the number of targets discovered while effectively avoiding potential threat areas. Simulation results show that the AD-MARL algorithm maintains a target discovery rate of over 80% in all test scenarios. When the number of targets reaches 15, its target discovery rate is improved by 8.5% and 13.2% compared to the MADDPG algorithm and the DDPG algorithm, respectively.

[0028] The specific steps of the embodiments of the present invention will be described in detail below with reference to a specific system model.

[0029] Step (1) establishes a multi-UAV system model, including a UAV 3D dynamics model, a sensor perception model, a target confidence update model, and an air-to-air communication model; including the following specific steps:

[0030] (1a) Establish a three-dimensional dynamic model of the UAV. Assume the UAV swarm performs a task in three-dimensional space, and the position coordinates of the UAV at time t are represented as follows: Through the horizontal flight acceleration component and control signals that change in height Control its motion state:

[0031]

[0032] Where Δt = 0.2s is the planning step size. The UAV's horizontal coordinates are updated using velocity and acceleration, while in the vertical direction, altitude is updated. Combined with height change Direct calculation and update.

[0033] (1b) Establish an airborne environmental perception model. The UAV is equipped with a pinhole camera sensor, and the detection field of view radius is positively correlated with the flight altitude, expressed as:

[0034]

[0035] Where CSS is the camera sensor size, FL = 40mm is the lens focal length; the target detection probability p is defined. d ∈[0.5, 1], false alarm probability p f ∈[0, 0.5), the detection accuracy at different heights is corrected by the depth perception model. When the detection probability is lower than the threshold, the average probability of 0.5 is used instead to avoid low confidence detection results from interfering with decision-making.

[0036] (1c) Establish a target confidence probability update model. A Bayesian update method is used to quantify the confidence of a target existing within a grid cell. A global target confidence probability is defined, with an initial value of 0.5. The confidence probability increases when the UAV detects a target and decreases when it does not. To simplify calculations, a nonlinear transformation is applied to the confidence probability:

[0037]

[0038] in, For the observation of cell C by drone i x,y The target confidence probability is calculated. Each UAV independently maintains a target confidence probability map, and a weighted fusion strategy based on the number of searches is introduced to fuse the detection information of each UAV to generate a global target confidence probability map.

[0039] (1d) Establish an air-to-air communication model. The communication channel between UAVs adopts a combined LoS / NLoS model, with the average path loss being:

[0040]

[0041] Among them, P Los (d ij ) represents the line-of-sight transmission probability, L Los and L NLOS These are the line-of-sight and non-line-of-sight path losses, respectively; interaction of status information and detection data can only be achieved when the distance between UAVs is less than 100m of the communication radius.

[0042] Step (2) involves collaborative search modeling as a partially observable Markov decision process and designing an environmental uncertainty map and a multidimensional reward function; this includes the following specific steps:

[0043] (2a) Define the state space and action space. In this invention, the observation state of UAV i is:

[0044]

[0045] Where, p uav,i This represents the three-dimensional coordinate position of drone i. This indicates the threat zone observation of drone i. i The map representing the search area uncertainty for drone i is composed of all cells C. x,y Uncertainty measure u x,y The uncertainty of the search area is composed of the uncertainty map of all drones, i.e., s m = [m1, m2, ..., m N ], Γ is the target confidence map of UAV i. p i nei m i u,nei and Γ i nei These represent the distance information between UAV i and neighboring UAVs, the uncertainty map of the search area for neighboring UAVs, and the target confidence map, respectively. The action space of a single UAV i is defined as:

[0046]

[0047] in, To control the horizontal acceleration of UAV i, the corresponding horizontal motion trajectory is planned. To control the change in drone altitude, the corresponding vertical motion is adjusted.

[0048] (2b) Construct a map representation of environmental uncertainty. Assume each cell C in the search area E... x,y There is a related uncertainty value for each drone i. This indicates the uncertainty regarding the target within that cell for drone i. Within this map, Indicates that the drone has grid cells C x,y Complete knowledge of internal information, This indicates that the drone lacks information about grid cells C. x,y Any knowledge of the internal information. Once the drone has searched cell C at time step t. x,y The uncertainty associated with this unit decreases by an uncertainty reduction rate λ, expressed as...

[0049]

[0050] Where λ is the uncertainty attenuation coefficient. This means cell C x,y It is the region that is completely unknown to UAV i at time step t, and it decreases as the UAV searches repeatedly.

[0051] (2c) Design a non-sparse joint reward. To provide dense feedback to the drone swarm, the following sub-reward model is considered: To guide the drones to explore the entire mission area and encourage them to minimize energy consumption during mission execution, a search-driven reward is designed as follows:

[0052]

[0053] in, The environmental uncertainty value of the grid cell is used as the reward, which is positively correlated with the area exploration coverage and negatively correlated with energy consumption, guiding UAVs to efficiently explore unknown areas. To ensure the reliability of target discovery, a target discovery reward is designed:

[0054]

[0055] Where α2 is a positive number, and τ is the reliability threshold. Let be the confidence probability of element (x, y) in the global target confidence probability graph at time t. To ensure the safe flight of the UAV, the threat avoidance reward is designed as follows:

[0056]

[0057] Where C1 is the negative reward value, d z This represents the radius of the no-fly zone. To help the drone swarm maintain communication while preventing collisions between drones, the distance maintenance bonus is:

[0058]

[0059] Where C2 is the positive reward value, d com d represents the maximum communication distance between drones. safe This refers to the safe distance between drones. To prevent drones from flying out of the search area E, exceeding the boundary incurs a penalty:

[0060]

[0061] C3 represents a negative reward value. In summary, the total reward obtained by drone i is:

[0062]

[0063] Step (3): Based on the above steps, construct a network framework based on 3D action decoupling and an action noise adaptive attenuation module; including the following specific steps:

[0064] (3a) A 3D motion decoupling Actor network is designed. The network consists of a fully connected backbone and parallel convolutional neural network branches. The UAV state information and local environment map information are input to the fully connected layer for processing to extract their unstructured motion and state features. The parallel convolutional neural network branches take the joint representation of the environment perception map and the target confidence probability map as image input, and capture the spatial context features in the environment map through convolution and pooling operations.

[0065] (3b) The fusion feature output decouples the actions, and the fusion layer processes the data to generate a unified joint feature representation. Subsequently, the representation signal is fed into two independent fully connected action output branches: the planar action branch is responsible for generating motion control command components in the horizontal plane; the height action branch is responsible for generating the height adjustment amount in the vertical direction, thereby achieving decoupling between horizontal motion and height control.

[0066] (3c) Establish a hybrid motion noise mechanism. A hybrid mechanism combining Gaussian noise and OU noise is used as an exploration strategy to ensure spatial exploration capabilities and motion continuity.

[0067]

[0068] Furthermore, a dynamic noise attenuation mechanism based on target confidence is proposed:

[0069]

[0070] Among them, Γ th Let ε be the target confidence probability threshold, and let ε be the base intensity of the action noise. base The attenuation coefficient is δ base The intensity of motion noise decays exponentially with the number of training epochs (e) and is modulated by δ, i.e.:

[0071] ε(e)=max(ε min ,ε base ·e -δ·e ),

[0072] Where, ε base ε is the initial noise intensity. min To minimize noise intensity and prevent the policy from getting stuck in a local optimum due to the noise disappearing too early.

[0073] Step (4): Based on the above modeling, the specific process of the AD-MARL algorithm for region search based on 3D action decoupling is obtained: First, initialize a target search region containing 200m×200m, and initialize the Actor main network, dual Critic main network, and corresponding target network for 3 UAVs. During training, the environment is randomly generated each round. Each UAV outputs actions through the Actor network based on local observations and superimposes adaptive mixed noise to execute, obtains immediate rewards and the next observation, and stores the experience tuples in the replay pool; when the sample size exceeds the batch size, random sampling is performed and the target Q value is calculated using the target network. The Critic network is updated by minimizing the temporal difference loss. After every 2 updates of the Critic network, the Actor network is updated once according to the policy gradient, and then the target network parameters are synchronized through soft updates.

[0074] The following simulation experiments were conducted using Python 3.7 and PyTorch. The invention was also compared with traditional algorithms to verify its beneficial effects.

[0075] exist Figure 3 The text describes a comparison of the average reward obtained during training between embodiments of the present invention and the MADDPG and DDPG algorithms. The AD-MARL algorithm exhibits a faster learning speed initially, with the average reward rising rapidly in the first 200 rounds and stabilizing after approximately 600 rounds, eventually converging to a higher reward value. In contrast, while the MADDPG algorithm possesses strong collaborative learning capabilities, its convergence process exhibits significant fluctuations, and its final average reward is slightly lower than that of AD-MARL. The DDPG algorithm, lacking an explicit coordination mechanism among multiple agents, has lower learning efficiency, slower average reward growth, and fails to achieve effective improvement in later stages.

[0076] exist Figures 4 to 6This paper describes a comparison of the target discovery rate, average collision rate, and normalized average energy consumption of the embodiments of the present invention with those of the DDPG and MADDPG algorithms in three different search scenarios. As the number of targets increases, the total number of targets discovered by each algorithm increases accordingly. The AD-MARL algorithm consistently maintains the highest target discovery rate. Although the increased target density in the search area places higher demands on the completeness of the task area coverage, leading to a decrease in the target discovery rate of all algorithms compared to the training environment, the AD-MARL algorithm maintains a target discovery rate above 80% in all test scenarios. When the number of targets is three times the number of training targets, its target discovery rate is improved by 8.5% and 13.2% compared to the MADDPG and DDPG algorithms, respectively. Simultaneously, the AD-MARL algorithm maintains the lowest average collision rate throughout the task, indicating that its cooperative trajectory planning strategy can effectively avoid close-range collisions between UAVs, ensuring the safety and stability of swarm flight. In terms of energy consumption, the normalized average energy consumption of the AD-MARL algorithm is reduced by 3.3% and 5.3% compared to the MADDPG and DDPG algorithms, respectively, demonstrating superior energy utilization efficiency. This further validates the significant advantages of the AD-MARL algorithm in terms of target detection rate, normalized average energy consumption, and conflict rate.

[0077] Based on the description of the present invention, those skilled in the art should readily recognize that the multi-UAV cooperative search trajectory planning method based on three-dimensional action decoupling designed in this invention can effectively improve the target detection rate during the search process and achieve reasonable energy consumption.

[0078] Based on the same inventive concept, this invention also discloses a multi-UAV cooperative search trajectory planning system based on three-dimensional action decoupling, comprising: an environmental cognition modeling module responsible for constructing underlying basic models such as the three-dimensional unknown search environment, UAV dynamics, perception communication, and target confidence update, and initializing task constraints; a reinforcement learning training module based on the Actor-Critic framework and multi-dimensional reward function, using an experience replay pool to complete centralized training and iterative optimization of network parameters; a dynamic exploration optimization module introducing an action noise adaptive attenuation mechanism based on target confidence, dynamically adjusting the multi-UAV exploration intensity online to achieve strategy adaptation; and a distributed cooperative decision-making module modeling the search task as a partially observable Markov decision process (POMDP), whose built-in three-dimensional action decoupling Actor network receives local observations and fuses spatial features, and distributes decoupled control commands for horizontal plane motion and vertical height adjustment, ultimately achieving efficient cooperative trajectory planning for multiple UAVs.

[0079] This invention also discloses a computer system, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, it implements the steps of the multi-UAV cooperative search trajectory planning method based on three-dimensional motion decoupling.

[0080] This invention also discloses a computer program product, including a computer program that, when executed by a processor, implements the steps of the multi-UAV cooperative search trajectory planning method based on three-dimensional motion decoupling.

[0081] The program code used to implement the method of the present invention can be written in any combination of one or more programming languages. This program code can be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that when executed by the processor or controller, the program code causes the steps of the method of the present invention to be performed. The program code can be executed entirely on the machine, partially on the machine, partially on the machine and partially on a remote machine as a standalone software package, or entirely on a remote machine or server. All aspects not detailed in this invention are well-known to those skilled in the art.

Claims

1. A multi-UAV cooperative search trajectory planning method based on three-dimensional action decoupling, characterized in that, Includes the following steps: A multi-UAV cooperative search system model is constructed, including a UAV three-dimensional dynamics model, an airborne environmental perception model, a target confidence probability update model, and an air-to-air communication model, providing an underlying theoretical foundation for trajectory planning; The multi-UAV cooperative search process in a three-dimensional unknown environment is modeled as a partially observable Markov decision process. The state space, observation space and action space of the corresponding UAV cluster are defined, and the three-dimensional trajectory planning problem is transformed into a sequential decision problem that can be solved by reinforcement learning. A multi-agent reinforcement learning Actor-Critic network framework based on 3D action decoupling is constructed. A 3D action decoupling Actor network is designed, which integrates UAV observation information and environmental spatial features, and decouples the output of horizontal motion control commands and vertical height adjustment, so as to realize the decoupled decision of UAV horizontal motion and height control. Constructing a non-sparse multi-dimensional reward function, including search-driven reward, target discovery reward, threat avoidance penalty, distance maintenance reward, and boundary exceedance penalty, provides dense feedback signals for policy learning in reinforcement learning, accelerating algorithm convergence; An adaptive exploration mechanism based on target confidence is designed, which combines Gaussian noise and OU noise to form a hybrid exploration strategy. The dynamic noise attenuation mechanism adaptively adjusts the action noise intensity according to the regional target confidence to achieve spatial optimization allocation of exploration resources. Based on a multi-agent reinforcement learning framework with centralized training and distributed execution, the AD-MARL algorithm is executed to complete the iterative update of network parameters and optimization of collaborative search strategies. The trained Actor network outputs UAV trajectory control actions in a distributed manner to achieve collaborative search trajectory planning for multiple UAVs.

2. The multi-UAV cooperative search trajectory planning method based on three-dimensional action decoupling according to claim 1, characterized in that, The UAV's three-dimensional dynamics model controls its horizontal trajectory through horizontal flight acceleration components and directly updates its vertical altitude through altitude change control signals. The detection field of view radius of its onboard camera depth perception model is positively correlated with flight altitude. It dynamically adjusts the detection accuracy of low-probability areas based on flight altitude, dynamically corrects the target detection probability and false alarm probability, and uses the average probability to replace results below the detection probability threshold.

3. The multi-UAV cooperative search trajectory planning method based on three-dimensional action decoupling according to claim 1, characterized in that, The target confidence probability update model uses a Bayesian update method to quantify the credibility of the presence of a target within a grid cell and performs nonlinear transformation on the confidence probability. Each UAV independently maintains a target confidence probability map, and a weighted fusion strategy based on the number of search attempts is introduced to generate a global target confidence probability map.

4. The multi-UAV cooperative search trajectory planning method based on three-dimensional action decoupling according to claim 1, characterized in that, The observation status of a single UAV within the observation space includes the UAV's own three-dimensional coordinate position, threat zone observation information, local search area uncertainty map, local target confidence map, as well as distance information, uncertainty map and target confidence map of neighboring UAVs. The uncertainty map of the search area describes the degree of recognition of grid cell information by the UAV through the uncertainty decay coefficient. The uncertainty value of the grid cell gradually decreases as the UAV searches repeatedly.

5. The multi-UAV cooperative search trajectory planning method based on three-dimensional action decoupling according to claim 1, characterized in that, The 3D motion decoupling Actor network consists of a fully connected backbone, a parallel convolutional neural network branch, and two independent motion output branches. The fully connected backbone is used to extract UAV state information and non-spatial global information, while the parallel convolutional neural network branch is used to capture the spatial context features of the environmental cognition map and the target confidence probability map. After the two types of features are spliced ​​and fused, they are respectively input to the two independent motion output branches. The planar motion branch generates motion control commands in the horizontal plane, and the altitude motion branch generates the vertical altitude adjustment amount.

6. The multi-UAV cooperative search trajectory planning method based on three-dimensional action decoupling according to claim 1, characterized in that, The non-sparse multidimensional reward function consists of search-driven reward, target discovery reward, threat avoidance penalty, distance maintenance reward, and boundary exceedance penalty. The search-driven reward is positively correlated with regional environmental uncertainty and exploration coverage, and negatively correlated with UAV energy consumption. The target discovery reward is set based on the global target confidence probability and a preset reliability threshold. The threat avoidance penalty is a negative reward given when the UAV enters the radius of a no-fly zone. The distance maintenance reward is a positive reward given when the UAV meets communication distance constraints and inter-UAV safety distance constraints. The boundary exceedance penalty is a negative reward given when the UAV flies out of a preset search area. The total reward for a single UAV is the sum of the above-mentioned sub-rewards.

7. The multi-UAV cooperative search trajectory planning method based on three-dimensional action decoupling according to claim 1, characterized in that, The dynamic noise attenuation mechanism based on target confidence presets a target confidence threshold. When the target confidence in the region is lower than the threshold, the noise attenuation rate is slowed down to maintain strong exploration capability. When the confidence level of the regional target is higher than the threshold, noise attenuation is accelerated to promote policy convergence. The intensity of motion noise decreases exponentially with the number of training rounds. At the same time, a minimum noise intensity is set to prevent the noise from disappearing too early and causing the strategy to get stuck in a local optimum.

8. A UAV intelligent opportunistic routing system based on reinforcement learning policy forwarding, implementing the method of claims 1-7, characterized in that, include: Cognitive modeling module: Constructs a 3D unknown search environment, generates a target and no-fly zone distribution model, and constructs and initializes the UAV's 3D dynamic equations, sensor observation model, communication model, and Bayesian-based target confidence update model; Collaborative Decision Module: The collaborative search problem of multiple UAVs is modeled as a partially observable Markov decision process. It has a built-in trained 3D motion decoupling Actor network, receives local observation information from UAVs, and outputs distributed trajectory control actions for horizontal plane motion and vertical height adjustment. Reinforcement learning module: used to build the Actor-Critic network framework, design a non-sparse multi-dimensional reward function, complete centralized training based on the experience replay pool, and iteratively update network parameters to optimize the collaborative search strategy; The exploration and optimization module is used to implement an adaptive noise decay mechanism based on target confidence, dynamically adjust the exploration intensity, and achieve online adaptive updates and training optimization of the policy.

9. A computer system comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1-7.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1-7.