Multi-uav target search method based on self-attention and reinforcement learning
By introducing a self-attention mechanism and combining the target probability map and the Vinyasa map into the MADDPG algorithm, the UAV search strategy is optimized, which solves the problem of low target search efficiency of UAV swarms in complex environments and realizes efficient dynamic target search.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING TECH UNIV
- Filing Date
- 2025-06-12
- Publication Date
- 2026-06-26
AI Technical Summary
Existing multi-agent reinforcement learning methods struggle to effectively utilize global information when searching for targets in UAV swarms in complex environments, and are inefficient in dynamic target searches, especially when obstacles are dense and targets are moving rapidly.
A multi-agent deep deterministic policy gradient algorithm (SA-MADDPG) with a self-attention mechanism and a region partitioning strategy combining target probability graph (TPM) and Venn diagram are introduced to optimize UAV search strategy, improve perception and collaborative decision-making capabilities, and reduce redundant coverage.
It improves the search efficiency and adaptability of UAV swarms in complex environments, effectively handles dynamic targets, is applicable to UAV swarms of different sizes, and enhances the system's deployability and application scope.
Smart Images

Figure CN120669757B_ABST
Abstract
Description
Technical Field
[0001] The invention relates to a multi-UAV target search method based on self-attention and reinforcement learning, belonging to the field of artificial intelligence and UAV swarm intelligence control. Background Technology
[0002] In recent years, the application of drones in military, search and rescue, and environmental monitoring fields has been increasing, especially in search and reconnaissance missions. With their high flight speed, strong communication capabilities, and ability to operate regardless of terrain, drones are particularly well-suited for performing these tasks in complex and dangerous environments. As technology continues to advance, drones are demonstrating enormous potential in exploring uncharted territories.
[0003] Multi-UAV cooperative target search refers to UAVs using onboard sensors to detect target areas and sharing information through communication networks to collaboratively execute tasks, thereby significantly reducing search time. Deep reinforcement learning (DRL) has been widely applied in multi-agent cooperative tasks due to its self-learning and adaptive capabilities. In single-agent systems, methods based on shared experience (such as Deep Q-Networks, DQN) have been successfully applied to tasks such as target search and path planning. However, as task complexity increases, single-agent methods are insufficient to meet the needs of multi-UAV cooperation. Therefore, multi-agent reinforcement learning (MARL) methods have emerged, especially frameworks with centralized training and distributed execution (such as the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm), which are widely used in cooperative decision-making and target search in multi-UAV systems.
[0004] While multi-agent reinforcement learning (MARL) methods have made significant progress in target search within UAV swarms, dynamic target search in complex environments still faces several challenges: First, the algorithms perform poorly in environments with complex and dense obstacles. As obstacle density and UAV swarm size increase, capturing global and local environmental features becomes increasingly difficult. Current MARL algorithms primarily rely on neural networks, such as multilayer perceptrons and convolutional neural networks, which struggle to strike a balance between scalability and utilization of global information. Second, search efficiency remains insufficient when dealing with dynamic targets, especially when the target's initial position is unknown. Existing methods typically transform target search into a target tracking problem; however, these methods perform poorly when targets are moving rapidly, and probability-based search strategies are inefficient when targets deviate from the predicted path. Summary of the Invention
[0005] To address the aforementioned problems in the existing technology, the present invention has made the following improvements:
[0006] First, a self-attention mechanism is introduced into the MADDPG algorithm to improve the perception and collaborative decision-making capabilities of UAVs. This mechanism enables each UAV to prioritize relevant spatial features, such as the positions of obstacles and teammates, thereby enhancing its adaptability in complex environments.
[0007] Secondly, by combining the Target Probability Map (TPM) with a Vinio Map-based region partitioning strategy, an exploration incentive mechanism is proposed to ensure efficient distribution of UAVs, reduce redundant coverage, and promote priority search of high-probability target areas, ultimately improving search efficiency in dynamic target environments.
[0008] Specifically, this invention is a multi-UAV cooperative dynamic target search method based on multi-agent deep reinforcement learning. This method transforms the target search task of multiple UAVs into a multi-agent cooperative problem, and optimizes the UAV search strategy under the framework of centralized training and decentralized execution.
[0009] To address the shortcomings of existing methods in complex environments, this invention introduces a self-attention mechanism on top of the traditional MADDPG, enabling each UAV to autonomously identify and prioritize key spatial features relevant to target search, thus improving the adaptability and search efficiency of UAVs in dynamic, obstacle-filled environments. Simultaneously, this invention combines a target probability map (TPM) with a Venn diagram-based search region partitioning strategy, effectively avoiding redundant coverage during the search process and ensuring the collaborative search capabilities of the UAV swarm under different environmental conditions.
[0010] The implementation steps of the method of the present invention include:
[0011] Step S1: Discretize the search area into a grid, construct a target probability map TPM to represent the distribution probability of the target in the environment, and dynamically update the TPM using a Bayesian inference method;
[0012] Step S2: Train the multi-UAV search strategy based on the improved MADDPG algorithm and make decentralized decisions during the task execution phase;
[0013] In step S2, the improved MADDPG is obtained by using a self-attention mechanism to optimize the observation information based on MADDPG to obtain SA-MADDPG (Self-Attention MADDPG, a self-attention multi-agent deep deterministic policy gradient algorithm);
[0014] In step S2, the UAV search area is divided based on the Venn diagram to reduce redundant searches, and TPM is used to guide the UAV to prioritize searching for high-probability target areas.
[0015] In step S3, the drone performs the search task according to the optimized strategy and updates the environmental information in real time.
[0016] Specifically, in step S2:
[0017] Each drone is treated as an independent intelligent agent. Each agent makes independent decisions based on its environment, while sharing collaborative mission objective information with other agents.
[0018] Define the observation space of the drone. t Action space a t and search rewards r t .
[0019] The observation space o t The input is combined with a policy network using a self-attention mechanism to obtain action a. t
[0020] The observation space o t and action a t Input the evaluation function to obtain the evaluation score qi for the observation space and action.
[0021] Using the minimization of variance as the loss function, the parameters of the policy network and the evaluation network are updated according to the gradient descent method.
[0022] Assuming the observation space of the drone is o t Through a three-layer convolutional network g i The resulting feature map is x = g i (o t The computational methods for policy networks incorporating self-attention mechanisms include:
[0023] Step 1. Let the input feature map x have dimensions C×H×W, where C represents the number of channels and H×W is the spatial dimension.
[0024] Step 2. Transform the feature map x into three feature spaces Q = W Q x, K = W K x and V = W V x; where W Q W K W V It is a learnable weight matrix, where Q, K, and V represent the query, key, and value vectors in the attention mechanism, respectively.
[0025] Step 3. Calculate the dot product between query Q and key K, then perform a soft max operation to obtain attention map A:
[0026] A = soft max(Q) T K)
[0027] Step 4. Apply the attention map A to the value V representation to calculate the attention weight O:
[0028] O = VA
[0029] Step 5. Reshape the self-attention feature map to match the input dimension:
[0030] c i =reshape(O,C,H,W).
[0031] Combining the use of Veno maps to define the drone search area and TPM-guided drones to prioritize searching for high-probability target areas is achieved through the design of a dedicated reward function.
[0032] The total reward function r is calculated as follows:
[0033] r = r cov +r vor +r obs +r step ,
[0034] in,
[0035] Coverage reward r cov The reward for drone searches is calculated based on the newly explored area, using the following formula:
[0036] Here, D sensor,t This represents the set of grid cells perceived by the drone at time t. This represents the complement of the area covered by the previous time step.
[0037] Vinonic reward r vor The reward for drone distribution is calculated based on the maximum difference in the area of the Venn diagram divided by the drones. The formula is as follows:
[0038]
[0039] in, It is calculated based on the drone's position, representing the Vino cell area of the drone at time t; It is the area of the maximum Vino unit at time t-1.
[0040] Obstacle penalty r obs The penalty for drone collisions is calculated based on the number of collisions, using the following formula:
[0041]
[0042] Constant penalty r step It is a constant negative number with a value of -0.03.
[0043] The beneficial effects of this invention are:
[0044] (1) This invention introduces a self-attention mechanism (i.e., SA-MADDPG) into the MADDPG algorithm, enabling UAVs to prioritize key environmental information (such as obstacles, teammate distribution, etc.), thereby improving perception and collaborative decision-making capabilities and thus increasing target search efficiency in complex environments.
[0045] (2) This invention combines Target Probability Map (TPM) with a region partitioning strategy based on Veno maps to optimize the search path and coverage of UAVs, reduce redundant searches, and improve the response capability in dynamic target environments, making it suitable for the search of high-speed moving targets.
[0046] (3) SA-MADDPG is adopted to enable the drone swarm to operate stably in complex environments with dense obstacles and unpredictable target movements. At the same time, the scalability of the algorithm is optimized to ensure that it is applicable to drone swarms of different sizes, thereby improving the system's deployability and application scope. Attached Figure Description
[0047] Other features, objects, and advantages of this application will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:
[0048] Figure 1 This is a schematic diagram of the distributed decision-making and centralized training process of the SA-MADDPG algorithm of this invention;
[0049] Figure 2 This is a schematic diagram of the policy network and evaluation network that integrate the self-attention mechanism of the present invention;
[0050] Figure 3 This is a comparison chart of the training reward curves of the SA-MADDPG algorithm, which incorporates a self-attention mechanism, and the MADDPG and DQN algorithms. Detailed Implementation
[0051] The present application will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the invention. Furthermore, it should be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings.
[0052] 1. Overview
[0053] The specific implementation steps of this invention include:
[0054] First, the problem is modeled by transforming the UAV target search task into a multi-agent collaborative problem. Each UAV acts as an independent agent, utilizing MADDPG and a self-attention mechanism (i.e., SA-MADDPG) for target search.
[0055] Next, the search area is rationally divided using a Venn diagram to ensure efficient collaborative searching by the drone swarm in different areas. The Target Probability Map (TPM) helps assess the possible distribution of targets, enabling drones to prioritize searching high-probability areas, significantly improving search efficiency and accuracy.
[0056] During execution, the search strategy is adjusted based on real-time feedback to adapt to dynamic changes in the target.
[0057] 2. Modeling the multi-UAV cooperative search problem:
[0058] First, we need to clarify the environmental modeling, UAV modeling, and target probability map updates for the UAV cooperative search problem.
[0059] The environment is represented by a discretized target region using a three-channel binary grid, with each grid cell representing a portion of the environment. The value of each grid cell (x, y) consists of three parts: the probability of the target's presence. Obstacle presence marker and the presence markers of drones These values satisfy the following constraints:
[0060] Let d represent the sensing range of each drone's sensor.
[0061] Let D = {0, 1, 2, 3} represent the set of possible movement directions for the drone, which represent moving one grid cell up, down, left, and right respectively.
[0062] The distribution of targets is represented by a target probability map, where each grid cell P xy (t)∈[0,1] represents the probability that the target appears in cell (x,y) at time t. Initially, the target probability P in each grid cell is... xy (0) = 0.5 indicates that the target point's location is unknown. The probability update rule for target movement is: The target probability map is updated using a Bayesian model based on sensor detection information. When the drone scans for a target, the probability of the target's appearance is updated as follows:
[0063] Where p a This indicates the accuracy of the drone's sensors. If the drone fails to detect a target, then p... a Replace with 1-p a .
[0064] 3. A multi-UAV cooperative search method based on SA-MADDPG
[0065] The following describes how to use SA-MADDPG to solve the problem of cooperative search by drones.
[0066] First, define the observation space O of the UAV at time t. t Action space a t and search rewards r t .
[0067] Observation space of drones in, This is the latest update to TPM. Indicates the location of all other drones. This indicates the distribution of obstacles in the environment. This indicates the drone's current position, serving as a reference point within that location. Each of these components is represented as two-dimensional data, with dimensions matching the size of the environment.
[0068] The motion space is represented by D = {0, 1, 2, 3}, which indicates the selectable movement directions of the drone.
[0069] The search reward r is defined as the sum of the coverage reward, the Venograph reward, the obstacle penalty, and the constant penalty:
[0070] r = r cov +r vor +r obs +r step
[0071] Among them, the coverage reward r cov The reward for drone searches is calculated based on the newly discovered area, using the following formula: Where D sensor,t This represents the set of grid cells perceived by the drone at time t. This represents the complement of the area covered by the previous time step.
[0072] Vinonic reward r vor The reward for drone distribution is calculated using the maximum difference in the area of the Venn diagram divided by the drones. The formula is as follows: for i = 1, 2, ..., N. where, It is calculated based on the drone's position, representing the Vino cell area of the drone at time t; It is the maximum Vino unit area at time t-1. Where the obstacle penalty r... ovs The penalty for drone collisions is calculated based on the number of collisions, using the following formula: The constant penalty is a constant negative number with a value of -0.03.
[0073] II. Then observe the space o t The input policy network SA-MADDPG obtains action a. tIn the SA-MADDPG network architecture, the data processing steps include:
[0074] a t =h i (g i (o t ), c i (o t ))
[0075] Among them, h i It is a two-layer fully connected network, g i It is a three-layer convolutional neural network, c i These are weighting factors that measure the importance of these environmental characteristics to the drone decision-making process.
[0076] c i A feature extraction network based on self-attention, c i (o t )=f i (g i (o t ));f i It is a self-attention network.
[0077] Among them, c i The calculation steps are as follows:
[0078] Step 1. Define the observation space of the UAV as o t Enter g i The resulting feature map is x = g i (o t The dimension of the input feature map x is set to C×H×W, where C represents the number of channels and H×W is the spatial dimension.
[0079] Step 2. Transform the feature map x into three feature spaces Q = W Q x, K = W K x and V = W V x in which W Q W K W V It is a learnable weight matrix.
[0080] Step 3. Calculate the attention map using the dot product between the query and the key representation, and then perform a soft max operation:
[0081] A = soft max(Q) T K)
[0082] Step 4. Apply the attention map to the value representation to calculate the attention weights:
[0083] O = VA
[0084] Step 5. Reshape the self-attention feature map to match the input dimension:
[0085] c i = reshape(O, C, H, W)
[0086] Third, the observation spaces and actions of all agents are then combined to obtain the joint observation space O. t and combined action A t Input the evaluation function to obtain the evaluation score q for the joint observation space and joint actions. i ,
[0087] q i =h i (g i (O t ), c i (O t A t )
[0088] Figure 1 This diagram illustrates the distributed decision-making and centralized training process of the SA-MADDPG algorithm. Figure 2 A schematic diagram is shown of a policy network and an evaluation network that incorporate a self-attention mechanism.
[0089] Fourth, use the minimization of variance as the loss function to update the parameters of the policy network and the evaluation network according to the gradient descent method.
[0090] 4. Performance Analysis
[0091] To verify the effectiveness of the proposed multi-UAV target search method based on self-attention mechanism and reinforcement learning, we designed experiments to analyze the reward curves of the algorithm during training and evaluate its performance against benchmark algorithms. A two-dimensional UAV search environment was constructed based on a custom framework similar to OpenAIGym, and its state was discretized into cells. At the beginning of each simulation scenario, the initial positions of the UAVs and moving targets were randomly generated. To ensure the continuity of the search task, a fixed number of targets were always present in the environment: once a target was detected and removed, a new target was generated at a random location, and obstacles were randomly distributed throughout the environment. The number of UAVs was set to 4, the number of targets to 2, and the number of obstacles to 0-200, randomly selected at the beginning of each round. The maximum number of training rounds was set to 10,000, and the maximum time step per round was 150. We trained the algorithm according to the above experimental settings and compared our method with the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) and Deep Q-Network (DQN) methods. The corresponding training reward curves are shown below. Figure 3As shown, the average reward value of all algorithms increases with the number of training iterations, eventually converging. From the final convergence results, the average reward curve of the SA-MADDPG algorithm shows a significant advantage compared to the other two methods.
[0092] 5. Summary
[0093] This invention relates to a method for collaborative dynamic target search among multiple unmanned aerial vehicles (UAVs), belonging to the field of artificial intelligence and UAV swarm intelligent control.
[0094] To address the target search problem of UAVs in complex environments, this invention proposes an SA-MADDPG search method that combines a self-attention mechanism with a multi-UAV cooperative search strategy that uses Vino region partitioning and exploration incentives.
[0095] This invention discretizes the search area into a grid map, constructs a target probability map (TPM) to represent the target distribution, and uses Bayesian inference to dynamically update the TPM; each UAV perceives the environment based on its onboard sensors and uses a self-attention mechanism to optimize the observation information;
[0096] Search strategy optimization: The SA-MADDPG algorithm is used to train a multi-UAV search strategy. Global information is used to optimize the strategy during the centralized training phase, and distributed decision-making is carried out during the execution phase.
[0097] Explore incentive mechanisms: Divide the UAV search area based on the Veno map to reduce redundant searches, and combine TPM to guide UAVs to prioritize searching high-probability target areas;
[0098] Mission execution: The UAV executes the search mission according to the trained strategy and updates environmental information in real time.
[0099] The technical solution of the present invention has been described above with reference to the preferred embodiments shown in the accompanying drawings. However, it will be readily understood by those skilled in the art that the scope of protection of the present invention is obviously not limited to these specific embodiments. Without departing from the principles of the present invention, those skilled in the art can make equivalent changes or substitutions to the relevant technical features, and the technical solutions after these changes or substitutions will all fall within the scope of protection of the present invention.
Claims
1. A multi-UAV target search method based on self-attention and reinforcement learning, characterized in that, The goal is to transform the target search task of UAVs into a multi-agent collaborative problem, treating each UAV as an independent agent and using a policy network SA-MADDPG for target search. At the same time, a Venn diagram is used to rationally divide the search area, and a target probability map TPM is used to help evaluate the possible distribution of targets, so that UAVs prioritize searching areas with high target probability. Finally, during the execution of the search task, the search strategy is adjusted according to real-time feedback to adapt to the dynamic changes of the targets. SA-MADDPG is a multi-agent deep deterministic policy gradient (MADDPG) algorithm that combines the self-attention mechanism. Each drone uses onboard sensors to observe the environment, and the self-attention mechanism is used to optimize the observation information. The steps of the target search method include: S1. Discretize the search area into a grid map, construct a TPM to represent the probability distribution of the target in the environment, and use Bayesian inference to dynamically update the TPM. S2 uses the SA-MADDPG algorithm network structure as the policy network to obtain the search policy; S3: The drone executes search tasks according to the search strategy and updates environmental information in real time; In step S2, each drone, i.e., the intelligent agent, makes independent decisions based on its environment, while sharing target information for the collaborative search task with other intelligent agents. For the observation space, action space, and search reward of the UAV, the observation space is input into the policy network SA-MADDPG to obtain the action, and the observation space and action are input into the evaluation network to obtain the evaluation score for the observation space and action. The optimal search policy is obtained based on the evaluation score. The parameters of the policy network and the evaluation network are iteratively updated by combining the loss function. The observation space includes the latest TPM update, the positions of other drones, the distribution of obstacles in the environment, and the current position of the drone as a reference point within the location; the action space includes the drone's possible directions of movement; the search reward is the sum of the coverage reward, the Veno map reward, the obstacle penalty, and the constant penalty; Coverage reward refers to the reward for drone searches calculated based on the area of new exploration; Veno map reward refers to the reward for drone distribution calculated based on the maximum difference in Veno map area divided by drones; obstacle penalty refers to the penalty for drone collisions calculated based on the number of collisions; constant penalty is a constant negative number.
2. The multi-UAV target search method based on self-attention and reinforcement learning according to claim 1, characterized in that, In step S1, TPM is: The discretized task region is represented by a three-channel binary mesh, where each mesh cell represents a portion of the environment, as defined below: The value of any grid cell (x, y) includes: the probability of the target's presence. Obstacle presence marker and the presence markers of drones These values satisfy the following constraints: Let d represent the sensing range of each drone's sensor; Let D = {0, 1, 2, 3} represent the set of movement directions of the UAV, which represent moving one grid up, down, left, and right respectively; Let P be the probability of a target appearing in a grid cell (x, y) in TPM. x,y (t), P x,y (t)∈[0,1] represents the probability that the target appears in grid cell (x,y) at time t; Initially, the P of the grid cell (x,y) x,y (0) = 0.5 indicates that the target's location is unknown, i.e., the probability is 50%; let the probability update rule for the target's appearance be... Where P x,y (t+1) represents the probability that the target appears in grid cell (x,y) at time t+1; p m,n (t) represents the probability that the target appears in grid cell (m,n) at time t; The TPM is updated using a Bayesian method based on sensor detection information, specifically as follows: When the drone scans for a target, the probability of the target appearing is updated as follows: Where p a This indicates the accuracy of the sensor; if the drone fails to detect the target, p will be displayed. a Replace with 1-p a .
3. The multi-UAV target search method based on self-attention and reinforcement learning according to claim 2, characterized in that, In step S2, the data processing method in the SA-MADDPG network structure is as follows: the observation space o of UAV i is... t Input its policy network to obtain action a t The formula is a t =h i (g i (o t ), c i (o t )), where h i It is a two-layer fully connected network, g i It is a three-layer convolutional neural network, c i A feature extraction network based on self-attention, c i (o t )=f i (g i (o t ))f i It is a self-attention network.
4. The multi-UAV target search method based on self-attention and reinforcement learning according to claim 3, characterized in that... Assuming the observation space of the drone is o t Enter g i The resulting feature map is x = g i (o t ), and input it into the attention network f i The subsequent calculation steps are as follows: 1) Let the dimension of the input feature map x be C×H×W, where C represents the number of channels and H×W is the spatial dimension; 2) Transform the feature map x into three feature spaces Q = W Q x, K = W K x and V = W V x; where W Q W K W V It is a learnable weight matrix, where Q, K, and V represent the query, key, and value vectors in the attention mechanism, respectively; 3) Calculate the dot product between query Q and key K, and then perform a soft max operation to obtain the attention map A: A=soft max(Q T K), 4) Apply the attention map A to the value V representation to calculate the attention weight O: O = VA, 5) Reshape the self-attention feature map using the reshape function to match the input dimension: c i =reshape(O,C,H,W)。 5. The multi-UAV target search method based on self-attention and reinforcement learning according to claim 2, characterized in that, In step S2, observe space o t and action a t Input the data into the evaluation network to obtain an evaluation score for the observation space and action, calculated as q. i =h i (g i (o t ), c i (o t ), a t ).
6. The multi-UAV target search method based on self-attention and reinforcement learning according to claim 1, characterized in that, In step S2, the parameters of the policy network and the evaluation network are updated using the minimum variance as the loss function and the gradient descent method.
7. The multi-UAV target search method based on self-attention and reinforcement learning according to claim 1, characterized in that, In step S2, the search reward r is defined as the sum of the coverage reward, the Venograph reward, the obstacle penalty, and the constant penalty: r=r cov +r vor +r obs +r step , Coverage reward r cov The reward for drone searches is calculated based on the newly discovered area, using the following formula: Where D sensor,t This represents the set of grid cells perceived by the drone at time t. This represents the complement of the area covered by the previous time step; Vinonic reward r vor The reward for drone distribution is calculated using the maximum difference in the area of the Venn diagram divided by the drones. The formula is as follows: for i = 1, 2, ..., N. where, It is calculated based on the position of UAV i, and is the Vino cell area of UAV i at time t; It is the area of the maximum Vino unit at time t-1; Obstacle penalty r obs The penalty for drone collisions is calculated based on the number of collisions, using the following formula: Constant penalty r step It is a constant negative number with a value of -0.03.