A post-disaster unmanned aerial vehicle deployment method based on worst-case flexibility policy evaluation
By constructing energy consumption and channel models, and combining reward and safety assessments, the flight strategy of drone swarms was optimized, solving the problems of drone resource waste and network connectivity in remote post-disaster scenarios, and realizing the efficient utilization of drone resources and effective support for post-disaster relief.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA UNIV OF MINING & TECH
- Filing Date
- 2023-04-12
- Publication Date
- 2026-06-26
AI Technical Summary
In remote disaster scenarios, existing drone deployment methods struggle to achieve continuous trajectory optimization of drone swarms, network connectivity maintenance, drone role selection, and energy management, leading to resource waste and rescue delays.
We employ a worst-case flexible strategy evaluation method to construct energy consumption and channel models. By combining reward evaluation, safety evaluation, and worst-case evaluation, we optimize the flight strategy of UAV swarms through reinforcement learning, thereby achieving optimal flight position and role management for UAVs.
While ensuring network connectivity and coverage, energy consumption should be reduced to avoid insufficient remaining energy of drones, thereby achieving efficient utilization of drone resources and effective support for post-disaster relief.
Smart Images

Figure CN116488705B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of unmanned aerial vehicle (UAV)-assisted disaster relief, and in particular relates to a method for deploying UAVs after a disaster based on a worst-case flexible strategy assessment. Background Technology
[0002] my country is one of the few countries in the world most severely affected by natural disasters. In a typical year, approximately 200 million people are affected by disasters nationwide, with direct economic losses amounting to around 200 billion yuan. With the sustained rapid development of the national economy, the expansion of production scale, and the accumulation of social wealth, disaster losses are showing an increasingly severe trend. Disasters have become one of the main factors restricting the sustained and stable development of the national economy. my country experiences a wide variety of natural disasters, with a high frequency and severity. In remote mountainous areas, natural disasters caused by extreme weather or other reasons can easily spread regionally, and these geographical and weather factors also pose significant challenges to disaster relief and rescue efforts. Currently, UAVs are mainly used for reconnaissance missions in confined and geographically dispersed environments, limiting their scope of use, reducing flexibility, and making it difficult to ensure network connectivity between UAVs. Especially in post-disaster emergency scenarios where the distance between the target point, the UAV, and the ground base station is too long and the environment is harsh, it is even more necessary to adjust the flight trajectory of the UAV in real time to ensure that it can safely, continuously, and efficiently complete remote reconnaissance missions.
[0003] After an accident, drones, due to their flexible deployment, unmanned operation, and ability to fly in confined spaces, can quickly penetrate dangerous areas to perform reconnaissance missions. They collect data using their onboard sensor modules and transmit it back to ground stations for data analysis in disaster relief. Due to the limited perception and communication range of a single drone, multiple drone swarms can self-organize and form flight ad hoc networks through situational awareness and information exchange. In these networks, drones cooperate, quantifying their mission attributes to jointly optimize energy efficiency and mission completion, thereby reducing energy consumption caused by unreasonable task allocation. Ground station operators typically plan drone flight trajectories using a task-driven approach. Furthermore, for multi-drone swarms, the topology between drones also influences flight trajectories. Therefore, dynamically planning the optimal trajectory for a drone swarm to meet mission requirements, reduce energy consumption, and ensure network connectivity is crucial. Currently, internationally, most methods employ directed acyclic graphs (DAGs) to transform the path planning problem for a single drone into an integer linear programming problem. Graph theory, convex optimization, and distributed deep reinforcement learning are then used to minimize the total trajectory length and reduce the probability of interruption. These methods are suitable for scenarios with discrete and small operational spaces. However, in complex and challenging post-disaster scenarios, even minor deviations in UAV flight caused by discrete location selection can lead to serious accidents. Therefore, a more rigorous continuous trajectory design is needed to adjust the UAV's flight status in real time.
[0004] Furthermore, tasks in post-disaster scenarios are often more complex. Network connectivity is crucial for cooperative drone swarms to ensure information can be exchanged and relayed back to ground systems. Failure to return information due to a lack of connectivity not only wastes drone resources but, more seriously, can delay disaster relief efforts, leading to greater human and financial losses. To reduce excessive routing overhead while maintaining multi-hop link connectivity, Q. Zhu et al. proposed a multi-relay drone selection scheme based on fuzzy optimal selection in 2017. This scheme aims to achieve a trade-off between surveillance tasks and connectivity maintenance by utilizing historical detection information and surveillance return assessments. However, this method is only suitable for situations where historical detection data is readily available. When the detection scenario is a remote post-disaster mountainous or forested area, the lack of previous detection data prevents an effective trade-off and makes it unsuitable for post-disaster scenarios. For remote scenarios after disasters, there are still some issues worth further discussion: (1) optimizing the 3D position of the UAV swarm through continuous trajectory deployment; (2) whether the UAV has enough remaining energy to return to the ground base station after completing each mission; (3) role selection and different contribution analysis of each UAV in the complex task collaborative UAV swarm; (4) maintaining the connectivity between the UAV and the ground base station to ensure reliable data transmission. Summary of the Invention
[0005] To address the aforementioned technical issues, this invention proposes a post-disaster drone deployment method based on worst-case flexible strategy assessment, which can effectively avoid high-risk flight locations and ultimately achieve a balance between coverage and energy consumption.
[0006] To achieve the above objectives, this invention provides a post-disaster drone deployment method based on worst-case flexible strategy assessment, comprising:
[0007] Obtain the current location of the drone swarm, the location of the mission target, the remaining energy, the communication energy consumption, the flight energy consumption, and the maximum communication distance of the drones;
[0008] An energy consumption model for the UAV swarm is constructed based on the communication energy consumption and flight energy consumption, and a channel model is constructed based on the current position of the UAV swarm, the position of the mission target, and the maximum communication distance of the UAVs.
[0009] Set the optimization constraint objectives for the drone swarm;
[0010] Based on the energy consumption model and channel model, and using the optimization constraint objective, the UAV swarm is optimized to obtain the optimal flight strategy of the UAV swarm.
[0011] Optionally, the energy consumption model is:
[0012]
[0013] Where E is the total energy consumption of the drone swarm, E p (i) represents the energy consumption for drone propulsion, N represents the number of drones, and i represents the i-th drone.
[0014] Optionally, the channel model is:
[0015]
[0016] in, p represents the maximum communication range of the drone. i Let σ represent the given transmission power. 2 γ represents the variance of Gaussian white noise. th The threshold representing the signal-to-noise ratio, h u Indicates drone u i The relative height between the ground base and the ground surface.
[0017] Optionally, the optimization constraint objective is:
[0018]
[0019] Where, ρ π This represents the trajectory distribution based on policy π. For 3D coordinates, r(s) t a t ) represents the reward function, c(s) t a t ) represents the loss function.
[0020] Optionally, optimizing the drone swarm includes:
[0021] A reward evaluation model, a safety evaluation model, and a worst-case evaluation model are set for the optimization constraint objective; wherein, the parameters of the reward evaluation model include: the effective coverage rate of the UAV, and the parameters of the safety evaluation model include: the remaining energy consumption of the UAV;
[0022] Based on the reward evaluation model, safety evaluation model, and worst-case evaluation model, the optimal flight position strategy for the drone swarm is obtained.
[0023] Optionally, the reward evaluation model is:
[0024]
[0025] Where r(s) t a t ) represents the reward function, and ι represents network connectivity.
[0026] Optionally, constructing the security assessment model includes:
[0027] Design the loss function;
[0028] The conditional value of risk is obtained based on the expected value of the loss function;
[0029] The security assessment model is constructed based on the conditional value of risk.
[0030] The loss function is:
[0031]
[0032] in, express E p (i) Energy consumption for UAV propulsion flight, E r (i) represents the remaining energy of the drone.
[0033] Optionally, the security assessment model is:
[0034]
[0035] Among them, F C p π The cumulative distribution function of (C|s,a), where α represents the risk level. For 3D coordinates, D represents the constraint value.
[0036] Optionally, the worst-case assessment model is:
[0037]
[0038] Among them, Γ π (s, a, α) represents the safety metric, CVaR α α represents the risk level, Φ -1 (α) represents the cumulative distribution function.
[0039] Optionally, obtaining the optimal flight strategy includes: performing role management on the drone swarm, wherein role management includes: role assignment and role switching.
[0040] Role allocation includes: based on the energy utilization efficiency of the drone cluster, it is allocated as: relay drone RU, articulated drone AU, detection and coverage task drone MU, and standby drone SU;
[0041] The role switching includes:
[0042] Based on the network connectivity in the reward function, a preset number of drones are added to the MU, and their return path energy consumption is calculated; if the coverage areas of the MUs overlap, other MUs are added to RU∪AU; SU is in standby mode.
[0043] Compared with the prior art, the present invention has the following advantages and technical effects:
[0044] This invention introduces the maximum entropy method, making the actions of UAVs more exploratory and preventing the location strategy of UAV swarms from getting trapped in local optima. Replacing the original energy consumption loss function with a Gaussian distributed safety assessment generates a closed-form estimate of the conditional risk value (CvaR). Based on this, by adjusting D, the proportion of energy consumption exceeding the remaining energy is controlled, enhancing the safety of UAV space exploration and preventing insufficient remaining energy for the UAV to complete the detection mission. The design of the reward function includes SER and network topology connectivity, and the adjustment of the loss function controls the energy consumption of the UAV within a threshold. These two factors work together within the framework of worst-case flexible strategy evaluation, enabling UAVs to achieve the optimal trade-off between space exploration ratio, energy consumption, and network connectivity in target detection in remote areas, especially when the target is dynamically triggered and changes.
[0045] This invention uses reinforcement learning based on two evaluation parameters, reward and safety, to obtain a better flight position strategy for UAV swarms, thereby achieving a balance between energy consumption and coverage while maintaining the connectivity of the UAV topology network. Attached Figure Description
[0046] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an undue limitation of this application. In the drawings:
[0047] Figure 1 This is a flowchart of a post-disaster drone deployment method based on worst-case flexible strategy assessment, as described in an embodiment of the present invention.
[0048] Figure 2 This is a schematic diagram of the overall system model of an embodiment of the present invention, wherein (a) is a schematic diagram of task 1 triggering, (b) is a schematic diagram of 3D deployment, and (c) is a schematic diagram of role management;
[0049] Figure 3 This is a schematic diagram of a 3D deployment framework for a drone swarm based on WCSAC, according to an embodiment of the present invention.
[0050] Figure 4 This is a diagram illustrating the optimization strategy for reshaping the long-term loss distribution in an embodiment of the present invention. Detailed Implementation
[0051] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0052] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0053] This invention provides a post-disaster drone deployment method based on worst-case flexible strategy assessment, such as... Figure 1 As shown, it specifically includes:
[0054] Obtain the current location of the drone swarm, the location of the mission target, the remaining energy, the communication energy consumption, the flight energy consumption, and the maximum communication distance of the drones;
[0055] An energy consumption model for UAV swarms is constructed based on communication energy consumption and flight energy consumption, and a channel model is constructed based on the current position of the UAV swarm, the position of the mission target, and the maximum communication distance of the UAVs.
[0056] Set optimization constraints for the drone swarm;
[0057] Based on energy consumption and channel models, and utilizing optimization constraints, the drone swarm is optimized to obtain the optimal flight strategy for the drone swarm.
[0058] Furthermore, the energy consumption model is as follows:
[0059]
[0060] Where E is the total energy consumption of the drone swarm, E p (i) represents the energy consumption for drone propulsion, N represents the number of drones, and i represents the i-th drone.
[0061] Furthermore, the channel model is as follows:
[0062]
[0063] in, p represents the maximum communication range of the drone. i Let σ represent the given transmission power. 2 γ represents the variance of Gaussian white noise. th The threshold representing the signal-to-noise ratio, h u Indicates drone u i The relative height between the ground base and the ground surface.
[0064] Furthermore, the optimization constraint objective is:
[0065]
[0066] Where, ρ π This represents the trajectory distribution based on policy π. For 3D coordinates, r(s) t a t ) represents the reward function, c(s) t a t ) represents the loss function.
[0067] Furthermore, optimization of drone swarms includes:
[0068] A reward evaluation model, a safety evaluation model, and a worst-case evaluation model are set for the optimization constraint objective; the parameters of the reward evaluation model include the effective coverage rate of the UAV, and the parameters of the safety evaluation model include the remaining energy consumption of the UAV.
[0069] Based on the reward evaluation model, safety evaluation model, and worst-case evaluation model, the optimal flight position strategy for the drone swarm is obtained.
[0070] Furthermore, the reward evaluation model is as follows:
[0071]
[0072] Where r(s) t a t ) represents the reward function, and ι represents network connectivity.
[0073] Furthermore, the construction of the security assessment model includes:
[0074] Design the loss function;
[0075] Conditional Value at Risk (VaR) is obtained based on the expected value of the loss function.
[0076] Constructing a security assessment model based on conditional value at risk;
[0077] The loss function is:
[0078]
[0079] in, express E p (i) Energy consumption for UAV propulsion flight, E r (i) represents the remaining energy of the drone.
[0080] Furthermore, the security assessment model is as follows:
[0081]
[0082] Among them, F C p π The cumulative distribution function of (C|s,a), where α represents the risk level. For 3D coordinates, D represents the constraint value.
[0083] Furthermore, the worst-case assessment model is as follows:
[0084]
[0085] Among them, Γ π (s, a, α) represents the safety metric, CVaR α α represents the risk level, Φ -1 (α) represents the cumulative distribution function.
[0086] Furthermore, obtaining the optimal flight strategy includes: role management of the drone swarm, which includes role assignment and role switching.
[0087] Role allocation includes: based on the energy utilization efficiency of the drone swarm, they are allocated as: relay drones (RU), articulated drones (AU), detection and coverage drones (MU), and standby drones (SU);
[0088] Role switching includes:
[0089] Based on the network connectivity in the reward function, a preset number of drones are added to the MU, and their return path energy consumption is calculated; if the coverage areas of the MUs overlap, other MUs are added to RU∪AU; SU is in standby state.
[0090] 1. Model Establishment
[0091] Drone swarm U = {u1, u2, ..., u N} To complete exploration missions in remote forests or mountains M={m1,m2,...,m t This includes scheduling, data sensing, and information transmission. It assumes that tasks are dynamically triggered within the detection area, meaning the center and scale of the tasks are variable.
[0092] 1.1 Environment Model
[0093] Initial state as Figure 2 (a) All drones were initially on standby around GS. For example... Figure 2As shown in (b), when a mission is triggered, a swarm of drones is deployed, adjusting their flight positions to maximize coverage of the mission area. However, this flight adjustment may lead to communication interruptions. To successfully transmit sensor data back to the GS, network connectivity must be maintained. Therefore, a certain number of drones are selected for connectivity maintenance, rather than all being used for detection and coverage of the mission area. Figure 2 As shown in (c), considering UAV role management and energy utilization efficiency, UAVs choose to act as one of the following four roles during flight adjustments: Relay UAV (RU), Articulation UAV (AU), Mission UAV (MU), and Standby UAV (SU). Specifically, UAVs assigned as MUs are responsible for coverage and data transmission in the mission area; UAVs assigned as RUs are responsible for maintaining connectivity between GS and AU to ensure reliable information transmission; UAVs assigned as AUs are responsible for connecting RUs and other UAVs (AU or DU) and transmitting information; UAVs assigned as SUs do not contribute to mission execution.
[0094] Let T be the time it takes for a drone to complete a mission, which can be divided into t max There are 1 time slot, each with a duration of δ. The drone's flight speed is v. i The azimuth coordinates of the UAV are represented by L. ui (t)=(x ui (t),y ui (t),z ui (t) represents the task region (with radius r). m The center is L m ) and the coverage area of a single drone (radius r) u All are circular areas. This embodiment considers that the task area may be time-varying in the real world. Therefore, the following dynamic task triggering mechanism is designed, in which the radius of the emergency area may expand or shrink, such as the spread or extinguishing of a fire.
[0095]
[0096] The adjustment coefficient is k > 1. Let represent the initial radius and dynamic radius of the region of task i.
[0097] However, 3D deployment of drone swarms can lead to communication disruptions due to the vast distances between them. To successfully transmit data captured by the drone swarm from remote areas back to the GS, network connectivity between the drones must be guaranteed. Therefore, it is necessary to strategically deploy a certain number of drones for connectivity maintenance, rather than performing coverage tasks within the mission area. Figure 2 As shown in (c), considering role management and energy efficiency, the drones will be assigned four different roles: RU, AU, MU, and SU. The corresponding sets are represented as RU, AU, MU, and SU. Drones acting as MUs focus on providing coverage and transmitting sensor data back to GS via AU and RU; drones acting as RUs maintain and preserve connectivity between GS and AU to ensure reliable information transmission; drones acting as AUs are responsible for connecting RUs to another AU or MU and transmitting sensor data; drones acting as SUs do not contribute to the mission.
[0098] 1.2 Channel Model
[0099] Based on the 3GPP (3rd Generation Partnership Project), the Technical Specification Group Radio Access Network, and the Study on Enhanced LTE Support for Aerial Vehicles (Revision 15), this embodiment adopts the RMa-AV LOS channel model corresponding to remote area scenarios. The path loss of this channel model can be expressed as:
[0100]
[0101] Among them, h u and They represent the unmanned aerial vehicle (UAV) u i The relative altitude between GS and the drone u i and drones j The relative distance between them. Given a transmission power p i Then the drone u i The maximum communication distance is:
[0102]
[0103] Where, σ 2 γ represents the variance of Gaussian white noise. th This represents the signal-to-noise ratio (SNR) threshold; an SNR that meets this threshold allows the receiver to successfully decode the information. Furthermore, the reliable communication distance between drones includes not only the maximum communication range but also the minimum interval range d. min This refers to the safe distance to avoid collisions. Therefore, drones...i To drone u j The conditions for successfully establishing a link are:
[0104] 1.3 Energy Consumption Model
[0105] The energy consumption of unmanned aerial vehicles (UAVs) can be mainly divided into two aspects: communication energy consumption and propulsion / flight energy consumption. This embodiment mainly studies the scheduling problem of UAVs, so it focuses on the latter. For a rotary-wing UAV, let the uniform flight speed be v. i The flight distance is d i Then the propulsion energy consumption of the engine can be expressed as:
[0106]
[0107] Among them, U tip Let v0 represent the rotor blade tip velocity, v0 be the average hovering speed of the rotor, d0 and s represent the fuselage drag ratio and rotor stability, respectively, and ρ and A represent the air density and rotor disk area, respectively. P0 and P1 are two defined constants, as follows:
[0108]
[0109] Therefore, the total energy consumption of a drone swarm can be defined as:
[0110]
[0111] 1.4 Optimization Problem
[0112] This embodiment applies to a mission area detection and perception scenario in remote areas. Therefore, the optimization objective includes maximizing the coverage of the mission area, i.e., maximizing the Spatial Exploration Ratio (SER). However, the mobility of UAVs may cause network disconnections, leading to unreliable information transmission. Furthermore, in remote areas, the energy consumption of UAV propulsion is also significant. Therefore, the optimization problem in this embodiment is to maximize SER and minimize energy consumption while maintaining network topology connectivity during UAV scheduling.
[0113] P:
[0114]
[0115] in, Indicates assignment to u i The role of ι. ι is a binary number representing network connectivity. Specifically, if each MUu i There is a safe path back to GS, namely If ι equals 1, then ι equals 0; otherwise, ι equals 0. Therefore, the first constraint guarantees network connectivity and the reliability of subsequent transmissions. In other words, if the network topology is not connected, the coverage rate will be recorded as 0 even if the drone swarm has actual coverage. The second constraint states that the drone's energy consumption should not exceed its remaining energy E. r (Initial energy, also known as remaining energy), ensures that the drone can successfully return to GS after completing its remote mission. Drone u i The coverage area is represented as like Figure 2 As shown in (c), multiple UAVs may overlap their detection areas. When there is overlapping coverage between UAVs, the GS operator uses the Monte Carlo method for calculation.
[0116] 2. 3D Deployment and Connectivity Maintenance of Drones Based on WCSAC
[0117] In a post-disaster scenario, the target location, the drone's location, and the drone's remaining energy form a state space. Within this state space, the SER (Self-Reward Decision) can be iteratively optimized by adjusting the drone's location. This embodiment adjusts the drone's flight based on its real-time remaining energy status at the time the detection mission is launched. Therefore, this optimization problem is modeled as a Single-Step Constrained Markov Decision Process (CMDP). A CMDP is defined as a tuple (S, A, p, r, c, G, γ), which includes a (multidimensional, continuous, and bounded) state space S, a (multidimensional, continuous, and bounded) action space A, a probability transition function p, a reward function r, a loss function c, an adjustable safety threshold G, and a discount coefficient γ (γ∈(0,1]).
[0118] Drones are considered intelligent agents that interact with their environment in reinforcement learning. Each agent u i Based on the current state s∈S, execute the flight strategy a∈A, i.e., 3D coordinates. (Continuous action space). Then, the agent receives a reward r(s, a) and a loss c(s, a). The entire process can be called a cyclic iteration. When the environment changes, the agent will start a new round of policy selection learning in any state s0~p(s0).
[0119] 2.1 Problem Transformation
[0120] While AC-based methods can solve continuous variable problems, they are inefficient and unstable when the utility function is zero. Therefore, T. Haarnoja proposed a maximum entropy deep reinforcement learning method for flexible policy evaluation (SAC) to incentivize more exploratory policies while abandoning those that offer no clear benefit. Considering that the reward function associated with equation (7) is likely to be zero, and that unsafe interactions with the environment are unacceptable in post-disaster scenarios, this embodiment constructs a WCSAC framework, i.e., reinforcement learning with safety constraints.
[0121] First, we transform the optimization problem P into a SAC-based problem. By combining security constraints with maximum entropy deep reinforcement learning, the SAC-based solution to the optimization problem P1 can be expressed as:
[0122] Pl
[0123]
[0124] Where ρπ represents the trajectory distribution based on policy π, and The optimization problem P is transformed into P1 according to the following two steps: (1) Combine the coverage and network connectivity of the UAV swarm to obtain an effective SER as a reward, thereby aligning with the optimization objective P1. t ,a t (2) The inequality E(t)≤E in the energy consumption condition is correspondingly expressed as follows: r (t) is transformed into E(t)-E r (t)≤0, corresponding to the first condition c(s) in P1 t ,a t Maximum entropy reinforcement learning enables an agent to attempt as many actions as possible to maintain its state space, where H represents the adjustable entropy threshold, which represents the minimum degree of randomness in its exploratory behavior.
[0125] Next, role management provides an additional layer of security to address issue P. For example, as... Figure 2 As shown in (c), if The coverage area includes another The coverage area or the two overlap, we can use u j The role is switched to SU to improve the energy efficiency of the drone swarm. Therefore, the actual energy consumed by the drone swarm in problem P is less than or equal to the solution to problem P1, which also proves the feasibility of the method proposed in this embodiment. The following subsections will provide detailed explanations. The 3D deployment framework of drones based on WCSAC is as follows: Figure 3 As shown.
[0126] 2.2 3D Deployment of Drones Based on WCSAC
[0127] As mentioned above, to avoid the endless energy consumption of the UAV in remote detection scenarios, this embodiment sets a rational energy consumption limit to maintain the lowest possible energy consumption while allowing the UAV to perform exploration actions. Therefore, the first constraint of P1 is improved using the WCSAC framework. WCSAC replaces the safety assessment of SAC by learning the distribution of the long-term loss function to obtain risk-avoidance decisions, i.e., strategies with non-negative residual energy. The following gives two parameterized Q-value functions corresponding to the optimization strategy π, namely Q... r (parameters (φ, ψ)) and Q c (parameter ).
[0128] Actor Critic
[0129] According to the transformation step (1) in formula (8), network connectivity is incorporated into SER. The reward function designed in this embodiment is as follows:
[0130]
[0131] The reward function will return a valid SER. That is, if the MU does not return a path to the GS, the reward value is equal to 0. Furthermore, androids with valid coverage are first assigned as role MUs.
[0132] The value of the policy π corresponding to the maximum entropy exploration objective, i.e., the flexible Q value, can be obtained by starting with any Q function and using the improved Bellman backup operator T1. π Iterative calculation:
[0133]
[0134] The flexible state value function V(s) is expressed as:
[0135]
[0136] Here, β represents the temperature parameter, which adjusts the randomness of the optimal policy by determining the relative weight of the entropy term with respect to the reward.
[0137] B represents the previously sampled state and action distribution, i.e., the retransmission buffer, which can be used to train the parameters of the flexible Q-function to minimize the flexible Bellman residual:
[0138]
[0139] The estimated value of Q can be calculated using the following formula:
[0140]
[0141] Furthermore, instead of forcing the entropy to a fixed value, the optimal temperature β is adjusted. β can be updated by minimizing the following equation:
[0142]
[0143] in, In this embodiment, it is set to a negative value for the action policy dimension.
[0144] 2) Safety Criticism
[0145] In remote detection scenarios, previous research suggests that the loss function can be simply designed as the energy consumption of the drone swarm. However, since the agent cannot fully perceive the characteristics of the environment (the area to be detected) in advance, the energy consumption may be infinite. Therefore, limiting energy consumption is necessary. Secondly, the significance of setting safety constraints lies in ensuring that the drones can successfully return to the GS after completing the detection mission. In post-disaster scenarios, the drones should effectively avoid any situations that would cause the network topology to disconnect or return without remaining energy during flight adjustments. Based on this, this embodiment can dynamically adjust the safety constraints according to the exploration needs, and designs the following loss function using the idea of Lagrange multipliers:
[0146]
[0147] Here, express A binary value can be obtained from this loss function. This embodiment transforms the problem of minimizing energy consumption into the problem of maximizing remaining energy. If the drone's energy consumption exceeds the remaining energy, the loss function returns 1; otherwise, it returns 0. This mechanism not only relaxes the energy constraint, i.e., it does not blindly pursue minimum energy consumption, but also keeps it on the same order of magnitude as the reward function, laying the foundation for achieving decision equilibrium.
[0148] The expected cumulative long-term loss of the drone starting from the starting point (s,a) is:
[0149]
[0150] Based on strategy π and probability distribution p π (C|s,a) and long-term loss C π (s,a), define the distributed Bellman operator T2 π for:
[0151] T2 π C(s,a)=c(s,a)+γC(s′,a′) (17)
[0152] Among them, s′~p(·∣s,a) and a′~π(·∣s′).
[0153] To further simplify, replace the Maxwell-Boltzmann distribution with a Gaussian distribution:
[0154]
[0155] Among them, expectations and variance The results were obtained by estimating using the standard Bellman function:
[0156]
[0157]
[0158] This embodiment uses two neural networks with parameters μ and η, respectively. and Estimation for security assessment. A simplified 2-Wassertein distance is used to estimate the security assessment loss:
[0159]
[0160] Where u ~ N(Q1,V1), v ~ N(Q2,V2). Based on formulas (19) and (20), the security assessment is updated by calculating the 2-Wassertein distance. Therefore, the goal of this embodiment is to minimize the two loss functions. and That is, J C (μ) and J V (η):
[0161]
[0162]
[0163] in, and These are the Temporal Difference (TD) objectives in formulas (19) and (20), respectively.
[0164] In the safety-critical domain, the constraint value D is highly likely to be exceeded. For example... Figure 4As shown, the long-term loss (Equation (16)) is represented on the x-axis, and its probability density function is represented on the y-axis. pcntl(α) represents α-percentage (which will be explained in detail in Equation (24)). This embodiment considers the worst-case performance in a multi-UAV mission scenario where safety is paramount. The expected loss function value is replaced with the conditional value of risk (CVaR), and constraints are set on CVaR, which is a risk metric for judging the safety of a strategy. Therefore, the optimized strategy will be derived from p π The end of (C|s,a) moves to the left of D. That is, the drone has gradually evolved from initially aimless exploration to being able to move without crossing the boundary D. Before explaining how to transform the energy minimization optimization problem into a residual energy maximization problem with adjustable risk level constraints, the following two definitions are given:
[0165] Definition 1 (Security Risk Level): A positive scalar α∈(0,1] represents the security risk level of a WCSAC. A WCSAC with a smaller α (α→0) is expected to be more pessimistic and risk-averse in terms of security. Conversely, a larger α value leads to less risk-averse behavior, while when α equals 1, it corresponds to a risk-neutral situation.
[0166] In the post-disaster application scenario considered in this embodiment, α refers to the percentage by which the drone's energy consumption exceeds its remaining energy. Regarding the above, the focus will be on α-percentage, i.e.
[0167]
[0168] Among them, F C p π The cumulative distribution function (CDF) of (C|s,a).
[0169] Definition 2 (Safety Based on CVaR): Considering the risk level α, if policy π is safe, then the following should be satisfied:
[0170]
[0171] Among them, (s t ,a t )~ρ π and Definition 2 provides a new constraint for learning risk aversion strategies. This constraint is easier to compute than the traditional constraint in Equation (8) because a closed-form estimate of CVaR can be obtained from a safety assessment based on a Gaussian distribution.
[0172] 3) Worst-case scenario assessment
[0173] Based on the above distributed security assessment, the expected long-term energy consumption can be replaced with a new security indicator, CVaR. α This refers to the proportion of energy consumption exceeding the remaining energy, guiding the safe exploration of drones. The new safety metric for risk level α is:
[0174]
[0175] Where φ(·) and Φ(·) represent the probability density function (PDF) and CDF of the standard normal distribution, respectively. According to Definition 2, given a risk level α, the optimal policy is up to Γ. π satisfy:
[0176]
[0177] For policy optimization, using Kullback-Leibler (KL) divergence is the most convenient:
[0178]
[0179] Among them, Z π (s t () represents the partition function for distribution normalization. KL divergence can be transformed into:
[0180]
[0181] in, also, It is calculated using formula (13). κ represents the safety weight. Since... This has no effect on updating the parameter θ, so it can be omitted. Therefore, the new action loss is as follows:
[0182]
[0183] The safety weight κ can be learned by minimizing the following loss function:
[0184]
[0185] Where, if D≥Γ π If (s,a,α), then κ will decrease; otherwise, κ will increase to improve security.
[0186] The pseudocode for the WCSAC-based UAV 3D deployment and connectivity maintenance algorithm is shown in Algorithm 1.
[0187]
[0188]
[0189]
[0190] 2.3 Role Management of Unmanned Aerial Vehicle Clusters
[0191] If these four characters are integrated into the action space, the space complexity becomes (R0). 3 ×4) N After the WCSAC agent (i.e., the drone) completes the algorithm, we can obtain the drone's flight strategy, i.e., its 3D position, and then obtain the initial set MU. At this point, the space complexity is reduced to (R0). 3 ×3) N Considering that the reward function (Formula (9)) contains connectivity constraints, this embodiment will assign roles to each UAV based on its position during the mission and its distance from other UAVs. By analyzing the return paths of the determined MUs, a UAV role management mechanism is proposed, including role assignment and role switching.
[0192] Specifically, firstly use MU u i The return path, i.e., MU u i The sensor data is sent back to the ensemble of drones used by the GS. Since RUs and AUs contribute similarly to the task, this embodiment blurs the distinction between them. Therefore, all elements of the sets RU and AU can be represented as... U\(RU∪AU∪MU) also gives the set of SUs. Through the above steps, the role allocation process is completed. Next, this embodiment will further improve the allocation of MUs. For example... Figure 2 As shown in (b), the coverage area of MU3 is covered by another MU1. To avoid wasting their energy resources, the algorithm decides to convert these drones to SU roles. The pseudocode for the drone swarm role management algorithm is shown in Algorithm 2.
[0193]
[0194]
[0195] 2.4 Algorithm Description
[0196] For WCSAC-based 3D drone deployment, to reduce the size of the state-action space, this embodiment constructs an (N×4) state matrix comprising the positions of the ground base station, the target, and N drones, as well as their remaining energy. The output, i.e., the action policy, is an (N×3) matrix representing the 3D positions of the drone swarm. Therefore, the size of the action space becomes (R... 3 ) NThe algorithm proposed in this embodiment is named WCSAC-cntRM, and is described in detail below:
[0197] First, each drone (agent) stands by around GS, performs an action from its current policy (flying to a random location), and triggers a probe mission when it observes a change in its current state (line 2) (line 3). Next, the state is updated (line 5). All traversed states are stored in a retransmission buffer for use in the following gradient step (line 7). Samples are taken from the retransmission buffer to update all network parameters; this step is repeated throughout the method. When the iteration count t is greater than the collection time Tc, the following update operation begins. For reward evaluation, this embodiment independently learns two flexible Q-functions (φ1 and φ2) and temperature to avoid overestimation and reduce positive bias (lines 13-14). For safety evaluation, this embodiment uses two independent neural networks to estimate the mean and variance functions of a Gaussian distribution (lines 18-19). Safety metrics The parameters are calculated in the worst-case policy network (line 24), and the policy parameters are also updated (lines 25-26). When the number of iterations is divisible by the update frequency ε, the parameters of the four target networks are flexibly updated using parameters τ∈[0,1] to reduce network fluctuations and instability (lines 28-31). Finally, drone role management is implemented to maximize drone utilization.
[0198] The above are merely preferred embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for post-disaster drone deployment based on worst-case flexible strategy assessment, characterized in that, include: Obtain the current location of the drone swarm, the location of the mission target, the remaining energy, the communication energy consumption, the flight energy consumption, and the maximum communication distance of the drones; An energy consumption model for the UAV swarm is constructed based on the communication energy consumption and the flight energy consumption, and a channel model is constructed based on the current position of the UAV swarm, the position of the mission target, and the maximum communication distance of the UAVs. Set the optimization constraint objectives for the drone swarm; Based on the energy consumption model, the channel model, and the remaining energy of the UAV swarm, the UAV swarm is optimized using the optimization constraint objective to obtain the optimal flight strategy of the UAV swarm. Optimizing the drone swarm includes: A reward evaluation model, a safety evaluation model, and a worst-case evaluation model are set for the optimization constraint objective; wherein, the parameters of the reward evaluation model include: the effective coverage rate of the UAV, and the parameters of the safety evaluation model include: the remaining energy consumption of the UAV; Based on the reward evaluation model, the safety evaluation model, and the worst-case evaluation model, the optimal flight position strategy for the drone swarm is obtained. The reward evaluation model is as follows: , in, r ( st, at ) represents the reward function, Indicates network connectivity. Sc ( t ) represents a single drone in t The actual coverage area of the target region at any given time. N This represents the number of drones. at Represented as drones in t The action choice at any given moment, i.e., the next position. st This indicates the drone's status, i.e., its current location; The construction of the security assessment model includes: Design the loss function; The conditional value of risk is obtained based on the expected value of the loss function; The security assessment model is obtained based on the conditional value of risk. The loss function is: , in, express , Ep ( i Energy consumption for drone propulsion flight. Er ( i (This refers to the remaining energy of the drone.) The security assessment model is as follows: , in, express The cumulative distribution function, The value indicates the risk level; "D" represents the constraint that the drone's energy consumption exceeds its remaining energy, and the value is between 0 and 1. C ( st , at ) represents the loss function. at Represented as drones in t The action choice at any given moment, i.e., the next position. st This indicates the drone's status, i.e., its current location; The worst-case assessment model is as follows: , in, Indicates safety metrics, , Indicates the risk level. Represents the cumulative distribution function. Expressed as expectation, Let s represent the variance, s represent the state space, and a represent the action space.
2. The method for post-disaster drone deployment based on worst-case flexible strategy assessment according to claim 1, characterized in that, The energy consumption model is as follows: , in, E The total energy consumption of the drone swarm. Ep ( i Energy consumption for drone propulsion flight. N For the number of drones, i For the first i One drone.
3. The method for post-disaster drone deployment based on worst-case flexible strategy assessment according to claim 1, characterized in that, The channel model is as follows: , in, This represents the maximum communication range of the drone. Indicates the given transmission power, This represents the variance of Gaussian white noise. The threshold representing the signal-to-noise ratio. Indicates drone The relative height between the ground base and the ground surface. fc It is represented as carrier frequency.
4. The method for post-disaster drone deployment based on worst-case flexible strategy assessment according to claim 1, characterized in that, The optimization constraint objective is: , in, Indicates policy-based The trajectory distribution r ( st, at ) represents the reward function, c ( st , at ) represents the loss function. at Represented as drones in t The action choice at any given moment, i.e., the next position. st This represents the state of the drone at time t, i.e., its current position. Represented as a cumulative expectation function, D Represented as residual energy constraint, H Represented as an adjustable entropy threshold, it forces a minimum degree of randomness in drone space exploration.
5. A method for post-disaster drone deployment based on worst-case flexible strategy assessment as described in claim 1, characterized in that, Obtaining the optimal flight strategy includes: performing role management on the drone swarm, wherein role management includes: role assignment and role switching. Role allocation includes: based on the energy utilization efficiency of the drone cluster, it is allocated as: relay drone RU, articulated drone AU, detection and coverage task drone MU, and standby drone SU; The role switching includes: after the role allocation process is completed, the allocation of MU will be further improved. If the coverage area of MU is covered by another MU, the role of the covered MU will be switched to SU.