A Dynamic Aerial Wireless Mission Network Planning Method and System Based on Multi-Agent Reinforcement Learning

By employing multi-agent reinforcement learning to assign independent agents to dynamic airborne wireless mission networks, enabling distributed decision-making and link switching, the collaborative optimization problem of network planning in dynamic environments is solved, achieving efficient and stable network topology adjustment and planning results.

CN122308069APending Publication Date: 2026-06-30XIDIAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
XIDIAN UNIV
Filing Date
2026-03-16
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies struggle to achieve coordinated optimization of QoS for high throughput, low latency, and high reliability services in dynamic airborne wireless mission networks. Furthermore, traditional algorithms and single-agent deep reinforcement learning methods suffer from low planning efficiency, insufficient coordinated optimization capabilities, and poor training stability in dynamic environments, making it difficult to provide real-time, reliable, and globally efficient planning solutions.

Method used

A multi-agent reinforcement learning approach is adopted, with agents configured for high-throughput, low-latency, and high-reliability service subnets respectively. Distributed decision-making is carried out through local observation states, and link switching is optimized using action masking mechanism and cooperative reward function. Dynamic network topology model and heterogeneous service model are constructed and trained centrally to achieve dynamic adjustment of network topology.

Benefits of technology

It achieves real-time performance and stability in network planning under dynamic environments, reduces the coupling of online decision-making, satisfies the fixed constraint of link budget, avoids resource waste, and outputs a complete topology evolution sequence, providing a data foundation for subsequent analysis and iteration.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122308069A_ABST
    Figure CN122308069A_ABST
Patent Text Reader

Abstract

This invention discloses a dynamic aerial wireless mission network planning method and system based on multi-agent reinforcement learning. It addresses the problems of low planning efficiency, insufficient collaborative optimization capabilities, poor training stability, and difficulty in providing real-time, reliable, and globally efficient planning schemes in existing technologies. It achieves a reproducible, verifiable, and deployable software implementation to improve engineering usability. The method includes: configuring a corresponding agent for each service subnet; wherein the agent pre-loads a trained policy network; at each decision moment, each agent collects local observation states and inputs them into the corresponding policy network to obtain the optimal pairwise link switching action; each agent, based on the action, controls the connection and disconnection of communication links between UAVs within the corresponding service subnet, updating the network topology of the corresponding service subnet; until the mission ends, the network planning result in the mission time domain is obtained.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of dynamic aerial wireless mission network planning technology, and in particular to a dynamic aerial wireless mission network planning method and system based on multi-agent reinforcement learning. Background Technology

[0002] In aerial wireless mission networks (such as emergency communication support, disaster monitoring, infrastructure inspection, low-altitude logistics, and urban governance scenarios), a complete mission often requires the network to simultaneously support three types of heterogeneous services within the same mission time domain: high-throughput data backhaul tasks, low-latency control and alarm interaction tasks, and high-reliability critical data reporting tasks. To achieve these goals, existing solutions typically plan separate networks for different services or adopt static, uniform topology strategies. This makes it difficult to achieve coordinated quality of service (QoS) assurance for multiple tasks in environments with limited resources and rapidly changing network topology due to node mobility.

[0003] Traditional network planning algorithms, such as graph theory optimization, mixed-integer programming, and heuristic rules, typically perform centralized computation based on a network "snapshot" at a certain moment. These methods can find feasible solutions in static scenarios where node positions are fixed or change slowly, but they suffer from computational lag, frequent replanning, and short-sighted optimization objectives in dynamic task environments with continuous UAV maneuvers and rapid topology evolution. This makes it difficult to achieve long-term performance optimization across the entire task time domain, leading to a decline in the overall network service quality and resource utilization efficiency.

[0004] Adaptive methods, such as deep reinforcement learning, offer new insights into dynamic programming. However, their mainstream single-agent framework still faces fundamental limitations when applied to this scenario: the large number of network nodes and task subnets leads to an explosion in action space combinations and low agent exploration efficiency; the vast differences in the dimensions and changing patterns of multiple objectives such as throughput, latency, and reliability make it difficult for a single reward function to balance them, easily causing the policy to favor a single metric; and the time-varying topology results in non-stationarity of the environment, large variance in policy gradient estimation, and difficulty in achieving stable convergence during training. Therefore, existing technologies struggle to achieve real-time, collaborative, and efficient planning for multi-task aerial wireless task networks in complex dynamic environments with high mobility and strong constraints.

[0005] Furthermore, from an engineering implementation perspective, current network planning software development for over-the-air wireless mission networks is insufficient or incomplete: it lacks an integrated simulation / control platform capable of reproducible experiments and rapid iteration, and lacks a unified control and data acquisition toolchain for "multi-service QoS coordination + dynamic topology + constraint feasibility," making it difficult to quickly verify algorithms and to integrate and deploy them in engineering. Therefore, the technical problems this invention aims to solve include: (1) Achieve coordinated optimization of QoS for three types of services: high throughput, low latency, and high reliability within the time domain of dynamic tasks; (2) Maintain the feasibility and continuous stability of the planning scheme under hard constraints such as communication radius, link budget, and topology connectivity; (3) Integrate intelligent planning algorithms with network simulation and network control closed loop to achieve a deployable and verifiable network planning system-level implementation.

[0006] In the field of dynamic airborne wireless mission network planning, in order to achieve quality of service (QoS) assurance for various heterogeneous tasks such as high throughput, low latency and high reliability, existing technical solutions can be mainly divided into two categories: traditional optimization algorithms based on mathematical models and data-driven adaptive learning algorithms.

[0007] The first category comprises traditional optimization algorithms based on mathematical models, such as graph theory optimization, mixed-integer programming, and heuristic rules. These algorithms typically simplify dynamic network planning problems into static optimization problems based on one or more time-lapse "snapshots" of the network. Through centralized solutions, these methods can obtain feasible topology solutions in scenarios where node positions are fixed or change slowly. However, in dynamic mission environments where UAVs are constantly maneuvering and network topology evolves rapidly, these methods have fundamental limitations: First, computation relies on complete global state information, and the solution time increases rapidly with network size, making it difficult to meet the real-time requirements of the mission; second, optimization usually targets instantaneous performance or short-term goals, lacking consideration for long-term benefits across the entire mission time domain, easily leading to short-sighted planning solutions; third, frequent restarts of computation are required when changes in node positions cause abrupt changes in the feasible solution space, making it difficult to guarantee the continuity and stability of the solution. Therefore, traditional algorithms are ill-suited to highly dynamic and time-varying environments and cannot achieve long-term optimization of network performance.

[0008] The second category comprises data-driven adaptive learning algorithms (the closest existing solution). To overcome the limitations of traditional algorithms, some research has introduced reinforcement learning frameworks, modeling network planning as a sequential decision problem. Among these, single-agent deep reinforcement learning methods based on proximal policy optimization have shown potential in static or quasi-static network planning. This method involves a single agent interacting with the environment to learn policies for selecting and adjusting links within a fixed set of nodes, optimizing a weighted sum of multiple objectives such as throughput and latency. Network constraints (such as communication radius) can be embedded through action masks.

[0009] However, directly applying this single-agent near-end policy optimization framework to dynamic over-the-air wireless task network planning scenarios still exposes a series of key drawbacks: Decision dimension and action space explosion: Dynamic programming requires simultaneous link selection for three types of task subnets in the continuous time domain. The action space is the Cartesian product of candidate link combinations for each subnet, changing over time, leading to low exploration efficiency and difficulty in policy convergence. Difficulty in balancing and shaping multi-objective rewards: Throughput, latency, and reliability are three different metrics with varying dimensions and significant numerical ranges, and their changes with link distance and channel state are complex. A single agent relies on a unified reward function to integrate all objectives, making reward design difficult and easily leading to policy bias towards a single metric at the expense of others. Environmental non-stationarity and training instability: In dynamic scenarios, node maneuvers and policy updates jointly cause continuous changes in the state transition distribution, increasing the variance of value estimation and policy gradient estimation, resulting in oscillations and difficulty in convergence during training.

[0010] In summary, while the most similar single-agent deep reinforcement learning planning schemes have made progress in static scenarios, their core architecture struggles to cope with the challenges of action space, reward function, and environmental non-stationarity caused by "node maneuvering, multi-task, and distributed decision-making" in dynamic aerial wireless task networks. This results in low planning efficiency, insufficient collaborative optimization capabilities, and poor training stability, making it difficult to provide real-time, reliable, and globally efficient planning solutions. Summary of the Invention

[0011] This invention provides a dynamic aerial wireless mission network planning method and system based on multi-agent reinforcement learning, which solves the problems of low planning efficiency, insufficient collaborative optimization capability, poor training stability, and difficulty in providing real-time, reliable, and globally efficient planning schemes in the prior art. It achieves a reproducible, verifiable, and deployable software implementation to improve engineering usability.

[0012] In a first aspect, the present invention provides a dynamic over-the-air wireless mission network planning method based on multi-agent reinforcement learning, comprising: S101, Configure a corresponding intelligent agent for each of the high-throughput service subnet, low-latency service subnet, and high-reliability service subnet; wherein, the intelligent agent is preloaded with a trained policy network; S102, at each decision moment, each of the intelligent agents collects the local observation state in the corresponding service subnet and inputs the local observation state into the corresponding policy network to obtain the optimal pair link switching action; S103, each of the intelligent agents controls the communication link between UAVs in the corresponding service subnet to be connected or disconnected according to the optimal pair link exchange action, and updates the network topology of the corresponding service subnet. S104, repeat S102 to S103 until the task ends, and obtain the network planning result in the task time domain; wherein, the network planning result includes: the topology evolution sequence of the network topology of the service subnet after each decision time.

[0013] In conjunction with the first aspect, in one possible implementation, the step of inputting the local observation state into the corresponding policy network to obtain the optimal pairwise link switching action includes: The policy network outputs an action probability distribution based on the local observation state; The probability of actions that do not meet the first preset constraint is set to zero through the action masking mechanism; wherein the first preset constraint includes: communication radius constraint, no multiple edges constraint, and topological connectivity constraint. The optimal pairwise link switching action is obtained by sampling from the remaining actions; wherein the optimal pairwise link switching action is: selecting a link to be disconnected from the set of active links of the corresponding service subnet at the current time, and selecting a link to be newly established from the set of feasible inactive links of the corresponding service subnet at the current time.

[0014] In conjunction with the first aspect, in one possible implementation, the policy network is obtained through an offline training step: Construct dynamic network topology models and heterogeneous service models; Based on the dynamic network topology model and the heterogeneous service model, an agent is assigned to each of the high-throughput service subnet, the low-latency service subnet, and the high-reliability service subnet, and the local observation state, pairwise link switching action, and cooperative reward function of each agent are defined. Based on the dynamic network topology model and the heterogeneous service model, a dynamic environment simulation module is constructed. In the environment simulation, the policy network corresponding to each agent is trained in a centralized manner based on the local observation state, paired link exchange actions, and cooperative reward function to obtain the trained policy network.

[0015] In conjunction with the first aspect, in one possible implementation, the construction of the dynamic network topology model and heterogeneous service model includes: The set of drone nodes and the set of candidate links that change over time are modeled as a time-varying undirected graph. According to the time-varying undirected graph Construct the dynamic network topology model, and define communication radius constraints, no illegal edges constraints, and topological connectivity constraints in the dynamic network topology model; Based on the task requirements, the time-varying undirected graph... The node set is divided into non-overlapping high-throughput service subnets. Low-latency service subnet and high-reliability service subnet And based on the high-throughput service subnet Low-latency service subnet and high-reliability service subnet Construct the heterogeneous business model, and define link budget fixed constraints and optimization objectives in the heterogeneous business model.

[0016] In conjunction with the first aspect, in one possible implementation, the optimization objective is expressed as: ; in, Indicates a high-throughput service subnet Task importance weighting; This represents the standardized high-throughput service subnet. The energy value; Indicates low latency service subnet Task importance weighting; This represents the standardized low-latency service subnet. The energy value; Indicates a high-reliability service subnet Task importance weighting; This represents the standardized high-reliability service subnet. The energy value.

[0017] In conjunction with the first aspect, in one possible implementation, the construction of the dynamic environment simulation module includes: According to the time-varying undirected graph A smooth turning model is used to generate the continuous motion trajectory of the drone nodes and update the node positions in real time. Based on the updated location, the Rician fading channel model is used to measure the throughput, end-to-end delay, and bit error rate of each candidate in real time. Based on the communication radius constraints defined in the dynamic network topology model, the legality of the action selected by the agent is verified to ensure that the activation link distance does not exceed the maximum communication radius; Based on the topological connectivity constraints defined in the dynamic network topology model, the connectivity of each service subnet is verified after the action to ensure that the graph formed by the active links of each subnet remains connected. Based on the comprehensive service quality optimization target defined in the heterogeneous business model, the instant reward is calculated and fed back to each agent.

[0018] In conjunction with the first aspect, in one possible implementation, the step of centrally training the policy network in the environment simulation to obtain the trained policy network includes: A multi-agent proximal policy optimization algorithm is adopted to construct an Actor policy network for each agent and a centralized Critic value network. In the environment simulation, each intelligence collects the current local observation state within the corresponding service subnet according to the defined local observation state, and inputs the current local observation state into the corresponding Actor policy network to obtain the action probability distribution and select pair training links to exchange actions. The environment simulation executes the pairwise training link switching action, verifies the legality of the action based on the communication radius constraint and topological connectivity constraint defined in the dynamic network topology model, calculates the instant reward based on the all-time-domain global service quality comprehensive score optimization objective defined in the heterogeneous service model, and returns the local observation state of each agent at the next moment. The experience data generated from each interaction is stored as an experience tuple in the experience replay pool; wherein, the experience tuple includes: the current local observation state, the paired training link exchange action, the immediate reward, and the local observation state at the next moment; Continue until the policy network converges, resulting in a fully trained policy network.

[0019] In a second aspect, the present invention provides a dynamic airborne wireless mission network planning system based on multi-agent reinforcement learning, for performing the method described herein, comprising: Multiple intelligent agents are configured in a high-throughput service subnet, a low-latency service subnet, and a high-reliability service subnet, respectively. Each intelligent agent includes: Model storage unit, used to store pre-loaded, pre-trained policy networks; A local observation and acquisition unit is connected to the model storage unit and is configured to acquire the local observation status within the corresponding service subnet at each decision time. The strategy reasoning unit is connected to the model storage unit and the local observation acquisition unit respectively, and is configured to input the local observation state into the strategy network to obtain the optimal pair link switching action; The topology adjustment execution unit is connected to the strategy reasoning unit and is configured to control the communication link connection and disconnection between UAVs in the corresponding service subnet according to the optimal pair link exchange action.

[0020] In conjunction with the second aspect, one possible implementation also includes an offline training subsystem deployed at a ground control center, the offline training subsystem comprising: The dynamic network and service modeling module is configured to build dynamic network topology models and heterogeneous service models. The multi-agent decision-making process construction module is connected to the dynamic network and service modeling module. It is configured to assign an agent to the high-throughput service subnet, the low-latency service subnet, and the high-reliability service subnet based on the dynamic network topology model and the heterogeneous service model, and define the local observation state, pairwise link switching action, and cooperative reward function of each agent. The high-fidelity dynamic environment simulation module is connected to the multi-agent decision-making process construction module and is configured to simulate the maneuver trajectory of UAV nodes and channel fading, and update link performance parameters in real time. The centralized training engine module is connected to the multi-agent decision-making process construction module and the high-fidelity dynamic environment simulation module, respectively. It is configured to construct a dynamic environment simulation module based on the dynamic network topology model and the heterogeneous service model. In the environment simulation, the policy network corresponding to each agent is centrally trained based on the local observation state, paired link exchange actions and cooperative reward function to obtain the trained policy network. The model loading module, connected to the centralized training engine module, is configured to load the trained policy network into the model storage unit of each agent.

[0021] One or more technical solutions provided in this invention have at least the following technical effects or advantages: This invention configures a corresponding agent for each of the three service subnets: high-throughput, low-latency, and high-reliability. Each agent is pre-loaded with a pre-trained policy network. The advantages are: by independently configuring an agent for each service subnet, the complex multi-objective collaborative problem is decomposed into three sub-problems, reducing the coupling of online decision-making; at each decision-making moment, each agent collects local observation states within its corresponding service subnet and inputs these states into the corresponding policy network to obtain the optimal pairwise link switching action; the agent makes decisions based on local observation states, eliminating the need for global information and avoiding the communication overhead and latency associated with large-scale networks, thus achieving distributed real-time planning; each agent controls its corresponding... The network topology of the corresponding business subnet is updated by switching communication links between UAVs within the business subnet. Paired link switching is used to dynamically adjust the topology while maintaining a constant total number of active links in each subnet, naturally satisfying the fixed link budget constraint and avoiding resource over-allocation or waste. The network planning results are obtained until the task ends, including a topology evolution sequence composed of the updated network topologies of the business subnets at each decision point. Continuous iterative execution enables network planning to cover the entire task time domain, achieving a leap from static snapshot optimization to dynamic continuous optimization. The final output topology evolution sequence fully records the network's trajectory throughout the entire task, providing a data foundation for post-event analysis, performance evaluation, and strategy iteration. Attached Figure Description

[0022] Figure 1 This is a flowchart of the steps of the dynamic airborne wireless mission network planning method based on multi-agent reinforcement learning provided in an embodiment of the present invention. Figure 2 A schematic diagram of the service function chain deployment module structure used in this invention is provided for an embodiment of the invention; Figure 3 The present invention provides a schematic diagram of the state processing and decision-making module of a multi-agent policy network (Actor). Detailed Implementation

[0023] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.

[0024] In a first aspect, this invention provides a dynamic over-the-air wireless mission network planning method using multi-agent reinforcement learning, see [link to relevant documentation]. Figure 1 The process includes the following steps S101 to S104.

[0025] S101, configure a corresponding intelligent agent for each of the high-throughput service subnet, low-latency service subnet, and high-reliability service subnet; wherein, the intelligent agent is preloaded with a trained policy network; S102, at each decision moment, each agent collects the local observation state within the corresponding business subnet and inputs the local observation state into the corresponding policy network to obtain the optimal pair link switching action; Specifically, in step S102, the local observation state is input into the corresponding policy network to obtain the optimal pair link switching action, including the following steps S1021 to S1023.

[0026] S1021, The policy network outputs the action probability distribution based on the local observation state; S1022, the probability of actions that do not meet the first preset constraint is set to zero through the action masking mechanism; wherein, the first preset constraint includes: communication radius constraint, no multiple edges constraint and topological connectivity constraint; S1023, sample the optimal pair link switching action from the remaining actions; wherein, the optimal pair link switching action is: select a link to be disconnected from the set of active links of the service subnet corresponding to the current time, and select a link to be newly established from the set of feasible inactive links of the service subnet corresponding to the current time.

[0027] For example, taking a high-throughput service subnet as an example, assume that the subnet consists of 4 drones, numbered 1, 2, 3, and 4. At the current decision time t, the local observation state collected by the agent includes: the normalized position of each node (e.g., node 1 is at (0.2, 0.3), node 2 is at (0.5, 0.6), etc.), speed, heading angle; distance margin, normalized throughput estimate, etc. of each candidate link. At the current time, the set of active links in the subnet is A(t) = {link12, link23, link34} (i.e., the links between nodes 1-2, 2-3, and 3-4 are in a connected state), and the set of feasible inactive links is F(t) = {link14, link24} (i.e., the links between nodes 1-4 and 2-4 are within the communication radius and are currently inactive). After receiving the aforementioned local observation state, the agent's policy network outputs an action probability distribution after forward propagation. This distribution covers all possible pairwise link-swapping actions, including (link12, link14), (link12, link24), (link23, link14), (link23, link24), (link34, link14), (link34, link24), and no-op (preserving the topology). Assume the original probability distribution output by the policy network is: P(link12, link14) = 0.1, P(link12, link24) = 0.05, P(link23, link14) = 0.3, P(link23, link24) = 0.2, P(link34, link14) = 0.15, P(link34, link24) = 0.1, and no-op = 0.1.

[0028] Subsequently, an action masking mechanism is executed: each action is checked one by one to ensure it meets the communication radius constraint, the no-multiple-edge constraint, and the topology connectivity constraint. Assuming that the distance between node 2 and node 4 exceeds the maximum communication radius, all actions involving the creation of a new link link24 (i.e., (link12, link24), (link23, link24), (link34, link24)) do not meet the communication radius constraint and are deemed illegal, with their probability set to zero. Simultaneously, assuming that the (link12, link14) action would cause subnet topology disconnection (e.g., after disconnecting link12, node 1 is only connected to node 4 through the newly created link14, but node 4 was originally only connected to node 3, potentially leading to isolation? Connectivity needs to be guaranteed; for example, (link12, link14) is assumed to be legal), (link23, link14) and (link34, link14) both satisfy all constraints, and no-operation is also legal. Therefore, the remaining legal actions are (link12, link14), (link23, link14), (link34, link14), and a no-op, with probabilities that, after renormalization, are: P(link12, link14) = 0.1 / (0.1+0.3+0.15+0.1) = 0.154, P(link23, link14) = 0.3 / 0.65 = 0.461, P(link34, link14) = 0.15 / 0.65 = 0.231, and no-op = 0.1 / 0.65 = 0.154. The agent samples based on this normalized probability distribution. For example, sampling the action (link23, link14) is the optimal pairwise link exchange action at the current decision moment. This action indicates disconnecting link 23 between node 2 and node 3, and simultaneously establishing link 14 between node 1 and node 4.

[0029] Here, the policy network is obtained through an offline training step: (1) Constructing a dynamic network topology model and a heterogeneous service model; Here, constructing a dynamic network topology model and a heterogeneous service model includes: (1.1) The set of UAV nodes and the set of candidate links that change over time are modeled as a time-varying undirected graph. According to time-varying undirected graphs Construct a dynamic network topology model and define communication radius constraints, no illegal edges constraints, and topological connectivity constraints in the dynamic network topology model; (1.2) Based on the task requirements, the time-varying undirected graph is transformed. The node set is divided into non-overlapping high-throughput service subnets. Low-latency service subnet and high-reliability service subnet And based on the high-throughput service subnet Low-latency service subnet and high-reliability service subnet Construct a heterogeneous business model, and define fixed constraints and optimization objectives for the link budget within the heterogeneous business model.

[0030] Specifically, the optimization objective is expressed as: ; in, Indicates a high-throughput service subnet Task importance weighting; This represents the standardized high-throughput service subnet. The energy value; Indicates low latency service subnet Task importance weighting; This represents the standardized low-latency service subnet. The energy value; Indicates a high-reliability service subnet Task importance weighting; This represents the standardized high-reliability service subnet. The energy value.

[0031] (2) Based on the dynamic network topology model and heterogeneous service model, an agent is assigned to each of the high-throughput service subnet, low-latency service subnet and high-reliability service subnet, and the local observation state, pair link exchange action and cooperative reward function of each agent are defined. (3) Based on the dynamic network topology model and heterogeneous service model, a dynamic environment simulation module is constructed. In the environment simulation, the policy network corresponding to each agent is trained in a centralized manner based on the local observation state, paired link exchange action and cooperative reward function to obtain the trained policy network.

[0032] Here, a dynamic environment simulation module is constructed, including: (3.1.1) Based on time-varying undirected graphs A smooth turning model is used to generate the continuous motion trajectory of the drone nodes and update the node positions in real time. (3.1.2) Based on the updated position, the Rician fading channel model is used to measure the throughput, end-to-end delay and bit error rate of each candidate in real time; (3.1.3) Based on the communication radius constraints defined in the dynamic network topology model, verify the legality of the action selected by the agent to ensure that the activation link distance does not exceed the maximum communication radius; (3.1.4) Based on the topological connectivity constraints defined in the dynamic network topology model, verify the connectivity of each service subnet after the action to ensure that the graph formed by the active links of each subnet remains connected. (3.1.5) Calculate the immediate reward based on the overall service quality optimization target defined in the heterogeneous business model and feed it back to each agent.

[0033] For example, consider a... A network of drones, in the planning time domain Internal nodes continue to maneuver. The network model at each time step is a time-varying undirected graph. ,in, Represents a set of nodes. express The set of all possible communication links at any given moment.

[0034] node exist The position of the time is recorded as Node pairs The Euclidean distance is .

[0035] The network is divided into three non-overlapping subnets based on the types of tasks it carries: a high-throughput subnet, a low-latency subnet, and a high-reliability subnet. These three subnets respectively serve high-throughput, low-latency, and high-reliability tasks. The high-throughput subnet, low-latency subnet, and high-reliability subnet are denoted as... , , ,satisfy: , .

[0036] For business subnets , its in Candidate link set at time Determined by physical communication reachability: ;in This represents the maximum communication radius of the drone node.

[0037] Dynamic network planning must satisfy the following constraints: No illegal edges: set of active links Self-loops and multiple edges are prohibited.

[0038] Communication is available: This ensures that the link distance is within the communication radius.

[0039] Fixed link budget: per service subnet The total number of links active at any given time is fixed, i.e. .

[0040] Topology connectivity: The active topology corresponding to each service subnet The connection must be maintained.

[0041] The optimization objective is to optimize the entire maneuver time domain. Within this framework, maximize the overall global Quality of Service (QoS) evaluation score. First, calculate... Performance metrics for each subnet at any given time: Average effective throughput of high-throughput subnets Average latency of low-latency subnets (Negative optimization), average reliability score of the high-reliability subnet Summing and standardizing these indicators in the time domain, then weighting them, yields the overall time-domain optimization objective, i.e., the optimization goal: ;in These are the standardized subnet performance values. This represents the weight of task importance.

[0042] This invention adopts a "centralized training, distributed execution" paradigm, assigning one agent to each subnet to form a multi-agent system. The following Markov decision process model is established for it: 1) State set: intelligent agent At any moment Local observation status Include: Node characteristics: Normalized position ,speed Heading angle and cluster head markings .

[0043] Candidate link characteristics: For Including distance margin Normalized link performance estimates (throughput, latency, reliability scores), and activation flags. .

[0044] Topological characteristics: node degree .

[0045] Time characteristics: trajectory progress .

[0046] 2) Action set: Adopts the "paired link switching" action format. .in, From the currently active link set Choose from, From currently inactive feasible edge sets Select from [the options]. Introduce no-operation. The strategy is allowed to keep the topology unchanged.

[0047] 3) Reward Set: The reward function aims to guide the agents to collaboratively optimize the global objective and satisfy constraints. ,in for The global QoS score at any given time (obtained by weighted summation of the standardized performance of each subnet). To constrain violations of the indicator function (which is 1 if violated), This is the penalty coefficient.

[0048] Here, the policy network is trained intensively in the environmental simulation to obtain the trained policy network, including: (3.2.1) A multi-agent proximal policy optimization algorithm is adopted to construct an Actor policy network for each agent and a centralized Critic value network. (3.2.2) In the environmental simulation, each intelligence collects the current local observation state in the corresponding service subnet according to the defined local observation state, and inputs the current local observation state into the corresponding Actor policy network to obtain the action probability distribution and select pair training links to exchange actions. (3.2.3) The environment simulation executes pairwise training link switching actions, verifies the legality of the actions based on the communication radius constraints and topological connectivity constraints defined in the dynamic network topology model, calculates the instant reward based on the comprehensive service quality score of the entire time domain defined in the heterogeneous service model, and returns the local observation state of each agent at the next moment. (3.2.4) Store the experience data generated by each interaction as an experience tuple into the experience replay pool; wherein, the experience tuple includes: the current local observation state, the paired training link exchange action, the immediate reward, and the local observation state at the next time step; (3.2.5) Continue until the policy network converges to obtain the trained policy network.

[0049] For example, a multi-agent proximal policy optimization algorithm is used for training: Network Structure: Construct a three-actor policy network (corresponding to three sub-network agents respectively) and a centralized Critic value network .

[0050] Decision-making process: Each Actor network receives its own local observations. Features are extracted using a multilayer perceptron, and the action probability distribution is output. The agent samples actions based on this distribution and ensures [the desired action] through action masking. The legitimacy of it.

[0051] Training Update: The Critic network receives the global state during the training phase. Estimating the state value function The generalized advantage estimation (GOP) is used to calculate the advantage function, which is then standardized by subnet grouping. The Actor network is updated using the pruning objective function of the PPO, and the Critic network is updated by minimizing the mean squared error of the value function, thereby achieving stable and efficient cooperative policy learning.

[0052] In environmental simulation, the intelligent agent executes actions, while the environmental model is responsible for: Node Maneuvering: Updating the UAV Position Using a Smooth Turning Movement Model This generates a continuous trajectory that conforms to physical constraints.

[0053] Link quality calculation: Based on the updated node locations and the Rician fading channel model, the throughput, latency, and bit error rate of each link are calculated in real time.

[0054] State and reward feedback: Calculate the next state based on the new topology and link quality. and instant rewards .

[0055] Experience Collection and Training: Storing and Transferring Experience Tuples This is used for subsequent MAPPO policy updates.

[0056] After training, the algorithm can output dynamic programming solutions across the entire time domain, including: topological evolution sequences. Performance metrics curves: throughput-time, latency-time, and reliability score-time curves for each subnet. Deployable policy model: pre-trained Actor network parameters, which can be used for online distributed decision-making.

[0057] S103, each agent controls the communication link between UAVs in the corresponding business subnet to be open or closed according to the optimal pair link exchange action, and updates the network topology of the corresponding business subnet. For example, taking a high-throughput service subnet as an example, assume that the current active link set of this service subnet is {link12, link23, link34}, and the feasible inactive link set is {link14, link15, link24}. The optimal pairwise link switching action obtained by the agent through policy network inference is (link23, link14), that is, disconnecting link23 and creating link14. The agent sends the disconnect command to drone 2 and drone 3 to release the communication resources between them; and sends the create command to drone 1 and drone 4 to establish a new communication connection. After executing this action, the active link set of this subnet is updated to {link12, link14, link34}, the total number of active links remains unchanged at 3, and the updated topology remains connected. In this way, each agent performs a link switch once for the subnet it is responsible for at each decision time, thereby realizing the dynamic evolution of the network topology in the entire task time domain.

[0058] S104, repeat S102 to S103 until the task ends, and obtain the network planning results in the task time domain; wherein, the network planning results include: the topology evolution sequence of the service subnets after the network topology is updated at each decision time.

[0059] For example, setting the task time domain T = 100 seconds and the decision interval Δt = 1 second, then a total of 100 S102-S103 loops are executed. At each decision time t (t = 1, 2, ..., 100), the agents of the three subnets output pairwise link switching actions to update the topology of their respective subnets. The active link sets of the high-throughput subnet, low-latency subnet, and high-reliability subnet at each time t are recorded and denoted as G_tp(t), G_dl(t), and G_rl(t). After the task is completed, these topology states arranged in chronological order are summarized to obtain the topology evolution sequence. This sequence fully describes the dynamic changes in the topology of the three subnets throughout the entire task, and can be used for subsequent analysis, visualization, or as a final network planning solution delivery.

[0060] In a simulation experiment provided by this invention, to verify the method, a simulation verification platform containing the following modules is built in the Mininet+Ryu environment: (1) Scenario and Data Generation Layer: Used to define mission requirements scenarios and generate or import input data such as UAV node trajectories, communication equipment parameters, and flight missions (business flows). It supports the automatic generation of node trajectories that conform to flexible maneuvering modes using a smooth turn model.

[0061] (2) Multi-agent decision-making and training layer: This is the core algorithm layer, which deploys the MAPPO framework and includes components such as the Actor / Critic network, experience replay pool, and optimizer. It is responsible for interacting with the environment and training and evaluating policies.

[0062] (3) Dynamic Network Environment Simulation Layer: Simulates the real physical world. This layer receives link adjustment actions from the agent, updates node states based on the mobility model, calculates link-level performance indicators (throughput, latency, bit error rate) in real time based on the channel model (such as Rician fading) and path loss model, verifies connectivity constraints, and finally calculates the reward and returns to the next state.

[0063] (4) Results Visualization and Analysis Layer: This layer visualizes the training process and the final planning results. It includes training curves (such as cumulative rewards and number of constraint violations), dynamic topology evolution animations, QoS performance comparison charts for each subnet, and reports on key indicators such as connectivity.

[0064] This platform integrates the core algorithm, high-fidelity environment simulation, and visualization analysis, enabling comprehensive and intuitive verification of the effectiveness, stability, and superiority of the system for planning aerial wireless mission networks in dynamic scenarios, thus providing a solid foundation for final engineering deployment.

[0065] Figure 3 This is a schematic diagram of the state processing and decision-making module of the multi-agent policy network (Actor) in this invention.

[0066] First, set the current subnet Local observation status Input feature extraction module. This observation contains dimensions of... The node feature matrix (each row corresponds to the normalized position, velocity, heading, and cluster head identifier of a node), and the dimension of The candidate link feature matrix is ​​generated (each row contains link distance margin, normalized performance estimate, and activation flag). A unified feature embedding is obtained by encoding and dimensionality reduction of node and link features using a multilayer perceptron with shared parameters.

[0067] The processed node and link features are aggregated and combined with topological features (node ​​degree) and time features. This forms a comprehensive state representation of the agent. This representation is then fed into the core of the policy network—a deep neural network composed of fully connected layers. Through forward propagation, the network learns the complex mapping between states and actions, ultimately outputting an action probability distribution. This distribution covers all feasible "pair swap" actions (including no-ops) under the action mask constraint.

[0068] The agent samples based on this probability distribution and selects the action to be executed at the current time. This completes a subnet test. Topology optimization decision.

[0069] Secondly, this invention provides a dynamic aerial wireless mission network planning system based on multi-agent reinforcement learning, and a method for executing this system, comprising: multiple agents respectively configured in a high-throughput service subnet, a low-latency service subnet, and a high-reliability service subnet, each agent comprising: a model storage unit for storing a pre-loaded, pre-trained policy network; a local observation acquisition unit connected to the model storage unit, configured to acquire local observation states within the corresponding service subnet at each decision time; a policy inference unit connected to both the model storage unit and the local observation acquisition unit, configured to input the local observation states into the policy network to obtain the optimal pairwise link switching action; and a topology adjustment execution unit connected to the policy inference unit, configured to control the connection and disconnection of communication links between UAVs within the corresponding service subnet according to the optimal pairwise link switching action.

[0070] The multi-agent reinforcement learning-based dynamic airborne wireless mission network planning system also includes an offline training subsystem deployed at a ground control center. This offline training subsystem comprises: a dynamic network and service modeling module, configured to construct dynamic network topology models and heterogeneous service models; a multi-agent decision-making process construction module, connected to the dynamic network and service modeling module, configured to assign an agent to each of the high-throughput, low-latency, and high-reliability service subnets based on the dynamic network topology model and heterogeneous service models, and define the local observation state, pairwise link switching actions, and cooperative reward function for each agent; and a high-fidelity dynamic environment simulation module, which is integrated with the multi-agent decision-making process. The process construction module is connected and configured to simulate the maneuvering trajectory of UAV nodes and channel fading, and update link performance parameters in real time. The centralized training engine module is connected to the multi-agent decision-making process construction module and the high-fidelity dynamic environment simulation module, respectively. It is configured to build a dynamic environment simulation module based on a dynamic network topology model and a heterogeneous service model. In the environment simulation, the policy network corresponding to each agent is centrally trained based on local observation states, paired link exchange actions, and cooperative reward functions to obtain the trained policy network. The model loading module is connected to the centralized training engine module and configured to load the trained policy network into the model storage unit of each agent.

[0071] For example, see Figure 2The system architecture shown is illustrated using a small-scale scenario involving three drones as an example. Assume the high-throughput service subnet includes drones A and B, the low-latency service subnet includes drones C and D, and the high-reliability service subnet includes drones E and F. The ground control center deploys an offline training subsystem, including a dynamic network and service modeling module, a multi-agent decision-making process construction module, a high-fidelity dynamic environment simulation module, a centralized training engine module, and a model loading module.

[0072] First, the dynamic network and service modeling module constructs a dynamic network topology model and a heterogeneous service model, modeling the six UAVs as time-varying graphs and dividing them into three service subnets. The multi-agent decision-making process construction module, based on this model, assigns an agent to each subnet and defines the local observation state, pairwise link exchange actions, and cooperative reward function for each agent. The high-fidelity dynamic environment simulation module simulates the UAV trajectory and channel changes according to the dynamic network topology model, providing an interactive environment for training. The centralized training engine module uses a multi-agent near-end policy optimization algorithm to centrally train the policy networks of the three agents in the environment simulation, obtaining the trained policy networks. The model loading module loads the trained policy networks onto the corresponding UAVs in the subnets via data link, with the agent for the high-throughput subnet deployed on UAV A, the agent for the low-latency subnet deployed on UAV C, and the agent for the high-reliability subnet deployed on UAV E.

[0073] After the task begins, each agent performs online planning at each decision point. Taking a high-throughput subnet as an example, the model storage unit on UAV A pre-loads the trained policy network. The local observation and acquisition unit collects the current position, velocity, and link quality information of UAVs A and B within its subnet, forming a local observation state. The policy inference unit inputs this local observation state into the policy network, and the policy network outputs the optimal pairwise link switching action. Assuming that the active link of the subnet is AB at the current moment, and the feasible inactive link is empty, the agent outputs a no-operation, keeping the topology unchanged. At the next moment, as the UAV position changes, the feasible inactive link becomes AB (activated) and empty, and the agent still outputs a no-operation. When a new feasible link appears at some moment, the agent will output the corresponding switching action. The topology adjustment execution unit controls the connection and disconnection of communication links according to the actions, updating the subnet topology. Until the task ends, the network planning result for the entire task time domain is obtained.

[0074] The various embodiments described in this specification are presented in a progressive manner. Similar or identical parts between embodiments can be referred to interchangeably. Each embodiment focuses on its differences from other embodiments. All or part of this invention can be used in numerous general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, mobile communication terminals, multiprocessor systems, microprocessor-based systems, programmable electronic devices, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices, etc.

[0075] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the present invention.

Claims

1. A dynamic over-the-air wireless mission network planning method based on multi-agent reinforcement learning, characterized in that, include: S101, Configure a corresponding intelligent agent for each of the high-throughput service subnet, low-latency service subnet, and high-reliability service subnet; wherein, the intelligent agent is preloaded with a trained policy network; S102, at each decision moment, each of the intelligent agents collects the local observation state in the corresponding service subnet and inputs the local observation state into the corresponding policy network to obtain the optimal pair link switching action; S103, each of the intelligent agents controls the communication link between UAVs in the corresponding service subnet to be connected or disconnected according to the optimal pair link exchange action, and updates the network topology of the corresponding service subnet. S104, repeat S102 to S103 until the task ends, and obtain the network planning result in the task time domain; wherein, the network planning result includes: the topology evolution sequence of the network topology of the service subnet after each decision time.

2. The dynamic aerial wireless mission network planning method based on multi-agent reinforcement learning according to claim 1, characterized in that, The step of inputting the local observation state into the corresponding policy network to obtain the optimal pair link switching action includes: The policy network outputs an action probability distribution based on the local observation state; The probability of actions that do not meet the first preset constraint is set to zero through the action masking mechanism; wherein the first preset constraint includes: communication radius constraint, no multiple edges constraint, and topological connectivity constraint. The optimal pairwise link switching action is obtained by sampling from the remaining actions; wherein the optimal pairwise link switching action is: selecting a link to be disconnected from the set of active links of the corresponding service subnet at the current time, and selecting a link to be newly established from the set of feasible inactive links of the corresponding service subnet at the current time.

3. The dynamic aerial wireless mission network planning method based on multi-agent reinforcement learning according to claim 1, characterized in that, The policy network is obtained through an offline training step: Construct dynamic network topology models and heterogeneous service models; Based on the dynamic network topology model and the heterogeneous service model, an agent is assigned to each of the high-throughput service subnet, the low-latency service subnet, and the high-reliability service subnet, and the local observation state, pairwise link switching action, and cooperative reward function of each agent are defined. Based on the dynamic network topology model and the heterogeneous service model, a dynamic environment simulation module is constructed. In the environment simulation, the policy network corresponding to each agent is trained in a centralized manner based on the local observation state, paired link exchange actions, and cooperative reward function to obtain the trained policy network.

4. The dynamic aerial wireless mission network planning method based on multi-agent reinforcement learning according to claim 3, characterized in that, The construction of the dynamic network topology model and heterogeneous service model includes: The set of drone nodes and the set of candidate links that change over time are modeled as a time-varying undirected graph. According to the time-varying undirected graph Construct the dynamic network topology model, and define communication radius constraints, no illegal edges constraints, and topological connectivity constraints in the dynamic network topology model; Based on the task requirements, the time-varying undirected graph... The node set is divided into non-overlapping high-throughput service subnets. Low-latency service subnet and high-reliability service subnet And based on the high-throughput service subnet Low-latency service subnet and high-reliability service subnet Construct the heterogeneous business model, and define link budget fixed constraints and optimization objectives in the heterogeneous business model.

5. The multi-agent reinforcement learning-based dynamic aerial wireless mission network planning method according to claim 4, characterized in that, The optimization objective is expressed as: ; in, Indicates a high-throughput service subnet Task importance weighting; This represents the standardized high-throughput service subnet. The energy value; Indicates low latency service subnet Task importance weighting; This represents the standardized low-latency service subnet. The energy value; Indicates a high-reliability service subnet Task importance weighting; This represents the standardized high-reliability service subnet. The energy value.

6. The dynamic aerial wireless mission network planning method based on multi-agent reinforcement learning according to claim 4, characterized in that, The dynamic environment simulation module includes: According to the time-varying undirected graph A smooth turning model is used to generate the continuous motion trajectory of the UAV node and update the node position in real time. Based on the updated location, the Rician fading channel model is used to measure the throughput, end-to-end delay, and bit error rate of each candidate in real time. Based on the communication radius constraints defined in the dynamic network topology model, the legality of the action selected by the agent is verified to ensure that the activation link distance does not exceed the maximum communication radius; Based on the topological connectivity constraints defined in the dynamic network topology model, the connectivity of each service subnet is verified after the action to ensure that the graph formed by the active links of each subnet remains connected. Based on the comprehensive service quality optimization target defined in the heterogeneous business model, the instant reward is calculated and fed back to each agent.

7. The dynamic aerial wireless mission network planning method based on multi-agent reinforcement learning according to claim 3, characterized in that, The step of training the policy network in the environment simulation to obtain the trained policy network includes: A multi-agent proximal policy optimization algorithm is adopted to construct an Actor policy network for each agent and a centralized Critic value network. In the environment simulation, each intelligence collects the current local observation state within the corresponding service subnet according to the defined local observation state, and inputs the current local observation state into the corresponding Actor policy network to obtain the action probability distribution and select pair training links to exchange actions. The environment simulation executes the pairwise training link switching action, verifies the legality of the action based on the communication radius constraint and topological connectivity constraint defined in the dynamic network topology model, calculates the instant reward based on the all-time-domain global service quality comprehensive score optimization objective defined in the heterogeneous service model, and returns the local observation state of each agent at the next moment. The experience data generated from each interaction is stored as an experience tuple in the experience replay pool; wherein, the experience tuple includes: the current local observation state, the paired training link exchange action, the immediate reward, and the local observation state at the next moment; Continue until the policy network converges, resulting in a fully trained policy network.

8. A dynamic aerial wireless mission network planning system based on multi-agent reinforcement learning, used to execute the method according to any one of claims 1 to 7, characterized in that, include: Multiple intelligent agents are configured in a high-throughput service subnet, a low-latency service subnet, and a high-reliability service subnet, respectively. Each intelligent agent includes: Model storage unit, used to store pre-loaded, pre-trained policy networks; A local observation acquisition unit is connected to the model storage unit and is configured to acquire the local observation status within the corresponding service subnet at each decision time. The strategy reasoning unit is connected to the model storage unit and the local observation acquisition unit respectively, and is configured to input the local observation state into the strategy network to obtain the optimal pair link switching action; The topology adjustment execution unit is connected to the strategy reasoning unit and is configured to control the communication link connection and disconnection between UAVs in the corresponding service subnet according to the optimal pair link exchange action.

9. The multi-agent reinforcement learning-based dynamic aerial wireless mission network planning system according to claim 8, characterized in that, It also includes an offline training subsystem deployed at a ground control center, the offline training subsystem comprising: The dynamic network and service modeling module is configured to build dynamic network topology models and heterogeneous service models. The multi-agent decision-making process construction module is connected to the dynamic network and service modeling module. It is configured to assign an agent to the high-throughput service subnet, the low-latency service subnet, and the high-reliability service subnet based on the dynamic network topology model and the heterogeneous service model, and define the local observation state, pairwise link switching action, and cooperative reward function of each agent. The high-fidelity dynamic environment simulation module is connected to the multi-agent decision-making process construction module and is configured to simulate the maneuver trajectory of UAV nodes and channel fading, and update link performance parameters in real time. The centralized training engine module is connected to the multi-agent decision-making process construction module and the high-fidelity dynamic environment simulation module, respectively. It is configured to construct a dynamic environment simulation module based on the dynamic network topology model and the heterogeneous service model. In the environment simulation, the policy network corresponding to each agent is centrally trained based on the local observation state, paired link exchange actions and cooperative reward function to obtain the trained policy network. The model loading module, connected to the centralized training engine module, is configured to load the trained policy network into the model storage unit of each agent.