A routing method for distributed unmanned aerial vehicle ad hoc networks based on deep reinforcement learning
By building a distributed deep reinforcement learning architecture on each drone node and optimizing drone routing using the Dijkstra algorithm and Markov decision process, the performance degradation of centralized routing schemes in adversarial scenarios is solved, achieving more stable and efficient drone network communication.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHONGQING UNIV OF POSTS & TELECOMM
- Filing Date
- 2023-02-13
- Publication Date
- 2026-06-26
AI Technical Summary
Existing drone routing schemes based on centralized deep reinforcement learning are susceptible to the influence of the control center in real-world drone combat scenarios, leading to a decline in network routing performance and reducing the stability of the drone communication network.
A distributed UAV ad hoc network routing method is adopted, which utilizes a deep reinforcement learning architecture to build a deep Q-network on each UAV node, pre-trains it using the Dijkstra algorithm, and combines Markov decision process and experience replay memory unit to optimize routing decisions.
It improves the robustness of drone networks, reduces packet transmission time and hop count, lowers packet loss rate, balances energy consumption and network load, extends network lifespan, and enhances routing performance.
Smart Images

Figure CN116234073B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of UAV ad hoc networks, and specifically relates to a routing method for distributed UAV ad hoc networks based on deep reinforcement learning. Background Technology
[0002] With significant advancements in modern technology, such as artificial intelligence, sensors, batteries, radio, and the Global Positioning System (GPS), unmanned aerial vehicles (UAVs) have found numerous diverse applications. Due to their small size, high speed, and flexibility, UAVs are widely used in fields such as military reconnaissance and public administration.
[0003] Timely and reliable communication between drones relies on intelligent and efficient routing protocols. However, most existing reinforcement learning-based drone routing solutions are based on centralized deep reinforcement learning methods and use relatively traditional reinforcement learning algorithms. They all use reinforcement learning to solve the problem of selecting the next-hop neighbor, without considering the overall drone network transmission level. They only use centralized drones to learn the state of the entire network environment and then send corresponding actions to each node in the network. However, in actual drone combat scenarios, once the control center that controls the routing decisions of the entire network is affected, it will seriously affect the performance of the entire network routing and reduce the stability of the drone communication network. Summary of the Invention
[0004] To address the problems existing in the background technology, this invention provides a routing method for distributed UAV ad hoc networks based on deep reinforcement learning, comprising:
[0005] S1: Use drones as nodes and create a drone communication network by relying on the communication links between drones;
[0006] Preferably, the drone nodes in the drone communication network adopt a random movement model, and the drone nodes can dynamically join or leave the network. Each drone node can act as a relay node, a source node, or a destination node. The drone nodes periodically send Hello messages to neighboring nodes to update the drone communication network in real time. If no feedback information for the Hello message is received from the neighboring node within a specified time, the communication link is considered to be disconnected.
[0007] For a given drone node, Hello messages are periodically sent to neighboring nodes to determine whether a response is received from the neighboring nodes within a specified time. If a neighboring node returns a response, the node is retained in the neighbor table; otherwise, the node is deleted.
[0008] S2: A deep reinforcement learning architecture for building an unmanned aerial vehicle (UAV) communication network is constructed using Markov decision processes; the deep reinforcement learning architecture includes an input layer, a dual deep Q-network, an output layer, a demonstration data buffer, and an experience playback memory unit;
[0009] Preferably, the deep reinforcement learning architecture employs a multi-layer fully connected neural network, optimized using the Adam algorithm, and backpropagates. The input layer takes the current node state and neighboring node states as input. After extensive training, the features of the input node states are extracted, and the Q-value corresponding to the current input state is output. The dual deep Q-network specifically consists of an eval-Q network and a tar-Q network. The eval-Q network is responsible for exploring the latest routing environment, while the tar-Q network is responsible for storing the experience learned from the current environment. The demonstration data buffer mainly stores the original training data, and the experience replay memory unit mainly stores the target training data.
[0010] S3: Randomly generate the source and destination nodes of the original data packets, run the Dijkstra algorithm to send the original data packets from the source node to the destination node; and generate original training data based on the routing process of the original data packets to pre-train the deep reinforcement learning architecture.
[0011] Preferably, the pre-training of the deep reinforcement learning architecture includes:
[0012] Before pre-training the deep reinforcement learning architecture, we first define the parameters of the reinforcement learning algorithm: state s t Action a t Reward signal r t ;
[0013] state s t For {D t N t B t A t}, D t N represents the destination node of the data packet forwarded by the current node. t B represents the set of neighboring nodes of the current node. t A represents the set of data packets queued by the current node and its neighboring nodes. t This represents the set of actions performed by the current node and its neighboring nodes in the first three iterations;
[0014] Action a t Includes: Action a t This indicates the action chosen by the drone node at time t. This indicates that node i is selected as the next-hop node for forwarding data packets, and the selectable action is the set of all neighboring nodes of the current node;
[0015] Reward signal rt ; γ is the discount factor, f tran R represents the forwarding cost, which is half the total number of nodes in the network. i H represents the number of duplicate loops occurring in the path of data packet i. i This represents the number of hops from packet i to the destination node, O(n). i This represents the waiting time of packet i in the queue of the corresponding drone node, w1w2w3 represent the weight parameters respectively, and n represents the number of hops in the routing process.
[0016] After defining the parameters, the value function of the deep reinforcement learning architecture is pre-trained. First, the original training data (s) is generated based on the routing process of the original data packets. t ,a t ,r t ,s t+1 The original training data is stored in the demo data buffer, and the original training data is randomly extracted from the demo data buffer and stored in the demo data buffer. t The input tar-Q network calculates the value Q. The output layer uses an ε-greedy mechanism to predict the behavior of the current node and updates the parameters of the eval-Q network using gradient descent. Each C step uses θ... - =θ updates the parameters of the tar-Q network until the loss function is less than a set threshold, where θ - θ represents the parameters of the tar-Q network, and θ represents the parameters of the eval-Q network;
[0017] S4: Input the coordinates of the destination node D of the target data packet, obtain the link state of the current node A and the link states of the neighboring nodes of the current node A into the pre-trained deep reinforcement learning architecture, and obtain the next hop node B of the current node A; generate target training data according to the routing process of the target data packet, and retrain the deep reinforcement learning architecture according to the original training data and the target training data.
[0018] Preferably, the retraining of the deep reinforcement learning architecture includes:
[0019] S41: Generate state s based on the coordinates of the destination node D of the target data packet, the link state of the current node A, and the link states of the neighboring nodes of the current node A. t ;
[0020] S42: Change state s t The target training data (s) is obtained by inputting the next-hop node of the current node into a pre-trained deep reinforcement learning architecture. t ,a t ,r t ,s t+1 ), and the target training data (s)t ,a t ,r t ,s t+1 Stored in the experience playback memory unit;
[0021] S43: Randomly sample data from the experience playback memory unit and the demonstration data buffer and input it into the eval-Q network to calculate the value Q. The output layer uses an ε-greedy mechanism to predict the behavior of the current node and updates the parameters of the eval-Q network using gradient descent. Every C steps, θ is used as the input. - =θ updates the parameters of the tar-Q network;
[0022] The ratio of randomly sampled data from the experience playback memory unit and the demonstration data buffer is 1:η, where η is a simulation parameter that is manually set before the simulation begins.
[0023] Preferably, the loss function includes:
[0024] L DQN (θ)=E[(yQ(s,a;θ)) 2 ]
[0025]
[0026] Where Q(s,a;θ) represents the environmental state s t The input of a dual-depth Q-network outputs in this environment state s t Next choice behavior a t The cumulative reward value is then obtained, where y is the target value calculated by the target neural network.
[0027] Preferably, the gradient descent method is used to update the parameters θ of the eval-Q network with a learning rate α:
[0028]
[0029]
[0030] S5: Take the next hop node B as the starting node and repeat steps S4-S5 until the next hop node is the destination node, thus completing the routing of the target data packet;
[0031] Preferably, during the routing of the target data packet, the next-hop node B of the current node A is calculated using a deep reinforcement learning architecture, and the next-hop node of node B is calculated. It is then determined whether the next-hop node of the current node B is node A. If it is, a loop is generated. For loops, the communication link of path B-A is temporarily set to disconnected, and a suboptimal next-hop node is selected until a node that will not cause a loop is selected as the next-hop node.
[0032] The present invention has at least the following beneficial effects
[0033] This invention uses the Dijkstra algorithm to generate raw data for pre-training of a deep reinforcement learning architecture. Pre-learning yields better initial performance, accelerates algorithm convergence, and reduces training costs. By building a deep reinforcement learning architecture for each UAV node in an ad hoc UAV network and training it with routing data from each UAV, this invention is more suitable for large-scale UAV networking. It can significantly enhance network robustness, reduce packet transmission time, hop count, and packet loss rate, while effectively balancing energy consumption and network load to extend network lifetime and improve the overall routing performance of the network. Attached Figure Description
[0034] Figure 1 This is a flowchart of the method in this invention;
[0035] Figure 2 A flowchart of the training process for a deep reinforcement learning architecture;
[0036] Figure 3 A schematic diagram showing the simulation parameter settings;
[0037] Figure 4 Simulation diagram of algorithm convergence;
[0038] Figure 5 Simulation diagram for maximum queue length;
[0039] Figure 6 This is a simulation diagram comparing end-to-end delays.
[0040] Figure 7 To control overhead, a comparison simulation diagram is used. Detailed Implementation
[0041] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.
[0042] Please see Figure 1 This invention provides an efficient routing method for distributed unmanned aerial vehicle (UAV) ad hoc networks based on deep reinforcement learning, comprising:
[0043] S1: Use drones as nodes and create a drone communication network by relying on the communication links between drones;
[0044] Preferably, the drone nodes in the drone communication network adopt a random movement model, and the drone nodes can dynamically join or leave the network. Each drone node can act as a relay node, a source node, or a destination node. The drone nodes periodically send Hello messages to neighboring nodes to update the drone communication network in real time. If no feedback information for the Hello message is received from the neighboring node within a specified time, the communication link is considered to be disconnected.
[0045] For a given drone node, Hello messages are periodically sent to neighboring nodes to determine whether a response is received from the neighboring nodes within a specified time. If a neighboring node returns a response, the node is retained in the neighbor table; otherwise, the node is deleted.
[0046] S2: A deep reinforcement learning architecture for building an unmanned aerial vehicle (UAV) communication network is constructed using Markov decision processes; the deep reinforcement learning architecture includes an input layer, a dual deep Q-network, an output layer, a demonstration data buffer, and an experience playback memory unit;
[0047] Preferably, the deep reinforcement learning architecture employs a multi-layer fully connected neural network, optimized using the Adam algorithm, and backpropagates. The input layer takes the current node state and neighboring node states as input. After extensive training, the features of the input node states are extracted, and the Q-value corresponding to the current input state is output. The dual deep Q-network specifically consists of an eval-Q network and a tar-Q network. The eval-Q network is responsible for exploring the latest routing environment, while the tar-Q network is responsible for storing the experience learned from the current environment. The demonstration data buffer mainly stores the original training data, and the experience replay memory unit mainly stores the target training data.
[0048] Each drone node has a separate deep reinforcement learning architecture, meaning that during subsequent training, the state of each node and the states of its neighbors are used to train the current node's deep reinforcement learning architecture.
[0049] S3: Randomly generate the source and destination nodes of the original data packets, run the Dijkstra algorithm to send the original data packets from the source node to the destination node; and generate original training data based on the routing process of the original data packets to pre-train the deep reinforcement learning architecture.
[0050] Dijkstra's algorithm is a shortest path algorithm that finds the shortest path from one vertex to all other vertices. It solves the shortest path problem. The main feature of Dijkstra's algorithm is that it starts from the starting point and uses a greedy algorithm strategy, traversing to the nearest unvisited vertex adjacent to the starting point each time, until it extends to the destination.
[0051] Preferably, the pre-training of the deep reinforcement learning architecture includes:
[0052] Before pre-training the deep reinforcement learning architecture, we first define the parameters of the reinforcement learning algorithm: state s t Action a t Reward signal r t ;
[0053] state s t For {D t N t B t A t}, D t N represents the destination node of the data packet forwarded by the current node. t B represents the set of neighboring nodes of the current node, that is, the set of all nodes within the effective communication range of the current node; t This represents the set of data packets queued by the current node and its neighboring nodes; that is, the set of all data packets in the receiving queues of the current node and its neighboring nodes. t This represents the set of actions performed by the current node and its neighboring nodes in the first three iterations; that is, the set of actions taken by the node in the first three rounds when forwarding to the same destination node.
[0054] Action a t Includes: Action a t This indicates the action chosen by the drone node at time t. This indicates that node i is selected as the next-hop node for forwarding data packets, and the selectable action is the set of all neighboring nodes of the current node;
[0055] Reward signal r t ; γ is the discount factor, f tran R represents the forwarding cost, which is half the total number of nodes in the network. i H represents the number of duplicate loops in the routing path of packet i. i This represents the hop count of data packet i from the source node to the destination node, i.e., the number of relay nodes the data packet passes through during the entire transmission process; O i This represents the waiting time of data packet i in the queue of the corresponding drone node, specifically the total time from when the data packet enters the node's receiving queue to when the data packet is sent; w1, w2, and w3 represent weight parameters, with larger weight values for items with greater influence; n represents the number of hops in the routing process, mainly used to calculate the reward for the (n+1)th hop action in the routing process. For example, when calculating the reward for the second-to-last hop action, n is set to 1.
[0056] After defining the parameters, the value function of the deep reinforcement learning architecture is pre-trained. First, the original training data (s) is generated based on the routing process of the original data packets. t,a t ,r t ,s t+1 The original training data is stored in the demo data buffer, and the original training data is randomly extracted from the demo data buffer and stored in the demo data buffer. t The input tar-Q network calculates the value Q. The output layer uses an ε-greedy mechanism to predict the behavior of the current node and updates the parameters of the eval-Q network using gradient descent. Each C step uses θ... - =θ updates the parameters of the tar-Q network until the loss function is less than a set threshold, where θ - θ represents the parameters of the tar-Q network, and θ represents the parameters of the eval-Q network;
[0057] S4: Input the coordinates of the destination node D of the target data packet, obtain the link state of the current node A and the link states of the neighboring nodes of the current node A into the pre-trained deep reinforcement learning architecture, and obtain the next hop node B of the current node A; generate target training data according to the routing process of the target data packet, and retrain the deep reinforcement learning architecture according to the original training data and the target training data.
[0058] Preferably, the retraining of the deep reinforcement learning architecture includes:
[0059] S41: Generate state s based on the coordinates of the destination node D of the target data packet, the link state of the current node A, and the link states of the neighboring nodes of the current node A. t ;
[0060] S42: Change state s t The target training data (s) is obtained by inputting the next-hop node of the current node into a pre-trained deep reinforcement learning architecture. t ,a t ,r t ,s t+1 ), and the target training data (s) t ,a t ,r t ,s t+1 Stored in the experience playback memory unit;
[0061] S43: Randomly sample data from the experience playback memory unit and the demonstration data buffer and input it into the eval-Q network to calculate the value Q. The output layer uses an ε-greedy mechanism to predict the behavior of the current node and updates the parameters of the eval-Q network using gradient descent. Each C step uses θ... - =θ updates the parameters of the tar-Q network;
[0062] The ratio of randomly sampled data from the experience playback memory unit and the demonstration data buffer is 1:η, where η is a simulation parameter that is manually set before the simulation begins.
[0063] Preferably, the loss function includes:
[0064] L DQN (θ)=E[(yQ(s,a;θ)) 2 ]
[0065]
[0066] Where Q(s,a;θ) represents the environmental state s t The input of a dual-depth Q-network outputs in this environment state s t Next choice behavior a t The cumulative reward value is then obtained, where y is the target value calculated by the target neural network.
[0067] Preferably, the gradient descent method is used to update the parameters θ of the eval-Q network with a learning rate α:
[0068]
[0069]
[0070] S5: Take the next hop node B as the starting node and repeat steps S4-S5 until the next hop node is the destination node, thus completing the routing of the target data packet;
[0071] Preferably, during the routing of the target data packet, the next-hop node B of the current node A is calculated using a deep reinforcement learning architecture, and the next-hop node of node B is calculated. It is then determined whether the next-hop node of the current node B is node A. If it is, a loop is generated. For loops, the communication link of path B-A is temporarily set to disconnected, and a suboptimal next-hop node is selected until a node that will not cause a loop is selected as the next-hop node.
[0072] In this embodiment, the UAV communication network of the present invention consists of N UAV nodes, and its communication topology is a connected graph G = (V, E), where V represents a UAV node, E represents an edge between UAV nodes, and each edge... This corresponds to a communication link between drone nodes;
[0073] use This represents the set of neighboring nodes of drone node i at time t, using... This represents the number of neighboring nodes of drone node i.
[0074] Data packets can be sent via wireless communication links between drone nodes. Each drone node in the network can be either a source node or a destination node.
[0075] When the source node To the destination node When a data packet p of size L arrives at relay node j, the processing procedure is as follows:
[0076] If j = d, it means that the data packet has reached the destination node and the data packet routing process has ended.
[0077] Otherwise, the data packet will be sent to the next-hop neighbor node selected by the routing policy learned by the current drone node.
[0078] During the routing process, the data packet transmission time t from node i to the next-hop node j is... i,j Defined as the sum of the waiting time of the data packet in the queue of node i and the forwarding time of the wireless link, i.e. The waiting time w of a data packet in the node queue i The forwarding time of a wireless link is the time difference between a data packet entering the node queue and leaving the node queue. This forwarding time is determined by the data packet size L and the link's maximum transmission rate B. i,j It is measured by the ratio.
[0079] Each drone node has a certain initial energy. When a node has data packets to send or is queuing, it is considered to be in a working state and consumes energy. Each data packet transmission also consumes a certain amount of energy. Otherwise, the node is considered to be in a dormant state, which is considered not to consume energy. (Using e) i This represents the remaining energy of node i, which is the difference between the initial energy and the consumed energy. When a node's energy falls below a given threshold, the node is considered inactive and unreachable. When the energy is 0, the node is deleted.
[0080] Next, the distributed routing problem of UAV ad hoc networks is modeled as a Markov decision process. Since this invention studies distributed routing protocols, each UAV node in the network is considered an agent. Each agent can intelligently make decisions and choose actions based on the state of the network environment. The Markov decision process is defined by a quadruple (S, A, P, R), where S is a finite set of states, A is a finite set of actions, and P is the action a that the agent performs at time t. t After from state s t Transition to state s t+1 The transition probability in this invention is a model-free reinforcement learning method, meaning that the policy can be optimized and improved even when the state transition probability matrix P is unknown. R refers to the probability of the agent performing action a. tThe immediate reward obtained afterward represents the degree of good or bad of the current action in the current sense. The detailed definitions of the state space, action space, and reward function are given below:
[0081] State space S: The state of each agent is defined as the joint network state observed at time t. For example, the state of an agent can be represented as... This represents the destination node for the data packets forwarded by the current node. For the set of neighboring nodes, The set of data packets queued for the current node and its neighboring nodes. This represents the set of actions performed by the current node and its neighbors during the first three iterations. It is assumed that queued data packets and other necessary information from neighboring nodes are locally observable. During data transmission, neighboring nodes can be notified via piggyback acknowledgments.
[0082] Action space: Action a t This represents the action chosen by the agent at time t. This indicates that node i has been selected as the next-hop node to forward data packets.
[0083] Reward function: Reward function r t (s t ,a t () refers to the action a performed by the agent at time t. t Then by state s t Transition to state s t+1 The environment provides agents with immediate rewards.
[0084] This scheme proposes a distributed reward strategy to consider the information of the entire network. On one hand, the reward function includes local rewards describing the interactions between individuals, i.e., information about the relationship between two adjacent nodes. On the other hand, a global reward is introduced to reflect the quality of the action, i.e., the transmission direction of the data packet. Specifically, the reward is calculated using the final path of all data packets, the value of each agent node in the path is calculated, and finally the global reward is calculated. Therefore, this scheme defines the immediate reward of the last-hop action of data packet i as: The reward for the previously performed forwarding action was Where γ is the discount factor, f tran R represents the forwarding cost, which is half the total number of nodes in the network. i H represents the number of duplicate loops occurring in the path of data packet i. i This represents the number of hops from packet i to the destination node, O(n). i represents the waiting time of the data packet i in the queue of the corresponding agent node, w1 w2 w3 represent the weight parameters respectively, representing the importance of the corresponding indicators, and both are penalty terms.
[0085] By defining the state and action space and the reward, the drone ad hoc network routing problem can be formally described as an MDP process.
[0086] The agent's goal is to find a deterministic optimal policy π to maximize the total cumulative reward. Based on the above formula, the deep reinforcement learning-based UAV networking routing problem can be defined as maximizing the future cumulative reward G. t =r des +γr des-1 +γ 2 r des-2 To estimate the cumulative future reward, the Q function is represented as the expected cumulative future reward:
[0087] Q π (s t ,a t )=E[G t |s t =s,a t =a]=
[0088]
[0089] Among them, G t The optimal strategy π represents the total reward during a packet routing process. * It can be defined as a strategy that maximizes the Q-function and returns the optimal action given the state.
[0090] The goal of this invention is to learn an optimal routing strategy that can correctly forward each data packet, extend network lifetime, optimize network load, and reduce the average number of hops per data packet. This problem has been proven to be NP-hard.
[0091] Therefore, this invention explores the use of Deep Reinforcement Learning (DRL) to learn the optimal routing strategy and proposes a DQN-based routing algorithm to solve this problem.
[0092] In real-world large-scale drone collaborative networks, the network state space becomes extremely large as the number of drones increases.
[0093] Reinforcement learning methods such as Q-learning require storing all Q-values in a Q-table, which results in an exceptionally large Q-table. Due to the limitations of UAV hardware, storing and updating such a massive Q-table is inefficient, thus impacting task performance. Therefore, in this solution, we utilize the DQN algorithm to transform the Q-table update process in traditional Q-learning reinforcement learning into a function fitting problem, addressing the issue that Q-learning is unsuitable for high-dimensional state-action spaces.
[0094] Algorithm diagram as follows Figure 2 As shown.
[0095] The study employs a distributed design, where each drone node can determine its status based on observed network state information. t And select action a according to the ε-greedy principle. t That is, a neighboring node is randomly selected as the next hop with probability ε, and the neighboring node with the largest Q value is selected as the next hop with probability 1-ε. The AI then receives a reward r. t (s t ,a t ), and enter the next state s t+1 , experience information (s t ,a t ,r t ,s t+1 The experience playback memory unit stored in the intelligent agent.
[0096] Because the data samples generated by reinforcement learning are interconnected, the training of the agent becomes difficult to converge. Therefore, randomly sampling samples from the experience replay memory unit can eliminate the correlation between experience data. This mechanism also allows the agent to use both new and old experiences for training, making the training more efficient. Since the Q-value changes during training, updating the Q-network with a constantly changing set of values can lead to uncontrollable estimates, causing algorithm instability. To address this, a tar-Q network is used to frequently but slowly update the Q-value of the eval-Q network. This significantly reduces the correlation between the target value and the estimate, thus stabilizing the algorithm.
[0097] DQN approximates the true Q-value using the Q-estimated value from an eval-Q network. It first generates a shallow neural network model consisting of an input layer, an output layer, and two hidden layers. Let θ represent the parameters of the eval-Q network. - This represents the parameters of the tar-Q network. These two neural networks have the same structure but different parameters. The training process of a neural network optimizes the loss function, which is the deviation between the target value and the estimated value of the network output. Therefore, the loss function minimized by the DQN agent is defined as follows:
[0098] L DQN (θ)=E[(yQ(s,a;θ)) 2 ]
[0099]
[0100] Where Q(s,a;θ) is the estimated value of the output of the trained neural network, and y is the target value calculated by the target neural network.
[0101] This scheme uses gradient descent to update the parameters θ of the trained neural network with a learning rate α.
[0102]
[0103]
[0104] The parameters of the target neural network are updated by copying them to the target neural network after training the neural network in multiple steps.
[0105] In actual training, routing protocols based on deep reinforcement learning often exhibit poor initial performance. This is because reinforcement learning algorithms learn their strategies through trial and error, which negatively impacts their initial performance. Therefore, in scenarios requiring rapid route generation for UAV swarm operations, traditional methods relying solely on reinforcement learning clearly have significant drawbacks. It is necessary to improve the early learning performance of the algorithm to accelerate its learning speed, increase its efficiency, and adapt to future real-world UAV swarm applications.
[0106] Specifically, this scheme allows each agent in the network to pre-learn an initial policy. This initial policy may not be perfect, and its performance may not meet the set indicators, but it can greatly reduce the trial and error cost of the agent's self-learning in the initial stage. This can significantly improve the initial performance of the network training compared to before, and the speed of reward convergence will also be greatly improved.
[0107] This invention uses Python 3.6 to write code for an event-driven simulator, uses PyTorch as a deep learning framework to implement a DQN-based routing algorithm, and uses NetworkX to build a drone networking environment.
[0108] To evaluate and compare the performance of the proposed routing algorithm and the benchmark algorithm in a real UAV network, repeated training is required on a UAV network with a custom number of nodes. To fit the actual scenario of UAV swarm operations, each node is set to have a random number of neighboring nodes and may leave the effective communication range of its neighbors at any time. Then, the queue length, transmission queue length, wireless link data transmission rate and data packet size of each node are initialized. The maximum lifetime is set to half of the total number of nodes, and the source and destination nodes of each data packet are randomly generated.
[0109] The system creates a simulated packet routing process within a predefined UAV network, which is discretely updated at a series of time steps. Throughout one episode of the simulation, links randomly break and randomly recover at each time step. Furthermore, the link weights fluctuate in an approximately sinusoidal manner throughout the simulation.
[0110] At the start of each episode, numerous data packets (network load) are generated on the network, each with a random source and destination node. A new data packet is initialized after several time steps once a packet has been successfully transmitted. An episode ends once a certain number of data packets have been generated and successfully transmitted on the network. The average packet delivery time and various network performance metrics are then calculated. The simulation requires the routing process to determine the path for each data packet using a routing algorithm. In this project, we primarily explore the shortest path using Dijkstra's algorithm, and explore distributed Q-learning and deep Q-learning using various reward functions.
[0111] Please see Figure 3 To evaluate and compare the performance of the proposed routing algorithm and the benchmark algorithm, the number of UAV nodes was set to 50, 60, 70, 80, 90, 100, and 110, the sending queue length to 30, the receiving queue length to 150, the number of training rounds to 2250, the maximum link bandwidth to 10 Mbps, the number of data packets to 5000, the test load interval to 500, the minimum test load to 1000, and the maximum test load to 5000. To simulate the actual scenario of UAV swarm operations, each node was set to have a random number of neighboring nodes and could potentially leave the effective communication range of its neighbors at any time. The wireless link data transmission rate between nodes was randomly set between 1 Mbps, 2 Mbps, 5.5 Mbps, and 11 Mbps. The data packet size was 1000 bytes, with a maximum lifetime of half the total number of nodes. The source and destination nodes for each data packet were randomly generated, and the number of data packets in the network was set to 5000. This ensured both a certain amount of data transmission and sufficient training for each node during the training process. A different random network topology was used for each training session.
[0112] like Figure 4 As shown, after training with 100 nodes, the algorithm converges in approximately 1000 rounds. Pre-learning accelerates the algorithm's learning speed.
[0113] like Figure 5As shown, the DQN algorithm is compared with traditional AODV (Ad hoc On-demand Distance Vector Routing) and Q-Routing. With 50 nodes set up and the packet delivery rate gradually increased, it is evident that the queue length of the DQN algorithm is much shorter than that of AODV and Q-Routing, ensuring that congestion does not occur even with large data volumes.
[0114] like Figure 6 As shown, the end-to-end latency is the time from when the current node sends a message until the next node receives it. With 50-110 nodes, the DQN algorithm maintains an end-to-end latency of ≤50ms, meeting the technical specifications. In contrast, the traditional AODV algorithm experiences increasing latency due to the longer neighbor discovery and establishment time as the number of nodes increases.
[0115] In practical applications, control overhead is also a very important technical indicator. It refers to the proportion of control messages sent in the network to the total number of messages sent. Figure 7 As shown, the DQN algorithm significantly reduces control overhead compared to traditional AODV and Q-Routing, especially when the number of nodes is large. This ensures cost-effectiveness in practical applications.
[0116] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A routing method for distributed unmanned aerial vehicle (UAV) ad hoc networks based on deep reinforcement learning, characterized in that, include: S1: Use drones as nodes and create a drone communication network by relying on the communication links between drones; S2: A deep reinforcement learning architecture for building an unmanned aerial vehicle (UAV) communication network is constructed using Markov decision processes; the deep reinforcement learning architecture includes an input layer, a dual deep Q-network, an output layer, a demonstration data buffer, and an experience playback memory unit; S3: Randomly generate the source and destination nodes of the original data packets, run the Dijkstra algorithm to send the original data packets from the source node to the destination node; and generate original training data based on the routing process of the original data packets to pre-train the deep reinforcement learning architecture. Before pre-training the deep reinforcement learning architecture, define the parameters of the reinforcement learning algorithm: state. ,action Reward signals ; state for , Indicates the destination node of the data packet forwarded by the current node. This represents the set of neighboring nodes of the current node. This represents the set of data packets queued by the current node and its neighboring nodes. This represents the set of actions performed by the current node and its neighboring nodes in the first three iterations; action Includes: actions Indicates the time of the drone node The chosen action Represents a node The next-hop node selected by the current node forwards the data packet, and the action that can be selected is the set of all neighboring nodes of the current node; Reward signal ; , , As a discount factor, This represents the forwarding cost, which is half the total number of nodes in the network. For data packets The number of repeating loops in the path. Indicates data packet The number of hops to reach the destination node. This represents the data packet The waiting time in the queue at the corresponding drone node. These represent the weight parameters, and n represents the number of hops in the routing process; After defining the parameters, the value function of the deep reinforcement learning architecture is pre-trained. First, the original training data is generated based on the routing process of the original data packets. The original training data is stored in the demo data buffer, and the original training data is randomly drawn from the demo data buffer and then... The input tar-Q network calculates the value Q, and the output layer uses... The mechanism predicts the behavior of the current node and updates the parameters of the eval-Q network using gradient descent, with each C step using... Update the parameters of the tar-Q network until the loss function is less than a set threshold, where, This represents the parameters of the tar-Q network. These represent the parameters of the eval-Q network; Indicates the next state; S4: Input the coordinates of the destination node D of the target data packet, obtain the link state of the current node A and the link states of the neighboring nodes of the current node A into the pre-trained deep reinforcement learning architecture, and obtain the next hop node B of the current node A; generate target training data according to the routing process of the target data packet, and retrain the deep reinforcement learning architecture according to the original training data and the target training data. The retraining of the deep reinforcement learning architecture includes: S41: Generate the status based on the coordinates of the destination node D of the target data packet, the link status of the current node A, and the link status of the neighboring nodes of the current node A. S42: Input the state into the pre-trained deep reinforcement learning architecture and output the next hop node of the current node to obtain the target training data, and store the target training data in the experience replay memory unit; S43: Randomly sampled data input from the experience playback memory unit and demonstration data buffer into the tar-Q network to compute the value Q output layer. The mechanism predicts the behavior of the current node and updates the parameters of the eval-Q network using gradient descent, with each C step using... Update the parameters of the tar-Q network; S5: Take the next hop node B as the starting node and repeat steps S4-S5 until the next hop node is the destination node, thus completing the routing of the target data packet.
2. The routing method for a distributed UAV ad hoc network based on deep reinforcement learning according to claim 1, characterized in that, The drone nodes in the drone communication network adopt a random movement model, and the drone nodes can dynamically join or leave the network. Each drone node can act as a relay node, a source node, or a destination node. The drone nodes periodically send Hello messages to neighboring nodes to update the drone communication network in real time. If no feedback information for the Hello message is received from the neighboring node within a specified time, the communication link is considered to be disconnected.
3. The routing method for a distributed unmanned aerial vehicle (UAV) ad hoc network based on deep reinforcement learning according to claim 1, characterized in that, The loss function includes: in, Indicates the environmental state The input of a dual-depth Q network outputs in this environment state. Down-selection behavior Afterwards, you will receive accumulated reward points. It is the target value calculated by the target neural network.
4. The routing method for a distributed unmanned aerial vehicle (UAV) ad hoc network based on deep reinforcement learning according to claim 3, characterized in that, Gradient descent method with learning rate Update the parameters of the training neural network : 。 5. The routing method for a distributed unmanned aerial vehicle (UAV) ad hoc network based on deep reinforcement learning according to claim 1, characterized in that, During the routing of the target data packet, the next hop node B of the current node A is calculated through a deep reinforcement learning architecture, and the next hop node of node B is calculated. It is then determined whether the next hop node of the current node B is node A. If it is, a loop is generated. For loops, the communication link of path B-A is temporarily disconnected, and a suboptimal next-hop node is selected until a node that will not cause a loop is selected as the next-hop node.