Space-ground integrated network qoe-aware adaptive routing method based on deep learning
By constructing an SDN-based integrated space-ground network model and a multi-agent partially observable Markov decision process, combined with a Safety Shield-constrained Actor-Critic algorithm, the high dynamism of routing algorithms and user experience issues in integrated space-ground networks are solved, achieving efficient and reliable routing strategy optimization and improving user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHONGQING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-03-31
- Publication Date
- 2026-06-30
Smart Images

Figure CN122317818A_ABST
Abstract
Description
Technical Field
[0001] This invention pertains to integrated space-ground networks and relates to a QoE-aware adaptive routing method for integrated space-ground networks based on deep learning. Background Technology
[0002] With the advent of the 6G communication vision, the Space-Air-Ground Integrated Network (SAGIN) has become a core infrastructure for achieving seamless global coverage and breaking through terrain limitations. The construction of mega-constellations of low-Earth orbit (LEO) satellites, represented by Starlink, OneWeb, and China's State Grid, is accelerating. By integrating space-based (satellites), air-based (drones / HAPs), and ground-based networks, the SAGIN can provide broadband access services to remote areas, oceans, aviation, and emergency disaster relief scenarios. However, this heterogeneous, multi-layered network architecture presents unprecedented challenges to routing technology. Unlike fixed terrestrial networks, SAGIN exhibits extremely high dynamism and time-varying characteristics: frequent topology changes, with LEO satellites moving at high speeds of 7.6 km / s, causing network topology changes on a second or even millisecond basis, and frequent switching of satellite-to-ground / inter-satellite links; existing protocols become ineffective, as traditional internet routing protocols (such as OSPF and BGP) are primarily designed for static or low-speed networks. In an integrated space-air-ground network environment, these protocols face problems such as slow convergence speed and exponential growth in signaling overhead. At the same time, by the time the optimal path is calculated and sent, the actual physical link has often changed or broken, resulting in routing lag and packet loss.
[0003] Current research on routing algorithms for SAGIN mainly focuses on QoS (Quality of Service) optimization, which has the following serious shortcomings, and this is the direct motivation for the development of this technology:
[0004] QoS metrics cannot represent the true user experience (QoE): Existing research focuses primarily on minimizing end-to-end latency or packet loss rate. However, in experience-sensitive applications such as high-definition video conferencing and VR / AR transmission, low latency alone does not equate to a high-quality experience. For example, frequently switching routing paths in pursuit of the lowest latency can lead to severe jitter and out-of-order delivery, resulting in video stuttering, increased buffering, and a sharp decline in the user's subjective experience (QoE). Currently, there is a lack of a mechanism that can effectively map and jointly optimize network layer parameters (QoS) and application layer awareness (QoE).
[0005] The "difficulty in implementation" of traditional DRL routing algorithms: Although deep reinforcement learning (DRL) has been introduced into the routing field due to its powerful non-linear decision-making capabilities, existing DRL routing schemes generally suffer from the following problems: security risks brought about by exploration. In the early stages of training or when the environment changes suddenly, the agent often tries physically unreachable paths (such as links that have been blocked by the earth), resulting in communication interruption. At the same time, existing algorithms lack awareness of the remaining lifespan of the link and are prone to selecting satellite nodes that are about to be disconnected during transit.
[0006] In summary, existing technologies suffer from problems such as rapid routing failures in highly dynamic environments with integrated air and ground systems, a disconnect between traditional QoS optimization and user experience, and a lack of physical reliability guarantees for AI algorithms. Summary of the Invention
[0007] To address the aforementioned problems in the prior art, this invention employs a deep learning-based QoE-aware adaptive routing method for integrated space-ground networks, comprising:
[0008] S1. Construct an SDN-based integrated space-ground network model; the SDN-based integrated space-ground network model includes: SDN control center, LEO satellite constellation layer, HAP / UAV relay layer, and ground user and gateway layer;
[0009] S2. Model the routing process of the SDN-based integrated space-ground network model as a multi-agent partially observable Markov decision process;
[0010] S3. The optimal routing strategy is calculated using the Actor-Critic algorithm with Safety Shield constraints based on a multi-agent partially observable Markov decision process.
[0011] Beneficial effects:
[0012] 1. To address the slow convergence and poor reliability issues caused by the blind exploration of traditional DRL, this invention introduces a Safety Shield mechanism. Before the agent makes a decision, a dynamic mask is generated based on physical topology constraints (i.e., remaining link connectivity time and remaining link bandwidth). The Q-value is corrected based on the dynamic mask, and the optimal action is selected based on the corrected Q-value. Next-hop nodes that are about to disconnect or are physically invisible due to orbital movement are forcibly eliminated, completely eliminating routing black holes. This significantly improves the convergence speed and robustness of the algorithm while ensuring physical reliability. 2. To address the problem of decreased user experience due to frequent route switching in highly dynamic environments, this invention designs a composite reward function. Instead of solely anchoring to latency indicators, it integrates multi-dimensional features such as end-to-end latency, remaining link bandwidth, and congestion level. A smoothness penalty, i.e., a path switching penalty, is introduced, forcing the agent to find a balance between "performance" and "stability," achieving high-quality transmission with "experience first" and QoE-driven experience optimization. Attached Figure Description
[0013] Figure 1 A flowchart of a deep learning-based QoE-aware adaptive routing method for a space-ground integrated network provided in an embodiment of the present invention;
[0014] Figure 2 The SDN simulation architecture provided in this embodiment of the invention;
[0015] Figure 3 This is a diagram illustrating the architecture of a QoE-aware routing optimization algorithm for multiple service types, provided in an embodiment of the present invention.
[0016] Figure 4 A diagram of the Safety Shield decision model provided in an embodiment of the present invention;
[0017] Figure 5 A diagram illustrating the hierarchical structure of federated learning provided in this embodiment of the invention. Detailed Implementation
[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0019] like Figure 1 , Figure 3 As shown, this embodiment of the invention employs a deep learning-based QoE-aware adaptive routing method for integrated space-ground networks, including:
[0020] S1. Construct an SDN-based integrated space-ground network model; the SDN-based integrated space-ground network model includes: SDN control center, LEO satellite constellation layer, HAP / UAV relay layer, and ground user and gateway layer;
[0021] like Figure 2 As shown, in order to overcome the challenges of high cost, difficulty in reproducing the environment, and high risk of on-orbit testing in real satellite network experiments, this invention first establishes a high-fidelity simulation base and adopts the decoupling concept of software-defined networking (SDN) to separate network forwarding behavior from routing control logic.
[0022] Physical modeling of the data plane (Infrastructure Layer): The OMNeT++ discrete event simulator is used as the basic platform, and the INET open source framework and OS3 satellite orbit model library are integrated.
[0023] Heterogeneous Node Deployment: A three-layer heterogeneous node configuration is set up in the simulation scenario. The top layer is the LEO satellite constellation layer, which includes satellite nodes and is configured with Kepler orbital parameters (orbital altitude, inclination, right ascension of the ascending node, etc.) to simulate real periodic high-speed motion. The middle layer is the HAP / UAV (High Altitude Platform / UAV) relay layer, which serves as a regional relay node, including high altitude platforms, UAVs, and other nodes, and is set to quasi-static or low-speed hovering mode. The bottom layer is the ground user and gateway layer, which serves as the initiator and receiver of services, including ground stations, terminals, and other nodes.
[0024] Complex Channel Modeling: To approximate the real physical environment, an RF (Radio Frequency) link model was adopted for satellite-to-ground links, and the free-space path loss formula and rain attenuation and atmospheric absorption loss defined by the ITU-R model were introduced when calculating the signal-to-noise ratio. For inter-satellite links (ISL), an FSO (Free Space Optical Communication) model was adopted, and a line-of-sight (LoS) detection algorithm was added to determine in real time whether the Earth blocks the connection between two satellites; if blocking occurs, the link is immediately interrupted. Control Layer and Southbound Interface Development: The control logic runs on an external server supporting the Python / PyTorch deep learning framework. To achieve real-time interaction between the C++ environment (OMNeT++) and the Python environment (agent), this invention uses a TCP / UDP socket synchronous communication interface. State Upload: The SDN controller agent module in the simulation environment periodically (e.g., every 100ms) collects the topology connection status of the entire network. Real-time bandwidth utilization of each link and packet loss rate This data is then serialized into Protocol Buffers format and sent to the Python side. Flow Rule Install: The Python agent calculates the next hop action based on the received state. Then, it is converted into a standard OpenFlow flow table entry and injected back into the satellite node routing table in the simulation environment via Socket to guide the actual forwarding of data packets. The simulation base architecture is as follows: Figure 1 As shown.
[0025] S2. Model the routing process of the SDN-based integrated space-ground network model as a multi-agent partially observable Markov decision process;
[0026] The multi-agent partially observable Markov decision process is as follows:
[0027]
[0028] in, This represents the state transition probability function. This represents the global reward function. Let N be the reward function for agent i, where i is the index of the agent, i.e., the index of the satellite node in the LEO satellite constellation layer of the integrated space-ground network model, and N is the number of satellite nodes in the LEO satellite constellation layer. Let V be the graph structure of the integrated space-ground network model. V represents the set of nodes (including LEO satellites, UAVs, ground stations, terminals, etc.) in the LEO satellite constellation layer, HAP / UAV relay layer, and ground user and gateway layer of the integrated space-ground network. E represents the set of edges in the integrated space-ground network model. A usable link edge exists between two nodes if and only if they satisfy the connectivity constraints of physical line-of-sight visibility and reaching the minimum communication angle. For state space, Let i be the local state observed by agent i. For the action space, The local action performed by agent i, namely, the selection of the next-hop node for the current data packet.
[0029] To achieve fine-grained perception of service types, this invention utilizes extended fields in the MPQUIC protocol header to identify service types, defining service flag bits:
[0030]
[0031] in, This indicates latency-sensitive services. Indicates bandwidth-sensitive services. This indicates a normal service. This mechanism enables network nodes to quickly identify service intent without introducing additional signaling overhead.
[0032] If the connectivity between nodes changes over time, then at time t, the local actions of agent i will... , Let be the set of neighboring nodes of agent i at time t; and let be the local state observed by agent i. , , , This represents the observations made by agent i at time t via neighboring nodes. The estimated end-to-end delay corresponding to the forwarding. , This represents the observations of agent i and its neighboring nodes at time t. The remaining bandwidth of the physical direct link between them. , This represents the path to neighboring nodes observed by agent i at time t. Congestion level indicators in the forwarding direction, i.e., neighboring nodes. The congestion level indicator.
[0033] The specific method for obtaining the local state is as follows:
[0034] Business flag : Obtain directly by parsing the Service_Type field in the MPQUIC protocol header.
[0035] End-to-end delay estimate By parsing the interaction timestamps and acknowledgment frames in the MPQUIC protocol headers of the flowing nodes, the historical end-to-end round-trip delay of the service flow to the destination node along the neighbor node k can be directly obtained. The SDN control center periodically sends link probe information based on a global network view to calculate the delay caused by the neighbor nodes. The expected remaining path reference delay to the destination node is used, combined with the measured historical values of the end-to-end round-trip delay and the expected remaining path reference delay, to obtain the accurate path via nodes. The estimated end-to-end delay for forwarding.
[0036] The end-to-end delay estimate is calculated by parsing the timestamps of MPQUIC protocol message exchanges or by combining the periodic link probe information issued by the SDN controller.
[0037] Remaining bandwidth of the link The Sketch data structure (containing a hash function and a counter array) is used to accumulate the number of data packets in bytes flowing through the physical link from itself to neighbor node k within a statistical time window in real time. The current bandwidth throughput is calculated, and then the remaining bandwidth of the link is obtained by subtracting the used bandwidth from the total capacity of the physical link.
[0038] Congestion level indicators The Flow Entropy filter based on the Sketch mechanism statistically analyzes the feature distribution of local micro-flows from agent i to node k, obtains the occupancy of the underlying buffer queue of neighboring node k, and combines the feature distribution of local micro-flows and the occupancy of the underlying buffer queue to obtain a congestion index, which is used to quantitatively characterize the congestion risk of the current link.
[0039] This joint state representation method can simultaneously characterize the features of business requirements and the dynamic changes of network resources, providing a foundation for the design of subsequent composite reward functions.
[0040] In multi-service scenarios, different types of services have significantly different focuses on network performance metrics. To address this, this invention constructs a composite reward function that integrates QoE metrics and path switching penalty terms to guide reinforcement learning agents in achieving an adaptive trade-off between low latency, high throughput, and routing stability.
[0041] Let agent i be in the current local observation state at time t. Take action below (i.e., selecting specific neighbor nodes) As the next jump, the environment evaluates the action and provides immediate rewards. To force the agent to find a balance between performance and stability, the agent i is defined as a composite reward function based on the QoE metric and a switching penalty, defined as a mapping between state and action:
[0042] in, For agent i in the current local state Take action below Instant rewards received , , , This is a weighting coefficient; the specific optimal value range depends on the business flag bits. The settings are dynamic, and all other parameters are calculated based on local observation data of agent i, as detailed below:
[0043] This represents the reward based on end-to-end latency observed by agent i, reflecting the latency performance of the path. It is determined by the maximum tolerable end-to-end latency threshold for the service flow set by the system and the neighbor nodes observed by agent i at time t. (i.e., action) The path delay estimate for the corresponding node. The difference is calculated by dividing by the delay threshold.
[0044] According to the 3GPP Non-Terrestrial Network (NTN) standard and service type (Flag), the range of this maximum end-to-end latency threshold is set as follows: For latency-sensitive services (such as voice / online interaction), the optimal value is... For ordinary data services, the optimal value is .
[0045] This represents the reward based on the remaining bandwidth of the link observed by agent i, used to measure the efficiency of link resource utilization. It is calculated by agent i's observations of itself and its neighboring nodes at time t. (action Remaining bandwidth of the physical direct links between the corresponding nodes Divided by the bandwidth requirements of the service flow Calculated bandwidth requirements for service flows. The resource request field is dynamically obtained by directly parsing the MPQUIC protocol handshake phase.
[0046] This represents the penalty term based on the congestion index observed by agent i, which is calculated based on the distance to neighbor nodes observed by agent i at time t. (action Congestion level indicators in the forwarding direction of the corresponding node It is calculated by dividing by the congestion threshold set by the system.
[0047] The range of values for the congestion threshold is: That is, the normalized congestion index Its optimal value is This value can avoid excessive penalties when the network is under light load, but it can also be applied when congestion approaches the queue overflow threshold. When a significant gradient penalty signal is generated, it prompts the agent to switch routes in a timely manner.
[0048] Path switching penalty This is used to constrain network oscillations caused by frequent path reconstruction; among which, Let be the indicator function, representing the action chosen by agent i at time t. Actions selected at time t-1 At different times, that is, when the path changes. Take 1, otherwise Set to 0. This mechanism can effectively reduce routing oscillations and improve system stability and link convergence speed.
[0049] The weighting parameters are set as follows for different business types:
[0050] Latency-sensitive services ( ):make The primary optimization objective is to minimize end-to-end latency.
[0051] Optionally, , , .
[0052] Bandwidth-sensitive services ( ):make This is to enhance system throughput and avoid link congestion.
[0053] Optionally, , , .
[0054] General business ( ):make To maintain a balance between latency and throughput.
[0055] Optionally, .
[0056] The aforementioned composite reward function, by introducing QoE constraints and stability control mechanisms, enables reinforcement learning agents to dynamically balance multiple performance indicators, thereby achieving multi-service collaborative optimization.
[0057] S3. The optimal routing strategy is calculated using the Actor-Critic algorithm with Safety Shield constraints based on a multi-agent partially observable Markov decision process; whereby the Safety Shield mechanism is a safety shield mechanism.
[0058] The calculation of the optimal routing strategy involves: constructing an Actor-Critic model based on Safety Shield constraints; training the Actor-Critic model based on Safety Shield constraints using a hierarchical federated cooperative routing training framework based on a multi-agent partially observable Markov decision process; obtaining a trained Actor-Critic model based on Safety Shield constraints; and obtaining the optimal routing strategy based on the trained Actor-Critic model based on Safety Shield constraints.
[0059] Actor-Critic includes: Actor networks and Critic networks; such as Figure 5 As shown, this invention employs an Actor-Critic reinforcement learning framework for multi-agent cooperative routing optimization. Within this framework, to further improve the state evaluation accuracy in highly dynamic networks, the value network (Critic) specifically utilizes a Dueling DQN architecture. Training the Actor-Critic model based on Safety Shield constraints includes:
[0060] S31. Deploy an Actor-Critic model based on Safety Shield constraints as a local model on each satellite node i in the LEO satellite constellation layer; divide the satellite nodes into multiple clusters, and set an aggregation node for each cluster;
[0061] S32. Each satellite node i trains its own local model to obtain an updated local model, and then uploads the updated local model to the corresponding aggregation node.
[0062] The training process relies solely on data from the local experience pool and does not involve any cross-node data exchange, thus fully protecting data privacy.
[0063] like Figure 4 As shown, training the local model by satellite node i includes:
[0064] S321, Local state of satellite node i at time t , local state Input the Actor network of the local model, output the policy distribution. The strategy distribution includes satellite node A in state. Select each candidate action below The probability, Let k be the parameters of the Actor network of the local model of satellite node i; where k is the index of the neighboring node of satellite node i at time t.
[0065] Satellite node i also needs to collect time t and each candidate action. corresponding nodes Remaining connectivity time between links .
[0066] Because the orbits of low-Earth orbit satellites are highly periodic and predictable, the SDN control center can accurately calculate the orbital distances between any two nodes i and i by analyzing the two-line elements (TLE) and real-time ephemeris data of all satellite nodes in the network, combined with the system's preset minimum pass angle constraint. The end time of the current visible time window between the two viewing distances The remaining link connectivity time at time t can then be directly calculated as: .
[0067] S322, local state Each candidate action of satellite node i Input the Critic network of the local model to obtain each candidate action. Q value Where k is the index of the neighboring node of satellite node i;
[0068] S323, Use the Safety Shield mechanism to correct each candidate action. Q value Get each candidate action The corrected Q value According to the corrected Q value Choose the optimal action ;
[0069] Correcting the Q value using the Safety Shield mechanism include:
[0070] Generate dynamic mask vector Dynamic mask vector Includes double masking;
[0071] First layer (physical shielding): Eliminating physically unreachable links or links with insufficient remaining time. The first layer (congestion shielding): Borrowing from congestion-aware mechanisms, removing the remaining bandwidth of the current link. Bandwidth less than business requirements The node prevents data flow from being directed to an already congested link.
[0072] Dynamic mask vector Specifically:
[0073]
[0074] Among them, when and At the same time, the time mask is set to 1. This represents the set of all physically reachable neighbor nodes of satellite node i at time t in the actual physical topology G that satisfy the connectivity constraints (i.e., satisfy the line-of-sight requirement and are greater than the minimum beacon angle requirement). , For time t, satellite node i and candidate action corresponding nodes Remaining connectivity time of the links between them The preset security time threshold for the system represents the minimum link hold-up time necessary to ensure the complete transmission of the current business data stream and avoid link interruption during transmission. For time t, satellite node i and candidate action corresponding nodes Remaining bandwidth of the link between them Bandwidth required for business needs.
[0075] The optimal value of the safety time threshold can be adaptively and dynamically calculated based on the characteristics of the business flow, i.e. ,in This represents the average burst data volume of the current business data stream. For business bandwidth requirements, The reserved protection time interval (preferred range is) (by taking a random value), thus achieving the best trade-off between maximizing link utilization and preventing transmission interruption.
[0076] The process of statistically analyzing the average burst data volume of the current business data stream includes: using a lightweight measurement module based on the Sketch mechanism to perform real-time statistics on the data packets that continuously arrive at agent i in the current business data stream, obtaining the data packet sequence, calculating the average total number of bytes in the data packet sequence, and obtaining the average burst data volume.
[0077] This invention incorporates "remaining link time" into the action constraint and mask generation mechanism, enabling the agent to have insight into network topology changes. Before executing routing actions, the remaining connectivity lifetime of the link is strictly verified through dynamic masking, which can force the agent to actively avoid satellite nodes that are about to pass or disconnect, thus ensuring transmission continuity in a highly dynamic environment from a physical mechanism perspective.
[0078] Q-values of each candidate action evaluated by the value network After the Safety Shield mechanism corrects the validity through a mask, the corrected Q value is obtained. .
[0079] Based on the corrected Q value Choose the optimal action Includes: based on all candidate actions Corrected Q value Calculate the Softmax probability distribution and select the optimal action based on the Softmax probability distribution. (i.e., the action with the highest probability).
[0080] Softmax probability distribution:
[0081]
[0082] Where j is the index of the candidate action, The Temperature Parameter controls the degree of exploration. This parameter is used to adjust the randomness of the agent's action selection to achieve a dynamic balance between exploration and exploitation.
[0083] when When the value is large, the probability distribution of each safe candidate action being selected tends to be uniform, prompting the agent to conduct more random exploration to avoid prematurely getting trapped in a local optimum routing strategy in complex dynamic network topologies; when When the value is small, the difference in probability distribution is significantly amplified, and the agent is more inclined to "exploit" the current value assessment. The highest possible next jump. In actual training, The value of is usually dynamically decayed exponentially or linearly as the number of training rounds or time steps increases, thereby ensuring that the model can fully explore the unknown network environment in the early stage of training and stably converge to the optimal routing strategy in the later stage of training.
[0084] S324. Construct the set of safe actions for satellite node i at time t. According to the set of safety actions The Safety Shield mechanism is used to correct the optimal action. The corrected action was obtained. ;
[0085] Safety Actions Collection The actions in the process are: a node that exists in the set of neighboring nodes of satellite node i at time t, does not exist in the node routing and forwarding table stored in the SDN control center, has a buffer queue occupancy rate less than the preset maximum buffer threshold, and has remaining power higher than the preset minimum safe energy threshold.
[0086] Based on local state Build a set of security actions include:
[0087] Step 1: Initialize the set of security actions Given an empty set, obtain all physically reachable neighbor nodes of satellite node i to construct a candidate set of neighbor nodes;
[0088] Step 2: For each candidate node in the candidate set of neighbor nodes Perform a legality check; candidate nodes that pass the legality check are... Add to safety action set .
[0089] Legality determination includes:
[0090] (1) Loop prevention determination: The node routing table of the SDN control center records the traversed node paths of the current service flow to determine candidate nodes. Check if it already exists in the historical path list; if not, the anti-loop condition is met.
[0091] (2) Cache capacity determination: Obtain candidate nodes through a state-aware mechanism. If the current underlying buffer queue occupancy rate is strictly less than the system's maximum cache threshold... If it is determined that there is still enough space to accommodate new bursts of data, then the cache capacity requirement is met.
[0092] Optionally, maximum cache threshold The value range is 70% to 90%, with an optimal value of 80%. This range can maintain high link utilization while reserving sufficient buffer space for micro-bursts in the network, effectively preventing hard packet loss caused by queue overflow.
[0093] (3) Node energy determination: Obtain candidate nodes The real-time remaining power, if the candidate node If the remaining power is higher than the minimum safe energy threshold for maintaining basic communication links and data forwarding, then the energy requirement is met.
[0094] Optionally, the minimum safe energy threshold ranges from 15% to 30% of the node's total battery capacity, with an optimal value of 20%. This threshold setting ensures that highly dynamic heterogeneous nodes (such as energy-constrained drones or low-Earth orbit micro-nano satellites) can trigger self-protection mechanisms in a timely manner when their power is low (refusing to forward non-critical relay traffic), thus avoiding unexpected blind spots or routing black holes in the network physical topology due to the complete depletion of node energy.
[0095] If and only if candidate nodes If all three of the above conditions are met, its legality is deemed valid.
[0096] In one embodiment, if none of the candidate nodes meet the conditions at a certain moment, the source node or the previous hop node is used as the only security degradation action (i.e., triggering a traffic rollback or retransmission mechanism).
[0097] The set of security actions obtained through the above steps forcibly eliminates illegal actions that could lead to loops, buffer overflows, or energy depletion from both physical and resource perspectives.
[0098] According to the set of safety actions The Safety Shield mechanism is used to correct the optimal action. Includes: if the optimal action Belongs to the set of safety actions The corrected action The optimal action If the optimal action Not part of the set of safety actions Then, the distribution of the security action set is based on the strategy. Choose the action with the highest probability; the specific rules are as follows:
[0099]
[0100] in, Set of safety actions The actions within.
[0101] S325, Satellite node i executes the revised action. To obtain the next local state and rewards ;
[0102] S326, Constructing Experience Samples The experience samples are input into the local experience pool of the corresponding satellite node i; the satellite node i collects experience samples from the local experience pool, trains its own local model based on the collected experience samples, and obtains the updated local model of satellite node i.
[0103] Satellite node i based on collected experience samples Training the local model includes:
[0104] Step 1: Set the local state Input the Actor network of the local model, output the policy distribution. The strategy distribution includes the state of satellite node i. The probability of selecting each candidate action k. For the parameters of the Actor network of the local model of satellite node i;
[0105] Step 2: Set the local state Local state And the revised actions Input the Critic network of the local model to obtain the state value function. , and dominance function ;
[0106] Step 3: Based on the strategy distribution and advantage function Calculate the objective function of the Actor network; based on the reward and state value functions... , Calculate the loss function of the Critic network;
[0107] During the training of the Actor network, the objective function is:
[0108]
[0109] The corresponding gradient update direction is:
[0110]
[0111] The advantage function evaluates the merits of the currently selected physical next-hop action relative to the average routing level under the current network topology connectivity state, and is expressed as a logarithm of the policy. Multiplication and mutual guidance lead to better updates of Actor network parameters, thereby effectively reducing gradient estimation variance and improving training stability in highly dynamic network environments.
[0112] Based on the Temporal Difference (TD) learning concept, a TD error is constructed. As the driving signal for updating the value network parameters, its expression is:
[0113]
[0114] By minimizing the mean squared loss function of the TD error, adaptive optimization of the value network parameters is achieved. The loss function is defined as follows:
[0115]
[0116] Step 4: Update the parameters of the Actor network according to the objective function of the Actor network, and update the parameters of the Critic network according to the loss function of the Critic network.
[0117] In one embodiment, parameters are updated in the local model. Sparsity processing is performed, and only the top-ranked absolute values are uploaded. The important gradient is compressed and transmitted using zero-padding, significantly reducing the size of the uploaded data.
[0118] S33. Each aggregation node r periodically receives the local models uploaded by each satellite node within the cluster, aggregates the local models uploaded by each satellite node within the cluster, and obtains the first aggregated model. The first aggregated model is then uploaded to the SDN control center.
[0119] The model synchronization period is dynamically adjusted based on network load and model convergence status.
[0120]
[0121] in, The dynamic adjustment sensitivity coefficient (or step size factor) set for the system is used to control the magnitude of the synchronization cycle increase. The range of values is The optimal value is This range ensures that the synchronization period can be effectively lengthened as the model tends to stabilize, while avoiding system oscillations caused by drastic period jumps due to excessively large step sizes. This represents the gradient of the model loss function during the m-th round of global aggregation. Let be the model gradient norm (i.e. the absolute magnitude of the gradient) in the m-th round. It is used to quantify the degree of fluctuation in the current model training. The smaller the value, the closer the model is to the convergence state (i.e., tends to be stable). The model gradient norm in the initial stage (round 0) is used as a benchmark reference value for normalization to eliminate the influence of different network initial states and business scenarios on the dimensionality. This is the maximum allowed safe synchronization period for the system, preventing the period from growing indefinitely and causing severe disconnection between policies between nodes. The value range is the initial synchronization period. of times (i.e.) (one training step or communication time slot), the optimal value is... This upper limit design can both reduce the overhead of inter-satellite communication and prevent the phenomenon of severe overfitting of individual node models in the highly dynamic scenario of integrated space-ground network.
[0122] The specific calculation method is as follows: each satellite node i in the cluster calculates the loss gradient of the local value network using the temporal difference error (TD Error). The gradient is uploaded along with the model parameters. The aggregation node obtains the global gradient by weighted averaging of the local gradients of each node. , This represents the reputation weight of satellite nodes within each cluster. In engineering implementation, the parameter changes between the two rounds of the global aggregation model can also be used for an approximate equivalent substitution. .
[0123] The above formula introduces a negative exponential decay term for the relative gradient. When the model fluctuates greatly in the early stages of training ( When the value is relatively large, the exponential term approaches 0. The system maintains high-frequency synchronization to ensure the correct convergence direction; as the model gradually stabilizes ( When the exponential term decreases (approaching 0), it approaches 1, and the periodic multiplier increases (approaching 0). This mechanism drives the synchronization period to extend adaptively. While ensuring the convergence accuracy of the routing strategy, it minimizes the communication frequency and resource consumption of the satellite-to-ground link.
[0124] When the model training stabilizes, the synchronization period is automatically extended to further reduce communication consumption.
[0125] First Aggregation Model :
[0126]
[0127] in, This represents the set of satellite nodes contained within the cluster of aggregate nodes r. Let be the size of the local experience sample of satellite node i.
[0128] S34. The SDN control center receives the first aggregated model uploaded by each aggregation node, and aggregates the first aggregated models uploaded by each aggregation node to obtain the second aggregated model. The second aggregation model is then distributed to each aggregation node via the SDN control channel.
[0129]
[0130] in, For the number of clusters, Indicates the first The total size of the experience samples used in training within each cluster.
[0131] S35. Each aggregation node synchronizes the second aggregation model to the satellite nodes within its own cluster as the local model for each satellite node.
[0132] S36. Determine whether the training stop condition has been met. If so, each satellite node obtains the trained local model; otherwise, return to step S32.
[0133] The specific training termination condition is that any of the following conditions are met:
[0134] (1) The current number of training episodes has reached the maximum iteration threshold preset by the system;
[0135] (2) In a series of training rounds, the average cumulative reward obtained by the agent’s output action tends to stabilize, and the fluctuation of the reward value is less than the set minimum convergence threshold (i.e., the model is judged to have reached the convergence state).
[0136] If any of the above training termination conditions are met, training will stop, and each satellite node will obtain the trained local model and use it for actual routing and forwarding operations.
[0137] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A deep learning-based QoE-aware adaptive routing method for integrated space-ground networks, characterized in that, include: S1. Construct an SDN-based integrated space-ground network model; The SDN-based integrated space-ground network model includes: SDN control center, LEO satellite constellation layer, HAP / UAV relay layer, and ground user and gateway layer; S2. Model the routing process of the SDN-based integrated space-ground network model as a multi-agent partially observable Markov decision process; S3. The optimal routing strategy is calculated using the Actor-Critic algorithm with Safety Shield constraints based on a multi-agent partially observable Markov decision process; whereby the Safety Shield mechanism is a safety shield mechanism.
2. The deep learning-based QoE-aware adaptive routing method for integrated space-ground networks according to claim 1, characterized in that, The multi-agent partially observable Markov decision process is as follows: ;in, This represents the state transition probability function. This represents the global reward function. Let N be the reward function for agent i, where i is the index of the agent, i.e., the index of the satellite node in the LEO satellite constellation layer of the SDN-based integrated space-ground network model, and N is the number of satellite nodes in the LEO satellite constellation layer. V represents the graph structure of the SDN-based integrated space-ground network model, where V is the set of nodes in the LEO satellite constellation layer, HAP / UAV relay layer, and ground user and gateway layer of the SDN-based integrated space-ground network model, and E is the set of edges in the SDN-based integrated space-ground network model. For state space, Let i be the local state observed by agent i. For the action space, For the local action of agent i; At time t, the local state observed by agent i Local actions of agent i , Let be the set of neighboring nodes of agent i at time t. , This represents the observations made by agent i at time t via neighboring nodes. The estimated end-to-end delay corresponding to the forwarding. , This represents the observations of agent i and its neighboring nodes at time t. The remaining bandwidth of the physical direct link between them. , This represents the neighboring nodes observed by agent i at time t. The congestion level indicator, , For business flags, This indicates latency-sensitive services. Indicates bandwidth-sensitive services. This indicates regular business operations.
3. The deep learning-based QoE-aware adaptive routing method for integrated space-ground networks according to claim 2, characterized in that, The reward function is: ; in, For agent i in the current local state Take action below Instant rewards received This represents the reward term based on the end-to-end latency estimate. This indicates a reward based on the remaining bandwidth of the link. This indicates a penalty based on congestion levels. This indicates the penalty for path switching. Let be the indicator function, representing the action chosen by agent i at time t. Actions selected at time t-1 At the same time, Take 1, otherwise Take 0, These are the weighting coefficients.
4. The deep learning-based QoE-aware adaptive routing method for integrated space-ground networks according to claim 3, characterized in that, hour, ; season ; season .
5. The deep learning-based QoE-aware adaptive routing method for integrated space-ground networks according to claim 1, characterized in that, The calculation of the optimal routing strategy involves: constructing an Actor-Critic model based on Safety Shield constraints; training the Actor-Critic model based on Safety Shield constraints using a hierarchical federated cooperative routing training framework based on a multi-agent partially observable Markov decision process; obtaining a trained Actor-Critic model based on Safety Shield constraints; and obtaining the optimal routing strategy based on the trained Actor-Critic model based on Safety Shield constraints.
6. The deep learning-based QoE-aware adaptive routing method for integrated space-ground networks according to claim 5, characterized in that, Training an Actor-Critic model based on Safety Shield constraints includes: S31. Deploy an Actor-Critic model based on Safety Shield constraints as a local model on each satellite node of the LEO satellite constellation layer; divide the satellite nodes into multiple clusters, and set an aggregation node for each cluster; S32. Each satellite node trains its own local model to obtain an updated local model, and then uploads the updated local model to the corresponding aggregation node. S33. Each aggregation node receives the local model uploaded by each satellite node in the cluster, aggregates the local models uploaded by each satellite node in the cluster to obtain the first aggregated model, and uploads the first aggregated model to the SDN control center. S34. The SDN control center receives the first aggregation model uploaded by each aggregation node, aggregates the first aggregation models uploaded by each aggregation node to obtain the second aggregation model, and distributes the second aggregation model to each aggregation node. S35. Each aggregation node synchronizes the second aggregation model to the satellite nodes within its own cluster as the local model for each satellite node. S36. Determine whether the training stop condition has been met. If so, each satellite node obtains the trained local model; otherwise, return to step S32.
7. The deep learning-based QoE-aware adaptive routing method for integrated space-ground networks according to claim 6, characterized in that, Training the local model by satellite node i includes: S321, Local state of satellite node i at time t , local state Input the Actor network of the local model and output the policy distribution; S322, local state Each candidate action of satellite node i Input the Critic network of the local model to obtain each candidate action. Q value Where k is the index of the neighboring nodes of satellite node i at time t; S323, Use the Safety Shield mechanism to correct each candidate action. Q value Get each candidate action The corrected Q value According to the corrected Q value Choose the optimal action ; S324. Construct the set of safe actions for satellite node i at time t. According to the set of safety actions The Safety Shield mechanism is used to correct the optimal action. The corrected action was obtained. ; S325, Satellite node i executes the revised action. To obtain the next local state and rewards ; S326, Constructing Experience Samples The experience samples are input into the local experience pool of the corresponding satellite node i; the satellite node i collects experience samples from the local experience pool, trains its own local model based on the collected experience samples, and obtains the updated local model of satellite node i.
8. The deep learning-based QoE-aware adaptive routing method for integrated space-ground networks according to claim 7, characterized in that, Correcting the Q value using the Safety Shield mechanism include: ; ; in, For dynamic mask vectors, Let i represent the set of neighboring nodes of satellite node i at time t. For time t, satellite node i and candidate action corresponding nodes Remaining connectivity time of the links between them The preset safety time threshold, For time t, satellite node i and candidate action corresponding nodes Remaining bandwidth of the link between them Bandwidth required for business needs.
9. The deep learning-based QoE-aware adaptive routing method for integrated space-ground networks according to claim 7, characterized in that, Safety Actions Collection The actions in the process are: A node that exists in the neighbor node set of satellite node i at time t, does not exist in the node routing and forwarding table stored in the SDN control center, has a buffer queue occupancy rate less than the preset maximum buffer threshold, and has remaining power higher than the preset minimum safe energy threshold; based on the set of safe actions... The Safety Shield mechanism is used to correct the optimal action. Includes: if the optimal action Belongs to the set of safety actions The corrected action The optimal action If the optimal action Not part of the set of safety actions Then, the distribution of the security action set is based on the strategy. Choose the action with the highest probability.
10. The deep learning-based QoE-aware adaptive routing method for integrated space-ground networks according to claim 7, characterized in that, Satellite node i based on collected experience samples Training the local model includes: Step 1: Set the local state Input the Actor network of the local model and output the policy distribution; Step 2: Set the local state Local state And the revised actions Input the Critic network of the local model to obtain the state value function. , and dominance function ; Step 3: Based on the strategy distribution and advantage function Calculate the objective function of the Actor network; based on the reward and state value functions... , Calculate the loss function of the Critic network; Step 4: Update the parameters of the Actor network according to the objective function of the Actor network, and update the parameters of the Critic network according to the loss function of the Critic network.