A Multi-Party Cooperative Satellite Access and Anti-Interference Method Based on Deep Reinforcement Learning

By employing a multi-body collaborative approach using deep reinforcement learning, combined with the Actor-Critic algorithm and partially connected neural networks, the problems of low access efficiency and insufficient anti-interference capability of satellite networks in ultra-dense networks are solved, achieving more efficient satellite access and anti-interference effects.

CN117715054BActive Publication Date: 2026-06-30BEIJING INST OF TECH +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING INST OF TECH
Filing Date
2023-12-05
Publication Date
2026-06-30

Smart Images

  • Figure CN117715054B_ABST
    Figure CN117715054B_ABST
Patent Text Reader

Abstract

This invention relates to a multi-agent cooperative satellite access and anti-jamming method based on deep reinforcement learning, belonging to the field of satellite communication. It utilizes the Actor-Critic offline learning method in deep reinforcement learning to build a partially connected neural network. The target network is used to softly update the neural network parameters, improving decision-making performance during adversarial processes and better adapting to changes in the electromagnetic environment. In environmental modeling and reinforcement learning state modeling, the actions from the previous time step are incorporated into the state, and combined with reward determination, different actions are output within consecutive time slots, making intelligent access more flexible and variable, and improving the anti-jamming capability of access. Using GPU computing networks and offline policy reinforcement learning methods, sample collection and training can be performed and effective intelligent access can be achieved even in the absence of training samples and prior data. This invention is applicable to the field of satellite communication, improving anti-jamming capabilities while ensuring user access accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning, belonging to the field of satellite communication. Background Technology

[0002] Massive satellite network access refers to providing high-speed internet access services to users across a wide area using satellite technology. This concept arose from the urgent need for widespread global internet access, particularly in regions with challenging geographical conditions and weak infrastructure. Developing satellite communication systems is a crucial step towards securing a leading position in the development of space information networks; it can promote the industrialization of services such as navigation enhancement, wide-area surveillance, and data acquisition and distribution; and it is also a vital measure to drive the comprehensive development of commercial aerospace and lead the upgrading of the information industry and aerospace technology.

[0003] However, the first challenge in large-scale satellite network access is that current random access protocols perform poorly in ultra-dense networks, necessitating efficient access schemes capable of handling a large number of requests. Another challenge is interference attacks. To address access congestion, some enhanced random access schemes include priority-based, packet-based, and code-extended random access. Some research has also considered coded random access and sparse code multiple access. However, these schemes require centralized scheduling mechanisms, which are unavailable in wide-area satellite access scenarios due to large propagation delays and a large number of users. To resist interference attacks, common techniques include direct sequence spread spectrum and frequency hopping spread spectrum, as well as multi-beam antennas and adaptive anti-interference routing.

[0004] However, much work focuses on random access mechanisms to improve success rates, and some even aim to achieve anti-jamming capabilities. Due to their high openness, satellites are vulnerable to jamming attacks. In malicious jamming environments, jammers degrade channel quality by sending interfering signals, leading to access failures. Furthermore, when a device is inaccessible, it continuously attempts to retransmit, causing rapid battery discharge and exacerbating channel congestion. Therefore, an advanced random access scheme is needed to support the large-scale operation of satellite networks under jamming attacks. Traditional anti-jamming methods are inadequate against intelligent jamming that adjusts its jamming strategy based on user behavior.

[0005] Therefore, for multi-body cooperative intelligent satellite access and anti-interference problems in complex electromagnetic environments, we must not only refer to traditional access and anti-interference methods, but also face intelligent interference with constantly changing strategies. We must combine deep reinforcement learning algorithms to continuously observe changes in the electromagnetic environment, learn the changing patterns of interference, and thus better improve the efficiency of multi-satellite access. Summary of the Invention

[0006] To address the shortcomings of existing satellite access technologies in terms of insufficient anti-interference capabilities and poor environmental adaptability, the main objective of this invention is to propose a multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning. This method employs the Actor-Critic algorithm from deep reinforcement learning, combining deep neural networks with traditional Q-learning reinforcement learning. In a satellite-ground cooperative access environment, it sets up both traditional human-induced interference and intelligent interference modes, while considering resource allocation for transmission latency and transmission power. This improves anti-interference capabilities while ensuring user access accuracy.

[0007] The objective of this invention is achieved through the following technical solution:

[0008] The multi-agent cooperative electromagnetic interference suppression method based on deep reinforcement learning disclosed in this invention first constructs a complex electromagnetic environment for multi-agent reinforcement learning, which includes transmission delay, signal fading, and noise interference; the transmission channel is a time-varying channel with Markov properties; a partially connected neural network is used, which can simultaneously output two actions: channel selection and power allocation; a denser reward method is used to evaluate the quality of the action; agents cannot choose the same channel in consecutive time slots to increase the variability of their decisions; and through multiple rounds of iteration, the access capability and interference suppression capability are continuously improved.

[0009] The multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning disclosed in this invention includes the following steps:

[0010] Step 1: Construct a complex electromagnetic environment for multiple agents;

[0011] An integrated air-space network is constructed, in which N intelligent users transmit information to satellites, M satellites receive information, and one traditional jammer and one intelligent jammer are deployed. Both types of jammers have an equal chance to access the channel using limited power. If the jammer and the user select the same channel in the same time slot, the user's transmission fails, and the jammer's interference succeeds; otherwise, the user's transmission succeeds, and the jammer successfully avoids interference. Furthermore, the jamming trajectory of the jammer is partially observable.

[0012] Step 2: In the electromagnetic environment of Step 1, construct a three-dimensional coordinate system in Cartesian space and an Actor-Critic neural network;

[0013] Large-scale fading is modeled based on the LoS probability model, which handles line-of-sight (LOS) effects with shadowing and blocking. In the LoS probability model, large-scale fading follows a generalized Bernoulli distribution of two distinct events; the channel is either LoS or non-LoS (NLoS) with a certain probability. Since it is a satellite access model, only the LoS channel is considered; therefore, the large-scale fading between satellite m and user n is expressed as:

[0014]

[0015] Large-scale fading between satellite m and jammer j is represented as:

[0016]

[0017] Where β0 is the average power gain at a reference distance d0 = 1, and l is a vector in the three-dimensional spatial coordinate system. Let represent the position vectors of the satellite, user, and jammer, respectively, and α be the path loss exponent.

[0018] The channel gain between satellite m and user n is expressed as:

[0019]

[0020] The channel gain between satellite m and jammer j is expressed as:

[0021]

[0022] in and It represents the effect of a small-scale fading at time t, following a Rician distribution.

[0023] The transmit power of the jammer on the k-th channel of the m-th satellite and user n are respectively and p n (t). Therefore, the channel capacity of user n on the k-th channel of the m-th satellite is:

[0024]

[0025]

[0026] Where W is the channel bandwidth. This represents the Gaussian noise power. The transmit power of the user and the jammer satisfies... t∈T and Where P tot It is the maximum power that each user is allowed to use in this time slot. These are the maximum power that the user and the enemy jammer can use, respectively.

[0027] Step 3: Based on the position coordinates in the three-dimensional spatial coordinate system obtained in Step 2, the acquired spectrum information is digitized to obtain the input state of the neural network;

[0028] To ensure stable user access, it is necessary to obtain the spectrum occupancy status of consecutive time slots. Available, unoccupied channels are marked as 1, and unavailable, occupied channels are marked as -1. The spectrum occupancy status of consecutive time slots is used as input to a neural network, using b... m (t) represents the observed channel conditions, b m,k (t) = 1 indicates that a user has successfully connected to the satellite and the connection is successful. m,k (t) = -1 indicates that a user attempted to connect to the satellite but failed. m,k (t) = 0 indicates that no user has access to the satellite.

[0029]

[0030] in, Let F represent the k-th channel of satellite m accessed by user n at time t, and u represent the user's occupancy status. n (t) represents two possible scenarios where a user fails to access the satellite, which can be either 0 or 1.

[0031] Define b m (t)=[b m,1 (t), b m,2 (t)····b m,k (t)], B(t)=[b1(t), b2(t)····b M (t)] T This indicates the occupancy status of the channel.

[0032] Step 4: After obtaining the state input from Step 3, input it into the Actor network and the Critic network respectively. The agent will then take two actions to counter the jammer: selecting the appropriate channel and power.

[0033] The target gradients optimized by the two neural networks are as follows:

[0034]

[0035] Where ω is the value parameter, θ is the network parameter, and Qω(s) t ,a t π represents the Q-value of the current action. θ (a t |s t E represents the current time-instance strategy. π This represents the expected outcome of the strategy.

[0036] Using the mean squared error loss function:

[0037]

[0038] Where r is the reward for the action, γ is the decay factor, and V w (st V(s) represents the state-value function at the current moment. t+1 ) represents the state value function at the next moment.

[0039] The Actor network selects a suitable channel from the user's action space and simultaneously chooses the transmission power to use from the power action space; this process is the output action At. The Critic network calculates the state V value V(s) at time t. t+1 Then, the time-division error is output to evaluate the quality of the action. The calculation form of the time-division error is as follows:

[0040] TD-Error = Q(s,a) - V w (s t )

[0041] =r + gamma * V(s) t+1 )-V w (s t (10)

[0042] Where Q(s,a) is the Q value at the current time.

[0043] Step 5: Input the state from Step 3 and the action from Step 4 into the environment to interact and receive rewards from the environment.

[0044] The interaction between the agent's actions, the jammer's actions, and the environment will result in the following situations, with a maximum reward of 3 and a minimum reward of -1 for each step. The following is C... n,m,k (t) represents the transmission rate of the k-th channel of satellite m selected by user n at time t, where C threshold This indicates the threshold rate at which a transmission was successful.

[0045] ① Select the correct channel and correct power

[0046] In this scenario, the intelligent user selects a channel not occupied by other users and correctly avoids interference from an enemy jammer. In this case, the intelligent user receives a reward of r = 3. If multiple intelligent users select the same channel, the reward decreases to r = 2. If interference is received but the power is sufficient to reach the transmission rate C... n,m,k (t)>C threshold At this point, r=2. If multiple smart users choose the same channel, the reward will be reduced to r=1.

[0047] ② Choosing the correct channel but the wrong power

[0048] In this situation, the intelligent user selects a channel that is not occupied by other users, but it is subject to interference from the enemy, and the power is insufficient to reach the transmission rate. n,m,k (t) <C thresholdThe reward is r=1 at this point; if multiple smart users choose the same channel, the reward is r=0.

[0049] ③ Selecting the wrong channel

[0050] In this case, if the intelligent user chooses the same channel as other users and is not interfered with by an enemy jammer, the reward is r = 0. If it is interfered with by a jammer, r = -1.

[0051] Step Six: The intermediate state from Step Three, the action from Step Four, and the reward from Step Five are all input into the Actor and Critic networks for experience learning, optimizing and updating network parameters to achieve the satellite's anti-interference effect.

[0052] First, the Critic network learns by observing the environment and obtaining the state St+1 at the next time step. Through network computation, the value V of the next state is obtained, and the Bellman equation is calculated.

[0053] TD-Error = Q(s,a) - V (s) =r + gamma*V(s′) - V(s) (11)

[0054] After calculating the time-division error TD-error, it is fed into the Actor network along with the state S and action A for learning. TD-Error represents the weights during Actor updates. Therefore, the Critic network does not need to estimate Q, but rather V. Then, TD-Error, which is the Advantage function, can be calculated. Minimizing TD-Error and then calculating its expectation is crucial. The policy gradient to be calculated at this point is:

[0055]

[0056] The state-value function V can be used as a baseline, π θ Given the current strategy and γ as the decay factor, the updated weights are obtained, where A π,γ (s t ,a t Let A be the potential function. π,γ (s,a)=Qπ,γ(s,a)-Vπ,γ(s).

[0057] Qπ,γ(s,a) is the Q value under the influence of the decay factor, and Vπ,γ(s) is the state value function under the influence of the decay factor.

[0058] The reward value can be used as a loss function to train the agent's decision direction. The larger the reward value, the smaller the difference from the expected Q value, and the better the training effect. The smaller the reward value, the larger the loss value, indicating poor action selection. If the reward value is very small or negative within consecutive time slots, other actions will be explored to update the policy gradient in order to find a better policy to improve the agent's anti-interference ability, until the optimal policy is found so that the reward value of the agent in the environmental feedback is close to the maximum reward and converges stably, thus achieving the satellite access and anti-interference effect.

[0059] Beneficial effects:

[0060] 1. This invention discloses a multi-agent cooperative electromagnetic interference suppression method based on deep reinforcement learning, utilizing the Actor-Critic offline learning algorithm in deep reinforcement learning. A partially connected neural network is constructed, and parameters are trained on each network branch separately. Therefore, the agent can perform two actions: accessing the channel and allocating power. Compared with other intelligent access technologies, this invention has a greater advantage in resource allocation. Furthermore, a target network is used to softly update the neural network parameters, improving decision-making performance during adversarial processes and better adapting to changes in the electromagnetic environment.

[0061] 2. The multi-agent cooperative electromagnetic interference suppression method based on deep reinforcement learning disclosed in this invention not only treats continuous spectrum information as state input in environmental modeling and reinforcement learning state modeling, but also incorporates the action from the previous time step into the state. Combined with reward determination, it can output different actions in continuous time slots, making intelligent access more flexible and variable, and its strategy more difficult for intelligent interference to learn. It improves the anti-interference capability of access while ensuring its own channel resource access.

[0062] 3. The multi-body cooperative electromagnetic anti-interference method based on deep reinforcement learning disclosed in this invention uses a GPU computing network and an offline policy reinforcement learning algorithm. It can perform sample collection and training and effective intelligent access even in the absence of training samples and prior data. At the same time, it simulates a more complex and realistic electromagnetic confrontation environment. Compared with other intelligent anti-interference methods, this invention also takes into account factors such as transmission delay, signal fading, power resource constraints, and partial observability of the environment, thereby improving the training speed and generalization ability. Attached Figure Description

[0063] Figure 1 This paper discloses a flowchart of a multi-body cooperative electromagnetic anti-interference method based on deep reinforcement learning.

[0064] Figure 2 This is a partial connection neural network constructed in this embodiment;

[0065] Figure 3This is a schematic diagram of random access in a satellite network under malicious interference attacks in this embodiment;

[0066] Figure 4 This is a diagram illustrating the access performance against conventional enemy jammers in this embodiment;

[0067] Figure 5 This is a diagram illustrating the access performance against enemy intelligent jammers in this embodiment. Detailed Implementation

[0068] The present invention will now be described in detail with reference to the accompanying drawings and embodiments. The technical problems solved by the present invention and its beneficial effects are also described. It should be noted that the described embodiments are only intended to facilitate understanding of the present invention and do not constitute any limitation thereof.

[0069] This embodiment discloses a multi-agent cooperative electromagnetic interference mitigation method based on deep reinforcement learning, applied to the TensorFlow 2.0 and Tensorlayer 2.0 frameworks. It employs the Actor-Critic algorithm of deep reinforcement learning, replacing the traditional Q-table calculation method with a partially connected neural network. Based on an optimized model, it bridges the processed spectral information with electromagnetic environment observations, then feeds the bridged state into the network to calculate the output action and its Q-value. Finally, the state, action, reward, and Q-value of the entire process are used for empirical training to update the parameters. Specific parameter settings are shown in the table below:

[0070] Attenuation factor γ 0.95 Actor Learning Rate 0.0001 Critic learning rate 0.001 Number of channels 16 Training batch 300 Each round slot 109 Greed factor ε 0.15 Optimizer Adam

[0071] like Figure 1 As shown, it includes the following steps:

[0072] Step 101: Establish the Actor part of the Actor-Critic algorithm to connect to the neural network.

[0073] It employs a 3-layer partially connected Actor neural network. The first layer is a common layer containing 256 neurons; the second layer is a batch normalization layer, which improves the convergence speed of the neural network and enhances its generalization ability while preventing overfitting. Its normalization form is as follows:

[0074]

[0075]

[0076]

[0077] y i ←γx i +β=BN γ,B(x i (16)

[0078] Where m is the sample size, x i For sample values, μ B The sample mean. The variance is denoted as γ, and the shift and scaling factors are introduced, respectively. ∈ is a manually set parameter to prevent the denominator from being zero. BN γ,B (x i ) represents the sample data after passing through the Bn layer, y i The output corresponds to the sample.

[0079] The process involves first obtaining m sample values ​​x. i Calculate the sample mean μ B Then calculate the sample variance. The sample data is standardized by introducing two parameters: a γ scaling factor and a β translation factor. Training these parameters allows the network to learn and recover the feature distribution that the original network needed to learn. Normalized sample values ​​are then calculated. Normalization can accelerate network convergence. The mean and standard deviation can be considered as introducing noise, i.e., preventing overfitting.

[0080] The third layer consists of two partially connected branches, each containing 128 neurons. Its network structure is as follows: Figure 2 As shown, its first branch outputs the selection action in the channel selection action space, and the second branch outputs the power allocation action in the power magnitude action space. Since the two actions have different purposes and the directions of gradient calculation are also different, two different sets of network parameters need to be trained separately.

[0081] Furthermore, both branches can use the same Adam optimizer and the same learning rate of 0.001.

[0082] Step 102: Construct a fully connected Critic network

[0083] The Critic part of this algorithm is similar to the Actor part, with 256 neurons in the first layer, followed by a Bn layer, and then the output layer. However, the final output of the network has only one neuron, whose output value is the time-division error (TD-Error), specifically a backward TD error, used to evaluate the quality of the Actor network's output action.

[0084] The TD error is calculated in the following form:

[0085] TD-Error=r+gamma*V(s′)-V(s) (17)

[0086] We also chose to use the Adam optimizer with a learning rate of 0.01.

[0087] Step 103: Configure other parameters required for network training. Set the learning rate, batch size, weight initialization method, weight decay coefficient, optimization method, number of iterations, epoch size, and decay factor.

[0088] Step 104: After initializing the network parameters and state, at the beginning of each round and in each time slot, observe the spectrum occupancy in the environment, marking available channels as 1, unavailable channels as -1, and 0 indicating uncertain occupancy due to transmission distance delay. The final result is a 16*9 matrix used as the input to the neural network. The state S at a certain moment can be represented as:

[0089]

[0090] The occupancy status of other users follows the Markov property in the time row, that is:

[0091]

[0092] Step 105: Input the state matrix from Step 104 into the Actor part of the connection network from Step 101. Output actions Ac and Ap in the action spaces of the channel and power, respectively. During the action selection process, in order to obtain more sample resources and explore the environment, an ε-greedy strategy is adopted, that is:

[0093]

[0094] Since channel selection and power allocation can be viewed as a classification problem, this method uses the softmax activation function. Furthermore, this method constructs a discrete model, and after selecting the action output by the Actor neural network, a stochastic logistic regression is performed.

[0095] Step 106: Send the state S from step 104 and the actions Ac and Ap from step 105 into the environment. The environment contains the jamming part of the jammer. Figure 3 This is a round-based random jamming mode, meaning that 16 different channels are randomly attacked within four consecutive time slots. Under jamming, the agent receives a reward R. To maximize network throughput, if multiple intelligent users choose the same channel in the same time slot, their R is reduced by 1. Furthermore, if an agent chooses the same channel for two consecutive time slots, its R is reduced by 1. This flexible decision-making not only ensures effective channel access but also helps to disrupt the jamming decisions of the intelligent jammer, causing it to learn in the wrong direction.

[0096] Step 107: Input the state S in step 104, the actions Ac and Ap in step 105, and the time reward R in step 106 into the Critic network to learn and output the time division error (TD-Error), which is calculated in (17).

[0097] Step 108: The state, action, reward, and time-division error involved in Step 107 are then fed back into the Actor network for training. The gradients that the Actor network needs to calculate at this point are:

[0098]

[0099] Where A π,γ (s t ,a t Let A be the potential function. π,γ (s,a)=Q π ,γ(s,a)-V π ,γ(s), and then use the mean squared error loss function to improve decision-making:

[0100]

[0101] The quality of a decision can be determined by the reward r at each step. t and round reward R tot To reflect:

[0102]

[0103] Step 109: Repeat steps 104 to 106, and update the neural network parameters every 20 steps to maintain the stability of the Actor and Critic networks.

[0104] We compare the reward R obtained with that of the random case, such as Figure 4 As shown. A 100-round system is set, with 100 time slots per round, resulting in a maximum reward of 300 per round. The solid line represents the cooperative intelligent access method based on the deep reinforcement learning Actor-Critic algorithm, while the dashed line represents the random decision-making access method. Both parties use probabilistic transfer channels with Markov properties in time and are subject to round-based random interference. Figure 4 As can be seen, the access energy efficiency achieved by this method is nearly four times that of random access, while maintaining convergence stability. Under otherwise identical conditions, we also implemented an intelligent interference mode. This mode can also effectively learn the channel's changing patterns. Both sides employ intelligent methods in the game, and the results are as follows: Figure 5As shown, the solid line represents the performance of intelligent access without considering power; the '·-' curve represents the performance when both sides consider power limitations; and the '--' dashed line represents the performance curve of our jammer under the enemy jamming mode with uniform power. It can be seen that this method is more robust against interference when considering limited power resources. Furthermore, this method employs a cooperative multi-agent approach. To increase access flexibility, intelligent jammers cannot select the same channel for jamming in consecutive time slots, which would significantly increase the learning difficulty for the enemy's intelligent anti-jamming. Additionally, to achieve greater access efficiency, the two intelligent jammers cannot select the same channel for access in the same time slot, as this would poison the enemy jammer's samples, causing them to learn in the wrong direction. Moreover, we use a target network to softly update the computational network, which increases the stability and robustness of the decision-making process.

[0105] The above detailed description further illustrates the purpose, technical solution, and beneficial effects of the invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A method for multi-body cooperative satellite access and anti-jamming based on deep reinforcement learning, characterized in that: Includes the following steps, Step 1: Construct a complex electromagnetic environment for multiple agents; Step 2: In the electromagnetic environment of Step 1, construct a three-dimensional coordinate system in Cartesian space and an Actor-Critic neural network; The implementation method for step two is as follows: Large-scale fading is modeled based on a probabilistic LoS (LoS) model with shadowing and blocking effects. In the LoS probabilistic model, large-scale fading follows a generalized Bernoulli distribution of two distinct events. The channel is either LoS or non-LoS with a certain probability. Since it is a satellite access model, only the LoS channel is considered. Therefore, the large-scale fading between satellite m and user n is represented as: Large-scale fading between satellite m and jammer j is represented as: wherein is the reference distance = 1, and is a vector of three-dimensional spatial coordinates, , , denote the position vectors of the satellite, user and jammer, respectively, and a is the path loss exponent; The channel gain between satellite m and user n is expressed as: The channel gain between satellite m and jammer j is expressed as: in and It is the effect of a small-scale fading at time t, which follows a Rice distribution; The transmit power of the jammer on the k-th channel of the m-th satellite and user n are respectively and Therefore, the channel capacity of user n on the k-th channel of the m-th satellite is: , in For channel bandwidth, This represents the Gaussian noise power; the transmit power of the user and the jammer satisfies... , and ,in It is the maximum power that each user is allowed to use at time t. , These are the maximum power that the user and the enemy jammer can use, respectively; Step 3: Based on the position coordinates in the three-dimensional spatial coordinate system obtained in Step 2, the acquired spectrum information is digitized to obtain the input state of the neural network; The implementation method for step three is as follows: To ensure stable user access, it is necessary to obtain the spectrum occupancy status of consecutive time slots; available unoccupied channels are marked as 1, and unavailable occupied channels are marked as -1; and the spectrum occupancy status of consecutive time slots is used as input to a neural network. Indicates the observed channel conditions. =1 indicates that a user has successfully connected to the satellite and the connection was successful. =-1 indicates that a user attempted to connect to the satellite but failed. =0 indicates that no users are connected to the satellite; in, This indicates that user n accesses the k-th channel of satellite m at time t, and u represents the user's occupancy status; The two possible values ​​for indicating a user's failure to access the satellite are 0 and 1; definition , T This indicates the channel occupancy status; Step 4: After obtaining the state input from Step 3, input it into the Actor network and the Critic network respectively. The agent will then take two actions to counter the jammer: selecting the appropriate channel and power. Step 5: Input the state from Step 3 and the action from Step 4 into the environment to interact and receive rewards from the environment. The implementation method for step five is as follows: The interaction between the agent's actions, the jammer's actions, and the environment will result in the following situations: the maximum reward for each step is 3, and the minimum reward is -1; The following... This represents the transmission rate of the k-th channel of satellite m selected by user n at time t. The threshold value representing the rate at which a transmission was successful; ① Select the correct channel and correct power In this scenario, the intelligent user selects a channel not occupied by other users and correctly avoids interference from enemy jammers. In this case, the intelligent user receives a reward of r=3. If multiple intelligent users select the same channel, the reward is reduced to r=2. If interference is received but the power is sufficient to reach the transmission rate... At this point, r=2. If multiple smart users choose the same channel, the reward is reduced to r=1. ② Choosing the correct channel but the wrong power In this situation, the intelligent user selects a channel that is not occupied by other users, but it is subject to interference from the enemy, and the power is insufficient to achieve the required transmission rate. The reward is r=1 at this point; if multiple smart users choose the same channel, the reward is r=0. ③ Selecting the wrong channel In this case, if the intelligent user chooses the same channel as other users and is not interfered with by the enemy jammer, the reward is r=0. If it is interfered with by the jammer, r=-1. Step Six: The intermediate state from Step Three, the action from Step Four, and the reward from Step Five are all input into the Actor and Critic networks for experience learning, optimizing and updating network parameters to achieve the satellite's anti-interference effect.

2. The multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning as described in claim 1, characterized in that: The implementation method for step one is as follows: An integrated air-space network is constructed, in which N intelligent users transmit information to satellites, M satellites receive information, and one traditional jammer and one intelligent jammer are used. Both types of jammers have an equal chance to access the channel using limited power. If the jammer and the user select the same channel in the same time slot, the user's transmission fails and the jammer's interference is successful; otherwise, the user's transmission is successful and the jammer successfully avoids interference. In addition, the jamming trajectory of the jammer is partially observable.

3. The multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning as described in claim 1, characterized in that: The implementation method for step four is as follows: The target gradient optimized by the Actor neural network is as follows: in As a value parameter, For network parameters, The Q value of the current action; For the current moment's strategy, For strategic expectations; Using the mean squared error loss function: Where r is the reward for the action, and γ is the decay factor. The state value function represents the state at the current moment. The state value function represents the state at the next moment; The Actor network selects a suitable channel from the user's action space and simultaneously chooses the transmission power to use from the power action space; this process is the output action At. The Critic network calculates the state V value at time step V. Then, the time-division error is output to evaluate the quality of the action. The calculation form of the time-division error is as follows: in, This is the Q value at the current moment.

4. The multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning as described in claim 1, characterized in that: The implementation method for step six is ​​as follows: First, the Critic network learns by observing the environment and obtaining the state St+1 at the next time step. Through network computation, the value V of the next state is obtained, and the Bellman equation is calculated. After calculating the time-division error TD-error, it is fed into the Actor network along with the state S and action A for learning. TD-Error represents the weights during Actor updates. Therefore, the Critic network does not need to estimate Q, but rather V. Then, TD-Error, which is the Advantage function, can be calculated. TD-Error is then minimized, and its expectation is calculated. The policy gradient to be calculated at this point is: The state-value function V can be used as a baseline. As the current strategy, As the decay factor, the updated weights are obtained. Let be the potential function. ; The Q value under the influence of the attenuation factor. The state value function under the influence of the decay factor; The reward value can be used as a loss function to train the agent's decision direction. The larger the reward value, the smaller the difference from the expected Q value, and the better the training effect. The smaller the reward value, the larger the loss value, indicating poor action selection. If the reward value is very small or negative in consecutive time slots, other actions will be explored to update the policy gradient in order to find a better policy to improve the agent's anti-interference ability until the optimal policy is found so that the reward value of the agent in the environmental feedback is close to the maximum reward and converges stably, thus achieving the satellite access and anti-interference effect.