A reinforcement learning congestion control method based on Transformer for time series modeling
By using the Transformer architecture for timing modeling in congestion control, the problem of dynamic adjustment in complex network environments using traditional methods is solved, achieving efficient resource utilization and stable network transmission, and improving network performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANDONG NORMAL UNIV
- Filing Date
- 2026-04-01
- Publication Date
- 2026-06-26
AI Technical Summary
Traditional congestion control methods struggle to dynamically adjust congestion windows or transmission rates in complex and dynamic network environments, leading to network instability, low resource utilization, and poor performance of existing reinforcement learning methods.
We replace the traditional MLP with the Transformer architecture for temporal modeling. By combining multi-head attention mechanism, residual connection, layer normalization and fine PPO parameter configuration, we establish an efficient reinforcement learning network congestion control method. Through temporal modeling, we capture the dependencies of network state and dynamically adjust the congestion window and transmission rate.
It achieves efficient resource utilization in complex network environments, reduces packet loss rate, improves information confirmation rate, ensures model stability and adaptability, and enhances network transmission performance.
Smart Images

Figure CN121967328B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of communication networks and artificial intelligence technology, and relates to a reinforcement learning congestion control method based on Transformer for time-series modeling. Background Technology
[0002] Congestion control technology is a mechanism used in computer networks to prevent excessive data from flooding the network simultaneously, which can lead to latency, packet loss, and performance degradation. It monitors network load and dynamically adjusts the transmission rate to maintain efficient and stable data transmission. Common methods include slow start and congestion avoidance.
[0003] Traditional congestion control methods have evolved over time to develop their own interpretations of congestion signals and rate adjustment rules, but these rules are essentially "fixed structures" based on experience. For example, BBR attempts to proactively maintain the optimal transmission rate by modeling bandwidth and round-trip time (RTT). Fixed-rule strategies struggle to accommodate all environmental conditions and can even lead to network instability, unfair resource contention, and a decrease in overall throughput efficiency.
[0004] The connection between reinforcement learning and network congestion control lies primarily in "decision optimization." Traditional congestion control algorithms rely on manually designed rules, while reinforcement learning, through continuous interaction with the network environment, automatically learns the optimal transmission strategy based on feedback (such as throughput, latency, and packet loss rate). It can dynamically adjust the congestion window or transmission rate under complex and variable network conditions, achieving more efficient resource utilization and more stable transmission performance. Compared to fixed algorithms, reinforcement learning methods are more adaptable and perform better in scenarios such as mobile networks.
[0005] Existing simple reinforcement learning methods for network congestion control perform slightly better or worse than traditional algorithms. Therefore, there is an urgent need to propose a more optimized reinforcement learning method to improve its network adaptability, thereby enabling more efficient network congestion control in complex and ever-changing network environments. Summary of the Invention
[0006] To address the shortcomings of existing technologies, this invention aims to propose a reinforcement learning-based network congestion control method based on Transformer-based temporal modeling. This method addresses the challenge of dynamically adjusting the congestion window or transmission rate under complex and variable network conditions, making it difficult to achieve more efficient resource utilization and more stable transmission performance under traditional congestion control methods. This invention provides superior and more stable congestion control compared to traditional algorithms.
[0007] This invention establishes an efficient reinforcement learning network congestion control method adaptable to complex network environments by replacing the classic PPO algorithm of traditional reinforcement learning with the simple neural network MLP, and then replacing the MLP with the more innovative Transformer architecture for temporal modeling. Specifically, the traditional MLP multilayer perceptron is replaced with an improved Transformer as the feature extractor. Transformers excel at handling sequential data. In network congestion control scenarios, historical network states (such as latency, packet loss rate, bandwidth, etc.) constitute a time series. Transformers can capture the dependencies and long-term dependencies within these time series, while MLPs can only handle fixed-size inputs and have a weaker perception of temporal order. The self-attention mechanism of Transformers allows the model to dynamically focus on the most relevant parts of the input sequence. This means that when making decisions, the model can "selectively" focus on network state information that has the greatest impact on the current decision over a past period, rather than treating all historical information indiscriminately. This is especially advantageous when handling longer sequences. Furthermore, this module does not directly adopt the standard Transformer module, but incorporates several targeted improvements to enhance stability during RL training and adaptability to network characteristics, such as Xavier initialization, residual connections and LayerNorm, and network state-aware modulation. Additionally, this invention utilizes more refined PPO parameters and optimizer configurations.
[0008] Therefore, this invention can efficiently and dynamically adjust network congestion in complex network environments, achieving efficient resource utilization, low packet loss rate and high information confirmation rate during data transportation, while also ensuring the stability of reinforcement learning model performance and adapting to different network environments.
[0009] This invention proposes a reinforcement learning congestion control method based on Transformer-based temporal modeling. Addressing the shortcomings of traditional reinforcement learning congestion control methods in dynamic adjustment of congestion windows and transmission rates, as well as low resource utilization in complex network environments such as low bandwidth and high latency, this invention proposes a novel optimization method. The main difference lies in replacing the classic PPO+MLP combination reinforcement learning method with a more innovative PPO+Transformer reinforcement learning method for network congestion control. Basic MLPs can only see the current step's observations and cannot utilize historical changes in network state for congestion control. In contrast, the improved Transformer model in this invention can maintain a 64-step historical observation sequence and, through Transformer encoding, can learn: RTT rising trends, queue accumulation processes, bandwidth change patterns, and precursory signals before packet loss, enabling the model to make more stable and forward-looking congestion control decisions. Furthermore, several stabilization designs for Transformer are added, along with training stabilization mechanisms. Additionally, the PPO algorithm undergoes more refined parameter tuning and optimizer configuration.
[0010] The technical solution of this invention is as follows:
[0011] A reinforcement learning congestion control method based on Transformer for temporal modeling includes an NS3 simulation environment process (NS3 client) and a Python process (Python client). The NS3 client and the Python client communicate using the inter-process communication mechanism provided by the ZeroMQ library. The specific steps are as follows:
[0012] Step 1: Perform network simulation on the NS3 terminal to generate simulated network data;
[0013] Step 2: Send the generated simulated network data to a simulated TCP-RL protocol. The TCP-RL protocol simulates the sending, delay, packet loss, and ACK return of data packets.
[0014] Step 3: Congestion control RL environment on the Python side; Receive simulated network data sent from the NS3 side, convert it into observation data, and then send it to the reinforcement learning agent;
[0015] Step 4: The reinforcement learning agent on the Python side analyzes the received observation data, forms a historical sequence, dynamically monitors the real-time situation of data transmission, and returns the congestion control results of the analysis and decision as actions in reinforcement learning to the simulation network environment on the NS3 side. The actions in reinforcement learning include the current congestion control window size, bytes in flight, throughput, RTT round-trip time, RTT change rate, packet loss rate, and information acknowledgment signal rate.
[0016] Step 5: The actions returned from reinforcement learning by the Python side are first passed to the reinforcement learning environment by the Python side, and then sent to the NS3 side through the inter-process communication mechanism. The network environment of the NS3 side adjusts the network data transmission according to the feedback data of the reinforcement learning agent, thereby performing congestion control, and feeding back the network control results to the agent side in the form of reward.
[0017] Step 6: Iterative training: Repeat steps 2 to 5 until the preset training termination condition is met.
[0018] According to a preferred embodiment of the present invention, in step 3, receiving simulated network data sent from the NS3 terminal and converting it into observation data includes:
[0019] `super().step()` is responsible for interacting with the NS3 client, passing the actions taken by the agent to the NS3 client, and executing a time step in the NS3 client. After the NS3 client completes the execution of a time step, it returns the current network state information to the Python client through the ZeroMQ library. The raw network state information is stored in the variable `obs`. The `transform_obs(self, obs)` method receives the raw observation data `obs` returned by `super().step()`. The raw observation data `obs` includes the network connection ID (socket ID), the slow start threshold `ssThresh`, the congestion window size `cWnd`, the number of bytes sent `bytesTX`, the number of bytes successfully received `bytesRx`, and the round-trip time `rtt`. It processes the raw observation data `obs` to extract the features needed by the agent, and finally forms the agent's observation space.
[0020] According to a preferred embodiment of the present invention, the intelligent agent includes a multi-head attention mechanism, a robust Transformer encoder block, an enhanced TCP congestion control feature extractor, and a stable TCP congestion control Actor-Critic policy module; the specific process is as follows:
[0021] Step 4.1: Input feature construction and embedding mapping process;
[0022] Let the reinforcement learning environment be established from time to time. The stable, controllable, and generalizable state representation extracted from the output raw network observation data is as follows:
[0023] ;
[0024] in, Maintain a length of The historical sequence constitutes the input tensor. :
[0025] ;
[0026] in, Indicates batch size. Indicates the length of a historical time step;
[0027] The original input sequence is fed into an enhanced TCP congestion control feature extractor module for signal preprocessing. An embedded mapping function maps low-dimensional, heterogeneous network metrics to a unified high-dimensional feature space. The mapping relationship is... Represented as:
[0028] ;
[0029] in, This represents an embedding mapping network that includes linear mapping, layer normalization, and nonlinear activation functions, and outputs features. ;
[0030] Step 4.2: Stabilize the multi-head attention feature modeling process;
[0031] Let the input features be ;
[0032] The query matrix, key matrix, and value matrix are constructed using linear projection, and are represented as follows:
[0033] ;
[0034] in, The parameter matrix is a linear transformation matrix;
[0035] The single-head attention score for multi-head attention is calculated as follows:
[0036] ;
[0037] in, This represents the feature dimension of each attention head. Indicates the first The first query and the first Attention score between each key, representing the degree of attention; Indicates the first A query vector, derived from the input Through linear transformation get; Indicates the first There are _ key vectors, with dimension _ . , by input Through linear transformation get; , To represent different numerical values; Represents the dot product. The attention score is normalized using a fixed scaling factor.
[0038] Adding a gating mechanism, where the output of Attention is approved, specifically using the formula: g = ( ),in, (⋅) refers to the sigmoid function, which maps element-wise to (0, 1). ;
[0039] Add residual connections and layer normalization, output = LayerNorm(output + query), where output represents the current state, query represents the adjustment given by historical experience, and LN represents the layer normalization operation;
[0040] Subsequently, a Softmax operation is performed on the attention scores to obtain the attention weight matrix, which is then weighted and summed with the corresponding value vectors to form the context feature representation. The multi-head attention outputs are recombined through linear mapping and added to the input features via residual connections. Finally, after layer standardization, the attention layer output is obtained. The calculation process is expressed as follows:
[0041] ;
[0042] Where LN represents layer normalization operation, and MSA represents stable multi-head attention computation;
[0043] Therefore, a stable multi-head attention mechanism can be generally represented as:
[0044] SMHA(x) = LN(Z+ ( )⊙(softmax( )V));
[0045] Where SMHA represents a stable multi-head attention mechanism, LN represents a layer normalization operation, and Z is the input feature sequence. The product of Hadamard;
[0046] Step 4.3: Feedforward network and network state-aware modulation mechanism;
[0047] Obtaining the output of the attention layer Then, it is further fed into a robust Transformer encoder block for nonlinear transformation. The feedforward network consists of two linear mapping layers and a nonlinear activation function. The linear mapping layers include a linear layer + ReLU and a linear layer + Tanh. The nonlinear activation function is the GELU activation of the feedforward network. The output is... Represented as:
[0048] ;
[0049] A network state-aware modulation mechanism is introduced in the output stage of the feedforward network. This mechanism generates modulation factors based on the attention output. :
[0050] ;
[0051] in, This represents a modulation network composed of linear transformations, and the Tanh function is used to constrain the range of values for the modulated signal.
[0052] Finally, the output features of the feedforward network are dynamically adjusted through a combined additive and multiplicative modulation method, expressed as:
[0053] ;
[0054] The modulated features and attention layer output are then processed again through residual connections and layer normalization to obtain the final output of a single Transformer encoder block:
[0055] ;
[0056] Step 4.4: Stacking of Transformer Feature Encoder Layers;
[0057] Let the Transformer feature encoder output for:
[0058] ;
[0059] in, For batch size, The length of the time series. For feature channel dimensions;
[0060] Step 4.5: Temporal feature fusion and output representation construction;
[0061] The Transformer feature encoder fuses the output features; specifically, it extracts end-of-sequence features and average features from the entire sequence separately.
[0062] ;
[0063] in, This represents the temporal feature tensor after multi-layer robust Transformer block encoding, where L represents the sequence length. This indicates the current state of the high-level network. Indicating historical trend characteristics, and The features are concatenated along the feature dimension and then fed into the feature fusion module to obtain the final feature representation. :
[0064] ;
[0065] Final output features The data is fed into the Actor-Critic policy network to guide TCP congestion control decisions; where B represents the batch size, and represents the number of TCP connections / environments processed in parallel. Indicates the number of feature channels.
[0066] According to a preferred embodiment of the present invention, the reinforcement learning agent forms a historical sequence; including:
[0067] In the enhanced TCP congestion control feature extractor, state_buffer refers to a state buffer used to store the observation history of the most recent seq_len time steps. It saves the observations of the most recent seq_len time steps. During each forward propagation, the new observations are normalized and transformed into (batch_size, 1, input_dim) through unsqueeze(1). batch_size refers to the number of samples processed at one time during the current training or inference. input_dim refers to the feature dimension of each time step observation, that is, the 7-dimensional vector output by transform_obs. It is concatenated to the state buffer. After concatenation, the shape of state_buffer is still (batch_size, seq_len, input_dim), and the latest seq_len time steps are always saved.
[0068] In the last dimension, the oldest step is discarded, resulting in the state_buffer always storing the latest historical sequence, i.e., the formed historical sequence;
[0069] The agent learns a historical sequence to make action responses; this includes: the historical sequence (batch_size, seq_len, input_dim) is fed into the feature extraction layer of the policy network to extract temporal features; the Actor network outputs action probabilities or specific actions based on the feature vectors; the Critic network evaluates the value function of the current state; the output action is sent to the NS3 terminal, which adjusts the network based on the action and returns new observations and rewards.
[0070] According to a preferred embodiment of the present invention, fixed temperature scaling is used, as shown in the following formula:
[0071] ;
[0072] in, Indicates the first The first query and the first Attention score between each key, representing the degree of attention; Indicates the first Query vectors, with dimensions of . , by input Through linear transformation get; Indicates the first There are _ key vectors, with dimension _ . , by input Through linear transformation get; , To represent different values, Represents the dot product. Indicates dimension;
[0073] Perform Xavier initialization, input sequence Each token is linearly projected as ,in ; The input feature matrix represents the feature vector at each position in the sequence; Batch size is the number of sequence samples processed in each training / inference iteration. `seq_len` is the sequence length and the number of positions in each sample. , , This is the Query, Key, and Value matrix after linear projection; The linear transformation weight matrix of the query is of size . , used to project the input into the Query space; Let be the linear transformation weight matrix of Key, with a size equal to . The same applies to projecting the input into the Key space; Let be the linear transformation weight matrix of Value, with a size equal to . The same applies; it is used to project the input onto the Value space.
[0074] A computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the steps of the reinforcement learning congestion control method based on Transformer for time-series modeling described above.
[0075] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the reinforcement learning congestion control method based on Transformer for time-series modeling described above.
[0076] The beneficial effects of this invention are as follows:
[0077] This invention enhances the feature extraction capabilities of reinforcement learning through the Transformer architecture, proposing an intelligent decision-making method for network congestion control. In feature extraction, a multi-head attention mechanism is introduced to perform temporal modeling of multi-dimensional network states, effectively capturing the dynamic evolution of congestion states. In network architecture design, a lightweight Transformer encoder block is employed, reducing computational overhead while maintaining sensitivity to network fluctuations. Regarding training stability, gating mechanisms and other techniques are used to avoid policy oscillations during reinforcement learning training. Experiments show that this method achieves increased throughput and reduced packet loss rate compared to traditional congestion control algorithms. This invention can adapt to different network load scenarios, significantly improving network resource utilization while ensuring fairness, providing a deployable intelligent congestion control solution for high-performance network environments such as 6G and data centers. Attached Figure Description
[0078] Figure 1 This is a schematic diagram of the process of a reinforcement learning congestion control method based on Transformer for temporal modeling according to the present invention.
[0079] Figure 2 This is a structural block diagram of the intelligent agent of the present invention;
[0080] Figure 3 This is a comparison diagram of ACK confirmation signals under different congestion control methods according to an embodiment of the present invention;
[0081] Figure 4 This is a comparison chart of packet loss rates under different congestion control methods according to embodiments of the present invention;
[0082] Figure 5 This is the reward graph of the reinforcement learning network congestion control method using Transformer for temporal modeling in an embodiment of the present invention. Detailed Implementation
[0083] The present invention will be further defined below with reference to the accompanying drawings and embodiments, but is not limited thereto.
[0084] Terminology Explanation:
[0085] 1. The NS3 simulation environment process: NS3 (Network Simulator 3) is an open-source network simulator widely used in academia and industry. It can simulate various network topologies, protocol stacks, and traffic models. This invention simulates a real network environment on the NS3 side and interacts with the intelligent agent on the reinforcement learning side.
[0086] 2. In simulation-based learning scenarios, the Python process (network simulator NS3) is responsible for simulating the behavior of a real network environment and providing feedback to the agent. The agent is responsible for learning the optimal policy. These components typically reside in different processes to achieve decoupling and efficient execution. The NS3 process simulates the real environment, while the Python process reacts based on observations of the environment.
[0087] 3. ZeroMQ library, a high-performance, lightweight asynchronous message queue library, is an embedded, multi-threaded, and portable message queue system. It provides a set of socket APIs that can be used to build distributed or concurrent applications. In the technical solution described in this invention, the ZeroMQ library is used as a communication middleware to facilitate communication between the NS3 simulation environment process (usually a child process written in C++) and the Python machine learning training process (usually a parent process written in Python). ZeroMQ provides socket abstractions for various communication modes (such as request-response REQ / REP, publish-subscribe PUB / SUB, push-pull PUSH / PULL, etc.). This allows components written in different processes or languages to connect and exchange messages in a standardized way, without needing to worry about the underlying operating system IPC details.
[0088] 4. TCP-RL Protocol: TCP-RL is a communication protocol implementation responsible for intelligently and dynamically determining the congestion window size for data transmission to optimize communication performance under various network conditions. This protocol can work in conjunction with existing transport layer services and continuously learns and adapts from network feedback through a monitoring interval mechanism.
[0089] 5. ACK, Acknowledgement, is a crucial communication mechanism. An ACK is a message sent by the receiver to the sender to indicate that the receiver has successfully received one or more data packets previously sent by the sender.
[0090] 6. RL, RL (Reinforcement Learning) represents reinforcement learning.
[0091] 7. RTT (Round-Trip Time) refers to the total time elapsed from when the sender transmits a data packet until the sender receives an acknowledgment packet from the receiver. This time includes transmission delay, processing delay, queuing delay, and other factors, and is an important indicator of network performance.
[0092] 8. super().step() completes the process of sending control actions to ns-3 and receiving the raw network state from ns-3.
[0093] 9. The transform_obs(self, obs) method transforms the original network state variables returned by ns-3 into low-dimensional, physically interpretable state vectors suitable for PPO+Transformer learning.
[0094] Example 1
[0095] A reinforcement learning-based congestion control method for temporal modeling based on Transformer, such as... Figure 1 As shown, this method aims to improve network data transmission efficiency and increase resource utilization under bandwidth-constrained and high-latency network conditions, ultimately constructing a congestion control agent capable of maintaining stable and efficient decision-making in complex, dynamic, and noisy network environments. This network congestion control method includes an NS3 simulation environment process (NS3 client) and a Python process (Python client). The NS3 client and Python client communicate using the inter-process communication mechanism provided by the ZeroMQ library. The specific steps are as follows:
[0096] Step 1: The NS3 terminal performs network simulation to generate simulated network data. The simulated network data includes network status such as socketID (network connection ID), ssThresh slow start threshold, cWnd congestion window size, bytesTX (number of bytes currently sent), bytesRx (number of bytes successfully received), and RTT (round-trip time).
[0097] Step 2: Send the generated simulated network data to a simulated TCP-RL protocol. The TCP-RL protocol simulates the sending, delay, packet loss, and ACK return of data packets.
[0098] Step 3: Congestion control RL environment on the Python side; Receive simulated network data sent from the NS3 side, convert it into observation data, and then send it to the reinforcement learning agent;
[0099] Step 4: The reinforcement learning agent on the Python side analyzes the received 7-dimensional observation data to form a historical sequence, dynamically monitors the real-time situation of data transmission, and returns the congestion control results of the analysis and decision as actions in reinforcement learning to the simulation network environment on the NS3 side. The actions in reinforcement learning include the current congestion control window size, bytes in flight (number of bytes sent but not received), throughput, RTT round-trip time, RTT change rate, packet loss rate, and information acknowledgment signal rate.
[0100] Step 5: The actions returned from reinforcement learning by the Python client are first passed to the reinforcement learning environment in Python, and then sent to the NS3 client via inter-process communication. The NS3 client adjusts network data transmission based on the feedback data from the reinforcement learning agent to perform congestion control, and feeds back the network control results to the agent in the form of rewards. This invention also innovates and improves the reward mechanism by adding a throughput zeroing safety mechanism, making the reward mechanism more reasonable and enabling the model to focus more on the most relevant parts of the input sequence. ;
[0101] Step 6: Iterative training: Repeat steps 2 to 5 until the preset training termination condition is met.
[0102] Example 2
[0103] The difference between the reinforcement learning congestion control method based on Transformer for temporal modeling described in Example 1 and the one described in Example 1 is as follows:
[0104] In step 3, the simulated network data sent from the NS3 terminal is received and converted into observation data; this includes:
[0105] `super().step()` is responsible for interacting with the NS3 client, passing the agent's actions (here, the target congestion window processed by `transform_action(action)`) to the NS3 client, and executing a time step in the NS3 client. After the NS3 client completes the execution of a time step, it returns the current network state information to the Python client through the ZeroMQ library. The raw network state information (such as the socket ID, network connection ID, and the ssThresh slow start threshold) is stored in the `obs` variable. The `transform_obs(self, obs)` method receives the raw observation data `obs` returned by `super().step()`. The raw observation data `obs` includes the network connection ID, the slow start threshold `ssThresh`, the congestion window size `cWnd`, the number of bytes sent (`bytesTX`), the number of bytes successfully received (`bytesRx`), the round-trip time (`rtt`), and other network state information. It then processes the raw observation data `obs` to extract the features needed by the agent, ultimately forming the agent's observation space. This invention selects the seven most important dimensions of the original data for parsing, accumulation, and calculation. The final observation data includes: current window size, throughput, round-trip time, packet loss rate, RTT fluctuation, number of bytes in transit, and ACK reception rate.
[0106] This invention constructs a congestion control agent for TCP congestion control tasks, used for time-series modeling and high-level feature abstraction of continuous network state sequences. For example... Figure 2 As shown, the agent comprises a multi-head attention mechanism, a robust Transformer encoder block, an enhanced TCP congestion control feature extractor, and a stable TCP congestion control Actor-Critic policy module; it achieves accurate modeling of network congestion evolution patterns through hierarchical design. The agent employs residual connections and layer normalization mechanisms throughout its sub-modules to ensure the stability of gradient propagation during long-term online training and to avoid model degradation due to drastic changes in state distribution. The specific process is as follows:
[0107] Step 4.1: Input feature construction and embedding mapping process;
[0108] Let the reinforcement learning environment be established from time to time. The stable, controllable, and generalizable state representation extracted from the output raw network observation data is as follows:
[0109] ;
[0110] in, This corresponds to the seven core network metrics in TCP congestion control. To characterize the temporal correlation of network states, the system maintains a segment of length... The historical sequence constitutes the input tensor. :
[0111] ;
[0112] in, Indicates batch size. Indicates the length of a historical time step;
[0113] The original input sequence is fed into an enhanced TCP congestion control feature extractor module for signal preprocessing. A mapping function is embedded to map low-dimensional, heterogeneous network metrics to a unified high-dimensional feature space. The mapping relationship is... Represented as:
[0114] ;
[0115] in, This represents an embedding mapping network that includes linear mapping, layer normalization, and nonlinear activation functions, and outputs features. ;
[0116] This process achieves scale uniformity and noise suppression of the original network signal, providing a stable input for subsequent time series modeling.
[0117] Step 4.2: Stabilize the multi-head attention feature modeling process;
[0118] To extract the dependencies of network states over time, the Transformer feature encoder embeds features... Based on this, a stable multi-head attention mechanism is introduced. Let the input features be... ;
[0119] The query matrix, key matrix, and value matrix are constructed using linear projection, and are represented as follows:
[0120] ;
[0121] in, The parameter matrix is a linear transformation matrix;
[0122] The single-head attention score for multi-head attention is calculated as follows:
[0123] ;
[0124] in, This represents the feature dimension of each attention head. Indicates the first The first query and the first The attention score between each key represents the degree of attention. Indicates the first Each query vector is derived from the input. Through linear transformation get; Indicates the first Each key vector has a dimension of _Key_. , by input Through linear transformation get; , To represent different numerical values; Represents the dot product. The attention score is normalized using a fixed scaling factor to represent the dimension, thus preventing numerical instability of the attention weights during training. A fixed temperature parameter is incorporated into this process to ensure stable network output.
[0125] Adding a gating mechanism, where the output of Attention is approved, specifically using the formula: g = ( ),in, (⋅) refers to the sigmoid function, which maps element-wise to (0, 1). ;
[0126] Add residual connections and layer normalization, output = LayerNorm(output + query), where output represents the current state, query represents the adjustment given by historical experience, and LN represents the layer normalization operation;
[0127] Subsequently, a Softmax operation is performed on the attention scores to obtain the attention weight matrix, which is then weighted and summed with the corresponding value vectors to form the context feature representation. The multi-head attention outputs are recombined through linear mapping and added to the input features via residual connections. Finally, after layer standardization, the attention layer output is obtained. The calculation process is expressed as follows:
[0128] ;
[0129] Where LN represents layer normalization operation, and MSA represents stable multi-head attention computation;
[0130] Therefore, a stable multi-head attention mechanism can be generally represented as:
[0131] SMHA(x) = LN(Z+ ( )⊙(softmax( )V));
[0132] Where SMHA represents a stable multi-head attention mechanism, LN represents a layer normalization operation, and Z is the input feature sequence. The product of Hadamard;
[0133] Step 4.3: Feedforward network and network state-aware modulation mechanism;
[0134] Obtaining the output of the attention layer Then, it is further fed into a robust Transformer encoder block for nonlinear transformation. The feedforward network consists of two linear mapping layers and a nonlinear activation function. The linear mapping layers include a linear layer + ReLU and a linear layer + Tanh. The nonlinear activation function is the GELU activation of the feedforward network. The output is... Represented as:
[0135] ;
[0136] To enhance the model's adaptability to changes in network congestion, this invention introduces a network state-aware modulation mechanism in the feedforward network output stage. This mechanism generates modulation factors based on attention output. :
[0137] ;
[0138] in, This represents a modulation network composed of linear transformations, and the Tanh function is used to constrain the range of values for the modulated signal.
[0139] Finally, the output features of the feedforward network are dynamically adjusted through a combined additive and multiplicative modulation method, expressed as:
[0140] ;
[0141] This design enables the model to automatically suppress aggressive features when network congestion intensifies and amplify throughput-oriented features when the network is idle, thereby achieving adaptive adjustment of TCP behavior;
[0142] The modulated features and attention layer output are then processed again through residual connections and layer normalization to obtain the final output of a single Transformer encoder block:
[0143] ;
[0144] Step 4.4: Stacking of Transformer Feature Encoder Layers;
[0145] The stabilized multi-head attention module and the network state-aware modulation module constitute a complete robust Transformer coding block; multiple Transformer coding blocks are stacked according to the same structure to form a complete Transformer feature encoder, which is used to extract high-order temporal features from the network state sequence layer by layer.
[0146] Let the Transformer feature encoder output for:
[0147] ;
[0148] in, For batch size, The length of the time series. For feature channel dimensions;
[0149] Step 4.5: Temporal feature fusion and output representation construction;
[0150] To simultaneously capture the current network state and long-term congestion trends, the Transformer feature encoder fuses the output features. Specifically, it extracts end-of-sequence features and average features from the entire sequence separately.
[0151] ;
[0152] in, This represents the temporal feature tensor encoded by multiple robust Transformer blocks, where L represents the sequence length, which is specified as 64 in this invention. This indicates the current state of the high-level network. Indicating historical trend characteristics (long-term information), and The features are concatenated along the feature dimension and then fed into the feature fusion module to obtain the final feature representation. :
[0153] ;
[0154] Final output features The data is fed into the Actor-Critic policy network (a stable TCP congestion control Actor-Critic policy module) to guide TCP congestion control decisions; where B represents the batch size, indicating the number of TCP connections / environments processed in parallel. Indicates the number of feature channels.
[0155] After network state observations are input into the policy network, the system first uses a temporal Transformer feature encoder to jointly model the current and historical network states, generating a stable high-level state representation. Subsequently, the Actor network outputs TCP congestion control actions based on this state representation, while the Critic network evaluates the long-term reward under this state. The two work together to optimize within the PPO framework, thereby achieving online learning and adaptive adjustment of the TCP congestion control policy. The above are the main tasks of the stable TCP congestion control Actor-Critic policy module.
[0156] The formation history of reinforcement learning agents includes:
[0157] In the enhanced TCP congestion control feature extractor, there is a state buffer that stores the raw network data input from the NS3 network environment. The state_buffer is a state buffer used to store the observation history of the most recent seq_len (here set to 64) time steps. It saves the observations of the most recent seq_len (in this invention set to 64) time steps. During each forward propagation, the new observations (the observation state obtained from the network environment (NS3) at the current time step, i.e., the network metrics returned by the environment) are normalized and transformed into (batch_size, 1, input_dim) by unsqueeze(1) (matching the input requirements of the Transformer time series model). batch_size refers to the number of samples processed at one time during the current training or inference; input_dim refers to the feature dimension of each time step observation, i.e., the 7-dimensional vector output by transform_obs. It is concatenated to the state buffer. After concatenation, the shape of state_buffer is still (batch_size, seq_len, input_dim), and the latest seq_len is always saved. Each time step; this invention specifies that seq_len is 64, so that the agent can make decisions using the network state of the past 64 steps.
[0158] In the last dimension, the oldest step is discarded, resulting in the state_buffer always storing the latest historical sequence, i.e., the formed historical sequence;
[0159] The agent learns from a historical sequence to make action responses. This process includes: the historical sequence (batch_size, seq_len, input_dim) is fed into the feature extraction layer (Transformer) of the policy network to extract temporal features; the Actor network outputs action probabilities or specific actions (such as congestion window increments) based on the feature vectors; the Critic network evaluates the value function of the current state; the output action is sent to the NS3 terminal, which adjusts the network based on the action and returns new observations and a reward.
[0160] When reinforcement learning agents receive and process observation data, the methods also include:
[0161] The Transformer architecture is introduced for network control signal modeling. By observing the sent data over a long period of time, the long-term dependencies between TCP states are captured, and a historical observation buffer is added. For example, the congestion window is readjusted by detecting the RTT round-trip time to observe the packet loss in the queue. These are all time-series chain changes that the basic MLP cannot handle at all.
[0162] Attention mechanisms focus on sudden spikes (such as instantaneous queue growth); improving stable control capabilities;
[0163] The self-attention mechanism of the Transformer architecture allows reinforcement learning agents to dynamically focus on the most relevant parts of the input sequence; for example, the importance of RTT, loss, and throughput varies at different time points, making Transformer's attention more flexible.
[0164] Use more refined PPO parameters and optimizer configurations to create custom strategies that enable deep integration of Transformerextractor with PPO actor / critic, and use more stable optimizer hyperparameters (AdamW).
[0165] Reinforcement learning agents have also improved in ensuring model stability, and other methods include:
[0166] To avoid gradient explosion, use fixed-temperature scaling, as shown in the following formula:
[0167] ;
[0168] in, Indicates the first The first query and the first Attention score between each key, representing the degree of attention; Indicates the first Each query vector has a dimension of . , by input Through linear transformation get; Indicates the first Each key vector has a dimension of _Key_. , by input Through linear transformation get; , To represent different values, Represents the dot product. Indicates dimension;
[0169] Perform Xavier initialization, input sequence Each token is linearly projected as ,in ; The input feature matrix represents the feature vector at each position in the sequence; The batch size is the number of sequence samples processed in each training / inference iteration. `seq_len` is the sequence length and the number of positions in each sample. , , This is the Query, Key, and Value matrix after linear projection; The linear transformation weight matrix of the query is of size . , used to project the input into the Query space; Let be the linear transformation weight matrix of Key, with a size equal to . The same applies to projecting the input into the Key space; Let be the linear transformation weight matrix of Value, with a size equal to . The same applies to projecting the input into the Value space.
[0170] A lightweight gating mechanism has been added to allow the model to automatically adjust the intensity of attention output.
[0171] When reinforcement learning agents perform network state-aware feedforward modulation, they extract network state information by not directly inputting the raw network parameters into the model. Instead, through specialized preprocessing and latent feature engineering, the model gains a deep understanding of the combined meaning of these parameters and their trends over time, thus perceiving the true network state. This understanding of the network state is transformed into a dynamic modulation signal, capable of responding differently to subtle changes in the network state. This modulation signal is then applied to key computational components within the model (such as the feedforward layer output). Through a combination of additive and multiplicative methods, this modulation signal can dynamically and precisely calibrate the amplitude and characteristics of the model's output.
[0172] Interaction between the NS3 and Python environments utilizes a richer observation space, resulting in a more rational structure and clearer physical meaning. A suitable arrangement of observations is beneficial for model learning. A throughput-zeroing safety mechanism has been added to the reward system, making the reward mechanism more reasonable and enabling the model to focus more on the most relevant parts of the input sequence. The observed data undergoes consistent processing, with more rigorous feedback logic.
[0173] The network congestion control results of the reinforcement learning congestion control method based on Transformer-based temporal modeling proposed in this embodiment are as follows: Figure 3 , Figure 4 As shown. From Figure 3 and Figure 4 It can be seen that the congestion control method using the Transformer architecture outperforms both the basic MLP method and traditional methods in terms of both ACK acknowledgment signal performance and packet loss rate. Furthermore, as the number of rounds in which the NS3 sends environmental state data increases, and the amount of learnable data for the reinforcement learning agent continues to grow, the reward value fed back from the NS3 environment to the agent generally shows an upward trend, such as... Figure 5 As shown. Therefore, the reinforcement learning congestion control method based on Transformer for temporal modeling proposed in this invention can effectively improve network congestion control performance.
[0174] Example 3
[0175] A computer device includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the steps of the reinforcement learning congestion control method based on Transformer for time-series modeling as described in Embodiment 1 or 2.
[0176] Example 4
[0177] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the reinforcement learning congestion control method based on Transformer for time-series modeling as described in Embodiment 1 or 2.
Claims
1. A reinforcement learning congestion control method based on Transformer for temporal modeling, characterized in that, This includes an NS3 simulation environment process (NS3 client) and a Python process (Python client). The NS3 client and the Python client communicate using the inter-process communication mechanism provided by the ZeroMQ library. The specific steps are as follows: Step 1: Perform network simulation on the NS3 terminal to generate simulated network data; Step 2: Send the generated simulated network data to a simulated TCP-RL protocol. The TCP-RL protocol simulates the sending, delay, packet loss, and ACK return of data packets. Step 3: Congestion control RL environment on the Python side; Receive simulated network data sent from the NS3 side, convert it into observation data, and then send it to the reinforcement learning agent; Step 4: The reinforcement learning agent on the Python side analyzes the received observation data, forms a historical sequence, dynamically monitors the real-time situation of data transmission, and returns the congestion control results of the analysis and decision as actions in reinforcement learning to the simulation network environment on the NS3 side. The actions in reinforcement learning include the current congestion control window size, bytes in flight, throughput, RTT round-trip time, RTT change rate, packet loss rate, and information acknowledgment signal rate. Step 5: The actions returned from reinforcement learning by the Python side are first passed to the reinforcement learning environment by the Python side, and then sent to the NS3 side through the inter-process communication mechanism. The network environment of the NS3 side adjusts the network data transmission according to the feedback data of the reinforcement learning agent, thereby performing congestion control, and feeding back the network control results to the agent side in the form of reward. Step 6: Iterative training: Repeat steps 2 to 5 until the preset training termination condition is met; The intelligent agent includes a multi-head attention mechanism, a robust Transformer encoder block, an enhanced TCP congestion control feature extractor, and a stable TCP congestion control Actor-Critic policy module; the specific process is as follows: Step 4.1: Input feature construction and embedding mapping process; Let the reinforcement learning environment be established from time to time. The stable, controllable, and generalizable state representation extracted from the output raw network observation data is as follows: ; in, Maintain a length of The historical sequence constitutes the input tensor. : ; in, Indicates batch size. Indicates the length of a historical time step; The original input sequence is fed into an enhanced TCP congestion control feature extractor module for signal preprocessing. A mapping function is embedded to map low-dimensional, heterogeneous network metrics to a unified high-dimensional feature space. The mapping relationship is... Represented as: ; in, This represents an embedding mapping network that includes linear mapping, layer normalization, and nonlinear activation functions, and outputs features. ; Step 4.2: Stabilize the multi-head attention feature modeling process; Let the input features be ; The query matrix, key matrix, and value matrix are constructed using linear projection, and are represented as follows: ; in, The parameter matrix is a linear transformation matrix; The single-head attention score for multi-head attention is calculated as follows: ; in, This represents the feature dimension of each attention head. Indicates the first The first query and the first Attention score between each key, representing the degree of attention; Indicates the first A query vector, derived from the input Through linear transformation get; Indicates the first There are _ key vectors, with dimension _ . , by input Through linear transformation get; , To represent different numerical values; Represents the dot product. The attention score is normalized using a fixed scaling factor. Adding a gating mechanism, where the output of Attention is approved, specifically using the formula: g = ( ),in, (⋅) refers to the sigmoid function, which maps element-wise to (0, 1). ; Add residual connections and layer normalization, output = LayerNorm(output + query), where output represents the current state, query represents the adjustment given by historical experience, and LN represents the layer normalization operation; Subsequently, a Softmax operation is performed on the attention scores to obtain the attention weight matrix, which is then weighted and summed with the corresponding value vectors to form the context feature representation. The multi-head attention outputs are recombined through linear mapping and added to the input features via residual connections. Finally, after layer standardization, the attention layer output is obtained. The calculation process is expressed as follows: ; Where LN represents layer normalization operation, and MSA represents stable multi-head attention computation; Therefore, a stable multi-head attention mechanism can be generally represented as: SMHA(x)=LN(Z+ ( )⊙(softmax( )V)); Where SMHA represents a stable multi-head attention mechanism, LN represents a layer normalization operation, and Z is the input feature sequence. The product of Hadamard; Step 4.3: Feedforward network and network state-aware modulation mechanism; Obtaining the output of the attention layer Then, it is further fed into a robust Transformer encoder block for nonlinear transformation. The feedforward network consists of two linear mapping layers and a nonlinear activation function. The linear mapping layers include a linear layer + ReLU and a linear layer + Tanh. The nonlinear activation function is the GELU activation of the feedforward network. The output is... Represented as: ; A network state-aware modulation mechanism is introduced in the output stage of the feedforward network. This mechanism generates modulation factors based on the attention output. : ; in, This represents a modulation network composed of linear transformations, and the Tanh function is used to constrain the range of values for the modulated signal. Finally, the output features of the feedforward network are dynamically adjusted through a combined additive and multiplicative modulation method, expressed as: ; The modulated features and attention layer output are then processed again through residual connections and layer normalization to obtain the final output of a single Transformer encoder block: ; Step 4.4: Stacking of Transformer Feature Encoder Layers; Let the Transformer feature encoder output for: ; in, For batch size, The length of the time series. For feature channel dimensions; Step 4.5: Temporal feature fusion and output representation construction; The Transformer feature encoder fuses the output features; specifically, it extracts end-of-sequence features and average features from the entire sequence separately. ; in, This represents the temporal feature tensor after multi-layer robust Transformer block encoding, where L represents the sequence length. This indicates the current state of the high-level network. Indicating historical trend characteristics, and The features are concatenated along the feature dimension and then fed into the feature fusion module to obtain the final feature representation. : ; Final output features The data is fed into the Actor-Critic policy network to guide TCP congestion control decisions; where B represents the batch size, and represents the number of TCP connections / environments processed in parallel. Indicates the number of feature channels.
2. The reinforcement learning congestion control method based on Transformer for temporal modeling as described in claim 1, characterized in that, In step 3, the simulated network data sent from the NS3 terminal is received and converted into observation data; this includes: `super().step()` is responsible for interacting with the NS3 client, passing the actions taken by the agent to the NS3 client, and executing a time step in the NS3 client. After the NS3 client completes the execution of a time step, it returns the current network state information to the Python client through the ZeroMQ library. The raw network state information is stored in the variable `obs`. The `transform_obs(self, obs)` method receives the raw observation data `obs` returned by `super().step()`. The raw observation data `obs` includes the network connection ID (socket ID), the slow start threshold `ssThresh`, the congestion window size `cWnd`, the number of bytes sent `bytesTX`, the number of bytes successfully received `bytesRx`, and the round-trip time `rtt`. It processes the raw observation data `obs` to extract the features needed by the agent, and finally forms the agent's observation space.
3. The reinforcement learning congestion control method based on Transformer for temporal modeling as described in claim 1, characterized in that, The formation history of reinforcement learning agents includes: In the enhanced TCP congestion control feature extractor, state_buffer refers to a state buffer used to store the observation history of the most recent seq_len time steps. It saves the observations of the most recent seq_len time steps. During each forward propagation, the new observations are normalized and transformed into (batch_size, 1, input_dim) through unsqueeze(1). batch_size refers to the number of samples processed at one time during the current training or inference. input_dim refers to the feature dimension of each time step observation, that is, the 7-dimensional vector output by transform_obs. It is concatenated to the state buffer. After concatenation, the shape of state_buffer is still (batch_size, seq_len, input_dim), and the latest seq_len time steps are always saved. In the last dimension, the oldest step is discarded, resulting in the state_buffer always storing the latest historical sequence, i.e., the formed historical sequence; The agent learns a historical sequence to make action responses; this includes: the historical sequence (batch_size, seq_len, input_dim) is fed into the feature extraction layer of the policy network to extract temporal features; the Actor network outputs action probabilities or specific actions based on the feature vectors; the Critic network evaluates the value function of the current state; the output action is sent to the NS3 terminal, which adjusts the network based on the action and returns new observations and rewards.
4. A reinforcement learning congestion control method based on Transformer for temporal modeling as described in any one of claims 1-3, characterized in that, Using fixed-temperature scaling, the formula is as follows: ; in, Indicates the first The first query and the first Attention score between each key, representing the degree of attention; Indicates the first Query vectors, with dimensions of . , by input Through linear transformation get; Indicates the first There are _ key vectors, with dimension _ . , by input Through linear transformation get; , To represent different values, Represents the dot product. Indicates dimension; Perform Xavier initialization, input sequence Each token is linearly projected as ,in ; The input feature matrix represents the feature vector at each position in the sequence; Batch size is the number of sequence samples processed in each training / inference iteration. `seq_len` is the sequence length and the number of positions in each sample. , , This is the Query, Key, and Value matrix after linear projection; The linear transformation weight matrix of the query is of size . , used to project the input into the Query space; Let be the linear transformation weight matrix of Key, with a size equal to . The same applies to projecting the input into the Key space; Let be the linear transformation weight matrix of Value, with a size equal to . The same applies; it is used to project the input onto the Value space.
5. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the reinforcement learning congestion control method based on Transformer for temporal modeling as described in any one of claims 1-4.
6. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the reinforcement learning congestion control method based on Transformer for temporal modeling as described in any one of claims 1-4.