A terahertz unmanned aerial vehicle network cooperative beamforming method based on a transformer and reinforcement learning
By using Transformer and reinforcement learning-based methods, a high-fidelity terahertz physical layer model and a multi-agent optimization model were constructed, which solved the path loss and interference topology problems of terahertz UAV networks, achieved efficient and robust beamforming control, and improved system performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHONGQING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-03-30
- Publication Date
- 2026-06-19
AI Technical Summary
Existing terahertz UAV networks struggle to achieve high reliability and energy efficiency when faced with high path loss, beam alignment sensitivity, and complex interference topologies. Traditional methods suffer from high computational complexity and lack the ability to perceive complex network topologies.
A high-fidelity terahertz physical layer environment model is constructed using a Transformer-based and reinforcement learning approach. By combining multi-agent reinforcement learning with a distributed soft-participant-commentator algorithm and a self-attention mechanism, beamforming strategies are optimized to achieve global interference topology perception and intelligent control.
It significantly improves the system throughput and energy efficiency of terahertz UAV networks, can cope with mechanical vibration and interference, maintains good scalability and robustness, and reduces computational complexity.
Smart Images

Figure CN122247470A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of interdisciplinary technology of wireless communication and artificial intelligence, and relates to a collaborative beamforming method for terahertz unmanned aerial vehicle networks based on Transformer and reinforcement learning. Background Technology
[0002] With the advent of the sixth-generation mobile communication standard, non-terrestrial networks have become the core architecture for realizing the vision of ubiquitous connectivity. Drones, with their high mobility and line-of-sight links, are widely used to assist or supplement ground base stations. Meanwhile, the terahertz band, capable of providing ultra-large bandwidths of tens of GHz, is a key enabling technology for terahertz-per-second applications. To compensate for the severe path loss in the terahertz band, massive MIMO (Multiple-Input Multiple-Output) technology is considered a key enabling technology.
[0003] However, building highly reliable terahertz UAV networks faces significant challenges at both the physical and network layers. At the physical layer, terahertz waves exhibit extremely high free-space path loss and frequency-selective molecular absorption loss. While high-gain pencil-shaped narrow beams can compensate for these losses, the extremely narrow beams are highly sensitive to mechanical micro-jitter during UAV flight; even millimeter-level pointing deviations can lead to severe communication link interruptions. At the network layer, in scenarios with high-density deployment of multiple UAVs, sidelobe energy leakage between beams generates complex and dynamically changing interference topologies, resulting in a sharp decline in overall network energy efficiency and system throughput. At the algorithmic level, traditional analytical model-based optimization methods, such as semidefinite relaxation and fractional programming, have computational complexity that increases exponentially with antenna size and heavily rely on perfect instantaneous channel state information. Existing deep reinforcement learning methods, such as soft-player-commentator algorithms using simple multilayer perceptron architectures, mostly employ oversimplified physical models, such as modeling beam shapes as flat-top models, lacking the ability to perceive complex network topologies and failing to accurately capture the interference dependencies of non-Euclidean geometries in UAV networks. Summary of the Invention
[0004] In view of this, the purpose of this invention is to provide a collaborative beamforming method for terahertz unmanned aerial vehicle networks based on Transformer and reinforcement learning.
[0005] To achieve the above objectives, the present invention provides the following technical solution: A collaborative beamforming method for terahertz UAV networks based on Transformer and reinforcement learning includes the following steps: S1: Construct a high-fidelity terahertz physical layer environment and beam model, including a large-scale channel power model combining a molecular absorption database, an array total gain model based on a uniform planar array, and a Gaussian pointing error model characterizing the mechanical micro-jitter of the UAV. S2: Construct a multi-agent reinforcement learning joint optimization model, model the joint beamwidth and power control problem of base stations and UAVs as a multi-agent partially observable Markov decision process (POSG), and define the corresponding state space, action space and composite reward function including service quality barrier penalty term; S3: The model is trained using a distributed soft participant-commenter algorithm based on the Transformer backbone network. The backbone network extracts global disturbance topology features from local observation states through a residual multi-head self-attention mechanism, quantifies long-tail risk channels using a value assessment network from a distributed perspective, and optimizes the exploration and convergence of strategies through an adaptive entropy adjustment mechanism.
[0006] Furthermore, in S1, the large-scale channel power gain model is expressed as:
[0007] in, f For carrier frequency, For drones i With base station m Transmission distance between c At the speed of light, For depends on frequency f and water vapor concentration v The molecular absorption coefficient.
[0008] Furthermore, in S1, the total array gain function G(\phi, \theta; \beta) is modeled as follows:
[0009] in, For the gain of the antenna array, and These represent the azimuth and elevation angles, respectively. Indicates beamwidth, and This represents the number of active antenna elements in a uniform planar array across two dimensions. and This is the phase term of the array factor.
[0010] Furthermore, in S2, the state space includes: the first i The observed state of each agent Includes the three-dimensional coordinates of its own drone Traffic caching load Instantaneous signal-to-interference-plus-noise ratio of the previous time slot The aforementioned drone i Associated base stations m relative distance and azimuth and the identity identifier vector used to establish node mapping .
[0011] Furthermore, in S2, the action space includes: the intelligent agents jointly controlling the drone within a continuous space. i Transmission power and beamwidth .
[0012] Furthermore, in S2, the composite reward function Represented as:
[0013] in, N The total number of drones, For drones i throughput, For the drone i The transmission power, This is a weighting factor for energy efficiency. The minimum rate threshold for service quality. The actual communication rate of the drone i. This is the penalty coefficient.
[0014] Furthermore, in S3, the residual multi-head self-attention mechanism calculates the attention weight matrix A using the following formula:
[0015] Where Q is the query matrix and K is the key matrix. Let be the dimension of the vector.
[0016] Furthermore, in S3, the value assessment from a distributional perspective is constructed using quantile regression to create a distributed critic network, whose network parameters... By minimizing the quantile Huber loss function Update:
[0017] in, D For experience replay buffer, K The number of quantiles, Here is the Huber loss function. For the time-difference objective, For the commentator network in quantiles The output value below.
[0018] Furthermore, in S3, the adaptive entropy adjustment mechanism adjusts the temperature coefficient. As learnable parameters, its objective function Represented as:
[0019] in, Indicates the policy network in time slots t strategy, and They represent time slots respectively. t Actions and states, Let be the target entropy.
[0020] A cooperative beamforming system for implementing the method, the system comprising: The environment modeling module is used to construct the high-fidelity terahertz physical layer environment and beam model described in S1. The multi-agent reinforcement learning optimization modeling module is used to construct the multi-agent reinforcement learning joint optimization model described in S2. The training and decision-making module includes a participant network based on a Transformer backbone network, a distributed commentator network based on quantile regression, and an adaptive entropy adjustment unit. It performs global disturbance topology feature extraction, long-tail risk value assessment, and policy optimization training as described in S3, and outputs the jointly controlled transmit power. beamwidth Action instructions; Beam execution module, used to execute the transmit power output by the training and decision module. With the beamwidth Control the base station antenna array to perform downlink beamforming transmission for the UAV.
[0021] The beneficial effects of this invention are as follows: Firstly, regarding robustness and reliability, this invention introduces a high-fidelity terahertz channel model and array model, and explicitly quantifies the impact of mechanical micro-jitter, enabling the proposed method to effectively address the risk of link interruption caused by narrow beam pointing deviation. The distributed reinforcement learning framework can perceive and mitigate extreme risks from the distribution of action values, thereby selecting a more robust anti-jitter beam control strategy.
[0022] Secondly, regarding interference coordination and global perception, compared to traditional methods that rely solely on multilayer perceptrons, this invention employs a Transformer-based backbone network architecture. This architecture, through a self-attention mechanism, can effectively reconstruct long-range, non-Euclidean global interference coupling relationships within the network from the local observation information of each agent, thereby achieving implicit coordination among agents and accurately identifying and suppressing potential sidelobe interference.
[0023] Third, regarding overall system performance, the collaborative method proposed in this invention can significantly improve the comprehensive performance of terahertz UAV networks. In complex scenarios with strong sidelobe interference and random mechanical jitter, compared with existing reinforcement learning methods based on multilayer perceptrons, this invention can achieve substantial performance improvements in both total system throughput and network energy efficiency.
[0024] Finally, regarding policy intelligence and scalability, the composite reward function designed in this invention includes a service quality barrier penalty term. This mechanism enables the model to autonomously learn intelligent connection access control strategies, actively silencing edge nodes operating under extremely poor channel conditions, thereby fundamentally eliminating their ineffective sidelobe interference to the entire network. This characteristic allows the algorithm to maintain good scalability and robustness even when facing large-scale, high-node-density networks, while significantly reducing the number of model parameters compared to traditional schemes.
[0025] Other advantages, objectives, and features of the invention will be set forth in part in the description which follows, and in part will be apparent to those skilled in the art from the following examination, or may be learned from practice of the invention. The objectives and other advantages of the invention can be realized and obtained through the following description. Attached Figure Description
[0026] To make the objectives, technical solutions, and advantages of the present invention clearer, the preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings, wherein: Figure 1 A model of a terahertz large-scale MIMO unmanned aerial vehicle network system; Figure 2 A network architecture based on the Tr-DSAC algorithm; Figure 3 Comparison curves of training convergence performance for different algorithms; Figure 4 For network connection topology comparison; Figure 5 A weighted heatmap of the attention mechanism; Figure 6 This is a diagram used to verify the ablation experiment. Figure 7 This is a scalability analysis diagram. Detailed Implementation
[0027] The following specific examples illustrate the implementation of the present invention. Those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the illustrations provided in the following embodiments are only schematic representations of the basic concept of the present invention. Unless otherwise specified, the following embodiments and features can be combined with each other.
[0028] The accompanying drawings are for illustrative purposes only and are schematic diagrams, not actual pictures. They should not be construed as limiting the invention. To better illustrate the embodiments of the invention, some parts in the drawings may be omitted, enlarged, or reduced, and do not represent the actual product dimensions. It is understandable to those skilled in the art that some well-known structures and their descriptions may be omitted in the drawings.
[0029] In the accompanying drawings of the embodiments of the present invention, the same or similar reference numerals correspond to the same or similar components. In the description of the present invention, it should be understood that if terms such as "upper," "lower," "left," "right," "front," and "rear" indicate the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, they are only for the convenience of describing the present invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, the terms used to describe positional relationships in the drawings are only for illustrative purposes and should not be construed as limiting the present invention. For those skilled in the art, the specific meaning of the above terms can be understood according to the specific circumstances.
[0030] This invention provides a cooperative beamforming method for terahertz UAV networks based on Transformer and reinforcement learning, the specific steps of which are as follows: Step 1: Construct a high-fidelity terahertz physical layer environment and beam model Combining the HITRAN 2020 molecular absorption database, large-scale channel power gain The model is as follows:
[0031] in, For carrier frequency, For transmission distance, At the speed of light, For depends on frequency and water vapor concentration The molecular absorption coefficient.
[0032] Meanwhile, considering that the base station is equipped with a uniform planar array (UPA), the system's total array gain function... Modeled as the product of the element gain and the array factor AF:
[0033] A Gaussian pointing error model is introduced to characterize the mechanical micro-jitter of the UAV during dynamic hovering. The error deviating from the target direction follows a zero mean and a variance of . The Gaussian distribution.
[0034] Step 2: Construct a multi-agent reinforcement learning joint optimization model The joint beamwidth and power control problem of base stations and UAVs is modeled as a multi-agent partially observable Markov decision process (POSG): 1. State Space: The first... The observation state of each agent includes the UAV's 3D coordinates, traffic buffer load, instantaneous SINR of the previous time slot, relative distance and azimuth angle between the UAV and the base station, and a one-hot ID vector used to break symmetry and establish node mapping. .
[0035] 2. Action Space: Agents jointly control the transmission power within a continuous space. and beamwidth .
[0036] 3. Composite Reward Function: Design a composite reward function that includes a QoS barrier penalty term:
[0037] Among them, when the actual rate Below the minimum threshold At that time, the nonlinear operator is activated and a large negative feedback penalty is imposed. This forces the agent to learn an active "Connectivity Access Control (CAC)" strategy.
[0038] Step 3: Design and Training of a Global Interference Sensing Network Based on the Tr-DSAC Algorithm Design an Actor-Critic architecture that incorporates a Transformer backbone network to extract global interference topology: 1. Residual Multi-Head Self-Attention Mechanism: This mechanism maps the original state to a high-dimensional feature sequence and captures distant perturbation dependencies through multi-head self-attention (MHSA). Attention weight matrix. The calculation method is as follows:
[0039] Combining residual connections and layer normalization (LayerNorm), the output characteristics of the backbone network are:
[0040] 2. Value Assessment from a Distributed Perspective: To address the long-tail risk channel caused by jitter, a distributed Critic network is constructed using quantile regression. Critic parameters. Update by minimizing the quantile Huber loss:
[0041] 3. Adaptive entropy adjustment mechanism (Auto-Alpha): Adjusts the temperature coefficient As a learnable parameter, the algorithm maintains high entropy in the early stages of training to encourage extensive exploration, and gradually reduces it in the later stages. To achieve fine-grained convergence in beam control, the objective function is:
[0042] Figure 1 The illustration depicts a terahertz massive MIMO drone communication scenario; the green area represents the high-gain main lobe link, and the red area represents the side lobe interference link that causes performance degradation.
[0043] Figure 2 The overall network architecture of the Tr-DSAC algorithm is shown, including an input layer that incorporates geometric feature embedding, a residual Transformer backbone network with global interference awareness, and an Actor / Critic head network based on distributed reinforcement learning.
[0044] Figure 3 The curves showing the convergence performance of different algorithms during the training phase are presented, demonstrating that the present invention converges faster and has stronger exploration capabilities.
[0045] Figure 4 A network connection topology comparison diagram is provided; it demonstrates the ability of the Tr-DSAC algorithm to actively mute weak channel nodes to eliminate invalid sidelobe interference.
[0046] Figure 5 The weight heatmap of the attention mechanism is presented, and the interpretability of the model in mapping physical space disturbances to the attention weight matrix is verified.
[0047] Figure 6 and Figure 7 These are ablation experiment verification diagrams and scalability analysis diagrams of algorithm throughput performance under different network node sizes.
[0048] This embodiment provides an execution device (such as a base station controller or airborne computing platform) for a reinforcement learning terahertz beamforming method based on Transformer, which includes the following specific operating steps: 1. Environmental initialization and information perception: The system in time slots Collect current network One base station and The physical state information of each UAV is used to construct an observation feature vector. .
[0049] 2. Feature embedding and global interference topology extraction: The device will extract the original state. Input is fed into a local or central Actor network and mapped to feature embeddings through linear layers. And it is fed into the multi-head self-attention layer. According to the formula Calculate the attention weight matrix between drone nodes and use this matrix to quantify the potential interference coupling strength between each node.
[0050] 3. Residual processing and motion output: Backbone features are obtained through residual connection and layer normalization. Next, the mean and logarithmic standard deviation of the Gaussian policy are output in parallel through the Actor, and the action vector of the current time slot is sampled using the reparameterization technique. That is, transmit power and beamwidth.
[0051] 4. Physical layer execution and environment interaction: The base station is based on the generated... and Beamforming downlink transmission is performed. Signal propagation is affected by HITRAN large-scale fading and array factor models. The system calculates the immediate reward based on actual throughput and QoS satisfaction using a barrier penalty-reward formula. .
[0052] 5. Model Training and Adaptive Evolution: During the training phase, the Critic network evaluates the distribution value range including long-tail risk and minimizes the quantile Huber loss to update the network parameters; simultaneously, the temperature coefficient is dynamically updated based on gradient descent. This addresses the policy collapse problem during narrow beam alignment until the network converges to the optimal joint power and beamwidth control strategy.
[0053] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.
Claims
1. A cooperative beamforming method for terahertz UAV networks based on Transformer and reinforcement learning, characterized in that: Includes the following steps: S1: Construct a high-fidelity terahertz physical layer environment and beam model, including a large-scale channel power model combining a molecular absorption database, an array total gain model based on a uniform planar array, and a Gaussian pointing error model characterizing the mechanical micro-jitter of the UAV. S2: Construct a multi-agent reinforcement learning joint optimization model, model the joint beamwidth and power control problem of base stations and UAVs as a multi-agent partially observable Markov decision process (POSG), and define the corresponding state space, action space and composite reward function including service quality barrier penalty term; S3: The model is trained using a distributed soft participant-commenter algorithm based on the Transformer backbone network. The backbone network extracts global disturbance topology features from local observation states through a residual multi-head self-attention mechanism, quantifies long-tail risk channels using a value assessment network from a distributed perspective, and optimizes the exploration and convergence of strategies through an adaptive entropy adjustment mechanism.
2. The terahertz UAV network cooperative beamforming method based on Transformer and reinforcement learning according to claim 1, characterized in that: In S1, the large-scale channel power gain model is expressed as: in, f For carrier frequency, For drones i With base station m Transmission distance between c At the speed of light, For depends on frequency f and water vapor concentration v The molecular absorption coefficient.
3. The terahertz UAV network cooperative beamforming method based on Transformer and reinforcement learning according to claim 1, characterized in that: In S1, the total array gain function G(\phi, \theta; \beta) is modeled as follows: in, For the gain of the antenna array, and These represent the azimuth and elevation angles, respectively. Indicates beamwidth. and This represents the number of active antenna elements in a uniform planar array across two dimensions. and This is the phase term of the array factor.
4. The terahertz UAV network cooperative beamforming method based on Transformer and reinforcement learning according to claim 1, characterized in that: In S2, the state space includes: the first i The observed state of each agent Includes the three-dimensional coordinates of its own drone Traffic caching load Instantaneous signal-to-interference-plus-noise ratio of the previous time slot The aforementioned drone i Associated base stations m relative distance and azimuth and the identity identifier vector used to establish node mapping .
5. The terahertz UAV network cooperative beamforming method based on Transformer and reinforcement learning according to claim 1, characterized in that: In S2, the action space includes: the agent jointly controlling the drone in a continuous space. i Transmission power and beamwidth .
6. The terahertz UAV network cooperative beamforming method based on Transformer and reinforcement learning according to claim 1, characterized in that: In S2, the composite reward function Represented as: in, N The total number of drones, For drones i throughput, For the drone i The transmission power, This is a weighting factor for energy efficiency. The minimum rate threshold for service quality. The actual communication rate of the drone i. This is the penalty coefficient.
7. The terahertz UAV network cooperative beamforming method based on Transformer and reinforcement learning according to claim 1, characterized in that: In S3, the residual multi-head self-attention mechanism calculates the attention weight matrix A using the following formula: Where Q is the query matrix and K is the key matrix. Let be the dimension of the vector.
8. The terahertz UAV network cooperative beamforming method based on Transformer and reinforcement learning according to claim 1, characterized in that: In S3, the value assessment from a distributional perspective is constructed using quantile regression to create a distributed critic network, whose network parameters... By minimizing the quantile Huber loss function Update: in, D For experience replay buffer, K The number of quantiles, Here is the Huber loss function. For the time-difference objective, For the commentator network in quantiles The output value below.
9. The terahertz UAV network cooperative beamforming method based on Transformer and reinforcement learning according to claim 1, characterized in that: In S3, the adaptive entropy adjustment mechanism will adjust the temperature coefficient. As learnable parameters, its objective function Represented as: in, Indicates the policy network in time slots t strategy, and They represent time slots respectively. t Actions and states, Let be the target entropy.
10. A cooperative beamforming system for implementing the method as described in any one of claims 1 to 9, characterized in that: The system includes: The environment modeling module is used to construct the high-fidelity terahertz physical layer environment and beam model described in S1. The multi-agent reinforcement learning optimization modeling module is used to construct the multi-agent reinforcement learning joint optimization model described in S2. The training and decision-making module includes a participant network based on a Transformer backbone network, a distributed commentator network based on quantile regression, and an adaptive entropy adjustment unit. It performs global disturbance topology feature extraction, long-tail risk value assessment, and policy optimization training as described in S3, and outputs the jointly controlled transmit power. beamwidth Action instructions; Beam execution module, used to execute the transmit power output by the training and decision module. With the beamwidth Control the base station antenna array to perform downlink beamforming transmission for the UAV.