Reinforcement learning based array beam adaptive method
By employing a reinforcement learning-based array beam adaptation method, which combines graph neural networks and multi-agent reinforcement learning, the beamforming weights are dynamically updated. This solves the problems of array element gain, phase drift, and radio frequency interference, achieving adaptive capabilities with stable main lobe gain, controlled side lobes, and guaranteed null depth, thus improving the performance of the radio array system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHANGZHOU INST OF TECH
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-19
AI Technical Summary
In existing radio array systems, the gain and phase of array elements are dynamically affected by temperature drift, device aging, and mismatch. This makes it difficult for fixed weights to meet the main lobe gain and side lobe constraints in the long term. The null depth and position in the direction of radio frequency interference are difficult to maintain in a controllable manner. Furthermore, the computational load is large and it is difficult to achieve low-latency online updates.
By employing a reinforcement learning-based approach, combining graph neural networks and multi-agent reinforcement learning, dynamic array graph modeling and distributed decision-making are used to dynamically update beamforming weights. Combined with health gating and constrained projection layers, this achieves adaptive capabilities with stable main lobe gain, controlled side lobes, and guaranteed null depth.
It maintains stable main lobe gain, reduces side lobe contamination, improves signal-to-noise ratio and observation efficiency, is compatible with hardware implementation constraints, and has distributed real-time adaptive capabilities under dynamic changes in array element state and the presence of radio frequency interference.
Smart Images

Figure CN122241081A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of array signal processing and adaptive beamforming, and more particularly to an array beamforming adaptive method based on reinforcement learning. Background Technology
[0002] Radio array systems acquire spatial signals through numerous array elements or subarrays and apply complex weights to each element's signal in the digital domain to achieve beamforming. This aims to obtain high main lobe gain in the target direction and suppress sidelobes in non-target directions. With the increasing scale of radio telescope arrays and the expansion of observation frequency bands, digital beamforming technology has evolved from early fixed weighting and offline calibration to schemes such as online compensation based on array calibration, adaptive beamforming based on statistical criteria, and constrained null control. For example, in engineering, periodic calibration is often used to correct element amplitude and phase errors, and methods such as least mean square, recursive least square, or minimum variance distortion-free response are used to adaptively adjust weights when interference exists. To meet the requirements of sidelobe templates or interference suppression, constrained beamforming methods such as linearly constrained minimum variance and convex optimization have also emerged. In recent years, research on neural networks for direction estimation, array state characterization, and beam control has also gradually increased, with distributed collaborative optimization and consensus updates used to reduce the computational and communication burden on the central node.
[0003] Existing technologies still have shortcomings in long-term operation scenarios for radio arrays, mainly in the following aspects:
[0004] First, the gain and phase of the array elements are dynamically affected by factors such as temperature drift, device aging, mismatch, and array element failure. Traditional fixed weights or methods that rely on manual adjustment and periodic recalibration are difficult to track in a timely manner, making it difficult to meet the long-term stable requirements of the main lobe gain and side lobe indicators.
[0005] Secondly, the suppression of radio frequency interference usually relies on a stable array model or accurate estimation of the interference incident direction. When the interference direction changes, the estimation is uncertain, or the array element state drifts, the null depth and position are difficult to maintain in a controllable manner, which can easily lead to sidelobe contamination and reduce the signal-to-noise ratio and observation efficiency.
[0006] Third, the weighting solution with multiple constraints such as sidelobes and null traps often involves a large amount of computation and is mostly a centralized optimization, making it difficult to achieve low-latency online updates in large-scale arrays. At the same time, it lacks compatibility with hardware implementation constraints such as constant modulus and phase quantization, making it difficult to implement in engineering.
[0007] Therefore, there is a need for an array beam adaptive method that can address the shortcomings of the existing technologies. Summary of the Invention
[0008] One objective of this invention is to propose a reinforcement learning-based array beam adaptation method. Addressing the problems in existing technologies where array element gain and phase dynamically drift due to temperature drift, aging mismatch, and element failure, fixed weights or manual parameter tuning methods struggle to consistently meet main lobe gain and side lobe constraints, and fail to create controllable nulls and achieve stable suppression of radio frequency interference, this invention proposes a technical solution combining graph neural network state representation based on dynamic array graph modeling with distributed incremental decision-making using multi-agent reinforcement learning. This solution characterizes changes in array element correlation through online edge weight updates, suppresses the impact of faulty and aging nodes on decision-making through health gating, corrects weights under constraints of main lobe and side lobe nulls and hardware implementation through a constrained projection layer, and achieves distributed consistent updates by combining primal-dual iteration and dual variable broadcasting. This invention achieves the technical effects of maintaining stable main lobe gain, controlled side lobes, guaranteed null depth, and distributed real-time adaptive capabilities under long-term operation and interference conditions.
[0009] This invention provides a reinforcement learning-based array beam adaptation method, comprising:
[0010] S1. Collect the received signals, calibration data, and data characterizing the temperature state of the array elements, perform preprocessing, estimate the gain drift and phase drift of each array element, and estimate the set of incident RF interference directions based on the received signals of the array elements to obtain the observation dataset and the set of incident RF interference directions. S2. Perform dynamic updates on the initial array diagram based on the observation dataset to generate the current array diagram. The initial array diagram uses array elements or subarrays as nodes, and the geometric adjacency or mutual coupling relationships between nodes as edges. S3. Perform graph neural network feature extraction on the current array diagram to generate a set of local state vectors, and calculate the set of health indicators from the node feature vectors. S4. Based on the set of local state vectors, the set of health indicators, the set of incident RF interference directions, and the dual variables retained from the previous adaptive update, a distributed policy network is used for distributed decision-making. Each agent generates the beamforming weight increment for the corresponding array element or subarray, resulting in a set of weight increments. S5. Superimpose the set of weight increments onto the beamforming weights from the previous adaptive update to generate a provisional beam. S6. Based on the set of health indicators, the update amplitude of the corresponding weights of array elements or subarrays with lower health in the provisional beamforming weights is suppressed; S7. The array pattern is calculated based on the provisional beamforming weights, and the constraint projection layer performs constraint projection on the provisional beamforming weights to generate the current beamforming weights; S8. The current beamforming weights are applied to the beamforming process to form the output beam, the beam output signal is collected, and the main lobe directional gain, side lobe level, and null depth calculated for each RF interference incident direction in the set of RF interference incident directions are calculated in combination with the array pattern. The reward value is generated based on the main lobe directional gain, side lobe level, and each null depth, and the constraint residual vector is generated based on the side lobe level and each null depth; S9. The dual variables are updated according to the Lagrange multiplier iteration rule based on the constraint residual vector, and the distributed policy network is updated based on the reward value and the updated dual variables, and are respectively retained as the dual variables, the distributed policy network, and the beamforming weights of the previous adaptive update time.
[0011] Optionally, S1 includes:
[0012] The array element received signals are synchronously acquired within the preset observation time slot, and bandpass filtering, amplitude normalization and time domain frame division processing are performed on the array element received signals to obtain preprocessed array element received signals.
[0013] A preset calibration signal is extracted from the calibration data. The complex gain estimate of each array element is calculated based on the preprocessed array element received signal and the calibration signal. The complex gain estimate is then compared with the complex gain reference value in the reference calibration result to obtain the gain drift and the phase drift.
[0014] The received signal of the preprocessed array element is subjected to time-frequency transformation to obtain a set of frequency domain snapshots. The spatial covariance matrix is calculated based on the set of frequency domain snapshots. The radio frequency interference incident direction and its confidence level are output by the direction estimation neural network based on the spatial covariance matrix. The radio frequency interference incident directions with confidence levels not less than a preset confidence threshold constitute the set of radio frequency interference incident directions.
[0015] The gain drift, phase drift, data characterizing the temperature state of the array elements, and spatial covariance matrix are compiled into the observation dataset.
[0016] Optionally, S2 includes:
[0017] Based on the observation dataset, a node feature vector is calculated for each node in the initial array diagram. The node feature vector includes the gain drift, phase drift, data characterizing the temperature state of the corresponding array element or subarray, and calibration residual.
[0018] For any two nodes connected by an edge in the initial array diagram, the cross-correlation value of the received signals of the corresponding array elements of the two nodes is calculated based on the observation dataset, the temperature difference between the two nodes is calculated, and the calibration residual difference between the two nodes is calculated.
[0019] The edge weight is generated based on the cross-correlation value, the temperature difference, and the calibration residual difference, and the edge weight is updated to the initial array diagram to obtain the current array diagram;
[0020] Specifically, when the edge weight is less than a preset edge weight threshold, the corresponding edge is deleted or the corresponding edge weight is set to zero to suppress message passing between weakly related nodes.
[0021] Optionally, S3 includes:
[0022] Map the node feature vectors in the current array graph to the node initial embedding vectors;
[0023] Based on the node feature vectors, a set of health indicators is calculated through a health assessment network, and the set of health indicators is converted into a set of gating coefficients.
[0024] In the message passing process of the graph neural network, for any target node in the current array graph, the gating coefficients corresponding to the set of gating coefficients are applied to the initial embedding vectors of each adjacent node, and the gating adjacent node messages are weighted and aggregated in combination with the edge weights to obtain the aggregated message of the target node.
[0025] Update the node embedding vector of the target node according to the aggregated message;
[0026] The updated embedding vectors of each node are used to form a local state vector set, wherein the gating coefficient decreases as the health index decreases, so as to suppress the influence of failed or aging nodes on the local state vector set.
[0027] Optionally, S4 includes:
[0028] The local state vector set is distributed to each agent according to the one-to-one correspondence between nodes and agents, and each agent obtains a health index associated with its corresponding node.
[0029] The dual variables retained in the previous adaptive update time are broadcast to each agent as consistent constraint adjustment quantities, and the set of radio frequency interference incident directions is encoded as interference direction feature vectors.
[0030] Under the joint constraints of the dual variable and the interference direction feature vector, each agent generates beamforming weight increments for corresponding array elements or subarrays through the distributed policy network of the multi-agent reinforcement learning, based on the corresponding local state vector, the corresponding health index, the dual variable, and the interference direction feature vector, thus forming a set of weight increments.
[0031] The beamforming weight increment is in complex form, comprising a real part increment and an imaginary part increment.
[0032] Optionally, S5 includes:
[0033] The set of weight increments is added to the beamforming weights of the previous adaptive update time node by node to obtain the superposition result;
[0034] An update inhibition coefficient is generated for each node based on a set of health indicators, and the update inhibition coefficient decreases as the health indicators decrease.
[0035] The weights of each node in the superposition result are multiplied by the corresponding update suppression coefficients, and the product is used as provisional beamforming weights.
[0036] Specifically, when the health index corresponding to any node is less than the preset health threshold, the update suppression coefficient corresponding to that node is set to zero so that the provisional beamforming weight corresponding to that node remains the beamforming weight at the previous adaptive update time.
[0037] Optionally, S6 includes:
[0038] The array pattern is calculated based on the provisional beamforming weights. The array pattern provides pattern amplitude at least in the main lobe direction, the sidelobe angular domain, and the corresponding directions of the set of radio frequency interference incident directions. It is determined whether the array pattern simultaneously satisfies the following conditions: the main lobe direction gain is not less than a preset main lobe gain threshold, the sidelobe level in the sidelobe angular domain is not greater than a preset sidelobe threshold, and the null depth in the corresponding direction for each radio frequency interference incident direction in the set of radio frequency interference incident directions is not less than a preset null threshold. If these conditions are met, the provisional beamforming weights are determined as the current beamforming weights. If these conditions are not met, the current beamforming weights are generated by the constraint projection layer by solving the following constraint optimization problem: under the conditions that the main lobe direction gain is not less than a preset main lobe gain threshold, the sidelobe level in the sidelobe angular domain is not greater than a preset sidelobe threshold, and the null depth in the corresponding direction for each radio frequency interference incident direction in the set of radio frequency interference incident directions is not less than a preset null threshold, the norm of the difference between the current beamforming weights and the provisional beamforming weights is minimized.
[0039] Furthermore, when generating the current beamforming weights, the constraint projection layer further satisfies hardware implementation constraints, which include at least one of the following: amplitude clipping constraints, constant mode constraints, and phase quantization constraints. The phase quantization constraint maps the phase of the current beamforming weights to the nearest phase value in a preset phase codebook set to adapt to the finite-bit phase controller of the phased array.
[0040] Optionally, the S7 includes:
[0041] The current beamforming weights are applied to the beamforming network, causing the received signals of each array element to be weighted and synthesized according to the current beamforming weights to obtain the beam output signal. The array pattern is calculated based on the current beamforming weights, and the main lobe direction gain is determined in a preset main lobe direction. The sidelobe level is determined in a preset sidelobe angle domain, where the sidelobe level is the maximum value of the pattern gain in the sidelobe angle domain. A corresponding null depth is determined for each radio frequency interference incident direction in the set of radio frequency interference incident directions, where the null depth for any radio frequency interference incident direction is the difference between the main lobe direction gain and the pattern gain of that radio frequency interference incident direction. A reward value is generated based on the main lobe direction gain, the sidelobe level, and each null depth. Constraint residual vectors are generated based on the difference between the sidelobe level and a preset sidelobe threshold, and the difference between each null depth and the preset null threshold. The constraint residual vectors include sidelobe level over-limit residuals and insufficient null depth residuals generated for each radio frequency interference incident direction in the set of radio frequency interference incident directions.
[0042] Optionally, S8 includes:
[0043] Based on the constraint residual vector, the dual variables retained at the previous adaptive update time are iteratively updated according to the preset step size, and the iterative update results are non-negatively truncated to obtain the updated dual variables.
[0044] A constraint-enhancing reward is constructed based on the reward value and the updated dual variable, wherein the constraint-enhancing reward is the reward value minus the inner product of the updated dual variable and the constraint residual vector;
[0045] The parameters of the distributed policy network of the multi-agent reinforcement learning are updated according to the constraint enhancement reward, so as to increase the expected value of the constraint enhancement reward;
[0046] The updated dual variables, the updated distributed policy network, and the current beamforming weights are retained as the dual variables, the distributed policy network, and the beamforming weights of the previous adaptive update time, respectively, for the next adaptive update time.
[0047] The beneficial effects of this invention are:
[0048] 1. It can track the array state online by using dynamic mapping and local state representation of graph neural networks when the gain and phase drift of array elements change dynamically with temperature drift, aging, mismatch and failure. It can also continuously correct the beamforming weights by combining multi-agent distributed decision-making, thereby maintaining the main lobe direction gain in a long time and reducing the dependence on manual modulation and repeated calibration.
[0049] 2. It can form controllable nulls oriented towards the incident direction of radio frequency interference, and simultaneously constrain the sidelobe level and null depth under the original dual iteration driven by reward and constraint residuals; further, by constraining the projection layer, the provisional weights are projected onto the feasible region that satisfies the thresholds of main lobe, sidelobe and null, thereby improving the stability of sidelobe suppression and interference suppression, reducing sidelobe contamination and improving signal-to-noise ratio and observation efficiency.
[0050] 3. By using health index gating and update suppression mechanisms to weaken the impact of faulty or aging array elements on subgraph message passing and weight updates, and by broadcasting dual variables to achieve distributed consistency constraint adjustment, it can still maintain robust distributed online adaptive capability when some array elements are abnormal or the array size is large, and can further be compatible with hardware implementation constraints such as constant modulus and phase quantization. Attached Figure Description
[0051] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings:
[0052] Figure 1 This is a flowchart of an array beam adaptive method based on reinforcement learning proposed in this invention. Detailed Implementation
[0053] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.
[0054] refer to Figure 1 A reinforcement learning-based array beam adaptation method includes:
[0055] S1. Collect the received signals, calibration data, and data characterizing the temperature state of the array elements, perform preprocessing, estimate the gain drift and phase drift of each array element, and estimate the set of incident RF interference directions based on the received signals of the array elements to obtain the observation dataset and the set of incident RF interference directions. S2. Perform dynamic updates on the initial array diagram based on the observation dataset to generate the current array diagram. The initial array diagram uses array elements or subarrays as nodes, and the geometric adjacency or mutual coupling relationships between nodes as edges. S3. Perform graph neural network feature extraction on the current array diagram to generate a set of local state vectors, and calculate the set of health indicators from the node feature vectors. S4. Based on the set of local state vectors, the set of health indicators, the set of incident RF interference directions, and the dual variables retained from the previous adaptive update, a distributed policy network is used for distributed decision-making. Each agent generates the beamforming weight increment for the corresponding array element or subarray, resulting in a set of weight increments. S5. Superimpose the set of weight increments onto the beamforming weights from the previous adaptive update to generate a provisional beam. S6. Based on the set of health indicators, the update amplitude of the corresponding weights of array elements or subarrays with lower health in the provisional beamforming weights is suppressed; S7. The array pattern is calculated based on the provisional beamforming weights, and the constraint projection layer performs constraint projection on the provisional beamforming weights to generate the current beamforming weights; S8. The current beamforming weights are applied to the beamforming process to form the output beam, the beam output signal is collected, and the main lobe directional gain, side lobe level, and null depth calculated for each RF interference incident direction in the set of RF interference incident directions are calculated in combination with the array pattern. The reward value is generated based on the main lobe directional gain, side lobe level, and each null depth, and the constraint residual vector is generated based on the side lobe level and each null depth; S9. The dual variables are updated according to the Lagrange multiplier iteration rule based on the constraint residual vector, and the distributed policy network is updated based on the reward value and the updated dual variables, and are respectively retained as the dual variables, the distributed policy network, and the beamforming weights of the previous adaptive update time.
[0056] In this specific embodiment, S1 includes:
[0057] In the same preset observation time slot Internal sampling rate right The array element received signals are synchronously acquired, and the array element received signals are denoted as... ,in Indicates the element index, Indicates the sampling point index. This represents the total number of sampling points within the observation time slot. For the first The array element in the first Complex baseband sample values at each sampling point;
[0058] For each array element Bandpass filtering, amplitude normalization, and time-domain framing are performed sequentially to obtain the preprocessed array element received signal. The bandpass filtering uses an FIR bandpass filter with 129 taps, and the passband is fixed at 5 MHz to 45 MHz to suppress out-of-band noise and image components. Amplitude normalization is achieved by calculating the root mean square amplitude of each array element within the observation time slot and performing a division operation on the full time slot sampling of that element to ensure consistent input amplitude scales for all elements. Time-domain framing uses the frame length... Point and Frame Shift The points are then multiplied by a Hanning window for each frame to reduce spectral leakage;
[0059] Extract the preset calibration signal from the calibration data. ,in For length The known complex calibration sequence of the points is injected into the receiving link of each array element at the beginning of the observation time slot, first by... and Cross-correlation peak localization enables calibration sequence alignment and interception with Using calibration segments of the same length, the complex gain estimates for each array element are then calculated based on the least squares criterion. and compared with the complex gain reference value in the reference calibration results. The gain drift and phase drift are compared, and their calculation relationship is as follows:
[0060] ;
[0061] in For the first The calibration section preprocesses the received array element signals after element alignment. For the preset calibration signal, express conjugate, Represents complex number amplitude operations. Represents complex phase angle operations. For the first The estimated complex gain of each array element. For the first The complex gain reference value of each array element in the reference calibration result For the first Gain drift of each array element For the first The phase drift of each array element, and the calibration residual is defined as and Together, they serve as the source of node features for subsequent mapping;
[0062] Simultaneously, data characterizing the temperature state of the array elements are collected. This data is generated by the temperature sensor corresponding to each array element, which outputs a temperature sample once within the observation time slot and records it as... And a first-order recursive low-pass pair with a recursion coefficient of 0.9 is used. Smoothing is performed to reduce instantaneous measurement noise and make the temperature state continuous over time;
[0063] Receive signal for the preprocessed array element Perform time-frequency transformation to obtain a set of frequency domain snapshots, wherein the time-frequency transformation is performed on each frame. Point Fast Fourier Transform and selection within the passband Each frequency point forms a snapshot, and for each frequency point and each frame, The complex spectral values of each array element at this frequency point are stacked according to the element index to form an array snapshot vector. The spatial covariance matrix is obtained by averaging the outer product of all snapshot vectors. ,in For dimension A complex matrix that retains both its real and imaginary parts as input to the neural network;
[0064] Based on the spatial covariance matrix The direction estimation neural network outputs the incident direction of radio frequency interference and its confidence level. The input of the direction estimation neural network is a magnitude... tensor and The real part is used as the first channel and the imaginary part as the second channel. The network structure is fixed as three 2D convolutional layers and two fully connected layers, where the kernel size of the three convolutional layers is [missing information]. The number of channels is 16, 32, and 64 respectively, and the activation function is ReLU for all three. The output dimensions of the two fully connected layers are 256 and 181 respectively, and the corner domain is obtained by using Softmax in the last layer. by The confidence vectors of the 181 candidate directions after interval quantization will have a confidence level no less than a preset confidence threshold. The corresponding angle output of the candidate directions is the incident direction of radio frequency interference and constitutes a set of incident directions of radio frequency interference;
[0065] Finally, the gain of each array element is shifted. Phase drift Calibration residuals Temperature status data and the spatial covariance matrix The data is compiled into an observation dataset and output along with the set of incident directions of the radio frequency interference.
[0066] In this specific embodiment, S2 includes:
[0067] Based on the observation dataset, the initial array map is dynamically updated to generate the current array map, denoted as . Its node set is denoted as ,in The initial array diagram represents the total number of array elements, with each node corresponding one-to-one with an array element. The initial array diagram uses geometric adjacency relationships to construct edges and is fixed as follows: The four-adjacency connection method of a planar array, that is, for any node With nodes If and only if the Manhattan distance between the two on the array grid is 1, then... Establish an undirected edge in the middle And stored in a sparse adjacency list To support online edge removal;
[0068] At each adaptive update time Generate the current array diagram At that time, first for each node Calculate node feature vectors ,in It is a four-dimensional column vector, and its four components are successively taken as gain and drift. Phase drift Temperature status data and calibration residuals ,in Indicates the first The estimated complex gain of each array element. Indicates the first The complex gain baseline value of each array element, and to ensure consistent input scale for subsequent graph neural networks, will be... according to linear scaling to ,Will according to linear scaling to ,Will according to linear scaling to ,Will according to linear scaling to And keep each scaling range constant throughout the entire run to avoid scale drift;
[0069] Then, for any edge in the initial array diagram... The computational cost of edge updates includes the cross-correlation values of the preprocessed array element received signals from the two nodes. Temperature difference between two nodes and the difference in calibration residuals between the two nodes ,in Taken as the observation time slot Internal and After performing normalized cross-correlation, the maximum value of the amplitude is taken. and They represent the first Individual elements and the first The preprocessing of array elements and the reception of array elements. and The data consists of temperature status data for two nodes. and These are the calibration residuals of the two nodes, respectively;
[0070] Based on the above quantities, edge weights are generated. And update it to the current array graph, the edge weights are calculated according to the following formula:
[0071] ;
[0072] in Representing an edge At the current update time The edge weights, Representing an edge The cross-correlation values of the received signals from both ends of the array element are preprocessed. This represents the temperature difference between the two array elements. This represents the difference in calibration residuals between the two array elements. This represents the temperature difference attenuation coefficient. This represents the attenuation coefficient of the calibration residual difference. Represents an exponential function;
[0073] All edges Forming the edge weight matrix And order The undirected graph is represented by a symmetric graph with zero diagonal lines, and then the preset edge weight threshold is set to... When either side is satisfied When an edge is removed from the sparse adjacency list, its weight is set to zero to suppress message passing between weakly related nodes. Edges that are not removed are retained in the current edge set of the array graph. And its edge weight is taken as the corresponding This yields a set containing node features. Edge set and edge weight set The current array diagram.
[0074] In this specific embodiment, S3 includes:
[0075] Receive current array diagram It also performs graph neural network feature extraction to generate a set of local state vectors, and calculates a set of health indicators from the node feature vectors and converts them into a set of gating coefficients;
[0076] in Indicates the current adaptive update time. Represents a set of nodes and Indicates the total number of array elements. Represents the current set of edges. Representing an edge The edge weights;
[0077] The feature vector of each node Mapped to node initial embedding vector ,in Indicates the node index. Indicates gain drift. Indicates phase drift, This represents temperature status data. Indicates calibration residual, Indicates transpose, with the embedding dimension set to . The mapping is accomplished using a fully connected layer with its parameters fixed. and ReLU activation is applied to the linear output, followed by LayerNorm to suppress the effect of input scale fluctuations on the embedding distribution;
[0078] Subsequently, based on the node feature vector Health index is calculated using a health assessment network. And form a set of health indicators The health assessment network is a two-layer multilayer perceptron, and its parameters are trained offline and fixed based on historical calibration residuals and array element fault records before system deployment. The parameters of the first layer of the network are fixed as follows: and ReLU activation is employed, and the parameters of the second layer of the network are fixed. and And Sigmoid activation is used to constrain the output to And define the gating coefficient as Thus, the set of gating coefficients is obtained. This causes the gating coefficient to decrease synchronously when the health index decreases;
[0079] In the message passing process of a graph neural network, a two-layer message passing network is used and the first layer is... The layer output is denoted as ,in Indicates layer index and The initial embedding vector for the aforementioned nodes is used for any target node in the current array graph. Define its set of neighboring nodes as And perform normalization on the weights of the adjacent edges of the target node to obtain ,in Depend on Divide by Get and when When the denominator is empty or zero, all Set to zero to ensure numerical stability;
[0080] During message passing at each level, first check each adjacent node. The previous layer embedding vector Applying a gating factor Then, combined with normalized edge weights Perform weighted aggregation and update the target node embedding vector, with the following update rule:
[0081] ;
[0082] in Represents the target node In the The updated node embedding vectors after layer update. Represents the target node In the Layer node embedding vectors, Represents the ReLU activation function. Indicates the first The self-loop transformation matrix of the layer, Indicates the first The neighborhood message transformation matrix of the layer. Indicates the first Layer bias vector, Represents the target node At any moment The set of adjacent nodes, This represents the normalized edge weights. Indicates adjacent nodes The gating coefficient, This represents the summation operation over adjacent nodes;
[0083] After completing two layers of message passing, Determined as a node The local state vectors are used to form a set of local state vectors. and set health indicators With the set of gating coefficients They are output together, thereby suppressing the influence of failed or aging nodes on the local state vector set during the message aggregation stage by using a mechanism where the gating coefficient decreases as the health index decreases.
[0084] In this specific embodiment, S4 includes:
[0085] Based on the set of local state vectors Health index set And the set of incident radio frequency interference directions, a distributed policy network of multi-agent reinforcement learning is used to make distributed decisions and each agent generates the beamforming weight increment of the corresponding array element.
[0086] in Indicates the current adaptive update time. Indicates the total number of array elements. Represents a node Local state vector, Represents a node The health indicators are output by the health assessment network;
[0087] In a distributed execution framework, nodes are mapped one-to-one with agents and the node number is recorded. Each intelligent agent is At any moment Will and As an intelligent agent The local observation inputs are stored in the local processing unit of the array element, and each agent only holds the local state vector and health index of its own node to meet the distributed constraints.
[0088] The dual variable retained from the previous adaptive update is broadcast to all agents as a consistent constraint adjustment. This dual variable is denoted as... Its dimension is fixed at 5 and corresponds sequentially to one sidelobe level constraint and four zero-depression depth constraints, among which The dual component representing the sidelobe level constraint. Indicates the first The dual components of the zero-trap depth constraint and and to Perform element-wise truncation to a range Then divide by 10 to obtain the normalized dual variable input, thereby ensuring numerical scaling stability;
[0089] Encode the set of incident directions of radio frequency interference into interference direction feature vectors. Its construction method is to sort all the incident directions of radio frequency interference output by the direction estimation neural network from high to low confidence and then take the first one. Each direction is denoted as... and its confidence level ,in Indicates the first The radian value of the incident direction of radio frequency interference. This indicates the corresponding confidence level. If the actual number of directions is less than 4, then the missing ones will be... and Set to 0 to complete the fixed-length fill, and press The order of splicing is obtained ;
[0090] A distributed policy network in which each agent uses shared parameters Generate weight increments, where the policy network parameters To ensure distributed consistency across all agents, the policy network is a three-layer multilayer perceptron with an input dimension of 50, and the input vector is arranged according to... The pieces were assembled, among which Occupying 32 dimensions, Occupy 1 dimension, Occupying 5 dimensions The layer has 12 dimensions, with 128 neurons in the first hidden layer activated by ReLU, 64 neurons in the second hidden layer activated by ReLU, and a 2-dimensional output layer. Activate to limit the output amplitude;
[0091] The policy network outputs the complex weight increments of the corresponding array elements to each agent. Its increment is due to the real part. Increment of the imaginary part Composed of a uniform step size coefficient Amplitude scaling is applied to control the magnitude of the weight increment in a single update and to match the superimposed update method in step S5. The decision output relationship is as follows:
[0092] ;
[0093] in Represents a node The increment of the real part of the weights, Represents a node The increment of the imaginary part of the weights, This represents the scaling factor for the weight increment. The parameter is Distributed policy network mapping, This represents the vector concatenation operation. Represents the imaginary unit and satisfies Ultimately, the weight increment set is formed by the parallel output of all intelligent agents. .
[0094] In this specific embodiment, S5 includes:
[0095] Receive weight increment set And read the beamforming weights retained at the previous adaptive update time. ,in Indicates the current adaptive update time. Indicates the total number of array elements. For the first Each element at any moment Complex beamforming weights, For the first Each element at any moment The increment of complex weights;
[0096] First, add the nodes one by one to obtain the superposition result. and all Composition of superposition result set To characterize the update results when not suppressed;
[0097] Subsequently, based on the set of health indicators Generate updated suppression coefficients for each node. ,in For the first Each element at any moment Health indicators, update inhibition coefficient This is used to suppress the update magnitude of the superposition result relative to the weights of the previous time step, and a preset health threshold is set. And perform a freeze update on the array elements with low health to avoid frequent perturbation of the weights of abnormal array elements;
[0098] The provisional beamforming weight is defined as And generate each array element according to the following rules This ensures that the update magnitude monotonically decreases as health decreases, and that the weights remain unchanged from the previous time step when health falls below a threshold.
[0099] ;
[0100] in For the first Each element at any moment Provisional beamforming weights, For the first Each element at any moment Beamforming weights, For the first Each element at any moment The weight increment, For the first Each element at any moment Health indicators As the health threshold, To update the suppression coefficient and when hour ,when hour This will result in the superimposed effect. Corresponding update range Scaling based on health level and freezing updates when health is low;
[0101] Finally, all of them The array elements are stacked in index order to form a provisional beamforming weight vector. .
[0102] In this specific embodiment, S6 includes:
[0103] Receive provisional beamforming weight vector The array pattern is calculated to determine whether the main lobe directional gain, side lobe level, and null depth constraints are satisfied. Indicates the current adaptive update time. Indicates the total number of array elements. Indicates the first Provisional weighted values for each array element Indicates transpose;
[0104] In this embodiment, the main lobe direction is fixed as Fix the corner domain mesh as And fix the side lobe angle domain as Used for sidelobe level calculation, where Represents the set of sampling angles in the radiation pattern. Indicates the side lobe angle domain;
[0105] The array pattern gain is denoted as ,in Indicates the direction angle. This represents the complex weight vector used to calculate the radiation pattern, and By array response vector with weight vector The result is obtained by taking the conjugate inner product, measuring the amplitude, and converting it to decibels. Based on the geometric coordinates of the array elements and the center frequency with the speed of light The calculation is performed using a plane wave far-field model, and the array element coordinate table is fixed during system initialization to ensure repeatability.
[0106] based on exist The values at the corresponding directions of the incident radio frequency interference are taken and constrained, where the gain threshold in the main lobe direction is fixed. The sidelobe threshold is fixed at The zero-depression threshold is fixed at The set of incident radio frequency interference directions, after being encoded by fixed length in step S4, corresponds to Each direction is denoted as... ;
[0107] when satisfy , And for all All meet At that time, As the current weight of the continuous domain ;
[0108] When none of the above constraints are satisfied, the constraint projection layer solves the following constraint optimization problem to generate the current weights in the continuous domain.
[0109] satisfy ;
[0110] in This represents the current beamforming weight vector in the continuous domain. Describes the optimization variable and is a dimension. complex vectors, Represents the set of complex numbers. Represents the L2 norm, This represents the provisional beamforming weight vector. Indicated by weight Direction of calculation At the radiation pattern gain, Indicates the direction of the main lobe. This indicates the gain threshold in the main lobe direction. Indicates the side lobe angle region. Indicates the sidelobe threshold. Indicates the first One radio frequency interference incident direction, Indicates the zero-trap threshold. Indicates the number of interference directions used for constraint;
[0111] To make the above include The constraints can be calculated, View it as a discrete set of angles and implement the constraint equivalently for all A finite number of inequality constraints hold simultaneously. The constrained optimization problem is solved numerically using the interior-point method, with the upper limit of iteration fixed at 30 and the convergence threshold fixed at [value missing]. And with This serves as the initial point to ensure the determinism and real-time nature of each update;
[0112] In obtaining Then, to meet the hardware implementation constraints, the final current beamforming weights are generated. For each array element The constant modulus constraint and phase quantization constraint are fixed, with the constant modulus amplitude fixed at [value missing]. The number of phase quantization bits is fixed at 1. And based on this, a phase codebook set is constructed. ,Will The Phase mapping of each component to Find the phase value with the smallest circumferential distance from it and force the amplitude to be set to 0. Get the final Thus forming .
[0113] In this specific embodiment, S7 includes:
[0114] Receive current beamforming weight vector The results are applied to beamforming to form the output beam. At the same time, the main lobe directional gain, side lobe level and null depth are calculated based on the array pattern, and the reward value and constraint residual vector are generated accordingly.
[0115] in Indicates the current adaptive update time. Indicates the total number of array elements. Indicates the first Each element at any moment Complex beamforming weights, Indicates transpose;
[0116] Beamforming processing in the observation time slot Internal preprocessing array element receiving signal Perform complex multiplication and addition operations and generate beam output signals. ,in For the first Each element at the sampling point Complex baseband sample values, and For the sampling rate, the complex multiplication and addition operations are performed according to the matrix element index from... Accumulate to Furthermore, phase alignment is achieved using the same conjugate weighting method as in step S6, and the obtained... The cache is in complex 32-bit floating-point format for reward recording and system monitoring in subsequent time slots;
[0117] At the same time based on Calculate the array radiation pattern and extract performance metrics, wherein the array radiation pattern gain function follows the method described in step S6. Define and use the same main lobe direction , direction sampling angle set With side lobe angle region The set of incident radio frequency interference directions is sorted by confidence level from high to low, and the set with the highest confidence level is selected. The following directions are designated as the directions for participating in the zero-defect assessment and are denoted as follows: When the actual number of directions is less than 4, the activation flag corresponding to the missing direction is set to 0. And set the activation flag of the existing direction to 0. This ensures that the dimension of the subsequent constraint residual vector remains fixed;
[0118] Based on the above conventions, the main lobe direction gain, side lobe level, and null depths are calculated, and a reward value is generated. With constraint residual vector The calculation relationship is as follows:
[0119] ;
[0120] in Indicates time The main lobe directional gain and its unit is Indicates time The sidelobe level is the maximum value of the pattern gain within the sidelobe angular domain, and the unit is... Indicates time Regarding the first The null depth in each incident direction of radio frequency interference, and the unit is... This indicates the operation of finding the maximum value. Indicates the sidelobe threshold. Indicates the zero-trap threshold. This indicates that the sidelobe level exceeds the residual limit. Indicates the first The residual is insufficient in the zero-depression depth. Indicates the first A flag indicating whether an interference direction is active is used to set the corresponding null trap residual to zero when the interference direction is missing. This represents the reward weight for the main lobe direction gain. This indicates the sidelobe over-limit penalty weight. This indicates the penalty weight for insufficient zero traps. This represents the summation operation. For dimension The constrained residual vector contains a sidelobe level overlimit residual in a fixed order and The residual is insufficient in the depth of the zero trap.
[0121] The reward value Perform numerical clipping to intervals and the constraint residual vector Compared with the cut Output them together.
[0122] In this specific embodiment, S8 includes:
[0123] Receive reward value With constraint residual vector And read the dual variable retained at the previous adaptive update time. and distributed policy network parameters ,in Indicates the current adaptive update time. It is a column vector of dimension 5, and each component is constructed by step S7 according to the sidelobe level excess residual and the null depth insufficient residual. It is a column vector of dual variables of dimension 5 and is Each component corresponds one-to-one. Indicates transpose;
[0124] According to the Lagrange multiplier iteration rule Perform iterative updates and truncate the iteration results to obtain the updated dual variables. The dual update step size is fixed at 1. And the dual variable is truncated element by element to the interval. and The element-by-element truncation process sets components less than 0 to 0, and components greater than 0 to 0. The component is set to "Execution is performed to ensure that the dual variable is non-negative and bounded;"
[0125] Based on the updated Constructing Constraint-Enhanced Rewards And used for subsequent policy updates, the dual variable update and constraint-enhancing reward construction are determined by the following formula:
[0126] ;
[0127] in Indicates time The updated dual variable, This indicates that an element-wise range operation is performed on the input vector. Truncated projection operator, Indicates time The retained dual variable, Indicates the dual update step size. Indicates time The constrained residual vector, Indicates time Constraints enhance rewards, Indicates time The reward value, This indicates that the inner product term is used to penalize constraint violations;
[0128] In the implementation of policy network updates, all agents share the same set of parameters. Furthermore, a parameter server is used for centralized updates and broadcasts the updated parameters to each agent. Each agent retains the two-dimensional action mean vector output by its policy network during the decision-making process in step S4. and the actual output two-dimensional action vector ,in For distributed policy networks at input The output below and the weight increment scaling factor in step S4 In coordination, By The mean is given and the standard deviation of each dimension is fixed at 1. Independent Gaussian distribution sampling and element-wise cropping to intervals Obtain, and and satisfy This allows the policy gradient to be calculated using the log-likelihood of the sampling distribution;
[0129] Parameter server at time Collect all A single agent Combined with the input vector and based on constraint-enhanced rewards For shared parameters Perform a gradient ascent update, using the Adam optimizer with a fixed learning rate. The first-order moment attenuation coefficient is fixed at... The second-order moment attenuation coefficient is fixed at... Gradient calculation is performed by summing the log-likelihoods of all agents' actions and then multiplying by... And on The method of "backpropagation to find gradients" is used to make it possible to... The expected value increases;
[0130] After completing the parameter update, the updated dual variable will be... The updated distributed policy network parameters are broadcast to all agents and used as inputs to the dual variables at the next adaptive update time. Form a weight vector with the current beam The distributed policy network and the beamforming weights of the previous adaptive update time are respectively retained as the weights of the beamforming network at the next adaptive update time.
[0131] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.
[0132] This invention directly addresses the aforementioned technical problem through a combination of "array graph representation plus distributed reinforcement learning closed-loop optimization": First, based on the received signals, calibration data, and temperature state data of array elements, gain drift and phase drift are estimated online, forming an observation set including the incident directions of radio frequency interference. Then, an array graph is constructed using array elements or subarrays as nodes and dynamically updated according to cross-correlation, temperature difference, and calibration residual differences, ensuring that changes in correlation within the array are continuously reflected. Based on this, a graph neural network performs message passing under edge weight constraints to generate local state vectors, enabling each agent to obtain a representation consistent with the states of its neighboring array elements. This drives multi-agent reinforcement learning to output complex weight increments in a distributed manner. Combined with the main lobe direction gain, side lobe level, and null depth of each interference direction calculated from the array radiation graph, reward and constraint residuals are formed. Through iterative updates of dual variables and reverse updates of the policy network, online adaptive adjustment of weights is achieved during long-term operation, thereby stably maintaining the main lobe gain while satisfying side lobe and null constraints, reducing side lobe contamination, and improving signal-to-noise ratio and observation efficiency.
[0133] Meanwhile, this invention improves the algorithm structure to address engineering problems related to array element dynamic drift, fault aging, and constraint controllability: First, it employs online edge weight updates and a weakly correlated edge deletion mechanism, enabling the graph structure to adaptively adjust with changes in the environment and array state, thus improving the effectiveness and robustness of state representation. Second, it introduces health assessment and gating coefficients to suppress the influence of low-health nodes during graph message aggregation and weight update stages, reducing the disturbance of failed or aging array elements to beamforming. Third, it adopts incremental action forms and sets a constraint projection layer to project the provisional weights output by the strategy onto a feasible region that satisfies the main lobe, side lobe, and null thresholds, as well as hardware implementation constraints, avoiding constraint failures caused by learning fluctuations. Fourth, it uses a distributed primal-dual framework and broadcasts dual variables as consistent constraint adjustment quantities, enabling agents to collaboratively satisfy global constraints even under distributed conditions, thereby achieving more stable and controllable long-term adaptive beamforming and interference suppression effects.
Claims
1. A reinforcement learning-based array beam adaptation method, characterized in that, include: S1. Collect the received signals, calibration data, and data characterizing the temperature state of the array elements, perform preprocessing, estimate the gain drift and phase drift of each array element, and estimate the set of incident RF interference directions based on the received signals of the array elements to obtain the observation dataset and the set of incident RF interference directions; S2. Perform dynamic updates on the initial array diagram based on the observation dataset to generate the current array diagram. The initial array diagram uses array elements or subarrays as nodes, and the geometric adjacency or mutual coupling relationships between nodes as edges; S3. Perform graph neural network feature extraction on the current array diagram to generate a set of local state vectors, and then extract the node features into a graph neural network. S4. Based on the local state vector set, the health index set, the radio frequency interference incident direction set, and the dual variables retained from the previous adaptive update time, a distributed policy network is used for distributed decision-making. Each agent generates the beamforming weight increment for the corresponding array element or subarray, resulting in a weight increment set. S5. The weight increment set is superimposed on the beamforming weight from the previous adaptive update time to generate provisional beamforming weights. The update amplitude of the corresponding weights for array elements or subarrays with lower health in the provisional beamforming weights is suppressed based on the health index set. S6. Calculate the array pattern based on the provisional beamforming weights, and perform constraint projection on the provisional beamforming weights by the constraint projection layer to generate the current beamforming weights; S7. Apply the current beamforming weights to the beamforming process to form the output beam, collect the beam output signal and calculate the main lobe directional gain, side lobe level, and null depth for each RF interference incident direction in the RF interference incident direction set, based on the array pattern, generate a reward value based on the main lobe directional gain, side lobe level, and each null depth, and generate a constraint residual vector based on the side lobe level and each null depth; S8. Update the dual variables according to the Lagrange multiplier iteration rule based on the constraint residual vector, update the distributed policy network based on the reward value and the updated dual variables, and retain them as the dual variables, distributed policy network, and beamforming weights of the next adaptive update time, respectively.
2. The array beam adaptive method based on reinforcement learning according to claim 1, characterized in that, S1 includes: The array element received signals are synchronously acquired within the preset observation time slot, and bandpass filtering, amplitude normalization and time domain frame division processing are performed on the array element received signals to obtain preprocessed array element received signals. A preset calibration signal is extracted from the calibration data. The complex gain estimate of each array element is calculated based on the preprocessed array element received signal and the calibration signal. The complex gain estimate is then compared with the complex gain reference value in the reference calibration result to obtain the gain drift and the phase drift. The received signal of the preprocessed array element is subjected to time-frequency transformation to obtain a set of frequency domain snapshots. The spatial covariance matrix is calculated based on the set of frequency domain snapshots. The radio frequency interference incident direction and its confidence level are output by the direction estimation neural network based on the spatial covariance matrix. The radio frequency interference incident directions with confidence levels not less than a preset confidence threshold constitute the set of radio frequency interference incident directions. The gain drift, phase drift, data characterizing the temperature state of the array elements, and spatial covariance matrix are compiled into the observation dataset.
3. The array beam adaptive method based on reinforcement learning according to claim 1, characterized in that, S2 includes: Based on the observation dataset, a node feature vector is calculated for each node in the initial array diagram. The node feature vector includes the gain drift, phase drift, data characterizing the temperature state of the corresponding array element or subarray, and calibration residual. For any two nodes connected by an edge in the initial array diagram, the cross-correlation value of the received signals of the corresponding array elements of the two nodes is calculated based on the observation dataset, the temperature difference between the two nodes is calculated, and the calibration residual difference between the two nodes is calculated. The edge weight is generated based on the cross-correlation value, the temperature difference, and the calibration residual difference, and the edge weight is updated to the initial array diagram to obtain the current array diagram; Specifically, when the edge weight is less than a preset edge weight threshold, the corresponding edge is deleted or the corresponding edge weight is set to zero to suppress message passing between weakly related nodes.
4. The array beam adaptive method based on reinforcement learning according to claim 1, characterized in that, S3 includes: Map the node feature vectors in the current array graph to the node initial embedding vectors; Based on the node feature vectors, a set of health indicators is calculated through a health assessment network, and the set of health indicators is converted into a set of gating coefficients. In the message passing process of the graph neural network, for any target node in the current array graph, the gating coefficients corresponding to the set of gating coefficients are applied to the initial embedding vectors of each adjacent node, and the gating adjacent node messages are weighted and aggregated in combination with the edge weights to obtain the aggregated message of the target node. Update the node embedding vector of the target node according to the aggregated message; The updated embedding vectors of each node are used to form a local state vector set, wherein the gating coefficient decreases as the health index decreases, so as to suppress the influence of failed or aging nodes on the local state vector set.
5. The array beam adaptive method based on reinforcement learning according to claim 1, characterized in that, S4 include: The local state vector set is distributed to each agent according to the one-to-one correspondence between nodes and agents, and each agent obtains a health index associated with its corresponding node. The dual variables retained in the previous adaptive update time are broadcast to each agent as consistent constraint adjustment quantities, and the set of radio frequency interference incident directions is encoded as interference direction feature vectors. Under the joint constraints of the dual variable and the interference direction feature vector, each agent generates beamforming weight increments for corresponding array elements or subarrays through the distributed policy network of the multi-agent reinforcement learning, based on the corresponding local state vector, the corresponding health index, the dual variable, and the interference direction feature vector, thus forming a set of weight increments. The beamforming weight increment is in complex form, comprising a real part increment and an imaginary part increment.
6. The array beam adaptive method based on reinforcement learning according to claim 1, characterized in that, S5 include: The set of weight increments is added to the beamforming weights of the previous adaptive update time node by node to obtain the superposition result; An update inhibition coefficient is generated for each node based on a set of health indicators, and the update inhibition coefficient decreases as the health indicators decrease. The weights of each node in the superposition result are multiplied by the corresponding update suppression coefficients, and the product is used as provisional beamforming weights. Specifically, when the health index corresponding to any node is less than the preset health threshold, the update suppression coefficient corresponding to that node is set to zero so that the provisional beamforming weight corresponding to that node remains the beamforming weight at the previous adaptive update time.
7. The array beam adaptive method based on reinforcement learning according to claim 1, characterized in that, S6 include: The array pattern is calculated based on the provisional beamforming weights. The array pattern provides pattern amplitudes at least in the main lobe direction, the sidelobe angular domain, and the corresponding directions of the set of radio frequency interference incident directions. It is determined whether the array pattern simultaneously satisfies the following conditions: the main lobe gain is not less than a preset main lobe gain threshold, the sidelobe level in the sidelobe angular domain is not greater than a preset sidelobe threshold, and the null depth in the corresponding direction for each radio frequency interference incident direction in the set of radio frequency interference incident directions is not less than a preset null threshold. If these conditions are met, the provisional beamforming weights are determined as the current beamforming weights. If not, the current beamforming weights are generated by the constraint projection layer by solving the following constraint optimization problem: minimizing the norm of the difference between the current beamforming weights and the provisional beamforming weights, while ensuring that the main lobe gain is not less than the preset main lobe gain threshold, the sidelobe level in the sidelobe angular domain is not greater than the preset sidelobe threshold, and the null depth in the corresponding direction for each radio frequency interference incident direction in the set of radio frequency interference incident directions is not less than the preset null threshold.
8. The array beam adaptation method based on reinforcement learning according to claim 1, characterized in that, S7 includes: The current beamforming weights are applied to the beamforming network, so that the received signals of each array element are weighted and synthesized according to the current beamforming weights to obtain the beam output signal; The array pattern is calculated based on the current beamforming weights, and the main lobe direction gain is determined in a preset main lobe direction. The sidelobe level is determined in a preset sidelobe angle domain, where the sidelobe level is the maximum value of the pattern gain in the sidelobe angle domain. Furthermore, a corresponding null depth is determined for each radio frequency interference incident direction in the set of incident directions. The null depth for any radio frequency interference incident direction is the difference between the main lobe direction gain and the pattern gain of that radio frequency interference incident direction. A reward value is generated based on the main lobe direction gain, the sidelobe level, and each null depth. Constraint residual vectors are generated based on the difference between the sidelobe level and the preset sidelobe threshold, and the difference between each null depth and the preset null threshold. The constraint residual vectors include sidelobe level over-limit residuals and null depth insufficient residuals generated for each radio frequency interference incident direction in the set of radio frequency interference incident directions.
9. The array beam adaptive method based on reinforcement learning according to claim 1, characterized in that, S8 includes: Based on the constraint residual vector, the dual variables retained at the previous adaptive update time are iteratively updated according to the preset step size, and the iterative update results are non-negatively truncated to obtain the updated dual variables. A constraint-enhancing reward is constructed based on the reward value and the updated dual variable, wherein the constraint-enhancing reward is the reward value minus the inner product of the updated dual variable and the constraint residual vector; The parameters of the distributed policy network of the multi-agent reinforcement learning are updated according to the constraint enhancement reward, so as to increase the expected value of the constraint enhancement reward; The updated dual variables, the updated distributed policy network, and the current beamforming weights are retained as the dual variables, the distributed policy network, and the beamforming weights of the previous adaptive update time, respectively, for the next adaptive update time.
10. The array beam adaptation method based on reinforcement learning according to claim 7, characterized in that, When generating the current beamforming weights, the constraint projection layer further satisfies hardware implementation constraints, which include at least one of the following: amplitude clipping constraints, constant mode constraints, and phase quantization constraints. The phase quantization constraint maps the phase of the current beamforming weights to the nearest phase value in a preset phase codebook set to adapt to the finite-bit phase controller of the phased array.