Layered coordinated control method and system for air-cooled microreactor using deep reinforcement learning
By employing a hierarchical coordinated control method based on deep reinforcement learning, combined with SAC agents and PID controllers, the problem of electric power control in gas-cooled microreactors was solved, achieving fast, smooth, and safe load tracking across the entire power range, thereby improving system stability and actuator lifespan.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XI AN JIAOTONG UNIV
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-26
AI Technical Summary
The power control of gas-cooled microreactors faces challenges from multi-factor coupling and strong nonlinearity. Traditional PID control struggles to balance fast response and stability across the entire power range, easily resulting in overshoot, oscillation, and steady-state error. Furthermore, it lacks adaptive decoupling capabilities, leading to limited load tracking performance and severe wear on actuators.
A hierarchical coordinated control method based on deep reinforcement learning is adopted. The SAC agent optimizes the setpoints of electric power and outlet temperature in real time, and a physical constraint layer is introduced. Combined with the underlying PID controller, fast, smooth and safe load tracking is achieved.
It achieves fast, smooth, and safe load tracking of the gas-cooled microreactor across the full power range, breaking through the bottlenecks of strong coupling and nonlinear control, and improving the system's stability and the lifespan of the actuator.
Smart Images

Figure CN122284293A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of nuclear reactor control technology, specifically relating to a hierarchical coordinated control method and system for gas-cooled microreactors using deep reinforcement learning. Background Technology
[0002] The complex thermodynamic processes and coaxial structure of gas-cooled microreactors (GBRs) lead to strong coupling between components in the energy conversion system. The use of the Brayton cycle further exacerbates this coupling between the reactor and the energy conversion system. When the turbine and compressor deviate from their design operating points, significant differences in system characteristics result in substantial coupling between the GBR's electrical power and secondary-side outlet temperature. This strong coupling and nonlinearity not only limit the load-tracking capability of GBRs but also make the system unstable at low power levels. These operational control challenges constitute one of the bottlenecks hindering the development and utilization of GBRs.
[0003] Current research on load-tracking strategies for small modular reactors coupled with Brayton cycle systems is limited. Furthermore, current research on gas-cooled microreactor control primarily focuses on nuclear power control, with little research on electric power control. Electric power control in gas-cooled microreactors faces the dual challenges of multi-factor coupling and strong nonlinearity. The core issue lies in the large volume and heat capacity of the graphite core, leading to a slow core temperature response, while the Brayton system responds quickly. However, power regulation requires coordination with strongly coupled variables such as helium and air flow rates; any change in any parameter can trigger a chain reaction. At different power levels, the characteristics of equipment in the Brayton system vary significantly, meaning that the same set of controller parameters cannot meet the regulation performance requirements under different operating conditions. However, due to the stringent safety requirements in the nuclear energy field, directly abandoning the industrially validated proportional-integral-derivative (PID) controller architecture would face significant implementation risks.
[0004] Therefore, an advanced hierarchical control strategy needs to be proposed: while retaining the bottom-level PID as the actuator to ensure system robustness, an advanced upper-level strategy coordinates the electric power and outlet temperature setpoints, thereby breaking through the control bottleneck of multi-factor coupling and strong nonlinearity of gas-cooled microreactors without changing the existing industrial control base. Summary of the Invention
[0005] The technical problem to be solved by this invention is to address the shortcomings of the prior art by providing a hierarchical coordinated control method and system for gas-cooled microreactors using deep reinforcement learning. This method optimizes the setpoints for electrical power and outlet temperature in real time through an intelligent agent and introduces a physical constraint layer (filtering and speed limiting). The final control is executed by the underlying PID controller, thereby achieving fast, smooth, and safe load tracking. It leverages the ability of deep reinforcement learning to handle strongly coupled nonlinearities while retaining the robustness and safety of the underlying PID control, and effectively suppresses control signal oscillations. This addresses the technical problems of existing gas-cooled microreactor control systems, which struggle to balance fast response and stability across the entire power range when facing strong coupling, strong nonlinearity, and large thermal inertia hysteresis characteristics of the Brayton cycle, resulting in overshoot, oscillations, and steady-state errors, as well as a lack of adaptive decoupling capabilities, leading to limited load tracking performance and severe wear of the actuators.
[0006] The present invention adopts the following technical solution: A hierarchical coordination control method for gas-cooled microreactors using deep reinforcement learning includes the following steps: S1. Construct a hierarchical control architecture for a gas-cooled microreactor, which includes an upper intelligent decision layer, a middle signal processing layer, and a lower PID controller layer. Simultaneously, design an SAC intelligent agent controller in the intelligent decision layer and set a rate limiter and a low-pass filter in the signal processing layer. S2. Design the observation space, action space, and composite reward function of the SAC agent controller. The observation space includes the thermal parameters of the gas-cooled microreactor secondary loop system. The action space is the optimized value of the electric power and outlet temperature setpoint. The composite reward function is used to guide the training and decision-making of the SAC agent controller. S3. Based on the hierarchical control architecture of the gas-cooled microreactor and the designed SAC intelligent agent controller, conduct offline training and strategy tuning experiments covering the full power operating range of the gas-cooled microreactor under complex operating conditions to obtain a control strategy model with generalized robustness and a trained SAC intelligent agent controller. Deploy the control strategy model into the SAC intelligent agent controller. S4. Execute real-time coordinated control, inputting the real-time operating parameters of the gas-cooled microreactor to the trained SAC intelligent agent controller. The SAC intelligent agent controller outputs the original setpoints for power and outlet temperature, as well as system status data, to achieve decoupled control of power and outlet temperature. The signal processing layer sequentially performs rate limiting and low-pass filtering on the original setpoints to obtain optimized setpoint instructions. The PID controller layer receives the optimized setpoint instructions and outputs turbine speed signals and fan speed signals through the PID controller to control the secondary loop energy conversion system of the gas-cooled microreactor, achieving load tracking of the gas-cooled microreactor.
[0007] Preferably, the SAC agent controller is constructed based on the maximum entropy reinforcement learning mechanism. The objective function of the SAC agent controller includes a cumulative reward term and an entropy term. The entropy term is used to measure the randomness of the policy, and the temperature parameter α of the objective function is used to control the trade-off between reward and entropy.
[0008] Preferably, the SAC intelligent agent controller adopts an Actor-Critic architecture, including a critic network with a dual-Q network, a soft Bellman equation update module that introduces an entropy term, and an automatic adjustment module for the temperature parameter α. The dual-Q network consists of two independent Q1 and Q2 networks used to estimate the value of the current state action pair.
[0009] Preferably, the thermal parameters in the observation space include the measured value of electric power, the reference value of the set value of electric power, the optimized value of the set value of electric power, the optimized value of the set value of electric power at the previous moment, the measured value of outlet temperature, the reference value of the set value of outlet temperature, the optimized value of the set value of outlet temperature, and the optimized value of the set value of outlet temperature at the previous moment.
[0010] Preferably, the composite reward function includes an error penalty term, a motion smoothing and vibration damping term, a zero-deviation reward term, an overshoot penalty term, and a cross-channel decoupling term. Each reward term works together to achieve multi-dimensional training and decision guidance for the SAC agent controller.
[0011] Preferably, the error penalty item sets a dead zone of 1% for power and temperature, and provides a reward to the SAC intelligent agent controller when the system deviation is within the dead zone; the error penalty item can also monitor the reference power level in real time and automatically increase the power error weight in the low power operation segment.
[0012] Preferably, the motion smoothing and damping term penalizes the rate of change of motion, and the penalty weight for the temperature channel is several times that for the power channel; the motion smoothing and damping term can also monitor the directional change of motion output, penalize the sign flip of control commands near steady state, and double the smoothing coefficient when the error enters within 3%.
[0013] Preferably, the complex operating condition offline training selects a typical load tracking condition covering the full power operating range of the gas-cooled microreactor as the training environment, and drives the policy convergence of the SAC agent controller through at least 1500 training iterations to obtain the control policy model. The system status data includes electrical power measurement value, outlet temperature measurement value, the current action of the SAC intelligent agent controller, and the action value of the SAC intelligent agent controller at the previous moment.
[0014] Preferably, the rate limiter imposes a hard constraint on the rate of change of the set value, and the low-pass filter filters out high-frequency jitter noise induced by random strategy sampling of the SAC intelligent agent controller; the PID controller layer includes a power loop PID controller and a temperature loop PID controller, the power loop PID controller receives the optimized set value of electric power and outputs the turbine speed signal, and the temperature loop PID controller receives the optimized set value of outlet temperature and outputs the fan speed signal.
[0015] Secondly, embodiments of the present invention provide a hierarchical coordination control system for a gas-cooled microreactor employing deep reinforcement learning, comprising: An architecture building module is used to build a hierarchical control architecture for a gas-cooled microreactor. The architecture includes an upper intelligent decision layer, a middle signal processing layer, and a lower PID controller layer. An SAC intelligent agent controller is deployed in the intelligent decision layer, and a rate limiter and a low-pass filter are set in the signal processing layer. The element design module is used to design the observation space, action space and composite reward function of the SAC intelligent agent controller. The observation space includes the thermal parameters of the gas-cooled microreactor secondary loop system, and the action space consists of the power setpoint optimization increment and the outlet temperature setpoint optimization increment. The training and deployment module is used to conduct offline training under complex operating conditions covering the full power range based on the hierarchical control architecture, to obtain a control strategy model with generalized robustness, and to deploy the control strategy model to the SAC intelligent agent controller. The real-time control module is used to input the real-time operating parameters of the gas-cooled microreactor to the SAC intelligent agent controller, and output the optimized increments of the power setpoint and the outlet temperature setpoint. The signal processing layer performs rate limiting and low-pass filtering on the optimized increments to synthesize smoothed power setpoints and outlet temperature setpoints. The PID controller layer outputs turbine speed signals and air flow signals to drive the secondary loop energy conversion system of the gas-cooled microreactor to achieve load tracking and decoupling control.
[0016] Thirdly, a computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the above-described hierarchical coordinated control method for gas-cooled microreactors employing deep reinforcement learning.
[0017] Fourthly, embodiments of the present invention provide a computer-readable storage medium including a computer program, which, when executed by a processor, implements the steps of the above-described hierarchical coordination control method for gas-cooled microreactors employing deep reinforcement learning.
[0018] Fifthly, a chip includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the above-described hierarchical coordinated control method for gas-cooled microreactors employing deep reinforcement learning.
[0019] In a sixth aspect, embodiments of the present invention provide an electronic device including a computer program, which, when executed by the electronic device, implements the steps of the above-described hierarchical coordination control method for gas-cooled microreactors employing deep reinforcement learning.
[0020] Compared with the prior art, the present invention has at least the following beneficial effects: A hierarchical coordinated control method for gas-cooled microreactors employing deep reinforcement learning is proposed. The upper-level Soft Actor-Critic (SAC) agent utilizes maximum entropy reinforcement learning to handle the complex dynamic characteristics of the gas-cooled microreactor, characterized by strong coupling and strong nonlinearity, generating the optimal setpoint trajectory in real time. This solves the problem of traditional PID controllers' inability to adapt to wide operating conditions. The middle layer introduces a rate limiter and a low-pass filter as safety filters, forcibly constraining the rate of change of action and filtering out inherent random high-frequency jitter, ensuring the physical realizability and smoothness of control commands. The PID controller layer outputs dual signals of turbine and fan speeds, precisely matching the control requirements of the gas-cooled microreactor's secondary loop energy conversion system. The entire process achieves decoupled control of electrical power and outlet temperature, as well as load tracking.
[0021] Furthermore, by maximizing policy entropy, the agent can maintain the diversity of actions during training, fully explore the complex state space of the gas-cooled microreactor under different power levels, and avoid premature convergence to a suboptimal policy. The introduction of the temperature parameter α enables the system to adaptively adjust: increasing α under unknown operating conditions to encourage exploration, and decreasing α after the policy matures to lock in a high-precision optimal solution.
[0022] Furthermore, in systems like gas-cooled microreactors where safety is paramount, biases in value estimation can lead to overly aggressive and dangerous actions by the agent. By updating the policy using the minimum of the outputs of two independent Q-networks, this architecture provides a more conservative and accurate value estimate, significantly improving training stability and convergence speed. Combining the soft Bellman equation and entropy term updates ensures that the agent can quickly approximate the optimal policy while maintaining policy smoothness and preventing drastic changes in control commands when learning complex nonlinear mappings.
[0023] Furthermore, the observation vectors containing time-series information allow the SAC agent to memorize the system's inertial characteristics, enabling it to make forward-looking decisions rather than simply relying on reactive control based on current errors. This significantly enhances the system's ability to compensate for large time lags, effectively reduces overshoot, and improves tracking accuracy during variable load processes.
[0024] Furthermore, the error penalty term ensures basic tracking accuracy; the motion smoothing term forcibly suppresses high-frequency oscillations to protect the actuator; the overshoot penalty term directly constrains safety indicators; and the cross-channel decoupling term explicitly penalizes unnecessary coupling interventions, forcing the agent to learn to specialize in specific tasks. This refined reward shaping mechanism enables the trained model to not only complete control tasks but also achieve an optimal balance across multiple dimensions such as energy consumption, equipment lifespan, and safety.
[0025] Furthermore, setting a dead zone allows the system to fluctuate freely within a very small deviation range, avoiding wear and oscillations caused by frequent fine-tuning of the actuators; while automatically increasing the error weight in the low-power range specifically overcomes the problem of large gain changes and instability of the microreactor under low load. This adaptive operating condition mechanism enables the same control strategy to cover an operating range from 100% to 40% or even wider, without the need for manual switching of multiple sets of PID parameters, achieving seamless and smooth control across the entire power range.
[0026] Furthermore, by imposing stricter smoothing constraints on temperature actions, system thermal shocks caused by sudden temperature command changes are effectively prevented; the penalty for sign flipping near steady state completely eliminates high-frequency sawtooth waves in the control signal; and doubling the smoothing coefficient after the error enters 3% ensures a soft landing of the system when it approaches the set value.
[0027] Furthermore, by employing an offline training and online deployment model, the high-risk exploration process is confined to a simulation environment, ensuring the absolute safety of actual nuclear facility operation. Selecting a complex training environment covering all operating conditions allows the agent to anticipate various extreme and transient situations, learning generalized robust control strategies.
[0028] Furthermore, the rate limiter physically eliminates the possibility of sudden changes in the setpoint, preventing actuator overload; the low-pass filter filters out inherent noise generated by probability sampling, outputting clean control commands. The underlying PID controllers independently adjust turbine speed and airflow, retaining mature, industrially proven control logic, ensuring that the system remains within a controllable range even when the agent exhibits extreme anomalies.
[0029] It is understood that the beneficial effects of the second to sixth aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here.
[0030] In summary, the method of this invention constructs a three-level hierarchical architecture, integrating SAC deep reinforcement learning with traditional PID control to achieve both decoupling and robustness; the composite reward function and adaptive training mechanism are adapted to full power conditions; the physical constraint layer suppresses signal oscillations, and the dual-loop PID is executed precisely, achieving fast, smooth, and safe load tracking of the air-cooled microreactor throughout the entire process, thus overcoming the bottlenecks of strong coupling and nonlinear control.
[0031] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. Attached Figure Description
[0032] Figure 1 This is a schematic diagram illustrating the reinforcement learning principle of the present invention. Figure 2 This is a schematic diagram of the hierarchical coordination control system for the gas-cooled microreactor based on SAC according to the present invention. Figure 3 A curve is set for the electrical power versus the outlet temperature value, where (a) is the normalized electrical power set value and (b) is the outlet temperature set value. Figure 4 The response diagrams for each parameter under the linear load reduction condition of 100%-40% are shown, where (a) is the normalized nuclear power, (b) is the core outlet temperature, (c) is the normalized electric power setpoint, (d) is the normalized electric power, (e) is the secondary side outlet temperature setpoint, (f) is the secondary side outlet temperature, (g) is the helium flow rate, and (h) is the air flow rate. Figure 5 A schematic diagram of a computer device provided in an embodiment of the present invention; Figure 6 This is a block diagram of a chip according to an embodiment of the present invention; Figure 7 This is a schematic diagram of the method flow of the present invention.
[0033] Among them, 60. Computer equipment; 61. Processor; 62. Memory; 63. Computer program; 600. Electronic device; 610. Processing unit; 620. Storage unit; 6201. Random access memory unit; 6202. Cache memory unit; 6203. Read-only memory unit; 6204. Program / utility; 6205. Program module; 630. Bus; 640. Display unit; 650. Input / output interface; 660. Network adapter; 700. External device. Detailed Implementation
[0034] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0035] In the description of this invention, it should be understood that the terms "comprising" and "including" indicate the presence of the described features, integrals, steps, operations, elements and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.
[0036] It should also be understood that the terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.
[0037] It should also be further understood that the term "and / or" as used in this specification and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes such combinations. For example, A and / or B can represent three cases: A alone, A and B simultaneously, and B alone. Additionally, the character " / " in this invention generally indicates that the preceding and following objects have an "or" relationship.
[0038] It should be understood that although terms such as first, second, third, etc., may be used in the embodiments of the present invention to describe the preset range, these preset ranges should not be limited to these terms. These terms are only used to distinguish the preset ranges from one another. For example, without departing from the scope of the embodiments of the present invention, the first preset range may also be referred to as the second preset range, and similarly, the second preset range may also be referred to as the first preset range.
[0039] Depending on the context, the word "if" as used here can be interpreted as "when," "when," "in response to determination," or "in response to detection." Similarly, depending on the context, the phrase "if determination" or "if detection (of the stated condition or event)" can be interpreted as "when determination," "in response to determination," "when detection (of the stated condition or event)," or "in response to detection (of the stated condition or event)."
[0040] The accompanying drawings illustrate various structural schematic diagrams according to embodiments disclosed in this invention. These drawings are not to scale, and some details have been enlarged for clarity, and some details may have been omitted. The shapes of the various regions and layers shown in the drawings, as well as their relative sizes and positional relationships, are merely exemplary and may deviate from reality due to manufacturing tolerances or technical limitations. Furthermore, those skilled in the art can design regions / layers with different shapes, sizes, and relative positions as needed.
[0041] This invention provides a hierarchical coordinated control method for gas-cooled microreactors using deep reinforcement learning. It optimizes the setpoint decision channel by using an agent to perceive the changing trends of power deviation, outlet temperature deviation, and other system thermal parameters in real time. Based on a maximum entropy reinforcement learning mechanism, it dynamically generates the optimal curve for the setpoints of power and outlet temperature, eliminating the coupling between them. Physical constraints and signal smoothing are introduced at the agent's output. A rate limiter imposes hard constraints on the setpoint change rate, and a low-pass filter removes high-frequency jitter noise induced by probabilistic sampling. The smoothed setpoint signal is then transmitted to the underlying PID control. The device achieves precise decoupling control of the secondary loop energy conversion system of the gas-cooled microreactor by coordinating the adjustment of turbine speed and air flow. By setting a reward function with a large reward for high-precision load tracking, stability control, and multi-variable channel decoupling, the agent is trained offline under complex operating conditions to construct a control strategy model with generalized robustness. During system operation, the optimal setpoint trajectory is dynamically mapped according to real-time state parameters. Finally, a hierarchical multi-objective coordinated control of the gas-cooled microreactor based on deep reinforcement learning is formed, which significantly improves the wide-range load tracking response characteristics while also having strict safety boundary protection and actuator oscillation suppression capabilities.
[0042] Please see Figure 7 The present invention discloses a hierarchical coordinated control method for gas-cooled microreactors using deep reinforcement learning, comprising the following steps: S1. Construct a hierarchical control architecture for the gas-cooled microreactor and design an intelligent agent controller, while also designing the signal processing layer in the middle layer. Please see Figure 1 We constructed a hierarchical control architecture for the gas-cooled microreactor and designed an intelligent agent controller, while also designing an intermediate signal processing layer. Reinforcement learning is an important branch of artificial intelligence (AI) and machine learning (ML), alongside supervised and unsupervised learning. This method simulates the process by which organisms learn optimal behavior through interaction with their environment. Unlike traditional supervised learning, reinforcement learning does not rely on pre-labeled datasets for training models. Instead, it depends on an agent learning how to achieve a predetermined goal in a specific environment through continuous trial, failure, adaptation, and optimization.
[0043] This invention introduces the Soft Actor-Critic (SAC) algorithm, a representative of deep reinforcement learning. Currently, SAC, as one of the leading offline algorithms for handling tasks in continuous action spaces, aims to solve the problems of low sampling efficiency and unstable training commonly found in deep reinforcement learning. Traditional reinforcement learning aims to find a policy that maximizes the sum of accumulated rewards. SAC introduces the concept of maximum entropy, and its core objective is to allow the agent to obtain the highest reward while maintaining as much randomness as possible in its actions. This achieves a dynamic optimal balance between exploration and exploitation, enabling it to provide smoother and more robust control performance than traditional deterministic algorithms when dealing with complex physical systems with high dimensions and strong nonlinearity.
[0044] The objective function of SAC not only includes the cumulative reward, but also adds an entropy term:
[0045] in, r ( s t , a t The reward item is responsible for guiding the agent to complete the task objective. H The entropy term measures the randomness of the strategy. The higher the entropy, the more diverse the actions. α is a temperature parameter that controls the trade-off between reward and entropy; the higher the α, the stronger the agent's willingness to explore.
[0046] SAC follows an Actor-Critic framework and is primarily composed of the following three parts: 1) Critics Network: Double Q Learning SAC uses two independent Q-networks (Q1 and Q2) to estimate the value of the current state-action pair. When calculating the target value, the minimum of the two Q-network outputs is taken. This effectively alleviates the Q-value overestimation that is prevalent in deep reinforcement learning.
[0047] 2) Update of the soft Bellman equation Based on the maximum entropy objective, SAC introduces soft-state value. Its soft Bellman equation includes an entropy term during TD update, allowing the evaluation of the Q function to take into account the value of future exploration.
[0048] 3) Automatic adjustment of temperature parameter α The AI automatically adjusts α based on the difference between the current entropy value and the target entropy value: if too little exploration is done, α is automatically increased to encourage random movement. If the strategy is already robust, α is automatically decreased to lock in the optimal solution.
[0049] Please see Figure 2 The hierarchical coordinated control system for a gas-cooled microreactor based on deep reinforcement learning, designed in this invention, consists of an upper intelligent decision-making layer, a middle signal processing layer, and a lower PID controller layer. The upper-layer SAC agent's output signal has two parts: reference setpoints for power and outlet temperature, and system state data, including measured power, measured outlet temperature, agent actions, and the agent's actions from the previous moment. This decouples power and outlet temperature, dynamically generating optimal command increments for power and outlet temperature. The middle layer constructs a "physical filter" for the agent's action output by introducing a rate limiter and a low-pass filter. This module smoothly reconstructs the original actions into optimized setpoint commands by limiting the slope of setpoint changes and filtering out high-frequency noise generated by the SAC random strategy. The lower layer is a traditional PID controller that receives the smoothed power and temperature setpoints and outputs turbine speed and fan speed signals, respectively.
[0050] S2. Design the observation space, action space, and composite reward function of the SAC agent; Design the intelligent agent's observation space, action space, and composite reward function; For a gas-cooled microreactor secondary loop system, the primary control objectives are electrical power and secondary-side outlet temperature. Therefore, the physical quantities in the observation space of the reinforcement learning agent currently include the measured electrical power value, the reference value of the electrical power setpoint, the optimized value of the electrical power setpoint, the optimized value of the electrical power setpoint at the previous moment, the measured outlet temperature value, the reference value of the outlet temperature setpoint, the optimized value of the outlet temperature setpoint, and the optimized value of the outlet temperature setpoint at the previous moment. The physical quantities in the action space are the electrical power and the optimized value of the outlet temperature setpoint.
[0051] The design of the reward function is a core issue in the SAC algorithm, directly affecting whether the agent can achieve the desired goal. The basic idea behind the reward function design is to provide appropriate rewards as the agent gradually approaches the goal. The design concept of the reward function in this project is as follows: (1) Ensure the stability of the output signal, with the temperature channel having a higher smoothing weight than the power channel. This is intended to eliminate high-frequency oscillations in temperature regulation.
[0052] (2) Adaptive adjustment based on operating conditions: The function dynamically switches weights according to the reference power level. In the low power range, the system automatically increases the power error weight and strictly controls the accuracy threshold to solve the problem of difficult elimination of static error under low load.
[0053] (3) Decoupling control concept: An explicit decoupling term is introduced, which punishes “cross-channel intervention”, i.e., when the power is stable, the temperature action is large, or vice versa, to suppress the negative impact of strong coupling.
[0054] (4) Nonlinear penalty architecture: A multi-level penalty system is adopted, consisting of dead zone, linear, and quadratic terms. Small errors are not penalized, medium errors are penalized linearly, and large errors are penalized heavily with quadratic terms to ensure that the system can quickly return to normal under severe disturbances.
[0055] Based on the above ideas, the reward function is set as shown in Table 1.
[0056] Table 1 Design of the reward function
[0057] S3. Based on the above control system, conduct offline training and strategy tuning experiments under complex working conditions to obtain a control strategy model and intelligent agent with generalized robustness.
[0058] After constructing the hierarchical coordinated control architecture of the gas-cooled microreactor based on deep reinforcement learning, offline training is required to address the multivariable strong coupling and strong nonlinear characteristics of the system during variable load processes. This invention selects typical load-tracking conditions covering the full power operating range, such as... Figure 3 As shown, the SAC agent policy is gradually converged through 1500 training iterations, serving as the training environment.
[0059] S4. Execute real-time coordinated control, inputting the real-time operating parameters of the gas-cooled microreactor to the trained SAC intelligent agent controller. The SAC intelligent agent controller outputs the original setpoints for power and outlet temperature, as well as system status data, to decouple power and outlet temperature. The signal processing layer sequentially performs rate limiting and low-pass filtering on the original setpoints to obtain optimized setpoint instructions. The PID controller layer receives the optimized setpoint instructions and outputs turbine speed and fan speed signals through the PID controller to control the secondary loop energy conversion system of the gas-cooled microreactor, achieving load tracking of the gas-cooled microreactor.
[0060] In another embodiment of the present invention, a hierarchical coordination control system for gas-cooled microreactors using deep reinforcement learning is provided. This system can be used to implement the above-mentioned hierarchical coordination control method for gas-cooled microreactors using deep reinforcement learning. Specifically, the hierarchical coordination control system for gas-cooled microreactors using deep reinforcement learning includes an architecture construction module, an element design module, a training and deployment module, and a real-time control module.
[0061] The architecture construction module is used to build a hierarchical control architecture for a gas-cooled microreactor. The architecture includes an upper intelligent decision layer, a middle signal processing layer, and a lower PID controller layer. An SAC intelligent agent controller is deployed in the intelligent decision layer, and a rate limiter and a low-pass filter are set in the signal processing layer. The element design module is used to design the observation space, action space and composite reward function of the SAC intelligent agent controller. The observation space includes the thermal parameters of the gas-cooled microreactor secondary loop system, and the action space consists of the power setpoint optimization increment and the outlet temperature setpoint optimization increment. The training and deployment module is used to conduct offline training under complex operating conditions covering the full power range based on the hierarchical control architecture, to obtain a control strategy model with generalized robustness, and to deploy the control strategy model to the SAC intelligent agent controller. The real-time control module is used to input the real-time operating parameters of the gas-cooled microreactor to the SAC intelligent agent controller, and output the optimized increments of the power setpoint and the outlet temperature setpoint. The signal processing layer performs rate limiting and low-pass filtering on the optimized increments to synthesize smoothed power setpoints and outlet temperature setpoints. The PID controller layer outputs turbine speed signals and air flow signals to drive the secondary loop energy conversion system of the gas-cooled microreactor to achieve load tracking and decoupling control.
[0062] This invention provides a terminal device comprising a processor and a memory. The memory stores a computer program, which includes program instructions. The processor executes the program instructions stored in the computer storage medium. The processor may be a Central Processing Unit (CPU), or other general-purpose processors, graphics processing units (GPUs), tensor processing units (TPUs), digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. It is the computing and control core of the terminal, suitable for implementing one or more instructions, specifically suitable for loading and executing one or more instructions to achieve a corresponding method flow or function. The processor described in this embodiment can be used for operation of a hierarchical coordination control method for gas-cooled micro-stacking systems employing deep reinforcement learning, including: A hierarchical control architecture for a gas-cooled microreactor is constructed, comprising an upper intelligent decision-making layer, a middle signal processing layer, and a lower PID controller layer. A Signal Processing Controller (SAC) is designed in the intelligent decision-making layer, and a rate limiter and low-pass filter are implemented in the signal processing layer. The SAC's observation space, action space, and composite reward function are designed. The observation space includes the thermal parameters of the gas-cooled microreactor's secondary loop system, the action space comprises optimized values of the electrical power and outlet temperature setpoints, and the composite reward function guides the training and decision-making of the SAC. Based on this hierarchical control architecture and the designed SAC, offline training and strategy tuning experiments covering the full power operating range of the gas-cooled microreactor under complex operating conditions are conducted. A control strategy model with generalized robustness and a trained SAC (System-Agent Control) agent controller are obtained. The control strategy model is deployed into the SAC agent controller. Real-time coordinated control is executed, and the real-time operating parameters of the gas-cooled microreactor are input to the trained SAC agent controller. The SAC agent controller outputs the original setpoints for power and outlet temperature, as well as system status data, to achieve decoupled control of power and outlet temperature. The signal processing layer sequentially performs rate limiting and low-pass filtering on the original setpoints to obtain optimized setpoint instructions. The PID controller layer receives the optimized setpoint instructions and outputs turbine speed and fan speed signals through the PID controller to control the secondary loop energy conversion system of the gas-cooled microreactor, thereby achieving load tracking of the gas-cooled microreactor.
[0063] Please see Figure 5 The terminal device is a computer device. In this embodiment, the computer device 60 includes a processor 61, a memory 62, and a computer program 63 stored in the memory 62 and executable on the processor 61. When executed by the processor 61, the computer program 63 implements the hierarchical coordination control method for gas-cooled microreactors using deep reinforcement learning as described in this embodiment. To avoid repetition, details are omitted here. Alternatively, when executed by the processor 61, the computer program 63 implements the functions of each model / unit in the hierarchical coordination control system for gas-cooled microreactors using deep reinforcement learning as described in this embodiment. To avoid repetition, details are omitted here.
[0064] Computer device 60 can be a desktop computer, laptop, handheld computer, cloud server, or other computing device. Computer device 60 may include, but is not limited to, a processor 61 and a memory 62. Those skilled in the art will understand that... Figure 5 This is merely an example of computer device 60 and does not constitute a limitation on computer device 60. It may include more or fewer components than shown, or combine certain components, or different components. For example, computer device may also include input / output devices, network access devices, buses, etc.
[0065] The processor 61 may be a Central Processing Unit (CPU), or other general-purpose processors, graphics processing units (GPUs), tensor processing units (TPUs), digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor.
[0066] The memory 62 can be an internal storage unit of the computer device 60, such as a hard disk or memory of the computer device 60. The memory 62 can also be an external storage device of the computer device 60, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc. equipped on the computer device 60.
[0067] Furthermore, the memory 62 may include both internal storage units of the computer device 60 and external storage devices. The memory 62 is used to store computer programs and other programs and data required by the computer device. The memory 62 can also be used to temporarily store data that has been output or will be output.
[0068] Please see Figure 6 The terminal device is an electronic device 600, which is manifested in the form of a general-purpose computing device. The components of the electronic device may include, but are not limited to: at least one processing unit 610, at least one storage unit 620, a bus 630 connecting different platform components (including storage unit 620 and processing unit 610), a display unit 640, etc.
[0069] The storage unit stores program code, which can be executed by the processing unit 610 to perform the steps described in the method section of this specification according to various exemplary embodiments of the present invention. For example, the processing unit 610 can perform actions such as... Figure 7 The steps are shown in the figure.
[0070] Storage unit 620 may include readable media in the form of volatile storage units, such as random access memory (RAM) 6201 and / or cache memory 6202, and may further include read-only memory (ROM) 6203.
[0071] Storage unit 620 may also include a program / utility 6204 having a set (at least one) program module 6205, such program module 6205 including but not limited to: operating system, one or more application programs, other program modules and program data, each or some combination of these examples may include an implementation of a network environment.
[0072] Bus 630 can represent one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local bus using any of the multiple bus structures.
[0073] Electronic device 600 can also communicate with one or more external devices 700 (e.g., keyboard, pointing device, Bluetooth device, etc.), and with one or more devices that enable a user to interact with electronic device 600, and / or with any device that enables electronic device 600 to communicate with one or more other computing devices (e.g., router, modem). This communication can be performed via input / output interface 650. Furthermore, electronic device 600 can also communicate with one or more networks (e.g., local area network, wide area network, and / or public network, such as the Internet) via network adapter 660. Network adapter 660 can communicate with other modules of electronic device 600 via bus 630. It should be understood that, although not shown in the figures, other hardware and / or software modules can be used in conjunction with electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms.
[0074] Example 3 This invention also provides a storage medium, specifically a computer-readable storage medium, which is a memory device in a terminal device for storing programs and data. It is understood that the computer-readable storage medium here can include both built-in storage media in the terminal device and extended storage media supported by the terminal device; it can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. The computer-readable storage medium provides storage space that stores the terminal's operating system. Furthermore, the storage space also stores one or more instructions suitable for loading and execution by a processor, which can be one or more computer programs (including program code). More specific examples of the computer-readable storage medium include: an electrical connection with one or more wires, a portable disk, a hard disk, random access memory, read-only memory, erasable programmable read-only memory, optical fiber, portable compact disk read-only memory, optical storage device, magnetic storage device, or any suitable combination thereof.
[0075] Computer-readable storage media also include data signals propagated in baseband or as part of a carrier wave, carrying readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A readable storage medium can also be any readable medium other than a readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the readable storage medium can be transmitted using any suitable medium, including but not limited to wireless, wired, optical fiber, radio frequency, etc., or any suitable combination thereof.
[0076] Program code for performing the operations of this invention can be written in any combination of one or more programming languages, including object-oriented programming languages such as Java and C++, and conventional procedural programming languages such as C or similar languages. The program code can execute entirely on the user's computing device, partially on the user's device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).
[0077] One or more instructions stored in a computer-readable storage medium can be loaded and executed by a processor to implement the corresponding steps of the hierarchical coordinated control method for gas-cooled microreactors using deep reinforcement learning in the above embodiments; one or more instructions in the computer-readable storage medium are loaded and executed by the processor to perform the following steps: A hierarchical control architecture for a gas-cooled microreactor is constructed, comprising an upper intelligent decision-making layer, a middle signal processing layer, and a lower PID controller layer. A Signal Processing Controller (SAC) is designed in the intelligent decision-making layer, and a rate limiter and low-pass filter are implemented in the signal processing layer. The SAC's observation space, action space, and composite reward function are designed. The observation space includes the thermal parameters of the gas-cooled microreactor's secondary loop system, the action space comprises optimized values of the electrical power and outlet temperature setpoints, and the composite reward function guides the training and decision-making of the SAC. Based on this hierarchical control architecture and the designed SAC, offline training and strategy tuning experiments covering the full power operating range of the gas-cooled microreactor under complex operating conditions are conducted. A control strategy model with generalized robustness and a trained SAC (System-Agent Control) agent controller are obtained. The control strategy model is deployed into the SAC agent controller. Real-time coordinated control is executed, and the real-time operating parameters of the gas-cooled microreactor are input to the trained SAC agent controller. The SAC agent controller outputs the original setpoints for power and outlet temperature, as well as system status data, to achieve decoupled control of power and outlet temperature. The signal processing layer sequentially performs rate limiting and low-pass filtering on the original setpoints to obtain optimized setpoint instructions. The PID controller layer receives the optimized setpoint instructions and outputs turbine speed and fan speed signals through the PID controller to control the secondary loop energy conversion system of the gas-cooled microreactor, thereby achieving load tracking of the gas-cooled microreactor.
[0078] The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.
[0079] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.
[0080] Please see Figure 4 The response curves of key system parameters of the gas-cooled microreactor are presented under conventional PID control and SAC-PID control, respectively, when the load is linearly reduced from 100% to 40%.
[0081] Figure 4 (d) As shown in the normalized power response curve, during a significant linear load reduction process, traditional PID control exhibits obvious overshoot (the curve exceeds the setpoint trajectory), which may lead to reactor power fluctuation risks. However, using the SAC-PID strategy of this invention, the power curve closely follows the setpoint trajectory with almost no overshoot. Similarly, in Figure 4 (f) In the secondary side outlet temperature response, the SAC-PID strategy effectively suppressed large temperature fluctuations. Data shows that this invention, through the agent's forward-looking decision-making and overshoot penalty mechanism, minimizes the overshoot of key parameters, greatly improving the safety margin of the nuclear system operation.
[0082] Figure 4 (a) Nuclear power and Figure 4 (g) Helium flow rate curve: Traditional PID controllers require a long time to regain stability after operating condition changes, exhibiting significant lag and oscillation decay. In contrast, the SAC-PID strategy of this invention can follow setpoint changes more quickly and rapidly enter steady state. This is due to the SAC agent's accurate modeling and real-time optimization of the system's nonlinear dynamics, enabling it to adjust turbine speed and air flow rate in advance, significantly shortening the system's settling time and improving the microreactor's response speed to grid load demands.
[0083] Figure 4(h) The airflow signal is the most obvious indicator for comparison. Traditional PID or unfiltered AI control often leads to frequent operation of the actuator (fan), generating high-frequency sawtooth fluctuations that severely damage equipment lifespan. This invention introduces a rate limiter and a low-pass filter in the middle layer, making the output airflow command (and the final execution signal) extremely smooth, without any high-frequency jitter. This not only verifies the effectiveness of the physical constraint layer but also demonstrates that this invention can effectively reduce mechanical wear, extend the service life of critical equipment, and achieve a balance between control performance and equipment protection.
[0084] During the load reduction process, there is a strong coupling between electrical power and outlet temperature. Figure 4 The invention's strategy, when adjusting electrical power, results in an outlet temperature ( Figure 4 f) It can remain stable, and vice versa. This proves that the cross-channel decoupling term in the composite reward function plays a role, and the agent learns to independently adjust the two variables, avoiding the chain reaction of power adjustment disrupting temperature in traditional control, and realizing true multivariable coordinated control.
[0085] The method of this invention can significantly optimize the overshoot and settling time of electric power and secondary side outlet temperature. Therefore, compared with traditional control, this invention can significantly reduce overshoot and shorten settling time for regulating the electric power of gas-cooled microreactors, providing better control performance and improving economic and safety benefits.
[0086] In summary, this invention presents a hierarchical coordinated control method and system for gas-cooled microreactors employing deep reinforcement learning. This aims to overcome the control bottlenecks of multi-factor coupling and strong nonlinearity in gas-cooled microreactors without altering the existing industrial control foundation. Addressing the control bottlenecks of strong coupling, strong nonlinearity, and thermal inertia hysteresis in gas-cooled microreactors, this invention constructs a hierarchical architecture combining intelligent decision-making and physical execution. The upper layer utilizes a reinforcement learning agent to dynamically optimize the setpoint trajectories of electrical power and outlet temperature, achieving multi-variable collaborative decoupling. The middle layer introduces a physical constraint layer containing a rate limiter and a low-pass filter, effectively filtering out high-frequency jitter noise and suppressing actuator oscillations. The bottom layer retains an industrial-grade PID controller for final adjustment, ensuring the system's safety and robustness. It exhibits good control performance and disturbance rejection in transient conditions. Compared with traditional control strategies, the hierarchical coordination controller of gas-cooled microreactor based on deep reinforcement learning has smaller overshoot and faster settling time in the evaluation of time-domain indicators. It also has good performance in disturbance rejection, realizing effective regulation of gas-cooled microreactor. Furthermore, it has important reference and guiding significance for the power regulation of other reactors.
[0087] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0088] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0089] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed in this invention can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.
[0090] In the embodiments provided by this invention, it should be understood that the disclosed devices / terminals and methods can be implemented in other ways. For example, the device / terminal embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.
[0091] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0092] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0093] If the integrated module / unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments of the present invention can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random-access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium, etc.
[0094] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus, and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0095] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0096] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0097] The above content is only for illustrating the technical concept of the present invention and should not be construed as limiting the scope of protection of the present invention. Any modifications made to the technical solution based on the technical concept proposed in this invention shall fall within the scope of protection of the claims of this invention.
Claims
1. A hierarchical coordinated control method for gas-cooled microreactors using deep reinforcement learning, characterized in that, Includes the following steps: S1. Construct a hierarchical control architecture for a gas-cooled microreactor, which includes an upper intelligent decision layer, a middle signal processing layer, and a lower PID controller layer. Simultaneously, design an SAC intelligent agent controller in the intelligent decision layer and set a rate limiter and a low-pass filter in the signal processing layer. S2. Design the observation space, action space, and composite reward function of the SAC agent controller. The observation space includes the thermal parameters of the gas-cooled microreactor secondary loop system. The action space is the optimized value of the electric power and outlet temperature setpoint. The composite reward function is used to guide the training and decision-making of the SAC agent controller. S3. Based on the hierarchical control architecture of the gas-cooled microreactor and the designed SAC intelligent agent controller, conduct offline training and strategy tuning experiments covering the full power operating range of the gas-cooled microreactor under complex operating conditions to obtain a control strategy model with generalized robustness and a trained SAC intelligent agent controller. Deploy the control strategy model into the SAC intelligent agent controller. S4. Execute real-time coordinated control, inputting the real-time operating parameters of the gas-cooled microreactor to the trained SAC intelligent agent controller. The SAC intelligent agent controller outputs the original setpoints for power and outlet temperature, as well as system status data, to achieve decoupled control of power and outlet temperature. The signal processing layer sequentially performs rate limiting and low-pass filtering on the original setpoints to obtain optimized setpoint instructions. The PID controller layer receives the optimized setpoint instructions and outputs turbine speed signals and fan speed signals through the PID controller to control the secondary loop energy conversion system of the gas-cooled microreactor, achieving load tracking of the gas-cooled microreactor.
2. The hierarchical coordinated control method for gas-cooled microreactors using deep reinforcement learning as described in claim 1, characterized in that, The SAC agent controller is built based on the maximum entropy reinforcement learning mechanism. The objective function of the SAC agent controller includes a cumulative reward term and an entropy term. The entropy term is used to measure the randomness of the policy, and the temperature parameter α of the objective function is used to control the trade-off between reward and entropy.
3. The hierarchical coordinated control method for gas-cooled microreactors using deep reinforcement learning according to claim 1, characterized in that, The SAC intelligent agent controller adopts an Actor-Critic architecture, including a critic network with a dual-Q network, a soft Bellman equation update module that introduces an entropy term, and an automatic adjustment module for the temperature parameter α. The dual-Q network consists of two independent Q1 and Q2 networks used to estimate the value of the current state action pair.
4. The hierarchical coordinated control method for gas-cooled microreactors using deep reinforcement learning as described in claim 1, characterized in that, The thermal parameters in the observation space include the measured value of electric power, the reference value of the setpoint of electric power, the optimized value of the setpoint of electric power, the optimized value of the setpoint of electric power at the previous moment, the measured value of outlet temperature, the reference value of the setpoint of outlet temperature, the optimized value of the setpoint of outlet temperature, and the optimized value of the setpoint of outlet temperature at the previous moment.
5. The hierarchical coordinated control method for gas-cooled microreactors using deep reinforcement learning according to claim 1, characterized in that, The composite reward function includes an error penalty term, a motion smoothing and vibration damping term, a zero-bias reward term, an overshoot penalty term, and a cross-channel decoupling term. Each reward term works together to achieve multi-dimensional training and decision guidance for the SAC agent controller.
6. The hierarchical coordinated control method for gas-cooled microreactors using deep reinforcement learning according to claim 5, characterized in that, The error penalty item sets a dead zone of 1% for power and temperature. When the system deviation is within the dead zone, the SAC intelligent agent controller is rewarded. The error penalty item can also monitor the reference power level in real time and automatically increase the power error weight in the low power operation segment.
7. The hierarchical coordinated control method for gas-cooled microreactors using deep reinforcement learning according to claim 5, characterized in that, The motion smoothing and damping term penalizes the rate of change of motion, and the penalty weight for the temperature channel is several times that for the power channel; the motion smoothing and damping term can also monitor the directional change of motion output, penalize the sign flip of control commands near steady state, and double the smoothing coefficient when the error enters within 3%.
8. The hierarchical coordinated control method for gas-cooled microreactors using deep reinforcement learning according to claim 1, characterized in that, The complex operating condition offline training selects a typical load tracking condition covering the full power operating range of the gas-cooled microreactor as the training environment, and drives the policy convergence of the SAC agent controller through at least 1500 training iterations to obtain the control policy model. The system status data includes electrical power measurement value, outlet temperature measurement value, the current action of the SAC intelligent agent controller, and the action value of the SAC intelligent agent controller at the previous moment.
9. The hierarchical coordinated control method for gas-cooled microreactors using deep reinforcement learning according to claim 1, characterized in that, The rate limiter imposes a hard constraint on the rate of change of the set value, and the low-pass filter filters out high-frequency jitter noise induced by random strategy sampling of the SAC intelligent agent controller; the PID controller layer includes a power loop PID controller and a temperature loop PID controller. The power loop PID controller receives the optimized set value of electric power and outputs the turbine speed signal, and the temperature loop PID controller receives the optimized set value of outlet temperature and outputs the fan speed signal.
10. A hierarchical coordination control system for a gas-cooled microreactor employing deep reinforcement learning, characterized in that, include: An architecture building module is used to build a hierarchical control architecture for a gas-cooled microreactor. The architecture includes an upper intelligent decision layer, a middle signal processing layer, and a lower PID controller layer. An SAC intelligent agent controller is deployed in the intelligent decision layer, and a rate limiter and a low-pass filter are set in the signal processing layer. The element design module is used to design the observation space, action space and composite reward function of the SAC intelligent agent controller. The observation space includes the thermal parameters of the gas-cooled microreactor secondary loop system, and the action space consists of the power setpoint optimization increment and the outlet temperature setpoint optimization increment. The training and deployment module is used to conduct offline training under complex operating conditions covering the full power range based on the hierarchical control architecture, to obtain a control strategy model with generalized robustness, and to deploy the control strategy model to the SAC intelligent agent controller. The real-time control module is used to input the real-time operating parameters of the gas-cooled microreactor to the SAC intelligent agent controller, and output the optimized increments of the power setpoint and the outlet temperature setpoint. The signal processing layer performs rate limiting and low-pass filtering on the optimized increments to synthesize smoothed power setpoints and outlet temperature setpoints. The PID controller layer outputs turbine speed signals and air flow signals to drive the secondary loop energy conversion system of the gas-cooled microreactor to achieve load tracking and decoupling control.