Intelligent control method and system for nuclear power plant accidents

By using a multi-agent reinforcement learning method, the action data of nuclear power plant emergency accidents were decomposed, and agent reward functions were designed to realize intelligent control of nuclear power plants. This solved the safety hazards caused by human judgment and improved the level of automation and operational efficiency.

CN116189945BActive Publication Date: 2026-06-23CHINA NUCLEAR POWER ENGINEERING COMPANY LTD +3

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA NUCLEAR POWER ENGINEERING COMPANY LTD
Filing Date
2022-09-09
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing nuclear power plant fault handling relies on manual judgment and experience, resulting in a heavy workload for operators, a high risk of errors, and safety hazards. The level of intelligence and automation is also insufficient.

Method used

A multi-agent reinforcement learning approach is adopted. By acquiring action data of nuclear power plant emergency accidents, the data is decomposed into multiple action subspaces and assigned to each agent. A reward function is designed to allow the agents to learn sub-policies and then aggregate them into a general control policy to achieve intelligent control.

Benefits of technology

It reduces manual workload, improves the intelligence and automation level of nuclear power plants, enhances operational efficiency and safety, and reduces operational risks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116189945B_ABST
    Figure CN116189945B_ABST
Patent Text Reader

Abstract

The present application relates to a kind of nuclear power plant accident intelligent control method and system, comprising: obtaining the action data of nuclear power plant emergency accident;Action data is decomposed to obtain multiple action subspace;Multiple action subspaces are assigned to each agent, obtain the action subspace corresponding to each agent;Obtain the reward function of the action subspace corresponding to each agent;Each agent learns according to corresponding reward function, obtains the sub-strategy of each agent;The sub-strategy of each agent is summarized, and total control strategy is obtained;Based on total control strategy, nuclear power plant emergency accident is intelligently controlled.The present application is based on multi-agent reinforcement learning, realizes the intelligent control of nuclear power plant accident, greatly reduces artificial workload and stress, significantly improves the intelligent level and automation level of nuclear power plant operation system, control system, also improves the efficiency of nuclear power plant operation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the technical field of nuclear power plant system failures, and more specifically, to an intelligent control method and system for nuclear power plant accidents. Background Technology

[0002] A nuclear power plant system is a complex system with multiple interconnected subsystems. The most commonly used nuclear power plant is the pressurized water reactor (PWR) nuclear power plant. Its working principle is as follows: nuclear fuel made from uranium undergoes nuclear fission within the reactor, releasing a large amount of heat energy; high-pressure circulating cooling water carries away the heat energy, generating steam in a steam generator; the high-temperature, high-pressure steam drives a turbine, which in turn drives a generator. Nuclear power is a clean, pollution-free, and high-energy-density renewable energy source, with virtually zero greenhouse gas and carbon dioxide emissions. However, nuclear power plants are expensive to build, technologically demanding, and costly to maintain.

[0003] When well-controlled and with a robust emergency response system, nuclear power plants are actually quite safe facilities. However, nuclear power plants have extremely high safety requirements because an accident leading to the leakage of large amounts of radioactive materials such as nuclear waste and wastewater would pose a serious threat to the surrounding environment and the health and lives of residents, causing irreparable damage to the natural environment. Therefore, safety is usually the primary consideration in the design, construction, commissioning, operation, and management of nuclear power plant systems.

[0004] Pressurized water reactor (PWR) nuclear power plants typically consist of three loops, each containing numerous devices, control systems, and sensing systems. For example, a typical conventional PWR's primary loop power regulation system includes: turbine and generator power regulation, coolant average temperature and control rod position regulation, boron concentration regulation, steam bypass control system, and interlocking system. Generator power regulation and its associated turbine regulation are external controls; the steam bypass control system serves as an auxiliary to the power regulation system; and the interlocking system prevents excessive control rod lifting, which could lead to an emergency shutdown.

[0005] During the operation of a nuclear power plant, some equipment and control devices in the three loops mentioned above may malfunction (for example, a pipe in a loop may break, or a switch in a regulation system may fail). These malfunctions can cause fluctuations in some core physical parameters during the operation of the nuclear power plant (for example, a pipe in a loop may break, causing a drop in pressure in a container, which may then affect the loop temperature or flow rate).

[0006] In existing nuclear power plant operation systems, fault inspection and handling typically rely on operator judgment and experience-based operating procedures to eliminate faults and maintain safe operation. The drawback of this approach is that its efficiency and efficiency heavily depend on the operator's judgment. Simultaneously handling multiple devices also places high demands on the operator, increasing the risk of confusion and errors, thus posing significant safety hazards and risks. Therefore, the overall level of intelligence and automation in the nuclear power plant system needs improvement. Summary of the Invention

[0007] The technical problem to be solved by the present invention is to provide an intelligent control method and system for nuclear power plant accidents, addressing the shortcomings of the existing technology.

[0008] The technical solution adopted by this invention to solve its technical problem is: to construct an intelligent control method for nuclear power plant accidents, comprising the following steps:

[0009] Acquire action data for nuclear power plant emergency incidents;

[0010] The action data is decomposed to obtain multiple action subspaces;

[0011] The multiple action subspaces are assigned to each agent to obtain the action subspace corresponding to each agent.

[0012] Obtain the reward function for the action subspace corresponding to each of the aforementioned agents;

[0013] Each of the aforementioned agents learns according to its corresponding reward function to obtain its own sub-policy;

[0014] The sub-policies of each of the intelligent agents are summarized to obtain the overall control policy;

[0015] Intelligent control of nuclear power plant emergency accidents is carried out based on the overall control strategy.

[0016] In the intelligent control method for nuclear power plant accidents described in this invention, the step of decomposing the action data to obtain multiple action subspaces includes:

[0017] The action data is decomposed based on prior knowledge methods to obtain the multiple action subspaces;

[0018] Alternatively, the action data can be decomposed based on an action encoding method to obtain the multiple action subspaces.

[0019] In the intelligent control method for nuclear power plant accidents described in this invention, the step of decomposing the action data based on prior knowledge to obtain the multiple action subspaces includes:

[0020] Based on the prior knowledge, the types and relevance of actions are classified;

[0021] Based on the classification results, the types and correlations of actions are obtained;

[0022] Based on the type and correlation of the actions, the action data is decomposed to obtain the multiple action subspaces.

[0023] In the intelligent control method for nuclear power plant accidents described in this invention, the step of decomposing the action data based on the action coding method to obtain the plurality of action subspaces includes:

[0024] One-hot encoded actions are transformed into latent representations in the latent space using an action encoder.

[0025] The effect of learning actions using a feedforward model neural network is to obtain the latent representation of the actions;

[0026] The clustering results are obtained by performing the clustering in the latent space based on the latent representation of the action.

[0027] The action data is decomposed into multiple action subspaces based on the clustering results.

[0028] In the intelligent control method for nuclear power plant accidents described in this invention, the step of allocating the plurality of action subspaces to each intelligent agent to obtain the action subspace corresponding to each intelligent agent includes:

[0029] The multiple action subspaces are assigned to each agent manually to obtain the action subspace corresponding to each agent.

[0030] Alternatively, the multiple action subspaces can be allocated using a neural network role selector to obtain the action subspace corresponding to each agent.

[0031] In the intelligent control method for nuclear power plant accidents described in this invention, the step of allocating the multiple action subspaces through a neural network role selector to obtain the action subspace corresponding to each agent includes:

[0032] Determine the input vector for the neural network role selector;

[0033] The input vector is mapped to a latent space of the same dimension as the action through the neural network;

[0034] The corresponding role is determined by maximizing the inner product between vectors;

[0035] The multiple action subspaces are allocated according to the determined corresponding roles to obtain the action subspace corresponding to each agent.

[0036] In the intelligent control method for nuclear power plant accidents described in this invention, the reward function for obtaining the action subspace corresponding to each of the intelligent agents includes:

[0037] A process-oriented approach is adopted to design reward functions to obtain reward functions for the multiple action subspaces and reward functions for each agent.

[0038] Alternatively, the reward function can be designed in a way that relates to course learning, using the reward functions of the multiple action subspaces and the reward functions of each agent.

[0039] The present invention also provides an intelligent control system for nuclear power plant accidents, comprising:

[0040] The acquisition unit is used to acquire action data in emergency situations at nuclear power plants.

[0041] An action decomposition unit is used to decompose the action data to obtain multiple action subspaces;

[0042] An action allocation unit is used to allocate the multiple action subspaces to each agent to obtain the action subspace corresponding to each agent.

[0043] A reward design unit is used to obtain the reward function of the action subspace corresponding to each of the intelligent agents;

[0044] The learning strategy unit is used by each of the intelligent agents to learn according to the corresponding reward function and obtain the sub-policy of each of the intelligent agents.

[0045] The control strategy output unit is used to summarize the sub-policies of each of the intelligent agents to obtain the overall control strategy;

[0046] The control unit is used to intelligently control nuclear power plant emergency accidents based on the overall control strategy.

[0047] In the intelligent control system for nuclear power plant accidents described in this invention, the action decomposition unit is specifically used for:

[0048] The action data is decomposed based on prior knowledge methods to obtain the multiple action subspaces;

[0049] Alternatively, the action data can be decomposed based on an action encoding method to obtain the multiple action subspaces.

[0050] In the intelligent control system for nuclear power plant accidents described in this invention, the action allocation unit is specifically used for:

[0051] Determine the input vector for the neural network role selector;

[0052] The input vector is mapped to a latent space of the same dimension as the action through the neural network;

[0053] The corresponding role is determined by maximizing the inner product between vectors;

[0054] The multiple action subspaces are allocated according to the determined corresponding roles to obtain the action subspace corresponding to each agent.

[0055] The intelligent control method and system for nuclear power plant accidents of the present invention have the following beneficial effects: It includes: acquiring action data of a nuclear power plant emergency; decomposing the action data to obtain multiple action subspaces; allocating the multiple action subspaces to various agents to obtain the action subspace corresponding to each agent; obtaining the reward function for the action subspace corresponding to each agent; each agent learning according to the corresponding reward function to obtain its sub-policy; summarizing the sub-policies of each agent to obtain a general control policy; and performing intelligent control of the nuclear power plant emergency based on the general control policy. This invention, based on multi-agent reinforcement learning, achieves intelligent control of nuclear power plant accidents, greatly reducing manual workload and pressure, significantly improving the intelligence and automation level of the nuclear power plant's operating and control systems, and also improving the efficiency of nuclear power plant operation. Attached Figure Description

[0056] The present invention will be further described below with reference to the accompanying drawings and embodiments. In the accompanying drawings:

[0057] Figure 1 This is a flowchart illustrating the intelligent control method for nuclear power plant accidents provided in an embodiment of the present invention;

[0058] Figure 2 This is a schematic diagram of the multi-agent learning strategy provided in an embodiment of the present invention;

[0059] Figure 3 This is a schematic diagram of the structure of the intelligent control system for nuclear power plant accidents provided in an embodiment of the present invention. Detailed Implementation

[0060] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0061] Multi-agent systems are collections of multiple agents whose goal is to transform large, complex systems into smaller, more manageable systems that communicate and coordinate with each other. Reinforcement learning (RL) is an important method in machine learning; it's a learning approach that uses environmental feedback as input and discovers optimal behavioral policies through trial and error. Therefore, multi-agent reinforcement learning, which combines multi-agent systems and reinforcement learning methods, applies the ideas and algorithms of reinforcement learning to multi-agent systems.

[0062] The intelligent control method for nuclear power plant accidents proposed in this invention is a nuclear power plant emergency accident control method based on multi-agent reinforcement learning. For details, please refer to... Figure 1 This is a flowchart illustrating a preferred embodiment of the intelligent control method for nuclear power plant accidents provided by the present invention.

[0063] like Figure 1 As shown, the intelligent control method for nuclear power plant accidents includes the following steps:

[0064] Step S101: Obtain action data for nuclear power plant emergency accidents.

[0065] Alternatively, nuclear power plant emergency action data refers to the actions taken by operators during an emergency at a nuclear power plant. This action data is generated by collecting and storing the actions performed by operators during a nuclear power plant emergency.

[0066] Step S102: Decompose the action data to obtain multiple action subspaces.

[0067] In some embodiments, decomposing action data to obtain multiple action subspaces includes: decomposing action data based on prior knowledge methods to obtain multiple action subspaces; or decomposing action data based on action encoding methods to obtain multiple action subspaces. That is, the actions required for an emergency are decomposed into multiple subtasks (i.e., multiple action subspaces) using prior knowledge or embedding learning.

[0068] Specifically, for high-dimensional action spaces, the original action space can be decomposed into multiple action subspaces through role decomposition, thereby reducing the difficulty and complexity of the search. Alternatively, this invention can employ the following two methods: the first is a method based on prior knowledge; the second is a method based on action encoding.

[0069] In some embodiments, decomposing action data based on prior knowledge to obtain multiple action subspaces includes: classifying the types and relevance of actions according to prior knowledge; obtaining the types and relevance of actions based on the classification results; and decomposing the action data according to the types and relevance of actions to obtain multiple action subspaces. That is, prior knowledge-based methods can classify corresponding operations into several categories (such as switches, valves, adding water, or other liquids, etc.) according to their types and relevance, thus ensuring that each agent is responsible for only one type of operation.

[0070] In some embodiments, decomposing action data based on action encoding methods to obtain multiple action subspaces includes: converting one-hot encoded actions into latent representations in the latent space using an action encoder; learning the effect of actions using a feedforward model neural network to obtain latent representations of actions; performing clustering based on the latent representations of actions in the latent space to obtain clustering results; and decomposing the action data into multiple action subspaces based on the clustering results. The action encoding-based approach employs the RODE algorithm for action decomposition, which uses a method similar to word embedding in natural language processing to map one-hot encoded actions into latent representations in the latent space using an encoder, and then uses k-means clustering of the latent space to classify the actions into several categories.

[0071] Specifically, the RODE algorithm mainly consists of three parts. The first part (a) primarily involves learning the representation of actions and the subspace decomposition of actions. Firstly, an action encoder is used to transform the one-hot encoded actions into a d-dimensional space R. d The action on z represents a Then, a feedforward model neural network is used to learn the effect of the action; the neural network receives the action representation z. a The actions of other intelligent agents a -i And the current observation value of the intelligence. i As input, it also predicts the observation at the next time step. i The algorithm uses the learned action representations and the global observations r. After collecting a sufficient number of samples, clustering is performed in the latent space based on the learned action representations (actions that are close to each other in the latent space produce similar effects and can be classified as the same role). Therefore, based on the clustering results, the original high-dimensional action space can be classified into multiple subsets of actions (multiple action subspaces) in the latent space, each corresponding to a different role.

[0072] Based on the action's effect, actions are clustered after learning their latent representations, thus decomposing the action space so that each subset of actions can perform a specific function. To fully utilize this decomposition, the RODE algorithm uses a two-layer hierarchical structure to coordinate role and action selection. At the top level, a role selector assigns a role to each agent at regular time steps. After being assigned a role, the agent explores the corresponding role's limited action space to learn the appropriate role strategy.

[0073] In Part 2(b), RODE designs an efficient, portable, and lightweight architecture for the role selector and the role's policy learning. For the role selector, a conventional Q-network can be simply used, with its input being the local action and observation history, and its output being the Q-value for each role. However, this architecture may be inefficient because it ignores information about the action space of different roles. Intuitively, choosing a role determines a subset of actions to be performed over a given period, therefore the Q-value for a role is closely related to its corresponding action space. Therefore, we can construct the role selector using the mean of all action representations within the action space.

[0074] To select a role, all agents share a linear layer and a gated recurrent unit (GRU) to encode each agent's local action-observation history into a fixed-length vector, which serves as the input to the role selector. This fixed-length vector is then passed through a fully connected neural network and mapped onto a latent space of the same dimension as the action representation. Finally, the corresponding role is selected by maximizing the inner product between the vectors.

[0075] After selecting the corresponding role, the corresponding sub-strategy can be learned based on the corresponding action subspace. When learning the action value function, i.e., the Q function, it can be adopted using the Q-function. MIX A similar structure is used, employing a neural network to integrate the local value functions of each agent to obtain a joint action value function. Each agent's local value function only requires its own local observations, thus the entire system operates in a distributed manner. The action that maximizes the cumulative expected reward is selected and executed through the local value functions. By adopting this method, latent representations can be used to better reflect the effects of different actions.

[0076] Furthermore, to achieve better results, the first and second methods can be combined in practical applications.

[0077] Step S103: Assign multiple action subspaces to each agent to obtain the action subspace corresponding to each agent.

[0078] In some embodiments, assigning multiple action subspaces to each agent to obtain an action subspace corresponding to each agent includes: manually assigning multiple action subspaces to each agent to obtain an action subspace corresponding to each agent; or, assigning multiple action subspaces through a neural network role selector to obtain an action subspace corresponding to each agent.

[0079] In some embodiments, allocating multiple action subspaces using a neural network role selector to obtain the action subspace corresponding to each agent includes: determining the input vector of the neural network role selector; mapping the input vector to a latent space of the same dimension as the action through the neural network; determining the corresponding role by maximizing the inner product between the vectors; and allocating multiple action subspaces according to the determined corresponding role to obtain the action subspace corresponding to each agent.

[0080] Specifically, after decomposing the action space, agents need to be assigned to different action subspaces. This can be done manually or by constructing a neural network role selector. The principle of the neural network role selector is the same as the RODE algorithm. To select a role, all agents share a linear layer and a gated recurrent unit to encode each agent's local action-observation history into a fixed-length vector. This fixed-length vector serves as the input to the role selector. This fixed-length vector is then mapped to a latent space of the same dimension as the action representation after passing through a fully connected neural network. Finally, the corresponding role is selected by maximizing the inner product between the vectors.

[0081] Furthermore, since each action in a nuclear power plant system has a clear meaning, and the corresponding action subspace also has a clear physical meaning, in this embodiment of the invention, each subspace can be directly assigned to a single agent, that is, each agent is responsible for a certain type of operation. This achieves the effect of being both simple and direct, while also making the network highly interpretable.

[0082] Step S104: Obtain the reward function for the action subspace corresponding to each agent.

[0083] In some embodiments, obtaining the reward function for each agent's corresponding action subspace includes: designing the reward function in a process-oriented manner to obtain the reward functions for multiple action subspaces and the reward function for each agent; or, designing the reward function in a way related to course learning to obtain the reward functions for multiple action subspaces and the reward function for each agent.

[0084] Specifically, for designing reward functions using a process-oriented approach: process variables are incorporated into the reward function, such as combining it with various control indicators within the process (e.g., response time, maximum overshoot, etc.). Simultaneously, a limit on the number of operations is added (e.g., introducing a negative multiplier reward related to the number of operations). The goal is to achieve the control indicators as quickly and with as few operations as possible, while ensuring safe operation.

[0085] Regarding reward function design related to course learning: Compared to indiscriminate machine learning, course-related approaches can mimic the human learning process, allowing the model to start learning from easy samples and gradually progress to more complex samples and knowledge. Therefore, by designing control metrics and reward functions, the agent can learn how to operate in a more balanced way. For example, initially designing easier-to-achieve control goals, and gradually increasing the difficulty of control after achieving these goals.

[0086] Step S105: Each agent learns according to the corresponding reward function to obtain the sub-policy of each agent.

[0087] Specifically, after being assigned to different action subspaces, each agent learns in its corresponding subspace and thus obtains the corresponding sub-control strategy.

[0088] In some embodiments, the learning strategies of each agent may adopt the following schemes: Figure 2 The algorithm structure is shown below. Figure 2 As shown, each agent's reward function (defined as a local value function for simplicity) only requires its own local observations. Therefore, the entire system operates in a distributed manner, selecting the action with the highest cumulative expected reward through the local value function. The upper layer uses a hybrid value network to combine the individual value functions of each agent to obtain the overall value function, while ensuring that the monotonicity of the joint action value function is the same as that of each local value function. Therefore, selecting the action that maximizes the local value function also maximizes the joint action value function. This architecture achieves centralized training and distributed execution, thus balancing training and execution efficiency.

[0089] Step S106: Summarize the sub-policies of each agent to obtain the overall control policy.

[0090] Step S107: Intelligent control of nuclear power plant emergency accidents based on the overall control strategy.

[0091] refer to Figure 3 This is a schematic diagram of a preferred embodiment of the intelligent control system for nuclear power plant accidents provided by the present invention. This intelligent control system for nuclear power plant accidents can be applied to the intelligent control method for nuclear power plant accidents disclosed in the embodiments of the present invention.

[0092] Specifically, such as Figure 3 As shown, the intelligent accident control system of this nuclear power plant includes:

[0093] Acquisition unit 301 is used to acquire action data of nuclear power plant emergency accidents.

[0094] Alternatively, nuclear power plant emergency action data refers to the actions taken by operators during an emergency at a nuclear power plant. This action data is generated by collecting and storing the actions performed by operators during a nuclear power plant emergency.

[0095] The motion decomposition unit 302 is used to decompose motion data to obtain multiple motion subspaces.

[0096] In some embodiments, the action decomposition unit is specifically used to: decompose action data based on a prior knowledge method to obtain multiple action subspaces; or, decompose action data based on an action encoding method to obtain multiple action subspaces.

[0097] Specifically, for high-dimensional action spaces, the original action space can be decomposed into multiple action subspaces through role decomposition, thereby reducing the difficulty and complexity of the search. Alternatively, this invention can employ the following two methods: the first is a method based on prior knowledge; the second is a method based on action encoding.

[0098] In some embodiments, decomposing action data based on prior knowledge to obtain multiple action subspaces includes: classifying the types and relevance of actions according to prior knowledge; obtaining the types and relevance of actions based on the classification results; and decomposing the action data according to the types and relevance of actions to obtain multiple action subspaces. That is, prior knowledge-based methods can classify corresponding operations into several categories (such as switches, valves, adding water, or other liquids, etc.) according to their types and relevance, thus ensuring that each agent is responsible for only one type of operation.

[0099] In some embodiments, decomposing action data based on action encoding methods to obtain multiple action subspaces includes: converting one-hot encoded actions into latent representations in the latent space using an action encoder; learning the effect of actions using a feedforward model neural network to obtain latent representations of actions; performing clustering based on the latent representations of actions in the latent space to obtain clustering results; and decomposing the action data into multiple action subspaces based on the clustering results. The action encoding-based approach employs the RODE algorithm for action decomposition, which uses a method similar to word embedding in natural language processing to map one-hot encoded actions into latent representations in the latent space using an encoder, and then uses k-means clustering of the latent space to classify the actions into several categories.

[0100] Specifically, the RODE algorithm mainly consists of three parts. The first part (a) primarily involves learning the representation of actions and the subspace decomposition of actions. Firstly, an action encoder is used to transform the one-hot encoded actions into a d-dimensional space R. d The action on z represents a Then, a feedforward model neural network is used to learn the effect of the action; the neural network receives the action representation z. a The actions of other intelligent agents a -i And the current observation value of the intelligence. i As input, it also predicts the observation at the next time step. i The algorithm uses the learned action representations and the global observations r. After collecting a sufficient number of samples, clustering is performed in the latent space based on the learned action representations (actions that are close to each other in the latent space produce similar effects and can be classified as the same role). Therefore, based on the clustering results, the original high-dimensional action space can be classified into multiple subsets of actions (multiple action subspaces) in the latent space, each corresponding to a different role.

[0101] Based on the action's effect, actions are clustered after learning their latent representations, thus decomposing the action space so that each subset of actions can perform a specific function. To fully utilize this decomposition, the RODE algorithm uses a two-layer hierarchical structure to coordinate role and action selection. At the top level, a role selector assigns a role to each agent at regular time steps. After being assigned a role, the agent explores the corresponding role's limited action space to learn the appropriate role strategy.

[0102] In Part 2(b), RODE designs an efficient, portable, and lightweight architecture for the role selector and the role's policy learning. For the role selector, a conventional Q-network can be simply used, with its input being the local action and observation history, and its output being the Q-value for each role. However, this architecture may be inefficient because it ignores information about the action space of different roles. Intuitively, choosing a role determines a subset of actions to be performed over a given period, therefore the Q-value for a role is closely related to its corresponding action space. Therefore, we can construct the role selector using the mean of all action representations within the action space.

[0103] To select a role, all agents share a linear layer and a gated recurrent unit (GRU) to encode each agent's local action-observation history into a fixed-length vector, which serves as the input to the role selector. This fixed-length vector is then passed through a fully connected neural network and mapped onto a latent space of the same dimension as the action representation. Finally, the corresponding role is selected by maximizing the inner product between the vectors.

[0104] After selecting the corresponding role, the corresponding sub-strategy can be learned based on the corresponding action subspace. When learning the action value function, i.e., the Q function, it can be adopted using the Q-function. MIX A similar structure is used, employing a neural network to integrate the local value functions of each agent to obtain a joint action value function. Each agent's local value function only requires its own local observations, thus the entire system operates in a distributed manner. The action that maximizes the cumulative expected reward is selected and executed through the local value functions. By adopting this method, latent representations can be used to better reflect the effects of different actions.

[0105] Furthermore, to achieve better results, the first and second methods can be combined in practical applications.

[0106] Action allocation unit 303 is used to allocate multiple action subspaces to each agent to obtain the action subspace corresponding to each agent.

[0107] The action allocation unit is specifically used for: determining the input vector of the neural network role selector; mapping the input vector to a latent space of the same dimension as the action through the neural network; determining the corresponding role by maximizing the inner product between the vectors; and allocating multiple action subspaces according to the determined corresponding roles to obtain the action subspace corresponding to each agent.

[0108] Specifically, after decomposing the action space, agents need to be assigned to different action subspaces. This can be done manually or by constructing a neural network role selector. The principle of the neural network role selector is the same as the RODE algorithm. To select a role, all agents share a linear layer and a gated recurrent unit to encode each agent's local action-observation history into a fixed-length vector. This fixed-length vector serves as the input to the role selector. This fixed-length vector is then mapped to a latent space of the same dimension as the action representation after passing through a fully connected neural network. Finally, the corresponding role is selected by maximizing the inner product between the vectors.

[0109] Furthermore, since each action in a nuclear power plant system has a clear meaning, and the corresponding action subspace also has a clear physical meaning, in this embodiment of the invention, each subspace can be directly assigned to a single agent, that is, each agent is responsible for a certain type of operation. This achieves the effect of being both simple and direct, while also making the network highly interpretable.

[0110] The reward design unit 304 is used to obtain the reward function of the action subspace corresponding to each agent.

[0111] In some embodiments, the reward design unit 304 is specifically used to: design reward functions in a process-oriented manner to obtain reward functions for multiple action subspaces and reward functions for each agent; or to design reward functions in a manner related to course learning to obtain reward functions for multiple action subspaces and reward functions for each agent.

[0112] Specifically, for designing reward functions using a process-oriented approach: process variables are incorporated into the reward function, such as combining it with various control indicators within the process (e.g., response time, maximum overshoot, etc.). Simultaneously, a limit on the number of operations is added (e.g., introducing a negative multiplier reward related to the number of operations). The goal is to achieve the control indicators as quickly and with as few operations as possible, while ensuring safe operation.

[0113] Regarding reward function design related to course learning: Compared to indiscriminate machine learning, course-related approaches can mimic the human learning process, allowing the model to start learning from easy samples and gradually progress to more complex samples and knowledge. Therefore, by designing control metrics and reward functions, the agent can learn how to operate in a more balanced way. For example, initially designing easier-to-achieve control goals, and gradually increasing the difficulty of control after achieving these goals.

[0114] The learning strategy unit 305 is used by each agent to learn according to the corresponding reward function and obtain the sub-policy of each agent.

[0115] The control strategy output unit 306 is used to summarize the sub-policies of each agent to obtain the overall control strategy.

[0116] Control unit 307 is used for intelligent control of nuclear power plant emergency accidents based on the overall control strategy.

[0117] This invention introduces multi-agent reinforcement learning into the nuclear power plant control system, thereby realizing an artificial intelligence-assisted nuclear power plant control scheme. This greatly reduces the workload and pressure of manual labor, significantly improves the intelligence and automation level of the nuclear power plant operating system and control system, and also improves the efficiency of nuclear power plant operation.

[0118] Because nuclear power plant control systems involve numerous parameters and control variables, resulting in complex control schemes, and because searching within a high-dimensional action space composed of multiple control variables is relatively difficult and challenging to guarantee accuracy and real-time performance during control, a solution can be found by leveraging the correlations between control variables. Borrowing from the role decomposition approach in multi-agent reinforcement learning, the variables to be controlled can be decomposed into multiple virtual roles for individual control based on their correlations. Essentially, this breaks down a problem of searching in a high-dimensional control space into multiple problems of searching in lower-dimensional spaces, significantly reducing the difficulty and time cost of the search and improving its efficiency.

[0119] Nuclear power plant systems are complex and require the control of many devices. When an operator operates multiple devices simultaneously, errors are likely to occur due to the busyness. When multiple operators operate the same devices, communication problems can also arise. By utilizing multi-agent reinforcement learning, multiple devices can be operated simultaneously without communication costs or delays. This increases the safety and reliability of nuclear power plant operations and helps to move towards a new model of unmanned and fully intelligent nuclear power in the future.

[0120] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to the method section.

[0121] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.

[0122] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.

[0123] The above embodiments are only for illustrating the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement it accordingly. They do not limit the scope of protection of the present invention. All equivalent changes and modifications made within the scope of the claims of the present invention should fall within the scope of the claims of the present invention.

Claims

1. A method for intelligent control of nuclear power plant accidents, characterized in that, Includes the following steps: Acquire action data for nuclear power plant emergency incidents; The action data is decomposed to obtain multiple action subspaces; The process of decomposing the action data to obtain multiple action subspaces includes: One-hot encoded actions are transformed into latent representations in the latent space using an action encoder. The effect of learning actions using a feedforward model neural network is to obtain the latent representation of the actions; Clustering is performed in the latent space based on the latent representation of the action to obtain the clustering result; Based on the clustering results, the action data is decomposed into the multiple action subspaces; Furthermore, after obtaining multiple action subspaces, the multiple action subspaces are assigned to each agent to obtain the action subspace corresponding to each agent. The multiple action subspaces are assigned to each agent to obtain the action subspace corresponding to each agent. Obtain the reward function for the action subspace corresponding to each of the aforementioned agents; Each of the aforementioned agents learns according to its corresponding reward function to obtain its own sub-policy; Each agent learns according to its corresponding reward function to obtain its sub-policy. A centralized training and distributed execution approach is adopted: each agent's local value function depends only on its own local observations. During training, a hybrid value network is used to combine the individual value functions of each agent to obtain the overall value function, ensuring that the joint action value function has the same monotonicity as each local value function. During execution, each agent selects the action with the highest cumulative expected reward by maximizing its own local value function. The sub-policies of each of the aforementioned intelligent agents are summarized to obtain the overall control policy; Intelligent control of nuclear power plant emergency accidents is carried out based on the overall control strategy.

2. The intelligent control method for nuclear power plant accidents according to claim 1, characterized in that, The step of decomposing the action data to obtain multiple action subspaces further includes: The action data is decomposed based on prior knowledge methods to obtain the multiple action subspaces.

3. The intelligent control method for nuclear power plant accidents according to claim 2, characterized in that, The method based on prior knowledge decomposes the action data to obtain the multiple action subspaces, including: Based on the prior knowledge, the types and relevance of actions are classified; Based on the classification results, the types and correlations of actions are obtained; Based on the type and correlation of the actions, the action data is decomposed to obtain the multiple action subspaces.

4. The intelligent control method for nuclear power plant accidents according to claim 1, characterized in that, The step of assigning the plurality of action subspaces to each agent to obtain the action subspace corresponding to each agent includes: The multiple action subspaces are assigned to each agent manually to obtain the action subspace corresponding to each agent. Alternatively, the multiple action subspaces can be allocated using a neural network role selector to obtain the action subspace corresponding to each agent.

5. The intelligent control method for nuclear power plant accidents according to claim 4, characterized in that, The process of allocating the multiple action subspaces using a neural network role selector to obtain the action subspace corresponding to each agent includes: Determine the input vector for the neural network role selector; The input vector is mapped to a latent space of the same dimension as the action through the neural network; The corresponding role is determined by maximizing the inner product between vectors; The multiple action subspaces are allocated according to the determined corresponding roles to obtain the action subspace corresponding to each agent.

6. The intelligent control method for nuclear power plant accidents according to claim 1, characterized in that, The reward function for obtaining the action subspace corresponding to each agent includes: A process-oriented approach is adopted to design reward functions to obtain reward functions for the multiple action subspaces and reward functions for each agent. Alternatively, the reward function can be designed in a way that is related to course learning, using the reward functions of the multiple action subspaces and the reward functions of each agent.

7. An intelligent control system for nuclear power plant accidents, characterized in that, include: The acquisition unit is used to acquire action data in emergency situations at nuclear power plants. An action decomposition unit is used to decompose the action data to obtain multiple action subspaces; An action allocation unit is used to allocate the plurality of action subspaces to various agents to obtain an action subspace corresponding to each agent. This includes: decomposing the action data based on an action encoding method to obtain the plurality of action subspaces; the decomposition of the action data based on the action encoding method to obtain the plurality of action subspaces includes: converting one-hot encoded actions into latent representations in a latent space using an action encoder; learning the effect of actions using a feedforward model neural network to obtain latent representations of the actions; clustering the latent representations of the actions in the latent space to obtain clustering results; and decomposing the action data into the plurality of action subspaces based on the clustering results. A reward design unit is used to obtain the reward function of the action subspace corresponding to each of the intelligent agents; The learning strategy unit is used by each agent to learn according to the corresponding reward function to obtain the sub-policy of each agent. The learning process for each agent to obtain its sub-policy includes: employing a centralized training and distributed execution approach; during training, a hybrid value network is used to combine the individual value functions of each agent to obtain the overall value function, ensuring that the monotonicity of the joint action value function is the same as that of each local value function; during execution, each agent selects the action with the highest cumulative expected reward by maximizing its own local value function. The control strategy output unit is used to summarize the sub-strategies of each of the intelligent agents to obtain the overall control strategy; The control unit is used to intelligently control nuclear power plant emergency accidents based on the overall control strategy.

8. The intelligent control system for nuclear power plant accidents according to claim 7, characterized in that, The action decomposition unit is also specifically used for: The action data is decomposed based on prior knowledge methods to obtain the multiple action subspaces.

9. The intelligent control system for nuclear power plant accidents according to claim 7, characterized in that, The action allocation unit is specifically used for: Determine the input vector for the neural network role selector; The input vector is mapped to a latent space of the same dimension as the action through the neural network; The corresponding role is determined by maximizing the inner product between vectors; The multiple action subspaces are allocated according to the determined corresponding roles to obtain the action subspace corresponding to each agent.