A method for coordinated control of a solid oxide fuel cell gas supply system
By employing distributed deep reinforcement learning, a coordinated control model for hydrogen and air intelligent agents was established, which solved the complexity problem of the gas supply system and improved the stability and efficiency of output voltage and stack temperature.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI UNIVERSITY OF ELECTRIC POWER
- Filing Date
- 2022-12-16
- Publication Date
- 2026-06-23
AI Technical Summary
Existing solid oxide fuel cell gas supply systems are highly complex to control, making it difficult to coordinate the control of hydrogen and air flow rates, which leads to difficulties in ensuring the stability and efficiency of output power, voltage, and stack temperature.
A distributed deep reinforcement learning approach is adopted. By setting up hydrogen and air agents, offline training is performed using the PE-MA4DPG algorithm. Combined with explorer, pioneer, and demonstrator modules, a coordinated control strategy model is established and executed in a distributed manner in an online application to achieve coordinated control of hydrogen and air flow.
It improves the efficiency of the gas supply system, ensures the stability of output voltage and stack temperature, meets the constraints of excess oxygen rate and fuel utilization rate, and enhances the robustness and control performance of the system.
Smart Images

Figure CN116154236B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of energy management technology for solid oxide fuel cell gas supply systems, and in particular to a coordinated control method for solid oxide fuel cell gas supply systems based on distributed deep reinforcement learning. Background Technology
[0002] Solid oxide fuel cells (SOFCs) have become one of the most promising power generation technologies of the 21st century due to their quiet operation, environmental friendliness, and high efficiency. Their widespread application is of great significance for protecting the environment and alleviating the energy crisis.
[0003] However, SOFCs are complex nonlinear systems with multiple inputs and multiple outputs. Output power, output voltage, stack temperature, and operating efficiency are simultaneously affected by various operational variables such as hydrogen flow rate and air flow rate, resulting in high control complexity. In practical SOFC applications, the gas supply system needs to provide the required oxygen and hydrogen to the stack according to demand. It must ensure the stack reaction performs optimally while minimizing unnecessary losses, reducing parasitic power, and improving overall system efficiency. Simultaneously, to control the stack temperature, the gas supply system also needs to control the air flow rate in real time to remove excess heat from the stack and keep it operating within a reasonable range, thereby improving stack performance and lifespan. Furthermore, SOFCs face numerous operational constraints, including maintaining fuel utilization at 0.7-0.9 and excess oxygen rate between 8-11. To address these issues, there is an urgent need to develop a coordinated control method for solid oxide fuel cell gas supply systems based on distributed deep reinforcement learning. Summary of the Invention
[0004] The purpose of this invention is to overcome the shortcomings of the existing technology and provide a coordinated control method for a solid oxide fuel cell gas supply system based on distributed deep reinforcement learning. This invention is the first to apply distributed deep reinforcement learning to the energy management of a solid oxide fuel cell gas supply system, and combines artificial intelligence technology with traditional gas flow control technology to improve the efficiency of the solid oxide fuel cell gas supply system.
[0005] The objective of this invention can be achieved through the following technical solutions:
[0006] The purpose of this invention is to provide a coordinated control method for a solid oxide fuel cell gas supply system, comprising the following steps:
[0007] S1: Offline training: Two agents are set up, namely a hydrogen agent and an air agent. The hydrogen agent and the air agent are used to control the flow rate of hydrogen and air entering the solid oxide fuel cell, respectively. Then, the agents are trained by centralized learning and distributed execution to ensure that the two agents can consider each other's strategies during training. An exploration unit is introduced during training to improve the adaptive ability and robustness, and finally a coordinated control strategy model is obtained.
[0008] S2: Online Application: Based on the trained coordinated control strategy model, the hydrogen agent detects the hydrogen flow and output voltage of the solid oxide fuel cell, and the air agent controls the oxygen flow by adjusting the voltage of the air compressor motor. Each agent executes decisions based on its own sensor status to ensure that the output voltage and stack temperature of the solid oxide fuel cell reach the preset ideal values.
[0009] Furthermore, both the hydrogen agent and the air agent include one actor network and two critic networks.
[0010] Furthermore, in S1, the PE-MA4DPG algorithm is used for offline training. The PE-MA4DPG algorithm is the DDPG algorithm that adopts an actor-critic architecture to select appropriate actions in the continuous action space.
[0011] Furthermore, in S1, the PE-MA4DPG algorithm includes a policy network and a value function network;
[0012] The strategy network consists of the actor network (current network) and the actor network (target network).
[0013] The value function network consists of the critic's current network and the critic's target network;
[0014] The input to each agent's actor network includes the action state information of all agents, which is used for centralized training. This allows each agent to establish a centralized critic network and provide a corresponding value function, thus mitigating the problem of environmental instability.
[0015] Furthermore, the current network of the critic optimizes the updated parameters by minimizing the loss function of each agent, the loss function being calculated as follows:
[0016]
[0017] y i =r i +γQ′(S′,′a1,…a′ N ,θ Q′ (2)
[0018] Where: a1, a2, ..., a N Let r represent the actions of N agents. i For reward value; y i γ is the target Q value; γ is the reward discount coefficient.
[0019] Furthermore, the PE-MA4DPG algorithm employs a distributed multi-agent training framework, comprising five modules: population space, explorer, pioneer, demonstrator, and common experience pool.
[0020] Furthermore, the population space module is the living environment of the population, and each population space includes two agents, which are any two combinations of explorers, pioneers, and demonstrators.
[0021] The environments in different population spaces are the same but independent of each other. Two agents are trained in these environments to obtain richer samples. The two agents in different population spaces represent the hydrogen flow controller and the air flow controller, respectively.
[0022] Furthermore, the explorer module has a complete intelligent agent structure. Different explorers adopt different exploration principles to improve sample diversity. Different explorers explore in different population spaces to obtain more samples to be put into the common experience pool.
[0023] The Pioneer module includes a SAC algorithm agent, which comprehensively explores the environment using a maximum entropy exploration strategy.
[0024] The demonstrator module includes conventional hydrogen flow controllers and air flow controllers with adjusted parameters that can achieve outstanding control performance. These conventional hydrogen flow controllers and air flow controllers interact in different population spaces and corresponding different environments to create high-value demonstration samples that are placed in a public experience pool to guide explorers in learning.
[0025] The public experience pool module includes two public experience pools, which respectively store exploration samples collected by pioneers and explorers and demonstration samples collected by demonstrators.
[0026] Furthermore, by employing artificially designed load current conditions across different episodes, multiple agents in different population spaces can gradually learn the corresponding control strategies from simple to complex. The variation of load current with each episode is as follows:
[0027]
[0028] Where, ΔI st It is the difference in load current;
[0029] Different network models were used in the actor networks of different explorers;
[0030] The explorers in population space 1-2 employ a greedy strategy, named ε-explorer, and their exploration actions are as follows:
[0031] Different network models are used for the actors among the different explorers. The explorers adopt a greedy strategy in population space 1-2, and their exploration actions are as follows:
[0032]
[0033] Among them, a l ε This is the action of the lth explorer. It is the policy function of the l-th explorer. It is a random action;
[0034] Explorers in population space 3-4 use the OU noise detection strategy. OU explorers are explorers whose detection actions are as follows:
[0035]
[0036] in, This is the action of the j-th explorer. It is the policy function of the j-th explorer. This is OU noise;
[0037] In population space 5-8, explorers use a Gaussian noise detection strategy, hence these explorers are called Gaussian explorers, and their exploration actions are as follows:
[0038]
[0039] in, This is the action of the m-th explorer. It is the policy function of the m-th explorer. It is Gaussian noise.
[0040] Furthermore, the hydrogen agent outputs the corresponding hydrogen flow rate value, and the air agent outputs the corresponding air compressor motor voltage value, thereby achieving distributed optimal coordinated control.
[0041] The control interval for both the hydrogen and air agents is 0.01s, ensuring that the output voltage and stack temperature of the solid oxide fuel cell reach ideal values, and that the constraints on oxygen excess rate and fuel utilization rate are met, thereby guaranteeing the normal operation of the system. The overall objective function is as follows:
[0042]
[0043] Where F(t) is the objective function, e v It is the error of the output voltage, e T λ is the error in output stack temperature, λ is the oxygen overload rate, and ρ is the fuel utilization rate.
[0044] Compared with the prior art, the present invention has the following technical advantages:
[0045] 1. This invention proposes a coordinated control model for the gas supply system of a 5kW solid oxide fuel cell that simultaneously considers air flow rate, hydrogen flow rate, and their interaction, which is key to solving existing technical problems.
[0046] 2. This invention proposes a data-driven coordinated control method for a gas supply system. Compared with other control strategies, this control strategy can achieve coordinated control of hydrogen and air.
[0047] 3. This invention proposes a multi-agent dual-delay deep deterministic policy gradient algorithm (PE-MA4DPG) based on population evolution. This algorithm incorporates the survival-of-the-fittest mechanism and population evolution mechanism from evolutionary biology. Different population combinations are set during pre-learning, and by combining imitation learning and curriculum learning, agents of different combinations can be fully trained in different environments. Furthermore, inefficiently trained populations are periodically eliminated, ultimately improving the robustness of the coordinated strategy.
[0048] 4. The proposed algorithm can improve the control performance of output power, voltage and stack temperature by formulating reasonable control strategies, while preventing the constraints from being violated. Attached Figure Description
[0049] Figure 1 This is a flowchart of a coordinated control method for a solid oxide fuel cell gas supply system based on distributed deep reinforcement learning.
[0050] Figure 2 This is a schematic diagram of the coordinated control of different intelligent agents in this embodiment. Detailed Implementation
[0051] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. Component models, material names, connection structures, control methods, algorithms, and other features not explicitly described in this technical solution are considered common technical features disclosed in the prior art.
[0052] The coordinated control method for the gas supply system of a solid oxide fuel cell based on distributed deep reinforcement learning in this technical solution includes the following steps:
[0053] 1) Offline training: Before formal application, distributed deep reinforcement learning algorithms are trained offline for solid oxide fuel cell gas supply systems. Two agents are set up, named hydrogen agent and air agent, to control the flow rate of hydrogen and air entering the solid oxide fuel cell, respectively. The algorithm is trained by a centralized learning and distributed execution method to ensure that the two agents can consider each other's strategies during training. Explorer, breakout and demonstrator exploration units are introduced in the training to obtain better adaptability and robustness, and finally obtain a highly robust coordinated control strategy.
[0054] 2) Online application: When the algorithm is formally applied, the hydrogen agent detects the hydrogen flow rate and output voltage of the SOFC, and the air agent detects the air flow rate and output voltage of the SOFC. The oxygen flow rate is controlled by adjusting the voltage of the air compressor motor. Each agent makes decisions based on its own sensor status. That is, the distributed optimal coordinated control of the solid oxide fuel cell gas supply system is realized by adopting a decentralized execution method.
[0055] The PE-MA4DPG algorithm in this technical solution is trained using a centralized learning, distributed execution method. In online applications, each agent consists of one actor network and two critic networks. During offline training, the PE-MA4DPG algorithm, using the actor-critic architecture of the DDPG algorithm, can select appropriate actions from a continuous action space. The PE-MA4DPG algorithm mainly consists of two networks: a policy network and a value function network. The policy network comprises the current and target networks of the actor network, while the value function network comprises the current and target networks of the critics. Each agent's actor network receives not only its own state information but also the action state information of all agents, thus enabling centralized training. This is equivalent to each agent establishing a centralized critic network and providing a corresponding value function, mitigating the problem of environmental instability. However, the actor network only needs to collect local information, achieving distributed coordination control.
[0056] The current network optimizes and updates parameters by minimizing the loss function for each agent. The loss function is calculated as follows:
[0057]
[0058] y i =r i +γQ ′ (S′,a′1,…a′ N ,θ Q′ (2)
[0059] Where: a1, a2, ..., a N Let r represent the actions of N agents.i For reward value; y i Let Q be the target Q-value; γ be the reward discount coefficient. The current actor network updates its parameters through backpropagation of the neural network's gradient. The gradient calculation formula is:
[0060] The distributed multi-agent training framework in PE-MA4DPG includes the following modules: population space, explorers, pioneers, demonstrators, and a common experience pool.
[0061] (1) Role
[0062] 1) Population Space: The population space refers to the living environment of a particular population. Each population space includes two agents, which can be any pair of explorers, pioneers, or demonstrators. The environments in different population spaces are the same but independent of each other. The two agents undergo intensive training in these environments to obtain richer samples. This embodiment sets up 24 population spaces, and the two agents in each population space represent a hydrogen controller (hydrogen flow control) and an air controller (air flow control), respectively. The complexity of the environment increases with the number of episodes, i.e., a course-based learning guidance strategy is adopted.
[0063] 2) Explorers: The structure contains a complete intelligent agent structure. Different explorers adopt different exploration principles, just as different populations adopt different survival strategies in the process of biological evolution to increase sample diversity. Different explorers explore in different population spaces to obtain more samples to put into the common experience pool, that is, they adopt multiple population exploration strategies.
[0064] 3) Pioneers: Pioneers include SAC algorithm agents, which comprehensively explore the environment through the maximum entropy exploration strategy. Compared with explorers, pioneers using the maximum entropy exploration strategy can effectively explore more and richer samples, thus becoming part of a multi-population exploration strategy.
[0065] 4) Demonstrators: Demonstrators include conventional controllers whose parameters have been adjusted and which can achieve outstanding control performance. These conventional controllers interact in different population spaces and corresponding different environments to create high-value demonstration samples and put them into the public experience pool to guide explorers to learn (imitation learning guidance strategy).
[0066] 5) Common Experience Pools: Two common experience pools are used to store exploration samples and demonstration samples, respectively. A classification replay mechanism is adopted to improve the training efficiency of the algorithm, which is the course learning guidance and classification experience replay strategy.
[0067] (2) Process Overview
[0068] 1) Explorers and pioneers in each population space explore their own environment in parallel, interact with the environment according to the policy function, and add the actions of other agents to generate samples and put them into the common experience pool 1.
[0069] 2) Each demonstrator in the population space interacts with the environment according to its own controller and adds the actions of other agents to the sample in a unified manner. The generated expert samples are put into the common experience pool 2.
[0070] 3) Each explorer employs a classification replay mechanism, collecting samples from different experience replay pools according to probability for learning and updating its own parameters.
[0071] 4) Pioneers collect samples from their own experience pool to update their own parameters.
[0072] 5) Every 1000 episodes, the total average reward value of agents in all population spaces is calculated. The best population space is retained and the other population spaces are deleted. The agents in the best population space are copied and put into other population spaces for training. This step is called extinction. Keeping the best agents for the next training is to save computing power and improve the final efficiency.
[0073] (3) Population space and curriculum guidance strategies
[0074] This technical solution includes 24 population spaces. Drawing inspiration from course learning, it uses artificially designed load current conditions across different episodes to allow multiple agents in different population spaces to gradually learn the corresponding control strategies from simple to complex. The load current variation with the episodes is shown below:
[0075]
[0076] Where, ΔI st It is the difference in load current.
[0077] The above formula shows that the maximum control error in the population space increases gradually from small to large according to the episodes, enabling the agent to learn from simple control tasks first, and then gradually learn from load conditions with longer control times and more complex control strategies.
[0078] (4) Multiple population exploration strategies
[0079] Different network models were used for the actor networks among the different explorers. The explorers in population space 1-2 adopted a greedy strategy, named ε-explorer, and their exploration actions are as follows:
[0080] Different network models are used for the actors among the different explorers. The explorers adopt a greedy strategy (called ε-explorer) in population space 1-2, and the exploration actions are as follows:
[0081]
[0082] in, This is the action of the lth explorer. It is the policy function of the l-th explorer. It is a random action.
[0083] Explorers in population space 3-4 use the OU noise detection strategy. OU explorers are explorers whose detection actions are as follows:
[0084]
[0085] in, This is the action of the j-th explorer. It is the policy function of the j-th explorer. This is OU noise.
[0086] In population spaces 5-8, explorers use a Gaussian noise detection strategy; therefore, these explorers are called Gaussian explorers. The exploration actions are as follows:
[0087]
[0088] in, This is the action of the m-th explorer. It is the policy function of the m-th explorer. It is Gaussian noise.
[0089] The PE-MA4DPG algorithm is trained using a centralized learning and distributed execution method. In online applications, the coordination control framework proposed in this technical solution is as follows: Figure 1 As shown. The coordinated control strategy in this technical solution must meet the following requirements: 1) It should be able to consider the nonlinear characteristics of SOFC and have good robustness. 2) It should be able to simultaneously consider the interaction between air flow rate and hydrogen flow rate and their effects on output power, voltage and stack temperature. 3) By formulating a reasonable control strategy, it should improve the control performance of output power, voltage and stack temperature while preventing the constraints from being violated.
[0090] The coordinated control model includes a hydrogen agent and an air agent, and the two controllers mentioned above are equivalent to two agents. A centralized training and decentralized execution strategy is used to enable the two agents to coordinate with each other.
[0091] In online application of the algorithm, each agent only needs to receive the status of its own sensors (hydrogen flow sensor, located at the anode of the battery, and air flow sensor, located at the cathode of the battery) to make decisions. The hydrogen agent outputs the corresponding hydrogen flow rate, and the air agent outputs the corresponding air compressor motor voltage, thereby achieving distributed optimal coordinated control. The control interval of the agents is 0.01s. The control objective of the coordinated control strategy is to make the SOFC output voltage and stack temperature reach ideal values, and to satisfy the constraints of excess oxygen rate and fuel utilization rate, thereby ensuring the normal operation of the system. Its overall objective function is as follows:
[0092]
[0093] Where F(t) is the objective function, e v It is the error of the output voltage, e T λ is the error in output stack temperature, λ is the oxygen overload rate, and ρ is the fuel utilization rate.
[0094] The above description of the embodiments is provided to enable those skilled in the art to understand and use the invention. It will be apparent to those skilled in the art that various modifications can be made to these embodiments, and the general principles described herein can be applied to other embodiments without inventive effort. Therefore, the present invention is not limited to the above embodiments, and any improvements and modifications made by those skilled in the art based on the disclosure of the present invention without departing from the scope of the invention should be within the protection scope of the present invention.
Claims
1. A coordinated control method for a solid oxide fuel cell gas supply system, characterized in that, Includes the following steps: S1: Offline training: Two agents are set up, namely a hydrogen agent and an air agent. The hydrogen agent and the air agent are used to control the flow rate of hydrogen and air entering the solid oxide fuel cell, respectively. Then, the agents are trained by centralized learning and distributed execution to ensure that the two agents can consider each other's strategies during training. An exploration unit is introduced during training to improve the adaptive ability and robustness, and finally a coordinated control strategy model is obtained. S2: Online Application: Based on the trained coordinated control strategy model, the hydrogen agent detects the hydrogen flow and output voltage of the solid oxide fuel cell, and the air agent controls the oxygen flow by adjusting the voltage of the air compressor motor. Each agent executes decisions based on its own sensor status to make the output voltage and stack temperature of the solid oxide fuel cell reach the preset ideal values. Both the hydrogen agent and the air agent include one actor network and two critic networks; In S1, the PE-MA4DPG algorithm is used for offline training. The PE-MA4DPG algorithm is the DDPG algorithm that adopts an actor-critic architecture to select appropriate actions in the continuous action space. In S1, the PE-MA4DPG algorithm includes a policy network and a value function network; The strategy network consists of the actor network (current network) and the actor network (target network). The value function network consists of the critic's current network and the critic's target network; The input to the actor network for each agent includes the action state information of all agents, which is used for centralized training. This enables each agent to establish a centralized commentator network and provide a corresponding value function, thus mitigating the problem of environmental instability. The population space module is the living environment of the population. Each population space includes two agents, which can be any two of the explorers, pioneers, and demonstrators. The environments in different population spaces are the same but independent of each other. Two agents are trained in these environments to obtain richer samples. The two agents in different population spaces represent the hydrogen flow controller and the air flow controller, respectively. The explorer module has a complete intelligent agent structure. Different explorers adopt different exploration principles to improve sample diversity. Different explorers explore in different population spaces to obtain more samples to be put into the common experience pool. The Pioneer module includes a SAC algorithm agent, which comprehensively explores the environment using a maximum entropy exploration strategy. The demonstrator module includes conventional hydrogen flow controllers and air flow controllers with adjusted parameters that can achieve outstanding control performance. These conventional hydrogen flow controllers and air flow controllers interact in different population spaces and corresponding different environments to create high-value demonstration samples that are placed in a public experience pool to guide explorers in learning. The public experience pool module includes two public experience pools, which respectively store exploration samples collected by pioneers and explorers and demonstration samples collected by demonstrators.
2. The coordinated control method for a solid oxide fuel cell gas supply system according to claim 1, characterized in that, The current network optimizes and updates parameters by minimizing the loss function for each agent. The loss function is calculated as follows: (1) (2) In the formula: for The actions of an intelligent agent; Reward value; For the goal value; This is the reward discount coefficient.
3. The coordinated control method for a solid oxide fuel cell gas supply system according to claim 1, characterized in that, The PE-MA4DPG algorithm employs a distributed multi-agent training framework, comprising five modules: population space, explorer, pioneer, demonstrator, and common experience pool.
4. The coordinated control method for a solid oxide fuel cell gas supply system according to claim 1, characterized in that, In different TV series, artificially designed load current conditions are used to enable multiple agents in different population spaces to gradually learn the corresponding control strategies from simple to complex. The load current variation with the TV series is as follows: (3) in, It is the difference in load current; Different network models were used in the actor networks of different explorers; The explorers in population space 1-2 employ a greedy strategy, named ε-Explorer, and their exploration actions are as follows: Different network models are used for the actors among the different explorers. The explorers adopt a greedy strategy in population space 1-2, and their exploration actions are as follows: (4) in, It is the first l The actions of an explorer, It is the first l The strategy function of each explorer For random actions; Explorers in population space 3-4 use the OU noise detection strategy. OU explorers are explorers whose detection actions are as follows: (5) in, It is the first j The actions of an explorer, It is the first j The strategy function of each explorer This is OU noise; In population space 5-8, explorers use a Gaussian noise detection strategy, hence these explorers are called Gaussian explorers, and their exploration actions are as follows: (6) in, It is the first m The actions of an explorer, It is the first m The strategy function of each explorer It is Gaussian noise.
5. The coordinated control method for a solid oxide fuel cell gas supply system according to claim 1, characterized in that, The hydrogen agent outputs the corresponding hydrogen flow rate value, and the air agent outputs the corresponding air compressor motor voltage value, thereby achieving distributed optimal coordinated control. The control interval for both the hydrogen and air agents is 0.01s, ensuring that the output voltage and stack temperature of the solid oxide fuel cell reach ideal values, and that the constraints on oxygen excess rate and fuel utilization rate are met, thereby guaranteeing the normal operation of the system. The overall objective function is as follows: (7) in, Let be the objective function. It's an error in the output voltage. This represents the error in the output stack temperature, where λ is the oxygen permeation rate. Fuel utilization rate.