Reinforcement learning intermittent process control method based on improved AC algorithm
By improving the priority sampling and reward function design of the Actor-Critic algorithm and combining it with deep reinforcement learning, a PER-SAC controller was constructed, which solved the multi-time-varying control problem in the penicillin fermentation process and achieved efficient and stable yield improvement.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGNAN UNIV
- Filing Date
- 2023-05-11
- Publication Date
- 2026-06-23
AI Technical Summary
The penicillin fermentation process is difficult to control precisely due to its multiple time-varying characteristics. Traditional control methods rely on process models and are susceptible to sensor or actuator failures. Existing deep reinforcement learning algorithms suffer from overestimation of value functions and hyperparameter sensitivity in large-scale data and long-term control tasks, resulting in unstable control performance.
The Actor-Critic algorithm is improved by prioritizing sampling to enhance data sampling efficiency. A reward function for the sparse reward problem is designed and control action constraints are introduced. By combining deep reinforcement learning and prioritizing sampling, a PER-SAC controller is constructed for the intermittent control of penicillin fermentation.
It achieves self-learning process dynamics without relying on process models, improves penicillin yield and control stability, solves the sparse reward problem, and enhances learning efficiency and control effectiveness.
Smart Images

Figure CN116520703B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a reinforcement learning intermittent process control method based on an improved AC algorithm, belonging to the fields of deep reinforcement learning and intermittent process control. Background Technology
[0002] Batch processes are one of the most important processing methods in modern industry, with applications spanning numerous fields, including fine chemicals, biopharmaceuticals, and semiconductors. Unlike continuous processes, batch processes exhibit "multiple time-varying" production characteristics. "Multiple" refers to diverse products, with batch processes switching between different products on the same equipment; "repetitive" means the process of producing the same product is repeated; "time-varying" refers to switching between different time periods, with multiple operational steps occurring within the same batch of products; and "variable" means "changing indicators," with different control objectives and schemes applied at different stages of operation for different products. As a typical batch process, penicillin fermentation suffers from low penicillin yield due to the difficulty in precisely controlling fermentation conditions caused by its "multiple time-varying" production characteristics.
[0003] Traditional model-based control methods, such as model predictive control (MPC) and iterative control (ILC), rely heavily on the accuracy of the underlying process model for performance. However, in practical applications of penicillin fermentation, mismatches between the actual situation and the model can occur due to sensor or actuator malfunctions and unknown disturbances, leading to inaccurate predictions of process variables. Therefore, developing a control strategy that does not require a process model is of great significance.
[0004] To achieve precise control of the penicillin fermentation process, reinforcement learning (RL), as a self-learning data-driven method, is a good alternative to traditional control methods. Deep reinforcement learning algorithms, such as deep deterministic policy gradient algorithms and proximal policy optimization algorithms, have shown performance superior to traditional control methods in some process control applications. However, problems such as high estimation of value functions and hyperparameter sensitivity in these algorithms can lead to unstable control effects and potentially cause irreversible losses to the production process. The recently proposed Soft Actor Commentator (SAC) algorithm, by combining the maximum entropy algorithm with a stochastic offline policy, possesses strong anti-interference capabilities and is suitable for intermittent process control tasks. It has already achieved considerable success in tasks such as robot control. However, robot control tasks are typically short in duration and involve relatively small amounts of data. For the control task of the penicillin fermentation process, which features large-scale data and long-term continuous control, targeted improvements to the SAC algorithm are needed. Therefore, it is necessary to develop an RL controller based on the SAC algorithm that meets the specific characteristics and requirements of the penicillin fermentation process. Summary of the Invention
[0005] In view of the above-mentioned shortcomings and deficiencies of the existing technology, this invention provides a reinforcement learning intermittent process control method based on an improved AC (Actor-Critic) algorithm. For penicillin fermentation processes with long batch times and significant nonlinear characteristics, the random sampling method in the experience replay mechanism of the SAC algorithm is improved to priority sampling, thereby enhancing the sampling efficiency of the SAC algorithm for numerous historical process data and achieving a more robust control effect. Simultaneously, to address the sparse reward problem, a novel reward function is designed that not only considers product quality deviation but also introduces control action constraints to guide the controller in updating its strategy, thus improving penicillin yield.
[0006] A reinforcement learning-based intermittent process control method based on an improved Actor-Critic algorithm, applied to a penicillin fermentation process, comprising:
[0007] Step 1: The penicillin fermentation process control task is modeled as an optimal control model based on a Markov decision process. This optimal control model includes a controller and actuators. The controller outputs a control signal based on the current state information of the penicillin fermentation process and determines the immediate reward r at the next moment based on the deviation between the next state value monitored by the sensors and the expected process trajectory. t Based on the immediate reward r t The control strategy is adjusted; the actuator is used to control the penicillin fermentation process according to the control signal output by the controller; the expected process trajectory refers to the control target of the penicillin fermentation process.
[0008] Step 2: Initialize the controller parameters in the optimal control model. The controller includes an actor module, a critic module, and an experience replay module. The actor module includes an actor network for receiving process states and outputting control actions. The critic module includes two critic networks and two target critic networks. The critic networks receive the next state value and reward signal from the environment so that the controller can update the control measurements, while the target critic networks receive historical data replayed in the experience pool. The experience replay module stores historical data of each interaction with the penicillin fermentation process in the experience replay pool and replays the data to the target critic networks using priority sampling when parameters are updated.
[0009] Step 3: After initializing the parameters, the actor network in the controller outputs control actions based on the state information of the penicillin fermentation process at each moment;
[0010] Step 4: The actuator adjusts the control variables online and interacts with the penicillin fermentation process according to the control actions in Step 3; at the next sampling time, the changes of each process variable are observed by the sensor as the next state vector s. t+1 The controller then returns the next control action to the actor network.
[0011] Step 5: Calculate the immediate reward r based on the deviation between the actual state value and the expected value at the next moment. t The policy is then returned to the controller's commenter network, which iterates to the optimal policy using a soft policy iteration method.
[0012] Step 6: Store the interaction samples at each time step into the transfer experience pool D. These samples will be used as historical data to update the network parameters. Each sample contains information including: the current state, action, reward, and the next state at the next time step, i.e., (s...). t ,a t ,r t ,s t+1 );
[0013] Step 7: Once the number of samples in the experience pool reaches the set number, the TD error of each sample is used as the standard to measure the importance of each sample's replay. A fixed number of transfer experience samples are sampled using a priority sampling method as historical data to be replayed into the target commentator network, i.e., [(s1,a1,r1,s2),,(s i ,a i ,r i ,s i+1 ),…,(s n ,a n ,r n ,sn+1 )).
[0014] Step 8: Using the experience replay samples from the priority sampling in Step 7, calculate the deviation between the target critic network's output and the critic network's output, and update the critic network parameters from Step 2 using the mini-batch gradient ascent method.
[0015] Step 9: At each sampling moment in the penicillin fermentation process, simultaneously perform the online control process of steps 3 to 5 and the experience playback process of steps 6 to 8 until the current batch of fermentation is completed. Determine whether the final quality indicators meet the requirements. If the requirements are met, end the control and output the control strategy; otherwise, save the network model and continue the control for the next batch.
[0016] Optionally, the optimal control model P(·) in step 1 is:
[0017]
[0018] Where t is the sampling time, k is the current batch, T is the termination time of the current batch, and j is the historical batch that previously interacted with the intermittent process; s t Let S be the state vector at time t, and S represent the state space S:={s0,s1,s2,}. The state information of the penicillin fermentation process is reflected by the process variables at each time step, i.e. This refers to the sampled values of each process variable observed by each sensor in the fermentation equipment at each sampling time; a t ∈A is the control signal vector output by the controller, and A represents the action space A:={a0,a1,a2,}; This is the process disturbance vector. This represents the set of real numbers containing all process interference information, with dimension n. d f(·) represents the nonlinear dynamic characteristics of the process;
[0019] The control objective of the method is to find an optimal policy π that maps from the state space S:={s0,s1,s2,} to the action space A:={a0,a1,a2,} under unknown disturbances. * (s,a), to achieve the expected control requirements; the objective function in the optimal control model P(·) Used to measure the control effectiveness of the penicillin fermentation process:
[0020]
[0021] Among them, E k [·] represents the expected Markov reward that the current batch k can obtain, and γ is the discount factor for future rewards.
[0022] Optionally, the reward function r t Simultaneously considering product quality deviation and control action constraints, the controller is effectively trained by increasing the number of effective reward samples, wherein the reward function r t for:
[0023]
[0024] in, It is an indicator of the concentration of process products. With y t δ1 represents the expected product quality and the actual product quality at time t for the intermittent process, respectively, and δ1 is the allowable quality deviation. For control signal a t Path constraint term, u t δ1 represents the sampled value of the control signal in a normal batch, and δ2 represents the allowable fluctuation range of the control signal. At each moment, when the controller receives a positive reward from the environment, the action is encouraged; otherwise, the action is suppressed.
[0025] Optionally, the process of the actor network outputting control actions based on the receiving process state in step 3 is performed using the following formula:
[0026] a t =tanh(μ θ (s t )+σ θ (s t )⊙ζ),ζ~N(0,I) (7)
[0027] Actor network a parameterized by θ t =π θ (a t |s t There are two output nodes, one for each action, and the other for the probability density distribution of the output action, with a mean of μ. θ (s t The standard deviation is σ. θ (s t ).
[0028] Optionally, the expression for the optimal strategy in step 5 is:
[0029]
[0030] Where α is the temperature coefficient balancing policy entropy and reward importance in the objective function, Η(π(·|s t )) is the entropy of the current strategy;
[0031] Wherein, the policy entropy H(π(·|s) t )) is represented as:
[0032]
[0033] a t For the actor network π(s,a) in the current state s t The action to be output.
[0034] Optionally, the soft policy iteration method in step 5 is performed in the following manner:
[0035] In the strategy evaluation step, the transfer experience samples (s) extracted from the experience replay pool are utilized using an experience replay mechanism. i ,a i ,r i ,s i+1 )∈D, and utilize the Bellman optimal operator Define a soft Q function to guide the algorithm update, as follows:
[0036]
[0037] Where V(s) i+1 ) is the state value function in soft updates, i.e.:
[0038]
[0039] definition Then sequence Q i As i approaches infinity, it will converge to the optimal strategy π. * The action value function of (s,a).
[0040] In the policy improvement step, the Kullback-Leibler divergence D is used. KL (·) Update the policy π(s,a) with the exponent of the new soft Q function as follows:
[0041]
[0042] Where П is the feasible set of policy functions, It is a partition function.
[0043] Optionally, the priority sampling of historical data in step 7 is performed in the following manner:
[0044] Calculate the average TD error of the two critic networks:
[0045]
[0046] Q target (s i+1 ,a i+1 ) is the target action value function, which is approximated by the target critic network in step two; for the i-th experience sample (s)t ,a t ,r t ,s t+1 The priority is defined as follows:
[0047] p i =|δ i |+ε (12)
[0048] Where ε is the priority smoothing term; after quantizing the priority, the sampling probability of each sample is calculated as follows:
[0049]
[0050] Here, ξ1 determines the degree to which priority affects the sampling probability; when ξ1 = 0, it corresponds to a completely random sampling method. The weight of the sampling importance (IS) of each sample is quantified by the following method:
[0051]
[0052] Where N is the number of samples in the experience replay pool, and ξ2 is the hyperparameter for compensating for the non-uniform probability P(i).
[0053] Optionally, in step 8, updating the actor and commentator network parameters using historical data sampled with priority is performed in the following manner:
[0054] Two critics' network Q w1 Q w2 The update method is as follows:
[0055]
[0056]
[0057]
[0058] Where w represents the parameters of the target critic network, Q target (s i ,a i The target action-value function is approximated by a target commentator network.
[0059] The optimization method for the actor network parameter θ in the actor module is as follows:
[0060]
[0061]
[0062] The initial parameters w of the target commentator network in the controller are the same as those of the commentator network. Subsequent updates use a soft update method, and the update frequency is the same as that of the actor network. The update method is as follows:
[0063]
[0064] Where τ is the soft update coefficient, The parameters for the target commentator network.
[0065] Optionally, the discount coefficient for the future reward can be in the range of 0 < γ < 1.
[0066] Optionally, the smoothing term ε for priority is 0.1.
[0067] The beneficial effects of this invention are:
[0068] 1) The reinforcement learning intermittent process control method based on the improved Actor-Critic algorithm proposed in this application combines deep reinforcement learning and priority sampling for intermittent process control. It can learn the process dynamics on its own without prior knowledge process models, and is a novel data-driven intermittent process control method.
[0069] 2) This application presents a carefully designed guided reward function for RL controllers, which considers not only product quality deviation but also introduces control action constraints. In intermittent process control focused on improving product quality, the reward signal is generally given by the deviation between the product quality and the expected quality. However, in processes with relatively slow product output rates, the method of only providing corresponding rewards based on product quality deviation at the final stage of each batch results in insufficient effective reward samples in a single scenario, i.e., the sparse reward problem. This makes it difficult for the controller to learn effective information during the process, and may even ultimately fail to obtain a stable control strategy. The introduction of control action constraints increases the number of effective reward samples, solving the sparse reward problem.
[0070] 3) This application improves the experience replay mechanism in the SAC algorithm from traditional completely random sampling to priority sampling (PER), which can use probability priority sampling to select samples that are more important to the update strategy. For the update of the action value function Q, IS weights are applied to realize the intermittent process controller based on the soft actor commentator algorithm (PER-SAC) of priority sampling. Compared with the existing RL controller, it has higher learning efficiency and can obtain higher penicillin yield under the same conditions. Attached Figure Description
[0071] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0072] Figure 1 This is a diagram of the PER-SAC algorithm structure provided in this application.
[0073] Figure 2 This is an analogy diagram of intermittent control based on the RL algorithm provided in this application.
[0074] Figure 3 This is a structural diagram of the actor and commentator network in the PER-SAC controller provided in this application.
[0075] Figure 4 This is a graph showing the penicillin concentration of a certain batch during the penicillin fermentation process using the PER-SAC controller provided in this application.
[0076] Figure 5 This is a reward image of a batch obtained during the penicillin fermentation process using the PER-SAC controller provided in this application.
[0077] Figure 6 This is a control action diagram (i.e., substrate feed rate) output by the PER-SAC controller provided in this application during a certain batch of penicillin fermentation. Detailed Implementation
[0078] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
[0079] Example 1:
[0080] This embodiment provides a reinforcement learning-based intermittent process control method based on an improved Actor-Critic algorithm, applied to the control of penicillin fermentation processes. See [link to relevant documentation]. Figure 1 The method includes:
[0081] Step 1: The penicillin fermentation process control task is modeled as an optimal control model based on a Markov decision process. This optimal control model includes a controller and actuators. The controller outputs a control signal based on the current state information of the penicillin fermentation process and determines the immediate reward r at the next moment based on the deviation between the next state value monitored by the sensors and the expected process trajectory. t Based on the immediate reward r tThe control strategy is adjusted; the actuator is used to control the penicillin fermentation process according to the control signal output by the controller; the expected process trajectory refers to the control target of the penicillin fermentation process.
[0082] Step 2: Initialize the controller parameters in the optimal control model. The controller includes an actor module, a critic module, and an experience replay module. The actor module includes an actor network for receiving process states and outputting control actions. The critic module includes two critic networks and two target critic networks. The critic networks receive the next state value and reward signal from the environment to guide the controller in updating control measurements. The target critic networks receive historical data replayed in the experience pool described in Step 7 and update the network parameters using methods such as mini-batch gradient descent. The experience replay module stores historical data from each interaction with the penicillin fermentation process in the experience replay pool and replays the data to the target critic networks using priority sampling during parameter updates.
[0083] Step 3: After initializing the parameters, the actor network in the controller outputs control actions based on the state information of the penicillin fermentation process at each moment;
[0084] Step 4: The actuator adjusts the control variables online and interacts with the penicillin fermentation process according to the control actions in Step 3; at the next sampling time, the changes of each process variable are observed by the sensor as the next state vector s. t+1 The controller then returns the next control action to the actor network.
[0085] Step 5: Calculate the immediate reward r based on the deviation between the actual state value and the expected value at the next moment. t The policy is then returned to the controller's commenter network, which iterates to the optimal policy using a soft policy iteration method.
[0086] Step 6: Store the interaction samples at each time step into the transfer experience pool D. These samples will be used as historical data to update the network parameters. Each sample contains information including: the current state, action, reward, and the next state at the next time step, i.e., (s...). t ,a t ,r t ,s t+1 );
[0087] Step 7: Once the number of samples in the experience pool reaches the set number, the TD error of each sample is used as the standard to measure the importance of each sample's replay. A fixed number of transfer experience samples are sampled using a priority sampling method as historical data to be replayed into the target commentator network, i.e., [(s1,a1,r1,s2),,(s i ,a i,r i ,s i+1 ),…,(s n ,a n ,r n ,s n+1 )).
[0088] Step 8: Using the experience replay samples from the priority sampling in Step 7, calculate the deviation between the target critic network's output and the critic network's output, and update the critic network parameters from Step 2 using the mini-batch gradient ascent method.
[0089] Step 9: At each sampling moment in the penicillin fermentation process, simultaneously perform the online control process of steps 3 to 5 and the experience playback process of steps 6 to 8 until the current batch of fermentation is completed. Determine whether the final quality indicators meet the requirements. If the requirements are met, end the control and output the control strategy; otherwise, save the network model and continue the control for the next batch.
[0090] Example 2:
[0091] This embodiment provides a reinforcement learning-based intermittent process control method based on an improved Actor-Critic algorithm, illustrated using an application to penicillin fermentation process control as an example. (See [link to relevant documentation]). Figure 1 The method includes:
[0092] Step 1: Model the penicillin fermentation process control task based on the improved Actor-Critic algorithm as a purely data-driven optimal control model:
[0093] Solving RL tasks is based on Markov Decision Processes (MDPs). Therefore, this application assumes that the dynamics of the penicillin fermentation process are given by an unknown probability distribution and follow a Markov process:
[0094] s t+1 =f(s) t ,a t ,d t (1)
[0095] Where t is the sampling time, s t Let S be the state vector at time t, and S represent the state space S:={s0,s1,s2,}. The state information of the penicillin fermentation process is reflected by the process variables at each time step, i.e. This refers to the sampled values of each process variable observed by each sensor in the fermentation equipment at each sampling time; a t ∈A is the control signal vector output by the controller, and A represents the action space A:={a0,a1,a2,}; This is the process disturbance vector. This represents the set of real numbers containing all process interference information, with dimension n. d f(·) represents the nonlinear dynamic characteristics of the process; it is worth noting that reinforcement learning can update the policy by directly interacting with the intermittent process and utilizing historical data, thus eliminating the need for d. t Making any assumptions is a model-free, data-driven control approach.
[0096] The control objective of this application is to find an optimal policy π that maps a state space S:={s0,s1,s2,} to an action space A:={a0,a1,a2,} under unknown disturbances. * (s,a) to achieve the expected control requirements. Therefore, the control of penicillin fermentation process based on reinforcement learning can be described as an optimal control problem P(·):
[0097]
[0098] Where k is the current batch, T is the end time of the current batch, and j is the historical data obtained from previous interactions with the penicillin fermentation process; The objective function in the optimal control model P(·) is used to measure the control effect of the penicillin fermentation process. In this invention, the reward function r in the Markov process is used. t To quantify the objective function.
[0099]
[0100] Among them, E k [·] represents the expected Markov reward that the current batch k can obtain, and γ is the discount factor for future rewards. In order to avoid the situation where the sequence is too long or the reward tends to be infinite in consecutive tasks, the discount factor is usually 0 < γ < 1.
[0101] In the field of artificial intelligence, the evaluation signal used to guide adjustments is called a reward. Each time a controller takes an action and interacts with the environment, it receives a reward signal to update its policy π(s,a). In intermittent process control focused on improving product quality, the reward signal is generally given by the deviation between the product quality and the expected quality. However, for penicillin fermentation processes with long batch times and slow product output, if rewards are only given based on product quality deviations at the end of each batch, the number of effective reward samples in a single episode is insufficient, resulting in a sparse reward problem. This makes it difficult for the controller to learn effective information during the process, and may even lead to convergence.
[0102] To address the sparse reward problem, this invention meticulously designs the reward function for reinforcement learning, considering not only product quality deviations but also introducing control action constraints. By increasing the number of effective reward samples, the controller is effectively trained.
[0103]
[0104] In the reward function, the first term It is an indicator of the concentration of process products. With y t These represent the expected and actual product quality at time t during the penicillin fermentation process, respectively, with δ1 representing the allowable quality deviation; the second term... For control signal a t Path constraint term, u t δ1 represents the sampled value of the control signal in a normal batch, and δ2 represents the allowable fluctuation range of the control signal. At each moment, when the controller receives a positive reward from the environment, the action is encouraged; conversely, the action is suppressed.
[0105] Step 2: Initialize the controller parameters in the optimal control model to prepare for online interaction with the penicillin fermentation process. Since the method in this application is a reinforcement learning-based intermittent process control method based on an improved Actor-Critic algorithm, the controller will be referred to thereafter as the PER-SAC controller.
[0106] like Figure 2 As shown, the PER-SAC controller includes two modules: an actor module, a critic module, and an experience replay module. The actor module comprises one actor network responsible for receiving process state outputs and control actions. The critic module includes two critic networks and two target critic networks. The critic networks receive the next state value and reward signal from the environment to update the control strategy. The target critic networks receive historical data replayed from the experience pool described in step 7 and update the network parameters using methods such as mini-batch gradient descent. The experience replay module stores historical data from each interaction with the process in the experience replay pool and replays the data to the target critic networks using priority sampling during parameter updates. The network structures in the actor and critic modules are shown in [reference needed]. Figure 3 ;
[0107] In the actor module, the input is a state vector composed of sampled values of various penicillin fermentation process variables. The hidden layers consist of a three-layer fully connected neural network, and the output is the control action, i.e., the setpoint value of the control variables. In the commentator module, the commentator network's input is the current state vector and action vector, and its output is the Q-value of the state-action value function approximated by the neural network under the current state-action pair, used to evaluate the merits of the current strategy. For the target commentator network, the input is migration experience samples (s) from historical data sampled from the experience replay pool. t ,a t ,r t ,s t+1 The state-action pair outputs a neural network-approximate target state-action value function Q. target (s i ,a i The critic network parameters are updated using mini-batch gradient ascent. Regarding the hidden layers of each network, the actor network uses batch normalization and LeakyReLU activation function structure after the three fully connected layers, with the Tanh function as the activation function for the last output layer. The critic network also uses batch normalization and LeakyReLU activation function structure after the fully connected layers. The hidden layers of the critic network and the target critic network share weights at each initial control point.
[0108] Step 3: After initializing the parameters, the actor network in the PER-SAC controller outputs control actions based on the penicillin fermentation process status at each time step;
[0109] The SAC algorithm increases the randomness of the output action by adding an action entropy term to the objective function, which helps improve the controller's exploratory ability and thus allows for faster learning of the penicillin fermentation process dynamics.
[0110]
[0111] Where α is the temperature coefficient balancing policy entropy and reward importance in the objective function, Η(π(·|s t )) is the entropy of the current policy, expressed as:
[0112] Η(π(·|s t ))=E at~π [-logπ(a t |s t (6)
[0113] Where a t For the actor network π(s,a) in the current state s t The output action. The actor network in the SAC algorithm outputs actions in the form of a probability density distribution. If the mean of the probability density distribution is μθ (s t The standard deviation is σ. θ (s t If action a is true, then action a is true. t The selection criteria are as follows:
[0114] a t =tanh(μ θ (s t )+σ θ (s t )⊙ζ),ζ~N(0,I) (7)
[0115] Step 4: The actuator adjusts the control variables online based on the control actions in Step 3, interacting with the penicillin fermentation process. At the next sampling time, the changes in each process variable are observed using sensors as the next state vector s. t+1 The system then returns the next control action to the actor network of the PER-SAC controller.
[0116] Step 5: Calculate the immediate reward r based on the deviation between the next state and the expected trajectory. t The policy is then returned to the PER-SAC controller's commenter network, which iterates to the optimal policy using a soft policy iteration method.
[0117] The soft policy iteration described is a dual gradient optimization method that alternates between policy evaluation and policy improvement within a maximum entropy framework. In the policy evaluation step, SAC utilizes an experience replay mechanism to extract transfer experience samples (s) from the experience replay pool. i ,a i ,r i ,s i+1 )∈D, and utilize the Bellman optimal operator Define a soft Q function to guide the algorithm update, as follows:
[0118] Where V(s) i+1 ) is the state value function in soft updates, i.e.:
[0119]
[0120] definition Then sequence Q i As i approaches infinity, it will converge to the optimal strategy π. * The action value function of (s,a).
[0121] In the policy improvement step, the SAC algorithm uses the Kullback-Leibler divergence D. KL (·) Update the policy π(s,a) with the exponent of the new soft Q function as follows:
[0122]
[0123] Where Π is the feasible set of policy functions. It is a partition function that normalizes the distribution and has no effect on the gradient of the new policy. Repeated applications of soft policy evaluation and soft policy improvement will converge to the optimal maximum entropy policy among all policies in Π.
[0124] Steps one through five constitute the online control process of the PER-SAC controller and the penicillin fermentation process. To update the controller's control strategy, step five utilizes historical data, learning from the process dynamics contained within the historical data to make the controller more intelligent. To enable the controller to learn from a large amount of historical data more effectively, this invention incorporates a Priority Sampling (PER) experience playback mechanism, specifically steps six through eight.
[0125] Step Six: Store the interaction samples at each time step into the transfer experience pool D. This will be used as historical data to update the network parameters. Each sample contains the following information: the current state, action, reward, and the next state at the next time step, i.e., (s...). t ,a t ,r t ,s t+1 );
[0126] Step 7: Once the number of samples in the experience pool reaches a certain number, the TD error of each sample is used as the standard to measure the importance of replaying each sample. A fixed number of transfer experience samples [(s1,a1,r1,s2),,(s...] are replayed using priority sampling (PER). i ,a i ,r i ,s i+1 ),…,(s n ,a n ,r n ,s n+1 )).
[0127] However, since the SAC algorithm has two critic networks, this application defines the TD error of each sample as the average TD error of the two critic networks:
[0128]
[0129] Q target (s i+1 ,a i+1 ) is the target action value function, which is approximated by the target critic network in step two.
[0130] Therefore, for the i-th empirical sample (s) t ,at ,r t ,s t+1 Priority is defined as follows:
[0131] p i =|δ i |+ε (12)
[0132] Where ε is a priority smoothing term, which prevents the extreme case where samples are not replayed once the TD error reaches zero. After quantizing the priority, the sampling probability of each sample is calculated as follows:
[0133]
[0134] Here, ξ1 determines the degree to which priority affects the sampling probability; ξ1 = 0 corresponds to completely random sampling. The weight of the sampling importance (IS) of each sample is quantified by the following method:
[0135]
[0136] Where N is the number of samples in the experience replay pool, and ξ2 is the hyperparameter for compensating for the non-uniform probability P(i).
[0137] Step 8: Using the experience gained from priority sampling in Step 7, replay the samples and update the network parameters described in Step 2 according to the mini-batch gradient ascent method:
[0138] The PER-SAC algorithm constructs two structurally identical critic networks Q. w1 Q w2 To estimate the value function of the current state-action pair, the parameters are updated using the mean squared error between the Q-value predicted by the critic network and the target Q-value, where the target action value function is approximated by setting the target critic network:
[0139]
[0140] a i+1 ~π θ (·|s i+1 (16)
[0141] Then the commentator network Q w,j The loss function is:
[0142] L(w)=E[(Q target (s i ,a i )-Q w,j (s i ,a i )) 2 (17)
[0143] w←w-β critic ▽ w (L(w)) (18)
[0144] Where w represents the parameters of the target commentator network, and the optimization method for the actor network parameters θ in the actor module is as follows:
[0145]
[0146]
[0147] The PER-SAC algorithm synchronizes the parameters of the target commentator network using a soft update method, and the update frequency is the same as that of the actor network. The update method is as follows:
[0148]
[0149] Where τ is the soft update coefficient, The parameters for the target commentator network.
[0150] Step Nine: At each sampling point during the penicillin fermentation process, simultaneously perform the control actions from Step Three, as well as the strategy and network model updates from Steps Five to Seven, until the batch ends. If the final quality indicators meet the requirements, end the control and output the control strategy. Otherwise, save the network model and continue control for the next batch.
[0151] In summary, the control flow of the penicillin fermentation process based on the PER-SAC controller is shown in Table 1:
[0152] Table 1: Control flow chart of penicillin fermentation process based on PER-SAC controller
[0153]
[0154] To verify the control effect of the method in this application, a simulation experiment was conducted based on the penicillin fermentation process control flow chart based on the PER-SAC controller described in Table 1. The experimental results are as follows: Figures 4-6 As shown, where, Figure 4 The simulation comparison diagram shows the penicillin concentration control effect obtained by using the method of this application to control the penicillin fermentation process and the penicillin concentration obtained without control. It can be seen that the penicillin concentration obtained by using the method of this application is extremely close to the target operating condition. Figure 5 The total reward value obtained by the controlled operating condition in this batch can be seen as the reward function at each moment in this batch being mostly positive, meaning that the deviation between the controlled operating condition and the target operating condition has been controlled within the allowable range. Figure 6The graph shows a comparison between the trajectory of the control variables adjusted online by the actuator according to the control signal and the trajectory of the control variables in the target operating condition. It can be seen that the control variables are mostly adjusted near the target trajectory.
[0155] Some steps in the embodiments of the present invention can be implemented using software, and the corresponding software program can be stored in a readable storage medium, such as an optical disc or a hard disk.
[0156] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A reinforcement learning-based intermittent process control method based on an improved Actor-Critic algorithm, characterized in that, The method is applied to the penicillin fermentation process, including: Step 1: Model the penicillin fermentation process control task into an optimal control model based on a Markov decision process. This optimal control model includes a controller and actuators. The controller outputs a control signal based on the current state information of the penicillin fermentation process and determines an immediate reward at the next moment based on the deviation between the next state value monitored by sensors and the expected process trajectory. Based on the immediate reward Adjusting the control strategy; the actuator is used to control the penicillin fermentation process according to the control signal output by the controller; the expected process trajectory refers to the control target of the penicillin fermentation process; the reward Simultaneously considering product quality deviation and control action constraints, the controller is effectively trained by increasing the number of effective reward samples; Step 2: Initialize the controller parameters in the optimal control model. The controller includes an actor module, a critic module, and an experience replay module. The actor module includes an actor network for receiving process states and outputting control actions. The critic module includes two critic networks and two target critic networks. The critic networks receive the next state value and reward signal from the environment so that the controller can update the control measurements, while the target critic networks receive historical data replayed in the experience pool. The experience replay module stores historical data of each interaction with the penicillin fermentation process in the experience replay pool and replays the data to the target critic networks using priority sampling when parameters are updated. Step 3: After initializing the parameters, the actor network in the controller outputs control actions based on the state information of the penicillin fermentation process at each moment; Step 4: The actuator adjusts the control variables online and interacts with the penicillin fermentation process according to the control actions in Step 3; at the next sampling time, the changes in each process variable are observed by sensors as the next state vector. The controller then returns the next control action to the actor network. Step 5: Calculate the immediate reward based on the deviation between the actual state value and the expected value at the next moment. The policy is then returned to the controller's commenter network, which iterates to the optimal policy using a soft policy iteration method. Step 6: Store the interaction samples at each time step into the transfer experience pool. In the process, this data is subsequently used as historical data to update the parameters of each network. Each sample contains information including: the current state, action, reward, and the next state at the next moment in the process. ; Step 7: Once the number of samples in the experience pool reaches the set number, the TD error of each sample is used as the standard to measure the importance of each sample's replay. A fixed number of transfer experience samples are sampled using a priority sampling method as historical data to be replayed into the target commentator network. ; Step 8: Using the experience replay samples from the priority sampling in Step 7, calculate the deviation between the target critic network's output and the critic network's output, and update the critic network parameters from Step 2 using the mini-batch gradient ascent method. Step 9: At each sampling moment in the penicillin fermentation process, simultaneously perform the online control process of steps 3 to 5 and the experience playback process of steps 6 to 8 until the current batch of fermentation is completed. Determine whether the final quality indicators meet the requirements. If the requirements are met, end the control and output the control strategy; otherwise, save the network model and continue the control for the next batch.
2. The method according to claim 1, characterized in that, The optimal control model in step 1 for: (2) in Sampling time, This is the current batch. This is the end time of the current batch. These are historical batches that previously interacted with the intermittent process; for t The state vector at time t, Representing the state space The state information of the penicillin fermentation process is reflected by the process variables at each moment, that is... , These are the sampled values of each process variable observed by each sensor in the fermentation equipment at each sampling time. The control signal vector output by the controller. Representing the action space ; This is the process disturbance vector. Represents the set of real numbers containing all process interference information, with dimension . ; This indicates the nonlinear dynamic characteristics of the process; The control objective of the method is to find a state space under unknown disturbances. Mapping to action space Optimal strategy To achieve the expected control requirements; the optimal control model The objective function in Used to measure the control effectiveness of the penicillin fermentation process: (3) in, Indicates the current batch The expectation of receiving a Markov reward. This is the discount factor for future rewards.
3. The method according to claim 2, characterized in that, The reward for: (4) in, It is an indicator of the concentration of process products. and Intermittent processes The expected product quality and the actual quality Tolerable quality deviation; For control signals Path constraint terms, These are the sampled values of the control signals in a normal batch. To control the allowable fluctuation range of the signal; at each moment, when the controller receives a positive reward from the environment, the action is encouraged; otherwise, the action is suppressed.
4. The method according to claim 3, characterized in that, In step 3, the process of the actor network outputting control actions based on the reception process state is carried out by the following formula: (7) Depend on Parameterized actor network There are two output nodes, one of which represents the mean of the probability density distribution of the output action. The standard deviation is .
5. The method according to claim 4, characterized in that, The expression for the optimal strategy in step 5 is: (5) in, It is the temperature coefficient that balances the policy entropy and reward importance in the objective function. It is the entropy of the current strategy; Among them, policy entropy Represented as: (6) For the network of activists In the current state The action to be output.
6. The method according to claim 5, characterized in that, The soft policy iteration method in step 5 is performed in the following manner: In the strategy evaluation step, transfer experience samples are extracted from the experience replay pool using an experience replay mechanism. And using Bellman optimal operator Define a soft Q function to guide the algorithm update, as follows: (8) in For the state value function in soft updates, that is: (9) definition Then the sequence Will follow It approaches infinity and converges to the optimal strategy. Action value function; In the policy improvement step, Kullback-Leibler divergence is used. strategy The exponent of the new soft Q function is updated as follows: (10) in, It is the feasible set of policy functions. It is a partition function.
7. The method according to claim 6, characterized in that, In step 7, the priority sampling of historical data is performed in the following way: Calculate the average TD error of the two critic networks: (11) in The target action value function is obtained by approximating it through the target critic network in step two; For the experience sample The priority is defined as follows: (12) in, This is a smoothing term for the priority; after quantizing the priority, the sampling probability of each sample is calculated as follows: (13) in, The priority determines the degree to which the sampling probability is affected. This corresponds to a completely random sampling method; and the weight of the importance of each sample is quantified by the following method: (14) in This represents the number of samples in the experience replay pool. To compensate for non-uniform probability Hyperparameters.
8. The method according to claim 7, characterized in that, In step 8, updating the actor and commentator network parameters using historical data from priority sampling is performed in the following way: Two Critics Network , The update method is as follows: (15) (17) (18) in For the parameters of the target critic network, For the target commentator network, an approximation of the target action value function is provided; Actor network parameters in the actor module The optimization method is as follows: (19) (20) Initial parameters of the target critic network in the controller Similar to the commentator network parameters, subsequent updates employ a soft update method, and the update frequency is the same as that of the actor network. The update method is as follows: (21) in, This is the soft update coefficient. The parameters for the target commentator network.
9. The method according to claim 8, characterized in that, The range of the discount coefficient for the future reward is: .
10. The method according to claim 9, characterized in that, The priority smoothing term =0.1.