Reinforcement learning based online 3D fuzzy modeling method for distributed parameter systems and application
By using the Actor-Critic framework based on reinforcement learning and Markov decision processes, online 3D fuzzy modeling of distributed parameter systems was achieved, solving the problem of decreased model accuracy in existing technologies and realizing real-time, high-precision temperature modeling and production optimization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI UNIV
- Filing Date
- 2023-12-06
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to achieve online 3D fuzzy modeling of distributed parameter systems that combines high accuracy and real-time performance. In particular, when the dynamic characteristics of the system change, the accuracy of offline models based on historical data decreases, making it impossible to adapt to the needs of real-time data updates.
We employ an Actor-Critic framework based on reinforcement learning, combined with Markov decision processes, to construct and optimize a 3D fuzzy model of a distributed parameter system using online data. The Actor network generates actions, and the Critic network evaluates the value of the actions, enabling real-time updates and optimization of the model.
It enables high-precision online modeling of distributed parameter systems, and can respond to and adapt to new data inputs in real time. It is applicable to fields such as rotary hearth furnace temperature, chemical reactor temperature, and rolled steel plate temperature, optimizing production processes and improving product quality and energy efficiency.
Smart Images

Figure CN117633936B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of distributed parameter system modeling technology, and in particular to an online 3D fuzzy modeling method and application for distributed parameter systems based on reinforcement learning. Background Technology
[0002] In recent years, although 3D fuzzy systems have begun to be explored in the field of Distributed Parameter System (DPS) modeling, their practical applications are still relatively limited, and development is still in its early stages. A 3D fuzzy system is a system based on fuzzy logic theory that can process data with three-dimensional spatial coordinates. The core of this system is fuzzy sets and fuzzy operations, which allow for the processing of uncertain, imprecise, or fuzzy information. In a 3D fuzzy system, data is typically divided into several fuzzy sets, each corresponding to a specific region of the input space. Each fuzzy set has a corresponding membership function used to determine the degree to which the input data belongs to that set. By comparing the input data with these membership functions, the membership degree of the input data to each set can be obtained.
[0003] Another important characteristic of 3D fuzzy systems is fuzzy computation. Traditional mathematical operations typically focus only on precise values, while fuzzy computation considers the membership degrees of the input data. For example, the fuzzy addition of two fuzzy sets may produce a new fuzzy set whose membership function is the maximum of the membership functions of the two input sets. This computational approach allows the system to better handle uncertainty and fuzziness. 3D fuzzy systems naturally achieve spatiotemporal separation and spatiotemporal synthesis internally. Compared to traditional dimensionality reduction-based modeling methods, 3D fuzzy systems avoid the model accuracy loss caused by dimensionality reduction, and the 3D fuzzy rule base makes the model linguistically interpretable. Therefore, 3D fuzzy systems have unique advantages in modeling distributed parameter systems.
[0004] Online modeling techniques for distributed parameter systems have made progress in real-time data acquisition and processing, machine learning-driven modeling, distributed computing, and specialized mathematical modeling software. However, they face challenges such as difficulty in capturing complexity, data quality and availability, huge computational resource requirements, continuous model updates and maintenance, declining interpretability, and real-time performance and latency. These challenges hinder the accuracy and reliability of the models, necessitating further technological innovation and comprehensive solutions to achieve more accurate, efficient, and real-time responsive online modeling of distributed parameter systems.
[0005] Reinforcement learning algorithms are highly adaptable, capable of learning and optimizing in constantly changing environments. Through continuous trial and error, they dynamically adjust the model based on data and feedback from different scenarios to adapt to various fuzzy environments. The online learning nature of reinforcement learning-based 3D fuzzy modeling enables it to respond to and adapt to new data inputs in real time.
[0006] Most modeling methods for 3D fuzzy systems are offline methods based on historical data. When the dynamic characteristics of the system change, offline models based on historical data become unsuitable for the current system, and their accuracy deteriorates. Given the limitations of offline modeling, online updating modeling driven by real-time acquired data remains a pressing problem. Therefore, researching and establishing an online modeling method based on real-time data is of great significance. The characteristics of distributed parameter systems change with time and environment, so the model needs continuous updating and maintenance to maintain accuracy. Real-time model updates may require efficient algorithms and techniques, and the core idea of reinforcement learning algorithms is that an agent receives rewards through interaction with the environment and seeks strategies to maximize those rewards. Essentially, this is an incremental learning method, naturally suited to online learning scenarios.
[0007] Achieving highly accurate online 3D fuzzy modeling of distributed parameter systems based on real-time data has become a technical problem that needs to be solved. Summary of the Invention
[0008] The purpose of this invention is to overcome the shortcomings of the existing technology and provide an online 3D fuzzy modeling method and application for distributed parameter systems based on reinforcement learning.
[0009] The objective of this invention can be achieved through the following technical solutions:
[0010] According to one aspect of the present invention, an online 3D fuzzy modeling method for distributed parameter systems based on reinforcement learning is provided, the method comprising the following steps:
[0011] Step S1: Based on the sensor data collected in the distributed parameter system, construct a dataset and build a Markov decision process model.
[0012] Step S2: Establish an online 3D fuzzy model of the distributed parameter system based on the Actor-Critic reinforcement learning model framework;
[0013] Step S3: Optimize the online 3D fuzzy model of the distributed parameter system.
[0014] Preferably, step S1 specifically includes:
[0015] First, based on the sensor data collected from the distributed parameter system, data preprocessing and dataset construction are performed;
[0016] Secondly, based on the requirements of online 3D fuzzy modeling for distributed parameter systems, the system state, actions, and reward function are determined.
[0017] Finally, a Markov decision process model is constructed.
[0018] Preferably, in step S1, constructing the dataset specifically involves:
[0019] The input to a nonlinear distributed parameter system is u(t)∈R m The spatiotemporal output is y(z,t)∈R, where t is the time variable. For spatial variables, It is a spatial domain;
[0020] There are P sensors located at spatial points z1, z2, ..., zn. p , The system output is
[0021] Given the input variables {u(t-1), u(t-2), … u(tK)} and {y(Z,t-1), y(Z,t-2), …, y(Z,tJ)} of the 3D fuzzy system, the 3D fuzzy rule of the 3D fuzzy system is expressed as follows:
[0022]
[0023]
[0024]
[0025]
[0026] in, Let i = 1, 2, ..., J, represent a 3D fuzzy set. Let j = 1, 2, ..., m; k = 1, 2, ..., K be traditional fuzzy sets. Let K denote the spatial basis function, K be the order of the input variable u(t), J be the order of the output variable y(z,t), and m be the order of the traditional fuzzy set.
[0027] By establishing input and output The relationships between these elements are used to identify a spatiotemporal 3D fuzzy model, whose dataset D is shown in the following equation:
[0028]
[0029] The parameters are defined as follows:
[0030]
[0031] L is the time length, K is the order of the input variable u(t), and J is the order of the output variable y(z,t).
[0032] Let state S t =x kFrom the properties of distributed parameter systems, we know that state S t Only with its previous state S t-1 It is relevant and satisfies the Markov property.
[0033] Preferably, in step S1, constructing the Markov decision process model specifically involves constructing a Markov decision process (MDP) quintuple (S, A, R, P, γ):
[0034]
[0035]
[0036]
[0037] Where S is the state space, A is the action space, and R is the... t For state S t Take action A t The reward obtained by the agent after the transition is P, where P is the state transition probability matrix and γ is the decay factor.
[0038] Preferably, in step S2, establishing an online 3D fuzzy model of a distributed parameter system based on the Actor-Critic reinforcement learning model framework includes determining the input and output, constructing the Actor network and Critic network, selecting reinforcement learning algorithms and parameters, collecting sample data and performing training optimization, and applying the trained 3D fuzzy model to the actual system and verifying its performance.
[0039] More preferably, the Actor network selects actions, and the Critic network evaluates the value of the actions. The two cooperate to achieve the goal of online 3D fuzzy modeling and dynamic decision-making of the system.
[0040] More preferably, the establishment of the online model of the distributed parameter system based on the Actor-Critic reinforcement learning model framework specifically includes:
[0041] Actor and Critic are represented by two 3D fuzzy systems, where Actor serves as an online model of the distributed parameter system;
[0042] At time step t, the Actor will set state S. t As input, the predicted value of DPS is output at time step t+1. Critic will state S t And Action A t As input, the output Critic behavior value function Q(s,a) is given, and the environment composed of DPS outputs the next state S. t+1 and reward R t ;
[0043] The time difference objective is used to update Q(s,a), and the Actor policy function is updated along the positive gradient direction of Q(s,a) using the chain rule;
[0044] The structure of the Actor policy function μ(s) is shown in the following equation:
[0045]
[0046]
[0047] in, Let i be a 3D fuzzy set, i = 1, 2, ..., J. For a traditional fuzzy set, j = 1, 2, ..., m; k = 1, 2, ..., K. Let K be the spatial basis function, K be the order of the traditional fuzzy set, J be the order of the 3D fuzzy set, and a be the spatial basis function. l ,b l ,c l ,d l These are the coefficients of the basis functions in the Fourier space;
[0048] The structure of the Critic behavior value function Q(s,a) is shown in the following equation:
[0049]
[0050] in, It is a 3D fuzzy set. It is a traditional fuzzy set, Q1 and Q N Both are constants.
[0051] Preferably, in step S3, optimizing the online model of the distributed parameter system specifically involves: updating the Actor policy function parameters based on the model error using the stochastic gradient method, updating the Critic behavior value function parameters by minimizing the loss function, using the updated Actor policy function as the model of the distributed parameter system, and finally updating the parameters of the target fuzzy system.
[0052] Using a 3D fuzzy system as a nonlinear function approximator, the structure of the 3D fuzzy system is fixed, and the parameters are updated through gradient backpropagation. The Actor policy function parameter θ is... u The update is as follows:
[0053]
[0054]
[0055] Critic behavior value function parameter θ Q The update is as follows:
[0056]
[0057]
[0058] in, For strategy π θ The expected return or total return Here, μ is the parameter of the policy function, and N is the sample size. Let y be the gradient function. t Let t be the target value at time step t. t For the instant reward of time step t, This represents the average reward during experience replay.
[0059] Preferably, the parameters of the target fuzzy system include target Critic parameters and target Actor parameters, which are updated as follows:
[0060] θ Q′ =τθ Q +(1-τ)θ Q′ ,θ u′ =τθ Q +(1-τ)θ u′
[0061] Where τ is the weighting factor, θ Q θ is the parameter of the Critic behavior value function. Q′ θ is the parameter of the target Critic behavior value function. u′ The parameters are the target Actor's policy function parameters.
[0062] According to another aspect of the present invention, an application of an online 3D fuzzy modeling method for distributed parameter systems based on reinforcement learning is provided. This method is applied to distributed parameter systems with spatiotemporal coupling characteristics, wherein the distributed parameter system includes a rotary hearth furnace temperature model, a chemical reactor temperature model, or a steel rolling plate temperature tracking model. The specific application process is as follows:
[0063] First, sensor data from the distributed parameter system are collected to construct a dataset, and a Markov decision process model is built based on the online modeling problem of the distributed parameter system.
[0064] Secondly, a reinforcement learning model based on the Actor-Critic framework is established, and the 3D fuzzy system continuously optimizes the model parameters according to changes in the environment.
[0065] Finally, the optimized online 3D fuzzy model is embedded into the distributed parameter system to more accurately simulate and predict temperature distribution at different locations, and to make real-time adjustments in production based on the predictions.
[0066] Compared with the prior art, the present invention has the following beneficial effects:
[0067] 1. This invention combines the incremental learning approach in reinforcement learning algorithms with the unique advantages of 3D fuzzy systems in processing systems with spatiotemporal coupling characteristics. It can model the system online with high accuracy from scratch and respond to and adapt to new data inputs in real time.
[0068] 2. This invention has a wide range of applications and can be used in various fields that require real-time and accurate temperature modeling, such as: temperature modeling of rotary hearth furnaces, temperature modeling of chemical reactors, and temperature tracking modeling of rolled steel plates.
[0069] 3. This invention helps to optimize production processes, improve product quality, increase energy efficiency, and reduce energy consumption. Attached Figure Description
[0070] Figure 1 This is a schematic diagram of the framework of the online 3D fuzzy modeling method in this invention;
[0071] Figure 2 This is a schematic diagram of the rapid heating chemical vapor deposition (RTCVD) system in this invention.
[0072] Figure 3 This is a schematic diagram of the actual output of the DPS model in this invention;
[0073] Figure 4 This is a schematic diagram of the DPS model prediction output in this invention;
[0074] Figure 5 This is a schematic diagram showing the actual value and model prediction value on the S5 sensor in this invention;
[0075] Figure 6 This is a schematic diagram showing the actual value and model prediction value on the S7 sensor in this invention;
[0076] In the attached figure, q represents argon gas mixed with 10% silane, T represents the wafer temperature, and r represents the wafer radius. Detailed Implementation
[0077] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0078] This embodiment relates to an online 3D fuzzy modeling method for distributed parameter systems based on reinforcement learning. The method includes the following steps:
[0079] Step S1: Dataset Construction and Markov Decision Process Model. A dataset is constructed from the sensor data collected in the distributed parameter system. Based on the online modeling problem of the distributed parameter system, a Markov decision process model is built. This process includes feature extraction, definition of state and action spaces, design of reward functions, and model training and validation, aiming to effectively establish a framework for online system modeling using reinforcement learning methods.
[0080] Consider a nonlinear distributed parameter system with input u(t)∈R. m The spatiotemporal output is y(z,t)∈R, where t is the time variable. For spatial variables, It is a spatial domain. Assume there are P sensors located at spatial points z1, z2, ..., zn. p ,make The system output can be represented as
[0081] Furthermore, step S1 includes:
[0082] Given the input variables {u(t-1), u(t-2), … u(tK)} and {y(Z,t-1), y(Z,t-2), …, y(Z,tJ)} of a 3D fuzzy system, and setting the orders of the input variable u(t) and the output variable y(z,t) to K and J respectively, the 3D fuzzy rule of the 3D fuzzy system can be expressed as:
[0083]
[0084] in, Represents a 3D fuzzy set. Represents traditional fuzzy sets, Represents the basis functions of the space.
[0085] By establishing input and output The relationship between the input variable u(t) and the output variable y(z,t) is used to identify a spatiotemporal 3D fuzzy model, where L is the time length, and the orders of the input variable u(t) and the output variable y(z,t) are set to K and J, respectively. The dataset D is shown below.
[0086]
[0087] The parameters are defined as follows:
[0088]
[0089] Let state S t =x k From the properties of distributed parameter systems, we know that state S t Only with its previous state St-1 It is relevant, and therefore satisfies the Markov property.
[0090] Step S2: Actor-Critic Reinforcement Learning Model Framework. A reinforcement learning model based on the Actor-Critic framework was established. The Actor-Critic model is an important framework in reinforcement learning, where the Actor is responsible for generating actions, while the Critic evaluates the value of the actions taken by the Actor. This model will be used to build an online model of a distributed parameter system. The Actor improves its action policy by learning reward signals from interactions with the environment, while the Critic is responsible for evaluating the quality of the policy. This process, through the Actor network selecting actions and the Critic network evaluating the value of actions, achieves the goal of online modeling and dynamic decision-making of the system through their collaboration.
[0091] Step S3: Update Model Parameters. In S3, the Critic function is updated by minimizing the loss function. This step primarily aims to evaluate the model's predictive performance and adjust it based on the error. Simultaneously, the policy function (Actor) is updated using stochastic gradient descent based on the model error to optimize the policy and improve model performance. Finally, the updated policy function is used to adjust the parameters of the target fuzzy system, applying the policy function as the model for the distributed parameter system, thereby achieving online modeling and optimization. This process aims to improve the accuracy and performance of the distributed parameter system model by optimizing the Critic and Actor networks.
[0092] The update of the Critic behavior value function is based on the semi-gradient method, using TD(0) as the update target. TD(0) means that at the current time step t, the value function at the next time step t+1 is updated based on the estimated value function. TD, as a method of estimating the value function, is commonly used to update the value function in Critic in order to gradually improve the prediction accuracy of the value function.
[0093] The TD objective is the reward R at the current time step. t+1 Add the value function estimate Q(S) at the next time step t+1 A t+1 ), where A t+1 It is based on the current policy in state S t+1 The chosen action. The estimated value function of Critic at the current time step is Q(S). t A t According to the semi-gradient method, the gradient is calculated using the TD error and the parameters of the Critic value function are updated. The loss function is used to measure the difference between the TD objective and the predictions of the Critic function.
[0094] This invention provides an online 3D fuzzy modeling method for distributed parameter systems based on reinforcement learning. It combines the incremental learning approach of reinforcement learning with the unique advantages of 3D fuzzy systems in handling systems with spatiotemporal coupling characteristics, enabling high-precision modeling of the system from scratch. Reinforcement learning algorithms are highly adaptable, capable of learning and optimizing in constantly changing environments. Through continuous trial and error, they dynamically adjust the model based on data and feedback from different scenarios to adapt to various fuzzy environments. The online learning characteristic allows reinforcement learning-based 3D fuzzy modeling to respond to and adapt to new data inputs in real time.
[0095] like Figure 1 As shown, based on the Actor-Critic framework, two 3D fuzzy systems are used to represent the Actor and Critic, respectively, where the Actor serves as an online model of the distributed parameter system. At time step t, the Actor will change the state S. t As input, the predicted value of DPS is output at time step t+1. Critic will state S t And Action A t As input, the output value is Q(s,a), and the environment composed of DPS outputs the next state S. t+1 and reward R t The action value function Q(s,a) is updated using a time-difference objective, and then the policy function is updated along the positive gradient direction of Q(s,a) using the chain rule.
[0096] Based on the Actor-Critic framework and employing deterministic policy gradient theory, the Actor function is used as the final model to be built. The DPS modeling process is incorporated into a Markov decision process, and the Actor and Critic functions are iteratively updated alternately through the interaction between the agent and the environment. Online modeling is a continuous problem without a terminal state. To address this problem, an average reward is proposed, with maximizing the average reward as the agent's objective. Simulation results verify the effectiveness of the proposed method. The main steps are as follows:
[0097] Step 1: Construct the Markov Decision Process (MDP) quintuple (S, A, R, P, γ).
[0098] S is the state space, A is the action space, and R is the action space. t Indicates that in state S t Take action A t The rewards obtained by the agent after the fact.
[0099]
[0100]
[0101]
[0102] Step 2: Considering the spatiotemporal coupling characteristics of distributed parameter systems, both Actor and Critic are represented using 3D fuzzy systems. The structure of the Actor's policy function μ(s) is shown below.
[0103]
[0104] in, It is a 3D fuzzy set. It is a traditional fuzzy set. These are spatial basis functions, and K and J are defined as model orders.
[0105]
[0106] Among them, a l ,b l ,c l ,d l These are the coefficients of the Fourier space basis functions.
[0107] The structure of the behavior value function Q(s,a) in Critic is shown below:
[0108]
[0109] in, and It is a 3D fuzzy set. It is a traditional fuzzy set, Q t and Q N They are all constants.
[0110] Step 3: When the state space and action space are infinite-dimensional, function approximation is typically used to represent the value function and policy function. This invention uses a 3D fuzzy system as a nonlinear function approximator, fixing the structure of the 3D fuzzy system and updating the parameters through gradient backpropagation. The update of the objective function and policy function parameters is shown below.
[0111]
[0112]
[0113] in, α is the parameter of the policy function μ, N is the sample size, and α is the learning rate.
[0114] It is strategy π θ The expected return or total return. This function measures the expected return or total return under strategy π. θThe average value of the action performed. In reinforcement learning, the goal of policy optimization is to maximize the expected reward, i.e., to maximize J(π). θ To achieve this goal, optimization algorithms such as gradient descent are typically used to update the policy parameters θ. Specifically, according to the policy gradient theorem, J(π) can be used... θ The policy gradient method updates the policy parameters by calculating the gradient of J(π), thus gradually improving the policy's performance. θ The gradient of the policy parameter θ is used as a reference, and the parameters are updated according to this gradient, so that the policy iterates towards higher expected returns. The purpose of updating the parameters is to find the optimal policy π. θ In order to maximize expected returns.
[0115] The loss function is used to measure the difference between the predicted values of the TD objective and the Critic function. During the update of the Critic function, the gradient of the loss function is typically used to guide the direction of parameter updates. By adjusting the parameters of the Critic function, it can more accurately predict the value of the value function. The update of the Critic behavior value function is shown below:
[0116]
[0117]
[0118] The parameter updates for the target Critic and Actor fuzzy systems are shown below.
[0119] θ Q′ =τθ Q +(1-τ)θ Q′ ,θ u′ =τθ Q +(1-τ)θ u′ (14)
[0120] Where τ is the weighting factor.
[0121] This embodiment also relates to the application of an online 3D fuzzy modeling method for distributed parameter systems based on reinforcement learning, and the specific embodiment is as follows:
[0122] A typical simulation case of a three-zone rapid heated chemical vapor deposition (RTCVD) reactor with distributed parameter systems. RTCVD exhibits various characteristics of heat treatment systems, such as nonlinearity, time-varying, and spatiotemporal properties. Its structure is as follows: Figure 2 As shown. RTCVD has three heating zones: lamp group 1, lamp group 2, and lamp group 3. A wafer with a radius of r = 7.6 cm is located in the center of the reactor. Argon gas q, a mixture of 10% silane at 5 atmospheres, is injected into the reactor from the top. The three input variables are u... a (t), ub (t) and u c (t).
[0123] Under heating conditions, silane undergoes a chemical reaction to produce silicon and hydrogen gas, depositing a thin polycrystalline silicon film on the wafer. To ensure a uniform thickness of this polycrystalline silicon film, the wafer temperature T needs to be controlled to be constant throughout. Since the wafer is rotating, we only need to consider radial temperature uniformity.
[0124] Because the internal pressure of the reactor is low during the reaction, the heat transfer effect between the wafer and the gas can be ignored. The heat released by the chemical and physical processes within the reactor has a very small impact compared to the heat transfer and radiation from the wafer, so this effect can also be ignored. Furthermore, during the reaction, the temperature difference between the upper and lower surfaces of the wafer is very small and the rotation is slow; therefore, the wafer temperature can be considered to change only radially.
[0125] To fully obtain the dynamic information of the system, an interference signal with an amplitude not exceeding 10% is added to the system input signal. The expression for the input variable with the interference signal is shown in the following formula:
[0126] u a (t)=0.2028+0.1*0.2028*normrnd(0,1) (15)
[0127] u b (t)=0.2028+0.1*0.2028*normrnd(0,1) (16)
[0128] u c (t)=0.2028+0.1*0.2028*normrnd(0,1) (17)
[0129] Where 0.2028, 0.1008, and 0.2245 are the steady-state inputs when the internal temperature of the RTCVD is 1000K, and normrnd is a normally distributed random number function.
[0130] Eleven measurement sensors were placed radially along the wafer. To simulate measurement noise, independent white noise with an amplitude of 0.2 and a mean of 0 was added to these 11 sets of measurement data. The sampling period was set to Δt = 0.1s, and the experiment lasted a total of 500s. Figure 3 The figure shows the actual output of the distributed parameter system. Figure 4 The output shown is the model prediction output for the distributed parameter system.
[0131] The experiment selected data from 330s to 370s. The actual output and predicted output of the DPS model for the fifth sensor (s5) were compared as follows: Figure 5As shown, the comparison between the actual output and the model prediction output of the DPS model of the seventh sensor s7 is as follows: Figure 6 As shown in the figure. Through the above comparison, it can be seen that the online modeling algorithm based on reinforcement learning of this invention has good performance in RTCVD.
[0132] In addition, for rotary hearth furnace temperature modeling, the 3D fuzzy system can continuously learn and improve during model operation, thereby more accurately predicting the temperature distribution of the rotary hearth furnace at different locations, which helps to optimize production processes, improve energy efficiency and reduce energy consumption.
[0133] Furthermore, temperature tracking modeling for rolled steel sheets in the steel industry can also benefit from this technology. Through continuous learning and improvement, the 3D fuzzy system can adjust the model based on real-time production data to more accurately track the temperature distribution of the steel sheet throughout the rolling process, helping to ensure product quality and allowing for timely adjustments during production to cope with changing conditions.
[0134] Overall, this 3D fuzzy system technology, which can continuously learn and improve during model operation, has broad applicability to various fields that require real-time, accurate temperature modeling. It can help optimize production processes, improve efficiency, and best reflect the actual conditions of the current environment.
[0135] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A method for online 3D fuzzy modeling of a distributed parameter system based on reinforcement learning, characterized in that, The method includes the following steps: Step S1: First, based on the sensor data collected in the distributed parameter system, perform data preprocessing and construct a dataset; second, based on the online 3D fuzzy modeling requirements of the distributed parameter system, determine the system state, actions, and reward function; finally, construct a Markov decision process model. Step S2: Establish an online 3D fuzzy model of the distributed parameter system based on the Actor-Critic reinforcement learning model framework, including determining the input and output, constructing the Actor network and Critic network, selecting reinforcement learning algorithms and parameters, collecting sample data and performing training and optimization, and applying the trained 3D fuzzy model to the actual system and verifying its performance. Step S3, optimizing the online 3D fuzzy model of the distributed parameter system, specifically involves: using the stochastic gradient method to update the Actor policy function parameters based on the 3D fuzzy model error, updating the Critic behavior value function parameters by minimizing the loss function, using the updated Actor policy function as the 3D fuzzy model of the distributed parameter system, and finally updating the parameters of the target fuzzy system. The 3D fuzzy system is used as a nonlinear function approximator, the structure of the 3D fuzzy system is fixed, and the parameters are updated through gradient back propagation, and the Actor policy function parameters are updated as follows: , Critic behavior value function parameters The update of the critic behavior value function parameters is shown below: , in, For parameterized strategy distribution The expected return or total return For policy function The parameters, For sample size, , For time step The target value, For time step Instant rewards The average reward in experience replay. For the target Actor strategy, For the target Critic behavior value function network, For Critic behavior value function, Let E be the learning rate and E be the expected value. This indicates that state s follows a parameterized policy distribution. Induced state distribution.
2. The online 3D fuzzy modeling method for distributed parameter systems based on reinforcement learning according to claim 1, characterized in that, In step S1, constructing the dataset specifically involves: The input of a nonlinear distributed parameter system is and the spatio-temporal output is where is the time variable, is the spatial variable, is the spatial domain; There are sensors located at spatial points , [ ], and the system output is , Determine the input variables of a 3D fuzzy system and The 3D fuzzy rules of the 3D fuzzy system are expressed as follows: , in, Represents a 3D fuzzy set. , Represents traditional fuzzy sets, , , Describes the spatial basis functions. K Input variables order, For output variables , m For the order of traditional fuzzy sets, l Indicates the first l Rules; By establishing a relationship between the input and output a spatio-temporal 3D fuzzy model is identified, whose data set D is given by: , The parameters are defined as follows: , is a length of time, P is a number of sensors, K is an input variable is an order of the polynomial, J is an output variable y(z, t) ; Let the state By the nature of distributed parameter systems, the state is only related to its previous state and satisfies the Markov property.
3. The online 3D fuzzy modeling method for distributed parameter systems based on reinforcement learning according to claim 1, characterized in that, The step S1 is specifically constructing a Markov decision process model, that is, constructing a Markov decision process MDP five-tuple (S, A, P, R, γ) ): , in, For state space, For the action space, In the state Take action below The rewards obtained by the subsequent intelligent agent P Here is the state transition probability matrix. As the attenuation factor, For a moment t The state vector, For a moment t Spatial state components, subscript z Corresponding spatial domain ; For a moment t Control input state components, subscript Corresponding control input variables; For a distributed parameter system at time... t Spatial domain The predicted output; For a distributed parameter system at time... t Spatial domain The actual output.
4. The online 3D fuzzy modeling method for distributed parameter systems based on reinforcement learning according to claim 1, characterized in that, The Actor network selects actions, and the Critic network evaluates the value of the actions. The two work together to achieve the goal of online modeling and dynamic decision-making in the system.
5. The online 3D fuzzy modeling method of a distributed parameter system based on reinforcement learning according to claim 1, characterized in that, The establishment of the distributed parameter system online 3D fuzzy model based on the Actor-Critic reinforcement learning model framework is specifically as follows: Actor and Critic are represented by two 3D fuzzy systems, where Actor serves as an online model of the distributed parameter system; At time step The Actor will state As input, at time step Predicted values of the distributed parameter system DPS Critic will state and actions As input, the output Critic behavior value function The value, the environment composed of DPS, outputs the next state. and rewards ; Time-difference target update And by using the chain rule along Update the Actor policy function in the positive gradient direction; Actor policy function The structure of the compound of formula (I) is as follows: , in, For the first i The first historical moment l 3D fuzzy set of rules , For the first j The first control input variable, the first k Time, Number l The traditional fuzzy set corresponding to the rule, ; For the first l Regular spatial basis functions, The order of a traditional fuzzy set. Let be the order of the 3D fuzzy set. These are the coefficients of the basis functions in the Fourier space; For distributed parameter systems in the spatial domain ,forward J The output at each moment For the first l Fuzzy control rules, For the first j A control input variable at time... t The value, m To control the number of input variables; Critic behavioral value function The structure is shown in the following formula: , in, , All of these are 3D fuzzy sets of the current prediction output. It is a traditional fuzzy set. Both are constants. This is the Nth fuzzy control rule, where N is the total number of fuzzy rules. For the first m The true values of the control inputs over the previous K time steps.
6. The online 3D fuzzy modeling method of a distributed parameter system based on reinforcement learning according to claim 1, characterized in that, The parameters of the target fuzzy system include the target Critic parameter and the target Actor parameter, which are updated as follows: , in, As a weighting factor, For the Critic behavior value function parameters, The parameters of the target Critic behavior value function are... The parameters are the target Actor's policy function parameters.
7. The application of the online 3D fuzzy modeling method of the distribution parameter system based on reinforcement learning according to claim 1, characterized in that, The method is applied to a distributed parameter system with spatiotemporal coupling characteristics, wherein the distributed parameter system is a rotary hearth furnace temperature model, a chemical reactor temperature model, or a steel plate temperature tracking model, in order to simulate and predict the temperature distribution at different locations of the distributed parameter system, and to make real-time adjustments in production based on the prediction.