A furnace temperature setting method for a walking beam furnace based on deep reinforcement learning

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a deep reinforcement learning-based method, the furnace temperature setpoint of the walking beam furnace is adjusted in real time, solving the problems of staticity and lag, optimizing heating quality and energy consumption, adapting to dynamic operating conditions, and reducing maintenance costs.

CN122239482APending Publication Date: 2026-06-19ANSTEEL ENG TECH CORP

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: ANSTEEL ENG TECH CORP
Filing Date: 2026-04-13
Publication Date: 2026-06-19

Application Information

Patent Timeline

13 Apr 2026

Application

19 Jun 2026

Publication

CN122239482A

IPC: G05B13/04; F27D19/00

AI Tagging

Application Domain

Control devices for furnaces Adaptive control

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Intelligent detection trigger and multi-feature visual recognition method and device for high-temperature smelting furnace
CN122258650ACharacter and pattern recognition Control devices for furnaces Anomaly detection Visual recognition
A high-temperature box-type resistance furnace convenient to clean
CN224365314UCharge supportsFurnace types
A pneumatic thermal zoning intelligent control system and method for industrial kilns
CN122192006AControl devices for furnaces
Wafer heating furnace and heating method
CN122216984ACharge manipulation Furnace types
Electric bag composite dust removal device for steel plant flue gas treatment
CN122230885ADispersed particle filtrationExternal electric electrostatic seperatorTemperature control Control system

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122239482A_ABST

Patent Text Reader

Abstract

This invention belongs to the field of furnace temperature control technology for heating furnaces, specifically a method for setting the temperature of a step-by-step heating furnace based on deep reinforcement learning. The method includes modeling the furnace temperature setting problem, designing a reinforcement learning framework, training a policy network based on the A3C algorithm, and deploying and controlling the policy network in real time. This invention uses a deep reinforcement learning-based furnace temperature control strategy to dynamically adjust the furnace temperature according to operating conditions, ensuring the control strategy always matches the current operating conditions. This results in more precise control and faster response. Furthermore, by linking control decisions with production plan data, global optimization is achieved. This ensures product quality while minimizing energy consumption throughout the entire production sequence, significantly improving product yield and effectively reducing fuel consumption, bringing considerable economic and environmental benefits to metallurgical enterprises.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of furnace temperature control technology, specifically a step-by-step furnace temperature setting method based on deep reinforcement learning. Background Technology

[0002] Walking beam furnaces are key equipment in hot rolling production lines. Their task is to heat the steel billets or slabs to be rolled from room temperature to the target temperature suitable for rolling in a uniform and rapid manner. The furnace is usually divided into multiple control zones along its length, such as a preheating zone, a heating zone, and a soaking zone. The steel billet moves forward intermittently and step by step in the furnace through the walking beam mechanism at the bottom of the furnace, passing through each heating zone in sequence, and finally reaching the requirements for exiting the furnace. The furnace temperature control directly affects the heating quality and energy consumption of the steel billet.

[0003] Currently, the methods for optimizing and controlling the furnace temperature of walking beam furnaces have the following significant shortcomings: First, static and lagging nature: Traditional algorithms calculate a static optimal solution for specific operating conditions, which cannot adapt to the dynamic changes in operating conditions such as billet specifications and rolling rhythm in actual production. Second, model dependence: Establishing an accurate mechanistic model requires a large amount of prior knowledge and complex parameter identification, and the model's description of the complex radiation, convection heat exchange and billet heat conduction process in the furnace often has deviations, resulting in unsatisfactory control effects. Finally, rigid rules: Expert experience rules rely on a static, manually set rule base, which cannot cover all possible combinations of operating conditions, and the maintenance and updating of the rule base is costly.

[0004] To address the aforementioned technical shortcomings, a solution is proposed. Summary of the Invention

[0005] The purpose of this invention is to provide a method for setting the furnace temperature of a step-by-step heating furnace based on deep reinforcement learning, so as to solve the technical defects mentioned in the background art.

[0006] To achieve the above objectives, the present invention provides the following technical solution: a method for setting the furnace temperature of a step-type heating furnace based on deep reinforcement learning, comprising the following steps: Step 1: Model the furnace temperature setting problem of the walking beam furnace, and establish a multi-objective optimization function and constraints. The multi-objective optimization function considers both energy consumption minimization and yield maximization. The constraints include furnace temperature process requirements, heating uniformity conditions, and equipment safety operation constraints. Step 2: Design a reinforcement learning framework (i.e., design a reinforcement learning framework for optimizing furnace temperature control). Use a Markov decision process to describe the furnace temperature setting problem, determine the state space and action space of the furnace energy consumption optimization model, and determine the reward function and state-action value function through the state space, action space and constraints, thereby obtaining the optimal strategy of the furnace temperature control optimization model. Step 3: Based on historical furnace temperature data, billet condition data, and environmental data, train the strategy network using the A3C algorithm; Step 4: Deploy the trained strategy network in the heating furnace control system and adjust the furnace temperature setpoint in real time.

[0007] Furthermore, step one includes the following: Step 1.1: Establish the system's objective function To more comprehensively reflect actual industrial needs, the system objective function considers both energy consumption and heating quality deviation. This multi-objective function explicitly sets minimizing energy consumption (F1) and minimizing heating quality deviation (F2, equivalent to maximizing yield) as parallel objectives, aligning with actual process requirements. The objective function is defined as follows: in, The energy consumption target (the energy consumption target is the weighted sum of the average set temperature of each heating zone and the rate of change of the set temperature during each heating period): , This represents the average set temperature of each heating zone. Indicates the rate of change of the set temperature. and These are the weighting coefficients; To achieve the heating quality deviation target (which is the weighted sum of the deviations between the billet surface temperature and the target temperature, and the portion of the deviation between the billet surface temperature and the core temperature exceeding the maximum allowable cross-sectional temperature difference during each heating period), the temperature tracking error and uniformity are comprehensively quantified. and To monitor the surface and core temperatures of the steel billet at all times, and These are weighting coefficients, reflecting the relative importance of temperature tracking and uniformity. This represents the maximum permissible cross-sectional temperature difference.

[0008] Step 1.2: Establish constraints for furnace temperature control The optimization problem needs to be solved under constraints, including heating process quality constraints and heating furnace equipment safety constraints: in, Target process temperature of steel billet (°C). Surface temperature tolerance Upper limit of cross-sectional temperature difference Furnace temperature value in region j (°C); Among these, the furnace temperature setpoints for each section must be within a safe range to prevent equipment damage or production accidents. The furnace temperature zoning constraints are specifically defined as follows: Furthermore, step 2 includes the following specific steps: Step 2.1: Determine the state space In the temperature control system of a walking beam furnace, the observed variables obtained by the agent from the environment include the target temperature of the steel billet. Actual temperature of each heating zone (preheating section temperature) Heating section temperature Temperature of the heating zone ), Current position of steel billet Predicted surface temperature of steel billet billet core temperature Billet moving speed timestamp Etc., state space Represented as: Step 2.2: Determine the motion space In this system, the action space of the intelligent agent This refers to the adjustment amount for setting the temperature of each heating zone, i.e.: in, This represents the change in the set temperature of the i-th heating zone, and its value range is limited by process safety constraints.

[0009] Step 2.3: Set the reward function The reward function represents the state at a given point. The agent then takes the specified action. At that time, the immediate benefits fed back to the agent by the environment are set as follows: To balance minimizing energy consumption, optimizing heating quality, and ensuring safety, the reward function is set as follows: In the formula, , , These are the weighting coefficients. To maximize the allowable cross-sectional temperature difference, the reward function encourages the agent to ensure temperature tracking accuracy and heating uniformity while reducing energy consumption.

[0010] Step 2.4: Set the state - action function Representation strategy The degree of superiority or inferiority of the strategy, that is, in terms of strategy Expected cumulative discounted return of the reward function: Among them, the agent's strategy For state To action The mapping, For the value to be Discount factor, optimal strategy To make the state-action function The strategy that maximizes the cumulative reward function is: .

[0011] Furthermore, in step three, the steps for training the policy network using the A3C algorithm include: initializing the global actor network parameters and critic network parameters, setting the learning rate, discount factor, and number of worker threads, creating multiple worker threads, each thread independently replicating the global parameters, interacting with the heating furnace simulation environment, and collecting empirical data; calculating the loss gradient based on the advantage function, including calculating the policy loss, value loss, and entropy regularization loss, asynchronously updating the global network parameters until the convergence condition is met, and saving the trained actor network parameters for online inference; The specific process is as follows: Step 3.1: Randomly initialize global actor network parameters and global commentator network parameters Set the actor's network learning rate Critics' Network Learning Rate Discount factor γ, number of worker threads and global step counter ; Step 3.2: Create N worker threads, each thread independently executing the following steps: Step 3.2.1: The thread locally copies the global network parameters and initializes the local actor network parameters and local critic network parameters, i.e. and ; Step 3.2.2: Initialize the thread-local experience buffer, reset the stepper furnace simulation environment, and obtain the initial state. ; Step 3.2.3: For each time step t, perform the following operations until the end of the round: Step 3.2.3.1: Based on the current state The motion probability distribution is generated through a local actor network: ,in It is the forward computation function of the local actor network; Step 3.2.3.2: Sample basic actions from the probability distribution And add exploration noise to enhance exploration: Among them, the noise standard deviation decays over time: .

[0012] Step 3.2.3.3: Perform a safety check before executing the action to ensure that the furnace temperature setpoint adjustment is within the allowable range. Step 3.2.3.4: Perform actions in the heating furnace simulation environment. Observation Rewards and the next state ; Step 3.2.3.5: Transfer the empirical tuples Store in the local experience buffer; Step 3.2.4: When the local experience buffer reaches the batch size Or at the end of the round, calculate the estimated value of the advantage function: in, It is the state value output by the local commentator network.

[0013] Step 3.2.5: Calculate the local network gradient: Critics' Network Gradient: Actor Network Gradient: Step 3.2.6: Asynchronously update the local gradients to the global network parameters: Step 3.2.7: Clear the local experience buffer and update the global step counter. ; Step 3.3: Repeat step 3.2 until one of the following convergence conditions is met: The average cumulative reward remains stable over several consecutive rounds; The change in network parameters is less than the threshold. Reach the preset maximum number of training rounds; Step 3.4: Save the trained global actor network parameters This is used for subsequent online inference.

[0014] Compared with the prior art, the beneficial effects of the present invention are: In this invention, a policy network based on deep reinforcement learning can be used to perceive the dynamic changes in working conditions such as billet specifications and rolling rhythm in real time, and adaptively adjust the furnace temperature setpoint so that the control strategy always matches the current working conditions. This solves the static and lag problems in the prior art. Furthermore, by learning the optimal strategy through interaction between the agent and the environment, the dependence of traditional control methods on precise mechanism models is avoided, the requirements for prior knowledge and parameter identification are reduced, and the practicality of the method is improved.

[0015] In this invention, by minimizing energy consumption and maximizing yield as parallel optimization objectives, the reward function guides the policy network to achieve global optimization rather than local optima. While ensuring the heating quality of steel billets, it effectively reduces the fuel consumption of the heating furnace, and eliminates the need for manual maintenance of a complex rule base. The policy network can continuously learn to adapt to new operating conditions, thereby reducing maintenance costs in the production process. Attached Figure Description

[0016] To facilitate understanding by those skilled in the art, the present invention will be further described below with reference to the accompanying drawings; Figure 1 This is a flowchart illustrating the overall method of the present invention; Figure 2 This is a basic framework diagram of deep reinforcement learning decision-making in this invention. Detailed Implementation

[0017] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0018] like Figure 1 As shown, this invention proposes a method for setting the temperature of a walking beam furnace based on deep reinforcement learning. This method applies deep reinforcement learning to intelligently control the furnace temperature. By establishing an energy consumption and heating quality optimization model, a policy network based on the A3C algorithm is trained using historical furnace temperature and billet status data. The trained network is then deployed in the furnace control system to acquire the furnace status in real time and dynamically adjust the temperature setpoints of each heating zone, achieving precise temperature control and energy saving. The specific implementation process is as follows: Step 1: Establish an energy consumption and quality optimization model for a walking beam furnace: The objective function is set to minimize energy consumption and maximize yield across all heating time periods. The constraints are that the furnace temperature meets the heating process quality requirements and the heating furnace equipment operates safely. Specifically, it includes the following two sub-steps: 1.1: Establish the objective function of the system, namely, to minimize the total energy consumption of the heating furnace and the heating quality deviation within the operating time: ; in: Indicates the total energy consumption target: ; Indicates the target for heating quality deviation: ; In the formula, This indicates the total energy consumption of the heating furnace. This represents the total number of moments in the heating process. This is the time interval between two consecutive control operations. This represents the average set temperature of each heating zone. This indicates the rate of change of the set temperature.

[0019] and These represent the surface temperature and core temperature of the steel billet at time t, respectively. The target process temperature. , , , These are weighting coefficients, calibrated experimentally. This represents the maximum permissible cross-sectional temperature difference.

[0020] 1.2: Establish constraints for furnace temperature control, namely, the furnace temperature must meet the heating process quality requirements and equipment safety operation constraints: Heating process quality is a key indicator for ensuring the heating effect of steel billets. It is characterized by temperature tracking accuracy and heating uniformity. Heating process quality constraints include: Surface temperature tracking constraints: ; In the formula, The surface temperature of the steel billet. For the target process temperature, This represents the maximum permissible tracking error.

[0021] Cross-sectional temperature difference constraint: ; In the formula, This refers to the core temperature of the steel billet. This represents the maximum permissible cross-sectional temperature difference.

[0022] Equipment safety operation constraints: The set temperatures for each heating zone of the furnace must be within safe limits. ; The specific zone temperature limits are as follows: Preheating section: Heating for one stage: Heating in two stages: Irradiation zone: .

[0023] The above constraints together ensure that the heating process meets the process quality requirements, guarantee the safe operation of the equipment, and provide a feasible search space for the optimization algorithm.

[0024] Step 2: Design a reinforcement learning framework for the energy consumption of a walking beam furnace: The furnace temperature setting is described using a Markov decision process, which determines the model's state space, action space, reward function, and state-action value function. Specifically, this step involves: 2.1: Determine the state space In the control system of a walking beam furnace, the observed variables acquired by the agent from the environment include the target temperature of the steel billet. Temperature of each heating zone (preheating section temperature) Heating section temperature Temperature of the heating zone ), Current position of steel billet Predicted surface temperature of steel billet billet core temperature Billet moving speed timestamp Etc. State space Represented as: ; 2.2: Determining the Action Space In this system, the action space of the intelligent agent This refers to the adjustment amount for setting the temperature of each heating zone, i.e.: ; in, This represents the change in the set temperature of the i-th heating zone, and its value range is limited by process safety constraints.

[0025] ; In the formula, , , , These are the weighting coefficients. The reward function for maximizing the allowable cross-sectional temperature difference encourages the agent to ensure temperature tracking accuracy and heating uniformity while reducing energy consumption.

[0026] 2.4: Setting the State - Action Function Representation strategy The degree of superiority or inferiority of the strategy, that is, in terms of strategy Expected cumulative discounted return of the reward function: ; Among them, the agent's strategy For state To action The mapping, As the discount factor, the optimal strategy To make the state-action function The strategy that maximizes the cumulative reward function is: ; Step 3: Train the policy network based on the A3C algorithm: Deep reinforcement learning (DRL) is an intelligent decision-making method that integrates deep learning and reinforcement learning. It achieves autonomous learning and optimization by constructing an interaction framework between the agent and its environment. Within the DRL framework, the agent perceives the environmental state in real time, generates actions based on a policy network, and receives feedback through a reward function. This allows it to maximize long-term cumulative rewards through continuous trial and error and policy updates. The deep reinforcement learning decision-making process for furnace temperature control is described using a Markov decision process structure, which can be referenced. Figure 2 As shown.

[0027] It should be noted that the agent is the decision-making core of the entire control system. Its goal is to learn an optimal strategy to maximize the long-term cumulative reward obtained in the process of controlling the heating furnace. The environment is the object of interaction with the agent, receiving the furnace temperature setpoint action output by the agent. After each decision step, it feeds back to the agent the new state, the immediate reward, and the signal indicating whether the current round has ended.

[0028] Furthermore, the state space defines the information that the agent can perceive at any given time regarding the heating furnace system and the condition of the billet, which serves as the basis for the agent's decision-making. The action space defines the operations that the agent can perform, i.e., its output to the heating furnace control system. In this problem, the action is to directly set the furnace temperature in three key areas, and the reward is the environment's immediate evaluation of the actions performed by the agent, which is the core signal guiding the agent's learning direction. This transforms the complex multi-objective engineering optimization problem into a scalar signal that the agent can understand and maximize.

[0029] The A3C (Asynchronous Advantage Actor-Critic) algorithm is a deep reinforcement learning algorithm based on an actor-critic architecture. A3C employs an asynchronous multi-threaded training mechanism, eliminating the need for a target network and an experience replay pool. Its core idea is to create multiple worker threads that interact with the environment in parallel. Each thread maintains local network parameters and asynchronously updates the global network parameters by calculating the advantage function. The training process for a policy network based on the A3C algorithm is as follows: Step 3.1: Algorithm Initialization: Initialize the global actor network parameters θ and the critic network parameters ω, where the actor network outputs the action probability distribution π(a|s;θ), and the critic network evaluates the state value V(s;ω). Set the learning rate: actor network α=0.0001, critic network β=0.001, set the discount factor γ=0.99, the number of worker threads N=16, initialize the exploration noise parameter σ=0.1, and set the noise attenuation coefficient λ=0.01; Step 3.2: Create worker threads and start parallel training. Each thread runs independently and interacts with the heating furnace simulation environment. The core training process is that each worker thread repeatedly executes steps 3.3 to 3.5 until the convergence condition is met.

[0030] Step 3.3: At each time step t, each worker thread performs the following sequence of operations: copy the latest parameters from the global network to the local network, and adjust the parameters according to the current state. The motion probability distribution is generated through a local actor network: ,in It is the forward computation function of the local actor network; Sample basic actions from the distribution. Add exploration noise: Perform the cutting operation while ensuring the action is within safe limits: It performs actions in the environment, observes rewards and the next state, and stores experience tuples in the local experience buffer.

[0031] If the current round ends or the buffer reaches the batch size B, then proceed to the gradient calculation process (step 3.4); otherwise, t←t+1, and continue the interaction of the next time step.

[0032] Step 3.4: Gradient Calculation and Global Update: When the local buffer is full or the round ends, the worker thread performs the following calculations: Step 3.4.1: Calculate the advantage function: Based on the state value estimated by the local commentator network, calculate the cumulative experience advantage value: , This represents the span from the current step to the last step within the buffer.

[0033] Step 3.4.2: Constructing the Loss Function: The loss function is used to quantify the network's prediction error and guide parameter optimization. It consists of three parts: Strategy loss: Used to optimize the actor network and encourage the selection of high-dominance actions. The formula is: .

[0034] Value loss: Used to optimize the commentator network, reduce value estimation error, and improve the accuracy of state value prediction. The formula is: .

[0035] Entropy regularization loss: Prevents the policy from prematurely converging to a local optimum by increasing policy randomness to promote exploration. The formula is: .

[0036] Total Loss: By weighted summing and integrating the above components, balancing strategy optimization, value estimation, and exploration requirements, the formula is: ; Calculate the gradient of the total loss with respect to the local network parameters. and .

[0037] Step 3.4.3: Asynchronous Global Update: Minimize the total loss using gradient descent. Each worker thread uploads the calculated gradient to the global server to update the global network parameters. ; in, This is the learning rate.

[0038] Step 3.5: Training Termination and Model Saving: Step 3.5.1: Continue training until the convergence condition is met: the average cumulative reward changes by less than 1% over 100 consecutive rounds, or the maximum number of training steps is reached. .

[0039] Step 3.5.2: After training is complete, save the trained global actor network parameters θ* for subsequent online inference.

[0040] Step 4: Deploy the trained A3C strategy network to the stepper furnace control system. It will infer and output temperature setting suggestions in real time, which will be executed by the PLC to realize intelligent closed-loop control. Data will be collected in real time through the sensor network installed at key locations in the furnace. The trained A3C agent network will be deployed to receive real-time status input and output the optimal control action.

[0041] The working principle of this invention is as follows: It uses a deep reinforcement learning-based step-by-step furnace temperature control strategy to dynamically adjust the furnace temperature according to the operating conditions, ensuring that the control strategy always matches the current operating conditions. This results in more precise control and a faster response. Furthermore, by linking control decisions with production plan data, the system can achieve global optimization rather than local optimization. This ensures product quality while minimizing energy consumption throughout the entire production sequence, significantly improving product yield, effectively reducing fuel consumption, and bringing considerable economic and environmental benefits to metallurgical enterprises.

[0042] The preferred embodiments of the present invention disclosed above are merely illustrative of the invention. These preferred embodiments do not exhaustively describe all details, nor do they limit the invention to any specific implementation. Clearly, many modifications and variations can be made based on the content of this specification. This specification selects and specifically describes these embodiments to better explain the principles and practical applications of the invention, enabling those skilled in the art to better understand and utilize it. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A furnace temperature setting method for a walking beam furnace based on deep reinforcement learning, characterized by, Includes the following steps: Step 1: Model the furnace temperature setting problem of the walking beam furnace, and establish a multi-objective optimization function and constraints; Step 2: Design a reinforcement learning framework, use Markov decision process to describe the furnace temperature setting problem, determine the state space, action space, reward function and state-action value function, and obtain the optimal strategy of the heating furnace temperature control optimization model; Step 3: Based on historical furnace temperature data, billet condition data, and environmental data, train the strategy network using the A3C algorithm; Step 4: Deploy the trained strategy network in the heating furnace control system and adjust the furnace temperature setpoint in real time.

2. The furnace temperature setting method based on deep reinforcement learning for a walking beam furnace according to claim 1, wherein, In step one, the multi-objective optimization function simultaneously considers minimizing energy consumption and maximizing yield, with constraints including furnace temperature process requirements, heating uniformity conditions, and equipment safety operation constraints.

3. The furnace temperature setting method based on deep reinforcement learning for a walking beam furnace according to claim 2, characterized in that, The multi-objective optimization function takes minimizing energy consumption and minimizing heating quality deviation as parallel objectives. The energy consumption objective is the weighted sum and accumulation of the average set temperature and the rate of change of the set temperature in each heating zone during each heating time period. The heating quality deviation target is the weighted sum of the deviation between the billet surface temperature and the target temperature during each heating time period, and the portion of the deviation between the billet surface temperature and the core temperature that exceeds the maximum allowable cross-sectional temperature difference.

4. The method for setting the furnace temperature of a step-type heating furnace based on deep reinforcement learning according to claim 1, characterized in that, In step two, the state space includes the billet target temperature, the actual temperature of the preheating section, the actual temperature of the heating section, the actual temperature of the soaking section, the current position of the billet, the predicted surface temperature of the billet, the core temperature of the billet, the billet moving speed, and the timestamp.

5. The method for setting the furnace temperature of a step-type heating furnace based on deep reinforcement learning according to claim 1, characterized in that, In step two, the operating space includes the set temperature change of the preheating section, the set temperature change of the heating section, and the set temperature change of the homogenization section.

6. The method for setting the furnace temperature of a step-type heating furnace based on deep reinforcement learning according to claim 1, characterized in that, In step three, the specific steps for training the policy network using the A3C algorithm include: Initialize the global actor network parameters and critic network parameters, and set the learning rate, discount factor, and number of worker threads; Multiple worker threads are created, each thread independently copies global parameters, interacts with the heating furnace simulation environment, and collects empirical data. Calculate the loss gradient based on the advantage function; Asynchronously update global network parameters until the convergence condition is met; Save the trained actor network parameters for online inference.

7. The method for setting the furnace temperature of a step-type heating furnace based on deep reinforcement learning according to claim 6, characterized in that, The calculation of loss gradient based on the advantage function includes calculating policy loss, value loss, and entropy regularization loss.