A solution for determining at least one simulation parameter for a simulation to pre-train a reinforcement learning (RL) based controller, and the controller
By determining simulation parameters based on spatial and system models, the method addresses inefficiencies in RL-based AC/HVAC controller pre-training, enhancing training efficiency and accuracy by accounting for hysteresis.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- MITSUBISHI ELECTRIC R&D CENTRE EUROPE BV
- Filing Date
- 2025-11-07
- Publication Date
- 2026-07-02
AI Technical Summary
Existing reinforcement learning (RL)-based AC/HVAC controllers face inefficiencies in pre-training, particularly due to the challenges of determining appropriate data retention periods for historical state data, which can lead to insufficient or excessive training times, and the inability to effectively account for hysteresis in system behavior.
Determine simulation parameters based on spatial and system models to define the range of historical state data and pre-training time, incorporating hysteresis considerations to improve controller training efficiency and accuracy.
The method reduces the risk of overfitting and ensures effective pre-training by accurately accounting for hysteresis, leading to improved stability and efficiency in complex environments.
Smart Images

Figure 2026110504000001_ABST
Abstract
Description
[Technical Field]
[0001] The present invention relates to a method for determining at least one simulation parameter for a simulation to pre-train a reinforcement learning (RL) based controller for an air conditioning (AC) system or a heating, ventilation, and air conditioning (HVAC) system. The present invention also relates to a system for determining at least one simulation parameter for a simulation to pre-train an RL-based controller for an AC system or an HVAC system. Furthermore, the present invention relates to an RL-based controller. [Background technology]
[0002] It is well known to use reinforcement learning for HVAC controllers. See, for example, Patent Document 1. Reinforcement learning is a method used in computerized systems, and in particular in HVAC controllers, in which artificial intelligence, including, for example, neural networks, learns optimal behavior through interaction with the environment. This method involves iterative trial and error to find appropriate behavior performed by an agent such as an HVAC controller, which yields known desirable results as rewards based on certain criteria.
[0003] Figure 3 schematically illustrates such a known reinforcement learning process. The features depicted in Figure 3 can be defined as follows.
[0004] Agent: A trained entity. The agent interacts with the environment and chooses actions.
[0005] Environment: The location where the agent exists and takes action. The agent's actions affect the environment, and the environment provides the agent with rewards or feedback.
[0006] Action: The choices an agent can make within the environment.
[0007] Reward: The evaluation or value assigned to an agent's actions. The agent attempts to learn actions that maximize reward over time.
[0008] State: The agent's state within the environment. The state provides information for the agent to make decisions about its actions. During reinforcement learning, the agent selects an action based on its current state and receives a reward as a result.
[0009] Simulations are often used for pre-training controllers / agents. For example, before deploying a reinforcement learning agent to a real-world environment, the agent is pre-trained through a simulation over a certain period. The simulation, and therefore pre-training, typically involves a series of iterations that determine and perform actions by the agent based on the current state, and then feed back the resulting state changes to the agent along with the observed rewards.
[0010] Existing solutions sometimes fail to achieve the desired results, for example, because control behaviors do not necessarily produce the desired outcome precisely, and / or because pre-training is inefficient. [Prior art documents] [Patent Documents]
[0011] [Patent Document 1] U.S. Patent Application Publication No. 2012-0040601(A1) [Overview of the Initiative] [Problems that the invention aims to solve]
[0012] This invention seeks to improve RL-based AC / HVAC controllers, particularly with regard to their pre-training. [Means for solving the problem]
[0013] This objective is addressed by the subject matter of the independent claim. Preferred embodiments are specified in the dependent claims, this specification, and the figures.
[0014] Therefore, a method is disclosed for determining at least one simulation parameter for a simulation to pre-train a reinforcement learning (RL) based controller for an air conditioning (AC) system or a heating, ventilation, and air conditioning (HVAC) system. The method determines the following simulation parameters a) and b) based on a spatial model that models the space to which a system, i.e., an AC or HVAC system, is applied, and / or a system model, i.e., a system model of the AC or HVAC system. a) The range of historical state data on which the controller should be pre-trained in at least one pre-training iteration of the simulation, b) The pre-training time period that should be applied through simulation, This includes a step of determining at least one of the following.
[0015] During pre-training by simulation and during optional additional learning when deployed in the real world, it is often recognized that the agent is only fed back the current time step as a state variable. This makes it difficult for the agent to correctly account for hysteresis within the system. There are already existing solutions that also consider historical state data. For example, in the following paper, Christian Blad, Simon Boegh, Carsten Skovmose Kallesoee, "Data-driven Offline Reinforcement Learning for HVAC-systems", Energy Journal, Volume 261, Issue Part B, 2022, Article Number 125290, ISSN 0360-5442, the behavior of the environment is simulated using LSTM (Long Short-Term Memory). The feedback as a state variable includes data up to the previous 6 time steps along with the current state.
[0016] However, even when historical data is taken into account, the results may still not be satisfactory. For example, when determining how long or how much data should be retained so that the controller can learn the hysteresis of the system from the retained data, intuition and experience of the controller designer are usually relied upon as key factors. However, in some cases, the data retention period may be selected too short to properly account for hysteresis, while in other cases, the data retention period may be too long, thus requiring excessive time for controller training.
[0017] In this regard, while reinforcement learning typically assumes Markov property where only the current state matters, "learning hysteresis" is related to the controller's decision-making based on incorporating past data. By "learning" this hysteresis effect, the controller / agent can better handle situations where past actions also affect the present, thus improving stability and effectiveness in complex environments. Further examples in this regard can be found in "Real building implementation of a deep reinforcement learning controller to enhance energy efficiency and indoor temperature control" by Alberto Silvestri, Davide Coraci, Silvio Brandi, Alfonso Capozzoli, Esther Borkowski, Johannes Koehler, Duan Wu, Melanie N. Zeilinger, Arno Schlueter, in Applied Energy, Volume 368, 2024, Article number 123447, ISSN 0306‐2619, https: / / doi.org / 10.1016 / j.apenergy.2024.123447.
[0018] For this reason, the present disclosure seeks to determine the simulation parameters a) defined above that correspond to and / or are similar to each data retention period in a more rule-based manner, specifically based on a space model and / or a device model.
[0019] Additionally, or by alternative means, to improve the efficiency of controller training through simulation, the required simulation time can be determined from the parameters of the spatial model and / or device model, based on the simulation parameter b) specified above. In other words, the length of the simulation period to be applied during controller training, which may correspond to or be similar to the pre-training time period specified above, can be automatically determined from those models.
[0020] The model-based determination of simulation parameters a) and / or b) reduces the risk of the controller being overfitted due to a short simulation period, and conversely, the risk of the controller being pre-trained for an unnecessarily long time due to a long simulation period. Note that the RL-based controller may encompass at least one neural network or at least one other machine learning model whose parameters can be trained by reinforcement learning. Generally, such a neural network or machine learning model may model state-behavior relationships that determine the control actions to be determined by the RL-based controller depending on the currently observed state. These states may encompass, for example, measurable parameters of space, and / or desired settings, such as those set by the user.
[0021] In one embodiment of this method, which includes the step of determining at least parameter a), the range of past state data is the following parameters of the spatial model: - The resistance parameter (R) that represents the thermal resistance of space to heat transfer, - A capacity parameter (C) that represents the heat capacity of a space for storing heat. It is determined based on at least one of these, and in particular, based on the product of these parameters.
[0022] For example, the product of these parameters may be used to determine the initial time parameter. For example, at least one additional time parameter may be determined based on a system model. For example, the additional time parameter may depend on, or be equivalent to, a parameter in the system model that affects the speed at which the system reaches a target steady state. When multiple AC / HVAC systems controlled by an RL-based controller regulate the air in a common space such as a building, such time parameters may be determined based on the system model of each system. The final parameter to be selected, which defines the time span corresponding to and / or defining the range of past state data, may be selected from the multiple time parameters, in particular as the maximum time parameter among the multiple time parameters.
[0023] In a further embodiment of the method, which includes the step of determining at least parameter b), the pre-training time period is determined based on a parameter of the system model that affects the speed at which the system reaches the target steady state. Additionally or alternatively, the pre-training time period may be determined based on a simulation parameter a), in particular based on or as a product of the parameter with a certain integer.
[0024] In further embodiments, the space modeled by the spatial model is a building, or is contained by a building. Therefore, the spatial model may also be referred to as a building model.
[0025] In one embodiment, the spatial model includes a spatial centrifugation model. This can help improve the determination of simulation parameters in a time-efficient manner.
[0026] To give a further example, the method includes the step of adjusting at least one parameter of the spatial model based on data measured for the actual space. In other words, the model may be fitted to the measured data to increase its accuracy.
[0027] To give a further example, the method includes a step of adjusting at least one parameter of the spatial model based on the season being simulated. For example, an AC / HVAC system can operate primarily in cooling mode during the summer, while operating primarily in heating mode during the winter. This can be taken into account in the above example. For example, the spatial model may include thermal capacity C and / or thermal resistance R as parameters. In practical building simulations, R and C are usually considered fixed properties of the building materials, meaning that R and C do not generally change with the season or heating / cooling mode. On the other hand, in data-driven models or machine learning-based methods such as this reinforcement learning method, the parameters optimized during cooling mode and heating mode may differ. As a result, although R and C are physically fixed in actual buildings, different values for R and C in cooling and heating concepts may be learned.
[0028] To give a further example, the system model includes a function that describes the time speed at which the HVAC system reaches the target steady state.
[0029] To give a further example, past state data includes pre-stored and / or recorded historical state data during preceding pre-training iterations. For example, if there is no pre-determined and / or simulated state data during the initial simulation and / or pre-training cycle, pre-stored historical state data may be referenced. Once such data becomes sufficiently available, for example, in the form of data recorded during preceding pre-training iterations, this data can be referred to as past state data.
[0030] To give a further example, the method includes a step of performing a simulation using at least one determined simulation parameter, which includes a step of performing a series of pre-training iterations.
[0031] To give a further example, the method includes the step of controlling a system (e.g., a real-world AC or HVAC system) using a pre-trained controller, wherein the controller is configured to take into account, during control, the same range of past state data as defined by simulation parameter a). In other words, the range of past state data determined by simulation parameter a) can be used after pre-training and also during real-world deployment to accurately determine control behavior.
[0032] To give a further example, the controller is also trained based on its past actions. Therefore, further simulation parameters relating to the range of past actions on which the controller should be pre-trained may be considered. This range may be flexibly set as an option, for example, based on a spatial or system model, and in particular based on or as the retention time length. Specifically, past action variables that may generally relate to a series of commands sent to the device by the controller are past state data (e.g., τ, which will be discussed below). save It is considered within the same length range as ). This setting ensures that the influence of the controller's actions is taken into account while considering the hysteresis of the target system, and therefore the efficiency of the learning process is improved.
[0033] Further disclosed are reinforcement learning (RL) based controllers for air conditioning (AC) systems or for heating, ventilation, and air conditioning (HVAC) systems, which include means for performing the method described herein in any of the embodiments disclosed herein.
[0034] The controller means may include, for example, at least one processor and / or at least one computer-readable medium on which a computer program is stored, the computer program including instructions that cause the processor to execute any of the disclosed methods when the program is executed by the processor.
[0035] This disclosure also relates to a structure for determining at least one simulation parameter for a simulation to pre-train a reinforcement learning (RL) based controller for an air conditioning (AC) system or a heating, ventilation, and air conditioning (HVAC) system, the structure being A controller configured to control an AC or HVAC system, • A computer that is connected to the controller in an operable manner This structure includes, ○The system is configured to determine at least one simulation parameter for the simulation based on a model representing the space to which the AC or HVAC system is applied, and / or based on a system model, such as an AC or HVAC system model. ○The system is configured to perform the simulation using at least one determined simulation parameter.
[0036] The simulation parameters are: a) The range of historical state data on which the controller should be pre-trained in at least one pre-training iteration of the simulation, b) The pre-training time period that should be applied through simulation, It is one of them.
[0037] Here, with respect to the attached schematic drawings, we will now consider exemplary embodiments. [Brief explanation of the drawing]
[0038] [Figure 1]This disclosure shows an HVAC system including a controller according to one embodiment of this disclosure. [Figure 2] The following shows a unit included in a controller according to one embodiment of the present disclosure. [Figure 3] This is a basic diagram illustrating reinforcement learning. [Figure 3a] This document illustrates how past or future states can be considered in one embodiment of the present disclosure. [Figure 4] This is a schematic diagram of a structure according to one embodiment of the present disclosure. [Figure 5] This section provides a schematic diagram of the spatial model. [Figure 6] This is a flowchart of a method according to one embodiment of the present disclosure. [Figure 7] This is the equation referenced by the method shown in Figure 6. [Figure 8] This is the equation referenced by the method shown in Figure 6. [Figure 9] This is the equation referenced by the method shown in Figure 6. [Figure 10] This is the equation referenced by the method shown in Figure 6. [Modes for carrying out the invention]
[0039] Figure 1 illustrates an HVAC system 10 that treats the air in a space 12 in a generally known manner. The space 12 is, for example, a building having several rooms and / or floors, or includes such a building. The HVAC system 10 includes a set of devices that include an outdoor unit and an indoor unit in a generally known manner. Optionally, a remote controller may also be provided that allows a user to remotely control either the outdoor unit or the indoor unit, for example by defining setpoints for the operation of these units.
[0040] The HVAC system 10 further includes sensors for measuring the conditions of the space 12, such as the indoor temperature and / or humidity within the space 12, and / or the airflow within the space 12, and / or the power consumption of the entire HVAC system 10, and / or the air quality within the space 12. The air quality may be determined, for example, based on the carbon dioxide level within the space 12.
[0041] The HVAC system 10 further includes an HVAC controller 14. The HVAC controller 14 is configured to use the above-mentioned setpoints and / or sensor signals as state input information and to output control signals (see dotted lines in Figure 1) for controlling the outdoor unit and / or indoor unit. As schematically illustrated in Figure 1, the HVAC controller 14 thus acts on the space 12 at least indirectly.
[0042] In the illustrated example, the HVAC controller 14 is a reinforcement learning (RL) based controller. This means that, once deployed to regulate the air in space 12 in a real-world environment, the HVAC controller 14 undergoes an ongoing reinforcement learning process. Through this process, the HVAC controller 14 continuously interacts with the environment in the form of space 12, receiving feedback based on its control actions and improving its control strategy over time. Control actions may include, for example, adjusting temperature setpoints, changing fan speeds, or controlling air conditioning or heating output. The control strategy as a whole may include how to select control actions in response to observed conditions.
[0043] As described with reference to Figure 3, the controller 14 improves its performance by learning from the conclusions of its actions, optimizing the efficiency of the HVAC system 10 and maintaining the desired conditions within the space 12. These conclusions may generally be reflected in the rewards shown in Figure 3. Such rewards may be determined by a reward function that can take into account any factors, such as energy efficiency, occupant comfort, air quality, and system performance, in relation to achieving and maintaining a set temperature. The controller is trained to maximize cumulative rewards over time by adjusting its actions.
[0044] The reinforcement learning algorithm utilized by controller 14 employs a neural network or another machine learning model to approximate a mapping between the system state and the corresponding control action to be taken. The neural network is trained through iterative interactions with the environment as described above, receives state data, outputs control actions, and adjusts its internal mapping model / function based on the reward received from each action.
[0045] Specifically, the HVAC controller 14 utilizes reinforcement learning algorithms such as Q-learning, Deep Q Network (DQN), Proximal Policy Optimization (PPO), or Deep Deterministic Policy Gradient (DDPG) to improve its decision-making process. These algorithms enable the controller to investigate different control strategies and improve its performance based on the observed results, and the controller's neural network or other machine learning model facilitates the modeling of the state-behavior relationships described above.
[0046] Figure 2 is an illustrative diagram of some of the units included in the HVAC controller 14. Some of these units are optional. While each of the units discussed below may be a software module, at least the input and output units may also include, or at least be connected to, a hardware interface for performing the data transmission described.
[0047] The HVAC controller 14 includes an input unit for receiving input data such as sensor data or user input information that may be similar to the state data described above. This data is transferred to the power consumption model establishment unit, the control value determination unit, and the history data storage unit, which will be discussed below. In this way, the control value determination unit may receive user input information, for example, regarding whether to enable reinforcement learning and / or whether to apply the control value determined by effective reinforcement learning.
[0048] Optionally, the input unit may also receive future data, particularly future state data, received based on predictions and / or forecasts. This future data may be calculated, for example, based on weather forecasts. The future data is output to the future data storage unit, which will be discussed below.
[0049] The HVAC controller 14 includes a power consumption model establishment unit, or alternatively, a general system model construction unit. For example, power consumption can be calculated based on the coefficient of performance (COP). COP is defined as a function of the supply capacity Q for a room. The supply capacity is usually related to the rate at which thermal energy (heating or cooling) is delivered to maintain the room temperature at a desired level. Specifically, COP can relate the supply capacity Q to the amount of power input required. In one example, power consumption can be calculated as the ratio of Q to COP, or reliance on this ratio. In particular, when calculating the ratio mentioned, power consumption can also be calculated by setting COP as a constant value.
[0050] These units generally work to model the equipment of the HVAC system 10 used to treat air, such as indoor and / or outdoor units. This will be explained in more detail below with respect to the equations in Figure 8. Optionally, power consumption may be part of the system model, or alternatively, it may be omitted. As will be similarly detailed below with respect to the equations in Figures 7 to 10, the power consumption model establishment unit, or alternatively, the system model construction unit, outputs information to the storage time calculation unit that can be used to calculate the respective storage time lengths.
[0051] The HVAC controller 14 further includes a building model construction unit, which may also be referred to as a spatial model construction unit. This unit is configured to construct a building model as a specific example of a spatial model. See the consideration in Figure 5 below. Construction may involve receiving at least a portion of the model from an external source, such as a server or another computer.
[0052] The HVAC controller 14 further includes a storage time length calculation unit that receives input information from a building model construction unit and, optionally, from a power consumption model establishment unit, or, alternatively, from a system model construction unit. The storage time length calculation unit is configured to determine, based on the input data, the range of historical state data on which the controller should be pre-trained in at least one pre-training iteration of the simulation. If the historical state data is considered to be similar to time-series data that is typically stored or saved for use in pre-training the controller, then this unit is referred to as a storage time length calculator. Alternatively, the storage time length calculation unit could be referred to, for example, a historical state range data calculation unit. For completeness, it should be noted that pre-training iterations refer equally to the time series or time sequence of simulated operation of the HVAC controller 14.
[0053] The HVAC controller 14 further includes a simulation execution unit. This unit receives input information from an optional power consumption model establishment unit, or alternatively from a system model construction unit, a building model construction unit, and a reinforcement learning unit. The reinforcement learning unit may output simulated and / or internal control actions during the pre-training phase. The simulation execution unit is configured to use these simulated control actions to determine the simulated influence of the HVAC controller 14 on the simulated environment, such as space, and in particular, on the building modeled by the building model construction unit, and thus determine their modified states. The simulated states thus computerized are output by the simulation execution unit to the HVAC controller 14's history data storage unit.
[0054] The historical data storage unit is configured to store past state data, which may also be referred to as historical data. This state data may relate to any actually measured state data received by the input unit, such as the indoor temperature and / or humidity in space 12, and / or the airflow in space 12, and / or the power consumption of the entire HVAC system 10, and / or the air quality in space 12. This state data may be stored as a time series of state data, particularly as a continuous time series. During pre-training with simulations, state data computerized by the simulation execution unit may be stored as historical data.
[0055] As an option, a future data storage unit is further provided. This unit may store predicted state data, which is computerized based on, for example, weather forecasts, as described above.
[0056] Both the historical data storage unit and the optional future data storage unit receive a determined time length, i.e., the determined range of past state data, as input information from the storage time length calculation unit. The amount of data to be stored by the historical data storage unit and the optional future data storage unit, or at least the amount of state data to be output by these units, is set according to this determined time length and range.
[0057] Furthermore, the HVAC controller 14 includes a training period length determination unit. This unit is configured to determine the pre-training time period to be multiplied by the simulation. Therefore, the training period length determination unit may also be called a pre-training time period determination unit. The method by which the said period is determined will be examined below with respect to the equations in Figures 6 to 9.
[0058] The training period length determination unit, the historical data storage unit, and optionally the future data storage unit output data to the reinforcement learning unit. Specifically, the training period length determination unit outputs a determined pre-training time period that should be multiplied by the simulation, while the historical data storage unit and the future data storage unit output the time series of state data stored in them, respectively. The range or amount of such state data is set according to the calculation result of the storage time length calculation unit.
[0059] The reinforcement learning unit uses the input data to limit pre-training to each pre-training time period. Furthermore, the received historical state data is used as historical data for the simulation and may be referenced, for example, along with the state data for the current instance or iteration to account for hysteresis.
[0060] Generally, reinforcement learning algorithms typically assume the Markov property. This means that state transitions and rewards at each step depend only on the preceding state and the selected action. Under this assumption, past states and actions do not influence future outcomes. Therefore, incorporating hysteresis into RL algorithms requires some kind of modification, as discussed below.
[0061] One way to account for hysteresis is to extend the current state by incorporating the history of related events that influence the next state. By embedding this history information into the current state, hysteresis can be considered while still maintaining the Markov property assumed in reinforcement learning. This method can also be applied to future information by incorporating future information, such as forecasts, into the current state and making it available for decision-making.
[0062] Figure 3a further illustrates this point. This figure represents a state transition diagram of a Markov decision process (MDP), and the reinforcement learning currently being implemented also relies on this Markov decision process. In this process, when in state St, action At is taken according to policy π (shown by a dotted line in the figure), transitioning to state St+1 and receiving an immediate reward Rt+1. As illustrated, state transitions and rewards at each point in time usually depend only on the preceding state and the selected action.
[0063] Therefore, in a system where past states and actions can influence state transitions and rewards as intended by this disclosure, these factors can be incorporated into preceding states and influence decisions.
[0064] Based on this input information, the reinforcement learning unit trains the controller, for example, based on Q-learning, and outputs appropriate control actions. During pre-training, these control actions are output to the simulation execution unit. During real-world deployment, these control actions are output to the control variable determination unit, which converts the received control actions into control variables such as voltage signals, on or off signals, or data flags that are appropriate for actually controlling the equipment in the HVAC system. These control variables are finally output by the output unit.
[0065] It should be noted that during real-world deployment and once the simulation is complete, at least the simulation execution unit, power consumption model establishment unit, building model construction unit, storage time length calculation unit, and training period length determination unit can be left idle. On the other hand, the historical data storage unit and the optional future data storage unit can still update the stored data. The reinforcement learning unit can continue referencing the stored data up to a range predetermined by the storage time length calculation unit. In other words, the same range of past state data determined during and / or in preparation for the simulation can be used during real-world deployment to improve control accuracy.
[0066] Based on the above, Figure 4 shows a structure 102 according to an alternative embodiment of the present disclosure. Structure 102 includes an HVAC controller 14 and another computer system 100, for example, an external computer associated with the HVAC controller 14. As indicated by the arrows, the computer system 100 and the HVAC controller 14 are connected to exchange data. The computer system 100 may be provided in particular during the simulation phase, when pre-training the HVAC controller 14. To perform the simulation phase and pre-training, the computer system 100 may be selectively connected only to the HVAC controller 14, for example. At least one of the units mentioned above (i.e., the simulation execution unit, the power consumption model establishment unit, the building model construction unit, the storage time length calculation unit, and the training period length determination unit), which are effective only during or in preparation for the simulation, may be implemented in the computer system 100 rather than in the HVAC controller 14, for example, as their respective software units.
[0067] Figure 5 is an exemplary spatial model of the form of a building model, such as that constructed by the building model construction unit described above. The model is a so-called centrifugal model and is generally similar to a low-dimensional model. As is common in centrifugal models, the model depicted is a simplified mathematical representation used to model the spatial thermal properties of the building form, ignoring spatial variations of specific properties, and the system is modeled as a set of separate, uniform components.
[0068] Specifically, the centralized model assumes that the entire space or building can be considered to have a uniform temperature at any given point in time. The thermal properties considered in this model include heat capacity C and thermal resistance R. Environmental conditions, such as those measured by sensors and interacting with the modeled building, include the outdoor temperature T. o and indoor temperature T i Related to parameters
[0069]
number
[0070] This represents the behavior of the HVAC system 10 in terms of heat input.
[0071] Depending on the season and whether the function of the HVAC system 10 during that season is primarily related to cooling or heating, different building models may be selected and / or the parameters R and C may be adjusted. For example, depending on the season, the way in which indoor and outdoor heat loads affect room temperature may differ.
[0072] Figure 6 is a flowchart of a method according to one embodiment of the present invention. In step S1, a building model and a system model are prepared. The system model is referred to as the device model. The building model is referred to as the RC building model and reflects its optional centrifugation characteristics as shown in Figure 5 and equation 7.
[0073] The building model may be determined, for example, by the building model construction unit discussed above, whereas the device model may be determined, for example, as part of the power consumption model described above and / or as a substitute for the power consumption model.
[0074] In the optional step S2, parameter fitting between the constructed model and the measured data can be included. In this way, the model can be more accurately adjusted to fit the measured real-world data using any known strategy.
[0075] In step S3, storage time length τ save This determines the range of past state data. This is done, for example, based on an apparatus model that models at least the indoor and / or outdoor units of the HVAC system 10, more specifically, its capacity for, for example, the form of heat transfer. The exemplary equation in Figure 7 is derived from such a model by the storage time length τ save The first measure for determining this is shown. As can be inferred from the equation in Figure 7, the building-specific parameter τ1 is calculated from the building model parameters R and C, in particular as the product of these parameters. This has been found to be a suitable measure that yields accurate and useful results. Note that the unit of τ1 when calculated by equation 7 is seconds.
[0076] Furthermore, based on the equation in Figure 8, which is similar to an HVAC system model as an example of an apparatus model, additional parameters τ2 and optionally τ3 are determined. In the equation in Figure 8, Q relates, for example, to the amount of heat transferred in time t. maxdefines the steady state that is estimated to be reached with respect to heat transfer. T is the time constant that characterizes the speed of the process, e.g., how fast the system reaches equilibrium. T may be determined as the step response time that can be measured or calculated from the system specifications, or it may depend on this step response time. In other words, the parameter T relates to how quickly the system establishes a steady temperature within a given space. The parameter e is the Euler number.
[0077] The parameter τ2 is equal to T and can also be in seconds or another unit of time here. For example, if there are multiple cooperative HVAC systems controlled by the same HVAC controller 14, such a parameter τ that is equal to T may be determined from each of their system models defined by the equations in FIG. 8 for each of the systems. In this way, further parameters such as τ3 can be determined.
[0078] Finally, the storage time length τ save , i.e., the range of past state data, is determined as the maximum value of any of these parameters τ2, τ3 including τ1. Refer to the equation in FIG. 9. As an option, especially when there are no multiple cooperative HVAC systems, it could also be considered to make τ save exactly equal to either τ1 or τ2.
[0079] Therefore, τ save defines the length of the time span in the form of a time series for which the state data should be held and stored for that purpose or in relation to that. In this way, τ save is converted to a certain amount or range of data, for example, depending on the time resolution of the said state data.
[0080] In step S4, the pre-training time period T sim to be multiplied by the simulation or rather the simulation time length is determined. According to the equation in FIG. 10, this simulation time length T sim is τsave It is determined as the product of and N, where N is some integer. Considering the constraints of computer resources, the standard value of N may be set to approximately 2-3 or 1-4. This choice is aimed at avoiding overfitting while effectively capturing the hysteresis pattern within the target period.
[0081] Therefore, according to step S5, the RL-based controller 14 (referred to as the RL agent in S5) has parameter τ save and T sim It is trained using, which means that during each training iteration, and the simulation time length T sim Throughout the period, each range of the historical status data τ save This relies on taking that into consideration.
[0082] Therefore, in order to take hysteresis into account from the building model and equipment model parameters, τ needs to be retained. save By determining the corresponding data length, the computing time can be reduced. Furthermore, the simulation time length T sim This ensures that meaningful simulation results, and therefore meaningful pre-training results, are generated within a reasonable timeframe.
[0083] Therefore, according to step S6, an HVAC controller 14 (i.e., an RL agent) controlled by at least one HVAC system 10 can be deployed in the real world to train it. During deployment, the HVAC controller 14 controls τ save However, it may still be applicable.
[0084] Furthermore, transfer learning from the simulation to the actual space and / or building may be performed. This transfer learning can be performed, for example, by retraining all layers in the neural network of the controller used in the simulation-based learning, or by retraining only the output layers of the neural network.
[0085] In yet another embodiment, the results of a reinforcement learning process according to any of the examples disclosed herein can be shared, for example, with other controllers and / or HVAC systems. This can be done, for example, via an online cloud, via some other server, or via another communication link. For example, if the heat source configuration is the same for some other controller, and / or the parameters of the building model are the same as those of an existing pre-trained controller, then pre-training of the reinforcement learning simulation for that other controller can be omitted or at least reduced, and the desired control policy can be acquired, at least partially, by transfer learning. Furthermore, the transfer may be applied to only a portion of some controller that has already been at least partially trained, for example, only to a specific layer, and as a result, the pre-trained knowledge that can be derived, for example, by any of the disclosed embodiments, can be maintained, at least partially. This enables effective deployment in real environments and adds value to the pre-training process as an additional feature.
[0086] The various aspects of this disclosure are summarized below as an appendix.
[0087] (Note 1) A method for determining at least one simulation parameter for a simulation to pre-train a reinforcement learning (RL) based controller (14) for an air conditioning (AC) system or a heating, ventilation and air conditioning (HVAC) system (10), Based on a spatial model that models the space (12) to which the AC or HVAC system (10) is applied, and / or based on the AC or HVAC system model, the following simulation parameters a) and b) a) In at least one pre-training iteration of the simulation, the controller (14) should be pre-trained based on past state data (τsave ) range, b) The pre-training time period to be multiplied by the above simulation (T sim ), A method that includes a step of determining at least one of the following. (Note 2) The method described in Appendix 1, comprising at least the step of determining parameter a), Past state data (τ save The range of the above is the following parameter of the spatial model, - The resistance parameter (R) of the space (12) that represents the thermal resistance to heat transfer, - A capacity parameter (C) representing the heat capacity of the space (12) for storing heat, A method determined based on at least one of these, in particular, based on the product of these parameters. (Note 3) The method described in Appendix 1 or 2, A method in which the simulation parameter b) is determined based on the simulation parameter a), in particular based on the product of the parameter and a certain integer. (Note 4) A method according to any one of the appendices 1 to 3, A method wherein the space (12) is a building or is included by a building. (Note 5) A method according to any one of the appendices 1 to 4, A method wherein the spatial model includes a centripetal model of the space (12). (Note 6) A method according to any one of the appendices 1 to 5, A method further comprising the step of adjusting at least one parameter of the spatial model based on data measured for the actual space (12). (Note 7) A method according to any one of the appendices 1 to 6, A method further comprising the step of adjusting at least one parameter of the spatial model based on the season being simulated. (Note 8) A method according to any one of the appendices 1 to 7, The system model includes a function that describes the time rate at which the AC or HVAC system (10) reaches a target steady state. (Note 9) A method according to any one of the appendices 1 to 8, A method wherein the aforementioned past state data includes historical state data that is pre-stored and / or recorded during preceding pre-training iterations. (Note 10) A method according to any one of the appendices 1 to 9, A method comprising the step of performing a simulation using the at least one determined simulation parameter, further comprising the step of performing a series of pre-training iterations. (Note 11) A method according to any one of the appendices 1 to 10, A step of controlling the AC or HVAC system (10) using the pre-trained controller (14), wherein, in particular, the controller (14) controls past state data (τ) within the same range as defined by the simulation parameter a) during control. save A method that includes further steps, configured to take into account ). (Note 12) A method according to any one of the appendices 1 to 11, A method in which the controller (14) is trained based on its past actions. (Note 13) A reinforcement learning (RL) based controller (14) for an air conditioning (AC) system or for a heating, ventilation and air conditioning (HVAC) system (10), comprising means for performing the method described in any one of the appendices 1 to 12. (Note 14) The controller (14) described in Appendix 13, The means includes at least one processor and at least one computer-readable medium on which a computer program is stored, and the computer program includes instructions that cause the processor to perform the method described in Appendix 1 when the program is executed by the processor, the controller (14). (Note 15) A structure (102) for determining at least one simulation parameter for a simulation to pre-train a reinforcement learning (RL) based controller (14) for an air conditioning (AC) system or a heating, ventilation and air conditioning (HVAC) system (10), wherein the structure (102) is A controller (14) configured to control the AC or HVAC system (10), A computer (100) that is operablely connected to the controller (14) and Includes, The aforementioned structure (102) is ○The system is configured to determine at least one simulation parameter for the simulation based on a model representing the space (12) to which the AC or HVAC system (10) is applied, and / or based on the AC or HVAC system model. ○The system is configured to perform the simulation using the at least one determined simulation parameter, The aforementioned simulation parameters are, a) Past state data (τ) on which the controller (14) should be pre-trained in at least one pre-training iteration of the simulation. save ) range, b) The pre-training time period to be multiplied by the above simulation (T sim ), One of them is structure (102). [Explanation of Symbols]
[0088] 10 HVAC systems, 12 spaces, 14 HVAC controllers.
Claims
1. A method for determining at least one simulation parameter for a simulation to pre-train a reinforcement learning (RL) based controller (14) for an air conditioning (AC) system or for a heating, ventilation and air conditioning (HVAC) system (10), Based on a spatial model that models the space (12) to which the AC or HVAC system (10) is applied, and / or based on the AC or HVAC system model, the following simulation parameters a) and b): a) In at least one pre-training iteration of the simulation, the controller (14) should pre-train based on past state data (τ save ) range, b) The pre-training time period (T) that should be multiplied by the above simulation. sim ), A method that includes a step of determining at least one of the following.
2. A method according to claim 1, comprising the step of determining at least parameter a), Past state data (τ save The range of the spatial model is determined by the following parameters: - The resistance parameter (R) of the space (12) that represents the thermal resistance to heat transfer, - A capacity parameter (C) representing the heat capacity of the space (12) for storing heat, A method determined based on at least one of these, in particular, based on the product of these parameters.
3. A method according to claim 1 or 2, A method in which the simulation parameter b) is determined based on the simulation parameter a), particularly based on the product of the parameter and a certain integer.
4. A method according to claim 1 or 2, A method wherein the space (12) is a building or is included by a building.
5. A method according to claim 1 or 2, A method wherein the spatial model includes a centripetal model of the space (12).
6. A method according to claim 1 or 2, A method further comprising the step of adjusting at least one parameter of the spatial model based on data measured for the actual space (12).
7. A method according to claim 1 or 2, A method further comprising the step of adjusting at least one parameter of the spatial model based on the season being simulated.
8. A method according to claim 1 or 2, The system model includes a function that describes the time rate at which the AC or HVAC system (10) reaches a target steady state.
9. A method according to claim 1 or 2, A method wherein the aforementioned past state data includes historical state data that is pre-stored and / or recorded during preceding pre-training iterations.
10. A method according to claim 1 or 2, A method comprising the step of performing a simulation using the at least one determined simulation parameter, further comprising the step of performing a series of pre-training iterations.
11. A method according to claim 1 or 2, A step of controlling the AC or HVAC system (10) using the pre-trained controller (14), wherein the controller (14) controls past state data (τ) within the same range as defined by the simulation parameter a) during control. save A method that further includes steps, configured to take into account ).
12. A method according to claim 1 or 2, A method in which the controller (14) is trained based on its past actions.
13. A reinforcement learning (RL) based controller (14) for an air conditioning (AC) system or for a heating, ventilation and air conditioning (HVAC) system (10), comprising means for performing the method according to claim 1 or 2.
14. A controller (14) according to claim 13, The means includes at least one processor and at least one computer-readable medium on which a computer program is stored, and the computer program includes instructions that cause the processor to perform the method according to claim 1 when the program is executed by the processor, controller (14).
15. A structure (102) for determining at least one simulation parameter for a simulation to pre-train a reinforcement learning (RL) based controller (14) for an air conditioning (AC) system or a heating, ventilation and air conditioning (HVAC) system (10), wherein the structure (102) is A controller (14) configured to control the AC or HVAC system (10), - A computer (100) that is operablely connected to the controller (14) and Includes, The above structure (102) is, ○The system is configured to determine at least one simulation parameter for the simulation based on a model representing the space (12) to which the AC or HVAC system (10) is applied, and / or based on the AC or HVAC system model. ○The system is configured to perform the simulation using the at least one determined simulation parameter, The aforementioned simulation parameters are, a) In at least one pre-training iteration of the simulation, the controller (14) should be pre-trained on past state data (τ save ) range, b) The pre-training time period (T) that should be multiplied by the above simulation. sim ), One of them is structure (102).