A multi-energy supply system coordinated scheduling optimization method based on reinforcement learning
By improving the model reinforcement learning framework and combining dynamic belief propagation and adaptive collaborative strategy generation, the computational complexity and real-time issues in the scheduling optimization of multi-energy combined supply systems are solved, and efficient, stable collaborative operation and resource matching of multi-energy systems are realized.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHN ENERGY NEW ENERGY TECHNOLOGY RESEARCH INSTITUTE CO LTD
- Filing Date
- 2025-11-13
- Publication Date
- 2026-06-23
AI Technical Summary
Traditional scheduling optimization methods for multi-energy combined systems suffer from high computational complexity and insufficient real-time performance when dealing with the dynamic, nonlinear, and high-dimensional characteristics of complex multi-energy coupled systems, making it difficult to meet dynamic scheduling requirements. Furthermore, machine learning-based methods suffer from unstable training, slow convergence speed, and insufficient policy generalization ability.
An improved model reinforcement learning framework is adopted, which combines dynamic belief propagation mechanism, adaptive collaborative strategy generation mechanism and risk-aware value assessment mechanism to construct an intelligent decision-making framework for multi-functional systems. Through dynamic cognitive modeling, adaptive strategy generation and risk assessment, the multi-functional system can achieve efficient and stable collaborative operation in dynamic environments.
It significantly improves the collaborative operation efficiency and intelligent decision-making capability of multi-energy systems in complex environments, enhances the definition completeness and computability of the system's state space and action space, reduces model error accumulation, strengthens the system's operational robustness and prediction accuracy under external disturbances, and realizes efficient energy transfer and resource matching among multi-energy systems.
Smart Images

Figure CN121745527B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of energy system optimization, and in particular to a method for collaborative scheduling optimization of multi-energy combined supply systems based on reinforcement learning. Background Technology
[0002] With the diversification of energy structures and the development of smart grids and regional energy internet, various energy systems such as electricity, heat, cooling, gas, and energy storage are gradually interconnected, forming multi-energy combined supply systems. These systems, through energy coupling and optimized allocation, can significantly improve energy utilization efficiency, reduce system operating costs, and promote a high proportion of renewable energy integration. However, due to the complex dynamic coupling relationships, spatiotemporal correlations, and uncertainties among different energy sources, traditional scheduling optimization methods exhibit significant limitations in addressing the dynamic, nonlinear, and high-dimensional characteristics of complex multi-energy coupled systems.
[0003] Existing methods for scheduling optimization in combined energy supply systems mainly fall into two categories: optimization methods based on mathematical programming and predictive control methods based on machine learning. The former relies on accurate system modeling and linearization assumptions, such as mixed-integer programming, dynamic programming, or robust optimization methods, typically requiring the solution to the global optimum within a fixed time period. However, in complex real-world operating environments, due to frequent changes in energy supply and demand, nonlinear equipment operating characteristics, and the difficulty in accurately modeling external disturbances, these methods often result in high computational complexity, insufficient real-time performance, and difficulty in meeting dynamic scheduling requirements. The latter uses neural networks or deep reinforcement learning models to model and predict system states, improving adaptability to some extent. However, their training process often relies on large amounts of real-world operating data, lacking model constraint support, and is prone to problems such as training instability, slow convergence speed, and insufficient policy generalization ability. Summary of the Invention
[0004] One objective of this invention is to propose a collaborative scheduling optimization method for multi-energy combined supply systems based on reinforcement learning. This invention fully utilizes an improved model reinforcement learning framework, a dynamic belief propagation mechanism, an adaptive collaborative strategy generation mechanism, and a risk-aware value assessment mechanism. During the operation of the multi-energy system, an intelligent decision-making framework integrating cognition, optimization, and evaluation is established, enabling the multi-energy system to operate efficiently, stably, and collaboratively in a dynamic environment.
[0005] According to an embodiment of the present invention, a cooperative scheduling optimization method for a multi-energy combined heat and power system based on reinforcement learning includes the following steps:
[0006] Collect and preprocess real-time operating data and external environmental data from the combined energy supply system to form a standardized dataset;
[0007] A mathematical model of the multi-energy combined supply system is established based on a standardized dataset. The system state space, system action space and system constraints are defined, and the system dynamic behavior equations are constructed.
[0008] Based on the system's dynamic behavior equations and historical operating data of the combined energy supply system, a system dynamic dataset is generated.
[0009] The system dynamic dataset is input into the multi-energy system dynamic cognitive modeling module of the improved model reinforcement learning (MBRL) framework for training. A dynamic confidence propagation mechanism is introduced to output the confidence of the prediction results.
[0010] Within each scheduling cycle, a confidence-corrected prediction sequence is generated based on the confidence level of the prediction results and the external disturbance information of the multi-energy combined supply system, and a set of candidate scheduling actions is formed in combination with system constraints.
[0011] The candidate scheduling action set is input into the adaptive cooperative policy generation module of the improved model reinforcement learning framework, and a multi-functional cooperative scheduling policy is generated by using dynamic game and policy update mechanism.
[0012] The multi-energy collaborative scheduling strategy is input into the risk-aware value assessment module of the improved model reinforcement learning framework to evaluate the long-term benefits of the multi-energy collaborative scheduling strategy, obtain the evaluation results, and select the optimal scheduling strategy for the current iteration stage based on the evaluation results.
[0013] The optimal scheduling strategy of the current iteration stage is input into the multi-energy combined supply system for execution, feedback information from system operation is collected, and the optimal collaborative scheduling strategy is output.
[0014] The optimal cooperative scheduling strategy is input into the experience pool of the improved model reinforcement learning framework. Through the experience sample replay and model parameter update mechanism, joint iterative training is performed to obtain the updated improved model reinforcement learning framework.
[0015] Optionally, the standardized dataset includes predicted values of electrical load, thermal load, cooling load, gas load, equipment operating status, energy storage unit state of charge, and energy market price information. The preprocessing includes missing value completion, outlier correction, and data normalization.
[0016] Optionally, the construction of the system's dynamic behavior equations specifically includes:
[0017] A mathematical model of a multi-energy combined supply system is established based on a standardized dataset. The process of establishing the model involves constructing energy balance equations, equipment operation constraint equations, and multi-energy coupling relationship equations for the electric energy system, thermal energy system, cold energy system, gas system, and energy storage system using a standardized dataset, thereby generating a mathematical model of the multi-energy combined supply system.
[0018] Based on the mathematical model of the multi-energy combined supply system, the system state space, system action space and system constraints are defined. The system state space is defined by selecting state variables that characterize the operation of the multi-energy combined supply system. The system action space is defined by the controllable scheduling variables of each energy subsystem. The system constraints are defined by the energy balance equation, equipment capacity limits and safe operation boundary conditions.
[0019] Based on the system state space, system action space, and system constraints, a system dynamic behavior equation is constructed. The construction process involves associating the system state changes with the control actions and external factors through energy balance and coupling relationships, thereby forming the system dynamic behavior equation.
[0020] Optionally, the generation of the system dynamic dataset specifically includes:
[0021] The state change process of the multi-energy combined supply system is discretized based on the system dynamic behavior equation, and the system state transition relationship is formed.
[0022] The system state information, system action information, and external disturbance information in the system state transition relationship are aligned and organized in time series order, and combined with the historical operation data of the multi-energy supply system to form a time series sample set;
[0023] The time-series sample set is classified and labeled according to different energy types to form a dynamic sample subset, which includes electric energy systems, thermal energy systems, cold energy systems, gas systems and energy storage systems.
[0024] Data consistency verification and anomaly correction are performed on dynamic sample subsets. Based on system energy balance constraints, when the energy balance deviation of samples in the dynamic sample subset is greater than a preset threshold, it is judged as an abnormal result and abnormal data processing is performed to form a dynamic sample set for consistency verification.
[0025] The dynamic sample set for consistency verification is unified and integrated to form a system dynamic dataset, which includes system state, system action, external disturbance and the corresponding system state at the next moment.
[0026] Optionally, the output of the confidence level of the prediction result specifically includes:
[0027] The system dynamic dataset is input into the multi-energy system dynamic cognitive modeling module in the improved model reinforcement learning framework to construct a state transition model. The state transition model is constructed by fitting the system state change law through nonlinear mapping based on the system dynamic dataset.
[0028] The state transition model is trained, the difference between the predicted output of the state transition model and the actual system state is calculated, and the parameters of the state transition model are adjusted.
[0029] A dynamic confidence propagation mechanism is introduced during the parameter adjustment process of the state transition model to obtain the confidence level of the prediction result. The confidence level of the prediction result is obtained by adding a confidence propagation unit in the hidden layer of the state transition model, combining the hidden layer output with the confidence weight matrix and bias parameters, and mapping it through the Sigmoid function.
[0030] Optionally, the formation of the candidate scheduling action set specifically includes:
[0031] During each scheduling cycle, external disturbance information of the multi-energy combined supply system is acquired, and the external disturbance information vector is input into the multi-energy system dynamic cognitive modeling module to perform disturbance correction on the system state prediction sequence and generate a disturbance correction prediction sequence. The external disturbance information includes weather changes, load fluctuations, energy price changes and equipment operating status changes.
[0032] Based on the confidence level of the prediction results, the perturbation correction prediction sequence is weighted and corrected to form a confidence correction prediction sequence. The formation of the confidence correction prediction sequence involves weighting and adjusting the prediction system state at each time step, so that the prediction results depend more on the model output at high confidence and more on the current system state at low confidence.
[0033] The confidence-corrected prediction sequence is combined with system constraints to construct a set of candidate scheduling actions. The construction process involves screening scheduling actions based on energy balance constraints, equipment capacity constraints, and safe operation constraints to obtain candidate scheduling actions that satisfy all constraints, thus forming a set of candidate scheduling actions.
[0034] Optionally, the generation of the multi-energy cooperative scheduling strategy specifically includes:
[0035] The candidate scheduling action set is input into the adaptive cooperative policy generation module of the improved model reinforcement learning framework to establish a policy function. The policy parameters are initialized by centralized samples in the system dynamic dataset to form an initial policy sample set. The policy function is established by establishing a probability mapping relationship between the system state and the candidate scheduling actions.
[0036] A multi-objective reward function is constructed based on the initialization strategy sample set. The multi-objective reward function is converted into corresponding numerical benefits according to the energy efficiency, economy and safety of the multi-energy combined supply system, and obtained by weighted summation by assigning weight coefficients.
[0037] Based on the multi-objective reward function, calculate the expected reward value of executing each candidate scheduling action under the current system state;
[0038] The expected return value is obtained by weighted summation of the current instantaneous reward and the expected value of the system state at the next moment;
[0039] A dynamic game and strategy update mechanism is adopted. Based on the difference in expected reward values of candidate scheduling actions under the same system state, the initial strategy function is iteratively updated to generate the result after strategy update.
[0040] Constraint verification is performed on the results of the policy update. If the verification result of the candidate scheduling action is greater than the preset constraint boundary, the corresponding candidate scheduling action is removed from the results of the policy update, and the scheduling policy after constraint verification is obtained.
[0041] The scheduling strategy after constraint verification is normalized to form a multi-energy collaborative scheduling strategy, which is a set of joint scheduling schemes for electric energy, thermal energy, cold energy, gas and energy storage systems.
[0042] Optionally, obtaining the optimal scheduling strategy for the current iteration stage specifically includes:
[0043] The multi-energy collaborative scheduling strategy is input into the risk-aware value assessment module of the improved model reinforcement learning framework to establish a value assessment model. The value assessment model takes the current state vector of the system as input and calculates the weighted cumulative calculation of the instantaneous reward and risk penalty in the prediction time domain.
[0044] Based on the value assessment model, the expected long-term return of the system is calculated by a discount weighting method. The expected long-term return refers to the comprehensive return of the multi-energy collaborative scheduling strategy in the prediction time domain.
[0045] The expected long-term returns are normalized to form a set of value assessment indicators, which includes energy efficiency return indicators, economic return indicators, safety return indicators and risk penalty indicators.
[0046] Based on the value assessment index set, a risk-weighted comprehensive evaluation mechanism is used to calculate the comprehensive evaluation score, which is the weighted sum of energy efficiency benefit index, economic benefit index and safety benefit index minus the weighted value of risk penalty index.
[0047] Based on the comprehensive evaluation score, the long-term benefits of the multi-energy collaborative scheduling strategy are ranked, and the multi-energy collaborative scheduling strategy that meets the optimal evaluation criterion is selected to generate the optimal scheduling strategy for the current iteration stage.
[0048] Optionally, the output of the optimal cooperative scheduling strategy specifically includes:
[0049] The optimal scheduling strategy is input into the real-time control module of the multi-energy system to perform coordinated scheduling operations on the power system, heating system, cooling system, gas system and energy storage system, and generate the actual execution action sequence.
[0050] During the scheduling process, real-time operating data of the multi-energy combined supply system is collected. The operating data includes system state vector, scheduling action vector and external disturbance information, forming a set of actual operating data samples.
[0051] The instant reward value is calculated based on real-time operation data. The instant reward value is obtained from the system's energy efficiency benefits, economic benefits and safety benefits. It is compared with the value assessment results of the risk-aware value assessment module. The scheduling parameters are dynamically optimized using a feedback correction mechanism to generate the optimal collaborative scheduling strategy.
[0052] Optionally, the update of the improved model reinforcement learning framework specifically includes:
[0053] The optimal cooperative scheduling strategy and the corresponding system operation feedback information are input into the experience pool of the improved model reinforcement learning framework. The experience pool consists of experience samples, which include the current system state vector, scheduling action vector, immediate reward value and the system state vector at the next moment.
[0054] The experience samples in the experience pool are randomly and uniformly drawn to form an experience sample batch set.
[0055] Based on the batch set of empirical samples, the joint parameters of the dynamic cognitive modeling module, adaptive collaborative strategy generation module, and risk-aware value assessment module of the multi-energy system are updated. The gradient descent mechanism is used to obtain the updated and improved model reinforcement learning framework.
[0056] The beneficial effects of this invention are:
[0057] Compared with existing technologies, this invention introduces an improved model reinforcement learning framework into the scheduling optimization process of multi-energy combined supply systems, and constructs an intelligent scheduling optimization system that integrates system dynamic cognition, adaptive strategy generation and risk perception assessment, thereby significantly improving the collaborative operation efficiency and intelligent decision-making capability of multi-energy systems in complex environments.
[0058] This invention establishes a standardized system dynamic dataset and mathematical modeling framework, enabling a systematic description of the coupling characteristics of multiple energy sources, including electricity, heat, cold, gas, and energy storage. Based on this, an improved model reinforcement learning framework is input for learning and optimization, resulting in a more complete and computable definition of the system's state and action spaces, ensuring the interpretability and controllability of the scheduling decision-making process. By introducing a dynamic confidence propagation mechanism into the multi-energy system dynamic cognitive modeling module, this invention can quantify the reliability of prediction results in real time during model training and prediction, and dynamically correct system state predictions using confidence information. This effectively reduces scheduling deviations caused by model error accumulation, improving the system's robustness and prediction accuracy under external disturbances.
[0059] In the strategy generation stage, this invention designs an adaptive collaborative strategy generation module that combines dynamic game theory with reinforcement learning strategy iteration mechanisms to construct a multi-objective reward function system. With energy efficiency, economy, and safety as core objectives, it achieves optimal collaborative control among multi-energy systems through multi-weighted and dynamic adjustments. This design not only overcomes the limitations of traditional single-objective scheduling but also endows the scheduling strategy with self-learning, self-evolution, and self-balancing capabilities. It can dynamically optimize the strategy structure according to environmental changes and energy demand, achieving efficient energy flow and resource matching among multi-energy systems.
[0060] Furthermore, this invention employs a risk-aware value assessment module to comprehensively and quantitatively evaluate the long-term benefits and potential risks of different strategies. The assessment results are then used to constrain and guide the strategy update process, thereby avoiding the overfitting and decision instability problems inherent in traditional reinforcement learning methods during strategy optimization. This module introduces a risk trade-off mechanism in the long-term scheduling benefit assessment, ensuring that the scheduling results balance immediate benefits with long-term steady-state performance, guaranteeing that the system maintains safe and stable operation while improving economic efficiency. Attached Figure Description
[0061] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings:
[0062] Figure 1 This is an overall flowchart of a collaborative scheduling optimization method for a multi-energy combined supply system based on reinforcement learning proposed in this invention;
[0063] Figure 2 This is a schematic diagram of the module structure of the reinforcement learning framework of the improved model of the cooperative scheduling optimization method for multi-energy combined supply system based on reinforcement learning proposed in this invention. Detailed Implementation
[0064] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.
[0065] refer to Figure 1-2 A collaborative scheduling optimization method for multi-energy combined heat and power systems based on reinforcement learning includes the following steps:
[0066] Collect and preprocess real-time operating data and external environmental data from the combined energy supply system to form a standardized dataset;
[0067] A mathematical model of the multi-energy combined supply system is established based on a standardized dataset. The system state space, system action space and system constraints are defined, and the system dynamic behavior equations are constructed.
[0068] Based on the system's dynamic behavior equations and historical operating data of the combined energy supply system, a system dynamic dataset is generated.
[0069] The system dynamic dataset is input into the multi-energy system dynamic cognitive modeling module of the improved model reinforcement learning (MBRL) framework for training. A dynamic confidence propagation mechanism is introduced to output the confidence of the prediction results.
[0070] Within each scheduling cycle, a confidence-corrected prediction sequence is generated based on the confidence level of the prediction results and the external disturbance information of the multi-energy combined supply system, and a set of candidate scheduling actions is formed in combination with system constraints.
[0071] The candidate scheduling action set is input into the adaptive cooperative policy generation module of the improved model reinforcement learning framework, and a multi-functional cooperative scheduling policy is generated by using dynamic game and policy update mechanism.
[0072] The multi-energy collaborative scheduling strategy is input into the risk-aware value assessment module of the improved model reinforcement learning framework to evaluate the long-term benefits of the multi-energy collaborative scheduling strategy, obtain the evaluation results, and select the optimal scheduling strategy for the current iteration stage based on the evaluation results.
[0073] The optimal scheduling strategy of the current iteration stage is input into the multi-energy combined supply system for execution, feedback information from system operation is collected, and the optimal collaborative scheduling strategy is output.
[0074] The optimal cooperative scheduling strategy is input into the experience pool of the improved model reinforcement learning framework. Through the experience sample replay and model parameter update mechanism, joint iterative training is performed to obtain the updated improved model reinforcement learning framework.
[0075] In this embodiment, the standardized dataset includes predicted values of electrical load, thermal load, cooling load, gas load, equipment operating status, energy storage unit state of charge, and energy market price information. The preprocessing includes missing value completion, outlier correction, and data normalization.
[0076] In this embodiment, the construction of the system dynamic behavior equation specifically includes:
[0077] A mathematical model of a multi-energy combined supply system is established based on a standardized dataset. The process of establishing the model involves constructing energy balance equations, equipment operation constraint equations, and multi-energy coupling relationship equations for the electric energy system, thermal energy system, cold energy system, gas system, and energy storage system using a standardized dataset, thereby generating a mathematical model of the multi-energy combined supply system.
[0078] Based on the mathematical model of the multi-energy combined supply system, the system state space, system action space and system constraints are defined. The system state space is defined by selecting state variables that characterize the operation of the multi-energy combined supply system. The system action space is defined by the controllable scheduling variables of each energy subsystem. The system constraints are defined by the energy balance equation, equipment capacity limits and safe operation boundary conditions.
[0079] Based on the system state space, system action space, and system constraints, a system dynamic behavior equation is constructed. The construction process involves associating the system state changes with the control actions and external factors through energy balance and coupling relationships, thereby forming the system dynamic behavior equation.
[0080] In this embodiment, the generation of the system dynamic dataset specifically includes:
[0081] The state change process of the multi-energy combined supply system is discretized based on the system dynamic behavior equation, and the system state transition relationship is formed.
[0082] The system state information, system action information, and external disturbance information in the system state transition relationship are aligned and organized in time series order, and combined with the historical operation data of the multi-energy supply system to form a time series sample set;
[0083] The time-series sample set is classified and labeled according to different energy types to form a dynamic sample subset, which includes electric energy systems, thermal energy systems, cold energy systems, gas systems and energy storage systems.
[0084] Data consistency verification and anomaly correction are performed on dynamic sample subsets. Based on system energy balance constraints, when the energy balance deviation of samples in the dynamic sample subset is greater than a preset threshold, it is judged as an abnormal result and abnormal data processing is performed to form a dynamic sample set for consistency verification.
[0085] The dynamic sample set for consistency verification is unified and integrated to form a system dynamic dataset, which includes system state, system action, external disturbance and the corresponding system state at the next moment.
[0086] In this embodiment, the output of the confidence level of the prediction result specifically includes:
[0087] The system dynamic dataset is input into the multi-energy system dynamic cognitive modeling module in the improved model reinforcement learning framework to construct a state transition model. The state transition model is constructed by fitting the system state change law through nonlinear mapping based on the system dynamic dataset.
[0088] The state transition model is trained, the difference between the predicted output of the state transition model and the actual system state is calculated, and the parameters of the state transition model are adjusted.
[0089] A dynamic confidence propagation mechanism is introduced during the parameter adjustment process of the state transition model to obtain the confidence level of the prediction result. The confidence level of the prediction result is obtained by adding a confidence propagation unit in the hidden layer of the state transition model, combining the hidden layer output with the confidence weight matrix and bias parameters, and mapping it through the Sigmoid function.
[0090] In this embodiment, the formation of the candidate scheduling action set specifically includes:
[0091] During each scheduling cycle, external disturbance information of the multi-energy combined supply system is acquired, and the external disturbance information vector is input into the multi-energy system dynamic cognitive modeling module to perform disturbance correction on the system state prediction sequence and generate a disturbance correction prediction sequence. The external disturbance information includes weather changes, load fluctuations, energy price changes and equipment operating status changes.
[0092] Based on the confidence level of the prediction results, the perturbation correction prediction sequence is weighted and corrected to form a confidence correction prediction sequence. The formation of the confidence correction prediction sequence involves weighting and adjusting the prediction system state at each time step, so that the prediction results depend more on the model output at high confidence and more on the current system state at low confidence.
[0093] The confidence-corrected prediction sequence is combined with system constraints to construct a set of candidate scheduling actions. The construction process involves screening scheduling actions based on energy balance constraints, equipment capacity constraints, and safe operation constraints to obtain candidate scheduling actions that satisfy all constraints, thus forming a set of candidate scheduling actions.
[0094] In this embodiment, the generation of the multi-energy cooperative scheduling strategy specifically includes:
[0095] The candidate scheduling action set is input into the adaptive cooperative policy generation module of the improved model reinforcement learning framework to establish a policy function. The policy parameters are initialized by centralized samples in the system dynamic dataset to form an initial policy sample set. The policy function is established by establishing a probability mapping relationship between the system state and the candidate scheduling actions.
[0096] A multi-objective reward function is constructed based on the initialization strategy sample set. The multi-objective reward function is converted into corresponding numerical benefits according to the energy efficiency, economy and safety of the multi-energy combined supply system, and obtained by weighted summation by assigning weight coefficients.
[0097] Based on the multi-objective reward function, calculate the expected reward value for executing each candidate scheduling action in the current system state:
[0098] ;
[0099] in, This is the current system state vector. In the state The scheduling actions below, , These are the individual benefits from energy efficiency, economy, and operational safety, respectively. , , For the corresponding weighting coefficients, As a discount factor, To predict the time-domain step size, For time step index, For value function, As the benchmark value function, The confidence level weighting coefficient is... To constrain the penalty coefficient, For the first The constraint residual of the term constraint. To constrain the quantity, Apply a secondary penalty to the remaining amount that does not meet the constraints;
[0100] The expected return value is obtained by weighted summation of the current instantaneous reward and the expected value of the system state at the next moment;
[0101] A dynamic game and strategy update mechanism is adopted. Based on the difference in expected reward values of candidate scheduling actions under the same system state, the initial strategy function is iteratively updated to generate the result after strategy update.
[0102] Constraint verification is performed on the results of the policy update. If the verification result of the candidate scheduling action is greater than the preset constraint boundary, the corresponding candidate scheduling action is removed from the results of the policy update, and the scheduling policy after constraint verification is obtained.
[0103] The scheduling strategy after constraint verification is normalized to form a multi-energy collaborative scheduling strategy, which is a set of joint scheduling schemes for electric energy, thermal energy, cold energy, gas and energy storage systems.
[0104] In this embodiment, obtaining the optimal scheduling strategy for the current iteration stage specifically includes:
[0105] The multi-energy collaborative scheduling strategy is input into the risk-aware value assessment module of the improved model reinforcement learning framework to establish a value assessment model. The value assessment model takes the current state vector of the system as input and calculates the weighted cumulative calculation of the instantaneous reward and risk penalty in the prediction time domain.
[0106] Based on the value assessment model, the expected long-term return of the system is calculated by a discount weighting method. The expected long-term return refers to the comprehensive return of the multi-energy collaborative scheduling strategy in the prediction time domain.
[0107] The expected long-term returns are normalized to form a set of value assessment indicators, which includes energy efficiency return indicators, economic return indicators, safety return indicators and risk penalty indicators.
[0108] Based on the value assessment index set, a risk-weighted comprehensive evaluation mechanism is used to calculate the comprehensive evaluation score, which is the weighted sum of energy efficiency benefit index, economic benefit index and safety benefit index minus the weighted value of risk penalty index.
[0109] Based on the comprehensive evaluation score, the long-term benefits of the multi-energy collaborative scheduling strategy are ranked, and the multi-energy collaborative scheduling strategy that meets the optimal evaluation criterion is selected to generate the optimal scheduling strategy for the current iteration stage.
[0110] In this embodiment, the output of the optimal cooperative scheduling strategy specifically includes:
[0111] The optimal scheduling strategy is input into the real-time control module of the multi-energy system to perform coordinated scheduling operations on the power system, heating system, cooling system, gas system and energy storage system, and generate the actual execution action sequence.
[0112] During the scheduling process, real-time operating data of the multi-energy combined supply system is collected. The operating data includes system state vector, scheduling action vector and external disturbance information, forming a set of actual operating data samples.
[0113] The instant reward value is calculated based on real-time operation data. The instant reward value is obtained from the system's energy efficiency benefits, economic benefits and safety benefits. It is compared with the value assessment results of the risk-aware value assessment module. The scheduling parameters are dynamically optimized using a feedback correction mechanism to generate the optimal collaborative scheduling strategy.
[0114] In this embodiment, the update of the improved model reinforcement learning framework specifically includes:
[0115] The optimal cooperative scheduling strategy and the corresponding system operation feedback information are input into the experience pool of the improved model reinforcement learning framework. The experience pool consists of experience samples, which include the current system state vector, scheduling action vector, immediate reward value and the system state vector at the next moment.
[0116] The experience samples in the experience pool are randomly and uniformly drawn to form an experience sample batch set.
[0117] Based on the batch set of empirical samples, the joint parameters of the dynamic cognitive modeling module, adaptive collaborative strategy generation module, and risk-aware value assessment module of the multi-energy system are updated. The gradient descent mechanism is used to obtain the updated and improved model reinforcement learning framework.
[0118] Example 1:
[0119] Taking a comprehensive energy station in an industrial park as an example, this station provides comprehensive energy supply to office buildings, production workshops, and dormitories within the park. It comprises five main subsystems: an electrical system, a thermal system, a cooling system, a gas system, and an energy storage system. The system utilizes distributed photovoltaic power generation, gas-fired boilers, electric chillers, absorption chillers, and lithium battery energy storage equipment to achieve multi-energy complementarity and cascaded energy utilization. Due to the high uncertainty of external environment, load demand, and renewable energy output, traditional scheduling optimization methods suffer from problems such as long calculation times, insufficient prediction accuracy, and slow response to disturbances, making it difficult to achieve dynamic optimal system operation.
[0120] To address the aforementioned problems, this invention proposes a reinforcement learning-based collaborative scheduling optimization method for multi-energy combined heat and power (CHP) systems. Real-time operational data from the energy system and external environmental data are collected, standardized, and then input into an improved reinforcement learning framework. The dynamic cognitive modeling module of the CHP system learns the dynamic behavior of the system, constructs a predictive relationship between system state and scheduling actions, and introduces a dynamic belief propagation mechanism to correct the prediction reliability. The adaptive collaborative strategy generation module, based on a multi-objective reward function, comprehensively considers system energy efficiency, economy, and safety, and dynamically generates collaborative scheduling schemes through the reinforcement learning strategy iteration process. The risk-aware value assessment module comprehensively evaluates the long-term benefits of each strategy and selects the optimal scheduling strategy for actual implementation.
[0121] The experiment used a one-week period for the integrated energy station in the industrial park as the research cycle, and compared simulations using both the traditional linear optimization scheduling method and the method of this invention. The simulation step size was 1 hour, totaling 168 scheduling cycles. The system electricity price was set according to the time-of-use pricing strategy, with RMB 1.05 / kWh during the day and RMB 0.68 / kWh during the night. The photovoltaic installed capacity was 1.0MW, the energy storage system capacity was 2MWh, and the gas price was RMB 3.20 / Nm³. The weighting coefficients of the reward function were set as follows: energy efficiency 0.4, economic efficiency 0.4, and safety 0.2.
[0122] Table 1 Comparison of Reinforcement Learning Optimization Methods for Multi-Energy Supply Systems (Table 1: Results of Comparison of Reinforcement Learning Optimization Methods for Multi-Energy Supply Systems)
[0123] Indicator Categories unit Traditional linear optimization scheduling This invention relates to reinforcement learning scheduling. Improvement rate (%) System overall energy efficiency % 79.2 86.1 +8.7 Weekly average operating cost Yuan / day 13,850 12,410 -10.4 Electricity purchased from the power grid MWh 78.6 70.3 -10.6 Energy storage utilization rate % 64.2 82.8 +28.9 Load tracking error (RMSE) kW 93.5 58.7 -37.2 Policy convergence time Hour 13.6 8.3 -38.9 Average Risk Index - 0.41 0.27 -34.1
[0124] As shown in Table 1, the method proposed in this invention significantly improves upon traditional methods in terms of energy efficiency, economy, and operational stability. The overall system energy efficiency is improved by approximately 8.7%, primarily due to the multi-energy system dynamic cognitive modeling module's ability to capture the nonlinear coupling relationships between multiple energy sources, avoiding the biases caused by linear assumptions in traditional models. The system remains stable under dynamic disturbances, with energy storage system utilization increasing by approximately 28.9% and charging / discharging cycles reduced by over 20%, effectively extending equipment lifespan. In terms of economy, the reinforcement learning scheduling method adaptively optimizes the operating strategy through a multi-objective reward function, resulting in a 10.4% reduction in average weekly operating costs and a 10.6% decrease in grid-purchased electricity, reflecting both improved overall energy utilization and reduced dependence on external energy purchases.
[0125] In terms of stability and real-time performance, the load tracking error decreased from 93.5kW to 58.7kW, indicating that the system can respond to load changes more accurately. The strategy convergence time was shortened to 8.3 hours, an improvement of about 40% compared to traditional methods, demonstrating that the reinforcement learning strategy can quickly learn and adapt to environmental changes and has stronger dynamic optimization capabilities. In addition, the risk-aware value assessment module effectively balances short-term gains and long-term risks, reducing the system's average risk index to 0.27, significantly improving the execution security of the scheduling strategy.
[0126] In summary, this invention, through the deep integration of reinforcement learning and multi-energy system scheduling, constructs an intelligent optimization scheduling mechanism with self-learning, self-adaptation, and high robustness, achieving dynamic balance and real-time optimization of energy supply and demand. The results show that the method of this invention possesses significant advantages in multi-energy systems, including high energy efficiency, low cost, fast response, and low operational risk, and has strong engineering application and promotion value.
Claims
1. A collaborative scheduling optimization method for a multi-energy combined heat and power (CHP) system based on reinforcement learning, characterized in that, Includes the following steps: Collect and preprocess real-time operating data and external environmental data from the combined energy supply system to form a standardized dataset; A mathematical model of the multi-energy combined supply system is established based on a standardized dataset. The system state space, system action space and system constraints are defined, and the system dynamic behavior equations are constructed. Based on the system's dynamic behavior equations and historical operating data of the combined energy supply system, a system dynamic dataset is generated. The system dynamic dataset is input into the multi-energy system dynamic cognitive modeling module of the improved model-based reinforcement learning (MBRL) framework for training. A dynamic confidence propagation mechanism is introduced to output the confidence of the prediction results. Within each scheduling cycle, a confidence-corrected prediction sequence is generated based on the confidence level of the prediction results and the external disturbance information of the multi-energy combined supply system, and a set of candidate scheduling actions is formed in combination with system constraints. The candidate scheduling action set is input into the adaptive cooperative policy generation module of the improved model reinforcement learning framework, and a multi-functional cooperative scheduling policy is generated by using dynamic game and policy update mechanism. The multi-energy collaborative scheduling strategy is input into the risk-aware value assessment module of the improved model reinforcement learning framework to evaluate the long-term benefits of the multi-energy collaborative scheduling strategy, obtain the evaluation results, and select the optimal scheduling strategy for the current iteration stage based on the evaluation results. The optimal scheduling strategy of the current iteration stage is input into the multi-energy combined supply system for execution, feedback information from system operation is collected, and the optimal collaborative scheduling strategy is output. The optimal cooperative scheduling strategy is input into the experience pool of the improved model reinforcement learning framework. Through the experience sample replay and model parameter update mechanism, joint iterative training is performed to obtain the updated improved model reinforcement learning framework. The output of the confidence level of the prediction result specifically includes: The system dynamic dataset is input into the multi-energy system dynamic cognitive modeling module in the improved model reinforcement learning framework to construct a state transition model. The state transition model is constructed by fitting the system state change law through nonlinear mapping based on the system dynamic dataset. The state transition model is trained, the difference between the predicted output of the state transition model and the actual system state is calculated, and the parameters of the state transition model are adjusted. A dynamic confidence propagation mechanism is introduced during the parameter adjustment process of the state transition model to obtain the confidence level of the prediction result. The confidence level of the prediction result is obtained by adding a confidence propagation unit in the hidden layer of the state transition model, combining the hidden layer output with the confidence weight matrix and bias parameters, and mapping it through the Sigmoid function. The formation of the candidate scheduling action set specifically includes: During each scheduling cycle, external disturbance information of the multi-energy combined supply system is acquired, and the external disturbance information vector is input into the multi-energy system dynamic cognitive modeling module to perform disturbance correction on the system state prediction sequence and generate a disturbance correction prediction sequence. The external disturbance information includes weather changes, load fluctuations, energy price changes and equipment operating status changes. Based on the confidence level of the prediction results, the perturbation correction prediction sequence is weighted and corrected to form a confidence correction prediction sequence. The formation of the confidence correction prediction sequence involves weighting and adjusting the prediction system state at each time step, so that the prediction results depend more on the model output at high confidence and more on the current system state at low confidence. The confidence-corrected prediction sequence is combined with system constraints to construct a set of candidate scheduling actions. The construction process involves screening scheduling actions based on energy balance constraints, equipment capacity constraints, and safe operation constraints to obtain candidate scheduling actions that satisfy all constraints, thus forming a set of candidate scheduling actions.
2. The method for collaborative scheduling optimization of a multi-energy combined heat and power system based on reinforcement learning according to claim 1, characterized in that, The standardized dataset includes predicted values of electrical load, thermal load, cooling load, gas load, equipment operating status, energy storage unit state of charge, and energy market price information. The preprocessing includes missing value completion, outlier correction, and data normalization.
3. The method for collaborative scheduling optimization of a multi-energy combined heat and power system based on reinforcement learning according to claim 1, characterized in that, The construction of the system's dynamic behavior equations specifically includes: A mathematical model of a multi-energy combined supply system is established based on a standardized dataset. The process of establishing the model involves constructing energy balance equations, equipment operation constraint equations, and multi-energy coupling relationship equations for the electric energy system, thermal energy system, cold energy system, gas system, and energy storage system using a standardized dataset, thereby generating a mathematical model of the multi-energy combined supply system. Based on the mathematical model of the multi-energy combined supply system, the system state space, system action space and system constraints are defined. The system state space is defined by selecting state variables that characterize the operation of the multi-energy combined supply system. The system action space is defined by the controllable scheduling variables of each energy subsystem. The system constraints are defined by the energy balance equation, equipment capacity limits and safe operation boundary conditions. Based on the system state space, system action space, and system constraints, a system dynamic behavior equation is constructed. The construction process involves associating the system state changes with the control actions and external factors through energy balance and coupling relationships, thereby forming the system dynamic behavior equation.
4. The method for collaborative scheduling optimization of a multi-energy combined heat and power system based on reinforcement learning according to claim 1, characterized in that, The generation of the system's dynamic dataset specifically includes: The state change process of the multi-energy combined supply system is discretized based on the system dynamic behavior equation, and the system state transition relationship is formed. The system state information, system action information, and external disturbance information in the system state transition relationship are aligned and organized in time series order, and combined with the historical operation data of the multi-energy supply system to form a time series sample set; The time-series sample set is classified and labeled according to different energy types to form a dynamic sample subset, which includes electric energy systems, thermal energy systems, cold energy systems, gas systems and energy storage systems. Data consistency verification and anomaly correction are performed on dynamic sample subsets. Based on system energy balance constraints, when the energy balance deviation of samples in the dynamic sample subset is greater than a preset threshold, it is judged as an abnormal result and abnormal data processing is performed to form a dynamic sample set for consistency verification. The dynamic sample set for consistency verification is unified and integrated to form a system dynamic dataset, which includes system state, system action, external disturbance and the corresponding system state at the next moment.
5. The method for collaborative scheduling optimization of a multi-energy combined heat and power system based on reinforcement learning according to claim 1, characterized in that, The generation of the multi-energy cooperative scheduling strategy specifically includes: The candidate scheduling action set is input into the adaptive cooperative policy generation module of the improved model reinforcement learning framework to establish a policy function. The policy parameters are initialized by centralized samples in the system dynamic dataset to form an initial policy sample set. The policy function is established by establishing a probability mapping relationship between the system state and the candidate scheduling actions. A multi-objective reward function is constructed based on the initialization strategy sample set. The multi-objective reward function is converted into corresponding numerical benefits according to the energy efficiency, economy and safety of the multi-energy combined supply system, and obtained by weighted summation by assigning weight coefficients. Based on the multi-objective reward function, calculate the expected reward value of executing each candidate scheduling action under the current system state; The expected return value is obtained by weighted summation of the current instantaneous reward and the expected value of the system state at the next moment; A dynamic game and strategy update mechanism is adopted. Based on the difference in expected reward values of candidate scheduling actions under the same system state, the initial strategy function is iteratively updated to generate the result after strategy update. Constraint verification is performed on the results of the policy update. If the verification result of the candidate scheduling action is greater than the preset constraint boundary, the corresponding candidate scheduling action is removed from the results of the policy update, and the scheduling policy after constraint verification is obtained. The scheduling strategy after constraint verification is normalized to form a multi-energy collaborative scheduling strategy, which is a set of joint scheduling schemes for electric energy, thermal energy, cold energy, gas and energy storage systems.
6. The method for collaborative scheduling optimization of a multi-energy combined heat and power system based on reinforcement learning according to claim 1, characterized in that, The optimal scheduling strategy for the current iteration phase is obtained specifically through: The multi-energy collaborative scheduling strategy is input into the risk-aware value assessment module of the improved model reinforcement learning framework to establish a value assessment model. The value assessment model takes the current state vector of the system as input and calculates the weighted cumulative calculation of the instantaneous reward and risk penalty in the prediction time domain. Based on the value assessment model, the expected long-term return of the system is calculated by a discount weighting method. The expected long-term return refers to the comprehensive return of the multi-energy collaborative scheduling strategy in the prediction time domain. The expected long-term returns are normalized to form a set of value assessment indicators, which includes energy efficiency return indicators, economic return indicators, safety return indicators and risk penalty indicators. Based on the value assessment index set, a risk-weighted comprehensive evaluation mechanism is used to calculate the comprehensive evaluation score, which is the weighted sum of energy efficiency benefit index, economic benefit index and safety benefit index minus the weighted value of risk penalty index. Based on the comprehensive evaluation score, the long-term benefits of the multi-energy collaborative scheduling strategy are ranked, and the multi-energy collaborative scheduling strategy that meets the optimal evaluation criterion is selected to generate the optimal scheduling strategy for the current iteration stage.
7. The method for collaborative scheduling optimization of a multi-energy combined heat and power system based on reinforcement learning according to claim 1, characterized in that, The output of the optimal cooperative scheduling strategy specifically includes: The optimal scheduling strategy is input into the real-time control module of the multi-energy system to perform coordinated scheduling operations on the power system, heating system, cooling system, gas system and energy storage system, and generate the actual execution action sequence. During the scheduling process, real-time operating data of the multi-energy combined supply system is collected. The operating data includes system state vector, scheduling action vector and external disturbance information, forming a set of actual operating data samples. The instant reward value is calculated based on real-time operation data. The instant reward value is obtained from the system's energy efficiency benefits, economic benefits and safety benefits. It is compared with the value assessment results of the risk-aware value assessment module. The scheduling parameters are dynamically optimized using a feedback correction mechanism to generate the optimal collaborative scheduling strategy.
8. The method for collaborative scheduling optimization of a multi-energy combined heat and power system based on reinforcement learning according to claim 1, characterized in that, The updates to the improved model reinforcement learning framework specifically include: The optimal cooperative scheduling strategy and the corresponding system operation feedback information are input into the experience pool of the improved model reinforcement learning framework. The experience pool consists of experience samples, which include the current system state vector, scheduling action vector, immediate reward value and the system state vector at the next moment. The experience samples in the experience pool are randomly and uniformly drawn to form an experience sample batch set. Based on the batch set of empirical samples, the joint parameters of the dynamic cognitive modeling module, adaptive collaborative strategy generation module, and risk-aware value assessment module of the multi-energy system are updated. The gradient descent mechanism is used to obtain the updated and improved model reinforcement learning framework.