Integrated energy system day optimal dispatch method, device and storage medium
By employing a Transformer-Mamba hybrid structure and a deep deterministic policy gradient algorithm based on generative adversarial imitation learning, the problems of economy, low carbon emissions, and operational safety in the scheduling of integrated energy systems are solved, improving the system's forward-looking perception and scheduling stability, and realizing the optimization of multi-energy complementary utilization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- STATE GRID ANHUI ELECTRIC POWER CO LTD ELECTRIC POWER SCI RES INST
- Filing Date
- 2026-03-17
- Publication Date
- 2026-06-19
Smart Images

Figure CN122243050A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of energy system scheduling methods, specifically a comprehensive intraday optimal scheduling method, equipment, and storage medium for energy systems. Background Technology
[0002] With the rapid growth of installed capacity of new energy sources, wind power, photovoltaics, and other renewable energy sources are being integrated into integrated energy systems on a large scale, resulting in highly coupled and highly uncertain operating characteristics among multiple energy flows such as electricity, heat, cooling, and gas. Affected by meteorological conditions and fluctuations in user-side loads, the output of new energy sources and various types of loads exhibit significant randomness on a daily scale. Traditional optimization scheduling methods based on deterministic models struggle to simultaneously consider economy, low carbon emissions, and operational safety within a limited computational timeframe.
[0003] In recent years, deep reinforcement learning has been gradually introduced into the field of integrated energy system scheduling due to its self-learning ability in high-dimensional and nonlinear decision-making problems. However, existing methods generally suffer from the following shortcomings: they rely heavily on prediction results at a single time step, making it difficult to characterize the evolution trend of the system state over a future period, resulting in "short-sightedness" in energy storage device scheduling; the prediction model and the scheduling decision model are independent of each other, and prediction errors cannot be corrected through feedback from scheduling results; reinforcement learning has low exploration efficiency under complex constraints, and the policy output is prone to oscillations, affecting the safety of system operation. Therefore, there is an urgent need for an intraday scheduling method for integrated energy systems that can simultaneously improve the system's forward-looking perception capability, scheduling stability, and overall optimization performance. Summary of the Invention
[0004] To address the aforementioned problems in existing technologies, this invention proposes an intraday optimal scheduling method, device, and storage medium for integrated energy systems based on Transformer-Mamba multi-step interval prediction and generative adversarial imitation enhanced dual-delay deep deterministic policy gradient (GAIL-ERCL2-TD3). Through the collaborative training of the prediction model and the scheduling model, the economic efficiency, low carbon emissions, and stability of the integrated energy system during intraday operation are synergistically improved.
[0005] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0006] A method for optimal intraday scheduling of an integrated energy system, the process of which is as follows:
[0007] Acquire the status and action observation data of the integrated energy system at each historical scheduling time step within the intraday historical time period prior to the current scheduling time step, and preprocess the acquired observation data;
[0008] A dataset is constructed using preprocessed observation data from intraday historical time periods to pretrain a prediction model. The state observation data of the time step closest to the current scheduling time step is input into the pretrained prediction model, and the prediction model predicts the state data interval of the current scheduling time step.
[0009] The state space is composed of the state observation data of the integrated energy system at each historical scheduling time step in the intraday historical time period before the current scheduling time step, and the state data prediction results of the integrated energy system at the current scheduling time step; the action space is composed of the action observation data of the integrated energy system at each historical scheduling time step in the intraday historical time period, and a multi-objective optimization function, reward function, agent value network and policy network for intraday scheduling are established, thereby constructing a sequential decision model for intraday scheduling.
[0010] The value network and policy network of the agent are trained based on the state in the state space and the action in the action space, and the parameters of the value network and policy network are adjusted according to the reward function calculation results, thereby obtaining a trained agent.
[0011] The predicted state data of the integrated energy system at the current scheduling time step in the state space is input into the policy network of the trained agent, and the policy network outputs the optimal action for the current scheduling time step.
[0012] Furthermore, after the agent executes the current scheduling time step and outputs the optimal action, the real reward signal based on the feedback of the optimal action is introduced into the loss function of the pre-trained prediction model to further correct the parameters of the pre-trained prediction model.
[0013] Furthermore, the status observation data of the integrated energy system includes wind speed observation data, light intensity observation data, temperature observation data, cooling load observation data, heating load observation data, and electrical load observation data; the operation observation data of the integrated energy system includes gas turbine output observation data, energy storage device charging and discharging power observation data, and power observation data interacting with the main grid.
[0014] Furthermore, the preprocessing includes normalization and Gaussian process regression. After normalizing the observed data, Gaussian process regression is used to obtain the probability density function of the same type of state data of the integrated energy system at all historical scheduling time steps in the intraday historical time period.
[0015] Furthermore, for the preprocessed observation data, a sliding window is used to take the various state observation data of the previous historical scheduling time step as input and the state observation data of the next historical scheduling time step as the target output, thereby obtaining multiple sample pairs. The input and target output of each sample pair exist in the form of time series segments, and the state data in the target output are divided into two categories: upper boundary segments and lower boundary segments. All sample pairs constitute the dataset used for training the prediction model.
[0016] Furthermore, the prediction model is a combined model formed by cascading the Transformer model and the Mamba model.
[0017] Furthermore, the multi-objective optimization function for intraday scheduling aims to minimize the total cost of the integrated energy system while ensuring that the constraints are met within the region, and simultaneously improve the utilization rate of new energy sources and reduce power fluctuations in tie lines.
[0018] Furthermore, the total cost is a weighted sum of the operating cost of the integrated energy system, the cost of renewable energy curtailment penalties, the cost of load shedding penalties, the cost of power flow exceeding limits penalties, and the cost of tie-line power fluctuation penalties.
[0019] Furthermore, the reward function for intraday scheduling is set to a negative value of the total cost.
[0020] Furthermore, the constraints include power balance constraints, energy storage device state of charge constraints, grid interaction power upper and lower limit constraints, and line power flow constraints.
[0021] Furthermore, a Critic regularization term is introduced into the agent's value network to suppress... The values diverge, and an L2 regularization term is introduced to constrain the growth of the value network parameters.
[0022] Furthermore, the agent's policy network adopts a generator in the generative adversarial simulation learning architecture. The generative adversarial network composed of the generator and discriminator networks serves as a model to learn expert knowledge. During generator training, the policy gradient is updated based on the dynamic imitation weights.
[0023] An electronic device includes a processor and a memory, wherein program instructions in the memory are read and executed to perform the above-described integrated energy system intraday optimal scheduling method.
[0024] A storage medium storing program instructions, which, when read and executed, perform the aforementioned integrated energy system intraday optimal scheduling method.
[0025] Compared with the prior art, the advantages of the present invention are:
[0026] This invention achieves multi-step interval prediction through a Transformer-Mamba hybrid structure, significantly enhancing the scheduling agent's ability to perceive future system state evolution trends and improving energy storage and multi-energy complementary utilization. It introduces a prediction-scheduling synchronous training mechanism to break down information silos between traditional prediction and scheduling models, achieving adaptive improvement in overall scheduling performance. Through generative adversarial learning and error regularization Critic mechanism, it improves the convergence speed and policy stability of reinforcement learning under complex constraints, reducing the risk of scheduling command oscillations. Attached Figure Description
[0027] Figure 1 This is a diagram of the integrated energy system architecture used in the embodiments of the present invention.
[0028] Figure 2 This is an overall flowchart of the method in an embodiment of the present invention. Detailed Implementation
[0029] The present invention will be further described below with reference to the accompanying drawings and embodiments.
[0030] This embodiment discloses an intraday optimal scheduling method for an integrated energy system, such as... Figure 1 As shown, the integrated energy system comprises an electrical subsystem, a thermal subsystem, a cooling load subsystem, and a gas subsystem, all coupled through energy conversion devices. On the electrical subsystem side, photovoltaics, wind turbines, energy storage, grid interconnection, and gas turbines are the main components; on the thermal subsystem side, boilers, waste heat boilers, and thermal storage tanks are the core equipment; on the cooling load subsystem side, electric chillers and absorption chillers work together; and on the gas subsystem side, natural gas consumption not only affects economic efficiency but is also directly related to carbon emissions.
[0031] like Figure 2 As shown, the process of the intraday optimal scheduling method for the integrated energy system in this embodiment is as follows:
[0032] Step 1: In this embodiment, the scheduling process of the integrated energy system during the day is discretized into a scheduling process of multiple consecutive scheduling time steps, each scheduling time step being 15 minutes.
[0033] Obtain the status and action observation data of the integrated energy system at each historical scheduling time step within the intraday historical time period prior to the current scheduling time step.
[0034] The status observation data for each historical time step includes wind speed observation data, light intensity observation data, temperature observation data, as well as observation data of cooling load (i.e., the load of electric chillers and absorption chillers), heat load (i.e., the load of gas boilers, waste heat boilers and thermal storage tanks), and electrical load (i.e., conventional basic electrical load and the power consumption of other electrical equipment in the system).
[0035] The action observation data for each historical time step includes the output observation data of the gas turbine at the corresponding moment, the charging and discharging power observation data of the energy storage device, and the power observation data interacting with the main grid.
[0036] Step 2: Normalize all wind speed observation data, light intensity observation data, temperature observation data, cooling load observation data, heating load observation data, and electrical load observation data for all historical scheduling time steps in the intraday historical time period of the integrated energy system obtained in Step 1.
[0037] After normalization, the Gaussian process regression (GPR) method is used to obtain the probability density functions of the same type of state data for all historical scheduling time steps in the intraday historical time period of the integrated energy system. Specifically, these are the probability distribution functions of wind speed observation data, light intensity observation data, temperature observation data, cooling load observation data, heat load observation data, and electrical load observation data.
[0038] In this embodiment, the Gaussian process regression method is used, which is a commonly used method for interval analysis due to its strong ability to capture nonlinear relationships and its advantages in uncertainty modeling. It constructs the covariance matrix between data pairs using a kernel function, and establishes a nonparametric model between the input and the target output for regression analysis. This process can be expressed as follows:
[0039]
[0040] in: The historical training set is constructed using the sliding window method based on the state observation data and action observation data within the intraday historical time period obtained in step 1. The vector formed by historical target observations refers specifically to a one-dimensional vector formed by arranging the same type of state observation data (i.e., one of the wind speed, light intensity, temperature, cooling load, heating load, or electrical load observation data extracted separately in step 1) in chronological order across all historical scheduling time steps. The corresponding input sample matrix is also extracted from the data obtained in step 1. The "24 time points" mentioned in the text refer to 24 consecutive historical scheduling time steps. Each row of this matrix represents a time series segment (TSS), specifically referring to... For a certain target observation value, the historical feature vector is formed by piecing together all the state observation data and action observation data obtained in step 1 within the 24 consecutive historical scheduling time steps before that observation time.
[0041] The input TSS vector is the current prediction time step, where "current prediction time step" refers to the target time step where state prediction and scheduling are required. Specifically, the input TSS vector is the current feature vector formed by piecing together all state observation data and action observation data obtained in step 1 within the 24 consecutive historical scheduling time steps immediately preceding the current prediction time step.
[0042] Function output Let be the probability density function of the target data (i.e., specific state observation data) at the current prediction time.
[0043] This is the kernel function for Gaussian process regression.
[0044] Then, based on the probability distribution functions of wind speed observation data, light intensity observation data, temperature observation data, cooling load observation data, heat load observation data, and electrical load observation data, the 95% confidence intervals for wind speed observation data, light intensity observation data, temperature observation data, cooling load observation data, heat load observation data, and electrical load observation data in all historical scheduling time steps are calculated respectively.
[0045] Finally, preprocessed observation data from intraday historical time periods are extracted to construct training sample pairs for the prediction model. Specifically, using a sliding window approach, the various state observation data from the previous historical scheduling time step (or historical data from multiple consecutive time steps) are used as input, and the state observation data from the next consecutive historical scheduling time step are used as the target output, thus obtaining multiple sample pairs. The input and target output in each sample pair exist in the form of time series segments (TSS). To effectively quantify the uncertainty of wind and solar power output and various load types, the state data in the target output are divided into two categories: upper boundary segments and lower boundary segments. Here, "interval" refers to the 95% confidence interval of each state observation data; the upper and lower boundaries refer to the upper and lower bound time series of this 95% confidence interval, respectively. Distinguishing between upper and lower boundary segments for prediction model training enables the prediction model not only to predict the development trend of the state but also to learn and output the dynamic range of state fluctuations (i.e., the uncertainty interval). All sample pairs constitute the dataset used for subsequent prediction model training.
[0046] Step 3: Generate a prediction model and pre-train it using the dataset obtained in Step 2 to obtain a pre-trained prediction model. Then, input the state observation data of the time step closest to the current scheduling time step (i.e., the previous time step) into the pre-trained prediction model. The prediction model then predicts the state data range for the current scheduling time step (i.e., the upper and lower boundaries of the predicted state). In this way, during subsequent integrated energy system scheduling, the trained reinforcement learning agent can make a decision based on the state prediction results of the current time step (considering the boundary characteristics of uncertainty fluctuations) and output the optimal execution action for the current time step, thereby achieving robust optimization scheduling of the system at the current time step.
[0047] The prediction model in this embodiment is a combination of the Transformer model and the Mamba model.
[0048] The Transformer model effectively captures long-term trends and global features through its self-attention mechanism, improving prediction accuracy. The basic structure of the Transformer model includes multiple encoder and decoder layers. Each encoder layer consists of a multi-head self-attention mechanism, a feedforward neural network, residual connections, and normalization operations. The decoder layer has a similar structure to the encoder, but adds an additional multi-head self-attention layer to handle the dependency between the encoder output and the decoder input.
[0049] The Mamba model excels at handling sequence data with temporal dependencies, preserving contextual information over long time steps and accurately capturing local temporal fluctuations. The basic structure of the Mamba model mainly includes linear projection layers, one-dimensional convolutional layers, a core gated state-space model (Selective SSM) layer, and a hardware-aware parallel algorithm module. Its core principle is the introduction of a selective scanning mechanism, which dynamically determines whether to remember or forget information based on input features, thereby achieving efficient modeling of extremely long sequences while maintaining linear computational complexity.
[0050] In the prediction model of this embodiment, the Transformer model and the Mamba model are connected in sequence (i.e., they are in a cascaded structure, with the output of the Transformer model directly connected to the input of the Mamba model). The output parameters of the Transformer model are directly used as part of the input of the Mamba model for prediction, while the remaining input of the Mamba model is the historical state memory array of the previous time step (i.e., the combination of various state and action observation data of the previous historical scheduling time step).
[0051] When training with the dataset obtained in step 2, the input to the Transformer model is a time series segment in each sample pair (i.e., a feature sequence formed by piecing together the state observation data and action observation data at each moment within the previous one or more historical scheduling time steps). The processing involves encoding the input sequence at its position, then feeding it into the Multi-Head Self-Attention module and the feedforward neural network module of the Transformer model. By calculating the attention weights between different time steps and different feature dimensions in the sequence, the system fully captures the local spatial coupling correlations and global long-term evolution trends among multi-energy loads, meteorological conditions, and equipment status. The final output is the extracted high-dimensional deep feature representation vector. This serves as the foundation for subsequent feature fusion and dynamic evolution.
[0052] The input structure of the Mamba model is as follows:
[0053]
[0054] In the formula: This is the output vector after deep feature extraction by the Transformer model; The array is a state memory array for historical moments. Combined with the dataset sample pairs in step 2, this array specifically refers to the set of 16 types of historical state and action observation data from the previous historical scheduling time step (i.e., time t-1) immediately adjacent to the target prediction time. This represents the input vector of the Mamba state-space model.
[0055] When training with the dataset obtained in step 2, firstly, the input time series segments in each sample pair are... (i.e., state observation data and action observation data at each moment within the previous one or more historical scheduling time steps) are input into the Transformer model, where feature extraction and processing are performed through its internal multi-head self-attention mechanism to obtain a high-dimensional deep feature representation vector. Subsequently, the feature result The historical state memory array of the previous time step By concatenating them together, we obtain a cascaded vector. The Mamba model will As input, the input sequence is subjected to temporal evolution processing through an internal selective state equation, ultimately obtaining the predicted state data intervals for each moment of the next historical scheduling time step in the sample pair, i.e., the lower boundary sequence of the predicted state. With upper boundary sequence .
[0056] After each sample pair completes a forward prediction, the state data prediction result is calculated. and The prediction error between the target output and the actual target output in the sample pair. Here, "actual target output" refers to... and They represent the actual lower boundary segment and the actual upper boundary segment of the 95% confidence interval of each state data obtained after Gaussian process regression (GPR) in step 2, respectively.
[0057] In the initial stage of the process, 1000 batches of samples were used to independently pre-train historical data. For each sample pair, the prediction error between the predicted interval and the actual measurement interval was calculated (e.g., using the quantile loss function or mean squared error). This error was then used to synchronously feed back and update the parameters of the Transformer model and the Mamba model through the backpropagation algorithm. Thus, through continuous iterative training with a large number of sample pairs, the model fully learned the data. and , The temporal mapping relationship between the sequences is used to obtain a pre-trained prediction model. This pre-training stage is purely based on supervised learning using sequence prediction errors and does not involve subsequent updates to the loss function based on agent scheduling rewards.
[0058] Step 4: For the intraday scheduling of the integrated energy system, using the integrated energy system applied to the intraday scheduling plan as the environment of the sequential decision model, construct the state space, action space, multi-objective optimization function, and reward function of the sequential decision model.
[0059] In this embodiment, the state space of the constructed sequential decision model consists of the state observation data of the integrated energy system for each historical scheduling time step within the intraday historical time period prior to the current scheduling time step obtained in step 1, and the state data prediction result of the integrated energy system for the current scheduling time step obtained in step 3. The state observation data of the integrated energy system for each historical scheduling time step and the state data prediction result of the integrated energy system for the current scheduling time step are respectively used as states in the state space.
[0060] In this embodiment, the action space of the constructed sequential decision-making model consists of the action observation data of the integrated energy system at each historical scheduling time step within the intraday historical time period obtained in step 1. The action observation data of the integrated energy system at each historical scheduling time step is used as the action in the action space.
[0061] In this embodiment, the multi-objective optimization function of the constructed sequential decision model aims to minimize the total cost of the integrated energy system while ensuring that the constraints are met within the region, and at the same time improve the utilization rate of new energy sources and reduce the power fluctuation of the tie line.
[0062] Specifically, the integrated energy system includes One gas turbine, A photovoltaic power station One wind power station For an energy storage power station, a non-adjustable rigid load, and an adjustable flexible load, the multi-objective optimization function established in this embodiment is shown in the following formula:
[0063]
[0064] In the formula: , , , and These are the operating costs of the integrated energy system, the cost of renewable energy curtailment penalties, the cost of load shedding penalties, the cost of power flow exceeding limits penalties, and the cost of tie-line power fluctuation penalties. , , , and These are the weighting coefficients for each cost; The total cost of the integrated energy system.
[0065] Therefore, in the sequential decision-making model constructed in this embodiment, the objective is to maximize the reward, and the reward function is... Set as total cost The negative value of the reward function As shown in the following formula:
[0066]
[0067] Among them, the operating cost of integrated energy systems As shown in the following formula:
[0068]
[0069] In the formula: , is a set containing all discrete time points; For the first Taiwan gas turbine Efforts made at all times; For the first Taiwan energy storage equipment The charging and discharging power at any given moment; For the power of interaction with the main network; This is the reduction amount for flexible loads; , , and These are the corresponding cost coefficients.
[0070] Cost of curtailment of renewable energy As shown in the following formula:
[0071]
[0072] In the formula: This refers to the penalty coefficient for curtailment of renewable energy. for Power surplus during a given period It can be expressed as the following formula:
[0073]
[0074] in, For the first A photovoltaic power station in Efforts during a specific time period; For the first A photovoltaic power station in Efforts during a specific time period; for Load during a specific time period.
[0075] Loss of load penalty cost As shown in the following formula:
[0076]
[0077] In the formula: This is the underload penalty coefficient.
[0078] Cost of violating trend limits As shown in the following formula:
[0079]
[0080] In the formula: Penalty coefficient for exceeding trend limits; This is an indicator function that takes the value 1 if and only if there is a branch power flow exceeding the limit, and 0 otherwise; For integrated energy systems during time periods A branch line power flow exceeding the limit event that occurred within the area.
[0081] Connection line power fluctuation penalty cost As shown in the following formula:
[0082]
[0083] In the formula: This is the penalty factor for power fluctuations in the tie line. The planned power values for the tie lines issued by the main network.
[0084] In this embodiment, the constraints of the established multi-objective optimization function include power balance constraints, energy storage device state of charge constraints, grid interaction power upper and lower limit constraints, and line power flow constraints. The formulas for each constraint are as follows:
[0085] Power balance constraints:
[0086]
[0087] In the formula, , , , They are respectively Output of gas turbines, photovoltaic power plants, wind power plants, and energy storage power plants during specific time periods; for Power of interaction with the main network during different time periods; for Total load during the time period; for Actual reduction in flexible load during specific time periods.
[0088] Energy storage device state of charge (SOC) constraints:
[0089]
[0090]
[0091]
[0092]
[0093] In the formula, Representative time period The state of charge of energy storage devices For energy storage capacity; For charge and discharge efficiency; and These are the lower and upper limits of the charging and discharging power, respectively. and These represent the minimum and maximum values of the state of charge of the energy storage device, respectively. and $ This indicates the energy storage state of charge at the beginning and end of the scheduling cycle.
[0094] Upper and lower limits of grid interaction power constraints:
[0095]
[0096] In the formula, and These represent the minimum and maximum values of active power interaction between the integrated energy system and the main grid, respectively.
[0097] Power flow constraints on the line:
[0098]
[0099] In the formula, These include gas turbines, energy storage power stations, photovoltaic power stations, wind power stations, and the nodes where the loads are located. The power transmission allocation factor; For the line The upper limit of the trend; This represents the total number of nodes in the power grid. After flexible loads participate in scheduling, Time period nodes The load value.
[0100] Step 5: For intraday scheduling of the integrated energy system, based on the Generative Adversarial Simulation Enhanced Error Regularized Double-Delay Deep Deterministic Policy Gradient (GAIL-ERCL2-TD3) scheduling algorithm, construct the value network (Critic network) and policy network (Actor network) of the agent in the sequential decision model. Specific details are as follows:
[0101] (A) In this embodiment, an Error Regularized Critic (ERC) term is introduced into the Critic network to suppress... The values diverge, and the Critic regularization term is shown in the following formula:
[0102]
[0103] In the formula: ( ) represents the projection direction based on the principal features of the state (i.e., the largest eigenvector of the state covariance matrix). It is an ERC regularization weight; This is the current estimate from the Critic network. For time period The state characteristic vector of the integrated energy system; For time period The scheduling action vector output by the agent; It is the target Q value; To introduce a modified reward value with a regularization penalty term; This represents the original reward value for environmental feedback.
[0104] The regularization term in the Critic network loss function is defined as follows:
[0105]
[0106] In the formula: It is the largest eigenvector of the state covariance matrix; It is the regularization strength; The target Q value corresponding to the empirical sample; State data from empirical samples; Action data from experience samples; To find the mathematical expectation operator; This is the Error Regularization (ERC) loss term.
[0107] Furthermore, L2 regularization (weight decay) is introduced into the Critic network to further constrain the growth of the Critic network parameters and prevent overfitting caused by excessive model complexity. The L2 regularization formula is shown below:
[0108]
[0109] In the formula: These are the L2 weighting coefficients; For network parameters; This is the L2 regularization loss term.
[0110] Therefore, in this embodiment, the joint loss function for constructing the Critic network is as follows:
[0111]
[0112] In the formula: Batch Size, which is the number of samples taken in each training iteration; Critic network for the first The Q-value estimation results for a sample of empirical data; The first sampled from the experience playback pool One status data; The first sampled from the experience playback pool Individual action data; For the first The target Q value corresponding to each sample; This is the joint loss function for the Critic network.
[0113] In this embodiment, the Critic network comprehensively considers the balance between minimizing the TD error, principal direction constraints, and parameter sparsity, effectively reducing the estimation bias and oscillation of the Critic. It suppresses the deviation of the TD error in the principal direction of the state, reducing overestimation; reduces the variance of the TD error, improving training stability; suppresses overfitting in irrelevant dimensions, making the Critic network learning more physically interpretable; delays policy updates, reducing policy oscillation; and smooths the target policy by adding action noise to improve robustness.
[0114] (B) In this embodiment, the Generative Adversarial Imitation Learning (GAIL) framework is adopted, and the generator therein is used as the Actor network.
[0115] In Actor networks, the chain rule can be used to leverage policy gradients. The parameters of the network itself are updated as shown in the following formula:
[0116]
[0117] In the formula: The gradient of the policy objective function of the Actor network; In strategy Bootstrap State Access Distribution Operators for calculating mathematical expectation; Regarding Actor network parameters The gradient operator; For parameters The Actor network in a given state The deterministic strategy for output (i.e., predictive generated scheduling actions); For time period The state characteristic vector of the integrated energy system; Regarding actions The gradient operator; The number is The Critic network for state-action pairs The resulting action value function (i.e., Q-value estimate) is evaluated. for The system state generated by the interaction of intelligent agents during a specific time period; The action variables taken by the intelligent agent (i.e., the specific scheduling instructions of the integrated energy system).
[0118] In the Generative Adversarial Imitation Learning (GAIL) framework of this embodiment, the imitation learning of expert policies can be achieved through a generative adversarial network consisting of a generator (i.e., an Actor network) and a discriminator. The generator is responsible for generating actions based on the state, while the discriminator is used to distinguish between the actions generated by the generator and expert behavior (the probability that the output state and action pairs come from the expert policy). This GAIL framework does not require manual design of a reward function; it directly uses the discriminator's output on the state and action pairs generated by the generator as the reward, training the generator to generate behaviors similar to expert behavior.
[0119] In the adversarial imitation reinforcement learning agent proposed in this embodiment, a parameter-based system is designed. (i.e., the weights and bias parameters of the discriminator neural network) Parameterized discriminator network The discriminator network uses state-action tuples. As input, output This represents the probability that the tuple originates from expert knowledge (rather than being generated by the current agent). The Actor network (i.e., the generator) and the discriminator network constitute a generative adversarial network, which serves as an imitation model to learn expert knowledge and guide the agent in the early stages of imitation.
[0120] During the imitation process, the discriminator network and the Actor network (i.e., the generator) are in an adversarial relationship. The loss function (Min-Max objective) of the generative adversarial network is shown in the following equation:
[0121]
[0122] In the formula: These are the weight parameters of the Actor network (i.e., the generator); These are the weight parameters of the discriminator network; To generate the joint loss function for the adversarial network; An operator for calculating the mathematical expectation of an expert experience dataset; An operator for calculating the mathematical expectation of the state-action data distribution sampled under the agent's current policy; and Representing time periods respectively The integrated energy system status and corresponding dispatch actions in expert experience data; and Representing time periods respectively The system state and scheduling actions generated by the interaction of intelligent agents.
[0123] The discriminator network maximizes the loss function To update parameters This allows the system to distinguish between agent behavior and expert knowledge. The Actor network then updates its parameters by minimizing this loss function. The loss function of the Actor network's mimicry part can be further expressed as follows:
[0124]
[0125] In the formula: For the loss function of the Actor network for imitation learning based on expert demonstrations; For the current policy in the Actor network Operators for calculating the mathematical expectation under the sampled state distribution; The probability value output by the discriminator network; For time period The state characteristics of the integrated energy system; For parameters The Actor network in a given state The generated scheduling actions.
[0126] Therefore, in the adversarial imitation reinforcement learning agent proposed in this embodiment, the policy gradient of the Actor network should be re-expressed as shown in the following equation:
[0127]
[0128] In the formula: For dynamic imitation weights; The overall policy gradient of the Actor network is obtained by combining environmental exploration and expert imitation. Operators for calculating the mathematical expectation under a state distribution; and These are the parameters of the Actor network. and actions The partial derivative gradient operator; For deterministic policy actions output by the Actor network; The action value (Q value) evaluated for the Critic network; The discriminator network evaluates the probability that the state-action pair originates from the expert policy.
[0129] Therefore, it can be seen that the objective of the Actor network in this embodiment includes two parts: maximizing To conduct exploration (based on environmental rewards) and maximize To imitate (based on expert demonstration).
[0130] To fully leverage the potential and guiding role of expert knowledge while preventing the agent from becoming overly reliant on it, this embodiment also incorporates dynamic imitation weights. In the initial stages of imitation, the training of the Actor network is primarily guided by expert knowledge, but the imitation weights gradually decrease as training progresses, resulting in dynamic imitation weights. The descent law is defined as a function of the training time step, as shown in the following equation:
[0131]
[0132] In the formula: This is the current training time step; The preset maximum training time step; This represents a quadratic decay function that indicates the dynamic imitation weights gradually decrease as the training time step increases.
[0133] When the discriminator's discrimination rate stabilizes at around 0.5, it indicates that the agent has basically mastered expert knowledge (at this point, the discriminator cannot effectively distinguish whether an action comes from expert experience or is generated by the agent, i.e., a Nash equilibrium has been reached). At this stage, it is necessary to accelerate the dynamic imitation weights. The decay rate allows the agent to smoothly transition to autonomous exploration mode, dynamically mimicking the weights. The formula is adjusted as follows:
[0134]
[0135] Through the above mechanism, the policy network of the agent in this embodiment realizes an adaptive switch from "expert guidance" to "autonomous exploration", which improves the robustness and exploration capability of the later policy while ensuring the convergence speed in the early stage.
[0136] Step 6: The agent of the sequential decision model is trained based on the state space and action space.
[0137] During the offline training phase of the agent, the agent selects a current state from the state space, and the Actor network selects a current action from the action space based on this state. Then, it calculates a reward function based on the interaction results of the current action in the environment, obtaining the reward function calculation result. Next, the Critic network calculates a value function based on the reward function calculation result, obtaining the value function calculation result, and updates its own network parameters based on the value function calculation result. Simultaneously, the Actor network updates its own network parameters based on the value function calculation result. This process is repeated until both the Actor network and the Critic network converge, resulting in a trained agent.
[0138] During the online scheduling and execution phase of the integrated energy system, after obtaining the trained agent, the predicted state data of the integrated energy system at the current scheduling time step in the state space is input into the Actor network of the trained agent. The Actor network then outputs the optimal action for the current scheduling time step (i.e., the specific equipment scheduling execution instruction that achieves the lowest overall system operating cost and the least penalty for exceeding limits at the current scheduling time step).
[0139] Furthermore, to ensure that the prediction model not only possesses temporal prediction accuracy but also considers the optimality of system scheduling decisions, the actual reward signal from the system environment based on the optimal action after the agent executes the current scheduling time step and outputs the optimal action is introduced into the loss function of the pre-trained prediction model. This allows for further reverse correction of the pre-trained prediction model parameters (i.e., online fine-tuning of the prediction model). Specifically, the pre-trained prediction model constructs a weighted loss function by combining the actual prediction error observed at the next time step with the scheduling reward value obtained from executing the optimal action at the current scheduling time step. The network parameters are then updated based on this weighted loss function, achieving closed-loop collaborative optimization of prediction and scheduling.
[0140] This weighted loss function is used to further update the parameters of the prediction model. The reward value directly reflects the quality of the scheduling strategy and is also used in this embodiment to evaluate the performance of the prediction model. The formula for calculating the weighted loss function of the pre-trained prediction model is as follows:
[0141]
[0142] In the formula: These are the weighting coefficients; This represents the error between the predicted result output by the prediction model and the actual measured value (to be consistent with the pre-training stage mentioned earlier, the error here specifically refers to the mean square error between the predicted interval result and the actual observed measured value). It represents the actual feedback reward value obtained by the agent after performing the optimal action in the current scheduling time step; , These are the possible optimal and worst reward values, used to normalize the reward values.
[0143] This guides the prediction model towards evolving towards "generating prediction results that are conducive to obtaining high scheduling rewards." Through this synchronous training mechanism, the optimization objective of the prediction model is no longer limited to minimizing prediction errors, but is directly related to scheduling performance, thereby achieving the co-evolution of the prediction model and the scheduling model and continuously improving the overall performance of intraday scheduling of the integrated energy system.
[0144] Step 7: At the current scheduling time step, the integrated energy system receives and physically executes the optimal action instruction output by the intelligent agent network. The system state is thus transferred and enters the next time step, thereby completing the full scheduling of the integrated energy system at the current scheduling time step.
[0145] This embodiment also discloses an electronic device, including a processor and a memory, wherein program instructions in the memory are read and executed to perform steps 1-7 of the above-described integrated energy system intraday optimal scheduling method.
[0146] This embodiment also discloses a storage medium storing program instructions, which, when read and executed, perform steps 1-7 of the above-described integrated energy system intraday optimal scheduling method.
[0147] The preferred embodiments of the present invention have been described in detail above with reference to the accompanying drawings. These embodiments are merely descriptions of preferred embodiments and are not intended to limit the scope or concept of the invention. The specific technical features described in the above embodiments can be combined in any suitable manner without contradiction. Such combinations, as long as they do not violate the spirit of the present invention, should also be considered as part of this disclosure. To avoid unnecessary repetition, the present invention will not further describe the various possible combinations.
[0148] This invention is not limited to the specific details of the above embodiments. Within the scope of the technical concept of this invention and without departing from the design idea of this invention, all modifications and improvements made by those skilled in the art to the technical solutions of this invention should fall within the protection scope of this invention. The technical content for which protection is sought in this invention has been fully described in the claims.
Claims
1. A method for optimal intraday scheduling of an integrated energy system, characterized in that, The process is as follows: Acquire the status and action observation data of the integrated energy system at each historical scheduling time step within the intraday historical time period prior to the current scheduling time step, and preprocess the acquired observation data; A dataset is constructed using preprocessed observation data from intraday historical time periods to pretrain a prediction model. The state observation data of the time step closest to the current scheduling time step is input into the pretrained prediction model, and the prediction model predicts the state data interval of the current scheduling time step. The state space is composed of the state observation data of the integrated energy system at each historical scheduling time step in the intraday historical time period before the current scheduling time step, and the state data prediction results of the integrated energy system at the current scheduling time step; the action space is composed of the action observation data of the integrated energy system at each historical scheduling time step in the intraday historical time period, and a multi-objective optimization function, reward function, agent value network and policy network for intraday scheduling are established, thereby constructing a sequential decision model for intraday scheduling. The value network and policy network of the agent are trained based on the state in the state space and the action in the action space, and the parameters of the value network and policy network are adjusted according to the reward function calculation results, thereby obtaining a trained agent. The predicted state data of the integrated energy system at the current scheduling time step in the state space is input into the policy network of the trained agent, and the policy network outputs the optimal action for the current scheduling time step. Furthermore, after the agent executes the current scheduling time step and outputs the optimal action, the real reward signal based on the feedback of the optimal action is introduced into the loss function of the pre-trained prediction model to further correct the parameters of the pre-trained prediction model.
2. The intraday optimal scheduling method for an integrated energy system according to claim 1, characterized in that, The status observation data of the integrated energy system includes wind speed observation data, light intensity observation data, temperature observation data, cooling load observation data, heating load observation data, and electrical load observation data; the operation observation data of the integrated energy system includes gas turbine output observation data, energy storage device charging and discharging power observation data, and power observation data interacting with the main grid.
3. The intraday optimal scheduling method for an integrated energy system according to claim 1, characterized in that, The preprocessing includes normalization and Gaussian process regression. After normalizing the observed data, Gaussian process regression is used to obtain the probability density function of the same type of state data of the integrated energy system at all historical scheduling time steps in the intraday historical time period.
4. The intraday optimal scheduling method for an integrated energy system according to claim 1, characterized in that, For the preprocessed observation data, a sliding window is used to take the various state observation data of the previous historical scheduling time step as input and the state observation data of the next historical scheduling time step as the target output, thus obtaining multiple sample pairs. The input and target output of each sample pair exist in the form of time series segments, and the state data in the target output are divided into two categories: upper boundary segments and lower boundary segments. All sample pairs constitute the dataset used for training the prediction model.
5. The intraday optimal scheduling method for an integrated energy system according to claim 1, characterized in that, The prediction model is a combined model formed by cascading the Transformer model and the Mamba model.
6. The intraday optimal scheduling method for an integrated energy system according to claim 1, characterized in that, The multi-objective optimization function for intraday scheduling aims to minimize the total cost of the integrated energy system while ensuring that the constraints are met within the guaranteed region, and simultaneously improve the utilization rate of new energy sources and reduce power fluctuations in tie lines.
7. The intraday optimal scheduling method for an integrated energy system according to claim 6, characterized in that, The total cost is a weighted sum of the operating cost of the integrated energy system, the cost of renewable energy curtailment penalties, the cost of load shedding penalties, the cost of power flow exceeding limits penalties, and the cost of tie-line power fluctuation penalties.
8. The intraday optimal scheduling method for an integrated energy system according to claim 7, characterized in that, The reward function for intraday scheduling is set to a negative value of the total cost.
9. The intraday optimal scheduling method for an integrated energy system according to claim 6, characterized in that, The constraints include power balance constraints, energy storage device state of charge constraints, grid interaction power upper and lower limit constraints, and line power flow constraints.
10. The intraday optimal scheduling method for an integrated energy system according to claim 1, characterized in that, The agent's value network incorporates a Critic regularization term to suppress... The values diverge, and an L2 regularization term is introduced to constrain the growth of the value network parameters.
11. The intraday optimal scheduling method for an integrated energy system according to claim 1, characterized in that, The agent's policy network adopts a generator in the generative adversarial simulation learning architecture. The generative adversarial network composed of the generator and discriminator networks serves as a model to learn expert knowledge. Furthermore, during generator training, the policy gradient is updated based on the dynamic imitation weights.
12. An electronic device comprising a processor and a memory, characterized in that, When the program instructions in the memory are read and executed, the intraday optimal scheduling method for the integrated energy system as described in any one of claims 1-11 is performed.
13. A storage medium storing program instructions, characterized in that, When the program instructions are read and executed, the intraday optimal scheduling method for the integrated energy system as described in any one of claims 1-11 is performed.