Power grid economic dispatch method and device based on risk perception distribution reinforcement learning
By using a risk-aware distribution-based reinforcement learning approach, and leveraging quantile commentator networks and a distorted expectation objective function to optimize power grid dispatch strategies, this approach addresses the shortcomings of traditional methods in handling uncertainties in power systems, achieving superior economic dispatch performance for the power grid.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- STATE GRID JILIN ELECTRIC POWER COMPANY LIMITED
- Filing Date
- 2026-02-09
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional stochastic programming and robust optimization methods are difficult to accurately handle uncertainties in real-time economic dispatch of power systems, resulting in overly conservative results or neglect of tail risks, and failing to effectively manage grid operating costs and voltage overrun risks.
We employ a risk-aware distribution-based reinforcement learning approach. By constructing a quantile commentator network, we accurately model the reward distribution and optimize the scheduling strategy using a distorted expectation objective function. This approach explicitly manages tail risk and generates feasible scheduling strategies by combining action projection and reward/penalty mechanisms.
It improves the robustness of dispatching strategies, reduces grid operating costs and the risk of voltage overruns, and enhances the safety and feasibility of the power system.
Smart Images

Figure CN122246699A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of power system operation and control technology, specifically relating to a power grid economic dispatch method and device based on risk perception distributed reinforcement learning. Background Technology
[0002] With the increasing penetration of variable renewable energy and the intensification of load fluctuations, real-time economic dispatch (RTED) of power systems faces significant uncertainties. On short timescales, these uncertainties are closely coupled with network constraints, making risk management a key design objective in RTED.
[0003] Traditional stochastic programming and robust optimization methods, while widely used to handle uncertainty, have limitations in practical applications. Stochastic programming relies on assumed probability distributions and is susceptible to model specification biases; robust optimization typically depends on a fixed-shape uncertainty set, often leading to overly conservative results. Both methods require explicit distribution models or fuzzy sets, information that is often difficult to obtain accurately in real-world scenarios.
[0004] Reinforcement Learning (RL), as a data-driven approach, learns scheduling policies through interaction with the environment without fully specifying system dynamics or probabilistic laws. However, most standard RL algorithms (such as DDPG, TD3, SAC, etc.) aim to maximize expected rewards and are inherently risk-neutral. In RTED, source-load uncertainty mainly manifests as power balance deviations, which can lead to feasible region shrinkage and drastic fluctuations in operating costs. Expectation-based policies may ignore tail risks, resulting in poor performance in certain extreme scenarios. To address this issue, it is necessary to go beyond average performance and explicitly consider the distribution of future rewards. Summary of the Invention
[0005] The purpose of this invention is to overcome the shortcomings of existing technologies and propose a power grid economic dispatch method and device based on risk-aware distributed reinforcement learning. This invention introduces distributed reinforcement learning, utilizes a quantile commentator network to accurately model the reward distribution, and explicitly manages tail risk by optimizing the distorted expectation objective, thereby formulating a risk-sensitive dispatch strategy that effectively reduces power grid operating costs and minimizes voltage exceedance risks.
[0006] A first aspect of this invention proposes a power grid economic dispatch method based on risk-aware distributed reinforcement learning, comprising:
[0007] The real-time scheduling problem of the power system is modeled as a Markov decision process;
[0008] Based on the Markov decision process, a distributed reinforcement learning model is constructed and a replay experience pool is generated to store training samples. The distributed reinforcement learning model includes a policy network and two quantile commentator networks with identical structures. The input of the policy network is the current state of the power system, and the output of the policy network is the corresponding original action. The original action is mapped to the feasible region through a safe projection to generate the final action.
[0009] The distributed reinforcement learning model is trained using training samples from the replay experience pool; wherein, the policy network is updated using a distorted expectation objective function to minimize the cumulative cost of risk perception, the distorted expectation objective function being calculated based on the quantiles output by the quantile commentator network and a preset risk metric function;
[0010] During real-time scheduling, the current state of the power system is input into the trained policy network, and the output of the policy network is passed through the security projection to obtain the current action, thereby realizing real-time economic scheduling of the power system.
[0011] In one specific embodiment of the present invention, the step of modeling the real-time scheduling problem of the power system as a Markov decision process includes:
[0012] 1) Define the state space:
[0013] Let the state at time t Includes: time characteristics, load, and ultra-short-term active power forecasts for renewable energy. Current output of conventional generators (CG), renewable energy sources (RES), and battery storage systems (BSS) Node voltage amplitude and the state of charge of the BSS ;
[0014] in, This represents the predicted active power of the load at time t+1. This represents the predicted active power of renewable energy at time t+1. This represents the active power setpoint of a conventional generator at time t. This represents the reactive power setpoint of a conventional generator at time t. The setpoint for the active power of renewable energy at time t. The reactive power setpoint for renewable energy at time t. This is the active power setpoint of the battery energy storage system at time t; Let be the node voltage amplitude at time t. This represents the state of charge of the battery energy storage system at time t.
[0015] 2) Define the action space;
[0016] Let the action at time t be Including: the active and reactive power setpoints of CG at the next moment. The active and reactive power setpoints of RES at the next moment And the active power setpoint of the BSS at the next moment. ;
[0017] 3) Define the reward function;
[0018] Let the reward function at time t be It is a negative value for total operating costs;
[0019] The total operating cost includes the power generation cost of the CG, the cost of wind or solar curtailment penalties for the RES, and a soft penalty for voltage over-limit, expressed as follows:
[0020]
[0021]
[0022]
[0023] in,
[0024]
[0025] Among them, subscript Represents the node number in the power system where the variable resides; set Represents the set of generator nodes; set Represents a set of renewable energy nodes; set Represents the set of all nodes; This is a positive weighting coefficient for the power system's generation cost and the penalty cost for wind or solar curtailment. This is the positive weighting coefficient for the soft penalty term for voltage exceeding the limit; For nodes The quadratic cost function of the generator, , , These correspond to the coefficients of quadratic cost, linear cost, and constant cost, respectively. For nodes The cost function of the curtailment penalty for wind or solar power in RES, where express time The available renewable energy output of a node. The penalty cost constant; This is a soft penalty term for voltage exceeding the limit, where and They are nodes The upper and lower limits of voltage.
[0026] In one specific embodiment of the present invention, it further includes:
[0027] The policy network is composed of a fully connected neural network;
[0028] The input to the policy network is Time-state space The output is Actions without security projection at all times , represented as ,in For policy network parameters; Safety actions are obtained after safety projection. ;
[0029] Among them, the projection operator Actions Cutting to the physical constraints of the equipment, including:
[0030] CG output constraints: , ;
[0031] CG Climbing Constraints: ;
[0032] RES output constraints: ;
[0033] BSS power and capacity constraints: ;
[0034] in, and For nodes The lower and upper limits of the active power output of the CG; and For nodes The lower and upper limits of CG reactive power output; and They are nodes The maximum down-rate and maximum up-rate of CG; Represents a node The maximum design active power of RES; Represents a node RES in The maximum output active power at any given time; Represents a node The maximum power angle of RES; and They are nodes The maximum charging power and maximum discharging power of the BSS; and They are nodes The minimum and maximum BSS capacity;
[0035] If action If any variable exceeds the upper limit of the corresponding constraint, then the projection operator... The variable exceeding the limit is restricted to the upper limit of the corresponding constraint; if any variable in the action exceeds the lower limit of the corresponding constraint, the projection operator... Limit the out-of-bounds variable to the lower limit of the corresponding constraint.
[0036] In one specific embodiment of the present invention, it further includes:
[0037] Both of the aforementioned quantile critic networks are composed of fully connected neural networks; the input of the quantile critic network is... Time-state space Actions after security projection The output is Quantiles of the action-value distribution at time ,in For prediction A vector of equally divided quantiles; These represent the parameters of the two quantile commentator networks.
[0038] In a specific embodiment of the present invention, the process of constructing the replay experience pool is as follows:
[0039] 1) Set the initial time And initialize the environment to obtain the initial state. ;
[0040] 2) Judgment Is the number of samples less than the set number required for training the model? If yes, proceed to step 3; otherwise, proceed to step 4.
[0041] 3) Calculate based on the current state Calculate the reward function at time t ;
[0042] The action at time t is obtained by randomly sampling in the action space and then safely projecting it. In the simulation environment, the current state is obtained. Next action The state of the next moment Then As Each sample at a given time is added to the initially empty replay experience pool;
[0043] Let t = t + 1, then return to step 2).
[0044] 4) When the number of samples in the replay experience pool reaches the required number of samples for training the model, the construction of the replay experience pool is complete.
[0045] In a specific embodiment of the present invention, training the distributed reinforcement learning model using training samples in the replay experience pool includes:
[0046] When training the model, a batch of samples is randomly sampled from the current replay experience pool. ,in Given the set of sampling data at each time point; for each sample set obtained from sampling, perform the following training steps:
[0047] 1) Initialize the number of training iterations Set the policy network update frequency Set discount factor Gaussian noise variance vector Policy network and quantile commentator network parameter update coefficients Model optimizer learning rate;
[0048] Copy the initial policy network as the initial target policy network. , The parameters of the target policy network are defined; two initial quantile critic networks are copied respectively to serve as the initial target quantile critic networks. , For the first One target quantile commentator network parameter;
[0049] 2) Input any sample from the currently sampled set into the current policy network to calculate the original target action for the next time step: ,in The mean is 0 and the variance is Gaussian noise vector;
[0050] The action of the safety target is obtained after safety projection. ;
[0051] 3) Set the state for the next moment. and target action The inputs are fed into two target quantile critic networks, resulting in two quantile distribution estimates. ;
[0052] The distortion expectation is calculated for each of the two target quantile commentator networks. ,in This indicates that the interval [0,1] is... A vector of equally divided quantiles. For risk parameters, The quantile weights are then used to select the distribution with the smaller expected value as the target distribution. , The target quantile of the selected distribution with a smaller expected output value is used as the commentator network parameter;
[0053] 4) Construct the temporal difference target distribution ;
[0054] Current state and actions The inputs are fed into two current quantile critic networks, resulting in two quantile distribution estimates. ;
[0055] 5) Train the current quantile commentator network using the quantile Huber loss function to minimize the difference between the predicted quantile and the target quantile;
[0056] The loss function is:
[0057]
[0058] in, Quantile vector The One element;
[0059] Temporal difference error of the commentator network at the kth quantile The calculation expression is:
[0060]
[0061] Quantile Huber Loss Function The calculation expression is:
[0062]
[0063] in, The meaning is: when When, the value is 1; when When the time is right, the value is 0; For Huber threshold;
[0064] The loss function is minimized using the backpropagation algorithm, with the learning rate set to the model optimizer's learning rate, and the parameters of the two quantile critic networks are optimized respectively.
[0065] After the two quantile commentator networks are updated, let the number of iterations be... = +1;
[0066] 6) Judgment: If Divisible If yes, proceed to step 7); otherwise, return to step 2).
[0067] 7) Update the policy network;
[0068] The gradients at different quantiles are weighted according to the weights derived from the twist function:
[0069]
[0070] use Update the policy network parameters using the gradient ascent method. The algorithm learning rate is set to the model optimizer learning rate;
[0071] Then with a smaller update factor Synchronize the parameters of the current policy network and the quantile critic network to the corresponding target network using a moving average:
[0072]
[0073]
[0074] 8) Generate a new sample using the updated policy network;
[0075] This includes reading the current environment state from the simulation environment. Calculate based on the current state Calculate the reward function at time t ;Will Input the current policy network, and obtain the policy network output after secure projection. The simulation environment yields the current state. Next action The state of the next moment This generates a new sample. ;
[0076] Determine if the number of samples in the current replay experience pool exceeds the preset maximum number:
[0077] If the maximum number is exceeded, delete the earliest sample that entered the replay experience pool, and then add new samples. Add to the replay experience pool; if it does not exceed the limit, directly add the new sample. Add to the replay experience pool;
[0078] Then let t = t + 1;
[0079] 9) Determine the current time Has the maximum set number of time steps been exceeded?
[0080] If the number of samples exceeds the limit, the model training is complete, and the trained policy network is saved; otherwise, a new batch of samples is randomly sampled from the current replay experience pool, and then the process returns to step 2.
[0081] In one specific embodiment of the present invention, it further includes:
[0082] The distortion expectation is calculated using the Wang risk measure, and its distortion function is: ,in, For function scalar input, Represents the cumulative distribution function of a normal distribution;
[0083] Corresponding quantile weights for ,in It is a quantile input. This represents the probability density function of the normal distribution.
[0084] A second aspect of this invention proposes a power grid economic dispatching device based on risk-aware distributed reinforcement learning, comprising:
[0085] The Markov Decision Process building module is used to model the real-time scheduling problem of power systems as a Markov decision process.
[0086] A distributed reinforcement learning model construction module is used to construct a distributed reinforcement learning model based on the Markov decision process and generate a replay experience pool for storing training samples. The distributed reinforcement learning model includes a policy network and two quantile critic networks with identical structures. The input of the policy network is the current state of the power system, and the output of the policy network is the corresponding original action. The original action is mapped to the feasible region through a safe projection to generate the final action.
[0087] The model training module is used to train the distributed reinforcement learning model using training samples in the replay experience pool; wherein, the policy network is updated using a distorted expectation objective function to minimize the cumulative cost of risk perception, and the distorted expectation objective function is calculated based on the quantiles output by the quantile commentator network and a preset risk metric function;
[0088] The economic dispatch module is used to input the current state of the power system into the trained policy network during real-time dispatch. The output of the policy network is then processed by the security projection to obtain the current action, thereby realizing real-time economic dispatch of the power system.
[0089] A third aspect of the present invention provides an electronic device comprising:
[0090] At least one processor; and a memory communicatively connected to said at least one processor;
[0091] The memory stores instructions that can be executed by the at least one processor, and the instructions are configured to execute the above-described power grid economic dispatch method based on risk-aware distributed reinforcement learning.
[0092] A fourth aspect of the present invention provides a computer-readable storage medium storing computer instructions for causing the computer to execute the above-described power grid economic dispatch method based on risk-aware distributed reinforcement learning.
[0093] The features and beneficial effects of this invention are as follows:
[0094] 1. This invention proposes a risk perception RTED framework based on distributed reinforcement learning. By improving the TD3 algorithm and introducing a quantile distribution commentator, it is able to capture the complete distribution information of returns, not just the expected value.
[0095] 2. This invention utilizes distorted expectations (such as Wang risk metric) as the optimization objective, enabling the scheduling strategy to adjust its focus on tail risks (such as high costs and risk of exceeding limits) according to actual needs, thereby enhancing the robustness of the strategy.
[0096] 3. This invention combines action projection and reward / penalty mechanisms, effectively mitigating constraint violations during the exploration process and improving the safety and feasibility of scheduling strategies in actual power systems.
[0097] 4. Compared with traditional reinforcement learning baselines, the method proposed in this invention exhibits superior performance in risk-sensitive RTED tasks, effectively reducing operating costs and minimizing voltage limit exceedance risks. Attached Figure Description
[0098] Figure 1 This is an overall flowchart of a power grid economic dispatch method based on risk-aware distributed reinforcement learning according to an embodiment of the present invention. Detailed Implementation
[0099] This invention proposes a power grid economic dispatch method and device based on risk-aware distributed reinforcement learning, which is further described in detail below with reference to specific embodiments.
[0100] A first aspect of this invention proposes a power grid economic dispatch method based on risk-aware distributed reinforcement learning, comprising:
[0101] The real-time scheduling problem of the power system is modeled as a Markov decision process;
[0102] Based on the Markov decision process, a distributed reinforcement learning model is constructed and a replay experience pool is generated to store training samples. The distributed reinforcement learning model includes a policy network and two quantile commentator networks with identical structures. The input of the policy network is the current state of the power system, and the output of the policy network is the corresponding original action. The original action is mapped to the feasible region through a safe projection to generate the final action.
[0103] The distributed reinforcement learning model is trained using training samples from the replay experience pool; wherein, the policy network is updated using a distorted expectation objective function to minimize the cumulative cost of risk perception, the distorted expectation objective function being calculated based on the quantiles output by the quantile commentator network and a preset risk metric function;
[0104] During real-time scheduling, the current state of the power system is input into the trained policy network, and the output of the policy network is passed through the security projection to obtain the current action, thereby realizing real-time economic scheduling of the power system.
[0105] In a specific embodiment of the present invention, the overall process of the power grid economic dispatch method based on risk-aware distributed reinforcement learning is as follows: Figure 1 As shown, it includes the following steps:
[0106] 1) Model the real-time power system dispatching problem as a Markov decision process (MDP); the specific steps are as follows:
[0107] 1-1) Define the state space.
[0108] In this embodiment, let the state at time t be... Includes: time features (time index) Month Index ), the ultra-short-term active power forecasts of load and renewable energy ( In this embodiment, the time interval between adjacent moments is 15 minutes, and the current output of conventional generators (CG), renewable energy sources (RES), and battery storage systems (BSS) is... ), node voltage amplitude ( ) and the state of charge of the BSS ( ).
[0109] in, This represents the predicted active power of the load at time t+1. This represents the predicted active power of renewable energy at time t+1. This represents the active power setpoint of a conventional generator at time t. This represents the reactive power setpoint of a conventional generator at time t. The setpoint for the active power of renewable energy at time t. The reactive power setpoint for renewable energy at time t. This is the active power setpoint of the battery energy storage system at time t. Let be the node voltage amplitude at time t. Let t represent the state of charge of the battery energy storage system at time t.
[0110] 1-2) Define the action space.
[0111] In this embodiment, the action at time t Including: the active and reactive power setpoints of CG at the next moment ( ), active and reactive power setpoints of RES ( ), and the active power setting value of the BSS ( ).
[0112] 1-3) Define the reward function.
[0113] In this embodiment, the reward function at time t It is a negative value for total operating costs.
[0114] The total operating cost includes the power generation cost of the CG, the wind / solar curtailment penalty cost of the RES, and the soft penalty term for voltage over-limit, as expressed below:
[0115]
[0116]
[0117]
[0118] in,
[0119]
[0120] Among them, new subscripts This represents the node number in the power system where the variable resides. (Set) Represents the set of generator nodes; set Represents a set of renewable energy nodes; set This represents the set of all nodes. The positive weighting coefficient for power system generation cost and wind / solar curtailment penalty cost is set to 0.5 in a specific embodiment of the present invention; This is the positive weighting coefficient of the soft penalty term for voltage exceeding the limit, which is taken as 0.1 in a specific embodiment of the present invention. For nodes The quadratic cost function of the generator, , , The coefficients corresponding to the quadratic cost, linear cost, and constant cost are determined by the specific operating characteristics of the generator at the corresponding node. For nodes The cost function of wind / solar curtailment penalty for RES, where express time The available renewable energy output of a node. To penalize the cost constant, in one specific embodiment of the present invention, all nodes are uniformly set to 0.05. This is a soft penalty term for voltage exceeding the limit, where and They are nodes The upper and lower limits of voltage.
[0121] 2) Based on the results of step 1), construct a distributed reinforcement learning model.
[0122] In this embodiment, the improved Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, referred to as DR-TD3, is used when constructing the distributed reinforcement learning model. DR-TD3 comprises one policy network (Actor) and two quantile critic networks; the specific steps are as follows:
[0123] 2-1) Construct a policy network.
[0124] In this embodiment, the policy network is composed of a fully connected neural network. In one specific embodiment of the invention, the policy network contains two hidden layers, each containing 256 neurons. The input to the policy network is... Time-state space The output is Actions without security projection at all times , represented as ,in These are the policy network parameters. Safety actions require additional safety projection. .
[0125] In this embodiment, the projection operator is... Actions Cutting to the physical constraints of the equipment, specifically including:
[0126] CG output constraints: , ;
[0127] CG Climbing Constraints: ;
[0128] RES output constraints: ;
[0129] BSS power and capacity constraints: .
[0130] in, and For nodes The lower and upper limits of the active power output of the CG; and For nodes The lower and upper limits of CG reactive power output; and For nodes The maximum down-rate and maximum up-rate of CG; Represents a node The maximum design active power of RES; Represents a node RES in The maximum output active power at any given time; Represents a node The maximum power angle of RES; and They are nodes The maximum charging power and maximum discharging power of the BSS; and They are nodes The minimum and maximum BSS capacity.
[0131] If action If any variable exceeds the upper limit of the corresponding constraint, then the projection operator... The variable exceeding the limit is restricted to the upper limit of the corresponding constraint; if any variable in the action exceeds the lower limit of the corresponding constraint, the projection operator... Limit the out-of-bounds variable to the lower limit of the corresponding constraint.
[0132] 2-2) Construct a network of quantile critics.
[0133] In this embodiment, the two quantile commentator networks have the same structure, both composed of fully connected neural networks. In a specific embodiment of the invention, the quantile commentator network contains two hidden layers, each containing 256 neurons. The input to the quantile commentator network is... Time-state space Actions after security projection The output is Quantiles of the action-value distribution at time ,in For prediction The equally divided quantile vector, in this embodiment ; These represent the parameters of the two quantile commentator networks.
[0134] 3) Construct a replay experience pool.
[0135] 3-1) Setting the initial time And initialize the environment to obtain the initial state. .
[0136] 3-2) Judgment Is the number of samples less than the set number required for training the model (set to 5000 in one specific embodiment of the present invention): if yes, proceed to step 3-3; otherwise, proceed to step 3-4).
[0137] 3-3) Calculation based on the current state Calculate the reward function at time t .
[0138] The action at time t is obtained by randomly sampling in the action space and then safely projecting it. In the simulation environment, the current state is obtained. Next action The state of the next moment Then As Each sample at a given time is added to the initially empty replay experience pool.
[0139] Let t = t + 1, and then return to step 3-2.
[0140] 3-4) Once the number of samples in the replay experience pool reaches the required number of samples for training the model, the replay experience pool is complete, and proceed to step 4).
[0141] 4) Train the distributed reinforcement learning model constructed in step 2) using samples from the replay experience pool.
[0142] In this embodiment, during model training, a batch of samples is randomly sampled from the current replay experience pool. (In this embodiment, the batch size is 256), where Let be the set of sampling data at each time point. For each sample set obtained from sampling, perform the following training steps:
[0143] 4-1) Initialize the number of training iterations Increase the policy network update frequency Set discount factor (0.99 is selected in one specific embodiment of the present invention), Gaussian noise variance vector (In one specific embodiment of the invention, 0.2 times the maximum action value is selected), update coefficients of the policy network and quantile commentator network parameters. (0.005 is selected in one specific embodiment of the present invention), model optimizer learning rate (0.0003 is selected in one specific embodiment of the present invention).
[0144] Copy the initial policy network and use it as the initial target policy network (Target actor). , The parameters for the target policy network are defined; two initial quantile critic networks are copied separately to serve as the initial target quantile critic networks. , For the first The target quantile commentator network parameters.
[0145] 4-2) Input any sample from the currently sampled sample set into the current policy network to calculate the original target action at the next time step: ,in The mean is 0 and the variance is Gaussian noise vector.
[0146] The action of the safety target is obtained after safety projection. .
[0147] 4-3) State the next time step and target action The inputs are fed into two target quantile critic networks, resulting in two quantile distribution estimates. .
[0148] In this embodiment, to explicitly manage tail risk, a distorted expectation is used as the optimization objective. The distorted expectation is calculated by applying a distortion function to the cumulative distribution function (CDF). To calculate. This embodiment uses the Wang risk measure, whose distortion function is: ,in, For function scalar input, Let represent the cumulative distribution function of the normal distribution. This is a risk parameter. When... In this case, lower quantiles (i.e., high-cost / low-reward regions) are assigned higher weights to reflect risk aversion. The corresponding quantile weights are... for ,in It is a quantile input. This represents the probability density function of the normal distribution.
[0149] The distortion expectation is calculated for each of the two target quantile commentator networks. ,in This indicates that the interval [0,1] is... Divide the vector into equal parts, and then select the distribution with the smaller expected value as the target distribution. . The target quantile of the selected distribution with a smaller expected output value is used to evaluate the network parameters.
[0150] 4-4) Constructing the temporal difference target distribution .
[0151] Current state and actions The inputs are fed into two current quantile critic networks, resulting in two quantile distribution estimates. .
[0152] 4-5) Train the current quantile commentator network using the quantile Huber loss function to minimize the difference between the predicted quantile and the target quantile.
[0153] The loss function is:
[0154]
[0155] in, Quantile vector The Each element.
[0156] Temporal difference error of the commentator network at the kth quantile The calculation expression is:
[0157]
[0158] Quantile Huber Loss Function The calculation expression is:
[0159]
[0160] in, The meaning is: when When, the value is 1; when When the value is 0, the value is 0. The value is the Huber threshold, which is set to 1.0 in one specific embodiment of the present invention.
[0161] In this embodiment, the loss function is minimized using the backpropagation algorithm, with the learning rate set to the model optimizer learning rate, to optimize the parameters of the two quantile commentator networks respectively.
[0162] After the two quantile commentator networks are updated, let the number of iterations be... = +1.
[0163] 4-6) Judgment: If Divisible If so, proceed to step 4-7) to perform the target policy network and target quantile commentator update; otherwise, return to step 4-2.
[0164] 4-7) Update the policy network.
[0165] In this embodiment, during gradient calculation, the gradients at different quantiles are weighted according to the weights derived from the twist function:
[0166]
[0167] use Update the policy network parameters using the gradient ascent method. The algorithm learning rate is set to the model optimizer learning rate.
[0168] This step uses a soft update mechanism with a smaller update coefficient. Synchronize the parameters of the current policy network and the quantile critic network to the corresponding target network using a moving average:
[0169]
[0170]
[0171] 4-8) Generate a new sample using the updated policy network.
[0172] In this embodiment, the current environment state is read from the simulation environment. Calculate based on the current state Calculate the reward function at time t .Will Input the current policy network, and obtain the network output after secure projection. The simulation environment yields the current state. Next action The state of the next moment This generates a new sample. .
[0173] Determine whether the number of samples in the current playback experience pool exceeds the preset maximum number (set to 1,000,000 in one specific embodiment of the present invention):
[0174] If the maximum number is exceeded, delete the earliest sample that entered the replay experience pool, and then add new samples. Add to the replay experience pool; if it does not exceed the limit, directly add the new sample. Add to the replay experience pool.
[0175] Then let t = t + 1.
[0176] 4-9) Determine the current time Has the maximum set number of time steps been exceeded (in one specific embodiment of the present invention, this is set to 200000)?
[0177] If the number of samples exceeds the limit, the model training is complete, and the trained policy network is saved; otherwise, a batch of samples is randomly sampled from the current replay experience pool, and then the process returns to step 4-2.
[0178] 5) Utilize the trained policy network for online real-time scheduling.
[0179] During real-time scheduling, the current state of the power system is read. Input the trained policy network from step 4) and obtain the action at the current time after safe projection. ,according to The active and reactive power values are used to issue corresponding control commands to each device, thereby realizing real-time economic dispatch based on power system risk perception.
[0180] To implement the above embodiments, a second aspect of the present invention proposes a power grid economic dispatch device based on risk-aware distributed reinforcement learning, comprising:
[0181] The Markov Decision Process building module is used to model the real-time scheduling problem of power systems as a Markov decision process.
[0182] A distributed reinforcement learning model construction module is used to construct a distributed reinforcement learning model based on the Markov decision process and generate a replay experience pool for storing training samples. The distributed reinforcement learning model includes a policy network and two quantile critic networks with identical structures. The input of the policy network is the current state of the power system, and the output of the policy network is the corresponding original action. The original action is mapped to the feasible region through a safe projection to generate the final action.
[0183] The model training module is used to train the distributed reinforcement learning model using training samples in the replay experience pool; wherein, the policy network is updated using a distorted expectation objective function to minimize the cumulative cost of risk perception, and the distorted expectation objective function is calculated based on the quantiles output by the quantile commentator network and a preset risk metric function;
[0184] The economic dispatch module is used to input the current state of the power system into the trained policy network during real-time dispatch. The output of the policy network is then processed by the security projection to obtain the current action, thereby realizing real-time economic dispatch of the power system.
[0185] In one specific embodiment of the present invention, the step of modeling the real-time scheduling problem of the power system as a Markov decision process includes:
[0186] 1) Define the state space:
[0187] Let the state at time t Includes: time characteristics, load, and ultra-short-term active power forecasts for renewable energy. Current output of conventional generators (CG), renewable energy sources (RES), and battery storage systems (BSS) Node voltage amplitude and the state of charge of the BSS ;
[0188] in, This represents the predicted active power of the load at time t+1. This represents the predicted active power of renewable energy at time t+1. This represents the active power setpoint of a conventional generator at time t. This represents the reactive power setpoint of a conventional generator at time t. The setpoint for the active power of renewable energy at time t. The reactive power setpoint for renewable energy at time t. This is the active power setpoint of the battery energy storage system at time t; Let be the node voltage amplitude at time t. This represents the state of charge of the battery energy storage system at time t.
[0189] 2) Define the action space;
[0190] Let the action at time t be Including: the active and reactive power setpoints of CG at the next moment. The active and reactive power setpoints of RES at the next moment And the active power setpoint of the BSS at the next moment. ;
[0191] 3) Define the reward function;
[0192] Let the reward function at time t be It is a negative value for total operating costs;
[0193] The total operating cost includes the power generation cost of the CG, the cost of wind or solar curtailment penalties for the RES, and a soft penalty for voltage over-limit, expressed as follows:
[0194]
[0195]
[0196]
[0197] in,
[0198]
[0199] Among them, subscript Represents the node number in the power system where the variable resides; set Represents the set of generator nodes; set Represents a set of renewable energy nodes; set Represents the set of all nodes; This is a positive weighting coefficient for the power system's generation cost and the penalty cost for wind or solar curtailment. This is the positive weighting coefficient for the soft penalty term for voltage exceeding the limit; For nodes The quadratic cost function of the generator, , , These correspond to the coefficients of quadratic cost, linear cost, and constant cost, respectively. For nodes The cost function of the curtailment penalty for wind or solar power in RES, where express time The available renewable energy output of a node. The penalty cost constant; This is a soft penalty term for voltage exceeding the limit, where and They are nodes The upper and lower limits of voltage.
[0200] In one specific embodiment of the present invention, it further includes:
[0201] The policy network is composed of a fully connected neural network;
[0202] The input to the policy network is Time-state space The output is Actions without security projection at all times , represented as ,in For policy network parameters; Safety actions are obtained after safety projection. ;
[0203] Among them, the projection operator Actions Cutting to the physical constraints of the equipment, including:
[0204] CG output constraints: , ;
[0205] CG Climbing Constraints: ;
[0206] RES output constraints: ;
[0207] BSS power and capacity constraints: ;
[0208] in, and For nodes The lower and upper limits of the active power output of the CG; and For nodes The lower and upper limits of CG reactive power output; and They are nodes The maximum down-rate and maximum up-rate of CG; Represents a node The maximum design active power of RES; Represents a node RES in The maximum output active power at any given time; Represents a node The maximum power angle of RES; and They are nodes The maximum charging power and maximum discharging power of the BSS; and They are nodes The minimum and maximum BSS capacity;
[0209] If action If any variable exceeds the upper limit of the corresponding constraint, then the projection operator... The variable exceeding the limit is restricted to the upper limit of the corresponding constraint; if any variable in the action exceeds the lower limit of the corresponding constraint, the projection operator... Limit the out-of-bounds variable to the lower limit of the corresponding constraint.
[0210] In one specific embodiment of the present invention, it further includes:
[0211] Both of the aforementioned quantile critic networks are composed of fully connected neural networks; the input of the quantile critic network is... Time-state space Actions after security projection The output is Quantiles of the action-value distribution at time ,in For prediction A vector of equally divided quantiles; These represent the parameters of the two quantile commentator networks.
[0212] In a specific embodiment of the present invention, the process of constructing the replay experience pool is as follows:
[0213] 1) Set the initial time And initialize the environment to obtain the initial state. ;
[0214] 2) Judgment Is the number of samples less than the set number required for training the model? If yes, proceed to step 3; otherwise, proceed to step 4.
[0215] 3) Calculate based on the current state Calculate the reward function at time t ;
[0216] The action at time t is obtained by randomly sampling in the action space and then safely projecting it. In the simulation environment, the current state is obtained. Next action The state of the next moment Then As Each sample at a given time is added to the initially empty replay experience pool;
[0217] Let t = t + 1, then return to step 2).
[0218] 4) When the number of samples in the replay experience pool reaches the required number of samples for training the model, the construction of the replay experience pool is complete.
[0219] In a specific embodiment of the present invention, training the distributed reinforcement learning model using training samples in the replay experience pool includes:
[0220] When training the model, a batch of samples is randomly sampled from the current replay experience pool. ,in Given the set of sampling data at each time point; for each sample set obtained from sampling, perform the following training steps:
[0221] 1) Initialize the number of training iterations Set the policy network update frequency Set discount factor Gaussian noise variance vector Policy network and quantile commentator network parameter update coefficients Model optimizer learning rate;
[0222] Copy the initial policy network as the initial target policy network. , The parameters of the target policy network are defined; two initial quantile critic networks are copied respectively to serve as the initial target quantile critic networks. , For the first One target quantile commentator network parameter;
[0223] 2) Input any sample from the currently sampled set into the current policy network to calculate the original target action for the next time step: ,in The mean is 0 and the variance is Gaussian noise vector;
[0224] The action of the safety target is obtained after safety projection. ;
[0225] 3) Set the state for the next moment. and target action The inputs are fed into two target quantile critic networks, resulting in two quantile distribution estimates. ;
[0226] The distortion expectation is calculated for each of the two target quantile commentator networks. ,in This indicates that the interval [0,1] is... A vector of equally divided quantiles. For risk parameters, The quantile weights are then used to select the distribution with the smaller expected value as the target distribution. , The target quantile of the selected distribution with a smaller expected output value is used as the commentator network parameter;
[0227] 4) Construct the temporal difference target distribution ;
[0228] Current state and actions The inputs are fed into two current quantile critic networks, resulting in two quantile distribution estimates. ;
[0229] 5) Train the current quantile commentator network using the quantile Huber loss function to minimize the difference between the predicted quantile and the target quantile;
[0230] The loss function is:
[0231]
[0232] in, Quantile vector The One element;
[0233] Temporal difference error of the commentator network at the kth quantile The calculation expression is:
[0234]
[0235] Quantile Huber Loss Function The calculation expression is:
[0236]
[0237] in, The meaning is: when When, the value is 1; when When the time is right, the value is 0; For Huber threshold;
[0238] The loss function is minimized using the backpropagation algorithm, with the learning rate set to the model optimizer's learning rate, and the parameters of the two quantile critic networks are optimized respectively.
[0239] After the two quantile commentator networks are updated, let the number of iterations be... = +1;
[0240] 6) Judgment: If Divisible If yes, proceed to step 7); otherwise, return to step 2).
[0241] 7) Update the policy network;
[0242] The gradients at different quantiles are weighted according to the weights derived from the twist function:
[0243]
[0244] use Update the policy network parameters using the gradient ascent method. The algorithm learning rate is set to the model optimizer learning rate;
[0245] Then with a smaller update factor Synchronize the parameters of the current policy network and the quantile critic network to the corresponding target network using a moving average:
[0246]
[0247]
[0248] 8) Generate a new sample using the updated policy network;
[0249] This includes reading the current environment state from the simulation environment. Calculate based on the current state Calculate the reward function at time t ;Will Input the current policy network, and obtain the policy network output after secure projection. The simulation environment yields the current state. Next action The state of the next moment This generates a new sample. ;
[0250] Determine if the number of samples in the current replay experience pool exceeds the preset maximum number:
[0251] If the maximum number is exceeded, delete the earliest sample that entered the replay experience pool, and then add new samples. Add to the replay experience pool; if it does not exceed the limit, directly add the new sample. Add to the replay experience pool;
[0252] Then let t = t + 1;
[0253] 9) Determine the current time Has the maximum set number of time steps been exceeded?
[0254] If the number of samples exceeds the limit, the model training is complete, and the trained policy network is saved; otherwise, a new batch of samples is randomly sampled from the current replay experience pool, and then the process returns to step 2.
[0255] In one specific embodiment of the present invention, it further includes:
[0256] The distortion expectation is calculated using the Wang risk measure, and its distortion function is: ,in, For function scalar input, Represents the cumulative distribution function of a normal distribution;
[0257] Corresponding quantile weights for ,in It is a quantile input. This represents the probability density function of the normal distribution.
[0258] This enables the introduction of distributed reinforcement learning, the use of quantile commentator networks to accurately model the reward distribution, and the explicit management of tail risk by optimizing the distorted expectation objective. This allows for the development of risk-sensitive scheduling strategies, effectively reducing grid operating costs and minimizing voltage overrun risks.
[0259] To implement the above embodiments, a third aspect of the present invention provides an electronic device, comprising:
[0260] At least one processor; and a memory communicatively connected to said at least one processor;
[0261] The memory stores instructions that can be executed by the at least one processor, and the instructions are configured to execute the above-described power grid economic dispatch method based on risk-aware distributed reinforcement learning.
[0262] To implement the above embodiments, a fourth aspect of the present invention provides a computer-readable storage medium storing computer instructions for causing the computer to execute the above-described power grid economic dispatch method based on risk-aware distributed reinforcement learning.
[0263] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in connection with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.
[0264] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device. The aforementioned computer-readable medium carries one or more programs, which, when executed by the electronic device, cause the electronic device to perform a power grid economic dispatch method based on risk-aware distributed reinforcement learning according to the above embodiments.
[0265] Computer program code for performing the operations of this disclosure can be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, and C++, and conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0266] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to specific features, structures, materials, or characteristics described in connection with that embodiment or example, which are included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.
[0267] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this application, "multiple" means at least two, such as two, three, etc., unless otherwise explicitly specified.
[0268] Any process or method described in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more executable instructions for implementing a particular logical function or process, and the scope of the preferred embodiments of this application includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the function involved, as will be understood by those skilled in the art to which embodiments of this application pertain.
[0269] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-including system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and programmable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, computer-readable media can even be paper or other suitable media on which programs can be printed, because programs can be obtained electronically, for example, by optically scanning the paper or other media, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory.
[0270] It should be understood that various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.
[0271] Those skilled in the art will understand that all or part of the steps of the methods in the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.
[0272] Furthermore, the functional units in the various embodiments of this application can be integrated into a processing module, or each unit can exist physically separately, or two or more units can be integrated into a module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
[0273] The storage medium mentioned above can be a read-only memory, a disk, or an optical disk, etc. Although embodiments of this application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting this application. Those skilled in the art can make changes, modifications, substitutions, and variations to the above embodiments within the scope of this application.
Claims
1. A power grid economic dispatch method based on risk-aware distributed reinforcement learning, characterized in that, include: The real-time scheduling problem of the power system is modeled as a Markov decision process; Based on the Markov decision process, a distributed reinforcement learning model is constructed and a replay experience pool is generated to store training samples. The distributed reinforcement learning model includes a policy network and two quantile commentator networks with identical structures. The input of the policy network is the current state of the power system, and the output of the policy network is the corresponding original action. The original action is mapped to the feasible region through a safe projection to generate the final action. The distributed reinforcement learning model is trained using training samples from the replay experience pool; wherein, the policy network is updated using a distorted expectation objective function to minimize the cumulative cost of risk perception, the distorted expectation objective function being calculated based on the quantiles output by the quantile commentator network and a preset risk metric function; During real-time scheduling, the current state of the power system is input into the trained policy network, and the output of the policy network is passed through the security projection to obtain the current action, thereby realizing real-time economic scheduling of the power system.
2. The method according to claim 1, characterized in that, The method of modeling the real-time power system scheduling problem as a Markov decision process includes: 1) Define the state space: Let the state at time t Includes: time characteristics, load, and ultra-short-term active power forecasts for renewable energy. Current output of conventional generators (CG), renewable energy sources (RES), and battery storage systems (BSS) Node voltage amplitude and the state of charge of the BSS ; in, This represents the predicted active power of the load at time t+1. This represents the predicted active power of renewable energy at time t+1. This represents the active power setpoint of a conventional generator at time t. This represents the reactive power setpoint of a conventional generator at time t. The setpoint for the active power of renewable energy at time t. The reactive power setpoint for renewable energy at time t. This is the active power setpoint of the battery energy storage system at time t; Let be the node voltage amplitude at time t. This represents the state of charge of the battery energy storage system at time t. 2) Define the action space; Let the action at time t be Including: the active and reactive power setpoints of CG at the next moment. The active and reactive power setpoints of RES at the next moment And the active power setpoint of the BSS at the next moment. ; 3) Define the reward function; Let the reward function at time t be It is a negative value for total operating costs; The total operating cost includes the power generation cost of the CG, the cost of wind or solar curtailment penalties for the RES, and a soft penalty for voltage over-limit, expressed as follows: in, Among them, subscript Represents the node number in the power system where the variable resides; set Represents the set of generator nodes; set Represents a set of renewable energy nodes; set Represents the set of all nodes; This is a positive weighting coefficient for the power system's generation cost and the penalty cost for wind or solar curtailment. This is the positive weighting coefficient for the soft penalty term for voltage exceeding the limit; For nodes The quadratic cost function of the generator, , , These correspond to the coefficients of quadratic cost, linear cost, and constant cost, respectively. For nodes The cost function of the curtailment penalty for wind or solar power in RES, where express time The available renewable energy output of a node. The penalty cost constant; This is a soft penalty term for voltage exceeding the limit, where and They are nodes The upper and lower limits of voltage.
3. The method according to claim 2, characterized in that, Also includes: The policy network is composed of a fully connected neural network; The input to the policy network is Time-state space The output is Actions without security projection at all times , represented as ,in For policy network parameters; Safety actions are obtained after safety projection. ; Among them, the projection operator Actions Cutting to the physical constraints of the equipment, including: CG output constraints: , ; CG Climbing Constraints: ; RES output constraints: ; BSS power and capacity constraints: ; in, and For nodes The lower and upper limits of the active power output of the CG; and For nodes The lower and upper limits of CG reactive power output; and They are nodes The maximum down-rate and maximum up-rate of CG; Represents a node The maximum design active power of RES; Represents a node RES in The maximum output active power at any given time; Represents a node The maximum power angle of RES; and They are nodes The maximum charging power and maximum discharging power of the BSS; and They are nodes The minimum and maximum BSS capacity; If action If any variable exceeds the upper limit of the corresponding constraint, then the projection operator... The variable exceeding the limit is restricted to the upper limit of the corresponding constraint; if any variable in the action exceeds the lower limit of the corresponding constraint, the projection operator... Limit the out-of-bounds variable to the lower limit of the corresponding constraint.
4. The method according to claim 3, characterized in that, Also includes: Both of the aforementioned quantile critic networks are composed of fully connected neural networks; the input of the quantile critic network is... Time-state space Actions after security projection The output is Quantiles of the action-value distribution at time ,in For prediction A vector of equally divided quantiles; These represent the parameters of the two quantile commentator networks.
5. The method according to claim 4, characterized in that, The process of constructing the replay experience pool is as follows: 1) Set the initial time And initialize the environment to obtain the initial state. ; 2) Judgment Is the number of samples less than the set number required for training the model? If yes, proceed to step 3; otherwise, proceed to step 4. 3) Calculate based on the current state Calculate the reward function at time t ; The action at time t is obtained by randomly sampling in the action space and then safely projecting it. In the simulation environment, the current state is obtained. Next action The state of the next moment Then As Each sample at a given time is added to the initially empty replay experience pool; Let t = t + 1, then return to step 2). 4) When the number of samples in the replay experience pool reaches the required number of samples for training the model, the construction of the replay experience pool is complete.
6. The method according to claim 5, characterized in that, Training the distributed reinforcement learning model using training samples from the replay experience pool includes: When training the model, a batch of samples is randomly sampled from the current replay experience pool. ,in Given the set of sampling data at each time point; for each sample set obtained from sampling, perform the following training steps: 1) Initialize the number of training iterations Set the policy network update frequency Set discount factor Gaussian noise variance vector Policy network and quantile commentator network parameter update coefficients Model optimizer learning rate; Copy the initial policy network as the initial target policy network. , The parameters of the target policy network are defined; two initial quantile critic networks are copied respectively to serve as the initial target quantile critic networks. , For the first One target quantile commentator network parameter; 2) Input any sample from the currently sampled set into the current policy network to calculate the original target action for the next time step: ,in The mean is 0 and the variance is Gaussian noise vector; The action of the safety target is obtained after safety projection. ; 3) Set the state for the next moment. and target action The inputs are fed into two target quantile critic networks, resulting in two quantile distribution estimates. ; The distortion expectation is calculated for each of the two target quantile commentator networks. ,in This indicates that the interval [0,1] is... A vector of equally divided quantiles. For risk parameters, The quantile weights are then used to select the distribution with the smaller expected value as the target distribution. , The target quantile of the selected distribution with a smaller expected output value is used as the commentator network parameter; 4) Construct the temporal difference target distribution ; Current state and actions The inputs are fed into two current quantile critic networks, resulting in two quantile distribution estimates. ; 5) Train the current quantile commentator network using the quantile Huber loss function to minimize the difference between the predicted quantile and the target quantile; The loss function is: in, Quantile vector The One element; Temporal difference error of the commentator network at the kth quantile The calculation expression is: Quantile Huber Loss Function The calculation expression is: in, The meaning is: when When, the value is 1; when When the time is right, the value is 0; For Huber threshold; The loss function is minimized using the backpropagation algorithm, with the learning rate set to the model optimizer's learning rate, and the parameters of the two quantile critic networks are optimized respectively. After the two quantile commentator networks are updated, let the number of iterations be... = +1; 6) Judgment: If Divisible If yes, proceed to step 7); otherwise, return to step 2). 7) Update the policy network; The gradients at different quantiles are weighted according to the weights derived from the twist function: use Update the policy network parameters using the gradient ascent method. The algorithm learning rate is set to the model optimizer learning rate; Then with a smaller update factor Synchronize the parameters of the current policy network and the quantile critic network to the corresponding target network using a moving average: 8) Generate a new sample using the updated policy network; This includes reading the current environment state from the simulation environment. Calculate based on the current state Calculate the reward function at time t ;Will Input the current policy network, and obtain the policy network output after secure projection. The simulation environment yields the current state. Next action The state of the next moment This generates a new sample. ; Determine if the number of samples in the current replay experience pool exceeds the preset maximum number: If the maximum number is exceeded, delete the earliest sample that entered the replay experience pool, and then add new samples. Add to the replay experience pool; if it does not exceed the limit, directly add the new sample. Add to the replay experience pool; Then let t = t + 1; 9) Determine the current time Has the maximum set number of time steps been exceeded? If the number of samples exceeds the limit, the model training is complete, and the trained policy network is saved; otherwise, a new batch of samples is randomly sampled from the current replay experience pool, and then the process returns to step 2.
7. The method according to claim 6, characterized in that, Also includes: The distortion expectation is calculated using the Wang risk measure, and its distortion function is: ,in, For function scalar input, Represents the cumulative distribution function of a normal distribution; Corresponding quantile weights for ,in It is a quantile input. This represents the probability density function of the normal distribution.
8. A power grid economic dispatching device based on risk-aware distributed reinforcement learning, characterized in that, include: The Markov Decision Process building module is used to model the real-time scheduling problem of power systems as a Markov decision process. A distributed reinforcement learning model construction module is used to construct a distributed reinforcement learning model based on the Markov decision process and generate a replay experience pool for storing training samples. The distributed reinforcement learning model includes a policy network and two quantile critic networks with identical structures. The input of the policy network is the current state of the power system, and the output of the policy network is the corresponding original action. The original action is mapped to the feasible region through a safe projection to generate the final action. The model training module is used to train the distributed reinforcement learning model using training samples in the replay experience pool; wherein, the policy network is updated using a distorted expectation objective function to minimize the cumulative cost of risk perception, and the distorted expectation objective function is calculated based on the quantiles output by the quantile commentator network and a preset risk metric function; The economic dispatch module is used to input the current state of the power system into the trained policy network during real-time dispatch. The output of the policy network is then processed by the security projection to obtain the current action, thereby realizing real-time economic dispatch of the power system.
9. An electronic device, characterized in that, include: At least one processor; And, a memory communicatively connected to the at least one processor; The memory stores instructions executable by the at least one processor, the instructions being configured to perform the method described in any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions for causing the computer to perform the method according to any one of claims 1-7.