A reinforcement learning algorithm test method and system for vehicle-network interaction
By constructing a two-stage power flow computing framework and a reinforcement learning agent, the credibility problem of algorithm testing in the field of vehicle-to-everything (V2X) interaction is solved, enabling rapid testing and scalability in different scenarios, and improving the engineering credibility and iteration efficiency of algorithm testing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SOUTHEAST UNIV
- Filing Date
- 2026-03-10
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies lack testing methods for reinforcement learning algorithms that balance physical accuracy and scalability, making it difficult to conduct fair comparisons and reproducible experiments in the field of vehicle-to-everything (V2X) interaction. Furthermore, the lack of optimized solutions such as OPF as performance benchmarks leads to insufficient credibility of algorithm performance conclusions.
A two-stage power flow calculation framework is constructed, which combines offline benchmark evaluation and online interactive evaluation. The observation state space, action space and reward function are defined by a reinforcement learning agent, the physical control quantity is solved by a linearized DistFlow model, and the evaluation is carried out under a unified index system.
It improves the credibility and scalability of reinforcement learning algorithm testing, enabling the comparison of different algorithms under the same physical model and cost caliber, reducing the cost of repetitive modeling, and enhancing the engineering credibility and iteration efficiency of algorithm testing.
Smart Images

Figure FT_1 
Figure FT_2 
Figure FT_3
Abstract
Description
Technical Field
[0001] This invention belongs to the field of power system and reinforcement learning technology, and relates to a method and system for testing reinforcement learning algorithms for vehicle-grid interaction. Background Technology
[0002] With the increasing penetration of distributed photovoltaic, wind power, energy storage, and electric vehicles on the distribution side, the distribution network is gradually evolving into a new type of power system that deeply integrates power generation, grid, load, and storage. Against this backdrop, reinforcement learning, due to its adaptive optimization capabilities in complex and uncertain environments, is widely used for vehicle-grid interactive and coordinated control. However, the proliferation of algorithm types has also brought prominent problems. Existing technologies lack a unified evaluation method that balances physical accuracy and scalability. Different works differ significantly in scenario configuration, discrete time axis, state and action definition, reward caliber, random seeds, and training steps, making fair comparisons and reproducible experiments difficult. Furthermore, most evaluations rely on simplified power flow models, do not support physical modeling of V2G bidirectional power interaction, and lack systematic comparisons using optimized solutions such as OPF as performance benchmarks, resulting in insufficient credibility of algorithm performance conclusions. Summary of the Invention
[0003] The technical problem to be solved by this invention is: in the field of vehicle-to-everything (V2X) interaction, how to improve the credibility of performance testing of reinforcement learning algorithms.
[0004] To solve the above-mentioned technical problems, the present invention is implemented using the following technical solution.
[0005] A method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction includes:
[0006] S1: Obtain distribution network topology parameters, output forecasts of distributed photovoltaic and wind power, and basic operation data of electric vehicles through external data; construct a vehicle-grid interactive simulation environment and generate a discrete time axis, dividing the simulation cycle into multiple time steps;
[0007] S2 establishes a two-stage power flow calculation framework for calculating the total operating cost of the distribution network system;
[0008] S3. In the offline benchmark evaluation stage, a multi-time-step scheduling optimization model is constructed to jointly optimize multiple decision variables within the simulation cycle, obtain the full-cycle scheduling solution, and generate offline benchmark evaluation results under the two-stage power flow calculation framework.
[0009] S4. In the online interactive evaluation phase, the vehicle-to-grid (V2G) interactive scheduling problem is modeled as a Markov decision process. A reinforcement learning agent is used to define the observation state space, normalized action space, state transition equation, and reward function, including the grid state and vehicle information. At any time step, the normalized action output by the agent based on the current observation state is decoded into a physical control quantity, and the physical control quantity is substituted into the corresponding constraints as a known parameter. Under the constraints, a linearized DistFlow model for each time step is constructed and solved to obtain the remaining free variables.
[0010] S5, the physical control quantity and remaining free variables are input into the two-stage power flow calculation framework for calculation to obtain the total operating cost of the distribution network system, and then input into the reinforcement learning agent. The reinforcement learning agent calculates the reward for the corresponding time step and generates the online evaluation result. The next state is updated according to the state transition equation of the vehicle-network interaction simulation environment. If the current time step reaches the maximum set period, the process terminates and the final online evaluation result is output. The offline benchmark evaluation result and the final online evaluation result are evaluated using a unified index system to generate the final evaluation result.
[0011] The aforementioned reinforcement learning algorithm testing method for vehicle-to-grid interaction includes a vehicle-to-grid interaction simulation environment that includes at least: a power distribution network topology model, a distributed power source model of distributed photovoltaic and wind power, an energy storage system model, a conventional generator model, a charging station supporting V2G and an electric vehicle travel conversation model, and a flexible interconnection device model.
[0012] In the aforementioned reinforcement learning algorithm testing method for vehicle-to-grid interaction, in step S2, the two-stage power flow calculation framework is as follows: In the first stage, with the goal of minimizing the total operating cost of the distribution network system, the linearized DistFlow model of the distribution network is used to approximate the power flow solution, obtaining the output of controllable equipment and the generation cost, approximate power purchase cost, and related loss cost of the distribution network system; In the second stage, the open-source distribution system simulator OpenDSS is used to perform accurate power flow calculation on the solution results of the controllable equipment output in the first stage, obtaining corrected results, including node voltage, line power flow and network loss, accurate power purchase cost, and related loss cost. The approximate power purchase cost and related loss cost in the first stage are replaced with the accurate power purchase cost and related loss cost, thereby obtaining the accurate total operating cost of the distribution network system.
[0013] The aforementioned reinforcement learning algorithm testing method for vehicle-to-grid interaction, wherein the distributed photovoltaic and wind power distributed generation model of the distribution network is expressed as follows:
[0014]
[0015]
[0016]
[0017] in, Is Time of the first The actual active power of each photovoltaic / wind power unit ultimately connected to the grid. Is Active peak shaving power determined by the time control system, reactive power output Based on a fixed power factor calculate.
[0018] The aforementioned method for testing reinforcement learning algorithms for vehicle-to-grid interaction, wherein the distribution network energy storage system model is represented as follows:
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025] in, , These are charging efficiency and discharging efficiency, respectively. For time step, For energy storage system models in Energy state at any moment and These represent the charging power and discharging power of the energy storage system model, respectively. and These represent the maximum charging power and maximum discharging power of the energy storage system model, respectively.
[0026] The aforementioned reinforcement learning algorithm testing method for vehicle-to-grid (V2G) interaction supports the following representation of the V2G charging station and electric vehicle conversation model:
[0027]
[0028]
[0029]
[0030]
[0031]
[0032] in, and These are the first A number of electric vehicles The charging and discharging power at any given time, and These are charging efficiency and discharging efficiency, respectively. For time step, For electric vehicles Battery level at any time and These are the maximum charging power and maximum discharging power of an electric vehicle, respectively. Set the target charging amount for electric vehicles; define a binary variable. To prevent charging and discharging from occurring simultaneously, and to limit the physical maximum power of the electric vehicle, a penalty term is added to the objective function. , The penalty coefficient is... For the departure time of the electric vehicle, This indicates the amount of charge that was not achieved when the vehicle left. This indicates the battery level of the electric vehicle when it leaves the station.
[0033] The aforementioned reinforcement learning algorithm testing method for vehicle-to-grid interaction includes two types of scenario data generation modes during the electric vehicle charging phase: random generation mode and external input mode.
[0034] In the random generation mode, the vehicle arrival time adopts a mixed Gaussian distribution model with three peaks in the morning, noon and evening. The peak is determined first according to the given weight, and then the arrival time is sampled from the corresponding normal distribution. The dwell time is sampled from a discrete uniform distribution, and the initial power is sampled from a continuous uniform distribution within a specified range. By adjusting the peak weight, dwell time interval and initial power distribution, different types of typical charging demand scenarios are constructed.
[0035] In external input mode, session data is obtained by importing a CSV file that meets the predefined field format.
[0036] The aforementioned method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction, with the flexible interconnected device model represented as follows:
[0037]
[0038]
[0039]
[0040]
[0041]
[0042] in, For the operating losses of SOP, The loss coefficient is used to determine the state of NOP, which is determined by a binary variable. The symbols indicate that the switch is closed, respectively. , These represent the active power and reactive power flowing through the SOP, respectively. , These represent the active power and reactive power flowing through the NOP, respectively. M is the upper limit of apparent power, and is a constant. yes The power generation cost of a conventional generator at any given time. It is a meritorious contribution. yes The output of the conventional generator at all times, These are cost coefficient one, cost coefficient two, and cost coefficient three, which are loaded from the data file.
[0043] In the aforementioned reinforcement learning algorithm testing method for vehicle-to-grid interaction, the objective function in the first stage of the two-stage power flow calculation framework is to minimize the sum of the total operating cost and penalty terms throughout the entire simulation cycle. :
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051] in, It is the cost of purchasing electricity. It is the operating cost of all generators. , These are the loss costs of all SOPs and the depreciation costs of all ESSs. It is the sum of all penalty items, including slack current penalty, electric vehicle not fully charged penalty, and slack penalty for SOP / NOP. These are sets of nodes and lines, respectively. Represented by node For the beginning and the node For the terminal line, , The lines are respectively exist The active and reactive power flows at any given moment, with positive values representing the flow from node [node name missing]. Flow to Node ; , The lines are respectively exist The active and passive currents at any given moment; , , , , , , Represents respectively in The active power output of the node's generators, grid-connected photovoltaic / wind power, active power output of energy storage, active power output of charging piles, active power flowing into the SOP and active power injected into the NOP, base active load, and active slack variables. This represents the power flowing into the upstream power grid at the slack node; , , , , , , , Represents respectively in The reactive power output of the node's generator, grid-connected photovoltaic / wind power, reactive power output of energy storage, reactive power output of charging piles, reactive power flowing into the SOP and reactive power injected into the NOP, reactive load of the base, and reactive slack variables; For nodes exist The square of the voltage at time t, , They are nodes Minimum and maximum allowable voltages , These are the resistance and reactance values of the circuit. This represents the total simulation time.
[0052] The aforementioned method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction includes the following in the second stage:
[0053] At each time step, the equipment output at the corresponding moment is extracted from the results of the first stage, and together with the base load and photovoltaic / wind power output, the active / reactive power injection at the corresponding moment is constructed.
[0054] The active / reactive power injection at the nodes is passed as a fixed input to OpenDSS to perform nonlinear AC power flow calculations at the corresponding time points, thereby obtaining accurate voltage. Precise line loss and precise power purchase at balancing nodes. ;
[0055] The optimization results obtained in the first stage are used as the control scheme, and the accurate electricity purchase cost calculated in the second stage is used. By replacing the approximate electricity purchase cost in the linearized Distflow, the accurate total cost of the distribution network system can be obtained. :
[0056] .
[0057] The aforementioned method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction, in an online computing mode, includes the following key elements in constructing a Markov decision process:
[0058] 1) State Space: The state space is designed as a high-dimensional continuous vector, including:
[0059]
[0060] in, This indicates the real-time electricity price and the status of whether the electric vehicle is at a charging station. Normalized remaining time on site of vehicles Normalized energy requirements for vehicle m to be fully charged to a specified charge ;
[0061] 2) Action Space: Normalize each dimension to An interval is represented as:
[0062]
[0063] Represents the action space, Represents the net power of electric vehicles. The net power representing the energy storage system model, Active peak shaving power representing photovoltaic and wind power models; , These represent the power values flowing through SOP and NOP, respectively;
[0064] 3) Reward Function: The agent's goal is to maximize the reward function. Weighted sum of operating costs and penalty items:
[0065]
[0066] in, yes Time Node Voltage over-limit penalty, It is an electric car At the moment of departure Unfilled energy; This is the penalty for OpenDSS power flow calculation failure. , , , These represent the corresponding penalty coefficients one, two, three, and four, respectively.
[0067] A computer system includes a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to perform the steps of the method described above.
[0068] The beneficial effects achieved by this invention are as follows: The testing method of this invention constructs a two-stage power flow model on a discrete time axis, and simultaneously constructs an offline multi-time-step optimization solution mode to obtain the full-cycle benchmark scheduling solution. It also constructs an online interactive stepping mode to model vehicle-network interactive scheduling as a Markov decision process. This not only meets the solution speed required for large-scale training and evaluation, but also improves the engineering credibility of the results through accurate verification. Furthermore, it allows the offline benchmark and online strategy to be compared under the same physical model and the same cost caliber, thereby enhancing the credibility of the algorithm testing conclusions.
[0069] The testing method of this invention uniformly constructs the distribution network topology and time-series data in a simulation environment, and incorporates distributed photovoltaic / wind power, energy storage systems, V2G-enabled charging stations and electric vehicle interactions, flexible interconnection devices, and other objects into the same model system. Relevant parameters can be configured through data files, thereby enabling the rapid generation of test scenarios with different network structures, different combinations of controllable devices, and different operating conditions within the same environment. Therefore, it reduces the cost of repetitive modeling and code modifications across multiple computational examples and scenarios, and improves the testability and scalability of reinforcement learning algorithms in engineering-oriented vehicle-to-grid interaction problems.
[0070] The testing method of this invention evaluates offline benchmark evaluation results and online interactive evaluation output under a unified evaluation standard, comprehensively considering indicators such as economy and security, thereby facilitating comprehensive comparative analysis of different reinforcement learning algorithms and benchmark methods; at the same time, the evaluation results are output in a unified format, which facilitates repeated running and reproduction of experiments under the same scenario configuration, improving the efficiency of algorithm iteration and engineering verification. Attached Figure Description
[0071] Figure 1 This is a flowchart of a reinforcement learning algorithm testing method for vehicle-to-everything (V2X) interaction according to Embodiment 1 of the present invention;
[0072] Figure 2 This is a schematic diagram of the testing method in Embodiment 1 of the present invention;
[0073] Figure 3 This is a schematic diagram of the IEEE 33-node scenario in Embodiment 1 of the present invention;
[0074] Figure 4 This is a schematic diagram of the minimum voltage values at different times in various scenarios in Embodiment 1 of the present invention;
[0075] Figure 5 This is a schematic diagram of the power flow changes of the first branch l1 and the last branch l32 in various scenarios in Embodiment 1 of the present invention;
[0076] Figure 6 This is a schematic diagram of the total charging load changes in various scenarios in Embodiment 1 of the present invention;
[0077] Figure 7 This is a schematic diagram illustrating the training performance of different reinforcement learning algorithms in Embodiment 1 of the present invention. Detailed Implementation
[0078] The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the embodiments of the present invention and the specific features in the embodiments are detailed descriptions of the technical solution of the present invention, rather than limitations thereof. In the absence of conflict, the embodiments of the present invention and the technical features in the embodiments can be combined with each other.
[0079] The term "and / or" simply describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. Additionally, the character " / " generally indicates that the preceding and following related objects have an "or" relationship.
[0080] Example 1
[0081] like Figure 1 , Figure 2 As shown, this embodiment provides a method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction, including:
[0082] Step 1: Obtain distribution network topology parameters, output prediction values of distributed photovoltaic and wind power, and basic operation data of electric vehicles through external data files; construct a vehicle-to-grid interaction simulation environment and generate a discrete time axis, dividing the simulation period into multiple time steps; the vehicle-to-grid interaction simulation environment includes at least: distribution network topology model, distributed power source model of distributed photovoltaic and wind power, energy storage system model, conventional generator model, charging station and electric vehicle travel conversation model supporting V2G, and flexible interconnection device model;
[0083] Step 2: Establish a two-stage power flow calculation framework. In the first stage, with the goal of minimizing the total operating cost of the distribution network system, the linearized DistFlow model of the distribution network is used to approximate the power flow, obtaining the output of controllable equipment, the generation cost, approximate power purchase cost, and related loss costs of the distribution network system. In the second stage, the OpenDSS open-source distribution system simulator is used to perform accurate power flow calculation on the solution results of the controllable equipment output in the first stage, obtaining corrected results, including node voltage, line power flow and network loss, accurate power purchase cost, and related loss costs. The approximate power purchase cost and related loss costs in the first stage are replaced with accurate power purchase cost and related loss costs, thereby obtaining the accurate total operating cost of the distribution network system.
[0084] Step 3: In the offline benchmark evaluation stage, a multi-time-step scheduling optimization model is constructed to jointly optimize multiple decision variables within the simulation cycle, obtain the full-cycle scheduling solution, and generate offline benchmark evaluation results under the two-stage power flow calculation framework; the multiple decision variables include electric vehicle charging and discharging power, energy storage charging and discharging power, distributed power reduction amount, and the status and power of flexible interconnection devices, etc.
[0085] Step 4: In the online interactive evaluation phase, the vehicle-to-grid (V2G) interactive scheduling problem is modeled as a Markov decision process. A reinforcement learning agent is used to define the observation state space, normalized action space, state transition equation, and reward function, including the grid state and vehicle information. At any given time step... The normalized action output by the agent based on the current observation state is decoded into a physical control quantity, and this physical control quantity is substituted into the corresponding constraints as known parameters; under these constraints, the time step is constructed and solved. The linearized DistFlow model yields the remaining free variables;
[0086] Step 5: Input the physical control variables and remaining free variables into the two-stage power flow calculation framework to calculate the total operating cost of the distribution network system. Then, input the total operating cost into the reinforcement learning agent, which calculates the reward for the corresponding time step and generates the online evaluation result. Update the next state according to the state transition equation of the vehicle-network interaction simulation environment. If the current time step reaches the maximum set period, terminate and output the final online evaluation result. Use a unified index system to evaluate the offline benchmark evaluation result and the final online evaluation result to generate the final evaluation result.
[0087] In the distributed photovoltaic and wind power distributed generation model of the power distribution network:
[0088] Typical output prediction values are obtained from external data files, and a deterministic prediction curve is generated using linear interpolation.
[0089] To reflect the volatility and uncertainty of photovoltaic and wind power output, a random perturbation is added to the deterministic prediction curve. The actual maximum available output of distributed photovoltaic and wind power at each time step is obtained by sampling from a Gaussian distribution with a mean deterministic prediction value. Simultaneously, active peak shaving power is set as a control variable to reduce the actual maximum available output of photovoltaic and wind power, and the reduction amount is limited to between 0 and the actual maximum output. This yields the grid-connected output of distributed photovoltaic and wind power, including both active and reactive power outputs, expressed as:
[0090]
[0091]
[0092]
[0093] in, Is Time of the first The actual active power of each photovoltaic / wind power unit ultimately connected to the grid. Is Active peak shaving power determined by the time control system, reactive power output Based on a fixed power factor calculate.
[0094] In the energy storage system model of the distribution network:
[0095] Net charge and discharge power Decomposed into two non-negative variables, including charging power. and discharge power Through binary variables To prevent charging and discharging from occurring simultaneously, the dynamic evolution of the internal energy state of the energy storage system is represented by a state transition equation, which is expressed as:
[0096]
[0097]
[0098]
[0099]
[0100]
[0101]
[0102] in, , These are charging efficiency and discharging efficiency, respectively. For time step, For energy storage system models in Energy state at any moment and These represent the charging power and discharging power of the energy storage system model, respectively. and These represent the maximum charging power and maximum discharging power of the energy storage system model, respectively.
[0103] In the V2G-enabled charging station and electric vehicle session model:
[0104] A three-tiered structure of vehicle, station, and charging session enables detailed modeling of charging and discharging behavior. For vehicles, each electric vehicle is configured with parameters such as battery capacity, permissible charging / discharging range, maximum charging / discharging power, and efficiency. Charging / discharging is only permitted at the corresponding charging station when the vehicle is present.
[0105] The electric vehicle's state of charge and charging / discharging power are represented as the active power injection or absorption at the access node on the distribution network side.
[0106]
[0107]
[0108]
[0109]
[0110]
[0111] in, and These are the first A number of electric vehicles The charging and discharging power at any given time, and These are charging efficiency and discharging efficiency, respectively. For time step, For electric vehicles Battery level at any time and These are the maximum charging power and maximum discharging power of an electric vehicle, respectively. Set the target charging amount for electric vehicles; define a binary variable. To prevent charging and discharging from occurring simultaneously, and to limit the physical maximum power of the electric vehicle, a penalty term is added to the objective function. , The penalty coefficient is... For the departure time of the electric vehicle, This indicates the amount of charge that was not achieved when the vehicle left. This indicates the battery level of the electric vehicle when it leaves the station.
[0112] During the electric vehicle charging phase, each charging and discharging process is abstracted as a set of vectors, by... This indicates the vehicle identification number, the number of the charging station used, the arrival time, the departure time, and the initial SOC upon arrival, respectively.
[0113] During the electric vehicle charging phase, there are two types of scenario data generation modes: random generation mode and external input mode.
[0114] In the random generation mode, the vehicle arrival time adopts a mixed Gaussian distribution model with three peaks in the morning, noon and evening. The peak is determined first according to the given weight, and then the arrival time is sampled from the corresponding normal distribution. The dwell time is sampled from a discrete uniform distribution, and the initial power is sampled from a continuous uniform distribution within a specified range. By adjusting the peak weight, dwell time interval and initial power distribution, different types of typical charging demand scenarios are constructed.
[0115] In external input mode, session data is obtained by importing a CSV file that meets the predefined field format.
[0116] In the flexible interconnected device model:
[0117] The intelligent soft switch SOP is modeled as an ideal controllable power source connecting two different nodes, and the operating loss of the intelligent soft switch SOP is modeled as a quadratic function of the transmitted active power.
[0118] The normally open point (NOP) is modeled as an ideal tie switch using the Big-M method.
[0119] In the conventional generator model, the generator's power generation cost per unit time is modeled as a standard quadratic function of active power output, expressed as:
[0120]
[0121]
[0122]
[0123]
[0124]
[0125] in, For the operating losses of SOP, The loss coefficient is used to determine the state of NOP, which is determined by a binary variable. The symbols indicate that the switch is closed, respectively. , These represent the active power and reactive power flowing through the SOP, respectively. , These represent the active power and reactive power flowing through the NOP, respectively. To provide the upper limit of apparent power, M is a sufficiently large constant. yes The power generation cost of a conventional generator at any given time. It is a meritorious contribution. yes The output of the conventional generator at all times, These are cost coefficient one, cost coefficient two, and cost coefficient three, which are loaded from the data file.
[0126] In the first stage of the two-stage power flow calculation framework, the optimal scheduling plan for all controllable devices within the simulation time is solved. A linearized DistFlow model of the distribution network power flow is adopted, and the objective function is to minimize the sum of the total operating cost and penalty terms over the entire simulation cycle. :
[0127]
[0128]
[0129]
[0130]
[0131]
[0132]
[0133]
[0134] in, It is the cost of purchasing electricity. It is the operating cost of all generators. , These are the loss costs of all SOPs and the depreciation costs of all ESSs. It is the sum of all penalty items, including slack current penalty, electric vehicle not fully charged penalty, and slack penalty for SOP / NOP. These are sets of nodes and lines, respectively. Represented by node For the beginning and the node For the terminal line, , The lines are respectively exist The active and reactive power flows at any given moment, with positive values representing the flow from node [node name missing]. Flow to Node ; , The lines are respectively exist The active and passive currents at any given moment; , , , , , , Represents respectively in The active power output of the node's generators, grid-connected photovoltaic / wind power, active power output of energy storage, active power output of charging piles, active power flowing into the SOP and active power injected into the NOP, base active load, and active slack variables. This represents the power flowing into the upstream power grid at the slack node; , , , , , , , Represents respectively in The reactive power output of the node's generator, grid-connected photovoltaic / wind power, reactive power output of energy storage, reactive power output of charging piles, reactive power flowing into the SOP and reactive power injected into the NOP, reactive load of the base, and reactive slack variables; For nodes exist The square of the voltage at time t, , They are nodes Minimum and maximum allowable voltages , These are the resistance and reactance values of the circuit. This represents the total simulation time.
[0135] In the second phase, OpenDSS is used to perform precise power flow verification of the scheduling scheme from the first phase. No further optimization is performed; instead, the process is performed step-by-step. Sequential simulation.
[0136] First, at each time step, extract the equipment output at the corresponding moment from the results of the first stage, and together with the base load and photovoltaic / wind power output, construct the node active / reactive power injection at the corresponding moment;
[0137] Then, the active / reactive power injection at the node is passed as a fixed input to OpenDSS to perform nonlinear AC power flow calculations at the corresponding time points, thereby obtaining accurate voltage. Precise line loss and precise power purchase at balancing nodes. ;
[0138] Finally, the optimization results obtained in the first stage were used as the control scheme, and the accurate electricity purchase cost calculated in the second stage was used. By replacing the approximate electricity purchase cost in the linearized Distflow, the accurate total cost of the distribution network system can be obtained. :
[0139] .
[0140] The two-stage power flow computing framework includes offline computing mode and online computing mode;
[0141] In offline computing mode, within a day Using the scheduling variables of each time step as the overall decision vector, construct a multi-time-step mixed integer programming problem.
[0142] First, build a system in Pyomo that includes... The multi-time-step linearized Distflow model and cost function are used to uniformly optimize decision variables such as the daily charging and discharging power of electric vehicles, energy storage power, and SOP / NOP status. At the same time, the generation cost, approximate power purchase cost, and various loss costs of the distribution network system are obtained. Then, the optimal scheduling plan obtained in the first stage is used to drive OpenDSS to complete the accurate power flow simulation in the second stage, calculate the corrected power purchase cost and network loss, and recalculate the total cost of the distribution network system based on the corrected power purchase cost.
[0143] In the online computing model, the vehicle-to-grid (V2G) scheduling problem is first modeled as a Markov decision process involving interactions at discrete time steps, defining the corresponding state space, action space, and reward function. Based on this, a reinforcement learning agent is introduced. During the policy learning phase, the reinforcement learning agent continuously optimizes network parameters through continuous interaction and trial and error with the V2G simulation environment, thereby learning and approximating the optimal online scheduling policy, including:
[0144] At each time step, the control action is adaptively output based on the current observation state. At any given moment, after the reinforcement learning agent submits a normalized action, the normalized action is decoded into a fixed set of physical decisions, including all electric vehicle charging power, ESS charging and discharging power, DER peak shaving rate, SOP / NOP states, etc. At this point, the undetermined free variables in the power grid include conventional generator output. Purchase power from the main network Therefore, a single-step Pyomo optimization model is constructed at the corresponding time step, and a linearized Distflow model is used to solve for the remaining free variables not determined by the reinforcement learning agent, with the goal of minimizing the immediate economic cost. Subsequently, OpenDSS is called to accurately verify the voltage and network loss, obtain the corrected electricity purchase cost, and then the total cost of the distribution network system is recalculated.
[0145] In online computing models, the building blocks of Markov decision processes include:
[0146] 1) State Space: The state space is designed as a high-dimensional continuous vector, including:
[0147]
[0148] in, This indicates the real-time electricity price and the status of whether the electric vehicle is at a charging station. Normalized remaining time on site of vehicles Normalized energy requirements for vehicle m to be fully charged to a specified charge .
[0149] 2) Action Space: Normalize each dimension to An interval is represented as:
[0150]
[0151] Represents the action space, Represents the net power of electric vehicles. The net power representing the energy storage system model, Active peak shaving power representing photovoltaic and wind power models; , These represent the power values flowing through SOP and NOP, respectively.
[0152] 3) Reward Function: The agent's goal is to maximize the reward function. Weighted sum of operating costs and penalty items:
[0153]
[0154] in, yes Time Node Voltage over-limit penalty, It is an electric car At the moment of departure Unfilled energy; This is the penalty for OpenDSS power flow calculation failure. , , , These represent the corresponding penalty coefficients one, two, three, and four, respectively.
[0155] In this embodiment, the evaluation phase should output at least two or more of the following indicators: total cost, electricity purchase cost, power generation cost, SOP loss cost, energy storage discharge depreciation cost, voltage qualification rate, line power flow load rate, and electric vehicle charging satisfaction rate.
[0156] like Figures 3 to 7 As shown below, the simulation verification process of the method of the present invention in the distribution network environment is described in conjunction with examples to verify that the present invention has physical consistency in response to different loads and distributed power output scenarios, and can realize the training and comparative evaluation of reinforcement learning algorithms.
[0157] like Figure 3 As shown, a 33-node distribution network was selected as a case study, with a simulation period of 0 h to 24 h and a discrete time step of 1 h. Electric vehicles used a randomly generated session mode, and the SOP and NOP of photovoltaic, wind power, conventional generators, energy storage devices, and flexible interconnection devices were enabled. To verify the environment's responsiveness to different operating conditions, three load scenarios were constructed: Scenario 1 was a normal load scenario as a control; Scenario 2 was a distributed power output enhancement scenario, where the maximum output limit of photovoltaic and wind power during the 11 h to 14 h period was set to three times the original value, based on Scenario 1; Scenario 3 was an electric vehicle discharge test scenario, where the base load curve for the 19 h to 21 h period was increased to twice the original value, and the corresponding electricity price curve was also increased to twice the original value, based on Scenario 1.
[0158] like Figure 4 As shown, in terms of voltage level, Scenario 2 has a significant effect on raising the terminal voltage during the midday period. Using the lowest node voltage of the entire network at each time as an indicator, the average lowest voltage of Scenario 2 from 11 h to 14 h is about 4.1% higher than that of Scenario 1, indicating that the increased output of distributed power sources can alleviate the low voltage problem during this period to some extent, and the trend is consistent with the physical laws of distribution network operation.
[0159] like Figure 5 As shown, in terms of power flow distribution, Scenario 2 significantly alters the power flow direction of some feeders. Taking the L32 branch near the end as an example, from 11 h to 14 h, the active power flow of L32 in Scenario 1 is approximately -0.74 pu, while in the same period in Scenario 2, the L32 power flow becomes approximately +0.14 pu, indicating a reverse power flow. This suggests that the end area is feeding power back to the upstream node after satisfying its local load. In Scenario 3, the power flow of the L1 branch from 19 h to 21 h is approximately 4.1 times that of Scenario 1.
[0160] like Figure 6 As shown, the net daily charging amount of electric vehicles in Scenario 3 is about 8.8% lower than that in Scenario 1, indicating that some energy is fed back to the grid through discharge during periods of high electricity prices. At the same time, due to the simultaneous increase in base load and electricity price during the evening peak, the total daily electricity purchase energy in Scenario 3 is still slightly higher than that in Scenario 1 by about 4.7%, and the proportion of electricity purchase cost during the 19-21 hour period is significantly higher, reflecting the changes in the correct system operation characteristics.
[0161] like Figure 7 As shown, this invention selects four reinforcement learning algorithms—PPO, SAC, TD3, and DDPG—for training and comparison under a unified simulation environment and a unified scene configuration. It can output the reward convergence curves of different algorithms to characterize the differences in training stability and convergence speed.
[0162] As shown in Table 1, after training is completed, this embodiment summarizes and outputs safety and cost-related indicators of the running results of each reinforcement learning algorithm under a unified evaluation standard, which facilitates comprehensive comparison of different algorithms and results reproduction experiments.
[0163] Table 1 Performance Comparison between Offline and Online Computing Modes
[0164]
[0165] In summary, the results of this embodiment show that the present invention can generate voltage and power flow responses that conform to the physical laws of the distribution network under typical scenarios such as different distributed power output, load surges and electricity price changes. At the same time, it supports multiple reinforcement learning algorithms to complete training, evaluation and comparison in the same environment, providing an effective simulation and analysis tool for testing vehicle-to-grid interaction reinforcement learning algorithms.
[0166] Example 2
[0167] A computer system includes a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to perform the steps of the method as described in Embodiment 1.
[0168] Example 3
[0169] A computer-readable storage medium having a computer program stored thereon that, when executed by a processor, implements the steps of the method as described in Example 1.
[0170] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0171] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0172] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0173] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0174] The embodiments of the present invention have been described above with reference to the accompanying drawings. However, the present invention is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of the present invention without departing from the spirit and scope of the claims. All of these forms are within the protection scope of the present invention.
Claims
1. A method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction, characterized in that, include: S1: Obtain distribution network topology parameters, output forecasts of distributed photovoltaic and wind power, and basic operation data of electric vehicles through external data; construct a vehicle-grid interactive simulation environment and generate a discrete time axis, dividing the simulation cycle into multiple time steps; S2 establishes a two-stage power flow calculation framework for calculating the total operating cost of the distribution network system; S3. In the offline benchmark evaluation stage, a multi-time-step scheduling optimization model is constructed to jointly optimize multiple decision variables within the simulation cycle, obtain the full-cycle scheduling solution, and generate offline benchmark evaluation results under the two-stage power flow calculation framework. S4. In the online interactive evaluation phase, the vehicle-to-grid (V2G) interactive scheduling problem is modeled as a Markov decision process. A reinforcement learning agent is used to define the observation state space, normalized action space, state transition equation, and reward function, including the grid state and vehicle information. At any time step, the normalized action output by the agent based on the current observation state is decoded into a physical control quantity, and the physical control quantity is substituted into the corresponding constraints as a known parameter. Under the constraints, a linearized DistFlow model for each time step is constructed and solved to obtain the remaining free variables. S5, the physical control quantity and the remaining free variables are input into the two-stage power flow calculation framework for calculation to obtain the total operating cost of the distribution network system, and input into the reinforcement learning agent. The reinforcement learning agent calculates the reward for the corresponding time step and generates online evaluation results; the next state is updated according to the state transition equation of the vehicle-network interaction simulation environment. If the current time step reaches the maximum set period, the process terminates and outputs the final online evaluation result; the offline benchmark evaluation result and the final online evaluation result are evaluated using a unified indicator system to generate the final evaluation result.
2. The method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction according to claim 1, characterized in that, The vehicle-to-grid (V2G) interactive simulation environment includes at least: a power distribution network topology model, a distributed power source model for distributed photovoltaic and wind power, an energy storage system model, a conventional generator model, a V2G-enabled charging station and electric vehicle travel conversation model, and a flexible interconnection device model.
3. The method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction according to claim 1, characterized in that, In step S2, in the two-stage power flow calculation framework, in the first stage, with the goal of minimizing the total operating cost of the distribution network system, the power flow approximation solution is performed using the linearized DistFlow model of the distribution network to obtain the output of controllable equipment and the power generation cost, approximate power purchase cost, and related loss cost of the distribution network system. In the second stage, the OpenDSS open-source power distribution system simulator is used to perform accurate power flow calculations on the controllable equipment output results from the first stage, and corrected results are obtained, including node voltage, line power flow and network loss, accurate power purchase cost and related loss cost. The approximate power purchase cost and related loss cost from the first stage are replaced with accurate power purchase cost and related loss cost, thereby obtaining the accurate total operating cost of the power distribution network system.
4. The method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction according to claim 3, characterized in that, The distributed photovoltaic and wind power distributed generation model of the distribution network is expressed as follows: in, Is Time of the first The actual active power of each photovoltaic / wind power unit ultimately connected to the grid. Is Active peak shaving power determined by the time control system, reactive power output Based on a fixed power factor calculate.
5. A method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction according to claim 3, characterized in that, The distribution network energy storage system model is represented as follows: in, , These are charging efficiency and discharging efficiency, respectively. For time step, For energy storage system models in Energy state at any moment and These represent the charging power and discharging power of the energy storage system model, respectively. and These represent the maximum charging power and maximum discharging power of the energy storage system model, respectively.
6. The method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction according to claim 3, characterized in that, The V2G-enabled charging station and electric vehicle conversation model is represented as follows: in, and These are the first A number of electric vehicles The charging and discharging power at any given time, and These are charging efficiency and discharging efficiency, respectively. For time step, For electric vehicles Battery level at any time and These are the maximum charging power and maximum discharging power of an electric vehicle, respectively. Set the target charging amount for electric vehicles; define a binary variable. To prevent charging and discharging from occurring simultaneously, and to limit the physical maximum power of the electric vehicle, a penalty term is added to the objective function. , The penalty coefficient is... For the departure time of the electric vehicle, This indicates the amount of charge that was not achieved when the vehicle left. This indicates the battery level of the electric vehicle when it leaves the station.
7. A method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction according to claim 6, characterized in that, During the electric vehicle charging phase, there are two types of scenario data generation modes: random generation mode and external input mode. In the random generation mode, the vehicle arrival time adopts a mixed Gaussian distribution model with three peaks in the morning, noon and evening. The peak is determined first according to the given weight, and then the arrival time is sampled from the corresponding normal distribution. The dwell time is sampled from a discrete uniform distribution, and the initial power is sampled from a continuous uniform distribution within a specified range. By adjusting the peak weight, dwell time interval and initial power distribution, different types of typical charging demand scenarios are constructed. In external input mode, session data is obtained by importing a CSV file that meets the predefined field format.
8. A method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction according to claim 3, characterized in that, The flexible interconnected device model is represented as follows: in, For the operating losses of SOP, The loss coefficient is used to determine the state of NOP, which is determined by a binary variable. The symbols indicate that the switch is closed, respectively. , These represent the active power and reactive power flowing through the SOP, respectively. , These represent the active power and reactive power flowing through the NOP, respectively. M is the upper limit of apparent power, and is a constant. yes The power generation cost of a conventional generator at any given time. It is a meritorious contribution. yes The output of the conventional generator at all times, These are cost coefficient one, cost coefficient two, and cost coefficient three, which are loaded from the data file.
9. A method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction according to claim 1, characterized in that, In the first stage of the two-stage power flow calculation framework, the objective function is to minimize the sum of the total operating cost and the penalty term over the entire simulation cycle. : in, It is the cost of purchasing electricity. This is the operating cost of all generators. , These are the loss costs of all SOPs and the depreciation costs of all ESSs. It is the sum of all penalty items, including slack current penalty, electric vehicle not fully charged penalty, and slack penalty for SOP / NOP. These are sets of nodes and lines, respectively. Represented by node For the beginning and the node For the terminal line, , The lines are respectively exist The active and reactive power flows at any given moment, with positive values representing the flow from node [node name missing]. Flow to Node ; , The lines are respectively exist The active and passive currents at any given moment; , , , , , , Represents respectively in The active power output of the node's generators, grid-connected photovoltaic / wind power, active power output of energy storage, active power output of charging piles, active power flowing into the SOP and active power injected into the NOP, base active load, and active slack variables. This represents the power flowing into the upstream power grid at the slack node; , , , , , , , Represents respectively in The reactive power output of the node's generator, grid-connected photovoltaic / wind power, reactive power output of energy storage, reactive power output of charging piles, reactive power flowing into the SOP and reactive power injected into the NOP, reactive load of the base, and reactive slack variables; For nodes exist The square of the voltage at time t, , They are nodes Minimum and maximum allowable voltages , These are the resistance and reactance values of the circuit. This represents the total simulation time.
10. A method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction according to claim 9, characterized in that, The second phase includes: At each time step, the equipment output at the corresponding moment is extracted from the results of the first stage, and together with the base load and photovoltaic / wind power output, the active / reactive power injection at the corresponding moment is constructed. The active / reactive power injection at the nodes is passed as a fixed input to OpenDSS to perform nonlinear AC power flow calculations at the corresponding time points, thereby obtaining accurate voltage. Precise line loss and precise power purchase at balancing nodes. ; The optimization results obtained in the first stage are used as the control scheme, and the accurate electricity purchase cost calculated in the second stage is used. By replacing the approximate electricity purchase cost in the linearized Distflow, the accurate total cost of the distribution network system can be obtained. : 。 11. A method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction according to claim 1, characterized in that, In online computing models, the building blocks of Markov decision processes include: 1) State Space: The state space is designed as a high-dimensional continuous vector, including: in, This indicates the real-time electricity price and the status of whether the electric vehicle is at a charging station. Normalized remaining time on site for vehicles Normalized energy requirements for vehicle m to be fully charged to a specified charge ; 2) Action Space: Normalize each dimension to An interval is represented as: Represents the action space, Represents the net power of electric vehicles. The net power representing the energy storage system model, Active peak shaving power representing photovoltaic and wind power models; , These represent the power values flowing through SOP and NOP, respectively; 3) Reward Function: The agent's goal is to maximize the reward function. Weighted sum of operating costs and penalty items: in, yes Time Node Voltage over-limit penalty, It is an electric car At the moment of departure Unfilled energy; This is the penalty for OpenDSS power flow calculation failure. , , , These represent the corresponding penalty coefficients one, two, three, and four, respectively.
12. A computer system comprising a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the steps of the method for testing reinforcement learning algorithms for vehicle-to-everything (V2X) interaction as described in any one of claims 1-10.