Method, device, equipment, medium and product for acquiring policy simulator
By exchanging data and adjusting the reward function between the source and target domains, a policy simulator suitable for the target domain is generated, solving the security problem of cross-domain policy deployment and realizing the generation of effective and secure power dispatch policies in the target domain.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TENCENT TECHNOLOGY (SHENZHEN) CO LTD
- Filing Date
- 2024-12-16
- Publication Date
- 2026-06-19
AI Technical Summary
In the process of cross-domain reinforcement learning transfer, there is an unknown security risk when the source domain strategy is directly deployed to the target domain, which leads to the generation of power dispatch strategy in the target domain being insecure.
By acquiring data from the source and target domains, a first reward function is determined, and the source domain policy simulator is adjusted based on this function to generate a policy simulator suitable for the target domain. Historical data from the target domain is used to adjust the reward function to adapt to dynamic differences, thus avoiding real-time training.
The generated power dispatching strategy is effective and secure in the target domain without requiring extensive real-time training, and adapts to the dynamic differences between the source and target domains.
Smart Images

Figure CN122246677A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of power systems, and in particular to a method, apparatus, equipment, medium, and product for acquiring a strategy simulator. Background Technology
[0002] A microgrid is a small-scale power system composed of distributed power sources, energy storage devices, energy conversion devices, loads, and protection devices. During microgrid operation, dispatching is required. Microgrid dispatching refers to the scheduling of various devices within the microgrid to achieve economical, reliable, and sustainable operation.
[0003] In related technologies, cross-domain reinforcement learning transfer algorithms are used to transfer policies from the source domain to the target domain. That is, the policies generated by the power policy simulator in the source domain are used in the training process of the power policy simulator in the target domain to transfer knowledge from the source domain to the power policy simulator in the target domain. The power policy simulator in the target domain after transfer learning is then used to generate power dispatching policies in the target domain.
[0004] In implementing the above scheme, the strategy from the source domain needs to be directly deployed in the target domain for real-time training. However, when the security of the strategy relative to the target domain is unknown, there is a security risk when it is deployed in a target domain that is different from the source domain. Summary of the Invention
[0005] This application provides a method, apparatus, device, medium, and product for obtaining a strategy simulator. The technical solution is as follows:
[0006] On the one hand, a method for obtaining a strategy simulator is provided, the method comprising:
[0007] Obtain a source domain policy simulator, which is used to analyze the environmental state of the source domain and generate a power dispatching policy in the source domain;
[0008] Acquire first data from the source domain and second data from the target domain within a historical time period. The first data is used to indicate the environmental state and scheduling actions of the source domain, and the second data is used to indicate the environmental state and scheduling actions of the target domain.
[0009] A first reward function is determined based on the first data and the second data. The first reward function is used to provide reinforcement learning feedback for the predicted scheduling actions in the process of generating power dispatching strategies.
[0010] The source domain policy simulator is adjusted based on the first reward function to obtain the target domain policy simulator. The target domain policy simulator is used to analyze the environmental state of the target domain and generate a power dispatching strategy for the target domain.
[0011] On the other hand, a device for acquiring a strategy simulator is provided, the device comprising:
[0012] The acquisition module is used to acquire the source domain policy simulator, which is used to analyze the environmental state of the source domain and generate a power dispatching policy in the source domain.
[0013] The acquisition module is further configured to acquire first data in the source domain and second data of the target domain within a historical time period. The first data is used to indicate the environmental state and scheduling actions of the source domain, and the second data is used to indicate the environmental state and scheduling actions of the target domain.
[0014] The correction module is used to obtain a first reward function based on the first data and the second data. The first reward function is used to provide feedback for the predicted scheduling actions in the process of generating the power dispatch strategy.
[0015] The adjustment module is used to adjust the source domain policy simulator based on the first reward function to obtain the target domain policy simulator. The target domain policy simulator is used to analyze the environmental state of the target domain and generate a power dispatching strategy in the target domain.
[0016] On the other hand, a computer device is provided, the computer device including a processor and a memory, the memory storing at least one instruction, at least one program, code set or instruction set, the at least one instruction, the at least one program, the code set or instruction set being loaded and executed by the processor to implement the method for obtaining a strategy simulator as described in any of the embodiments of this application above.
[0017] On the other hand, a computer-readable storage medium is provided, wherein at least one instruction, at least one program, code set, or instruction set is stored in the storage medium, wherein the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the method for obtaining a strategy simulator as described in any of the embodiments of this application above.
[0018] On the other hand, a computer program product or computer program is provided, which includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the strategy simulator acquisition method described in any of the above embodiments.
[0019] The technical solution provided in this application includes at least the following beneficial effects:
[0020] A first reward function is obtained by using first data provided by the source domain and second data generated by the target domain within a historical time period to provide reinforcement learning feedback during the generation of power dispatching strategies. This first reward function is then applied to the source domain policy simulator corresponding to the source domain to obtain a target domain policy simulator applicable to the target domain. The power dispatching strategy for the target domain is generated through the target domain policy simulator. In other words, the reward function corresponding to the source domain policy simulator is adjusted using historical data provided by the target domain. Under the influence of the historical data provided by the target domain, the adjusted reward function can adapt to the dynamic differences between the source and target domains, thus enabling the adjusted policy simulator to adapt to the target domain. This ensures the effectiveness and security of the generated power dispatching strategy applied to the target domain without requiring extensive real-time training in the target domain. Attached Figure Description
[0021] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0022] Figure 1 This is a schematic diagram of an implementation environment provided by an exemplary embodiment of this application;
[0023] Figure 2 This is a flowchart of a method for obtaining a strategy simulator provided in an exemplary embodiment of this application;
[0024] Figure 3 This is a flowchart of a method for obtaining a strategy simulator provided in an exemplary embodiment of this application;
[0025] Figure 4 This is a schematic diagram illustrating the migration and use of a strategy simulator provided in an exemplary embodiment of this application;
[0026] Figure 5 This is a flowchart of a method for obtaining a strategy simulator provided in an exemplary embodiment of this application;
[0027] Figure 6 This is a schematic diagram illustrating the migration and use of a strategy simulator provided in an exemplary embodiment of this application;
[0028] Figure 7 This is a structural block diagram of a strategy simulator acquisition device provided in an exemplary embodiment of this application;
[0029] Figure 8 This is a structural block diagram of a strategy simulator acquisition device provided in an exemplary embodiment of this application;
[0030] Figure 9 This is a schematic diagram of the structure of a server provided in an exemplary embodiment of this application. Detailed Implementation
[0031] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0032] In this application, the terms "first" and "second" are used to distinguish between identical or similar items that have essentially the same function. It should be understood that there is no logical or temporal dependency between "first" and "second", nor is there any limitation on the quantity or execution order.
[0033] First, a brief introduction to the terms used in the embodiments of this application will be given.
[0034] Reinforcement Learning (RL) is a machine learning method that learns how to make decisions through interaction with its environment. The fundamental framework of reinforcement learning is the Markov Decision Process (MDP). In this learning process, an agent receives state information from the environment and chooses an action to perform. After performing the action, the agent receives feedback from the environment, typically in the form of a reward or punishment. The agent's goal is to maximize its cumulative reward, which is usually achieved by learning a policy that tells the agent which action to take in a given state. Reinforcement learning does not rely on large amounts of labeled data but learns through trial and error, making it extremely useful in scenarios where complex decision-making processes need to be optimized through exploration and exploitation, such as games, robot control, and resource management.
[0035] The reward function is a core component of reinforcement learning. It defines the feedback value that an agent receives from the environment after performing a specific action in a given state. This value can be positive (reward) or negative (penalty). The goal of the reward function is to quantify the quality of the agent's behavior, guiding the agent to learn how to maximize cumulative rewards by choosing different actions, thereby achieving long-term goals. In different application scenarios, the design of the reward function will vary depending on the specific needs of the task, directly affecting the agent's learning efficiency and final performance.
[0036] Microgrids: Microgrids are small-scale power systems composed of distributed power sources, energy storage devices, energy conversion devices, loads, and protection devices. The concept of microgrids aims to enable the flexible and efficient application of distributed power sources and solve the grid connection problems of a large number and diverse range of distributed power sources. Developing and expanding microgrids can significantly promote the large-scale integration of distributed power sources and renewable energy, facilitating the transition from traditional power grids to smart grids.
[0037] Microgrid scheduling is a complex problem. In addition to ensuring energy balance within the microgrid, it also requires ensuring that the output power of distributed power sources and energy storage devices in the microgrid meets the load demand of the microgrid, guaranteeing the safety and stability of the microgrid, and achieving economically optimized operation of the microgrid.
[0038] Secondly, the implementation environment involved in the embodiments of this application will be described, for illustrative purposes only. Please refer to [the relevant documentation]. Figure 1 The implementation environment involves a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a communication network 130, which can be a wired network or a wireless network.
[0039] Terminal 110 can be a mobile phone, tablet computer, desktop computer, laptop computer, smart TV, vehicle terminal, smart home device, etc., and this application embodiment does not limit it to any particular type. Optionally, terminal 110 can directly upload historical data (second data) of the power system to server 120, wherein the power system is the power system in the actual power environment corresponding to the target domain, and the historical data includes the environmental status and scheduling actions recorded by the power system during the historical time period.
[0040] Optionally, an application with microgrid dispatch strategy generation function is installed in the terminal 110. Schematic, this application can be implemented as a power system management application, a power system diagnostic application, a power system data analysis application, etc.; or, the application can be implemented as a small program that depends on a host application, and the host application can be implemented as any of the above-mentioned programs. This application embodiment does not limit this.
[0041] Server 120 is used to migrate the source domain policy simulator from the source domain to the target domain. Server 120 can obtain historical data (second data) of the power system within a historical time period from terminal 110, or server 120 itself stores historical data (second data) of the power system within a historical time period.
[0042] Server 120 is equipped with a source domain policy simulator. Server 120 collects first data in the source domain policy simulator, which is used to indicate the environmental state and scheduling actions of the source domain, and receives second data provided by terminal 110. Server 120 determines a first reward function based on the first data and the second data. Server 120 adjusts the source domain policy simulator based on the first reward function to obtain a target domain policy simulator, which is used to analyze the environmental state of the target domain and generate a power scheduling strategy in the target domain.
[0043] Optionally, after obtaining the target domain policy simulator, server 120 can provide background services for the application with microgrid scheduling policy generation function in terminal 110; or, after obtaining the target domain policy simulator, server 120 can send the target domain policy simulator to terminal 110, and terminal 110 can store the target domain policy simulator locally, so terminal 110 can independently implement the above-mentioned microgrid scheduling policy generation function. In some optional embodiments, the above-mentioned acquisition process of the target domain policy simulator can also be implemented independently by terminal 110, and this application embodiment does not limit this.
[0044] It is worth noting that server 120 can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms.
[0045] In some embodiments, the server 120 described above can also be implemented as a node in a blockchain system.
[0046] Based on the above introduction to terminology and implementation environment, the method for obtaining the strategy simulator provided in this application will be explained, taking the method executed by the server as an example. Figure 2 As shown, the method includes the following steps 210 to 240.
[0047] Step 210: Obtain the source domain policy simulator.
[0048] Among them, the source domain policy simulator is used to analyze the environmental state of the source domain and generate power dispatching policies in the source domain.
[0049] The source domain refers to the environment in which an agent initially learns and collects data.
[0050] Optionally, in the power dispatching scenario, the source domain includes at least one of a simulated power system and a stable power system. The simulated power system is a virtual system obtained through algorithmic or model simulation, implemented using a source domain policy simulator. The stable power system is a real-world power system with a long service life; that is, the stable power system is equipped with a source domain policy simulator to generate power dispatching policies.
[0051] For example, this application illustrates the implementation of the above-mentioned simulated power system / stable power system as a microgrid.
[0052] In some embodiments, a state space is defined for the source domain, which is used to indicate the range of environmental state changes in the source domain.
[0053] Schematic, in a microgrid, this state space includes power-related operating parameters of the microgrid, which may optionally include at least one of the following:
[0054] 1. Power supply and demand relationship: The state space includes the power supply and demand relationship between microgrid nodes, such as the power required or available by the i-th node, where i is a positive integer.
[0055] 2. State of Charge (SOC) of Load, Photovoltaic Output, and Energy Storage: The state space can be represented by the SOC of load, photovoltaic output, and energy storage devices, the purchase and sale price of electricity, and the scheduling period of the microgrid operating environment. For example, the state space... in, For the load demand during time period t, The state of charge of the energy storage device during time period t. For the photovoltaic output during time period t, and These represent the prices for purchasing and selling electricity, respectively, and t represents the dispatching period.
[0056] 3. State of charge of energy storage devices: The state space may also include the minimum and maximum values of the state of charge of energy storage devices within the scheduling cycle, as well as the state of charge of energy storage devices.
[0057] 4. Power Balance Constraint: The state space needs to satisfy the power balance constraint, that is, the power output of the microgrid (including distributed generation resources such as photovoltaic and wind power) plus the power exchanged with the external grid (electricity purchase and sales) must equal the electricity demand of the load, as shown in Formula 1:
[0058] Formula 1: P G +P Ba +P grid =P L -P W -P PV
[0059] Among them, P G It is the power of the generator set, P Ba It is the power of the energy storage device, P grid It is the power exchanged with the external power grid, P L It is the load demand, P W It refers to wind power generation capacity, P. PV It refers to the power output of photovoltaic power generation.
[0060] In this embodiment of the application, taking the power scenario as an example, the environmental state of the source domain is the operating parameters of the microgrid at a certain moment (e.g., time t) obtained from the above state space, which describes the system state of the microgrid at a certain moment.
[0061] In some embodiments, an action space is defined for the source domain, which is used to indicate the scope of scheduled actions in the source domain.
[0062] Schematic illustration: In a microgrid, the action space includes scheduling actions for each device in the microgrid; that is, the action space is a set of candidate scheduling actions that each device in the microgrid may perform. In some embodiments, the scheduling actions used are related to the control decisions of the microgrid. Optionally, the scheduling actions include power dispatching, charging and discharging of energy storage devices, flexible load control, adjusting the power generation of controllable generators, etc.
[0063] This illustration demonstrates how scheduling actions adjust the operating parameters of various devices in a microgrid. In some embodiments, a scheduling action corresponds to at least two parameters, including: the scheduling object and the corresponding adjustment parameters; where the scheduling object refers to each device in the power system, and the adjustment parameters refer to the amount of adjustment to the device's operating parameters. For example, increasing the output power of a generator to 50% and decreasing the amount of electricity stored in an energy storage device to 50%.
[0064] In this embodiment, the source domain policy simulator uses reinforcement learning to generate power dispatching strategies in the source domain. Illustratively, the environmental state (operating parameters) of the microgrid at time t is input into the source domain policy simulator. The simulator selects at least one dispatching action from candidate dispatching actions in the action space. This at least one dispatching action is the power dispatching strategy of the source domain policy simulator for the microgrid (simulated power system / stable power system) under the environmental state (operating parameters) at time t.
[0065] Step 220: Obtain the first data in the source domain and the second data in the target domain within the historical time period.
[0066] The first data is used to indicate the environmental status and scheduling actions of the source domain.
[0067] Optionally, when the source domain is a simulated power system, the first data includes data randomly generated by the source domain policy simulator, and / or data simulated by the source domain policy simulator in historical simulation scenarios. The historical simulation scenarios are those simulated using power dispatch policies generated by the source domain policy simulator.
[0068] Optionally, when the source domain is a stable power system, the first data includes data randomly generated by the source domain policy simulator, and / or data generated by the source domain policy simulator within a historical scheduling period. The historical scheduling period refers to the time during which the stable power system performs power scheduling based on the power scheduling strategy generated by the source domain policy simulator.
[0069] The second data is used to indicate the environmental status and scheduling actions of the target domain.
[0070] The target domain is the new environment in which an agent wants to apply its learned knowledge.
[0071] Optionally, in the power dispatching scenario, the target domain is implemented as a real power system, where the real power system is the power system deployed in a real scenario.
[0072] For example, this application illustrates the above-mentioned real power system as a microgrid.
[0073] In some embodiments, a state space is defined for the target domain, which indicates the range of environmental state changes within the target domain. In the embodiments of this application, since there are certain commonalities among power systems, the state spaces of the source domain and the target domain are the same or similar.
[0074] In some embodiments, an action space is defined for the target domain, which indicates the scope of scheduling actions within the target domain. In the embodiments of this application, since there are certain commonalities among power systems, the action spaces of the source domain and the target domain are the same or similar.
[0075] In this embodiment of the application, the second data is the data generated by the target domain within a historical time period, that is, the second data is the data recorded by the real power system within a historical time period, including environmental status (operating parameters) and scheduling actions.
[0076] Optionally, the historical time period can be a pre-set time period, a user-defined time period, or a time period divided according to preset rules. For example, the target domain policy simulator for a target domain has an update cycle, and the above historical time period is the data from the n days prior to the current update cycle, where n is a positive integer.
[0077] Step 230: Determine the first reward function based on the first data and the second data.
[0078] The first reward function is used to provide reinforcement learning feedback for the predicted scheduling actions during the generation of power dispatching strategies.
[0079] To illustrate, the reward function is a core concept in reinforcement learning. The reward function defines the feedback that an agent receives after taking a specific scheduled action in a specific environmental state. This feedback helps the agent evaluate the quality of its behavior and guides its learning process to optimize long-term performance.
[0080] Optionally, the reinforcement learning feedback mentioned above includes positive feedback and negative feedback, where positive feedback encourages the agent's behavior and negative feedback inhibits the agent's behavior.
[0081] In power dispatching scenarios, reinforcement learning feedback is linked to the control objectives and optimization strategies of microgrids. For example, in power dispatching scenarios, positive feedback corresponding to the reward function can include: rewarding dispatching behaviors that balance power supply and demand in the microgrid, rewarding dispatching behaviors that balance power in the microgrid, rewarding dispatching behaviors that increase the use of renewable energy, and rewarding dispatching behaviors where generators or energy storage devices in the microgrid effectively respond to the optimal setpoints allocated by the energy management unit. In power dispatching scenarios, negative feedback corresponding to the first reward function can include: penalizing dispatching behaviors that cause grid instability or voltage deviation, penalizing dispatching behaviors that cause excessive discharge or excessive generation of energy storage devices, penalizing dispatching behaviors that cause the microgrid to over-rely on the main grid for power supply instead of utilizing its own distributed generation resources, and penalizing dispatching behaviors that cause significant environmental pollution through power generation methods.
[0082] In some embodiments, a second reward function corresponding to the source domain is obtained; the second reward function is adjusted based on the first data and the second data to obtain a first reward function.
[0083] In other embodiments, a third reward function corresponding to the target domain is obtained; the third reward function is adjusted based on the first data and the second data to obtain a first reward function.
[0084] In some embodiments, the second reward function corresponding to the source domain and the third reward function corresponding to the target domain are the same; or, the second reward function corresponding to the source domain and the third reward function corresponding to the target domain are different.
[0085] Step 240: Adjust the source domain policy simulator based on the first reward function to obtain the target domain policy simulator.
[0086] In this embodiment of the application, the target domain policy simulator is used to analyze the environmental state of the target domain and generate a power dispatching policy in the target domain.
[0087] In some embodiments, the source domain policy simulator includes a second reward function. Replacing the second reward function in the source domain policy simulator with a first reward function yields a target domain policy simulator. That is, the original second reward function of the source domain policy simulator is replaced by the obtained first reward function, thereby using the source domain policy simulator with the adjusted reward function as the target domain policy simulator for the power policy generation process of the target domain.
[0088] In some embodiments, the target domain policy simulator applies reinforcement learning algorithms to generate power dispatch policies for the target domain. Optionally, the reinforcement learning algorithms include at least one of the following: maximum entropy reinforcement learning algorithms (e.g., SoftQ-Learning, Soft Actor-Critic (SAC), etc.), Policy Gradient (PG), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), etc., and this application embodiment does not limit the specific implementation of these algorithms.
[0089] In some embodiments, a fifth environmental state corresponding to the target domain is obtained; the fifth environmental state is input into the target domain policy simulator, and a third scheduling action is predicted based on the feedback of the scheduling action by the first reward function; a power scheduling strategy in the target domain is generated based on the third scheduling action.
[0090] That is, the fifth environmental state (operating parameters) of the real power system is input into the target domain policy simulator. The target domain policy simulator selects at least one third scheduling action from the candidate scheduling actions in the action space through the first reward function. The at least one third scheduling action is the power scheduling strategy of the target domain policy simulator for the real power system under the fifth environmental state (operating parameters).
[0091] In some embodiments, the fifth environmental state described above can be implemented as the environmental state of the real power system at the current moment. The "current moment" indicates the time when the real power system has a scheduling need, for example, the moment when the terminal / server controlling the real power system receives a scheduling request.
[0092] Optionally, the above power dispatching strategy can be implemented through a single dispatching action or through a combination of multiple dispatching actions, without any limitation.
[0093] In summary, a first reward function is obtained by using the first data provided by the source domain and the second data generated by the target domain within a historical time period to provide reinforcement learning feedback during the generation of power dispatching strategies. This first reward function is then applied to the source domain policy simulator corresponding to the source domain, resulting in a target domain policy simulator applicable to the target domain. The power dispatching strategy for the target domain is generated using the target domain policy simulator. In other words, the reward function corresponding to the source domain policy simulator is adjusted using the historical data provided by the target domain. Under the influence of the historical data provided by the target domain, the adjusted reward function can adapt to the dynamic differences between the source and target domains, thus enabling the adjusted policy simulator to adapt to the target domain. This ensures the effectiveness and security of the generated power dispatching strategy applied to the target domain without requiring extensive real-time training in the target domain.
[0094] In some optional embodiments, a correction function between the source and target domains is obtained using first and second data. This correction function is then used to adjust the second reward function of the source domain policy simulator, thereby obtaining a target domain policy simulator based on the source domain policy simulator to adapt to the dynamic differences between the target and source domains. Please refer to [reference needed]. Figure 3 The diagram illustrates a flowchart of a method for obtaining a policy simulator provided in an exemplary embodiment of this application, the method including steps 310 to 340.
[0095] Step 310: Obtain the source domain policy simulator.
[0096] Among them, the source domain policy simulator is used to analyze the environmental state of the source domain and generate power dispatching policies in the source domain.
[0097] Optionally, in the power dispatching scenario, the source domain includes at least one of a simulated power system and a stable power system. The simulated power system is a virtual system obtained through algorithmic or model simulation, implemented using a source domain policy simulator. The stable power system is a real-world power system with a long service life; that is, the stable power system is equipped with a source domain policy simulator to generate power dispatching policies.
[0098] In this embodiment, the source domain policy simulator uses reinforcement learning to generate power dispatching strategies in the source domain. Illustratively, the environmental state (operating parameters) of the microgrid at time t is input into the source domain policy simulator. The simulator selects at least one dispatching action from candidate dispatching actions in the action space. This at least one dispatching action is the power dispatching strategy of the source domain policy simulator for the microgrid (simulated power system / stable power system) under the environmental state (operating parameters) at time t.
[0099] Taking the method as an example where the server executes it, optionally, the server reads the source domain policy simulator from the database, or the server receives the source domain policy simulator sent by the terminal.
[0100] Step 320: Obtain the first data in the source domain and the second data in the target domain within the historical time period.
[0101] The first data is used to indicate the environmental status and scheduling actions of the source domain.
[0102] Optionally, when the source domain is a simulated power system, the first data includes data randomly generated by the source domain policy simulator, and / or data simulated by the source domain policy simulator in historical simulation scenarios. The historical simulation scenarios are those simulated using power dispatch policies generated by the source domain policy simulator.
[0103] Optionally, when the source domain is a stable power system, the first data includes data randomly generated by the source domain policy simulator, and / or data generated by the source domain policy simulator within a historical scheduling period. The historical scheduling period refers to the time during which the stable power system performs power scheduling based on the power scheduling strategy generated by the source domain policy simulator.
[0104] The second data is used to indicate the environmental status and scheduling actions of the target domain.
[0105] In this embodiment of the application, the second data is the data generated by the target domain within a historical time period, that is, the second data is the data recorded by the real power system within a historical time period, including environmental status (operating parameters) and scheduling actions.
[0106] Optionally, the historical time period can be a pre-set time period, a user-defined time period, or a time period divided according to preset rules. For example, the target domain policy simulator for a target domain has an update cycle, and the above historical time period is the data from the n days prior to the current update cycle, where n is a positive integer.
[0107] Step 331: Obtain the correction function between the source domain and the target domain based on the first data and the second data.
[0108] The correction function is used to express the difference between the first state transition probability corresponding to the source domain and the second state transition probability corresponding to the target domain. The first state transition probability is used to indicate the probability of environmental state change in the source domain under the influence of scheduling actions, and the second state transition probability is used to indicate the probability of environmental state change in the target domain under the influence of scheduling actions.
[0109] State transition probability is a fundamental concept in Markov chain theory. It describes the probability of transitioning from one state to another in a Markov chain. A Markov chain is a stochastic process characterized by the fact that the probability distribution of future states depends only on the current state and is independent of previous historical states; this property is known as the Markov property.
[0110] In this embodiment, the state transition probability describes the probability that the system (source domain / target domain) will transition to the next environmental state after taking a scheduling action in a given environmental state. Illustratively, the first state transition probability T(s′|s,a) represents the probability that the source domain will transition to a new environmental state s′ after taking scheduling action a in the current environmental state s, where the new environmental state s′ depends only on the current environmental state s and the scheduling action a, and is independent of historical environmental states. The second state transition probability T′(s′|s,a) represents the probability that the target domain will transition to a new environmental state s′ after taking scheduling action a in the current environmental state s, where the new environmental state s′ depends only on the current environmental state s and the scheduling action a, and is independent of historical environmental states.
[0111] In some embodiments, a relative ratio between the first state transition probability and the second state transition probability is determined based on the first data and the second data, and a correction function between the source domain and the target domain is determined based on the relative ratio. That is, the relative ratio between the first state transition probability and the second state transition probability is directly determined using the first data and the second data, and the correction function between the source domain and the target domain is determined based on the relative ratio.
[0112] The relative ratio between the first state transition probability and the second state transition probability includes the average ratio, log density ratio, mutual information ratio, etc.
[0113] In some embodiments, at least one classifier is constructed to characterize the relative ratio between the first state transition probability and the second state transition probability. The classifier is used to identify the data source to which the input data belongs, including a source domain or a target domain; that is, the classifier determines whether the input data comes from the source domain or the target domain.
[0114] In some embodiments, the classifier is trained on the first data and the second data, and the classifier trained on the first data and the second data is used to determine the relative ratio between the first state transition probability and the second state transition probability.
[0115] In a schematic manner, a classifier is obtained; the classifier is trained based on the first and second data to obtain the trained classifier; and the correction function between the source domain and the target domain is obtained based on the trained classifier.
[0116] Illustratively, the classifier training process includes: labeling first data with a first label, the first label indicating that the data comes from a source domain; and labeling second data with a second label, the second label indicating that the data comes from a target domain; inputting the first data into the classifier to obtain a first classification result; and inputting the second data into the classifier to obtain a second classification result, the first classification result indicating the data source of the predicted first data, and the second classification result indicating the data source of the predicted second data; and training the classifier based on the difference between the first classification result and the first label, and the difference between the second classification result and the second label, to obtain the trained classifier.
[0117] In some embodiments, the difference between a first classification result and a first label, and the difference between a second classification result and a second label are determined by specifying a loss function. Optionally, the specified loss function can be implemented as at least one of the following: cross-entropy loss, mean squared error loss (MSE), log loss, least absolute deviations loss (L1 loss), etc., without limitation.
[0118] That is, by introducing a classifier to directly represent the relative ratio between the first state transition probability and the second state transition probability, the problem of difficulty in obtaining the state transition probability in some scenarios is avoided, and the efficiency of obtaining the correction function is improved.
[0119] In other embodiments, a first state transition probability is determined based on first data, and a second state transition probability is determined based on second data; a correction function between the source and target domains is determined based on the relative ratio between the first and second state transition probabilities. That is, the first state transition probability is determined using the first data, the second state transition probability is determined using the second data, the relative ratio between the first and second state transition probabilities is then used to determine the correction function between the source and target domains.
[0120] In some embodiments, the first state transition probability is represented by a first state transition function, and the second state transition probability is represented by a second state transition function, wherein the state transition function is a probability density function.
[0121] In some embodiments, the determination of the first state transition function and the second state transition function can be implemented in one of the following ways:
[0122] 1. Direct observation and statistics: Illustratively, the first state transition function is obtained through the observation and statistics of the first set of data, and the second state transition function is obtained through the observation and statistics of the second set of data. For example, the first data quantity corresponding to the execution of scheduling operation a in environment state s (i.e., the first data quantity corresponding to (s,a)) and the second data quantity corresponding to the transition of environment state to s' after the execution of scheduling operation a in environment state s (i.e., the second data quantity corresponding to (s,a,s')) are statistically analyzed. By analyzing the ratio between the second data quantity and the first data quantity corresponding to each combination of environment state and scheduling action, the first state transition function is obtained. The observation and statistical process of the second state transition function is the same as that of the first state transition function, and will not be elaborated here.
[0123] 2. Theoretical model / formula derivation: Schematic, a preset theoretical model / formula related to the power scenario is obtained. By substituting the first data into the preset theoretical model / formula, the corresponding first theoretical model / formula is selected from the preset theoretical model / formula to characterize the first state transition function; by substituting the second data into the preset theoretical model / formula, the corresponding second theoretical model / formula is selected from the preset theoretical model / formula to characterize the second state transition function.
[0124] 3. Computer Simulation Estimation: Illustratively, based on the first data, the initial probability distribution corresponding to each environmental state is calculated, and a first state transition frequency matrix is constructed. The elements of the matrix represent the number of transitions from one environmental state to another. The first state transition frequency matrix is normalized to obtain the probability of each state transition. Using the state transition matrix, state transitions are simulated through matrix multiplication, i.e., P(S... t+1 )=P(S t)×T, where T is the state transition matrix, P(S) t Let be the first initial probability distribution. The first state transition function is obtained through iterative simulation. The observation and statistical process of the second state transition function is the same as that of the first state transition function, and will not be described in detail here.
[0125] In some embodiments, the relative ratio between the first state transition function and the second state transition function is determined as the correction function. In one example, the relative ratio is implemented as the logarithmic density ratio, as shown in Equation 2:
[0126] Formula 2:
[0127] Where Δr is the correction function, T′(s′|s,a) is the second state transition function, and T(s′|s,a) is the first state transition function.
[0128] Step 332: Combine the second reward function with the correction function to obtain the first reward function.
[0129] The second reward function is the reward function corresponding to the source domain policy simulator. Illustratively, the second reward function corresponding to the source domain policy simulator is obtained, and then combined with the correction function to obtain the first reward function.
[0130] In some embodiments, the second reward function is added to the correction function to obtain the first reward function, as exemplarily shown in Formula 3:
[0131] Formula 3: r ′ =r+Δr
[0132] Where, r ′ Let r be the first reward function, r be the second reward function, and Δr be the correction function.
[0133] In some embodiments, the weighted sum of the second reward function and the correction function is used as the first reward function. Illustratively, a preset weight relationship is obtained, and the second reward function and the correction function are weighted and summed based on the preset weight relationship to obtain the first reward function. The preset weight relationship is a system-preset weight relationship. That is, by using the preset weight relationship to weightedly sum the second reward function and the correction function to obtain a new reward function, the correction magnitude of the correction function on the second reward function can be controlled by dynamically adjusting the preset weight relationship. This allows for adaptation to more diverse scenario requirements and balances the differences in requirements between knowledge from the source domain and knowledge from the target domain.
[0134] Step 340: Replace the second reward function in the source domain policy simulator with the first reward function to obtain the target domain policy simulator.
[0135] In this embodiment of the application, the target domain policy simulator is used to analyze the environmental state of the target domain and generate a power dispatching policy in the target domain.
[0136] That is, the original second reward function of the source domain policy simulator is replaced by the obtained first reward function, so that the source domain policy simulator with the adjusted reward function is used as the target domain policy simulator for the power policy generation process of the target domain.
[0137] In some embodiments, the target domain policy simulator applies reinforcement learning algorithms to generate power dispatch policies for the target domain. Optionally, the reinforcement learning algorithms include at least one of the following: maximum entropy reinforcement learning algorithms (e.g., Soft Q-Learning algorithm, SAC algorithm, etc.), policy gradient algorithms, confidence domain policy optimization algorithms, near-end policy optimization algorithms, etc., and this application embodiment does not limit the specific algorithms used.
[0138] In some embodiments, a fifth environmental state corresponding to the target domain is obtained; the fifth environmental state is input into the target domain policy simulator, and a third scheduling action is predicted based on the feedback of the scheduling action by the first reward function; a power scheduling strategy in the target domain is generated based on the third scheduling action.
[0139] That is, the fifth environmental state (operating parameters) of the real power system is input into the target domain policy simulator. The target domain policy simulator selects at least one third scheduling action from the candidate scheduling actions in the action space through the first reward function. The at least one third scheduling action is the power scheduling strategy of the target domain policy simulator for the real power system under the fifth environmental state (operating parameters).
[0140] In some embodiments, the fifth environmental state described above can be implemented as the environmental state of the real power system at the current moment. The "current moment" indicates the time when the real power system has a scheduling need, for example, the moment when the terminal / server controlling the real power system receives a scheduling request.
[0141] Optionally, the above power dispatching strategy can be implemented through a single dispatching action or through a combination of multiple dispatching actions, without any limitation.
[0142] This is illustrative; please refer to it. Figure 4This illustration shows a migration diagram of a policy simulator provided in an exemplary embodiment of this application. Taking a simulated power system 410 as the source domain and a real power system 420 as the target domain as an example, the simulated power system 410 corresponds to a policy simulator 411, which corresponds to a second reward function 412. The policy simulator 411 obtains first data 413 and obtains historical data recorded by the real power system 420 within a historical time period as second data 421. The first data 413 and the second data 421 are used to obtain a correction function 431. The second reward function 412 is corrected by the correction function 431 to obtain a first reward function 422. The policy simulator 411 is updated using the first reward function 422 to obtain the updated policy simulator 414.
[0143] The real power system 420 inputs the current environmental state into the updated policy simulator 414, and generates a power dispatch policy 423 through the updated policy simulator 414. The real power system 420 then completes power dispatch through the power dispatch policy 423.
[0144] In summary, a first reward function is obtained by using the first data provided by the source domain and the second data generated by the target domain within a historical time period to provide reinforcement learning feedback during the generation of power dispatching strategies. This first reward function is then applied to the source domain policy simulator corresponding to the source domain, resulting in a target domain policy simulator applicable to the target domain. The power dispatching strategy for the target domain is generated using the target domain policy simulator. In other words, the reward function corresponding to the source domain policy simulator is adjusted using the historical data provided by the target domain. Under the influence of the historical data provided by the target domain, the adjusted reward function can adapt to the dynamic differences between the source and target domains, thus enabling the adjusted policy simulator to adapt to the target domain. This ensures the effectiveness and security of the generated power dispatching strategy applied to the target domain without requiring extensive real-time training in the target domain.
[0145] In this embodiment, a correction function between the source and target domains is obtained by combining first data from the source domain and second data from the target domain. The original reward function of the source domain policy simulator is fine-tuned using the correction function to obtain a new reward function. Driven by the new reward function, the source domain policy simulator (target domain policy simulator) with the updated reward function can adapt to the dynamic differences between the source and target domains and provide the power dispatching strategy generation function for the target domain.
[0146] In some optional embodiments, the constructed first and second classifiers are trained using first and second data, and the resulting first and second classifiers are used to characterize the correction function. Please refer to [reference needed]. Figure 5The diagram illustrates a flowchart of a method for obtaining a strategy simulator provided in an exemplary embodiment of this application, the method including steps 510 to 580.
[0147] Step 510: Obtain the source domain policy simulator.
[0148] Among them, the source domain policy simulator is used to analyze the environmental state of the source domain and generate power dispatching policies in the source domain.
[0149] Optionally, in the power dispatching scenario, the source domain includes at least one of a simulated power system and a stable power system. The simulated power system is a virtual system obtained through algorithmic or model simulation, implemented using a source domain policy simulator. The stable power system is a real-world power system with a long service life; that is, the stable power system is equipped with a source domain policy simulator to generate power dispatching policies.
[0150] In this embodiment, the source domain policy simulator uses reinforcement learning to generate power dispatching strategies in the source domain. Illustratively, the environmental state (operating parameters) of the microgrid at time t is input into the source domain policy simulator. The simulator selects at least one dispatching action from candidate dispatching actions in the action space. This at least one dispatching action is the power dispatching strategy of the source domain policy simulator for the microgrid (simulated power system / stable power system) under the environmental state (operating parameters) at time t.
[0151] Taking the method as an example where the server executes it, optionally, the server reads the source domain policy simulator from the database, or the server receives the source domain policy simulator sent by the terminal.
[0152] Step 520: Obtain the first data in the source domain and the second data of the target domain within the historical time period.
[0153] The first data is used to indicate the environmental state and scheduling action of the source domain. In this embodiment, the first data includes a first environmental state, a first scheduling action, and a second environmental state. The second environmental state is the environmental state after the first environmental state is transferred under the influence of the first scheduling action. That is, the first data is a triplet data (s, a, s′) composed of the first environmental state, the first scheduling action, and the second environmental state.
[0154] The second data is used to indicate the environmental state and scheduling action of the target domain. In the embodiments of this application, the second data includes a third environmental state, a second scheduling action, and a fourth environmental state. The fourth environmental state is the environmental state after the third environmental state is transferred under the influence of the second scheduling action. That is, the second data is a triplet data (s,a,s′) composed of the third environmental state, the second scheduling action, and the fourth environmental state.
[0155] In this embodiment of the application, taking the source domain as a simulated power system as an example, the first data is data randomly generated by the source domain policy simulator, and / or the first data is data generated by the source domain policy simulator based on the second data.
[0156] In some embodiments, the process of acquiring the first data is implemented as follows: acquiring a tuple formed by a first environment state and a first scheduling action; inputting the tuple into a source domain policy simulator to obtain a second environment state corresponding to the source domain under the influence of the first scheduling action; and acquiring the first data composed of the first environment state, the first scheduling action, and the second environment state. That is, by generating a tuple data (s, a) composed of the first environment state and the first scheduling action, the source domain policy simulator predicts the second environment state under the conditions of the above tuple data, thereby obtaining a triple data (s, a, s′) composed of the first environment state, the first scheduling action, and the second environment state.
[0157] In some embodiments, the tuple (s,a) formed by the first environment state and the first scheduling action can be generated in the state space and action space corresponding to the source domain, or it can be generated based on the second data.
[0158] In one example, the state space and action space corresponding to the source domain are obtained, wherein the state space is used to indicate the range of environmental state changes in the source domain, and the action space is used to indicate the range of scheduled actions in the source domain; a first environmental state is sampled in the state space, and a first scheduled action is sampled in the action space.
[0159] Optionally, the sampling process can be implemented as random sampling, sequence sampling, hierarchical sampling, Bayesian sampling, Markov Chain Monte Carlo Sampling (MCMC), etc., without limitation here.
[0160] In another example, when the state space of the source domain and the state space of the target domain are the same, and the action space of the source domain and the action space of the target domain are the same, the tuple formed by the third environment state and the second scheduling action in the second data is used as the tuple formed by the first environment state and the first scheduling action. That is, the third environment state and the second scheduling action are input into the source domain policy simulator to obtain the second environment state of the source domain under the influence of the second scheduling action in the third environment state of the source domain, and the first data composed of the first environment state (third environment state), the first scheduling action (second scheduling action) and the second environment state is obtained.
[0161] Step 531: Obtain the classifiers, which include ternary classifiers and binary classifiers.
[0162] The classifier is used to identify the data source of the input data, which can be either the source domain or the target domain. In other words, the classifier determines whether the input data comes from the source domain or the target domain.
[0163] Optionally, the classifier can be built on a server or obtained from other devices; this is not limited here.
[0164] In this embodiment, the classifier includes a ternary classifier q. sas (s, a, s′) and binary classifier q sa (s, a). Where the triple classifier q sas The input to (s, a, s′) is the triplet data (s, a, s′), that is, the triplet classifier q sas (s, a, s′) is used to distinguish whether a given state transition (s, a, s′) originates from the source domain or the target domain; the binary classifier q sa (s, a) is used to distinguish whether a given state-action pair (s, a) originates from the source domain or the target domain.
[0165] Optionally, the above classifier can be implemented as a Support Vector Machine (SVM), a K-Nearest Neighbors (KNN) classifier, a Decision Tree classifier, a Gradient Boosting Trees (GBT) classifier, or a neural network-based classifier, etc.
[0166] Step 532: Label the first data with a first label, and label the second data with a second label.
[0167] The first label indicates that the data comes from the source domain, and the second label indicates that the data comes from the target domain.
[0168] It is worth noting that step 531 can be executed before step 532, after step 532, or in parallel with step 532.
[0169] In this embodiment of the application, the first data labeled with the first label and the second data labeled with the second label are training data used to train the classifier.
[0170] In some embodiments, the first data and the second data are mixed after the first data and the second data are labeled.
[0171] Step 541: Input the first data into the triple classifier to obtain the first subclassification result, and input the second data into the triple classifier to obtain the second subclassification result.
[0172] The first subclassification result indicates the data source of the first data predicted by the triple classifier, and the second subclassification result indicates the data source of the second data predicted by the triple classifier.
[0173] In some embodiments, when the first data and the second data are input into the triple classifier, the first data and the second data are regarded as a set of training data, and training data (which may be the first data or the second data) are randomly selected from the set and input into the triple classifier for the training process of the triple classifier.
[0174] Step 542: Based on the difference between the first subclassification result and the first label, and the difference between the second subclassification result and the second label, train a triple classifier to obtain the trained triple classifier.
[0175] In some embodiments, a first loss function is used to determine the difference between the prediction result of the triple classifier based on the training data and the label. Optionally, the first loss function can be implemented as at least one of the following: cross-entropy loss, mean squared error loss (MSE), log loss, and least absolute deviations loss (L1 loss), without limitation.
[0176] In one example, the first loss function is implemented as the cross-entropy loss function, as shown in Formula 4:
[0177] Formula 4:
[0178] Where (s, a, s′) is implemented as the first data, The first subclassification result q sas The loss value corresponding to (s, a, s′), when (s, a, s′) is implemented as the second data. For the second subclassification result q sas The loss value corresponding to (s, a, s′).
[0179] Step 551: Input the first environment state and the first scheduling action into the binary classifier to obtain the third subclassification result, and input the third environment state and the second scheduling action into the binary classifier to obtain the fourth subclassification result.
[0180] The third subclassification result is used to indicate the data source of the first environmental state and the first scheduling action predicted by the binary classifier, and the fourth subclassification result is used to indicate the data source of the third environmental state and the second scheduling action predicted by the binary classifier.
[0181] In this embodiment of the application, the first environment state and the first scheduling action in the first data form a binary data set, and the third environment state and the second scheduling action in the second data form a binary data set.
[0182] In some embodiments, when the first environment state and the first scheduling action form a binary data set, and the third environment state and the second scheduling action form a binary data set, the binary data set is regarded as a set of training data. Training data (which may be binary data from the first data set or binary data from the second data set) is randomly selected from this set and input into the binary classifier for the training process of the binary classifier.
[0183] Step 552: Based on the difference between the third subclassification result and the first label, and the difference between the fourth subclassification result and the second label, train a binary classifier to obtain the trained binary classifier.
[0184] In some embodiments, a second loss function is used to determine the difference between the prediction result of the binary classifier based on the training data and the label. Optionally, the second loss function can be implemented as at least one of the following: cross-entropy loss, mean squared error loss (MSE), log loss, and least absolute deviations loss (L1 loss), without limitation.
[0185] In one example, the second loss function is implemented as the cross-entropy loss function, as shown in Formula 5:
[0186] Formula 5:
[0187] Wherein, when (s, a) is implemented as a binary data consisting of the first environment state and the first scheduling action in the first data, For the third subclassification result q sa The loss value corresponding to (s, a) is when (s, a) is implemented as a binary data set consisting of the third environment state and the second scheduling action in the second data. The result of the fourth subclassification q sa The loss value corresponding to (s, a).
[0188] In this embodiment of the application, the trained classifier includes a trained ternary classifier and a trained binary classifier.
[0189] Step 560: Obtain the correction function between the source domain and the target domain based on the trained triple classifier and the trained binary classifier.
[0190] In this embodiment, the relative ratio between the first state transition probability and the second state transition probability of the trained ternary classifier and the trained binary classifier is used to determine the correction function between the source domain and the target domain.
[0191] The relative ratio between the first state transition probability and the second state transition probability includes the average ratio, log density ratio, mutual information ratio, etc.
[0192] In one example, taking the above relative ratio as a logarithmic density ratio, the correction function Δr is shown in Formula 6:
[0193] Formula Six:
[0194] Where T(s′|s,a) represents the first state transition probability corresponding to the source domain, T′(s′|s,a) represents the second state transition probability corresponding to the target domain, and q sas (s, a, s′) represents the trained triple classifier, q sa (s, a) represents the binary classifier obtained through training.
[0195] Illustratively, when the correction function is implemented as the log density ratio between the first state transition probability and the second state transition probability, the correction function can apply a positive reward when the second state transition probability is greater than the first state transition probability, and apply a negative reward (i.e., a penalty) when the second state transition probability is less than the first state transition probability, thereby adapting to the dynamic differences between the source and target domains through the correction function.
[0196] Step 570: Add the correction function and the second reward function to obtain the first reward function.
[0197] The second reward function is the reward function corresponding to the source domain policy simulator. Illustratively, the second reward function corresponding to the source domain policy simulator is obtained, and then combined with the correction function to obtain the first reward function.
[0198] In this embodiment of the application, the second reward function is added to the correction function to obtain the first reward function (as shown in Formula 3).
[0199] Step 580: Replace the second reward function in the source domain policy simulator with the first reward function to obtain the target domain policy simulator.
[0200] In this embodiment of the application, the target domain policy simulator is used to analyze the environmental state of the target domain and generate a power dispatching policy in the target domain.
[0201] That is, the original second reward function of the source domain policy simulator is replaced by the obtained first reward function, so that the source domain policy simulator with the adjusted reward function is used as the target domain policy simulator for the power policy generation process of the target domain.
[0202] In some embodiments, the target domain policy simulator applies reinforcement learning algorithms to generate power dispatch policies for the target domain. Optionally, the reinforcement learning algorithms include at least one of the following: maximum entropy reinforcement learning algorithms (e.g., Soft Q-Learning algorithm, SAC algorithm, etc.), policy gradient algorithms, confidence domain policy optimization algorithms, near-end policy optimization algorithms, etc., and this application embodiment does not limit the specific algorithms used.
[0203] In some embodiments, a fifth environmental state corresponding to the target domain is obtained; the fifth environmental state is input into the target domain policy simulator, and a third scheduling action is predicted based on the feedback of the scheduling action by the first reward function; a power scheduling strategy in the target domain is generated based on the third scheduling action.
[0204] That is, the fifth environmental state (operating parameters) of the real power system is input into the target domain policy simulator. The target domain policy simulator selects at least one third scheduling action from the candidate scheduling actions in the action space through the first reward function. The at least one third scheduling action is the power scheduling strategy of the target domain policy simulator for the real power system under the fifth environmental state (operating parameters).
[0205] In some embodiments, the fifth environmental state described above can be implemented as the environmental state of the real power system at the current moment. The "current moment" indicates the time when the real power system has a scheduling need, for example, the moment when the terminal / server controlling the real power system receives a scheduling request.
[0206] Optionally, the above power dispatching strategy can be implemented through a single dispatching action or through a combination of multiple dispatching actions, without any limitation.
[0207] This is illustrative; please refer to it. Figure 6This illustration shows a migration diagram of a policy simulator provided in an exemplary embodiment of this application. Taking the source domain as a simulated power system 610 and the target domain as a real power system 620 as an example, the simulated power system 610 corresponds to a policy simulator 611, which corresponds to a second reward function 612. The policy simulator 611 obtains first data 613 and obtains historical data recorded by the real power system 620 in the historical time period as second data 621.
[0208] The triple classifier and the binary classifier 601 are trained using the first data 613 and the second data 621 to obtain the trained triple classifier and the binary classifier 602. The correction function 631 is calculated using the trained triple classifier and the binary classifier 602. The second reward function 612 is corrected using the correction function 631 to obtain the first reward function 622. The policy simulator 611 is updated using the first reward function 622 to obtain the updated policy simulator 614.
[0209] The real power system 620 inputs the current environmental state into the updated policy simulator 614, and generates a power dispatch policy 623 through the updated policy simulator 614. The real power system 620 then completes power dispatch through the power dispatch policy 623.
[0210] In summary, a first reward function is obtained by using the first data provided by the source domain and the second data generated by the target domain within a historical time period to provide reinforcement learning feedback during the generation of power dispatching strategies. This first reward function is then applied to the source domain policy simulator corresponding to the source domain, resulting in a target domain policy simulator applicable to the target domain. The power dispatching strategy for the target domain is generated using the target domain policy simulator. In other words, the reward function corresponding to the source domain policy simulator is adjusted using the historical data provided by the target domain. Under the influence of the historical data provided by the target domain, the adjusted reward function can adapt to the dynamic differences between the source and target domains, thus enabling the adjusted policy simulator to adapt to the target domain. This ensures the effectiveness and security of the generated power dispatching strategy applied to the target domain without requiring extensive real-time training in the target domain.
[0211] In this embodiment, a triple classifier and a binary classifier are trained using first data from the source domain and second data from the target domain. A correction function is obtained using the trained triple and binary classifiers. The correction function is used to adjust the second reward function of the source domain policy simulator. Specifically, based on the second reward function, the correction function applies a positive reward when the second state transition probability of the target domain is greater than the first state transition probability of the source domain, and applies a negative reward (i.e., a penalty) when the second state transition probability of the target domain is less than the first state transition probability of the source domain. This adapts to the dynamic differences between the source and target domains, thereby enabling the source domain policy simulator to be transferred to the target domain by fine-tuning the reward function. This approach is well-suited for scenarios where real-time data acquisition costs are high in the target domain.
[0212] It should be noted that this application may display prompt interfaces, pop-ups, or output voice prompts before and during the collection of user data. These prompt interfaces, pop-ups, or voice prompts are used to inform the user that their data is being collected. This ensures that the application only begins the steps for collecting user data after receiving confirmation from the user regarding the prompt interface or pop-up; otherwise (i.e., without user confirmation), the steps for collecting user data end, meaning no user data is collected. In other words, all user data collected in this application is collected with the user's consent and authorization, and the collection, use, and processing of related user data must comply with the relevant laws, regulations, and standards of the relevant countries and regions.
[0213] Please refer to Figure 7 The diagram illustrates a structural block diagram of a strategy simulator acquisition device provided in an exemplary embodiment of this application. The device includes the following modules:
[0214] The acquisition module 710 is used to acquire the source domain policy simulator, which is used to analyze the environmental state of the source domain and generate a power dispatching policy in the source domain.
[0215] The acquisition module 710 is further configured to acquire first data in the source domain and second data of the target domain within a historical time period, wherein the first data is used to indicate the environmental state and scheduling actions of the source domain and the second data is used to indicate the environmental state and scheduling actions of the target domain.
[0216] The correction module 720 is used to obtain a first reward function based on the first data and the second data. The first reward function is used to provide feedback for the predicted scheduling actions in the process of generating the power dispatch strategy.
[0217] The adjustment module 730 is used to adjust the source domain policy simulator based on the first reward function to obtain the target domain policy simulator. The target domain policy simulator is used to analyze the environmental state of the target domain and generate a power dispatching strategy in the target domain.
[0218] In some optional embodiments, the source domain policy simulator includes a second reward function;
[0219] The adjustment module 730 is further configured to replace the second reward function in the source domain policy simulator with the first reward function to obtain the target domain policy simulator.
[0220] In some optional embodiments, the correction module 720 is further configured to obtain a correction function between the source domain and the target domain based on the first data and the second data. The correction function is used to express the difference between a first state transition probability corresponding to the source domain and a second state transition probability corresponding to the target domain. The first state transition probability is used to indicate the probability of environmental state change in the source domain under the influence of scheduling actions, and the second state transition probability is used to indicate the probability of environmental state change in the target domain under the influence of scheduling actions.
[0221] The correction module 720 is further configured to combine the second reward function with the correction function to obtain the first reward function.
[0222] In some optional embodiments, the correction module 720 is further configured to use a weighted sum of the second reward function and the correction function as the first reward function.
[0223] In some optional embodiments, the acquisition module 710 is further configured to acquire a classifier, the classifier being configured to identify the data source to which the input data belongs, the data source including the source domain or the target domain;
[0224] like Figure 8 As shown, the correction module 720 further includes:
[0225] Training unit 721 is used to train the classifier based on the first data and the second data to obtain the trained classifier;
[0226] The determining unit 722 is used to obtain the correction function between the source domain and the target domain based on the trained classifier.
[0227] In some optional embodiments, the correction module 720 further includes:
[0228] Labeling unit 723 is used to label the first data with a first label and to label the second data with a second label; the first label is used to indicate that the data comes from the source domain and the second label is used to indicate that the data comes from the target domain.
[0229] The training unit 721 is further configured to input the first data into the classifier to obtain a first classification result, and to input the second data into the classifier to obtain a second classification result; the first classification result is used to indicate the data source of the predicted first data, and the second classification result is used to indicate the data source of the predicted second data;
[0230] The training unit 721 is further configured to train the classifier based on the difference between the first classification result and the first label, and the difference between the second classification result and the second label, to obtain the trained classifier.
[0231] In some optional embodiments, the first data includes a first environmental state, a first scheduling action, and a second environmental state, wherein the second environmental state is the environmental state after the first environmental state has been transferred under the influence of the first scheduling action; the second data includes a third environmental state, a second scheduling action, and a fourth environmental state, wherein the fourth environmental state is the environmental state after the third environmental state has been transferred under the influence of the second scheduling action; the classifier includes a ternary classifier and a binary classifier.
[0232] The training unit 721 is further configured to input the first data into the triple classifier to obtain a first sub-classification result, and to input the second data into the triple classifier to obtain a second sub-classification result; the first sub-classification result is used to indicate the data source of the predicted first data, and the second sub-classification result is used to indicate the data source of the predicted second data;
[0233] The training unit 721 is further configured to input the first environmental state and the first scheduling action into the binary classifier to obtain a third sub-classification result, and to input the third environmental state and the second scheduling action into the binary classifier to obtain a fourth sub-classification result; the third sub-classification result is used to indicate the data source of the predicted first environmental state and the first scheduling action, and the fourth sub-classification result is used to indicate the data source of the predicted third environmental state and the second scheduling action.
[0234] The training unit 721 is further configured to train the triple classifier based on the difference between the first sub-classification result and the first label, and the difference between the second sub-classification result and the second label, to obtain the trained triple classifier.
[0235] The training unit 721 is further configured to train the binary classifier based on the difference between the third sub-classification result and the first label, and the difference between the fourth sub-classification result and the second label, to obtain a trained binary classifier, wherein the trained classifier includes the trained ternary classifier and the trained binary classifier.
[0236] In some optional embodiments, the determining unit 722 is further configured to characterize the relative ratio between the first state transition probability and the second state transition probability through the trained classifier;
[0237] The determining unit 722 is further configured to determine the correction function between the source domain and the target domain based on the relative ratio.
[0238] In some optional embodiments, the acquisition module 710 further includes:
[0239] The acquisition unit 711 is used to acquire a tuple formed by the first environment state and the first scheduling action;
[0240] The generation unit 712 is used to input the binary tuple into the source domain policy simulator to obtain the second environment state corresponding to the source domain under the influence of the first scheduling action.
[0241] The acquisition unit 711 is further configured to acquire the first data composed of the first environment state, the first scheduling action, and the second environment state.
[0242] In some optional embodiments, the acquisition unit 711 is further configured to acquire the state space and action space corresponding to the source domain, wherein the state space is used to indicate the range of environmental state changes in the source domain, and the action space is used to indicate the range of scheduling actions in the source domain.
[0243] The acquisition unit 711 is further configured to sample the first environmental state in the state space and sample the first scheduling action in the action space.
[0244] In some optional embodiments, the acquisition module 710 is used to acquire the fifth environmental state corresponding to the target domain;
[0245] The device further includes:
[0246] The prediction module 740 is used to input the fifth environmental state into the target domain policy simulator and predict the third scheduling action based on the feedback of the first reward function to the scheduling action.
[0247] The prediction module 740 is also used to generate a power dispatching strategy in the target domain based on the third dispatching action.
[0248] It should be noted that the strategy simulator acquisition device provided in the above embodiments is only an example of the division of the above functional modules. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the strategy simulator acquisition device and the strategy simulator acquisition method embodiments provided in the above embodiments belong to the same concept, and their specific implementation process can be found in the method embodiments, which will not be repeated here.
[0249] Figure 9 This illustration shows a schematic diagram of the structure of a server provided in an exemplary embodiment of this application. Specifically, it includes the following structure.
[0250] Server 900 includes a Central Processing Unit (CPU) 901, a system memory 904 including Random Access Memory (RAM) 902 and Read Only Memory (ROM) 903, and a system bus 905 connecting the system memory 904 and the CPU 901. Server 900 also includes a mass storage device 906 for storing an operating system 913, application programs 914, and other program modules 915.
[0251] Mass storage device 906 is connected to central processing unit 901 via a mass storage controller (not shown) connected to system bus 905. Mass storage device 906 and its associated computer-readable media provide non-volatile storage for server 900. That is, mass storage device 906 may include computer-readable media (not shown) such as hard disk or compact disc read-only memory (CD-ROM) drives.
[0252] Without loss of generality, computer-readable media can include computer storage media and communication media. Computer storage media includes volatile and non-volatile, removable and non-removable media implemented using any method or technology for storing information such as computer-readable instructions, data structures, program modules, or other data. Computer storage media include RAM, ROM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other solid-state memory technologies, CD-ROM, digital versatile disc (DVD) or other optical storage, magnetic tape cassettes, magnetic tape, disk storage, or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the above-mentioned types. The system memory 904 and mass storage device 906 described above can be collectively referred to as memory.
[0253] According to various embodiments of this application, server 900 can also be connected to a remote computer on a network, such as the Internet. That is, server 900 can be connected to network 912 via network interface unit 911 connected to system bus 905, or it can also use network interface unit 911 to connect to other types of networks or remote computer systems (not shown).
[0254] The aforementioned memory also includes one or more programs, which are stored in the memory and configured to be executed by the CPU.
[0255] Embodiments of this application also provide a computer device including a processor and a memory. The memory stores at least one instruction, at least one program, code set, or instruction set. The processor loads and executes the at least one instruction, at least one program, code set, or instruction set to implement the strategy simulator acquisition method provided in the above-described method embodiments. Optionally, the computer device may be a terminal or a server.
[0256] Embodiments of this application also provide a computer-readable storage medium storing at least one instruction, at least one program, code set, or instruction set, wherein the at least one instruction, at least one program, code set, or instruction set is loaded and executed by a processor to implement the method for obtaining the strategy simulator provided in the above-described method embodiments.
[0257] Embodiments of this application also provide a computer program product or computer program, which includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the strategy simulator acquisition method described in any of the above embodiments.
[0258] Optionally, the computer-readable storage medium may include: read-only memory (ROM), random access memory (RAM), solid-state drives (SSDs), or optical discs, etc. The random access memory may include resistive random access memory (ReRAM) and dynamic random access memory (DRAM). The sequence numbers of the embodiments in this application are merely descriptive and do not represent the superiority or inferiority of the embodiments.
[0259] Those skilled in the art will understand that all or part of the steps of the above embodiments can be implemented by hardware or by a program instructing related hardware. The program can be stored in a computer-readable storage medium, such as a read-only memory, a disk, or an optical disk.
[0260] The above description is merely an optional embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.
Claims
1. A method for obtaining a strategy simulator, characterized in that, The method includes: Obtain a source domain policy simulator, which is used to analyze the environmental state of the source domain and generate a power dispatching policy in the source domain; Acquire first data from the source domain and second data from the target domain within a historical time period. The first data is used to indicate the environmental state and scheduling actions of the source domain, and the second data is used to indicate the environmental state and scheduling actions of the target domain. A first reward function is determined based on the first data and the second data. The first reward function is used to provide reinforcement learning feedback for the predicted scheduling actions in the process of generating power dispatching strategies. The source domain policy simulator is adjusted based on the first reward function to obtain the target domain policy simulator. The target domain policy simulator is used to analyze the environmental state of the target domain and generate a power dispatching strategy for the target domain.
2. The method according to claim 1, characterized in that, The source domain policy simulator includes a second reward function; The step of adjusting the source domain policy simulator based on the first reward function to obtain the target domain policy simulator includes: The second reward function in the source domain policy simulator is replaced with the first reward function to obtain the target domain policy simulator.
3. The method according to claim 2, characterized in that, The step of determining the first reward function based on the first data and the second data includes: Based on the first data and the second data, a correction function is obtained between the source domain and the target domain. The correction function is used to express the difference between the first state transition probability corresponding to the source domain and the second state transition probability corresponding to the target domain. The first state transition probability is used to indicate the probability of environmental state change in the source domain under the influence of scheduling action, and the second state transition probability is used to indicate the probability of environmental state change in the target domain under the influence of scheduling action. The second reward function is combined with the correction function to obtain the first reward function.
4. The method according to claim 3, characterized in that, The step of combining the second reward function with the correction function to obtain the first reward function includes: The weighted sum of the second reward function and the correction function is used as the first reward function.
5. The method according to claim 3 or 4, characterized in that, The step of obtaining the correction function between the source domain and the target domain based on the first data and the second data includes: Obtain a classifier, which is used to identify the data source to which the input data belongs, the data source including the source domain or the target domain; The classifier is trained based on the first data and the second data to obtain the trained classifier. The correction function between the source domain and the target domain is obtained based on the trained classifier.
6. The method according to claim 5, characterized in that, The step of training the classifier based on the first data and the second data to obtain the trained classifier includes: The first data is labeled with a first label, and the second data is labeled with a second label; the first label is used to indicate that the data comes from the source domain, and the second label is used to indicate that the data comes from the target domain. The first data is input into the classifier to obtain a first classification result, and the second data is input into the classifier to obtain a second classification result; the first classification result is used to indicate the data source of the predicted first data, and the second classification result is used to indicate the data source of the predicted second data; Based on the difference between the first classification result and the first label, and the difference between the second classification result and the second label, the classifier is trained to obtain the trained classifier.
7. The method according to claim 6, characterized in that, The first data includes a first environmental state, a first scheduling action, and a second environmental state, wherein the second environmental state is the environmental state after the first environmental state has been transferred under the influence of the first scheduling action; the second data includes a third environmental state, a second scheduling action, and a fourth environmental state, wherein the fourth environmental state is the environmental state after the third environmental state has been transferred under the influence of the second scheduling action; the classifier includes a ternary classifier and a binary classifier; The step of inputting the first data into the classifier to obtain a first classification result, and inputting the second data into the classifier to obtain a second classification result, includes: The first data is input into the triple classifier to obtain a first sub-classification result, and the second data is input into the triple classifier to obtain a second sub-classification result; the first sub-classification result is used to indicate the data source of the predicted first data, and the second sub-classification result is used to indicate the data source of the predicted second data; The first environmental state and the first scheduling action are input into the binary classifier to obtain a third sub-classification result, and the third environmental state and the second scheduling action are input into the binary classifier to obtain a fourth sub-classification result; the third sub-classification result is used to indicate the data source of the predicted first environmental state and the first scheduling action, and the fourth sub-classification result is used to indicate the data source of the predicted third environmental state and the second scheduling action; The step of training the classifier based on the difference between the first classification result and the first label, and the difference between the second classification result and the second label, to obtain the trained classifier, includes: Based on the difference between the first sub-classification result and the first label, and the difference between the second sub-classification result and the second label, the triple classifier is trained to obtain the trained triple classifier; Based on the difference between the third sub-classification result and the first label, and the difference between the fourth sub-classification result and the second label, the binary classifier is trained to obtain a trained binary classifier, which includes the trained ternary classifier and the trained binary classifier.
8. The method according to claim 5, characterized in that, The step of obtaining the correction function between the source domain and the target domain based on the trained classifier includes: The trained classifier represents the relative ratio between the first state transition probability and the second state transition probability. The correction function between the source domain and the target domain is determined based on the relative ratio.
9. The method according to any one of claims 1 to 4, characterized in that, The step of obtaining the first data in the source domain includes: Obtain the tuple formed by the first environment state and the first scheduling action; The binary tuple is input into the source domain policy simulator to obtain the second environment state of the source domain under the influence of the first scheduling action. The first data is obtained, which consists of the first environment state, the first scheduling action, and the second environment state.
10. The method according to claim 9, characterized in that, The step of obtaining the tuple formed by the first environment state and the first scheduling action includes: Obtain the state space and action space corresponding to the source domain. The state space is used to indicate the range of environmental state changes in the source domain, and the action space is used to indicate the range of scheduling actions in the source domain. The first environmental state is sampled in the state space, and the first scheduling action is sampled in the action space.
11. The method according to any one of claims 1 to 4, characterized in that, The method further includes: Obtain the fifth environmental state corresponding to the target domain; The fifth environmental state is input into the target domain policy simulator, and the third scheduling action is predicted based on the feedback of the scheduling action from the first reward function. A power dispatching strategy for the target domain is generated based on the third dispatching action.
12. A device for acquiring a strategy simulator, characterized in that, The device includes: The acquisition module is used to acquire the source domain policy simulator, which is used to analyze the environmental state of the source domain and generate a power dispatching policy in the source domain. The acquisition module is further configured to acquire first data in the source domain and second data of the target domain within a historical time period. The first data is used to indicate the environmental state and scheduling actions of the source domain, and the second data is used to indicate the environmental state and scheduling actions of the target domain. The correction module is used to obtain a first reward function based on the first data and the second data. The first reward function is used to provide feedback for the predicted scheduling actions in the process of generating the power dispatch strategy. The adjustment module is used to adjust the source domain policy simulator based on the first reward function to obtain the target domain policy simulator. The target domain policy simulator is used to analyze the environmental state of the target domain and generate a power dispatching strategy in the target domain.
13. A computer device, characterized in that, The computer device includes a processor and a memory, the memory storing at least one program, which is loaded and executed by the processor to implement the method for obtaining the strategy simulator as described in any one of claims 1 to 11.
14. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores at least one piece of program code, which is loaded and executed by a processor to implement the method for obtaining the strategy simulator as described in any one of claims 1 to 11.
15. A computer program product, characterized in that, It includes a computer program or instructions that, when executed by a processor, implement the method for obtaining a policy simulator as described in any one of claims 1 to 11.