An integrated scheduling and decision-making method for end-to-end multi-energy coupled systems in seaports
By adopting an end-to-end integrated scheduling and decision-making method for multi-energy coupling systems in seaports, and utilizing deep reinforcement learning algorithms to achieve deep coupling between logistics and energy, the problem of the separation between port logistics and energy scheduling has been solved, operating costs have been reduced, and the level of port intelligence has been improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HOHAI UNIV
- Filing Date
- 2026-04-29
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, port logistics and energy dispatch are disconnected, making it impossible to achieve overall cost optimization by leveraging the flexibility of operational rhythms.
We construct a deep reinforcement learning algorithm based on deep deterministic policy gradients, and combine it with dynamic task models and energy asset models. Through an end-to-end integrated scheduling decision-making method for multi-energy coupled seaport systems, we achieve deep coupling and global optimization of logistics operations and energy scheduling.
By enabling intelligent agents to autonomously decide on crane operating rates, production loads can be shifted to lower-cost periods, significantly reducing port operating costs, improving economic efficiency, and enhancing the port's level of intelligence.
Smart Images

Figure CN122114565B_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of smart port technology, and in particular relates to an integrated scheduling and decision-making method for an end-to-end multi-energy coupled system of a seaport. Background Technology
[0002] With the advancement of global "dual carbon" goals and green shipping initiatives, ports, as key nodes in international logistics and major energy consumers, are accelerating their transformation towards green and intelligent operations. Electrification equipment such as electric cranes, automated guided vehicles (AGVs), and shore power systems for ships are becoming increasingly widespread. The coupling and integration of multiple energy forms, including distributed renewable energy, energy storage systems, and heat pumps, is gradually transforming modern ports into complex integrated energy systems. Currently, ports generally adopt a separate scheduling approach of "logistics first, energy second": the logistics department independently formulates operational plans based on ship arrival schedules and cargo throughput, determining equipment start-up, shutdown, and operating hours; the energy management department then passively performs economic scheduling, using the resulting electricity load curve as a rigid boundary.
[0003] However, current methods suffer from the problem of disconnect between logistics and energy scheduling, and the inability to leverage operational flexibility to achieve overall cost optimization. Summary of the Invention
[0004] This application provides an integrated scheduling and decision-making method, device, terminal equipment, and storage medium for an end-to-end multi-energy coupling system in a seaport. This method can solve the problems of current methods where logistics and energy scheduling are separated and global cost optimization cannot be achieved by utilizing the flexibility of operational rhythm.
[0005] In a first aspect, embodiments of this application provide an end-to-end integrated scheduling decision-making method for a multi-energy coupled port system, comprising: S1, constructing a port collaborative scheduling simulation environment including a dynamic task model and an energy asset model, for defining the system's state space, action space, and composite reward function; S2, based on the state space, action space, and composite reward function, constructing a deep reinforcement learning algorithm based on deep deterministic policy gradient, wherein the deep reinforcement learning algorithm includes an actor network and a critic network; S3, at each time step, interacting the deep reinforcement learning algorithm with the port collaborative scheduling simulation environment to generate an experience tuple, and storing each experience tuple in an experience replay pool; S4, sampling experience data from the experience replay pool, iteratively updating the actor network and critic network until a preset maximum number of iterations is reached, generating the final collaborative scheduling strategy.
[0006] In one possible implementation of the first aspect, S1 above involves constructing a port collaborative scheduling simulation environment that includes a dynamic task model and an energy asset model, used to define the system's state space, action space, and composite reward function, as follows:
[0007] S101. Construct the state space S, as follows:
[0008] (1)
[0009] In the formula, For the first The state vector for each decision time step; The current moment; For the first The state of charge of the port energy storage system at all times; For the first The electricity price on the grid at any given time; For the generator in the first Efforts made at all times; For the first The remaining workload of the dockside cranes at any given time; For the first The remaining workload of the yard cranes at all times; The deadline for the task. This represents the remaining time until the task deadline.
[0010] S102, Constructing a Joint Action Space , means as follows:
[0011] (2)
[0012] In the formula, Let be the joint action vector output by the agent at the t-th decision time step; For the first The generator output control command at all times; For the first Charge and discharge power control commands for the energy storage system at all times; For the first Time-bound control commands for the working speed of the dock cranes; For the first Constant-time operating rate control commands for yard cranes;
[0013] S103. Construct a dynamic task model within the simulation environment to calculate the dynamic power load generated by the operational actions, as detailed below:
[0014] (3)
[0015] (4)
[0016] (5)
[0017] In the formula, , The first The actual amount of work completed by dock cranes and yard cranes at any given time; These refer to the number of dock cranes and the number of yard cranes, respectively. These refer to the maximum operating rate of a single dock crane and the maximum operating rate of a single yard crane, respectively. These are the energy consumption per operation of a dock crane and the energy consumption per operation of a yard crane, respectively. For the first Dynamic power load at any given time;
[0018] S104. Construct an energy asset model within the simulation environment, as detailed below:
[0019] (6)
[0020] (7)
[0021] In the formula, For the first Power purchased from the external power grid at all times; For the first A fixed load at any given time; For the first The actual output of the gas turbine at any given moment; For the first The actual output of the energy storage system at any given time; This represents the state of charge of the port energy storage system at time t+1. Battery capacity; The charge / discharge efficiency coefficient;
[0022] S105. Constructing a composite reward function:
[0023] (8)
[0024] (9)
[0025] (10)
[0026] (11)
[0027] In the formula, For the first The reward value at any given moment; For the first The cost of purchasing electricity at any given moment; For the first Real-time gas turbine operating costs; This represents the secondary cost coefficient of the gas turbine. This represents the primary cost coefficient of the gas turbine. This represents the fixed cost coefficient for the gas turbine. This is a penalty for task failure. This is a preset constant for the huge fine.
[0028] Optionally, in another possible implementation of the first aspect, the aforementioned S2, based on the state space, action space, and composite reward function, constructs a deep reinforcement learning algorithm based on deep deterministic policy gradients, as follows:
[0029] S201. Constructing the actor network and action mapping for deep reinforcement learning algorithms:
[0030] (12)
[0031] (13)
[0032] (14)
[0033] (15)
[0034] (16)
[0035] (17)
[0036] In the formula, For actor network parameters; Represents the deterministic policy function of the actor network; These are the original motion vectors of the actor network; This is a normalized action vector that is restricted to the interval [-1,1] after being processed by the hyperbolic tangent activation function; For exploring noise at the current moment; formulas (14) to (17) represent the specific methods for mapping normalized actions to the actual physical control space, where, From maximum output and minimum effort constraint; Maximum charge and discharge power Constraints, where a positive value represents discharging and a negative value represents charging; and The normalized space [-1,1] is linearly mapped to the physical job space [0,1].
[0037] S202, Constructing a critic network for deep reinforcement learning algorithms:
[0038] (18)
[0039] In the formula, The expected long-term return for taking action in the current state; This is the fusion state vector input to the critic network; The joint action vector input to the critic network; For the network parameters of the commentator.
[0040] Optionally, in another possible implementation of the first aspect, S3 above interacts with the deep reinforcement learning algorithm and the port collaborative scheduling simulation environment at each time step to generate an experience tuple, and stores each experience tuple in the experience replay pool, as follows:
[0041] (19)
[0042] (20)
[0043] In the formula, For experience replay pool; For the first The empirical tuple at time step (i.e., the state vector) Joint action vector Reward Value Decision-making time steps state vector and end flag .
[0044] Optionally, in another possible implementation of the first aspect, S4 above involves iteratively updating the actor network and critic network by sampling experience data from the experience replay pool until a preset maximum number of iterations is reached, generating the final cooperative scheduling strategy, as follows:
[0045] S401, from the experience replay pool Randomly sample a batch of data ;
[0046] S402. Calculate the target motion vector and target Q value at time t+1, as follows:
[0047] (twenty one)
[0048] (twenty two)
[0049] In the formula, For the first The target action vector at any given time; and These are the target actor network and the target critic network, respectively. and These are the target network parameters corresponding to the target actor network and the target critic network, respectively. For the first The target actor's network outputs gas turbine control commands at all times; The target value for temporal difference is used to guide the reviewer network updates; Discount factor; For the target critic network based on the first The expected future value evaluated from the time-state vector and the target action vector; For the first The target actor's network outputs the energy storage system's charging and discharging control commands at all times; For the first The dock crane operation rate command output by the target actor network at any time; For the first The target actor network outputs the yard crane operation rate command;
[0050] S403. Construct the loss function for the critic network and update the critic network parameters using gradient descent. The details are as follows:
[0051] (twenty three)
[0052] (twenty four)
[0053] In the formula, For the loss function of the critic network; The learning rate of the critic network;
[0054] S404. Calculate the policy gradient of the actor network and update the actor network parameters through gradient ascent. The details are as follows:
[0055] (25)
[0056] (26)
[0057] In the formula, Let be the performance objective function of the actor network, i.e., the expected cumulative reward; is a general symbol for action variables; k is the index set identifier for each specific action dimension in the joint action space; gt is the identifier for gas turbine output control; ess is the identifier for energy storage system charge and discharge control; qc is the identifier for dock crane operation rate control; yc is the identifier for yard crane operation rate control. For the first Specific action control quantities in each dimension;
[0058] S405. Perform a soft update on the target network parameters, as follows:
[0059] (27)
[0060] (28)
[0061] In the formula, This is a soft update coefficient;
[0062] S406. Repeat steps S401-S405 until the preset maximum number of iterations is reached, generating the final collaborative scheduling strategy.
[0063] Secondly, embodiments of this application provide an integrated scheduling and decision-making device for an end-to-end multi-energy coupled port system, comprising: a first construction module for constructing a port collaborative scheduling simulation environment including a dynamic task model and an energy asset model, for defining the system's state space, action space, and composite reward function; a second construction module for constructing a deep reinforcement learning algorithm based on the state space, action space, and composite reward function, the deep reinforcement learning algorithm including an actor network and a critic network; a generation module for interacting the deep reinforcement learning algorithm and the port collaborative scheduling simulation environment at each time step to generate an experience tuple, and storing each experience tuple in an experience replay pool; and an update generation module for iteratively updating the actor network and critic network by sampling experience data from the experience replay pool until a preset maximum number of iterations is reached, thereby generating the final collaborative scheduling strategy.
[0064] Thirdly, embodiments of this application provide a terminal device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the aforementioned integrated scheduling decision method for an end-to-end multi-energy coupling system of a seaport.
[0065] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the aforementioned integrated scheduling and decision-making method for an end-to-end multi-energy coupling system of a seaport.
[0066] Beneficial Effects: This application offers the following beneficial effects: First, by incorporating port operation rhythm as an endogenous decision variable into an end-to-end reinforcement learning framework, it breaks the traditional "logistics first, energy later" separate scheduling model, achieving deep coupling and global optimization of logistics operations and energy scheduling. The agent can autonomously decide the crane operating rate based on real-time electricity price signals, increasing output during low-price periods and actively slowing down during high-price periods, thereby shifting production load to low-price periods and achieving "peak shaving and valley filling" at the production planning level. Experimental results show that compared to traditional separate scheduling, this application can further reduce operating costs and significantly improve port economic efficiency. Second, this application adopts a model-free deep reinforcement learning method, which does not rely on precise mathematical modeling. The agent learns through trial and error with the simulation environment, enabling it to adapt to complex operating conditions such as electricity price fluctuations and random task changes, exhibiting good robustness and generalization ability. Finally, the trained end-to-end agent can be directly deployed in the port control system, instantly outputting collaborative scheduling instructions based on the current state to meet real-time decision-making needs, greatly improving the port's intelligence level and market response flexibility. Attached Figure Description
[0067] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0068] Figure 1 This is a flowchart illustrating an integrated scheduling and decision-making method for an end-to-end multi-energy coupling system of a seaport, provided in one embodiment of this application.
[0069] Figure 2 This is a schematic diagram of the structure of the terminal device provided in the embodiments of this application;
[0070] Figure 3 This is a schematic diagram of the structure of an integrated scheduling and decision-making device for a multi-energy coupling system of a seaport based on an end-to-end approach, provided in one embodiment of this application. Detailed Implementation
[0071] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.
[0072] It should be understood that, when used in this application specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or a collection thereof.
[0073] It should also be understood that the term “and / or” as used in this application specification and the appended claims means any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.
[0074] As used in this application specification and the appended claims, the term "if" may be interpreted, depending on the context, as "when," "once," "in response to determination," or "in response to detection." Similarly, the phrase "if determined" or "if detected [the described condition or event]" may be interpreted, depending on the context, as meaning "once determined," "in response to determination," "once detected [the described condition or event]," or "in response to detection [the described condition or event]."
[0075] Furthermore, in the description of this application and the appended claims, the terms "first," "second," "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0076] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.
[0077] The following description, with reference to the accompanying drawings, details an integrated scheduling and decision-making method, apparatus, terminal equipment, and storage medium for an end-to-end multi-energy coupling system in a seaport, as provided in this application.
[0078] Figure 1 The illustration shows a flowchart of an integrated scheduling decision-making method for a multi-energy coupling system of a seaport based on an end-to-end approach, as provided in an embodiment of this application.
[0079] like Figure 1 As shown, the integrated scheduling decision-making method for an end-to-end multi-energy coupled seaport system includes the following steps:
[0080] S1. Construct a port collaborative scheduling simulation environment that includes a dynamic task model and an energy asset model to define the system's state space, action space, and composite reward function;
[0081] Furthermore, in this embodiment of the application, step S1 includes:
[0082] S101. Construct the state space S, as follows:
[0083] (1)
[0084] In the formula, For the first The state vector for each decision time step; The current moment; For the first The state of charge of the port energy storage system at all times; For the first The electricity price on the grid at any given time; For the generator in the first Efforts made at all times; For the first The remaining workload of the dockside cranes at any given time; For the first The remaining workload of the yard cranes at all times; The deadline for the task. This represents the remaining time until the task deadline.
[0085] S102, Constructing a Joint Action Space , means as follows:
[0086] (2)
[0087] In the formula, Let be the joint action vector output by the agent at the t-th decision time step; For the first The generator output control command at all times; For the first Charge and discharge power control commands for the energy storage system at all times; For the first Time-bound control commands for the working speed of the dock cranes; For the first Constant-time operating rate control commands for yard cranes;
[0088] S103. Construct a dynamic task model within the simulation environment to calculate the dynamic power load generated by the operational actions, as detailed below:
[0089] (3)
[0090] (4)
[0091] (5)
[0092] In the formula, , The first The actual amount of work completed by dock cranes and yard cranes at any given time; These refer to the number of dock cranes and the number of yard cranes, respectively. These refer to the maximum operating rate of a single dock crane and the maximum operating rate of a single yard crane, respectively. These are the energy consumption per operation of a dock crane and the energy consumption per operation of a yard crane, respectively. For the first Dynamic power load at any given time;
[0093] S104. Construct an energy asset model within the simulation environment, as detailed below:
[0094] (6)
[0095] (7)
[0096] In the formula, For the first Power purchased from the external power grid at all times; For the first A fixed load at any given time; For the first The actual output of the gas turbine at any given moment; For the first The actual output of the energy storage system at any given time; This represents the state of charge of the port energy storage system at time t+1. Battery capacity; The charge / discharge efficiency coefficient;
[0097] S105. Constructing a composite reward function:
[0098] (8)
[0099] (9)
[0100] (10)
[0101] (11)
[0102] In the formula, For the first The reward value at any given moment; For the first The cost of purchasing electricity at any given moment; For the first Real-time gas turbine operating costs; This represents the secondary cost coefficient of the gas turbine. This represents the primary cost coefficient of the gas turbine. This represents the fixed cost coefficient for the gas turbine. This is a penalty for task failure. This is a preset constant for the huge fine.
[0103] S2. Based on the state space, action space, and composite reward function, a deep reinforcement learning algorithm based on deep deterministic policy gradient is constructed. The deep reinforcement learning algorithm includes an actor network and a critic network.
[0104] Furthermore, in this embodiment of the application, step S2 includes:
[0105] S201. Constructing the actor network and action mapping for deep reinforcement learning algorithms:
[0106] (12)
[0107] (13)
[0108] (14)
[0109] (15)
[0110] (16)
[0111] (17)
[0112] In the formula, For actor network parameters; Represents the deterministic policy function of the actor network; These are the original motion vectors of the actor network; This is a normalized action vector that is restricted to the interval [-1,1] after being processed by the hyperbolic tangent activation function; For exploring noise at the current moment; Equations (14) to (17) represent the method of mapping normalized actions to the actual physical control space, where, From maximum output and minimum effort constraint; Maximum charge and discharge power Constraints, where a positive value represents discharging and a negative value represents charging; and The normalized space [-1,1] is linearly mapped to the physical job space [0,1].
[0113] S202, Constructing a critic network for deep reinforcement learning algorithms:
[0114] (18)
[0115] In the formula, The expected long-term return for taking action in the current state; This is the fusion state vector input to the critic network; The joint action vector input to the critic network; For the network parameters of the commentator.
[0116] S3. At each time step, the deep reinforcement learning algorithm and the port collaborative scheduling simulation environment interact to generate an experience tuple, and each experience tuple is stored in the experience replay pool.
[0117] Furthermore, in this embodiment, step S3 is specifically represented by the following formula:
[0118] (19)
[0119] (20)
[0120] In the formula, For experience replay pool; For the first The empirical tuple at time step (i.e., the state vector) Joint action vector Reward Value Decision-making time steps state vector and end flag .
[0121] S4. By sampling experience data from the experience replay pool, the actor network and critic network are iteratively updated until the preset maximum number of iterations is reached, and the final collaborative scheduling strategy is generated.
[0122] Furthermore, in this embodiment of the application, step S4 includes:
[0123] S401, from the experience replay pool Randomly sample a batch of data ;
[0124] S402. Calculate the target motion vector and target Q value at time t+1, as follows:
[0125] (twenty one)
[0126] (twenty two)
[0127] In the formula, For the first The target action vector at any given time; and These are the target actor network and the target critic network, respectively. and These are the target network parameters corresponding to the target actor network and the target critic network, respectively. For the first The target actor's network outputs gas turbine control commands at all times; The target value for temporal difference is used to guide the reviewer network updates; Discount factor; For the target critic network based on the first The expected future value evaluated from the time-state vector and the target action vector; For the first The target actor's network outputs the energy storage system's charging and discharging control commands at all times; For the first The dock crane operation rate command output by the target actor network at any time; For the first The target actor network outputs the yard crane operation rate command;
[0128] S403. Construct the loss function for the critic network and update the critic network parameters using gradient descent. The details are as follows:
[0129] (twenty three)
[0130] (twenty four)
[0131] In the formula, For the loss function of the critic network; The learning rate of the critic network;
[0132] S404. Calculate the policy gradient of the actor network and update the actor network parameters through gradient ascent. The details are as follows:
[0133] (25)
[0134] (26)
[0135] In the formula, Let be the performance objective function of the actor network, i.e., the expected cumulative reward; is a general symbol for action variables; k is the index set identifier for each specific action dimension in the joint action space; gt is the identifier for gas turbine output control; ess is the identifier for energy storage system charge and discharge control; qc is the identifier for dock crane operation rate control; yc is the identifier for yard crane operation rate control. For the first Specific action control quantities in each dimension;
[0136] S405. Perform a soft update on the target network parameters, as follows:
[0137] (27)
[0138] (28)
[0139] In the formula, This is a soft update coefficient;
[0140] S406. Repeat steps S401-S405 until the preset maximum number of iterations is reached, generating the final collaborative scheduling strategy.
[0141] For example, the maximum number of iterations can be preset to 3000.
[0142] This application provides an end-to-end integrated scheduling decision-making method for multi-energy coupled systems in seaports. First, a port collaborative scheduling simulation environment is constructed, including a dynamic task model and an energy asset model, to define the system's state space, action space, and composite reward function. Then, based on the state space, action space, and composite reward function, a deep reinforcement learning algorithm based on deep deterministic policy gradient is constructed. At each time step, the deep reinforcement learning algorithm interacts with the port collaborative scheduling simulation environment to generate an experience tuple, which is stored in an experience replay pool. Finally, by sampling experience data from the experience replay pool, the actor network and critic network are iteratively updated until a preset maximum number of iterations is reached, generating the final collaborative scheduling strategy. This application can break down the scheduling barriers between logistics and energy, using operational rhythm as a decision variable, and achieving global optimization through deep reinforcement learning, significantly reducing port operating costs and improving the level of intelligence.
[0143] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
[0144] To implement the above embodiments, this application also proposes a terminal device.
[0145] Figure 2 This is a schematic diagram of the structure of a terminal device according to an embodiment of this application.
[0146] like Figure 2 As shown, the terminal device 200 includes:
[0147] The system includes a memory 210 and at least one processor 220, and a bus 230 connecting different components (including the memory 210 and the processor 220). The memory 210 stores a computer program, which, when executed by the processor 220, implements an integrated scheduling and decision-making method for an end-to-end multi-energy coupling system of a seaport as described in this application embodiment.
[0148] Bus 230 represents one or more of several bus architectures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of the various bus architectures. For example, these architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MAC) bus, the Enhanced ISA bus, the Video Electronics Standards Association (VESA) local bus, and the Peripheral Component Interconnect (PCI) bus.
[0149] Terminal device 200 typically includes various electronically readable media. These media can be any available media that can be accessed by terminal device 200, including volatile and non-volatile media, removable and non-removable media.
[0150] Memory 210 may also include computer system readable media in the form of volatile memory, such as RAM 240 and / or cache 250. Terminal device 200 may further include other removable / non-removable, volatile / non-volatile computer system storage media. By way of example only, storage system 260 may be used to read and write non-removable, non-volatile magnetic media (… Figure 2 Not shown; usually referred to as a "hard drive"). Although Figure 2 As not shown, a disk drive for reading and writing to a removable non-volatile disk (e.g., a "floppy disk") and an optical disk drive for reading and writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 230 via one or more data media interfaces. Memory 210 may include at least one program product having a set (e.g., at least one) of program modules configured to perform the functions of the embodiments of this application.
[0151] A program / utility 280 having a set (at least one) of program modules 270 may be stored in, for example, memory 210. Such program modules 270 include—but are not limited to—an operating system, one or more application programs, other program modules, and program data. Each or some combination of these examples may include an implementation of a network environment. Program modules 270 typically perform the functions and / or methods described in the embodiments of this application.
[0152] Terminal device 200 can also communicate with one or more external devices 290 (e.g., keyboard, pointing device, display 291, etc.), and with one or more devices that enable a user to interact with terminal device 200, and / or with any device that enables terminal device 200 to communicate with one or more other computing devices (e.g., network card, modem, etc.). This communication can be performed via input / output (I / O) interface 292. Furthermore, terminal device 200 can also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public networks, such as the Internet) via network adapter 293. As shown, network adapter 293 communicates with other modules of terminal device 200 via bus 230. It should be understood that, although not shown in the figures, other hardware and / or software modules can be used in conjunction with terminal device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.
[0153] The processor 220 performs various functional applications and data processing by running programs stored in the memory 210.
[0154] It should be noted that the implementation process and technical principles of the terminal device in this embodiment are explained in the foregoing description of an integrated scheduling and decision-making method for an end-to-end multi-energy coupling system of a seaport, and will not be repeated here.
[0155] Corresponding to the above embodiment, an integrated scheduling decision-making method for an end-to-end multi-energy coupled seaport system, Figure 3 This diagram illustrates a structural block diagram of an integrated scheduling and decision-making device for a multi-energy coupling system of a seaport based on an end-to-end approach, provided by an embodiment of this application. For ease of explanation, only the parts relevant to the embodiments of this application are shown.
[0156] Reference Figure 3 The device 300 includes:
[0157] The first construction module 301 is used to construct a port collaborative scheduling simulation environment that includes a dynamic task model and an energy asset model, and is used to define the system's state space, action space, and composite reward function.
[0158] The second construction module 302 is used to construct a deep reinforcement learning algorithm based on the state space, action space and composite reward function, wherein the deep reinforcement learning algorithm includes an actor network and a critic network.
[0159] The generation module 303 is used to interact with the deep reinforcement learning algorithm and the port collaborative scheduling simulation environment at each time step to generate an experience tuple, and store each experience tuple in the experience replay pool;
[0160] The update generation module 304 is used to iteratively update the actor network and the critic network by sampling experience data from the experience replay pool until a preset maximum number of iterations is reached, thereby generating the final collaborative scheduling strategy.
[0161] In practical use, the end-to-end integrated scheduling and decision-making device for a multi-energy coupling system of a seaport provided in this application embodiment can be configured in any terminal device to execute the aforementioned end-to-end integrated scheduling and decision-making method for a multi-energy coupling system of a seaport.
[0162] This application provides an integrated scheduling and decision-making device for an end-to-end multi-energy coupled port system. First, a port collaborative scheduling simulation environment is constructed, including a dynamic task model and an energy asset model, to define the system's state space, action space, and composite reward function. Then, based on the state space, action space, and composite reward function, a deep reinforcement learning algorithm based on deep deterministic policy gradient is constructed. At each time step, the deep reinforcement learning algorithm interacts with the port collaborative scheduling simulation environment to generate an experience tuple, which is stored in an experience replay pool. Finally, by sampling experience data from the experience replay pool, the actor network and critic network are iteratively updated until a preset maximum number of iterations is reached, generating the final collaborative scheduling strategy. This application can break down the scheduling barriers between logistics and energy, using operational rhythm as a decision variable, and achieving global optimization through deep reinforcement learning, significantly reducing port operating costs and improving the level of intelligence.
[0163] It should be noted that the information interaction and execution process between the above-mentioned devices / units are based on the same concept as the method embodiments of this application. For details on their specific functions and technical effects, please refer to the method embodiments section, and they will not be repeated here.
[0164] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0165] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps described in the various method embodiments above.
[0166] This application provides a computer program product that, when run on a terminal device, enables the terminal device to implement the steps described in the various method embodiments above.
[0167] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments of this application can be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include at least: any entity or device capable of carrying the computer program code to a photographing device / terminal device, a recording medium, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks. In some regions, computer-readable media cannot be electrical carrier signals or telecommunication signals.
[0168] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0169] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0170] In the embodiments provided in this application, it should be understood that the disclosed devices / terminal equipment and methods can be implemented in other ways. For example, the device / terminal equipment embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual coupling or direct coupling or communication connection may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.
[0171] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0172] To verify the effectiveness of the method of the present invention, the following comparison scheme was designed:
[0173] Option 1 (this application): The energy-logistics collaborative scheduling method proposed in this invention is adopted.
[0174] Option 2 (Traditional Method): Distribute the 4000 TEUs of work evenly over 24 hours. During this period, to complete the logistics tasks, the cranes must operate continuously at a fixed average rate, regardless of whether electricity prices are at their peak or trough, forming a fixed power load curve. Then, use the same methods as Option 1... An intelligent agent, but its action space is limited to energy scheduling (2-dimensional) to optimize the energy cost under this fixed load.
[0175] The comparison results are shown in Table 1. In Scheme 2, due to the rigid operating rhythm, the agent can only optimize on the energy side, and its optimal operating cost is 386. In Scheme 1, the agent keenly detects that the period from 0:00 to 08:00 is the off-peak electricity price period, so it actively concentrates most of the crane operations during the early morning hours when the electricity price is lowest, and puts the cranes into "dormancy" during the peak electricity price period, ultimately achieving a total operating cost of 333 yuan. In contrast, Scheme 2 is limited by uniform operation, and its power load remains constant, resulting in having to bear a higher average electricity price during the peak electricity price period. Compared with Scheme 2, the collaborative scheduling method proposed in this application, by activating the flexibility of the operation side, creates an additional cost saving of more than 13%, fully demonstrating the great advantages and economic value of the method of this invention.
[0176] Table 1
[0177]
[0178] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.
Claims
1. A method for integrated scheduling and decision-making of a multi-energy coupled seaport system based on end-to-end, characterized in that, The method includes: S1. Construct a port collaborative scheduling simulation environment that includes a dynamic task model and an energy asset model to define the system's state space, action space, and composite reward function; S2. Based on the state space, action space, and composite reward function, construct a deep reinforcement learning algorithm based on deep deterministic policy gradient. The deep reinforcement learning algorithm includes an actor network and a critic network. S3. At each time step, the deep reinforcement learning algorithm and the port collaborative scheduling simulation environment are interacted to generate an experience tuple, and each experience tuple is stored in the experience replay pool. S4. By sampling experience data from the experience replay pool, the actor network and the critic network are iteratively updated until the preset maximum number of iterations is reached, and the final collaborative scheduling strategy is generated. Wherein, S101, constructing the state space S, is represented as follows: ;(1) In the formula, For the first The state vector for each decision time step; The current moment; For the first The state of charge of the port energy storage system at all times; For the first The electricity price on the grid at any given time; For the generator in the first Efforts made at all times; For the first The remaining workload of the dockside cranes at any given time; For the first The remaining workload of the yard cranes at all times; The deadline for the task. This represents the remaining time until the task deadline. S102, Construct the joint action space , means as follows: ;(2) In the formula, Let be the joint action vector output by the agent at the t-th decision time step; For the first The generator output control command at all times; For the first Charge and discharge power control commands for the energy storage system at all times; For the first Time-bound control commands for the working speed of the dock cranes; For the first Constant-time operating rate control commands for yard cranes; S103. Construct a dynamic task model within the simulation environment to calculate the dynamic power load generated by the operational actions, as detailed below: ;(3) ;(4) ;(5) In the formula, , The first The actual amount of work completed by dock cranes and yard cranes at any given time; These refer to the number of dock cranes and the number of yard cranes, respectively. These refer to the maximum operating rate of a single dock crane and the maximum operating rate of a single yard crane, respectively. These are the energy consumption per operation of a dock crane and the energy consumption per operation of a yard crane, respectively. For the first Dynamic power load at any given time; S104. Construct an energy asset model within the simulation environment, as detailed below: ;(6) ;(7) In the formula, For the first Power purchased from the external power grid at all times; For the first A fixed load at any given time; For the first The actual output of the gas turbine at any given moment; For the first The actual output of the energy storage system at any given time; This represents the state of charge of the port energy storage system at time t+1. Battery capacity; The charge / discharge efficiency coefficient; S105. Construct the composite reward function: ;(8) ;(9) ;(10) ;(11) In the formula, For the first The reward value at any given moment; For the first The cost of purchasing electricity at any given moment; For the first Real-time gas turbine operating costs; This represents the secondary cost coefficient of the gas turbine. This represents the primary cost coefficient of the gas turbine. This represents the fixed cost coefficient for the gas turbine. This is a penalty for task failure. This is a pre-set constant for the enormous fine; S201. Constructing the actor network and action mapping for deep reinforcement learning algorithms: ;(12) ;(13) ;(14) ; (15) ;(16) ;(17) In the formula, For actor network parameters; Represents the deterministic policy function of the actor network; These are the original motion vectors of the actor network; This is a normalized action vector that is restricted to the interval [-1,1] after being processed by the hyperbolic tangent activation function; For exploring noise at the current moment; formulas (14) to (17) represent the specific methods for mapping normalized actions to the actual physical control space, where, From maximum output and minimum effort constraint; Maximum charge and discharge power Constraints, where a positive value represents discharging and a negative value represents charging; and The normalized space [-1,1] is linearly mapped to the physical job space [0,1]. S202, Constructing a critic network for deep reinforcement learning algorithms: ;(18) In the formula, The expected long-term return for taking action in the current state; This is the fusion state vector input to the critic network; The joint action vector input to the critic network; For the network parameters of the commentator.
2. The method according to claim 1, characterized in that, In step S3, at each time step, the deep reinforcement learning algorithm and the port collaborative scheduling simulation environment interact to generate an experience tuple, and each experience tuple is stored in the experience replay pool, as follows: ;(19) ; (20) In the formula, For experience replay pool; For the first The empirical tuple at time step (i.e., the state vector) Joint action vector Reward Value Decision-making time steps state vector and end flag .
3. The method according to claim 2, characterized in that, Step S4 involves sampling experience data from the experience replay pool and iteratively updating the actor network and critic network until a preset maximum number of iterations is reached, generating the final collaborative scheduling strategy, as detailed below: S401, from the experience replay pool Randomly sample a batch of data ; S402. Calculate the target motion vector and target Q value at time t+1, as follows: ;(21) ; (22) In the formula, For the first The target action vector at any given time; and These are the target actor network and the target critic network, respectively. and These are the target network parameters corresponding to the target actor network and the target critic network, respectively. For the first The target actor's network outputs gas turbine control commands at all times; The target value for temporal difference is used to guide the reviewer network updates; Discount factor; For the target critic network based on the first The expected future value evaluated from the time-state vector and the target action vector; For the first The target actor's network outputs the energy storage system's charging and discharging control commands at all times; For the first The dock crane operation rate command output by the target actor network at any time; For the first The target actor network outputs the yard crane operation rate command; S403. Construct the loss function for the critic network and update the critic network parameters using gradient descent. The details are as follows: ;(23) ; (24) In the formula, For the loss function of the critic network; The learning rate of the critic network; S404. Calculate the policy gradient of the actor network and update the actor network parameters through gradient ascent. The details are as follows: ;(25) ; (26) In the formula, Let be the performance objective function of the actor network, i.e., the expected cumulative reward; is a general symbol for action variables; k is the index set identifier for each specific action dimension in the joint action space; gt is the identifier for gas turbine output control; ess is the identifier for energy storage system charge and discharge control; qc is the identifier for dock crane operation rate control; yc is the identifier for yard crane operation rate control. For the first Specific action control quantities in each dimension; S405. Perform a soft update on the target network parameters, as follows: ;(27) ; (28) In the formula, This is a soft update coefficient; S406. Repeat steps S401-S405 until the preset maximum number of iterations is reached, generating the final collaborative scheduling strategy.
4. An integrated scheduling and decision-making device for an end-to-end multi-energy coupling system of a seaport, characterized in that, include: The first building module is used to construct a port collaborative scheduling simulation environment that includes a dynamic task model and an energy asset model, and is used to define the system's state space, action space, and composite reward function. The second construction module is used to construct a deep reinforcement learning algorithm based on the state space, action space and composite reward function. The deep reinforcement learning algorithm includes an actor network and a critic network. The generation module is used to interact with the deep reinforcement learning algorithm and the port collaborative scheduling simulation environment at each time step to generate an experience tuple, and store each experience tuple in the experience replay pool; The update generation module is used to iteratively update the actor network and the critic network by sampling experience data from the experience replay pool until a preset maximum number of iterations is reached, thereby generating the final collaborative scheduling strategy. The construction of a port collaborative scheduling simulation environment, which includes a dynamic task model and an energy asset model, is used to define the system's state space, action space, and composite reward function, including: The state space S is constructed as follows: ;(1) In the formula, For the first The state vector for each decision time step; The current moment; For the first The state of charge of the port energy storage system at all times; For the first The electricity price on the grid at any given time; For the generator in the first Efforts made at all times; For the first The remaining workload of the dockside cranes at any given time; For the first The remaining workload of the yard cranes at all times; The deadline for the task. This represents the remaining time until the task deadline. Construct the joint action space , means as follows: ;(2) In the formula, Let be the joint action vector output by the agent at the t-th decision time step; For the first The generator output control command at all times; For the first Charge and discharge power control commands for the energy storage system at all times; For the first Time-bound control commands for the working speed of the dock cranes; For the first Constant-time operating rate control commands for yard cranes; A dynamic task model is constructed within the simulation environment to calculate the dynamic electrical load generated by the operational actions, as detailed below: ;(3) ;(4) ;(5) In the formula, , The first The actual amount of work completed by dock cranes and yard cranes at any given time; These refer to the number of dock cranes and the number of yard cranes, respectively. These refer to the maximum operating rate of a single dock crane and the maximum operating rate of a single yard crane, respectively. These are the energy consumption per operation of a dock crane and the energy consumption per operation of a yard crane, respectively. For the first Dynamic power load at any given time; The energy asset model within the simulation environment is constructed as follows: ;(6) ;(7) In the formula, For the first Power purchased from the external power grid at all times; For the first A fixed load at any given time; For the first The actual output of the gas turbine at any given moment; For the first The actual output of the energy storage system at any given time; This represents the state of charge of the port energy storage system at time t+1. Battery capacity; The charge / discharge efficiency coefficient; Construct the composite reward function: ;(8) ;(9) ;(10) ;(11) In the formula, For the first The reward value at any given moment; For the first The cost of purchasing electricity at any given moment; For the first Real-time gas turbine operating costs; This represents the secondary cost coefficient of the gas turbine. This represents the primary cost coefficient of the gas turbine. This represents the fixed cost coefficient for the gas turbine. This is a penalty for task failure. This is a pre-set constant for the enormous fine; The step of constructing a deep reinforcement learning algorithm based on the state space, action space, and composite reward function, including: Constructing an actor network and action mapping for a deep reinforcement learning algorithm: ;(12) ;(13) ;(14) ; (15) ;(16) ;(17) In the formula, For actor network parameters; Represents the deterministic policy function of the actor network; These are the original motion vectors of the actor network; This is a normalized action vector that is restricted to the interval [-1,1] after being processed by the hyperbolic tangent activation function; For exploring noise at the current moment; formulas (14) to (17) represent the specific methods for mapping normalized actions to the actual physical control space, where, From maximum output and minimum effort constraint; Maximum charge and discharge power Constraints, where a positive value represents discharging and a negative value represents charging; and The normalized space [-1,1] is linearly mapped to the physical job space [0,1]. Building a network of critics for deep reinforcement learning algorithms: ;(18) In the formula, The expected long-term return for taking action in the current state; This is the fusion state vector input to the critic network; The joint action vector input to the critic network; For the network parameters of the commentator.
5. A terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the method as described in any one of claims 1 to 3.
6. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1 to 3.