A container yard decision-making method, system, device and medium based on deep reinforcement learning

CN122243098APending Publication Date: 2026-06-19QINGDAO PORT INT CO LTD +2

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: QINGDAO PORT INT CO LTD
Filing Date: 2026-03-25
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing container yard management methods, based on fixed rules or penalty points, are difficult to adapt to changes in container operation instructions and yard conditions, resulting in uneven resource allocation, low yard utilization, and high overturning rates.

Method used

A container yard decision-making method based on deep reinforcement learning is adopted. By defining the state space and action space of the yard, a yard decision-making model is constructed. The reward function is defined by combining hard and flexible rules. The model is trained using deep reinforcement learning algorithm to optimize container location allocation.

Benefits of technology

It enables dynamic adjustment of container positions based on the real-time status of the yard, improving yard resource utilization, optimizing resource allocation, enhancing the model's adaptability and flexibility, and reducing yard overturning rate.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122243098A_ABST

Patent Text Reader

Abstract

This invention provides a container yard decision-making method, system, equipment, and medium based on deep reinforcement learning, belonging to the field of container yard technology. It constructs a yard decision-making model; defines a reward function based on hard rules, flexible rules, and yard storage conditions; trains the yard decision-making model using a deep reinforcement learning algorithm; during training, the yard decision-making model adjusts its decisions in conjunction with the yard environment and the reward function; after each decision, the yard state and reward signal are fed back to the yard decision-making model; after a preset number of training rounds, the target function of the preset training strategy is achieved, resulting in the trained yard decision-making model; the yard decision-making model is then embedded into a yard management system. This invention can adaptively adjust according to real-time changes in the yard state, responding to changes in container type, yard storage conditions, and subsequent tasks. While ensuring the implementation of hard rules, it flexibly handles flexible rules and yard storage conditions, achieving multi-objective optimization.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of container yard technology, specifically relating to a container yard decision-making method, system, equipment, and medium based on deep reinforcement learning. Background Technology

[0002] With the continuous increase in port throughput, the management and scheduling of container yards, as one of the core links in port logistics, are becoming increasingly complex and urgent.

[0003] Existing methods typically allocate container locations based on fixed rules or penalty points. These rules are often static and difficult to adapt to real-time changes in the yard's status. For example, factors such as the priority of container handling orders, the yard's storage conditions, and the status of container orders are constantly changing, but fixed rules or penalty points cannot respond to these changes in real time. This leads to uneven distribution of yard resources, increased yard turnover rates, and reduced yard utilization. Summary of the Invention

[0004] This invention provides a container yard decision-making method based on deep reinforcement learning. The method is used to optimize the allocation of container yard locations, improve the utilization rate of yard resources, and dynamically adjust the container locations according to the yard's busyness and subsequent planned instructions.

[0005] The methods include: Define the state space and action space of the yard to form the yard environment, and construct the yard decision model; The reward function is defined based on hard rules, flexible rules, and factors such as yard storage conditions, container order status, and the current status of ships and operating equipment. The yard decision-making model is trained using a deep reinforcement learning algorithm; during the training process, the yard decision-making model adjusts its decisions based on the yard environment and reward function. After each decision, the state and reward signal of the yard are fed back to the yard decision model. After training through a preset round, the target function of the preset training strategy is achieved, and the trained yard decision model is obtained. Embed the yard decision-making model into the yard management system.

[0006] It should be further explained that the state space of the stockpile is defined. for: ; in, For the load at each storage yard location, Each location is in an idle state. Container type coding for each location, For future yard tasks within a pre-defined timeframe; The motion space is used to select the target location from the available positions: The yard decision model monitors the current yard status at each moment and selects an action from the action space based on the yard strategy.

[0007] It should be further noted that the stockpile environment in the method is a three-dimensional mesh stockpile. In a three-dimensional gridded storage yard, by defining three spatial coordinate dimensions—shell, column, and layer—each grid cell precisely corresponds to a specific storage location in the storage yard. Define the state tensor ,in: Represents the three-dimensional spatial dimensions of the storage yard; This represents the feature dimension of each grid cell; In the three-dimensional mesh storage area, set the storage coefficient to indicate whether the current storage location is occupied. ; Mask indicating the types of containers that are allowed to be stacked ; Mark whether the location is temporarily closed or manually closed. Task relevance: the degree to which location matches subsequent tasks. .

[0008] It should be further noted that container characteristics were encoded before constructing the yard decision model.

[0009] Each container's characteristics include: basic attributes and business attributes; The mask is generated based on hard rules; the generated mask is a three-dimensional Boolean mask matrix. In this context, a valid position is marked as 1, and an invalid position is marked as 0.

[0010] Traverse each grid cell of the yard and check hard rules based on the characteristics of the containers and the attributes of the yard location.

[0011] If a position meets the hard rules, it is marked as a valid position; otherwise, it is marked as an invalid position.

[0012] The hard rules include: size compatibility, type matching, and whether it is closed.

[0013] It should be further noted that the steps of training the yard decision model using deep reinforcement learning algorithms also include: The Transformer encoder is used to process the stockpile state and extract global features; The probability distribution for each location is output using a multilayer perceptron:

[0014] in, Represents the policy network, Indicates the status of the storage yard.a Indicates an action; During the training process, the PPO algorithm is used to train the yard decision-making model, and the objective function is defined as follows:

[0015] in, To optimize parameters for the strategy, To define the action probability distribution parameters of the current policy, To fix the reference parameters, To quantify the relative value of actions, Trust zone parameters for control policy updates.

[0016] It should be further noted that, in the method, the reward function is set as follows: ; In the hard rules, let This is a reward for actions that violate strict rules regarding container type and location. , If it is a positive integer, it represents the penalty; otherwise... ; In the flexible rule, let As a reward item for flexible rules, The system is dynamically adjusted based on the storage conditions in the yard and the priority of containers. Prioritize containers. For the average load of the storage yard, then ,in, This refers to the reward coefficient for flexible rules. In the case of stockpiling in the yard, set This is a reward item for the storage status in the storage yard; if the location load corresponding to the action is... ,but ,in, The reward coefficient for the storage status in the storage yard; The method also defines subsequent tasks for the storage yard, setting... This is a reward for subsequent tasks; if the action matches the position required by the subsequent planned instructions, then... , A positive number indicates a reward; otherwise... ; The reward function is: .

[0017] This application also provides a container yard decision-making system based on deep reinforcement learning, the system comprising: The model building module is used to define the state space and action space of the yard to form the yard environment and to build the yard decision model; Define a reward module to define reward functions by combining hard rules, flexible rules, and factors such as yard storage conditions, container instruction status, and the current status of ships and operating equipment; The model training module is used to train the yard decision-making model using deep reinforcement learning algorithms; during the training process, the yard decision-making model adjusts its decisions based on the yard environment and reward function. The model output module is used to feed back the state and reward signals of the yard to the yard decision model after each decision. After training through a preset round and reaching the objective function of the preset training strategy, the trained yard decision model is obtained. The model embedding module is used to embed the trained yard decision-making model into the yard management system.

[0018] According to another embodiment of this application, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of a container yard decision-making method based on deep reinforcement learning.

[0019] According to yet another embodiment of this application, a storage medium is also provided that stores a computer program thereon, the computer program being executed by a processor as steps of the deep reinforcement learning-based container yard decision-making method.

[0020] As can be seen from the above technical solutions, the present invention has the following advantages: The container yard decision-making method based on deep reinforcement learning provided in this application optimizes the decision-making strategy through multiple rounds of training and trial-and-error. During training, the yard decision-making model adjusts its decisions based on environmental rewards. After each decision, the yard's state and reward signals are fed back to the yard decision-making model for strategy adjustment. Through multiple rounds of training, the yard decision-making model gradually learns the optimal container location allocation strategy. The trained yard decision-making model can be embedded into a yard management system to provide real-time container allocation decision support. The system automatically allocates container locations based on the real-time state of the yard, maximizing yard space utilization, optimizing resource allocation, and balancing various constraints.

[0021] This application designs a reward function that considers hard rules, flexible rules, yard storage conditions, and subsequent task instructions. The multi-factor integrated design more comprehensively reflects the actual needs and objectives of yard operations, providing richer and more accurate feedback signals to the yard decision-making model, guiding it to learn decision-making strategies more aligned with actual operations. A deep reinforcement learning algorithm is used to train the yard decision-making model, and a multilayer perceptron is used to fit the mapping relationship between container allocation locations and yard states. This leverages the ability of deep reinforcement learning to optimize decisions in dynamic environments, as well as the powerful feature learning and nonlinear mapping capabilities of neural networks, enabling better handling of complex yard state information and achieving more accurate decisions. Furthermore, through multiple rounds of training, the decision-making strategy is continuously refined through trial and error based on rewards and penalties from environmental feedback. This allows the yard decision-making model to gradually explore and discover optimal decisions in actual operational scenarios, enhancing the model's adaptability and flexibility. Attached Figure Description

[0022] To more clearly illustrate the technical solution of the present invention, the accompanying drawings used in the description will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0023] Figure 1 A flowchart of a container yard decision-making method based on deep reinforcement learning; Figure 2 A schematic diagram of a container yard decision-making system based on deep reinforcement learning; Figure 3 This is a schematic diagram of an electronic device. Detailed Implementation

[0024] The container yard decision-making method based on deep reinforcement learning provided in this application uses the yard location and containers as the state space and action space when constructing the yard decision-making model. The yard state includes location load, idle state, container type, and subsequent tasks.

[0025] The action space here is defined by the container stacking positions, and the goal of the yard decision model is to select the optimal position for container allocation based on the current state. This application also uses deep reinforcement learning algorithms (such as Proximal Policy Optimization, PPO) to train the yard decision model, and fits the mapping relationship between container allocation positions and yard states through neural networks.

[0026] The yard decision-making model continuously optimizes its decision-making strategy through multiple rounds of training and trial and error. During training, the model adjusts its decisions based on environmental rewards. After each decision, the yard's state and reward signals are fed back to the model for strategy adjustment. Through multiple rounds of training, the model gradually learns the optimal container location allocation strategy. The trained model can then be embedded into a yard management system to provide real-time container allocation decision support. The system automatically allocates container locations based on the yard's real-time status, maximizing yard space utilization, optimizing resource allocation, and balancing various constraints.

[0027] The following details the specific steps of a container yard decision-making method based on deep reinforcement learning. For illustrative purposes and not for limiting purposes, specific details such as particular system architectures and technologies are presented to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application can also be implemented in other embodiments without these specific details.

[0028] It should be understood that, when used in this specification, the term "comprising" indicates the presence of the described feature, integral, step, operation, element, and / or component, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or collections thereof. The terms "comprising," "including," "having," and variations thereof all mean "including but not limited to," unless otherwise specifically emphasized.

[0029] The terms "one embodiment" or "some embodiments" used in this application mean that one or more embodiments of this application include the specific features, structures, or characteristics described in that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this application do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized.

[0030] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0031] Please see Figure 1 The diagram shows a flowchart of a container yard decision-making method based on deep reinforcement learning in a specific embodiment. The method includes: S101: Define the state space and action space of the yard to form the yard environment, and construct the yard decision model.

[0032] In some embodiments, the state space includes the location load, idle status, container type, and subsequent tasks of the yard. Defining the yard information as a multi-dimensional vector transforms the complex yard state into a numerical form that can be processed by the model, allowing the model to perceive the real-time status of the yard.

[0033] The action space in this embodiment is composed of the stacking positions of containers. The constructed yard decision model provides decision options. The yard decision model selects an action from the action space based on the current state, that is, determines the stacking position of the containers.

[0034] This embodiment uses a neural network to construct the yard decision-making model, such as a Transformer encoder combined with an MLP structure. Taking the MLP as an example, the input is a state space vector, which is processed through multiple hidden layers to finally output the value or probability distribution of each action. Assume the MLP has... The first hidden layer, the... The output of each hidden layer pass . For activation function, It is a weight matrix. It is a bias vector. The output layer outputs the value or probability distribution of actions to fit the mapping relationship between states and actions, so that the model can make reasonable action decisions based on the input state of the stockpile.

[0035] As one implementation of this embodiment, the state space of the stockyard is defined. for: .

[0036] For the load at each storage yard location, Each location is in an idle state. Container type coding for each location, Pre-set yard tasks for the future within a predetermined timeframe.

[0037] The action space is used to select a target location from available locations. At each time step, the yard decision model monitors the current yard status and selects an action from the action space based on the yard policy.

[0038] For example, at a certain moment, the yard decision model observes the current yard state. This indicates that the location load is 8, the location idle is 3, the container type is 2, and the subsequent task identifier is 9. Then, according to the yard strategy, the first action is selected from the action space. The first action is the stacking position chosen by the yard decision model for the current container. The yard decision model executes the first action, and the yard state changes, entering the next state. The yard decision model then selects the second action from the action space according to the new state, and this process is repeated continuously to achieve the goal of selecting the best position for container allocation based on the current state.

[0039] S102: Define the reward function based on hard rules, flexible rules, and stockpile conditions.

[0040] Existing reward functions often use fixed weights, which cannot adapt to dynamic scenarios. This method achieves adaptive optimization by using a dynamic weight adjustment mechanism, combined with hard constraints and multi-objective elastic rules.

[0041] In this embodiment, the reward function is set as follows: .

[0042] In the hard rules, let This is a reward for actions that violate strict rules regarding container type and location. , If it is a positive integer, it represents the penalty; otherwise... .

[0043] In the flexible rule, let As a reward item for flexible rules, The system is dynamically adjusted based on the storage conditions in the yard and the priority of containers. Prioritize containers. For the average load of the storage yard, then ,in, This is the reward coefficient for the flexible rules.

[0044] In the case of stockpiling in the yard, set This is a reward item for the storage status in the storage yard; if the location load corresponding to the action is... ,but ,in, This is the reward coefficient for the storage status in the storage yard.

[0045] The method also defines subsequent tasks for the storage yard, setting... This is a reward for subsequent tasks; if the action matches the position required by the subsequent planned instructions, then... , A positive number indicates a reward; otherwise... .

[0046] The reward function is: .

[0047] S103: The yard decision model is trained using a deep reinforcement learning algorithm; during the training process, the yard decision model adjusts its decisions based on the yard environment and the reward function.

[0048] This embodiment uses the Proximal Policy Optimization (PPO) algorithm for model training. The PPO algorithm improves decision-making performance by optimizing the policy network. It maximizes cumulative rewards while ensuring that policy updates are not too drastic. Furthermore, the PPO algorithm is combined with experience replay to ensure training stability, and multi-objective collaborative optimization is achieved through sub-objective normalization.

[0049] As one embodiment of this application, a Transformer encoder is used to process the stockpile state and extract global features.

[0050] The probability distribution for each location is output using a multilayer perceptron:

[0051] in, Represents the policy network, Indicates the status of the storage yard. a Indicates an action; During the training process, the PPO algorithm is used to train the yard decision-making model, and the objective function is defined as follows:

[0052] in, To optimize parameters for the strategy, To define the action probability distribution parameters of the current policy, To fix the reference parameters, To quantify the relative value of actions, Trust zone parameters for control policy updates.

[0053] This embodiment also defines a value function loss, specifically as follows:

[0054] The total loss function is also defined as follows:

[0055] in, , , .

[0056] Compared to existing deep reinforcement learning methods that focus on only a single type of loss function during training, such as considering only policy gradient loss to optimize the policy network, this embodiment of the container yard decision-making method based on deep reinforcement learning defines both a value function loss and a total loss function. The total loss function integrates the policy network loss, the value function loss, and the entropy loss. This design considers more factors and optimizes the model from different perspectives.

[0057] For the coefficients in the total loss function , , The parameter configuration is carefully tailored to the specific scenario of container yard decision-making. Different coefficient values affect the weight of each loss term in the total loss, thus influencing the training direction and focus of the model. This scenario-specific parameter configuration reflects the method's deep understanding of the real-world problem and its targeted optimization, distinguishing it from general deep reinforcement learning methods.

[0058] The introduction of an entropy loss term can measure the randomness of the policy; entropy loss (with a coefficient of ) is added to the total loss function. This encourages the model to maintain a certain degree of policy randomness during training, enabling the model to explore various possible situations in the storage yard environment more comprehensively.

[0059] During training, the trained model is converted to ONNX format and deployed to edge devices. A REST interface is exposed via Spring Boot to support real-time decision requests. TensorRT is enabled to accelerate inference, ensuring a single decision latency of less than 50ms.

[0060] Load historical operation data and construct a virtual yard simulator. This embodiment can adopt a course-based learning strategy to train the model in stages. The specific steps are as follows: S1031: Enable only basic size rules.

[0061] This embodiment only considers the basic dimensional rules of containers, such as their length, width, and height. When making decisions, the model only needs to ensure that the selected stacking location can accommodate the container's dimensions. For example, if each stacking location has certain space constraints, the model needs to determine whether the container can be placed at that location without exceeding the space limit.

[0062] Compared to existing technologies that introduce all constraints from the outset, which exposes the model to a complex decision space and increases the difficulty of learning, the curriculum learning strategy starts training with simple, basic rules and gradually increases complexity. This approach allows the model to first master basic decision-making abilities and establish an initial understanding of the environment, laying the foundation for learning more complex rules later.

[0063] S1032: Overlay type matching rules.

[0064] Building upon step S1031, container type matching rules are superimposed. Different types of containers may have different storage requirements and stacking restrictions; for example, refrigerated containers require specific refrigeration equipment, and dangerous goods containers require special secure areas. When making decisions, the model must consider not only the basic size rules but also ensure that the container type matches the stacking location.

[0065] By progressively adding rules, the model can learn new constraints based on its existing knowledge. This phased learning approach reduces the learning difficulty, enabling the model to better understand and handle the relationships between different rules. Compared to introducing all rules at once, this stage allows the model to learn type-matching rules more deeply, improving the accuracy of decision-making.

[0066] S1033: Activate the complete constraint system.

[0067] In this embodiment, all constraints are enabled, including basic size rules, type matching rules, and other possible constraints such as weight balance and job priority. The model needs to comprehensively consider all these constraints to make the optimal decision.

[0068] After training in steps S1031 and S1032, the model has mastered the basic rules and type matching rules. In step S1033, a complete constraint system is introduced, allowing the model to integrate and apply the previously learned knowledge, improving the quality of decision-making. This avoids the learning difficulties and slow convergence problems caused by facing too many constraints at the beginning.

[0069] This embodiment leverages the computing power of a distributed GPU cluster to accelerate the model training process. In each epoch, the model makes multiple decisions in a virtual heap simulator and receives rewards or penalties based on the decisions. Through continuous iterative training, the model gradually adjusts its decision-making strategy to maximize cumulative rewards. Using a distributed GPU cluster allows for the completion of 1000 epochs of training within an acceptable timeframe, improving training efficiency. Furthermore, iterating through 1000 epochs allows the model to fully learn the optimal decision-making strategies for various scenarios, enhancing model performance and stability.

[0070] S104: After each decision, the state and reward signal of the yard are fed back to the yard decision model. After training through a preset round, the target function of the preset training strategy is achieved, and the trained yard decision model is obtained.

[0071] In this embodiment, after each decision and action by the model, the storage yard environment returns a new state and reward signal, which is then fed back to the model. Simultaneously, the number of training rounds is accumulated, and the model's training performance in different rounds is recorded.

[0072] The system checks whether the preset number of training epochs has been reached and determines whether the current strategy meets the objective function requirements of the preset training strategy. The objective function is usually constructed based on the reward function, such as maximizing the cumulative reward or minimizing a certain loss. If the preset number of epochs has not been reached or the objective function requirements have not been met, training continues; if both are met, the model is considered to have completed training. When both the number of training epochs and the objective function conditions are met, the trained container yard decision model is output. At this point, the model has learned a better decision strategy from a large amount of training data and feedback information, and can provide effective decision support for container yards in practical applications.

[0073] S105: Embed the yard decision-making model into the yard management system.

[0074] In this embodiment, within the container yard management system, when a container needs to be assigned a stacking location, the system inputs the current yard status information into an embedded decision model. Based on learned strategies, the model outputs the optimal stacking location decision for the container. The system then uses this result to schedule subsequent operations, achieving intelligent container yard management.

[0075] In one embodiment of the present invention, based on the step-by-step storage environment being a three-dimensional grid storage yard, the following will provide a possible embodiment and describe its specific implementation in a non-limiting manner.

[0076] In this method, the storage yard environment is a three-dimensional mesh storage yard; In a three-dimensional mesh storage field, the positions, columns, and layers are defined, with each mesh cell corresponding to a storage field location.

[0077] In this embodiment, the definition of the three-dimensional mesh storage yard better reflects the physical structure of the actual storage yard, and can more accurately reflect the spatial layout of the storage yard and the state of each location. By using the state tensor, the spatial information of the storage yard and the feature information of each location are integrated together, providing the model with rich and structured input, enabling the model to better understand the actual situation of the storage yard and thus make more reasonable decisions.

[0078] Define the state tensor ,in: Represents the three-dimensional spatial dimensions of the storage yard; This represents the feature dimension of each grid cell; In the three-dimensional grid storage area, load factors representing the occupancy level of the current storage area location are set. ; Mask indicating the types of containers that are allowed to be stacked ; Identifies the types of containers that can be stacked. Each dimension corresponds to one container type. An element of 1 indicates that the corresponding container type is allowed to be stacked at that location, while a value of 0 indicates that it is not allowed. This helps to quickly filter out locations that meet the type requirements when allocating containers.

[0079] A marker indicating whether a location is temporarily closed. ; indicates whether the location is temporarily closed. This indicates that the location is normal and available. This indicates that the location is closed and cannot be used for stacking containers.

[0080] The degree of matching between location and subsequent tasks; task relevance. To measure the degree of match between location and subsequent tasks. The higher the value, the stronger the correlation between the location and subsequent tasks. When making decisions, if subsequent tasks have specific requirements for the location, locations with high task relevance can be given priority.

[0081] This embodiment allows the model to more comprehensively evaluate the advantages and disadvantages of each location, thereby making decisions that better meet actual needs. For example, when allocating containers, it is necessary to consider not only whether the location is available, but also whether the location allows for the stacking of that type of container, and the degree of matching with subsequent tasks.

[0082] Furthermore, before constructing the yard decision-making model, the container characteristics are also encoded: Each container's characteristics include: basic attributes and business attributes.

[0083] The mask is generated based on hard rules; the generated mask is a three-dimensional Boolean mask matrix. In this context, a valid position is marked as 1, and an invalid position is marked as 0.

[0084] Traverse each grid cell of the yard and check hard rules based on the characteristics of the containers and the attributes of the yard location.

[0085] If a position meets the hard rules, it is marked as a valid position; otherwise, it is marked as an invalid position.

[0086] The hard rules include: size compatibility, type matching, and whether it is closed.

[0087] This embodiment pre-encodes container features and generates a mask matrix, enabling rapid selection of legitimate locations before decision-making, thus reducing computational complexity. Furthermore, the mask matrix generation process comprehensively considers various rigid rules, ensuring the legitimacy and accuracy of the decision.

[0088] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

[0089] The following are embodiments of a container yard decision-making system based on deep reinforcement learning provided in this disclosure. This system and the container yard decision-making methods based on deep reinforcement learning in the above embodiments belong to the same inventive concept. For details not described in detail in the embodiments of the container yard decision-making system based on deep reinforcement learning, please refer to the embodiments of the container yard decision-making methods based on deep reinforcement learning described above.

[0090] like Figure 2 As shown, the system includes: a model building module, which is used to define the state space and action space of the yard to form the yard environment and to build the yard decision model.

[0091] Define a reward module to define reward functions by combining hard rules, flexible rules, and factors such as yard storage conditions, container instruction status, and the current status of ships and operating equipment; The model training module is used to train the yard decision-making model using deep reinforcement learning algorithms. During the training process, the yard decision-making model adjusts its decisions based on the yard environment and reward function.

[0092] The model output module is used to feed back the state and reward signals of the yard to the yard decision model after each decision. After training through a preset round and reaching the objective function of the preset training strategy, the trained yard decision model is obtained.

[0093] The model embedding module is used to embed the trained yard decision-making model into the yard management system.

[0094] The container yard decision-making system based on deep reinforcement learning comprises the units and algorithmic steps of the various examples described in conjunction with the embodiments disclosed herein. It can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of each example have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can implement the described functions using different methods for each specific application, but such implementations should not be considered beyond the scope of this invention.

[0095] Those skilled in the art will understand that various aspects of the deep reinforcement learning-based container yard decision-making method can be implemented as a system, method, or program product. Therefore, various aspects of this disclosure can be specifically implemented in the following forms: a completely hardware implementation, a completely software implementation (including firmware, microcode, etc.), or a combination of hardware and software aspects, collectively referred to herein as a "circuit," "module," or "system."

[0096] like Figure 3 As shown, this application also provides an electronic device, including a display module 103, a memory 102, a processor 101, and a computer program stored in the memory and executable on the processor 101. When the processor 101 executes the program, it implements the steps of a container yard decision-making method based on deep reinforcement learning.

[0097] In embodiments of the present invention, electronic devices include, but are not limited to, laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the embodiments described and / or claimed herein.

[0098] In this embodiment, processor 101 may be implemented using at least one of an Application-Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), a Field-Programmable Gate Array (FPGA), a processor, a controller, a microcontroller, a microprocessor, or an electronic unit designed to perform the functions described herein. In some cases, such implementations may be implemented within a controller. For software implementations, implementations such as processes or functions may be implemented with separate software modules that allow the performance of at least one function or operation. The software code may be implemented by a software application (or program) written in any suitable programming language, and the software code may be stored in memory and executed by the controller.

[0099] The display module 103 is used to display information input by the user or information provided to the user. The display module 103 may include a display panel, which may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.

[0100] The memory 102 can be used to store software programs and various data. The memory 102 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device.

[0101] This application also provides a storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the container yard decision-making method based on deep reinforcement learning.

[0102] The storage medium may be any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example,, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of readable storage media include: electrical connections having one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0103] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A container yard decision-making method based on deep reinforcement learning, characterized in that the method... include: Define the state space and action space of the yard to form the yard environment, and construct the yard decision model; The reward function is defined based on hard rules, flexible rules, and factors such as yard storage conditions, container order status, and the current status of ships and operating equipment. The yard decision-making model is trained using a deep reinforcement learning algorithm; during the training process, the yard decision-making model adjusts its decisions based on the yard environment and reward function. After each decision, the state and reward signal of the yard are fed back to the yard decision model. After training through a preset round and reaching the objective function of the preset training strategy, the trained yard decision model is obtained. Embed the yard decision-making model into the yard management system.

2. The container yard decision-making method based on deep reinforcement learning according to claim 1, characterized in that, Define the state space of the stockpile for: For the load at each storage yard location, Each location is in an idle state. Container type coding for each location, For future yard tasks within a pre-defined timeframe; The motion space is used to select the target location from the available positions: The yard decision model monitors the current yard status at each moment and selects an action from the action space based on the yard strategy.

3. The container yard decision-making method based on deep reinforcement learning according to claim 1, characterized in that, In this method, the storage yard environment is a three-dimensional mesh storage yard; In a three-dimensional gridded storage yard, by defining three spatial coordinate dimensions—shell, column, and layer—each grid cell precisely corresponds to a specific storage location in the storage yard. Define the state tensor ,in: Represents the three-dimensional spatial dimensions of the storage yard; This represents the feature dimension of each grid cell; In the three-dimensional mesh storage area, set the storage coefficient to indicate whether the current storage location is occupied. ; Mask indicating the types of containers that are allowed to be stacked ; Mark whether the location is temporarily closed or manually closed. Task relevance: the degree to which location matches subsequent tasks. .

4. The container yard decision-making method based on deep reinforcement learning according to claim 3, characterized in that, Before constructing the yard decision model, the container characteristics are also encoded: Each container's characteristics include: basic attributes and business attributes; The mask is generated based on hard rules; the generated mask is a three-dimensional Boolean mask matrix. In this context, a valid position is marked as 1, and an invalid position is marked as 0. Traverse each grid cell of the yard and check hard rules based on the characteristics of the containers and the attributes of the yard location; If a position meets the hard rules, it is marked as a valid position; otherwise, it is marked as an invalid position. The hard rules include: size compatibility, type matching, and whether it is closed.

5. The container yard decision-making method based on deep reinforcement learning according to claim 1, characterized in that, The steps of training the yard decision model using deep reinforcement learning algorithms also include: The Transformer encoder is used to process the stockpile state and extract global features; The probability distribution for each location is output using a multilayer perceptron: in, Represents the policy network, Indicates the status of the storage yard. a Indicates an action.

6. The container yard decision-making method based on deep reinforcement learning according to claim 5, characterized in that, During the training process, the PPO algorithm is used to train the yard decision-making model, and the objective function is defined as follows: in, To optimize parameters for the strategy, To define the action probability distribution parameters of the current policy, To fix the reference parameters, To quantify the relative value of actions, Trust zone parameters for control policy updates.

7. The container yard decision-making method based on deep reinforcement learning according to claim 1, characterized in that, In the method, the reward function is set as follows: ; In the hard rules, let This is a reward for actions that violate strict rules regarding container type and location. , If it is a positive integer, it represents the penalty; otherwise... ; In the flexible rule, let As a reward item for flexible rules, The system is dynamically adjusted based on the storage conditions in the yard and the priority of containers. Prioritize containers. For the average load of the storage yard, then ,in, This refers to the reward coefficient for flexible rules. In the case of stockpiling in the yard, set This is a reward item for the storage status in the storage yard; if the location load corresponding to the action is... ,but ,in, The reward coefficient for the storage status in the storage yard; The method also defines subsequent tasks for the storage yard, setting... This is a reward for subsequent tasks; if the action matches the position required by the subsequent planned instructions, then... , A positive number indicates a reward; otherwise... ; The reward function is: .

8. A container yard decision-making system based on deep reinforcement learning, characterized in that, The system is used to implement the container yard decision-making method based on deep reinforcement learning as described in any one of claims 1 to 7; The system includes: The model building module is used to define the state space and action space of the yard to form the yard environment and to build the yard decision model; Define a reward module to define reward functions by combining hard rules, flexible rules, and factors such as yard storage conditions, container instruction status, and the current status of ships and operating equipment; The model training module is used to train the yard decision-making model using deep reinforcement learning algorithms; during the training process, the yard decision-making model adjusts its decisions based on the yard environment and reward function. The model output module is used to feed back the state and reward signals of the yard to the yard decision model after each decision. After training through a preset round and reaching the objective function of the preset training strategy, the trained yard decision model is obtained. The model embedding module is used to embed the trained yard decision-making model into the yard management system.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the container yard decision-making method based on deep reinforcement learning as described in any one of claims 1 to 7.

10. A storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the container yard decision-making method based on deep reinforcement learning as described in any one of claims 1 to 7.