Machine learning device, inference device, machine learning method, recording medium, and method for generating trained model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The machine learning device improves reinforcement learning efficiency and accuracy by calculating future losses and updating models based on corrected rewards, optimizing loading and unloading operations at a tank base.

WO2026140530A1PCT designated stage Publication Date: 2026-07-02ENEOS HLDG INC

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: ENEOS HLDG INC
Filing Date: 2025-11-07
Publication Date: 2026-07-02

Application Information

Patent Timeline

07 Nov 2025

Application

02 Jul 2026

Publication

WO2026140530A1

IPC: G06N3/092; G06N20/00

AI Tagging

Technology Topics

Engineering Data mining

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Texitile light ageing test instrument
CN1588059Acompact structure Easy to assemble and disassemble Material analysis by optical meansTextile testingEngineering Light filter
Multi-dimensional training method and device of support vector machine
CN114186620AImprove linear separabilityimprove classificationKernel methods Character and pattern recognition Data setDescent algorithm
Loop structure of cold heat flows
CN1916533AImprove efficiencySimple configurationFluid circulation arrangementHeating and refrigeration combinationsHeat flow Working fluid
Environment-friendly mobile collecting box for decoration cutting dust
CN108636005AThe dragging process is smoothavoid secondary flyingUsing liquid separation agent Working accessories EngineeringSediment
Credit text analysis method, credit object auditing method and credit object auditing device
CN114386430AReduce labor costs Improve efficiency Finance Semantic analysisCredit cardEngineering

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing reinforcement learning methods using neural networks are inefficient and lack accuracy in considering future situations, leading to suboptimal performance in tasks such as loading and unloading raw materials at a tank base.

Method used

A machine learning device that performs reinforcement learning by calculating a future loss based on a target state and updating the learning model using a corrected reward to improve the efficiency and accuracy of learning, specifically for tasks involving loading and unloading raw materials at a tank base.

Benefits of technology

Enhances the efficiency and accuracy of learning processes by considering future scenarios, resulting in improved planning and execution of loading and unloading operations at a tank base.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure JP2025039055_02072026_PF_FP_ABST

Patent Text Reader

Abstract

This machine learning device is for performing reinforcement learning on a learning model that provides an output relating to an action of an agent (A1) from a state of an environment (E1), and comprises: a memory that stores a program; and a processor that executes the program stored in the memory. The processor acquires a current state of the environment (E1) as a first state, determines an action of the agent (A1) from the first state using the learning model, acquires a first reward that is a reward for the determined action, and a second state that is the state of the environment (E1) changed due to the determined action, calculates a future loss that is a loss incurred in the future on the basis of the second state and a target state corresponding to a future situation of the environment (E1), and updates the learning model on the basis of the future loss and the first reward.

Need to check novelty before this filing date? Find Prior Art

Description

Machine Learning Device, Inference Device, Machine Learning Method, Recording Medium, and Method for Generating Trained Model

[0001] (Cross - reference to related applications) This application claims the benefit of the priority of Japanese Patent Application No. 2024 - 226074, filed on December 23, 2024, the entire specification of which is incorporated herein by reference. (Technical field) The present disclosure relates to a machine learning device, an inference device, a machine learning method, a recording medium, and a method for generating a trained model.

[0002] Various methods have been proposed to solve complex problems. For example, it has been devised to infer the optimal solution of a problem using a model trained by machine learning.

[0003] For example, in Patent Document 1, it is described that a neural network is trained by applying reinforcement learning in machine learning.

[0004] Japanese Patent No. 7335434

[0005] However, in reinforcement learning using a learning model such as a neural network, there is room to improve the efficiency of learning by considering future situations and to improve the accuracy of learning.

[0006] In view of the above problems, an object of the present disclosure is to provide a machine learning device, an inference device, a machine learning method, a recording medium, and a method for generating a trained model that can improve the efficiency and accuracy of learning.

[0007] To solve the above problems, a machine learning device according to one aspect of the present disclosure is a machine learning device that performs reinforcement learning on a learning model that outputs an agent's action from the state of the environment, and comprises a memory that stores a program and a processor that executes the program stored in the memory, wherein the processor acquires the current state of the environment as a first state, uses the learning model to determine the agent's action from the first state, acquires a first reward which is the reward for the determined action and a second state which is the state of the environment changed by the determined action, calculates a future loss which is a loss for the future based on the second state and a target state corresponding to a future situation in the environment, and updates the learning model based on the future loss and the first reward.

[0008] A machine learning method according to another aspect of the present disclosure is a machine learning method that performs reinforcement learning on a learning model that outputs an agent's action from the state of an environment, comprising the steps of: obtaining the current state of the environment as a first state; determining the agent's action from the first state using the learning model; obtaining a first reward, which is the reward for the determined action, and a second state, which is the state of the environment changed by the determined action; calculating a future loss, which is a loss for the future, based on the second state and a target state corresponding to a future situation in the environment; and updating the learning model based on the future loss and the first reward.

[0009] A recording medium according to another aspect of the present disclosure is a non-temporary computer-readable recording medium for recording a machine learning program, the machine learning program is a machine learning program that performs reinforcement learning on a learning model that outputs an agent's actions from the state of the environment, and causes the computer to perform the following steps: acquire the current state of the environment as a first state; use the learning model to determine the agent's actions from the first state; acquire a first reward, which is the reward for the determined actions, and a second state, which is the state of the environment changed by the determined actions; calculate a future loss, which is a loss for the future, based on the second state and a target state corresponding to future conditions in the environment; and update the learning model based on the future loss and the first reward.

[0010] A method for generating a trained model according to another aspect of the present disclosure is a method for generating a trained model by performing reinforcement learning on a training model that outputs an action of an agent from the state of an environment, comprising: a step of obtaining the current state of the environment as a first state; a step of using the training model to determine the action of the agent from the first state; a step of obtaining a first reward, which is the reward for the determined action, and a second state, which is the state of the environment changed by the determined action; a step of calculating a future loss, which is a loss for the future, based on the second state and a target state corresponding to a future situation in the environment; and a step of updating the training model based on the future loss and the first reward.

[0011] Furthermore, this disclosure may be implemented as a semiconductor integrated circuit that implements part or all of the program, as an information processing device, or as a system including an information processing device.

[0012] The machine learning apparatus, inference apparatus, machine learning method, recording medium, and method for generating a trained model described herein can improve the efficiency and accuracy of learning.

[0013] This diagram schematically shows an example of the configuration of a planning system according to the embodiment of this disclosure. This diagram shows an example of raw material loading and unloading to and from a tank base. This diagram shows the relationship between the environment and the agent related to machine learning. This diagram schematically shows an example of the hardware configuration of a server device. This block diagram shows an example of various functions in the server device. This diagram shows an example of loading and unloading plan information. This diagram shows an example of the probability distribution of an action. This diagram shows an example of constraints in the limiting section. This diagram shows an example of the relationship between planned actions and target states. This diagram shows an overview of input and output by the trained model. This flowchart shows an example of the learning process flow. This flowchart shows an example of the inference process flow.

[0014] The following descriptions illustrate some aspects of this disclosure.

[0015] A machine learning device according to a first aspect of this disclosure is a machine learning device that performs reinforcement learning on a learning model that outputs an agent's action based on the state of the environment, and comprises a memory that stores a program, and a processor that executes the program stored in the memory, wherein the processor acquires the current state of the environment as a first state, uses the learning model to determine the agent's action from the first state, acquires a first reward which is the reward for the determined action, and a second state which is the state of the environment changed by the determined action, calculates a future loss which is a loss for the future based on the second state and a target state corresponding to a future situation in the environment, and updates the learning model based on the future loss and the first reward.

[0016] In a machine learning apparatus according to a second aspect of this disclosure, as in the first aspect, the processor calculates the future loss based on the difference between the second state and the target state.

[0017] In a machine learning device according to a third aspect of this disclosure, relating to the first or second aspect, the processor corrects the first reward based on the future loss to obtain a second reward, calculating the second reward such that the larger the future loss, the smaller the value of the second reward is compared to the first reward, and updates the learning model based on the second reward.

[0018] In the machine learning apparatus according to the fourth aspect of this disclosure, the processor updates the learning model in such a way that the second reward increases and the future loss decreases, as in the third aspect.

[0019] In a machine learning device according to a fifth aspect of this disclosure, relating to the first or second aspect, the processor corrects the first reward based on the future loss to obtain a second reward, acquires the first state corresponding to each of the multiple steps included in the episode, determines the agent's action corresponding to each of the first states of each acquired step, acquires the first reward and the second state corresponding to each of the determined actions of each step, calculates the future loss corresponding to each of the second states of each acquired step, calculates the second reward based on the future loss and the first reward corresponding to some of the steps among the multiple steps, and updates the learning model based on the second reward.

[0020] In the machine learning apparatus according to the sixth aspect of this disclosure, according to the fifth aspect, the processor calculates a second reward corresponding to at least two or more of the steps among the plurality of steps, and updates the learning model based on the average value of the plurality of second rewards calculated.

[0021] In a machine learning apparatus according to a seventh aspect of this disclosure, relating to the first or second aspect, the processor sets the state of the environment prior to the planned action, in which the evaluation of the planned action performed in the environment after the second state is improved, as the target state.

[0022] In the machine learning device according to the eighth aspect of this disclosure, according to the seventh aspect, the learning model takes as input data relating to at least one of the plans for loading and unloading of multiple types of raw materials to and from the tank base, and outputs data relating to at least one of the plans for loading and unloading of each of the multiple tanks owned by the tank base.

[0023] In the machine learning apparatus according to the ninth aspect of this disclosure, an episode is set having steps corresponding to the loading and unloading of raw materials to and from the tank base, respectively, according to the eighth aspect, and the processor sets the planned action as unloading, the step immediately preceding the step corresponding to unloading as the target step, and the state of the target step in which the evaluation related to said unloading is improved as the target state.

[0024] In the machine learning apparatus according to the tenth aspect of this disclosure, relating to the ninth aspect, the processor sets the state of each of the multiple tanks owned by the tank base as the target state.

[0025] In the machine learning apparatus according to the eleventh aspect of this disclosure, according to the tenth aspect, the processor calculates the difference between the state of each tank indicated by the second state acquired in accordance with the target step and the state of each tank indicated by the target state corresponding to the target step as the future loss.

[0026] In the machine learning apparatus according to the twelfth aspect of this disclosure, according to the eleventh aspect, the processor calculates the difference as the Euclidean distance between the state of each tank indicated by the second state and the state of each tank indicated by the target state.

[0027] In a machine learning apparatus according to a thirteenth aspect of this disclosure, relating to the first or second aspect, the processor selects one of the multiple actions of the agent based on the probability distribution of the multiple actions of the agent obtained by the learning model.

[0028] In a machine learning apparatus according to a fourteenth aspect of this disclosure, according to the thirteenth aspect, the processor sets a restriction on each of the multiple actions in the probability distribution and selects one of the multiple actions from among the actions for which no restriction has been set.

[0029] In the machine learning apparatus according to the 15th aspect of this disclosure, relating to the first or second aspect, the environment corresponds to a raw material tank base, and the agent is a virtual entity that carries out the loading and unloading of the raw materials at the tank base.

[0030] The inference device according to the sixteenth aspect of this disclosure uses the learned model, which has been learned by the above-described machine learning device, as a trained model to infer output data corresponding to input data.

[0031] A machine learning method according to a 17th aspect of this disclosure is a machine learning method that performs reinforcement learning on a learning model that outputs an agent's action from the state of an environment, comprising the steps of: acquiring the current state of the environment as a first state; determining the agent's action from the first state using the learning model; acquiring a first reward, which is the reward for the determined action, and a second state, which is the state of the environment changed by the determined action; calculating a future loss, which is a loss for the future, based on the second state and a target state corresponding to a future situation in the environment; and updating the learning model based on the future loss and the first reward.

[0032] A recording medium according to the 18th aspect of this disclosure is a non-temporary computer-readable recording medium for recording a machine learning program, the machine learning program is a machine learning program that performs reinforcement learning on a learning model that outputs an agent's actions from the state of the environment, and causes the computer to perform the following steps: acquire the current state of the environment as a first state; use the learning model to determine the agent's actions from the first state; acquire a first reward, which is the reward for the determined actions, and a second state, which is the state of the environment changed by the determined actions; calculate a future loss, which is a loss for the future, based on the second state and a target state corresponding to future conditions in the environment; and update the learning model based on the future loss and the first reward.

[0033] A method for generating a trained model according to a 19th aspect of the present disclosure is a method for generating a trained model by performing reinforcement learning on a training model that outputs an action of an agent from the state of the environment, comprising: a step of obtaining the current state of the environment as a first state; a step of using the training model to determine the action of the agent from the first state; a step of obtaining a first reward, which is the reward for the determined action, and a second state, which is the state of the environment changed by the determined action; a step of calculating a future loss, which is a loss for the future, based on the second state and a target state corresponding to a future situation in the environment; and a step of updating the training model based on the future loss and the first reward.

[0034] The following description illustrates embodiments of the present disclosure. To facilitate understanding of the description, the same reference numerals are used for identical components and steps in each drawing whenever possible, and redundant descriptions are omitted.

[0035] <Overall Configuration> Figure 1 is a schematic diagram showing an example of the configuration of a planning system 1 according to one embodiment. The planning system 1 is a system that creates a plan corresponding to a pre-set problem.

[0036] As shown in Figure 1, the planning system 1 consists of a server device 2 and a user terminal 3. The server device 2 and the user terminal 3 can communicate with each other via the network NT.

[0037] Server device 2 is an information processing device (computer) that creates a plan to address a problem using information entered by user terminal 3. In this embodiment, one example of a problem is a problem related to the loading (inbound) and unloading (outbound) of raw materials to and from tank base 21. In this embodiment, a problem related to the planning of loading and unloading raw materials to and from tank base 21 is referred to as the "loading and unloading problem".

[0038] Figure 2 shows an example of the loading and unloading of raw materials to and from the tank base 21. The tank base 21 is a base for temporarily storing raw materials. For example, raw materials are loaded into the tank base 21 using a ship F1, and raw materials are unloaded from the tank base 21 using a ship F2, etc. Note that the means for loading and unloading are not limited to ships F1 and F2. The tank base 21 is equipped with multiple tanks 24. In the example shown in Figure 2, the tank base 21 has tanks 24a, tank 24b, and tank 24c. Note that the number of tanks 24 that the tank base 21 has is not limited. The tank base 21 can store the loaded raw materials in each of the tanks 24. The tank base 21 can also unload the raw materials stored in each of the tanks 24. Multiple types of raw materials may be loaded into the tank base 21 and stored in each of the tanks 24. Furthermore, the tank base 21 may have multiple tanks 24 into which different types of raw materials are delivered, or multiple types of raw materials may be delivered to a single tank 24. Each of the multiple types of raw materials stored in each tank 24 can be delivered from that tank 24. Although this embodiment uses the problem of both the delivery and delivery of raw materials to and from the tank base 21 as an example, it may also use the problem of either the delivery or delivery of raw materials to or from the tank base 21 as an example.

[0039] In this embodiment, the server device 2 creates a general plan (overall plan) for the loading and unloading of multiple types of raw materials to and from the tank base 21, and then creates a specific plan for the loading and unloading of each raw material to and from each tank 24 in the tank base 21.

[0040] Specifically, server device 2 is a device that uses machine learning to infer a plan. Therefore, server device 2 corresponds to a machine learning device when performing training and to an inference device when performing inference.

[0041] Figure 3 is a diagram illustrating an overview of machine learning. In this embodiment, the server device 2 performs reinforcement learning. For example, the server device 2 performs deep reinforcement learning. Reinforcement learning is a method in which agent A1 learns what actions will result in a greater reward (evaluation) by completing tasks (solutions to problems) while interacting with the environment E1. Agent A1 is the subject of action. Action is the behavior of agent A1. Environment E1 is both the target of agent A1's actions and a prerequisite for agent A1. That is, the state of environment E1 is the state in which agent A1 is placed within environment E1. The state of environment E1 changes according to the actions agent A1 takes with respect to environment E1. Agent A1 is presented with a reward (evaluation for the action) corresponding to that action. If agent A1 performs a favorable action in environment E1, it is presented with a greater reward compared to if it performs an unfavorable action in environment E1. The reward evaluation method can be set appropriately according to the learning objectives, etc.

[0042] As shown in action S1, agent A1 takes action on environment E1 in the given state of environment E1. As a result, the state of environment E1 changes to a new state according to the action taken by agent A1. Furthermore, as shown in action S2, agent A1 is presented with the new state of environment E1 and a reward corresponding to the action taken by agent A1. In reinforcement learning, learning is performed based on the interaction between environment E1 and agent A1 that occurs by repeating actions S1 and S2. For example, when learning is performed using an episode with multiple steps, an interaction between environment E1 and agent A1, such as actions S1 and S2, is performed corresponding to each step. Here, an episode is the flow (period) from the start to the end of the task to be solved in reinforcement learning.

[0043] Also, the action selected by agent A1 in operation S1 is determined based on policy W1. Policy W1 is a rule (policy) that serves as an indicator for determining the action of agent A1 according to the state of environment E1 before the action. For example, policy W1 is information in which a plurality of actions that agent A1 can take corresponding to the state of environment E1 are associated with the probability (selection probability) of each action being executed. For example, reinforcement learning aims to enable agent A1 to execute a more preferable action in the state of environment E1 by adjusting policy W1. Note that policy W1 may be information in which a plurality of actions that agent A1 can take corresponding to the state of environment E1 are associated with Q-values (action evaluations) for each action by applying Q-learning. Also, the Q-values may be converted into probabilities corresponding to each action, for example, by using a softmax function or the like.

[0044] In deep reinforcement learning, policy W1 is obtained from the state of environment E1 using a neural network. That is, the neural network outputs an output related to the action of agent A1 from the state of environment E1. Note that the neural network may output policy W1 including a plurality of actions, or may output policy W1 for one action. Then, the neural network is updated (learned) so that the reward is maximized. That is, the neural network is an example of a "learning model". Note that the learning model is not limited to a neural network and can be appropriately changed according to the reinforcement learning method selected by the user, the type of problem, and the like.

[0045] Returning to FIG. 1, user terminal 3 is a terminal device and is an information processing device (computer) used by the user. User terminal 3 is, for example, a personal computer or the like. The user can input various information through user terminal 3 and instruct server device 2 to create a plan.

[0046] <Hardware Configuration> FIG. 4 is a diagram schematically showing an example of the hardware configuration of server device 2.

[0047] As shown in FIG. 4, the server device 2 includes a control device 40, a communication device 41, and a storage device 42. The control device 40 is mainly composed of a CPU (Central Processing Unit) 43 and a memory 44.

[0048] In the control device 40, the CPU 43, which is an example of a processor, executes a predetermined program stored in the memory 44 or the storage device 42 or the like, and functions as various functional configurations described later. The memory 44 is a computer-readable storage medium, and may be composed of, for example, at least one of a RAM (Random Access Memory), a ROM (Read Only Memory), an EPROM (Erasable Programmable ROM), an EEPROM (Electrically Erasable Programmable ROM), and the like. The memory 44 can store various data, and store programs and the like necessary for executing processing in the server device 2.

[0049] The communication device 41 is composed of a communication interface or the like for communicating with an external device. The communication device 41 can communicate with, for example, the user terminal 3.

[0050] The storage device 42 is a non-temporary computer-readable instruction recording medium, and is composed of, for example, a hard disk, a solid state drive, or the like. The storage device 42 stores various programs, various information, and information on processing results necessary for executing processing in the control device 40. In addition, examples of non-temporary computer-readable instruction recording media include other portable recording media such as magnetic tapes, flexible disks, optical disks, digital versatile disks, Blu-ray disks, magneto-optical disks, memory cards, and USB memories.

[0051] The server device 2 may consist of a single information processing device or multiple information processing devices. Furthermore, Figure 4 only shows a portion of the main hardware configuration of the server device 2, and the server device 2 may have other configurations. For example, the server device 2 may further include an input device (not shown) and a display device (not shown). The input device is an input device that receives input from an external source (e.g., a keyboard, mouse, etc.). The input device receives user operations and inputs those operations to the server device 2. The display device is a display device that performs output to an external source (e.g., a display, etc.). The display device outputs characters and images. The server device 2 may have an integrated input device and output device (e.g., a touch panel). The user terminal 3, like the server device 2, includes a control device (CPU and memory), a communication device, a storage device, an input device, and a display device.

[0052] <Functional Configuration> Figure 5 is a block diagram showing an example of various functions in the server device 2. Various processes are executed according to the functions in each block. Computer programs that implement the functions of at least some of the functional blocks shown in Figure 5 may be installed in the storage of one or more computers. The CPU of one or more computers may perform the functions of multiple functional blocks shown in Figure 5 by reading the computer programs installed on its own machine into main memory and executing them.

[0053] Furthermore, the functions of each functional block shown in Figure 5 may be executed by a single computer, or they may be executed in a distributed manner across multiple computers. When the functions of each functional block shown in Figure 5 are executed in a distributed manner across multiple computers, these multiple computers may send and receive data via a communication network including a LAN (Local Area Network), a WAN (Wide Area Network), or the Internet.

[0054] As shown in Figure 5, the server device 2 has a functional configuration that mainly consists of a learning unit 51 and an inference unit 52. In other words, in the server device 2, the learning unit 51 functions as a machine learning device, and the inference unit 52 functions as an inference device.

[0055] The learning unit 51 performs learning (deep reinforcement learning) on the learning model. The learning unit 51 mainly comprises a simulation unit 60, a first acquisition unit 61, a decision unit 62, a limiting unit 63, a second acquisition unit 64, a setting unit 65, a calculation unit 66, a correction unit 67, and an update unit 68.

[0056] The simulation unit 60 executes a simulation related to the loading and unloading problem. For example, with respect to the loading and unloading problem, environment E1 corresponds to a simulation model of the tank base 21. Agent A1 is a virtual entity that performs actions such as loading and unloading at the tank base 21. Agent A1 executes an action determined according to policy W1 based on the state of the tank base 21. An episode is the entire loading and unloading problem (all processes), and the loading and unloading of raw materials to and from the tank base 21 each corresponds to a step (stage). That is, loading included in the loading and unloading problem corresponds to one step, unloading included in the loading and unloading problem corresponds to another step, and unloading and loading correspond to different steps. Then, policy W1 is learned so that the best action (or the action closest to the best) can be determined for the loading and unloading problem. That is, learning is performed on the neural network that outputs policy W1 in response to the loading and unloading problem.

[0057] Specifically, the neural network can receive input data related to the planning of loading and unloading multiple types of raw materials to and from the tank base 21. For example, the input data may include at least one of the following: initial inventory information, loading and unloading plan information, and constraint information.

[0058] The initial inventory information shows the inventory status of each tank 24 in the initial state of the loading and unloading problem. The inventory status is the type and quantity of raw materials stored in each tank 24. In other words, the initial inventory information is information that associates tank 24 (name, identification information, etc.) with the type of raw materials stored and the quantity (weight or percentage) stored.

[0059] The loading and unloading plan information is the overall plan for loading and unloading included in the loading and unloading problem. Figure 6 shows an example of loading and unloading plan information. The loading and unloading plan information is associated with attributes, dates (or sequences), and loading or unloading quantities for each type of raw material. Attributes indicate loading or unloading. Loading or unloading quantities for each type of raw material indicate the amount of raw material loaded into or unloaded from the tank base 21. In other words, the loading and unloading plan information does not limit individual loading or unloading to tank 24, but shows the loading and unloading plan for the tank base 21. In Figure 6, loading or unloading quantities for each type of raw material are shown by the amount of each type of raw material corresponding to the attribute (loading or unloading). The example in Figure 6 shows a case where six types of raw materials, raw material G1, raw material G2, raw material G3, raw material G4, raw material G5, and raw material G6, are handled in the loading and unloading problem. Furthermore, the loading / unloading plan information may be associated with the identification information (name, etc.) of the vessel involved in the loading or unloading.

[0060] The constraint information indicates the constraints in the loading and unloading problem. For example, the constraint information includes at least one of the following: upper and lower limit constraints on the inventory in tank 24, constraints on the number of tanks 24 used during loading and unloading, minimum quantities related to loading and unloading, concentration constraints, and tank internal layer separation constraints. The concentration constraint is a constraint on the combination and proportion (concentration) of multiple types of raw materials stored in one tank 24. The tank internal layer separation constraint is information that restricts the combination and order in which multiple types of raw materials are stored to prevent separation of multiple types of raw materials within tank 24. The constraint information may also include upper and lower limit constraints on the API (American Petroleum Institute) specific gravity of tank 24. Here, API specific gravity refers to the specific gravity of crude oil as defined by the American Petroleum Institute. API specific gravity is a value that can be measured, for example, in accordance with ASTM D1298. In this disclosure, API specific gravity may be simply referred to as "API".

[0061] Furthermore, the input to the neural network may be preprocessed to ensure that it can be input to the neural network in accordance with the information described above.

[0062] Furthermore, the neural network can output data related to the loading and unloading plans for each of the multiple tanks 24 located at the tank base 21.

[0063] For example, the output data includes information that associates tank selection information, raw material quantity information, and type information. The tank selection information specifies the tank 24 to be operated on regarding loading or unloading. That is, the tank 24 to be loaded into or unloaded is selected in the tank selection information. The raw material quantity information is information that indicates the amount of raw material to be loaded into or unloaded from the selected tank 24, corresponding to the selected tank 24. The type information is information that indicates the type of raw material to be loaded into or unloaded from the selected tank 24, corresponding to the selected tank 24. For example, the output data is shown as loading 100 kl of raw material G3 into tank 24a. If the loading / unloading problem involves multiple steps (loading or unloading), the tank selection information, raw material quantity information, and type information are output corresponding to each step.

[0064] Furthermore, the output data may include evaluation indicators corresponding to the output tank selection information, raw material quantity information, and type information. Examples of evaluation indicators include raw material agreement rate, raw material group agreement rate, and API error. The raw material agreement rate is the ratio of the planned quantity (actual quantity) of raw material to the required quantity (ideal quantity) for each type. For example, in raw material shipment, the raw material agreement rate is the ratio of the quantity of raw material to be shipped to the quantity of raw material required for shipment. When calculating the raw material agreement rate for multiple types, for example, the average of the raw material agreement rates for each type of raw material is used. The raw material group agreement rate is the ratio of the planned quantity (actual quantity) to the required quantity (ideal quantity) of raw material for each group. A group is, for example, a group of multiple types of raw materials with similar properties. For example, in raw material shipment, the raw material group agreement rate is the ratio of the total quantity of raw material from the same group to be shipped to the total quantity of raw material from that group required for shipment. When calculating the raw material group agreement rate for multiple groups, for example, the average of the raw material agreement rates for each group is used. API error is the difference between the API of the required raw material (e.g., the average of the APIs of various raw materials) and the API of the planned raw material (e.g., the average of the APIs of various raw materials). For example, in the shipment of raw materials, the API error represents the difference between the API of the raw material required for shipment and the API of the raw material scheduled to be shipped.

[0065] Furthermore, the output of the neural network may be subjected to post-processing.

[0066] Thus, the simulation unit 60 is capable of performing simulations related to the loading and unloading problem.

[0067] Returning to Figure 5, the first acquisition unit 61 acquires the current state in environment E1 as the "first state". That is, the first acquisition unit 61 acquires the first state in which agent A1 is located before agent A1 takes action. Specifically, the first acquisition unit 61 observes the state of the tank base 21 as environment E1 and sets it as the first state.

[0068] In the loading and unloading problem, the first state includes, for example, the state of the tanks 24 in the tank base 21 as environment E1. The state of the tanks 24 is information indicating the amount of raw materials stored in each of the tanks 24 and the type of raw materials being stored.

[0069] When learning is performed through an episode, the first acquisition unit 61 acquires a first state corresponding to each of the multiple steps included in the episode.

[0070] The decision unit 62 determines the action of agent A1 based on policy W1. Specifically, the decision unit 62 determines the action of agent A1 based on policy W1 obtained using a neural network in response to the first state. Policy W1 may be restricted by the restriction unit 63, which will be described later. For example, if policy W1 is information of a probability distribution P1 in which multiple actions and probabilities are associated, the decision unit 62 selects one action from among the multiple actions based on the probability distribution P1. In this way, the decision unit 62 determines the action of agent A1 corresponding to the first state.

[0071] When learning is performed through episodes, the decision unit 62 determines the action of agent A1 for each of the first states of each step acquired by the first acquisition unit 61.

[0072] The restriction unit 63 sets restrictions on the action selection based on policy W1 in the decision unit 62. Specifically, the restriction unit 63 sets restrictions for each of the multiple actions in the probability distribution P1 of policy W1. The restrictions, for example, prohibit the decision unit 62 from making a selection. That is, the restriction unit 63 sets a mask for the multiple actions included in policy W1. Figure 7 is a diagram showing an example of when restrictions are set for each action of policy W1. Figure 7 shows an example of the probability distribution P1. Figure 7 shows the case where probabilities are set corresponding to each of the actions AC1, AC2, AC3, AC4, AC5, AC6, AC7, AC8, AC9, and AC10. The restriction unit 63 sets restrictions so that the decision unit 62 does not decide on an action as an action for agent A1 that has been determined to be an inappropriate action for agent A1 corresponding to the first state based on predetermined constraint conditions. In the example in Figure 7, the restrictions set by the restriction unit 63 are shown as a mask MS. Figure 7 shows an example where the decision unit 62 is prohibited from selecting actions AC1, AC2, AC3, AC4, AC9, and AC10 as actions for agent A1. In this case, the decision unit 62 selects one action (for example, action AC6) from among the multiple actions included in policy W1 that are not restricted (actions AC5, AC6, AC7, and AC8), and decides it to be the action for agent A1.

[0073] Figure 8 shows an example of constraint conditions used in the limiting unit 63. For example, the constraint conditions are upper and lower limit constraints on the inventory of tank 24, upper and lower limit constraints on the API of the inventory of tank 24, upper limit constraint on the number of tanks 24 used when unloading, upper limit constraint on the number of tanks 24 used when loading, lower limit constraint on the amount loaded per tank, lower limit constraint on the amount unloaded per tank, upper limit constraint on the amount transported when shifting between tanks, lower limit constraint on the amount transported when shifting between tanks, upper limit constraint on the number of shifts between tanks per predetermined period (e.g., one month), and upper limit constraint on the number of shifts between tanks per predetermined period (e.g., one day). A shift is an operation to move raw materials from one tank 24 to another tank 24. The limiting unit 63 may use at least one of the above constraints. In addition, the type of action to which each constraint condition is applied is set. The types of actions are the selection of tank 24 when unloading, the amount of raw materials when unloading, the selection of tank 24 when loading, and the amount of raw materials when loading, respectively. Furthermore, the types of actions correspond to the shifts and include whether or not to use the tank, the selection of the tank 24 to unload from, the selection of the tank 24 to load into, and the amount of raw material. Figure 8 is a diagram showing an example of the correspondence between constraints and the types of actions to which those constraints can be applied. In Figure 8, "○" indicates that a particular constraint is applicable to a particular action. In addition, each constraint may or may not be assigned a priority. In the example in Figure 8, an example is shown where three levels of priority (high, medium, low) are set. In the example in Figure 8, "high" is the highest priority, "medium" is the next highest, and "low" is the next highest. More specifically, if there are multiple constraints imposed on the same action, the constraint with high priority is applied preferentially over the constraints with medium priority and the constraints with low priority. Also, if there are multiple constraints imposed on the same action, the constraint with medium priority is applied preferentially over the constraints with low priority.Furthermore, "preferential application" may mean, for example, that when multiple constraints are imposed on the same action, the degree of influence of each constraint on the action is adjusted according to the priority order of those constraints; or it may mean that only the constraint with the highest priority among those multiple constraints is applied to the action; or it may mean that the constraint is applied to the action in such a way that the constraint with the highest priority is always satisfied, while the constraints of other priorities are satisfied as much as possible. Furthermore, the number of priority levels is not limited to three, and can be changed as appropriate by the user, for example, to two levels. Furthermore, the priority of each constraint can be changed as appropriate by the user.

[0074] In this way, the restriction unit 63 sets restrictions on multiple actions before the action decision is made. In this embodiment, the case in which the restriction unit 63 is provided in the learning unit 51 is given as an example, but the restriction unit 63 may be omitted. Furthermore, the constraints used by the restriction unit 63 are not limited to the above example in the loading and unloading problem and may be changed by the user as appropriate. In addition, when applying a problem other than the loading and unloading problem to the planning creation system 1, the constraints used by the restriction unit 63 may be set as appropriate constraints that are effective for that problem. Furthermore, in this case, when multiple constraints are set, each of the multiple constraints may be set as an appropriate priority.

[0075] Returning to Figure 5, the second acquisition unit 64 acquires the state of the environment E1 that has changed as a result of the action determined by the decision unit 62 as the "second state". The second acquisition unit 64 also acquires the reward for the action determined by the decision unit 62 as the "first reward".

[0076] Specifically, the second acquisition unit 64 observes the state of the tank base 21 as environment E1 after the action and sets it to the second state. The second acquisition unit 64 also acquires an evaluation of the action that changed the state of the tank base 21 from the first state to the second state as the first reward.

[0077] In the loading and unloading problem, the second state includes, for example, the state of the tank 24 in the tank base 21 as environment E1.

[0078] When learning is performed through episodes, the second acquisition unit 64 acquires the first reward and the second state in accordance with each step's action determined by the decision unit 62.

[0079] The setting unit 65 sets the target state of environment E1. The target state is the desired state of environment E1 corresponding to future circumstances. In other words, the target state indicates a more favorable state of environment E1 considering future circumstances. Specifically, future circumstances are planned actions to be performed in environment E1 after the second state. Planned actions are actions planned to be performed on environment E1 after environment E1 has entered the second state. Figure 9 shows an example of the relationship between planned actions and the target state. For example, suppose an action is taken to bring in goods to environment E1 (first state), and environment E1 enters the second state. If an action is then planned to be taken to bring in goods to environment E1 in the second state, then the transport of goods becomes the planned action. Note that future circumstances are not limited to planned actions. The setting unit 65 sets the state of environment E1 before the planned action is taken as the target state, which improves the evaluation of the planned action to be taken in the future. Since planned actions are performed according to the state of environment E1, the state of environment E1 that is more suitable for taking the planned action (pre-action state) becomes the target state. In other words, the target state represents the state of environment E1 prior to the planned action, which is more favorable for the planned action to be performed in the future. The planned action is the action to be performed after the state of environment E1 becomes the second state. Therefore, the target state (the state of environment E1 before the planned action is performed) becomes the target state of environment E1 corresponding to the second state. That is, the second state of environment E1 and the target state represent the state of environment E1 at the same stage, respectively. As will be described later, the neural network is updated so that the difference between the second state and the target state becomes small. For example, the setting unit 65 sets the target state using a heuristic optimization solver such as Optuna, GA, or random search.

[0080] In the loading and unloading problem, the setting unit 65 sets the state of each of the multiple tanks 24 as a target state. For example, the setting unit 65 sets the ideal state of each of the multiple tanks 24 as a target state. That is, the setting unit 65 sets the target state of each tank 24 at the timing corresponding to the second state as the target state. Specifically, the setting unit 65 sets targets for the amount of raw material stored in each of the tanks 24 and the type of raw material stored.

[0081] For example, the setting unit 65 defines the planned action as "unloading," the step corresponding to unloading as the "unloading step," and the step immediately preceding the unloading step as the "target step." The target step may correspond to either loading or unloading. That is, the step immediately preceding the unloading step may correspond to either unloading or loading. The setting unit 65 then defines the state of the target step in which the evaluation related to unloading in the unloading step is expected to improve as the target state. That is, the state immediately preceding unloading (the state of the target step) in which the evaluation of unloading improves becomes the target state. Since the state of the target step immediately preceding the unloading step is the second state, the second state and the target state correspond to each other. The evaluation related to unloading is, for example, the raw material matching rate. The raw material matching rate is the ratio of the amount of raw material to be unloaded to the amount of raw material required in unloading, so the ideal state is that each tank 24 can unload the required amount (or close to the required amount) of each raw material. For this reason, the setting unit 65 sets this ideal state of the tank 24 as the target state. The setting unit 65 estimates the state of each tank 24 before discharge, which is capable of discharging the required amount of each type of raw material for discharge, and sets this as the target state. The setting unit 65 estimates the state of each tank 24 in which various raw materials can be discharged, such that the raw material matching rate is equal to or greater than a set value. The setting unit 65 may also estimate the state of each tank 24 in which the raw material matching rate is optimized by taking into account the state of the target step (state before action and action content). When the setting unit 65 sets the target state of each tank 24, it is preferable to maintain the material balance. The material balance is the total amount of each type of raw material stored in the multiple tanks 24. That is, the target state of each tank 24 is set such that the material balance is equal to the second state.

[0082] In the above example, the evaluation for setting the target state was given as the raw material agreement rate, but other indicators may be applied. For example, constraint compliance rate or API agreement rate may be used as indicators for evaluation. The constraint compliance rate indicates the percentage of time that constraints are observed. The constraints are, for example, constraints relating to the composition ratios of various raw materials. For example, the setting unit 65 sets the target state so that the composition ratio shown by the state of each tank 24 at the target step approaches the composition ratio shown by the state of each tank 24 at the beginning of the loading / unloading problem. That is, the setting unit 65 sets the target state so that the constraint compliance rate improves. Note that the evaluation method is not limited to the above and other methods may be applied.

[0083] Returning to Figure 5, the calculation unit 66 calculates the "future loss," which is the loss for the future, based on the second state and the target state in environment E1. For example, the calculation unit 66 compares the second state and the target state and calculates the future loss based on the difference between the respective states. The difference is represented, for example, by the Euclidean distance. That is, the calculation unit 66 calculates the difference between the state of each tank 24 indicated by the second state and the state of each tank 24 indicated by the target state using the Euclidean distance and defines it as the future loss.

[0084] When learning is performed through episodes, the calculation unit 66 calculates future losses corresponding to each of the second states of each step acquired by the second acquisition unit 64. In particular, the calculation unit 66 calculates the difference between the state of each tank 24 indicated by the second state acquired in relation to the target step and the state of each tank 24 indicated by the target state corresponding to the target step as the future loss.

[0085] The adjustment unit 67 adjusts the first reward based on future losses to obtain the "second reward." Specifically, the adjustment unit 67 subtracts the amount of future losses from the first reward to obtain the second reward. In other words, the larger the future losses, the smaller the value of the second reward becomes compared to the first reward. In this way, the adjustment unit 67 adjusts the first reward by taking future losses into account and sets the second reward.

[0086] When learning is performed through an episode, the correction unit 67 calculates a second reward using the future loss and first reward corresponding to some of the steps among the multiple steps. Specifically, the correction unit 67 selects (samples) at least two or more steps from among the multiple steps included in the episode. The selection is performed, for example, randomly. Then, the correction unit 67 calculates a second reward using the future loss and first reward corresponding to each of the selected steps.

[0087] The update unit 68 updates the neural network based on the second reward. That is, the update unit 68 updates the neural network based on the future loss and the first reward. Specifically, the update unit 68 updates the neural network so that the second reward increases (in the direction of increasing it). That is, the neural network is updated so that the second reward is greater than before the update. It is preferable that the update unit 68 updates the neural network so that the second reward is maximized. For example, the update unit 68 updates the neural network to obtain policy W1 so that agent A1 can perform actions that increase the second reward (especially the first reward). For example, algorithms such as Q-learning or Sarasa are applied for learning. When the neural network is updated so that the second reward increases, the component of future loss included in the second reward decreases. That is, when the second reward increases, the neural network is updated so that the future loss decreases (in the direction of decreasing it). The neural network is updated so that the future loss is smaller than before the update. In other words, the update unit 68 updates the neural network so that the second reward increases and future losses decrease. For example, the update unit 68 updates the neural network so that the second reward is maximized and future losses are minimized. The update is performed, for example, using backpropagation. Note that the method of updating the neural network with the second reward is not limited to backpropagation, and other methods may be used. As a result, learning progresses so that the first reward, which is the evaluation of the action, increases (in the direction of increase), and furthermore, so that future losses decrease (in the direction of decrease).

[0088] When learning is performed through episodes, the update unit 68 updates the neural network using a second reward corresponding to a selection of steps. Specifically, the update unit 68 updates the neural network using the average value of the selection of second rewards.

[0089] In this way, deep reinforcement learning is performed in the learning unit 51.

[0090] The inference unit 52 performs inference using the updated neural network as a trained model. Inference is the process of obtaining output data corresponding to input data using the trained model. Figure 10 shows an overview of the input and output by the trained model. The inference unit 52 receives input data from the trained model regarding the plans for loading and unloading multiple types of raw materials to and from the tank base 21. The input data includes, for example, initial inventory information, loading and unloading plan information, and constraint information. For example, each input data is preprocessed so that it can be input into the neural network. The inference unit 52 then obtains output data as an inference result regarding the plans for loading and unloading each raw material corresponding to each of the multiple tanks 24 owned by the tank base 21. The output data includes, for example, information that associates tank selection information, raw material quantity information, and type information. For example, each output data is postprocessed so that it can be output from the neural network.

[0091] <Processing Flow> Figure 11 is a flowchart showing an example of the learning process flow according to this embodiment. Each of the following processes is started, for example, according to a user's instruction to start learning. In the learning process, each episode consists of M steps, and each step is associated with a number i (an integer from 1 to M). The order and content of each of the following steps can be changed as appropriate.

[0092] (Step SP10) The simulation unit 60 initializes the model related to the loading and unloading problem. For example, the state of the environment E1 and the weights of the neural network are initialized. That is, each parameter related to the loading and unloading problem is set to its initial state. Alternatively, each hyperparameter related to learning may be set. Then, the process moves on to step SP11.

[0093] (Step SP11) The simulation unit 60 sets the step number i included in the episode to 1 (i=1). Then, the process moves on to step SP12.

[0094] (Step SP12) The simulation unit 60 starts the episode and executes step number i. Then, the process moves on to step SP13.

[0095] (Step SP13) The first acquisition unit 61 acquires the current state in environment E1 as the first state. That is, the first acquisition unit 61 acquires the state of environment E1 before agent A1's action in step number i as the first state. Then the process moves on to step SP14.

[0096] (Step SP14) The decision unit 62 determines the action of agent A1 according to the policy W1 obtained from the neural network corresponding to the first state. That is, the decision unit 62 determines the action of agent A1 in step number i. If the policy W1 is the probability distribution P1 of the action, the restriction unit 63 may set a restriction on the action. Then the process moves on to step SP15.

[0097] (Step SP15) The simulation unit 60 reflects the determined actions of agent A1 into the environment E1. Then, the process moves on to step SP16.

[0098] (Step SP16) The second acquisition unit 64 acquires the state that has changed due to agent A1's actions in environment E1 as the second state, and also acquires the reward for the action as the first reward. That is, the second acquisition unit 64 acquires the second state and the first reward corresponding to environment E1 after agent A1's actions in step i. Then the process moves on to step SP17.

[0099] (Step SP17) The setting unit 65 sets the target state of environment E1. That is, the setting unit 65 sets the target state of environment E1 after the action of agent A1 in step number i. Then the process moves on to step SP18.

[0100] (Step SP18) The calculation unit 66 calculates the future loss based on the second state and the target state. That is, the calculation unit 66 calculates the future loss corresponding to step number i. The first state, second state, first reward, and future loss corresponding to step number i are stored in association with each other. Then the process moves on to step SP19.

[0101] (Step SP19) The simulation unit 60 determines whether the episode has finished. Specifically, the simulation unit 60 determines that the episode has finished if the step number i is the final number (i = M). If the episode has not finished, proceed to step SP20. If the episode has finished, proceed to step SP21.

[0102] (Step SP20) The simulation unit 60 adds 1 to the number i. Then, the process moves to step SP12 and the process is executed again. That is, steps SP12 to SP18 are repeatedly executed until the episode ends.

[0103] (Step SP21) The correction unit 67 samples future losses and first rewards corresponding to some of the steps among the multiple steps. For example, the correction unit 67 samples future losses and first rewards corresponding to N (M > N) steps out of M steps. Then the process moves on to step SP22.

[0104] (Step SP22) The correction unit 67 corrects the first reward by future loss for each of the N sampled steps to obtain the second reward. The correction unit 67 then calculates the average value of the second rewards corresponding to each of the N steps. The process then proceeds to step SP23.

[0105] (Step SP23) The update unit 68 updates the neural network based on the second reward. Specifically, the update unit 68 updates the neural network using the average value of the second reward. This updates the policy W1 for determining the action of agent A1.

[0106] In this way, deep reinforcement learning is performed. That is, a trained model is generated by training (updating) a neural network, which is an example of a learning model. The neural network updated in step SP23 may be evaluated for its learning status using test episodes, etc. If the learning objective is achieved, training is terminated, and if the learning objective is not achieved, training may be restarted from step SP11.

[0107] Figure 12 is a flowchart showing an example of the inference process flow according to this embodiment. Each of the following processes is started, for example, in accordance with a user's instruction to start inference. Note that the order and content of each of the following steps can be changed as appropriate.

[0108] (Step SP30) The inference unit 52 acquires the trained neural network as a trained model. Then, the process moves on to step SP31.

[0109] (Step SP31) The inference unit 52 inputs input data to the neural network. For example, the input data is input to the neural network after preprocessing. Then the process moves on to step SP32.

[0110] (Step SP32) The inference unit 52 obtains output data from the neural network. For example, post-processing may be performed on the output data output from the neural network. Then the processing is completed.

[0111] <Effects> The server device 2 according to this embodiment is a machine learning device that performs reinforcement learning on a learning model that outputs about the actions of agent A1 from the state of environment E1, and comprises: a first acquisition unit 61 that acquires the current state in environment E1 as the first state; a decision unit 62 that uses the learning model to determine the actions of agent A1 from the first state; a second acquisition unit 64 that acquires a first reward, which is the reward for the determined action, and a second state, which is the state of environment E1 changed by the determined action; a calculation unit 66 that calculates a future loss, which is a loss in the future, based on the second state and a target state corresponding to a future situation in environment E1; and an update unit 68 that updates the learning model based on the future loss and the first reward.

[0112] This configuration allows for efficient learning by updating the learning model based on future losses and the first reward, thus considering losses in future situations. Furthermore, by using future losses to advance learning, the accuracy of the learning model can be improved. In other words, the learning model can learn more favorable actions for agent A1 that take future situations into account. That is, the learning model can learn actions that are considered beneficial from a long-term perspective. Additionally, efficient learning becomes possible, reducing computational burden and saving computer resources.

[0113] Furthermore, in the server device 2, the calculation unit 66 calculates future losses based on the difference between the second state and the target state.

[0114] With this configuration, future losses can be appropriately calculated as the difference between the second state and the target state.

[0115] Furthermore, the server device 2 is further equipped with a correction unit 67 that corrects the first reward based on future losses to obtain a second reward. The correction unit 67 calculates the second reward such that the value becomes smaller relative to the first reward as the future losses increase, and the update unit 68 updates the learning model based on the second reward.

[0116] With this configuration, by using future losses to make a negative adjustment to the first reward, it becomes possible to appropriately reflect future losses in the second reward.

[0117] Furthermore, in the server device 2, the update unit 68 updates the learning model so that the second reward increases and future losses decrease.

[0118] This configuration allows the learning model to be updated to minimize future losses, enabling it to learn more favorable actions for agent A1 that take future situations into account.

[0119] Furthermore, the server device 2 is further equipped with a correction unit 67 that corrects the first reward based on future losses to obtain a second reward. The first acquisition unit 61 acquires a first state corresponding to each of the multiple steps included in the episode, the decision unit 62 determines the action of agent A1 corresponding to each of the first states of each step acquired by the first acquisition unit 61, the second acquisition unit 64 acquires a first reward and a second state corresponding to each of the actions of each step determined by the decision unit 62, the calculation unit 66 calculates future losses corresponding to each of the second states of each step acquired by the second acquisition unit 64, the correction unit 67 calculates a second reward based on the future losses and first rewards corresponding to some of the steps among the multiple steps, and the update unit 68 updates the learning model based on the second reward.

[0120] With this configuration, when an episode contains multiple steps, sampling some of the steps to calculate the second reward allows for calculation efficiency while reflecting the state of each step.

[0121] Furthermore, in the server device 2, the correction unit 67 calculates a second reward corresponding to at least two or more steps among the multiple steps, and the update unit 68 updates the learning model based on the average value of the multiple calculated second rewards.

[0122] This configuration allows us to reflect the state of each sampled step during learning by using the average value of the second reward corresponding to multiple sampled steps.

[0123] Furthermore, the server device 2 includes a setting unit 65 that sets the state of environment E1 prior to a planned action as the target state, such that the evaluation of the planned action performed in environment E1 after the second state is improved.

[0124] This configuration makes it possible to set target states that correspond to planned actions as future situations. In other words, the learning model can learn more favorable actions for agent A1 that take planned actions into consideration.

[0125] Furthermore, in the server device 2, the learning model takes data relating to at least one of the plans for loading and unloading multiple types of raw materials to and from the tank base 21 as input, and outputs data relating to at least one of the plans for loading and unloading each of the multiple tanks 24 that the tank base 21 has.

[0126] This configuration allows for the training of a learning model that can handle the problem of raw material loading and unloading.

[0127] Furthermore, in the server device 2, episodes are set that have steps corresponding to the loading and unloading of raw materials to and from the tank base 21. The setting unit 65 sets the planned action as unloading, the step immediately preceding the step corresponding to unloading as the target step, and the state of the target step in which the evaluation related to said unloading is improved as the target state.

[0128] This configuration allows for addressing the problem of raw material loading and unloading, and enables setting unloading as the target state. In other words, the learning model can learn agent A1's actions that are more favorable for unloading as the target action.

[0129] Furthermore, in the server device 2, the setting unit 65 sets the state of each of the multiple tanks 24 owned by the tank base 21 as the target state.

[0130] This configuration makes it possible to set the target state in accordance with the tank base 21.

[0131] Furthermore, in the server device 2, the calculation unit 66 calculates the difference between the state of each tank indicated by the second state acquired in accordance with the target step and the state of each tank 24 indicated by the target state corresponding to the target step as a future loss.

[0132] This configuration makes it possible to calculate future losses based on the state of the tank 24.

[0133] Furthermore, in the server device 2, the calculation unit 66 calculates the difference (future loss) as the Euclidean distance between the state of each tank 24 indicated by the second state and the state of each tank 24 indicated by the target state.

[0134] With this configuration, the difference in the state of the tank 24 between the second state and the target state can be effectively calculated using the Euclidean distance.

[0135] Furthermore, in the server device 2, the decision unit 62 selects one of the multiple actions based on the probability distribution P1 of multiple actions of agent A1 obtained by the learning model.

[0136] With this configuration, an action is determined from the probability distribution P1 (policy W1), and the action determined using the learning model can be made to be executed by agent A1.

[0137] Furthermore, the server device 2 is further equipped with a restriction unit 63 that sets restrictions on each of the multiple actions in the probability distribution P1, and the decision unit 62 selects one action from among the multiple actions for which no restrictions have been set.

[0138] This configuration prevents the decision unit 62 from selecting an inappropriate action included in the probability distribution P1.

[0139] Furthermore, the server device 2 uses the trained model as a trained model to infer output data corresponding to the input data.

[0140] This configuration allows the trained model to perform inferences, for example, to address loading and unloading problems.

[0141] <Modifications> This disclosure is not limited to the embodiments described above. That is, any modifications made to the embodiments described above by a person skilled in the art are also included in the scope of this disclosure, as long as they retain the features of this disclosure. Furthermore, the elements of the embodiments described above and the modifications described later can be combined to the extent that it is technically possible, and any combination thereof is also included in the scope of this disclosure, as long as it retains the features of this disclosure.

[0142] In the above embodiment, one example is that each function is provided by the server device 2, but each function may also be provided by the user terminal 3. Alternatively, each function may be distributed between the server device 2 and the user terminal 3. For example, the user terminal 3 may function as both a machine learning device and an inference device. Alternatively, one of the server device 2 and the user terminal 3 may function as a machine learning device, and the other of the server device 2 and the user terminal 3 may function as an inference device. Alternatively, two server devices 2 that can communicate with each other may be installed, with one server device 2 functioning as a machine learning device and the other server device 2 functioning as an inference device. Alternatively, multiple server devices 2 that can communicate with each other may be installed, with each function constituting the machine learning device and each function constituting the inference device distributed among multiple server devices 2, and these may function together as a machine learning device or an inference device.

[0143] Furthermore, while the above embodiment described the application of the problem of loading and unloading raw materials at the tank base 21 to the planning system 1 as an example, the problems that can be applied to the planning system 1 are not limited to those described above. For example, the problem of vehicle entry and exit at a parking facility may be applied to the planning system 1. Also, the problem of ship entry and departure at a port may be applied to the planning system 1. Also, the problem of receiving and shipping goods and products at a warehouse or store may be applied to the planning system 1. It should be noted that the problems that can be applied to the planning system 1 are not limited to those described above, and a variety of problems can be applied.

[0144] Furthermore, in the above embodiment, the case in which the difference between the second state and the target state is calculated using the Euclidean distance was given as an example for the future loss, but this difference may be calculated by a method other than the Euclidean distance. The future loss is not limited to being calculated as the difference between the second state and the target state (e.g., the Euclidean distance). For example, the future loss may be calculated by multiplying the difference between the second state and the target state by a constant (correction coefficient), for example.

[0145] Furthermore, although the above embodiment described a case where the correction unit 67 calculates the second reward by subtracting future losses from the first reward, it is not limited to this. For example, the correction unit 67 may calculate the second reward by subtracting future losses multiplied by a constant (correction coefficient) from the first reward. As long as the first reward can be corrected with future losses as a negative component, the method for calculating the second reward is not limited.

[0146] Furthermore, while the above embodiment described the case where the optimal solution to the loading / unloading problem is learned as an example, it is not limited to this. That is, it is not limited to the case where the optimal solution is learned, but a solution close to the optimal solution may also be learned. Also, a solution above a predetermined level may be learned. In other words, the degree of learning performed on the learning model is not limited.

[0147] The various types of information described in this disclosure (e.g., status, reward, etc.) may be expressed using absolute values, relative values from a given value, or other corresponding information.

[0148] In this disclosure, expressions such as "based on," "using," and "by" (including equivalent expressions) do not mean "based solely on," "using only," or "by" unless otherwise specified. In other words, the phrase "based on" means both "based solely on" and "at least on," and the same applies to equivalent expressions such as "using" and "by."

[0149] The term “decision” in this disclosure may encompass a wide variety of actions. “Decision” may include, for example, judgment, calculation, calculation, processing, derivation, investigation, exploration, and confirmation. Furthermore, “decision” may include, for example, considering something to have been “decided,” such as resolving, selecting, choosing, establishing, or comparing. In short, “decision” may include considering any action to have been “decided.”

[0150] In this disclosure, where expressions such as "obtain / set / as input / using / based on" (including similar expressions) are used, unless otherwise specified, this includes cases where the information itself is used, or where the information has been processed in some way (e.g., noise-added, normalized, features extracted from the information, intermediate representation of the information, etc.) is used. Furthermore, where it is stated that some result is obtained by "obtaining / setting / as input / using / based on" (including similar expressions), unless otherwise specified, this includes cases where the result is obtained solely based on the information in question, or where the result is influenced by other information, factors, conditions, and / or states other than the information in question. Furthermore, where it is stated that "outputs" (including similar expressions), unless otherwise specified, this includes cases where the information itself is used as output, or where the information has been processed in some way (e.g., noise-added, normalized, features extracted from the information, intermediate representation of various types of information, etc.) is used as output.

Claims

1. A machine learning device that performs reinforcement learning on a learning model that outputs an agent's action based on the state of the environment, comprising: a memory that stores a program; and a processor that executes the program stored in the memory, wherein the processor acquires the current state of the environment as a first state; uses the learning model to determine the agent's action from the first state; acquires a first reward, which is the reward for the determined action, and a second state, which is the state of the environment changed by the determined action; calculates a future loss, which is a loss for the future, based on the second state and a target state corresponding to a future situation in the environment; and updates the learning model based on the future loss and the first reward.

2. The machine learning apparatus according to claim 1, wherein the processor calculates the future loss based on the difference between the second state and the target state.

3. The machine learning apparatus according to claim 1 or 2, wherein the processor adjusts the first reward based on the future loss to obtain a second reward, calculates the second reward such that the value becomes smaller relative to the first reward as the future loss increases, and updates the learning model based on the second reward.

4. The machine learning apparatus according to claim 3, wherein the processor updates the learning model so that the second reward is large and the future loss is small.

5. The machine learning apparatus according to claim 1 or 2, wherein the processor corrects the first reward based on the future loss to obtain a second reward, acquires the first state corresponding to each of the multiple steps included in the episode, determines the agent's action corresponding to each of the first states of each acquired step, acquires the first reward and the second state corresponding to each of the determined actions of each step, calculates the future loss corresponding to each of the second states of each acquired step, calculates the second reward based on the future loss and the first reward corresponding to some of the steps among the multiple steps, and updates the learning model based on the second reward.

6. The machine learning apparatus according to claim 5, wherein the processor calculates a second reward corresponding to at least two or more of the steps, and updates the learning model based on the average value of the calculated second rewards.

7. The machine learning apparatus according to claim 1 or 2, wherein the processor sets the state of the environment prior to the planned action, in which the evaluation of the planned action performed in the environment after the second state is improved, as the target state.

8. The machine learning device according to claim 7, wherein the learning model takes data relating to at least one of the plans for loading and unloading of a plurality of types of raw materials to and from a tank base as input, and outputs data relating to at least one of the plans for loading and unloading of each of the plurality of tanks owned by the tank base.

9. The machine learning apparatus according to claim 8, wherein an episode is set having steps corresponding to the loading and unloading of the raw materials to and from the tank base, the processor sets the planned action as unloading, the step immediately preceding the step corresponding to unloading as the target step, and the state of the target step in which the evaluation related to said unloading is improved as the target state.

10. The machine learning apparatus according to claim 9, wherein the processor sets the state of each of the multiple tanks owned by the tank base as the target state.

11. The machine learning apparatus according to claim 10, wherein the processor calculates the difference between the state of each tank indicated by the second state acquired in accordance with the target step and the state of each tank indicated by the target state corresponding to the target step as the future loss.

12. The machine learning apparatus according to claim 11, wherein the processor calculates the difference between the Euclidean distance between the state of each tank indicated by the second state and the state of each tank indicated by the target state.

13. The machine learning apparatus according to claim 1 or 2, wherein the processor selects one of the multiple actions of the agent based on the probability distribution of the multiple actions of the agent obtained by the learning model.

14. The machine learning apparatus according to claim 13, wherein the processor sets a restriction on each of the plurality of actions in the probability distribution and selects one of the plurality of actions from among the actions for which no restriction has been set.

15. The machine learning apparatus according to claim 1 or 2, wherein the environment corresponds to a raw material tank base, and the agent is a virtual entity that loads and unloads the raw materials at the tank base.

16. An inference device that uses the learned model, which has been trained by the machine learning device described in claim 1 or 2, as a trained model to infer output data corresponding to input data.

17. A machine learning method for performing reinforcement learning on a learning model that outputs an agent's action based on the state of the environment, comprising: a step of acquiring the current state of the environment as a first state; a step of determining the agent's action from the first state using the learning model; a step of acquiring a first reward, which is the reward for the determined action, and a second state, which is the state of the environment changed by the determined action; a step of calculating a future loss, which is a loss for the future, based on the second state and a target state corresponding to a future situation in the environment; and a step of updating the learning model based on the future loss and the first reward.

18. A non-temporary computer-readable recording medium for recording a machine learning program that performs reinforcement learning on a learning model that outputs an agent's action based on the state of the environment, the program comprising: a step of causing a computer to: acquire the current state of the environment as a first state; a step of using the learning model to determine the agent's action from the first state; a step of acquiring a first reward, which is the reward for the determined action, and a second state, which is the state of the environment changed by the determined action; a step of calculating a future loss, which is a loss for the future, based on the second state and a target state corresponding to a future situation in the environment; and a step of updating the learning model based on the future loss and the first reward.

19. A method for generating a trained model by performing reinforcement learning on a learning model that outputs an agent's action based on the state of the environment, comprising: a step of acquiring the current state of the environment as a first state; a step of determining the agent's action from the first state using the learning model; a step of acquiring a first reward, which is the reward for the determined action, and a second state, which is the state of the environment changed by the determined action; a step of calculating a future loss, which is a loss for the future, based on the second state and a target state corresponding to a future situation in the environment; and a step of updating the learning model based on the future loss and the first reward.