Information processing device, information processing method, and computer program
By identifying and excluding actions that lead to specific states with no options, the method allows reinforcement learning to proceed efficiently, addressing the challenge of unavailable actions in reinforcement learning.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ENEOS HLDG INC
- Filing Date
- 2025-10-29
- Publication Date
- 2026-06-25
Smart Images

Figure JP2025037999_25062026_PF_FP_ABST
Abstract
Description
Information Processing Apparatus, Information Processing Method, and Computer Program
[0001] The present disclosure relates to an information processing apparatus, an information processing method, and a computer program. This application claims priority based on Japanese Application No. 2024-220216 filed on December 16, 2024, and incorporates all the contents described in the above Japanese application.
[0002] In reinforcement learning, an agent observes the state of the environment and selects an action to be executed from the available action options based on the observed state. The environment is updated according to the selected action, and the agent selects the next action based on the updated state of the environment. In such a reinforcement learning process, there may only be action options that the agent cannot select, and the agent may fall into a state where it cannot solve the problem.
[0003] Patent Document 1 discloses a reinforcement learning method in which, when a state in which a situation (attention situation) for resetting the environment occurs is observed during learning of environmental data, the features of the first environmental data in the first state in which the attention situation has occurred are compared with the features of the second environmental data in the second state going back in time from the first state, and the difference between the respective features is learned by a learner.
[0004] Japanese Unexamined Patent Application Publication No. 2020-166795
[0005] An information processing device according to one aspect of the present disclosure is an information processing device comprising: a memory for storing a learning program; and a processor for executing the learning program, wherein the learning program includes a reinforcement learning program that rewards the agent for an action A(i,k) selected by the agent in step i of an episode, and transitions the state Si of the environment model to the state Si+1 of the next step i+1, and the processor includes: a determination unit that determines whether the state Si+1 of the environment model in the next step i+1 is a specific state in which all actions A(i+1,j) are unavailable, based on the selected action A(i,k) from among the actions A(i,j) in step i; an exclusion unit that excludes the action A(i-n,k) actually selected in steps i-n from the agent's selection target when the determination unit determines that the state Si+1 of the environment model is the specific state; and a rewind unit that returns step i of the episode to the previous step i-m when the determination unit determines that the state Si+1 of the environment model is the specific state. However, i: A variable representing the step number of the state of the environment model Si: A set of variables representing the state of the environment model at step i j: An identifier for the action (1 ≤ j ≤ N) A(i, j): A set of variables representing the action of identifier j at step i k: An identifier for the action actually selected by the agent (1 ≤ k ≤ N) A(i, k): A set of variables representing the action of identifier k actually selected at step i n: A setting value of 0 or greater representing the number of returns at step i m: A setting value representing the number of returns at step i (m > n)
[0006] Figure 1 is a diagram showing an example of the configuration of a machine learning system according to the embodiment. Figure 2 is a block diagram showing an example of the hardware configuration of a learning device according to the embodiment. Figure 3 is a diagram showing the relationship between the agent and the environment. Figure 4 is a diagram for explaining the flow of one episode. Figure 5 is a diagram showing an example of the probability distribution of actions. Figure 6 is a diagram showing an example of the configuration of an agent (learner) which is a neural network. Figure 7 is a functional block diagram showing an example of the functions of the learning device according to the embodiment. Figure 8 is a diagram for explaining an example of the oil loading and unloading problem. Figure 9 is a diagram showing an example of initial environment information for initial environment setup. Figure 10 is a diagram showing an example of a loading and unloading plan that is set. Figure 11 is a diagram for explaining the mask of actions. Figure 12 is a diagram for explaining dead ends. Figure 13 is a diagram for explaining the determination of dead ends by the determination unit. Figure 14 is a diagram for explaining an example of a mask of actions when it is determined that the state of the environment has transitioned to a dead end. Figure 15 is a diagram for explaining an example of step reversal. Figure 16 is a flowchart showing an example of reinforcement learning processing by the learning device according to the embodiment. Figure 17 is a diagram for explaining a first modified example of step reversal. Figure 18 illustrates a second modified example of the action mask and step reversal when it is determined that the environmental state transitions to a dead end. Figure 19 illustrates a third modified example of the action mask and step reversal when it is determined that the environmental state transitions to a dead end. Figure 20 is a flowchart showing an example of inference processing by the inference device according to the fourth modified example.
[0007] However, in the reinforcement learning method disclosed in Patent Document 1, it is necessary to stop reinforcement learning and perform the complex learning described above each time the environment is reset.
[0008] According to this disclosure, reinforcement learning can continue even when, during the reinforcement learning process, the only available action options for the agent are those that it cannot choose.
[0009] The embodiments of this disclosure are outlined below.
[0010] (1) The information processing device according to this embodiment includes a memory for storing a learning program and a processor for executing the learning program, wherein the learning program includes a reinforcement learning program that rewards the agent for an action A(i,k) selected by the agent in step i of an episode and transitions the state Si of the environment model to the state Si+1 of the next step i+1, and the processor includes a determination unit that determines whether the state Si+1 of the environment model in the next step i+1 is a specific state in which all actions A(i+1,j) are unavailable, based on the selected action A(i,k) from among the actions A(i,j) in step i, an exclusion unit that excludes the action A(i-n,k) actually selected in steps i-n from the agent's selection target when the determination unit determines that the state Si+1 of the environment model is the specific state, and a rewind unit that returns step i of the episode to the previous step i-m when the determination unit determines that the state Si+1 of the environment model is the specific state. However, i: A variable representing the step number of the state of the environment model Si: A set of variables representing the state of the environment model at step i j: An identifier for the action (1 ≤ j ≤ N) A(i, j): A set of variables representing the action of identifier j at step i k: An identifier for the action actually selected by the agent (1 ≤ k ≤ N) A(i, k): A set of variables representing the action of identifier k actually selected at step i n: A setting value of 0 or greater representing the number of returns at step i m: A setting value representing the number of returns at step i (m > n)
[0011] As a result, after going back to step i-m, the agent will not select action A(i-n,k), which was the cause of the specific state, again. Therefore, even if the agent encounters a specific state in the reinforcement learning process where only action options that it cannot select exist, reinforcement learning can still proceed.
[0012] (2) In (1) above, the specific state may be a state in which all of the actions A(i+1, j) in step i+1 violate the constraints defined in the environment model. This allows reinforcement learning to proceed even when the system falls into a specific state due to constraints in an environment where constraints are imposed.
[0013] (3) In (1) above, m = n + 1 is also acceptable. This allows for efficient reinforcement learning by going back to step i-n-1 immediately preceding the action A(i-n, k) that was excluded from the selection.
[0014] (4) In (1) above, m > n+1 is also acceptable. This allows reinforcement learning to proceed by going back to step i-m, which is several steps earlier than the action A(i-n, k) that was excluded from the selection.
[0015] (5) In (1) above, n may be 0. This makes it possible to exclude action A(i,k), which is the direct cause of falling into a specific state, from the selection.
[0016] (6) In (1) above, n > 0 is also acceptable. This allows for efficient searching of the cause of falling into a specific state if such a cause exists prior to step i.
[0017] (7) In (6) above, the exclusion unit may exclude from the agent's selection a plurality of actions previously selected by the agent. This makes it possible to exclude a plurality of actions previously selected from the selection a plurality of actions at once.
[0018] (8) In any one of (1) to (7) above, if the state Si-m+1 of the environment model in the next step i-m+1 is the specific state due to the action A(i-m, k1) actually selected by the agent after returning to step i-m, the exclusion unit may exclude the action A(i-n1, k) actually selected in step i-n1 from the agent's selection targets, and the retracing unit may return step i-m of the episode to the previous step i-m1. However, k1: Identifier of the action actually selected by the agent after the step has been returned (1 ≤ k1 ≤ N, k1 ≠ k) n1: Setting value representing the number of steps returned to step i (n1 ≥ m) m1: Setting value representing the number of steps returned to step i (m1 > n1) This makes it possible to efficiently search for a solution that avoids the specific state by repeating the step retracing when it is not possible to avoid falling into the specific state with a single step retracing.
[0019] (9) The information processing method according to this embodiment is an information processing method by an information processing device that rewards the agent for the action A(i,k) selected by the agent in step i of an episode and transitions the state Si of the environment model to the state Si+1 of the next step i+1, and includes the steps of: determining whether the state Si+1 of the environment model in the next step i+1 is a specific state in which all actions A(i+1,j) are unselectable, based on the action A(i,k) selected from among the actions A(i,j) in step i; excluding the action A(i-n,k) that was actually selected in steps i-n from the agent's selection targets if it is determined that the state Si+1 of the environment model is the specific state; and returning step i of the episode to the previous step i-m if it is determined that the state Si+1 of the environment model is the specific state. Therefore, even if the reinforcement learning process encounters a specific state where only action options that the agent cannot choose exist, reinforcement learning can still proceed.
[0020] (10) The computer program according to this embodiment is a computer program that rewards the agent for the action A(i,k) selected by the agent in step i of an episode, and transitions the state Si of the environment model to the state Si+1 of the next step i+1, and causes the computer to execute the following steps: determine whether the state Si+1 of the environment model in the next step i+1 is a specific state in which all actions A(i+1,j) are unavailable, based on the action A(i,k) selected from among the actions A(i,j) in step i; if it is determined that the state Si+1 of the environment model is the specific state, exclude the action A(i-n,k) that was actually selected in steps i-n from the agent's selection targets; and if it is determined that the state Si+1 of the environment model is the specific state, return step i of the episode to the previous step i-m. As a result, after going back to steps i-m, the agent will not select the action A(i-n,k) that caused the specific state to occur again. Therefore, even if the reinforcement learning process encounters a specific state where only action options that the agent cannot choose exist, reinforcement learning can still proceed.
[0021] This disclosure can be implemented not only as an information processing device having the characteristic configuration described above, an information processing method using the characteristic processing as a step, and a computer program for causing the information processing device to execute the characteristic processing, but also as an information processing system including the information processing device, or as part or all of the information processing device being implemented as a semiconductor integrated circuit.
[0022] <Details of Embodiments of the Disclosure> The details of embodiments of the disclosure will be described below with reference to the drawings. At least some of the embodiments described below may be combined in any way.
[0023] [1. Machine Learning System] Figure 1 is a diagram showing an example of the configuration of a machine learning system according to an embodiment. The machine learning system 1 according to the embodiment includes a learning device 10 and a terminal device 20. The learning device 10 is an example of an "information processing device".
[0024] The learning device 10 and the terminal device 20 are connected, for example, by a communication line, and are able to communicate data with each other. Figure 1 shows an example where the learning device 10 and the terminal device 20 are connected one-to-one, but the learning device 10 may be connected to multiple terminal devices 20 via a network.
[0025] The learning device 10 performs reinforcement learning. The terminal device 20 includes, for example, an input device and a display device. The user can, for example, operate the terminal device 20 to instruct the start of learning. Furthermore, the user can input setting information for the learning model into the terminal device 20, or input input data for machine learning into the terminal device 20. The terminal device 20 can transmit the setting information or input data to the learning device 10 to set up the learning model or input the input data into the learning device 10. When the learning device 10 finishes machine learning, the learning device 10 transmits the results of machine learning to the terminal device 20, and the terminal device 20 can display the learning results on the display device.
[0026] For example, the learning device 10 is comprised of a computer. In a specific example, the learning device 10 is a computer with large-scale and high-speed computing capabilities, such as a supercomputer. The learning device 10 may be a computer dedicated to machine learning, or it may be a general-purpose computer. The learning device 10 may be a server, and the terminal device 20 may be a client.
[0027] For example, the learning device 10 may have the functions of a terminal device 20. In this case, the machine learning system 1 is composed of one learning device 10.
[0028] [2. Hardware Configuration of the Learning Device] Figure 2 is a block diagram showing an example of the hardware configuration of the learning device according to this embodiment.
[0029] The learning device 10 includes a processor 101, a non-volatile memory 102, a volatile memory 103, and an interface (hereinafter also referred to as "IF") 104. The processor 101, the non-volatile memory 102, the volatile memory 103, and the IF 104 are each connected to one another by a bus (data bus). The processor 101, the non-volatile memory 102, the volatile memory 103, and the IF 104 can each transmit data to one another via the bus.
[0030] The volatile memory 103 is a semiconductor memory such as SRAM (Static Random Access Memory) or DRAM (Dynamic Random Access Memory). The non-volatile memory 102 is a rewritable storage device such as flash memory or a hard disk. The learning program 200 is stored in the non-volatile memory 102. The learning function of the learning device 10 is realized when the learning program 200 is executed by the processor 101. The learning program 200 is an example of a "computer program".
[0031] The processor 101 is, for example, a CPU (Central Processing Unit). However, the processor 101 is not limited to a CPU. The processor 101 may also be a GPU (Graphics Processing Unit). In a specific example, the processor 101 is a multi-core processor. The processor 101 may also be a single-core processor. The processor 101 may include multiple processors or cores and be capable of performing parallel processing. The processor 101 is configured to execute computer programs. The processor 101 may include, for example, an ASIC (Application Specific Integrated Circuit) as part, or programmable hardware such as an FPGA (Field Programmable Gate Array) or a CPLD (Complex Programmable Logic Device) as part.
[0032] The learning program 200 includes an agent 210 and an environment model 220. The agent 210 is composed of a learner (learning model). The environment model 220 provides the environment in which the agent 210 acts. The environment model 220 is, for example, a simulator that simulates a real environment. Both the agent 210 and the environment model 220 are computer programs. Hereinafter, the environment model will also be simply referred to as the "environment".
[0033] IF104 is a communication interface for communication with the terminal device 20. For example, IF104 is an Ethernet interface ("Ethernet" is a registered trademark).
[0034] [3. Reinforcement Learning] The following describes reinforcement learning that can be performed using the learning program 200.
[0035] Figure 3 shows the relationship between the agent and the environment. In reinforcement learning, agent 210 and the environment 220 interact according to a Markov decision process, and learning progresses. Agent 210 is given the state of the environment 220, and in the given state, agent 210 acts according to the policy. In response to agent 210's actions, the state of the environment 220 changes, and agent 210 is given the updated new state of the environment 220 and a reward corresponding to the action. The goal of reinforcement learning is to learn a policy that maximizes the accumulated reward at the end of such interaction cycles (episodes).
[0036] Figure 4 is a diagram illustrating the flow of an episode. An episode consists of one or more steps. In the example in Figure 4, one episode consists of three steps.
[0037] In the diagram, the white circle represents the state S of the environment 220. In the initial state S1 of the environment 220, agent 210 can select one of three actions A(1,1), A(1,2), or A(1,3). For example, in step 1, agent 210 selects action A(1,2) from the three actions A(1,1), A(1,2), and A(1,3), and when it executes the selected action A(1,2), the state of the environment 220 transitions from state S1 to state S2.
[0038] Agent 210 observes the state S2 of the environment 220. In the environment 220 in state S2, Agent 210 can select one of three actions A(2,1), A(2,2), or A(2,3). For example, in step 2, Agent 210 selects action A(2,2) from the three actions A(2,1), A(2,2), and A(2,3), and when it executes the selected action A(2,2), the state of the environment 220 transitions from state S2 to state S3.
[0039] Agent 210 observes the state S3 of the environment 220. In the environment 220 in state S3, Agent 210 can select one of three actions A(3,1), A(3,2), or A(3,3). For example, in step 3, Agent 210 selects action A(3,2) from the three actions A(3,1), A(3,2), and A(3,3), and when it executes the selected action A(3,2), the state of the environment 220 transitions from state S3 to the terminal state ST, which satisfies the episode termination conditions.
[0040] Agent 210 calculates a probability distribution for selecting each action according to the policy at each step. Figure 5 shows an example of the probability distribution of actions. The probability distribution includes the selection probability for each action A1, A2, A3, A4, A5, ... The policy is a function that defines the relationship between the observed state of the environment 220 and the probability distribution of actions. Agent 210 is composed of a function approximator that learns the policy, and the function approximator includes one or more parameters.
[0041] For example, a function approximator is a neural network, and more specifically, a deep neural network (DNN).
[0042] FIG. 6 is a diagram showing an example of the configuration of an agent (learning device) that is a neural network. The neural network includes an input layer, an intermediate layer, and an output layer.
[0043] The input layer includes a plurality of nodes (shown as circles in the figure). Parameters s1, s2, s3,... of the state S of the environment 220 are input to each node of the input layer.
[0044] The intermediate layer is composed of one or more processing layers. In the example shown in FIG. 6, the intermediate layer has a multi-phase structure. The intermediate layer includes a plurality of nodes.
[0045] The output layer includes a plurality of nodes.
[0046] The nodes between adjacent layers in the input layer, the intermediate layer, and the output layer are connected by edges (synaptic connections, shown as line segments in the figure). Weights (connection weights) are set for the edges. The policy is represented as a combination of each weight.
[0047] Each node of the output layer corresponds to actions A1, A2, A3,.... The selection probabilities of the corresponding actions A1, A2, A3,... are output from each node of the output layer.
[0048] [4. Function of the Learning Device] FIG. 7 is a functional block diagram showing an example of the function of the learning device according to the embodiment.
[0049] The learning device 10 includes functions of a setting unit 301, a start unit 302, an observation unit 303, a probability generation unit 304, an exclusion unit 305, a selection unit 306, a determination unit 307, an execution unit 308, an environment update unit 309, an end determination unit 310, a model update unit 311, and a backward unit 312. Each of the setting unit 301, the start unit 302, the observation unit 303, the probability generation unit 304, the exclusion unit 305, the selection unit 306, the determination unit 307, the execution unit 308, the environment update unit 309, the end determination unit 310, the model update unit 311, and the backward unit 312 is realized by the processor 101.
[0050] The configuration unit 301 initializes the agent 210 and the environment 220. For example, the configuration unit 301 initializes the parameters (weights) of the agent 210. The configuration unit 301 initializes the environment 220 according to the configuration information provided by the terminal device 20.
[0051] As an example, let's consider reinforcement learning that deals with the problem of loading and unloading petroleum (crude oil). Figure 8 is a diagram illustrating an example of the problem of loading and unloading petroleum. In the example in Figure 8, petroleum is loaded into and unloaded from five petroleum tanks A, B, C, D, and E. In the example in Figure 8, the petroleum occupancy rate in petroleum tank A is 30% (i.e., 30% of the maximum capacity of petroleum tank A is contained), the petroleum occupancy rate in petroleum tank B is 50%, the petroleum occupancy rate in petroleum tank C is 20%, the petroleum occupancy rate in petroleum tank D is 35%, and the petroleum occupancy rate in petroleum tank E is 60%.
[0052] Furthermore, in the example shown in Figure 8, the API specific gravity of oil in oil tank A is SGa, in oil tank B it is SGb, in oil tank C it is SGc, in oil tank D it is SGd, and in oil tank E it is SGe. Here, API specific gravity refers to the specific gravity of crude oil as defined by the American Petroleum Institute. API specific gravity is a value that can be obtained, for example, in accordance with ASTM D1289.
[0053] In this problem, the following operations are possible: selecting the receiving tank, determining the amount to be received, selecting the output tank, and determining the amount to be output. Selecting the receiving tank is the operation of choosing the destination tank for the oil from oil tanks A, B, C, D, and E. Determining the amount to be received is the operation of determining the amount of oil to be received. Selecting the output tank is the operation of choosing the output tank for the oil from oil tanks A, B, C, D, and E. Determining the amount to be output is the operation of determining the amount of oil to be output.
[0054] For example, if oil tank A is selected as the receiving tank and the amount to be received is determined to be 5 kL, then 5 kL of oil will be delivered from the oil tanker (ship) to oil tank A. For example, if oil tank B is selected as the output tank and the amount to be output is determined to be 10 kL, then 10 kL of oil will be delivered from oil tank B to the oil tanker (ship).
[0055] In this problem, it is also possible to perform the operation of transferring oil between oil tanks (hereinafter also referred to as "shifting"). For example, if oil tank C is selected as the receiving tank and the amount to be received is determined to be 7 kL, and oil tank D is selected as the source tank and the amount to be delivered is determined to be 7 kL, then 7 kL of oil will be transferred from oil tank D to oil tank C.
[0056] For example, when oil is brought into an oil tank (let's call it oil tank E), the API specific gravity of the oil may change. For instance, if the API specific gravity of the oil contained in oil tank E is SGe, and 10 kL of oil with an API specific gravity of SG0 is brought in from an oil tanker (ship), the API specific gravity of the oil contained in oil tank E will change from SGe to SGe1.
[0057] In this problem, the environment 220 consists of the oil storage capacity (stock) of each oil tank A, B, C, D, and E, the amount of oil brought in from the ship, and the amount of oil shipped out to the ship. The actions that agent 210 can perform are selecting the oil tank to receive the oil from, determining the amount to be received, selecting the oil tank from which the oil is shipped out, and determining the amount to be shipped out.
[0058] Figure 9 shows an example of initial environment information for the initial setup of environment 220. Initial environment information is included in the setup information. The initial setup of environment 220 includes inventory lower limit, inventory upper limit, API lower limit, API upper limit, initial inventory, and initial API for each oil tank (referred to as "tank" in the figure) A, B, C, D, and E. In the example in Figure 9, the initial setup of oil tank A has an inventory lower limit of "STL_A", an inventory upper limit of "STU_A", an API lower limit of "SGL_A", an API upper limit of "SGU_A", an inventory of "ST_A0", and an API specific gravity of "SG_A0". The initial setup of oil tank B has an inventory lower limit of "STL_B", an inventory upper limit of "STU_B", an API lower limit of "SGL_B", an API upper limit of "SGU_B", an inventory of "ST_B0", and an API specific gravity of "SG_B0". In the initial settings, oil tank C has a minimum inventory value of "STL_C", a maximum inventory value of "STU_C", a minimum API value of "SGL_C", a maximum API value of "SGU_C", a current inventory value of "ST_C0", and a specific gravity API value of "SG_C0". In the initial settings, oil tank D has a minimum inventory value of "STL_D", a maximum inventory value of "STU_D", a minimum API value of "SGL_D", a maximum API value of "SGU_D", a current inventory value of "ST_D0", and a specific gravity API value of "SG_D0". In the initial settings, oil tank E has a minimum inventory value of "STL_E", a maximum inventory value of "STU_E", a minimum API value of "SGL_E", a maximum API value of "SGU_E", a current inventory value of "ST_E0", and a specific gravity API value of "SG_E0".
[0059] Returning to Figure 7, the setting unit 301 initializes the environment 220 according to the initial environment information described above.
[0060] The setting unit 301 sets an import / export plan as the target for importing and exporting petroleum. The agent 210 selects an action according to the policy, with the set import / export plan as the target. The import / export plan is included, for example, in the setting information.
[0061] Figure 10 shows an example of a set loading / unloading plan. The loading / unloading plan specifies, for example, the plan number, the type of loading and unloading, the date of loading or unloading, the ID of the ship to be loaded or unloaded, the error limit of the API specific gravity of the unloaded oil, the API specific gravity of the loaded or unloaded oil, and the amount of oil loaded or unloaded. In the example in Figure 10, the type of plan number "1" is "loading", the date is "November 21, 2024", the ship ID is "EN01", the API specific gravity is "SG_I1", and the amount of oil is "Am1". The type of plan number "2" is "loading", the date is "November 23, 2024", the ship ID is "GA02", the API specific gravity is "SG_I2", and the amount of oil is "Am2". In the incoming plans, plan numbers "1" and "2", the error limit for API specific gravity is not specified. Plan number "3" is of type "Outgoing", the date is "November 25, 2024", the ship ID is "KA03", the error limit for API specific gravity is "Tol3_U-Tol3_L" (Tol3_U is the upper limit, Tol3_L is the lower limit), the API specific gravity is "SG_O3", and the amount of oil is "Am3". Plan number "4" is of type "Outgoing", the date is "November 27, 2024", the ship ID is "SA04", the error limit for API specific gravity is "Tol4_U-Tol4_L" (Tol4_U is the upper limit, Tol4_L is the lower limit), the API specific gravity is "SG_O4", and the amount of oil is "Am4".
[0062] Returning to Figure 7, the setting unit 301 further sets constraints for agent 210. The constraints are defined as follows, for example: (1) The amount of oil in the oil tanks does not exceed the upper limit. (2) The amount of oil in the oil tanks does not fall below the lower limit. (3) The number of oil tanks used during loading and unloading does not exceed the upper limit. (4) The amount of oil loaded does not fall below the lower limit. (5) The amount of oil unloaded does not fall below the lower limit. (6) The API specific gravity of the oil in the oil tanks does not exceed the upper limit. (7) The API specific gravity of the oil in the oil tanks does not fall below the lower limit.
[0063] For example, the upper and lower limits of inventory in constraints (1) and (2) are determined by the initial environmental information. The number of oil tanks used in constraint (3) is determined, for example, in the constraint information that defines the constraints. The lower limits of the amount of oil brought in and taken out in constraints (4) and (5) are determined in the constraint information. The upper and lower limits of the API specific gravity of oil in constraints (6) and (7) are determined by the initial environmental information of environment 220. The constraint information is included, for example, in the setting information.
[0064] The starter unit 302 starts an episode. The starter unit 302 can start multiple episodes using multiple agents 210. For example, the multiple agents 210 are configured by a common policy.
[0065] The observation unit 303 observes the state of the environment 220. The state S of the environment 220 includes multiple parameters s1, s2, s3, s4, ... For example, parameter s1 is the inventory amount in oil tank A, parameter s2 is the API specific gravity of the oil in oil tank A, parameter s3 is the inventory amount in oil tank B, and parameter s4 is the API specific gravity of the oil in oil tank B. The environment 220 outputs the state S, and the observation unit 303 observes the state S by receiving the state S.
[0066] The probability generation unit 304 generates probability distributions for actions A1, A2, A3, ... for each agent 210 according to the policy. For example, the state S = (s1, s2, s3, s4, ...) of the environment 220 is input to the DNN described above, and the probability distributions for actions A1, A2, A3, ... are output from the DNN. Specifically, the probability generation unit 304 is a function approximator (e.g., a DNN) of the policy that constitutes the agent 210.
[0067] The exclusion unit 305 excludes from the selection any actions A1, A2, A3, ... that violate the constraints. Hereinafter, excluding an action from the selection will also be referred to as "masking an action."
[0068] Figure 11 is a diagram illustrating the masking of actions. In the diagram, actions selectable by agent 210 are shown as black-filled circles, and masked actions are shown as dashed circles. In the state Si of environment 220 at step i (where i is a natural number), if performing action A(i,1) would violate a constraint, then action A(i,1) is masked. For example, in the state Si of environment 220, if the inventory in oil tank A is 9 kL less than the inventory limit, and action A(i,1) is to bring 10 kL of oil into oil tank A, then performing action A(i,1) would make the inventory in oil tank A larger than the inventory limit. In this case, the exclusion unit 305 masks action A(i,1).
[0069] Here, in step i, if in state Si of environment 220 all of actions A(i,1), A(i,2), and A(i,3) are unavailable, then state Si is said to be a dead end. A dead end is an example of a "specific state".
[0070] Figure 12 is a diagram illustrating a dead end. In state Si of environment 220, if all actions A(i,1), A(i,2), and A(i,3) violate the constraints, agent 210 cannot select any of actions A(i,1), A(i,2), and A(i,3). In this case, state Si of environment 220 is a dead end. In a dead end, agent 210 cannot select an action, and therefore the episode cannot proceed.
[0071] Returning to Figure 7, the selection unit 306 selects an action based on the probability distribution generated by the probability generation unit 304. Actions with a high probability are selected with a high probability, and actions with a low probability are selected with a low probability. However, the selection unit 306 cannot select masked actions.
[0072] The determination unit 307 determines whether, as a result of the first action selected by agent 210 in the first step of an episode, the state of environment 220 in the second step, which is later than the first step, will transition to a dead end.
[0073] Figure 13 is a diagram illustrating the determination of a dead end by the determination unit. In the figure, unselectable actions are shown by solid circles. In the state Si of the environment 220 in step i, actions A(i,1), A(i,2), and A(i,3) are selectable. If agent 210 selects action A(i,2), the state of the environment 220 transitions to Si+1 in the next step i+1. The determination unit 307 determines whether all of actions A(i+1,1), A(i+1,2), and A(i+1,3) in step i+1 violate the constraints. That is, the determination unit 307 determines whether the state Si+1 of the environment 220 in step i+1 is a dead end. In the example in Figure 13, all of actions A(i+1,1), A(i+1,2), and A(i+1,3) violate the constraints. In this case, the determination unit 307 determines that the state Si+1 of the environment 220 in step i+1 is a dead end.
[0074] The determination unit 307 determines whether the state of the environment 220 in a future step is a dead end. That is, the determination unit 307 determines not the state Si of the environment 220 in the current step i (the step i to be executed), but the state Si+1 of the environment 220 in the next step i+1. Therefore, the actions determined by the determination unit 307 are not masked by the exclusion unit 305. In Figure 13, unselectable actions are shown as solid circles to show the difference from masked actions (dashed circles).
[0075] Returning to Figure 7, if the determination unit 307 determines that the state S of the environment 220 does not transition to a dead end, the execution unit 308 executes the action selected by the selection unit 306.
[0076] The environment update unit 309 updates the state S of the environment 220 as a result of the action being performed. That is, the environment update unit 309 estimates the changes in the environment 220 caused by the action being performed through simulation and determines the state of the environment 220 after the change. For example, if in step i the action is performed to deliver QkL of oil with API specific gravity P to oil tank A, where the inventory amount STa 0% and the API specific gravity SGa 0, the environment update unit 309 determines the API specific gravity and inventory amount of oil tank A in step i+1 after the delivery and updates the state of the environment 220 from Si to Si+1.
[0077] The termination determination unit 310 determines whether or not the termination condition of the episode has been met. The termination condition is, for example, that the number of steps has reached a predetermined value N.
[0078] If the termination determination unit 310 determines that the termination condition has not been met, the step number is incremented, and the observation unit 303 observes the updated state of the environment 220.
[0079] If the termination determination unit 310 determines that the termination condition has been met, the model update unit 311 updates the agent 210. That is, the model update unit 311 updates the DNN weights as a policy.
[0080] For example, the model update unit 311 generates an oil loading and unloading plan as a result of the actions for each of several episodes. The model update unit 311 compares the oil loading and unloading plans generated from each episode with the oil loading and unloading plan set as the target, and determines the oil loading and unloading plan that best approximates the target. For example, the model update unit 311 determines the cumulative reward in the episode in which the oil loading and unloading plan that best approximates the target was generated, and determines the parameters (weights) of agent 210 according to the cumulative reward. The model update unit 311 updates agent 210 by setting the determined parameters to agent 210.
[0081] If the determination unit 307 determines that the state S of the environment 220 has transitioned to a dead end, the exclusion unit 305 excludes (masks) at least one of the actions selected by agent 210 in the episode, which is the third action, from the agent's selection target.
[0082] Figure 14 illustrates an example of action masking when it is determined that the environmental state is transitioning to a dead end. In the example shown in Figure 14, the exclusion unit 305 masks the action A(i,2) selected in the current execution step i. In other words, the exclusion unit 305 masks the action A(i,2) selected in the step i immediately preceding step i+1, in which it is determined that the state Si+1 of the environment 220 is a dead end.
[0083] For example, if the determination unit 307 determines that the state S of the environment 220 transitions to a dead end, the exclusion unit 305 may determine whether a dead end occurs in a certain number of episodes (a threshold) or more among the multiple episodes, and may mask the selected action A(i,2) if a dead end occurs in the episodes exceeding the threshold. The exclusion unit 305 does not need to mask the selected action A(i,2) if the number of episodes in which a dead end occurs is less than or equal to the threshold. In this case, the episode ends due to a dead end, but a certain number of episodes in which no dead end occurs remain, and reinforcement learning can proceed.
[0084] Returning to Figure 7, the rewind unit 312 rewinds the steps to be executed by agent 210 when the determination unit 307 determines that the state S of the environment 220 has transitioned to a dead end.
[0085] Figure 15 is a diagram illustrating an example of step reversal. In the example in Figure 15, the reversal unit 312 reverses from the current execution target step i to step i-1, which is one step prior. That is, the reversal unit 312 reverses from step i, which is masked as action A(i,2), to step i-1, which is one step prior.
[0086] Returning to Figure 7, after going back through the steps, the observation unit 303 observes the state Si-1 of the environment 220 at step i-1. Subsequently, in step i-1, the learning device 10 executes the process described above. As a result, the action A(i,2) that causes a dead end is masked, so that action A(i,2) is not selected again, and reinforcement learning (episode) can proceed.
[0087] [5. Operation of the Learning Device] Figure 16 is a flowchart showing an example of reinforcement learning processing by the learning device according to the embodiment.
[0088] For example, the user operates the terminal device 20 to input configuration information into the terminal device 20. The configuration information includes environment initial information that defines the initial state of the environment 220, constraint information that defines the constraint conditions, and target information. In learning about the oil loading and unloading problem, the target information is, for example, the oil loading and unloading plan described above. The input configuration information is transmitted from the terminal device 20 to the learning device 10.
[0089] The processor 101 receives the configuration information and initializes the learning device 10 according to the configuration information (step S101). That is, the processor 101 initializes the environment 220 according to the environment initial information and sets the constraints according to the constraint information. Furthermore, the processor 101 sets the goal according to the goal information.
[0090] The processor 101 sets the parameters (weights) of agent 210 and reads out agent 210 (step S102). The agent 210 read out may be a pre-trained model that has been pre-processed using machine learning (e.g., supervised learning), or it may be an untrained model.
[0091] Processor 101 sets the variable i to its initial value "1" and starts the episode (step S103). That is, the first step 1 of the episode is set as the target of execution. Specifically, processor 101 starts multiple episodes. The number of episodes is included in the configuration information. The following processes proceed in parallel across multiple episodes.
[0092] The processor 101 observes the state Si of the environment 220 (step S104).
[0093] The processor 101 identifies behaviors that violate the constraints based on the observed state Si of the environment 220, and masks the identified behaviors (step S105).
[0094] The processor 101 calculates the selection probability for each of the multiple actions A(i,j) based on the observed state Si of the environment 220, and generates a probability distribution. Based on the generated probability distribution, the processor 101 selects action A(i,k) (step S106).
[0095] The processor 101 determines whether the state Si+1 of the environment 220 in the next step i+1 transitions to a dead end as a result of the execution of the selected action A(i,k) (step S107). That is, the processor 101 determines whether all of the actions A(i+1,j) in step i+1 violate the constraint conditions.
[0096] If the state Si+1 of environment 220 does not transition to a dead end (NO in step S107), the processor 101 executes the selected action A(i,k) and updates the state Si of environment 220 to Si+1 (step S108). That is, the processor 101 determines the next state Si+1 of environment 220 by simulation.
[0097] The processor 101 determines whether the episode termination condition (i = N) has been met (step S109). If the episode termination condition has not been met (NO in step S109), the processor 101 increments the variable i (step S110) and returns to step S104.
[0098] If the state Si+1 of environment 220 transitions to a dead end (YES in step S107), the processor 101 determines whether the number of episodes in which the state Si+1 of environment 220 transitions to a dead end is greater than a threshold (step S111).
[0099] If the number of episodes in which the state Si+1 of environment 220 transitions to a dead end is below a threshold (NO in step S111), the processor 101 moves to step S108. In this case, in the next step, episodes in which the state Si+1 of environment 220 does not transition to a dead end proceed, and episodes in which the state Si+1 of environment 220 transitions to a dead end are terminated.
[0100] If the number of episodes in which the state Si+1 of environment 220 transitions to a dead end is greater than the threshold (YES in step S111), the processor 101 masks the action A(i,k) selected in step i to be executed (step S112).
[0101] Furthermore, processor 101 decrements the variable i (step S113) and returns to step S104. This causes the execution of the target step to be reversed to the previous step.
[0102] If the episode termination condition is met (YES in step S109), the processor 101 updates agent 210 (step S114). That is, the processor 101 determines the cumulative reward in the episode in which the oil loading and unloading plan that most closely approximates the target was generated, and determines the parameters of agent 210 according to the cumulative reward. The processor 101 updates agent 210 by setting the determined parameters to agent 210.
[0103] The processor 101 determines whether the termination condition for reinforcement learning has been met (step S115). For example, the termination condition for reinforcement learning is that the number of episode executions has reached a predetermined value. In another example, the termination condition for reinforcement learning may be that the performance of agent 210 has reached a predetermined value. Here, the performance of agent 210 may be the degree of agreement between the outputted oil transport plan and the target.
[0104] If the termination condition for reinforcement learning is not met (NO in step S115), the processor 101 returns the environment 220 to its initial state and returns to step S103. If the termination condition for reinforcement learning is met (YES in step S115), the reinforcement learning process ends.
[0105] [6-1. First Modification] In the embodiment described above, when it was determined that the state S of the environment 220 had transitioned to a dead end, the execution was reversed from the current execution target step i to step i-1, which was one step prior (see Figure 15), but the embodiment is not limited to this.
[0106] Figure 17 is a diagram illustrating a first modified example of step reversal. In the modified example shown in Figure 17, the reversal unit 312 goes back two steps past the current execution target step i to step i-2. That is, the reversal unit 312 goes back two steps past the step i in which action A(i,2) was masked to step i-2. Furthermore, the number of steps that the reversal unit 312 goes back to may be three or more. In this way, even if the reversal goes back two or more steps past the step of the masked action A(i,2), the masked action A(i,2) will not be selected again, and the state S of the environment 220 will not reach a dead end via the same path as last time.
[0107] [6-2. Second Modification] In the above-described embodiment, when it is determined that the state S of the environment 220 transitions to a dead end, the action A(i,2) selected in the current execution target step i is masked (see Figure 15), but the embodiment is not limited to this.
[0108] Figure 18 illustrates a second modified example of action masking and step reversal when it is determined that the environmental state has transitioned to a dead end. In the modified example shown in Figure 18, the exclusion unit 305 masks action A(i-1,2) selected in step i-1, one step prior to step i, rather than action A(i,2) selected in step i, the current execution target. In other words, the exclusion unit 305 masks action A(i-1,2) selected in step i-1, two steps prior to step i+1, where it is determined that the state Si+1 of the environment 220 is a dead end. Furthermore, the action that the exclusion unit 305 masks may be an action selected in a step two or more steps prior to step i, the current execution target.
[0109] Furthermore, in this case, the retracing unit 312 retraces to a step one or more steps prior to the step of the masked action. In the modified example shown in Figure 18, the retracing unit 312 retraces to step i-2, one step prior to step i-1, where action A(i-1,2) was masked. However, the retracing unit 312 may also retrace to a step two or more steps prior to step i-1, where action A(i-1,2) was masked. As a result, since any action in the path to reach the dead end is masked, the masked action is not selected again after the steps have been retraced, and the state S of the environment 220 does not reach the dead end via the same path as last time.
[0110] Furthermore, it is possible to mask multiple previously selected actions together. For example, in the example shown in Figure 18, the exclusion unit 305 may mask actions A(i,2) and actions A(i-1,2). This makes it possible to mask multiple actions that cause a dead end together.
[0111] [6-3. Third Modification] For example, the masking of the action and step reversal when the state S of the environment 220 transitions to a dead end may be repeated multiple times. Figure 19 is a diagram illustrating a third modification of the masking of the action and step reversal when it is determined that the state of the environment transitions to a dead end. In the modification shown in Figure 19, as a result of the selection unit 306 selecting action A(i,2) in step i, the state Si+1 of the environment 220 in the next step i+1 becomes a dead end. The exclusion unit 305 masks the action A(i,2) selected in step i, and the reversal unit 312 reverses to step i-1, one step prior to step i.
[0112] In step i-1, the selection unit 306 selects the previously selected action A(i-1,2) again. If all of the actions A(i,1), A(i,2), and A(i,3) in step i are masked, the state Si of the environment 220 is a dead end, the exclusion unit 305 masks the actions A(i-1,2) selected in step i-1, and the rewind unit 312 rewinds to step i-2, one step prior to step i-1.
[0113] In step i-2, the selection unit 306 selects action A(i-2,2) again, which it had previously selected. If all of actions A(i-1,1), A(i-1,2), and A(i-1,3) in step i-1 are masked, then the state Si-1 of the environment 220 is a dead end, the exclusion unit 305 masks action A(i-2,2) selected in step i-2, and the rewind unit 312 goes back to step i-3, one step prior to step i-2. Repeating the above, the rewind unit 312 goes back to step k.
[0114] This allows us to identify the action A(p,2) that causes the state S of environment 220 to reach a dead end, and to mask action A(p,2). Therefore, we can search for a path that avoids the dead end.
[0115] As another variation, if the ratio of masked actions to the total number of actions in the steps being traced back exceeds a threshold, the exclusion unit 305 may mask all actions in that step, even if unmasked actions remain, and the trace unit 312 may trace back further steps. This reduces the computational complexity in the process of tracing back multiple steps and allows for the efficient search of a path that avoids dead ends.
[0116] [6-4. Fourth Modification] In the above-described embodiment, the reinforcement learning process of the learning device 10 in the learning phase in which the agent 210 is trained in machine learning has been explained, but the embodiment is not limited thereto. In the inference process in the inference phase in which the reinforcement-trained agent 210 is trained in inference, it is possible to determine whether the state S of the environment 220 transitions to a dead end, and if the state S of the environment 220 transitions to a dead end, the action selected in the previous step may be masked. The fourth modification of the embodiment will be described in detail below.
[0117] In the fourth modified example, the learning device 10 functions as an inference device. However, the agent 210, which has been trained by reinforcement learning using the learning device 10, may be installed in an inference device different from the learning device 10, and the inference process may be performed by the inference device.
[0118] Referring to Figure 7, in the learning device 10 which functions as an inference device, the function of the model update unit 311 is disabled (not used). That is, since there is no need to adjust the parameters of the trained agent 210, the model update unit 311 is unnecessary. The other functions of the inference device are the same as those of the learning device 10.
[0119] Figure 20 is a flowchart showing an example of inference processing by the inference device according to the fourth modified example.
[0120] For example, the user operates the terminal device 20 to input configuration information. The configuration information includes initial environmental information that shows the actual inventory status of the oil tanks, and target information that is the actual oil loading and unloading plan. Furthermore, the configuration information includes the same constraints as those used in the reinforcement learning process.
[0121] The processor 101 receives the configuration information and initializes the learning device 10 according to the configuration information (step S101). That is, the processor 101 initializes the environment 220 according to the environment initial information and sets the constraints according to the constraint information. Furthermore, the processor 101 sets the goal according to the goal information.
[0122] The processor 101 sets the parameters (weights) of agent 210 and reads out agent 210 (step S202). The agent 210 read out is a trained model obtained through reinforcement learning.
[0123] Processor 101 sets the variable i to its initial value "1" and starts the episode (step S103). That is, the first step 1 of the episode is set as the target of execution. Specifically, processor 101 starts multiple episodes. The number of episodes is included in the configuration information. The following processes proceed in parallel across multiple episodes.
[0124] The processor 101 observes the state Si of the environment 220 (step S104).
[0125] The processor 101 identifies behaviors that violate the constraints based on the observed state Si of the environment 220, and masks the identified behaviors (step S105).
[0126] The processor 101 calculates the selection probability for each of the multiple actions A(i,j) based on the observed state Si of the environment 220, and generates a probability distribution. Based on the generated probability distribution, the processor 101 selects action A(i,k) (step S106).
[0127] The processor 101 determines whether the state Si+1 of the environment 220 in the next step i+1 transitions to a dead end as a result of the execution of the selected action A(i,k) (step S107). That is, the processor 101 determines whether all of the actions A(i+1,j) in step i+1 violate the constraint conditions.
[0128] If the state Si+1 of environment 220 does not transition to a dead end (NO in step S107), the processor 101 executes the selected action A(i,k) and updates the state Si of environment 220 to Si+1 (step S108). That is, the processor 101 determines the next state Si+1 of environment 220 by simulation.
[0129] The processor 101 determines whether the episode termination condition (i = N) has been met (step S109). If the episode termination condition has not been met (NO in step S109), the processor 101 increments the variable i (step S110) and returns to step S104.
[0130] If the state Si+1 of environment 220 transitions to a dead end (YES in step S107), the processor 101 determines whether the number of episodes in which the state Si+1 of environment 220 transitions to a dead end is greater than a threshold (step S111).
[0131] If the number of episodes in which the state Si+1 of environment 220 transitions to a dead end is below a threshold (NO in step S111), the processor 101 moves to step S108. In this case, in the next step, episodes in which the state Si+1 of environment 220 does not transition to a dead end proceed, and episodes in which the state Si+1 of environment 220 transitions to a dead end are terminated.
[0132] If the number of episodes in which the state Si+1 of environment 220 transitions to a dead end is greater than the threshold (YES in step S111), the processor 101 masks the action A(i,k) selected in step i to be executed (step S112).
[0133] Furthermore, processor 101 decrements the variable i (step S113) and returns to step S104. This causes the execution of the target step to be reversed to the previous step.
[0134] If the episode termination condition is met (YES in step S109), the processor 101 determines whether the inference processing termination condition has been met (step S214). For example, the inference processing termination condition is that the number of episode executions has reached a predetermined value. In another example, the inference processing termination condition is that the degree of agreement between the outputted oil transport plan and the target is greater than a predetermined value.
[0135] If the termination condition for the inference process is not met (NO in step S214), the processor 101 returns the environment 220 to its initial state and returns to step S103. If the termination condition for the inference process is met (YES in step S214), the processor 101 outputs the inference result, i.e., the oil loading and unloading plan with the highest degree of agreement with the target (step S215). This completes the inference process.
[0136] [6-5. Other Modifications] In the embodiments described above, a learning device 10 that deals with the problem of loading and unloading petroleum has been described, but this is illustrative and not limiting. For example, the learning device 10 can deal with problems similar to the problem of loading and unloading petroleum. A similar problem is, for example, the problem of receiving and shipping goods in a warehouse. In this problem, multiple petroleum tanks are replaced with multiple warehouses or multiple areas in one warehouse, and petroleum is replaced with cargo or items to be stored. Furthermore, the learning device 10 may deal with problems different from the problem of loading and unloading petroleum and the problem of receiving and shipping goods in a warehouse.
[0137] [7. Addendum] (Addendum 1) An information processing device comprising: a determination unit that determines whether, as a result of the learning agent executing a first action selected by the learning agent in a first step of an episode, the state of the environment in which the learning agent acts in a second step later than the first step transitions to a specific state in which it is impossible for the learning agent to select all one or more second actions; an exclusion unit that excludes a third action, which is at least one of the actions selected by the learning agent in the episode, from the learning agent's selection target when the determination unit determines that the state of the environment transitions to the specific state; and a retracing unit that, when the determination unit determines that the state of the environment transitions to the specific state, retraces the steps to which the learning agent is to be executed from the first step to a fourth step earlier than the third step in which the third action was selected.
[0138] (Note 2) The information processing device according to Note 1, wherein the specified state is a state in which all of one or more second actions violate the constraints defined in the environment.
[0139] (Note 3) The information processing apparatus described in Note 1 or Note 2, wherein the fourth step is the step immediately preceding the third step.
[0140] (Note 4) The information processing apparatus according to Note 1 or Note 2, wherein the fourth step is a step several steps earlier than the third step.
[0141] (Note 5) The information processing apparatus according to any one of Notes 1 to 4, wherein the third action is the first action and the third step is the first step.
[0142] (Note 6) The third step is a step prior to the first step, as described in any one of Notes 1 to 4.
[0143] (Note 7) The information processing device described in Note 6, wherein the third action is a plurality of actions selected in the episode.
[0144] (Note 8) When the learning agent, having gone back to the fourth step, executes the fourth action selected in the fourth step, and as a result the state of the environment transitions to the specific state, the exclusion unit excludes the fifth action selected in advance of the third action, and the retracing unit traces the steps to be executed by the learning agent back from the fourth step to the sixth step, which is earlier than the fifth step in which the fifth action was selected, as described in any one of Notes 1 to 7.
[0145] (Note 9) An information processing method comprising: determining whether, as a result of the learning agent executing a first action selected by the learning agent in a first step of an episode, the state of the environment in which the learning agent acts in a second step later than the first step transitions to a specific state in which it is impossible for the learning agent to select all one or more second actions; if it is determined that the state of the environment transitions to the specific state, excluding a third action, which is at least one of the actions selected by the learning agent in the episode, from the learning agent's selection targets; and if it is determined that the state of the environment transitions to the specific state, retracing the steps to which the learning agent is to execute from the first step to a fourth step earlier than the third step in which the third action was selected.
[0146] (Note 10) A computer program for causing a computer to perform the following steps: determine whether, as a result of the learning agent executing a first action selected by the learning agent in the first step of an episode, the state of the environment in which the learning agent acts in a second step later than the first step transitions to a specific state in which it is impossible for the learning agent to select all one or more second actions; if it is determined that the state of the environment transitions to the specific state, exclude a third action, which is at least one of the actions selected by the learning agent in the episode, from the learning agent's selection targets; and if it is determined that the state of the environment transitions to the specific state, rewind the steps to which the learning agent is to execute from the first step to a fourth step earlier than the third step in which the third action was selected.
[0147] [8. Supplementary Notes] The embodiments disclosed herein are illustrative in all respects and are not restrictive. The scope of rights in this disclosure is indicated by the claims rather than the embodiments described above, and includes the meaning of equivalents of the claims and all modifications within that scope.
[0148] 1 Machine Learning System 10 Learning Device (Information Processing Device) 20 Terminal Device 101 Processor 102 Non-volatile Memory 103 Volatile Memory 104 Interface (IF) 200 Learning Program 210 Agent (Learning Agent) 220 Environment Model (Environment) 301 Setting Unit 302 Start Unit 303 Observation Unit 304 Probability Generation Unit 305 Exclusion Unit 306 Selection Unit 307 Judgment Unit 308 Execution Unit 309 Environment Update Unit 310 Termination Judgment Unit 311 Model Update Unit 312 Retrospective Unit
Claims
1. An information processing device comprising: a memory for storing a learning program; and a processor for executing the learning program, wherein the learning program includes a reinforcement learning program that rewards the agent for an action A(i,k) selected by the agent in step i of an episode, and transitions the state Si of the environment model to the state Si+1 of the next step i+1, and the processor includes: a determination unit that determines whether the state Si+1 of the environment model in the next step i+1 is a specific state in which all actions A(i+1,j) are unavailable, based on the selected action A(i,k) among the actions A(i,j) in step i; an exclusion unit that excludes the action A(i-n,k) actually selected in steps i-n from the agent's selection target when the determination unit determines that the state Si+1 of the environment model is the specific state; and a rewind unit that returns step i of the episode to the previous step i-m when the determination unit determines that the state Si+1 of the environment model is the specific state. However, i: A variable representing the step number of the state of the environment model Si: A set of variables representing the state of the environment model at step i j: An identifier for the action (1 ≤ j ≤ N) A(i, j): A set of variables representing the action of identifier j at step i k: An identifier for the action actually selected by the agent (1 ≤ k ≤ N) A(i, k): A set of variables representing the action of identifier k actually selected at step i n: A setting value of 0 or greater representing the number of returns at step i m: A setting value representing the number of returns at step i (m > n) 2. The information processing apparatus according to claim 1, wherein the specific state is a state in which all of the actions A(i+1, j) in step i+1 violate the constraints defined in the environmental model.
3. The information processing apparatus according to claim 1 or claim 2, wherein m = n + 1.
4. The information processing apparatus according to claim 1 or claim 2, wherein m > n + 1.
5. An information processing apparatus according to any one of claims 1 to 4, wherein n = 0.
6. An information processing apparatus according to any one of claims 1 to 4, wherein n > 0.
7. The information processing apparatus according to claim 6, wherein the exclusion unit excludes a plurality of actions previously selected by the agent from the agent's selection targets.
8. If the state Si-m+1 of the environment model in the next step i-m+1 is the specific state due to the action (Ai-m, k1) actually selected by the agent after returning to step i-m, the exclusion unit excludes the action (Ai-n1, k) actually selected in step i-n1 from the agent's selection targets, and the rewind unit returns step i-m of the episode to the previous step i-m1, the information processing device according to any one of claims 1 to 7. However, k1: Identifier of the action actually selected by the agent after the step has returned (1 ≤ k1 ≤ N, k1 ≠ k) n1: Setting value representing the number of steps returned in step i (n1 ≥ m) m1: Setting value representing the number of steps returned in step i (m1 > n1) 9. An information processing method by an information processing device that rewards the agent for the action A(i,k) selected by the agent in step i of an episode, and transitions the state Si of the environment model to the state Si+1 of the next step i+1, comprising: determining whether the state Si+1 of the environment model in the next step i+1 is a specific state in which all actions A(i+1,j) are unavailable, based on the selected action A(i,k) among the actions A(i,j) in step i; if it is determined that the state Si+1 of the environment model is the specific state, excluding the action A(i-n,k) that was actually selected in steps i-n from the agent's selection targets; and if it is determined that the state Si+1 of the environment model is the specific state, returning step i of the episode to the previous step i-m. However, i: an incrementable variable representing the step number of the environment model's state Si: a set of variables representing the state of the environment model at step i j: an identifier for the action (1 ≤ j ≤ N) A(i, j): a set of variables representing the action of identifier j at step i k: an identifier for the action actually selected by the agent (1 ≤ k ≤ N) A(i, k): a set of variables representing the action of identifier k actually selected at step i n: a setting value of 0 or greater representing the return number at step i m: a setting value representing the return number at step i (m > n) 10. A computer program for rewarding an agent for an action A(i,k) selected by the agent in step i of an episode, and for transitioning the state Si of the environment model to the state Si+1 of the next step i+1, wherein the computer is instructed to perform the following steps: determine whether the state Si+1 of the environment model in the next step i+1 is a specific state in which all actions A(i+1,j) are unavailable, based on the selected action A(i,k) from among the actions A(i,j) in step i; if it is determined that the state Si+1 of the environment model is the specific state, exclude the action A(i-n,k) that was actually selected in steps i-n from the agent's selection targets; and if it is determined that the state Si+1 of the environment model is the specific state, return step i of the episode to the previous step i-m. However, i: an incrementable variable representing the step number of the environment model's state Si: a set of variables representing the state of the environment model at step i j: an identifier for the action (1 ≤ j ≤ N) A(i, j): a set of variables representing the action of identifier j at step i k: an identifier for the action actually selected by the agent (1 ≤ k ≤ N) A(i, k): a set of variables representing the action of identifier k actually selected at step i n: a setting value of 0 or greater representing the return number at step i m: a setting value representing the return number at step i (m > n)