Information processing device, information processing method, and computer program
The information processing device addresses reinforcement learning inefficiencies by identifying and excluding actions causing specific states, allowing continuous learning through step reversion, ensuring uninterrupted progress.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- ENEOS HLDG INC
- Filing Date
- 2024-12-16
- Publication Date
- 2026-06-26
AI Technical Summary
Reinforcement learning methods face interruptions when agents encounter states where only unavailable actions exist, leading to inefficiencies and the need to restart learning processes.
An information processing device with a determination unit to identify specific states where all actions are unavailable, an exclusion unit to exclude the causing action from selection, and a rewind unit to revert to previous steps, allowing continuous reinforcement learning.
Enables reinforcement learning to proceed even when only unselectable actions are present, preventing the agent from reselecting the actions that led to the specific state, thus maintaining learning continuity.
Smart Images

Figure 2026105656000001_ABST
Abstract
Description
Technical Field
[0001] The present disclosure relates to an information processing apparatus, an information processing method, and a computer program.
Background Art
[0002] In reinforcement learning, an agent observes the state of the environment and selects an action to be executed from the available action options based on the observed state. The environment is updated according to the selected action, and the agent selects the next action based on the state of the updated environment. In such a reinforcement learning process, there may be only action options that are impossible for the agent to select, and the agent may fall into a state where it cannot solve the problem.
[0003] Patent Document 1 discloses a reinforcement learning method in which, when a state in which the environment is reset (a state of interest) occurs during learning of environmental data, the features of the first environmental data in the first state in which the state of interest occurred are compared with the features of the second environmental data in the second state traced back from the first state, and the difference between the respective features is learned by a learner.
Prior Art Documents
Patent Documents
[0004]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0005] However, in the reinforcement learning method disclosed in Patent Document 1, every time the agent falls into a state where the environment is reset, reinforcement learning needs to be stopped and the above-mentioned complex learning needs to be executed.
Means for Solving the Problems
[0006] An information processing device according to one aspect of the present disclosure is an information processing device comprising: a memory for storing a learning program; and a processor for executing the learning program, wherein the learning program includes a reinforcement learning program that rewards the agent for an action A(i,k) selected by the agent in step i of an episode, and transitions the state Si of the environment model to the state Si+1 of the next step i+1, and the processor includes: a determination unit that determines whether the state Si+1 of the environment model in the next step i+1 is a specific state in which all actions A(i+1,j) are unavailable, based on the selected action A(i,k) from among the actions A(i,j) in step i; an exclusion unit that excludes the action A(in,k) actually selected in step in from the agent's selection target when the determination unit determines that the state Si+1 of the environment model is the specific state; and a rewind unit that returns step i of the episode to the previous step im when the determination unit determines that the state Si+1 of the environment model is the specific state. however, i: A variable representing the number of steps in the state of the environmental model. Si: A set of variables representing the state of the environmental model at step i. j: Identifier of the action (1 ≤ j ≤ N) A(i,j): A set of variables representing the action of identifier j in step i. k: Identifier of the action currently selected by the agent (1 ≤ k ≤ N) A(i,k): A set of variables representing the action of identifier k, which is actually selected in step i. n: A setting value of 0 or greater that represents the number of steps i to return from. m: A setting value representing the number of returns for step i (m>n)
[0007] This disclosure can be implemented not only as an information processing device having the characteristic configuration described above, an information processing method using the characteristic processing as a step, and a computer program for causing the information processing device to execute the characteristic processing, but also as an information processing system including the information processing device, or as part or all of the information processing device being implemented as a semiconductor integrated circuit. [Effects of the Invention]
[0008] According to this disclosure, reinforcement learning can continue even when, during the reinforcement learning process, the only available action options for the agent are those that it cannot choose. [Brief explanation of the drawing]
[0009] [Figure 1] Figure 1 shows an example of the configuration of a machine learning system according to an embodiment. [Figure 2] Figure 2 is a block diagram showing an example of the hardware configuration of a learning device according to this embodiment. [Figure 3] Figure 3 shows the relationship between the agent and the environment. [Figure 4] Figure 4 is a diagram illustrating the flow of one episode. [Figure 5] Figure 5 shows an example of a probability distribution of behavior. [Figure 6] Figure 6 shows an example of the configuration of an agent (learner) in a neural network. [Figure 7] Figure 7 is a functional block diagram showing an example of the functions of the learning device according to this embodiment. [Figure 8] Figure 8 is a diagram illustrating an example of a problem involving the transport and delivery of petroleum. [Figure 9] Figure 9 shows an example of initial environment information for initial environment setup. [Figure 10] Figure 10 shows an example of a set loading and unloading plan. [Figure 11]FIG. 11 is a diagram for explaining an action mask. [Figure 12] FIG. 12 is a diagram for explaining a dead end. [Figure 13] FIG. 13 is a diagram for explaining the determination of a dead end by a determination unit. [Figure 14] FIG. 14 is a diagram for explaining an example of an action mask when it is determined that the state of the environment has transitioned to a dead end. [Figure 15] FIG. 15 is a diagram for explaining an example of step backtracking. [Figure 16] FIG. 16 is a flowchart showing an example of reinforcement learning processing by a learning device according to an embodiment. [Figure 17] FIG. 17 is a diagram for explaining a first modification example of step backtracking. [Figure 18] FIG. 18 is a diagram for explaining an action mask and a second modification example of step backtracking when it is determined that the state of the environment has transitioned to a dead end. [Figure 19] FIG. 19 is a diagram for explaining an action mask and a third modification example of step backtracking when it is determined that the state of the environment has transitioned to a dead end. [Figure 20] FIG. 20 is a flowchart showing an example of inference processing by an inference device according to a fourth modification example. Embodiments for Carrying Out the Invention
[0010] <Summary of Embodiments of the Present Disclosure> The following lists and explains the summary of the embodiments of the present disclosure.
[0011] (1) The information processing device according to this embodiment includes a memory for storing a learning program and a processor for executing the learning program, wherein the learning program includes a reinforcement learning program that rewards the agent for an action A(i,k) selected by the agent in step i of an episode and transitions the state Si of the environment model to the state Si+1 of the next step i+1, and the processor includes a determination unit that determines whether the state Si+1 of the environment model in the next step i+1 is a specific state in which all actions A(i+1,j) are unavailable, based on the selected action A(i,k) from among the actions A(i,j) in step i, an exclusion unit that excludes the action A(in,k) actually selected in step in from the agent's selection target when the determination unit determines that the state Si+1 of the environment model is the specific state, and a rewind unit that returns step i of the episode to the previous step im when the determination unit determines that the state Si+1 of the environment model is the specific state. however, i: A variable representing the number of steps in the state of the environmental model. Si: A set of variables representing the state of the environmental model at step i. j: Identifier of the action (1 ≤ j ≤ N) A(i,j): A set of variables representing the action of identifier j in step i. k: Identifier of the action currently selected by the agent (1 ≤ k ≤ N) A(i,k): A set of variables representing the action of identifier k, which is actually selected in step i. n: A setting value of 0 or greater that represents the number of steps i to return from. m: A setting value representing the number of returns for step i (m>n)
[0012] As a result, after going back to step im, the agent will not select action A(in,k), which was the cause of the specific state, again. Therefore, even if the agent encounters a specific state in the reinforcement learning process where only action options that it cannot select exist, reinforcement learning can still proceed.
[0013] (2) In (1) above, the specific state may be a state in which all of the actions A(i+1,j) in step i+1 violate the constraints defined in the environment model. This allows reinforcement learning to proceed even when the system falls into a specific state due to constraints in an environment where constraints are imposed.
[0014] (3) In (1) above, m = n + 1. This allows for efficient reinforcement learning by going back to step in-1 immediately preceding the action A(in,k) that was excluded from the selection.
[0015] (4) In (1) above, m > n+1 is also acceptable. This allows reinforcement learning to proceed by going back to step im, which is several steps earlier than the action A(in,k) that was excluded from the selection.
[0016] (5) In (1) above, n may be 0. This makes it possible to exclude action A(i,k), which is the direct cause of falling into a specific state, from the selection.
[0017] (6) In (1) above, n > 0 is also acceptable. This allows for efficient searching of the cause of a particular state occurring prior to step i.
[0018] (7) In (6) above, the exclusion unit may exclude from the agent's selection a plurality of actions previously selected by the agent. This makes it possible to exclude a plurality of actions previously selected from the selection a plurality of actions at once.
[0019] (8) In any one of (1) to (7) above, if the state Si-m+1 of the environment model in the next step i-m+1 is the specific state due to the action A(im,k1) actually selected by the agent when it returned to step im, the exclusion unit may exclude the action A(i-n1,k) actually selected in step i-n1 from the agent's selection targets, and the rewind unit may return step im of the episode to the previous step i-m1. however, k1: Identifier of the action actually selected by the agent after the step has been returned (1 ≤ k1 ≤ N, k1 ≠ k) n1: A setting value representing the number of returns in step i (n1 ≥ m) m1: A setting value representing the number of returns for step i (m1 > n1) This allows for efficient searching for a solution that avoids a specific state by repeatedly retracing steps, even when a specific state cannot be avoided by retracing steps in a single step.
[0020] (9) The information processing method according to this embodiment is an information processing method by an information processing device that rewards the agent for the action A(i,k) selected by the agent in step i of the episode and transitions the state Si of the environment model to the state Si+1 of the next step i+1, and includes the steps of: determining whether the state Si+1 of the environment model in the next step i+1 is a specific state in which all actions A(i+1,j) are unselectable, based on the selected action A(i,k) from among the actions A(i,j) in step i; excluding the action A(in,k) that was actually selected in step in from the agent's selection targets if it is determined that the state Si+1 of the environment model is the specific state; and returning step i of the episode to the previous step im if it is determined that the state Si+1 of the environment model is the specific state. As a result, after going back to step im, the agent will not select action A(in,k) again, which was the cause of falling into the specific state.Therefore, even if the agent falls into a specific state in the process of reinforcement learning where only action options that are unselectable exist, reinforcement learning can still proceed.
[0021] (10) The computer program according to this embodiment is a computer program that rewards the agent for the action A(i,k) selected by the agent in step i of an episode, and transitions the state Si of the environment model to the state Si+1 of the next step i+1, and causes the computer to perform the following steps: determine whether the state Si+1 of the environment model in the next step i+1 is a specific state in which all actions A(i+1,j) are unselectable, based on the selected action A(i,k) from among the actions A(i,j) in step i; if it is determined that the state Si+1 of the environment model is the specific state, exclude the action A(in,k) that was actually selected in step in from the agent's selection targets; and if it is determined that the state Si+1 of the environment model is the specific state, return step i of the episode to the previous step im. As a result, after going back to step im, the agent will not select action A(in,k) again, which was the cause of falling into the specific state.Therefore, even if the agent falls into a specific state in the process of reinforcement learning where only action options that are unselectable exist, reinforcement learning can still proceed.
[0022] <Details of the embodiments of this disclosure> The embodiments of this disclosure will be described in detail below with reference to the drawings. At least some of the embodiments described below may be combined in any way.
[0023] [1. Machine Learning Systems] Figure 1 shows an example of the configuration of a machine learning system according to an embodiment. The machine learning system 1 according to the embodiment includes a learning device 10 and a terminal device 20. The learning device 10 is an example of an "information processing device".
[0024] The learning device 10 and the terminal device 20 are connected, for example, by a communication line, and are able to communicate data with each other. Figure 1 shows an example where the learning device 10 and the terminal device 20 are connected one-to-one, but the learning device 10 may be connected to multiple terminal devices 20 via a network.
[0025] The learning device 10 performs reinforcement learning. The terminal device 20 includes, for example, an input device and a display device. The user can, for example, operate the terminal device 20 to instruct the start of learning. Furthermore, the user can input setting information for the learning model into the terminal device 20, or input input data for machine learning into the terminal device 20. The terminal device 20 can send the setting information or input data to the learning device 10 to set up the learning model or input the input data into the learning device 10. When the learning device 10 finishes machine learning, the learning device 10 sends the results of machine learning to the terminal device 20, and the terminal device 20 can display the learning results on the display device.
[0026] For example, the learning device 10 is comprised of a computer. In a specific example, the learning device 10 is a computer with large-scale and high-speed computing capabilities, such as a supercomputer. The learning device 10 may be a computer dedicated to machine learning, or it may be a general-purpose computer. The learning device 10 may be a server, and the terminal device 20 may be a client.
[0027] For example, the learning device 10 may have the functions of a terminal device 20. In this case, the machine learning system 1 is composed of one learning device 10.
[0028] [2. Hardware configuration of the learning device] Figure 2 is a block diagram showing an example of the hardware configuration of a learning device according to this embodiment.
[0029] The learning device 10 includes a processor 101, a non-volatile memory 102, a volatile memory 103, and an interface (hereinafter also referred to as "IF") 104. The processor 101, the non-volatile memory 102, the volatile memory 103, and the IF 104 are each connected to one another by a bus (data bus). The processor 101, the non-volatile memory 102, the volatile memory 103, and the IF 104 can each transmit data to one another via the bus.
[0030] The volatile memory 103 is a semiconductor memory such as SRAM (Static Random Access Memory) or DRAM (Dynamic Random Access Memory). The non-volatile memory 102 is a rewritable storage device such as flash memory or a hard disk. The learning program 200 is stored in the non-volatile memory 102. The learning function of the learning device 10 is realized when the learning program 200 is executed by the processor 101. The learning program 200 is an example of a "computer program".
[0031] The processor 101 is, for example, a CPU (Central Processing Unit). However, the processor 101 is not limited to a CPU. The processor 101 may also be a GPU (Graphics Processing Unit). In a specific example, the processor 101 is a multi-core processor. The processor 101 may also be a single-core processor. The processor 101 may include multiple processors or cores and be capable of performing parallel processing. The processor 101 is configured to execute computer programs. The processor 101 may include, for example, an ASIC (Application Specific Integrated Circuit) as part, or programmable hardware such as an FPGA (Field Programmable Gate Array) or a CPLD (Complex Programmable Logic Device) as part.
[0032] The learning program 200 includes an agent 210 and an environment model 220. The agent 210 is composed of a learner (learning model). The environment model 220 provides the environment in which the agent 210 acts. The environment model 220 is, for example, a simulator that simulates a real environment. Both the agent 210 and the environment model 220 are computer programs. Hereinafter, the environment model will also be simply referred to as the "environment".
[0033] IF104 is a communication interface for communication with the terminal device 20. For example, IF104 is an Ethernet interface ("Ethernet" is a registered trademark).
[0034] [3. Reinforcement Learning] The following describes reinforcement learning that can be performed using learning program 200.
[0035] Figure 3 shows the relationship between the agent and the environment. In reinforcement learning, agent 210 and environment 220 interact according to a Markov decision process, and learning progresses. Agent 210 is given the state of environment 220, and in the given state, agent 210 acts according to the policy. In response to agent 210's actions, the state of environment 220 changes, and agent 210 is given the updated new state of environment 220 and a reward corresponding to the action. The goal of reinforcement learning is to learn a policy that maximizes the accumulated reward at the end of such interaction cycles (episodes).
[0036] Figure 4 is a diagram illustrating the flow of an episode. An episode consists of one or more steps. In the example in Figure 4, one episode consists of three steps.
[0037] In the diagram, the white circle represents state S of environment 220. In environment 220 in initial state S1, agent 210 can select one of three actions A(1,1), A(1,2), or A(1,3). For example, in step 1, agent 210 selects action A(1,2) from the three actions A(1,1), A(1,2), and A(1,3), and when it executes the selected action A(1,2), the state of environment 220 transitions from state S1 to state S2.
[0038] Agent 210 observes the state S2 of the environment 220. In the environment 220 in state S2, Agent 210 can choose from three actions A(2,1), A(2,2), and A(2,3). For example, in step 2, Agent 210 selects action A(2,2) from the three actions A(2,1), A(2,2), and A(2,3), and when it executes the selected action A(2,2), the state of the environment 220 transitions from state S2 to state S3.
[0039] Agent 210 observes state S3 of environment 220. In environment 220 in state S3, agent 210 can select one of three actions A(3,1), A(3,2), or A(3,3). For example, in step 3, agent 210 selects action A(3,2) from the three actions A(3,1), A(3,2), and A(3,3), and when the selected action A(3,2) is executed, the state of environment 220 transitions from state S2 to terminal state ST, which satisfies the episode termination conditions.
[0040] Agent 210 calculates a probability distribution for selecting each action according to the policy at each step. Figure 5 shows an example of the action probability distribution. The probability distribution includes the selection probability for each action A1, A2, A3, A4, A5, ... The policy is a function that defines the relationship between the observed state of the environment 220 and the action probability distribution. Agent 210 is composed of a function approximator that learns the policy, and the function approximator includes one or more parameters.
[0041] For example, a function approximator is a neural network, and more specifically, a deep neural network (DNN).
[0042] Figure 6 shows an example of the configuration of an agent (learner) in a neural network. A neural network includes an input layer, a hidden layer, and an output layer.
[0043] The input layer contains multiple nodes (shown as circles in the diagram). Each node in the input layer receives parameters s1, s2, s3, ... of the state S of environment 220 as input.
[0044] The intermediate layer consists of one or more processing layers. In the example shown in Figure 4, the intermediate layer has a multiphase structure. The intermediate layer contains multiple nodes.
[0045] The output layer includes multiple nodes.
[0046] In the input, hidden, and output layers, nodes in adjacent layers are connected by edges (synaptic connections, shown as line segments in the diagram). Each edge is assigned a weight (connection weight). The policy is represented as a combination of these weights.
[0047] Each node in the output layer corresponds to an action A1, A2, A3, ... The selection probability of the corresponding action A1, A2, A3, ... is output from each node in the output layer.
[0048] [4. Functions of the learning device] Figure 7 is a functional block diagram showing an example of the functions of the learning device according to this embodiment.
[0049] The learning device 10 includes the following functions: a setting unit 301, a start unit 302, an observation unit 303, a probability generation unit 304, an exclusion unit 305, a selection unit 306, a determination unit 307, an execution unit 308, an environment update unit 309, a termination determination unit 310, a model update unit 311, and a retrospective unit 312. Each of these functions is implemented by a processor 101.
[0050] The configuration unit 301 initializes the agent 210 and the environment 220. For example, the configuration unit 301 initializes the parameters (weights) of the agent 210. The configuration unit 301 initializes the environment 220 according to the configuration information provided by the terminal device 20.
[0051] As an example, let's consider reinforcement learning that deals with the problem of loading and unloading petroleum (crude oil). Figure 8 is a diagram illustrating an example of the problem of loading and unloading petroleum. In the example in Figure 8, petroleum is loaded into and unloaded from five petroleum tanks A, B, C, D, and E. In the example in Figure 8, the petroleum occupancy rate in petroleum tank A is 30% (i.e., 30% of the maximum capacity of petroleum tank A is contained), the petroleum occupancy rate in petroleum tank B is 50%, the petroleum occupancy rate in petroleum tank C is 20%, the petroleum occupancy rate in petroleum tank D is 35%, and the petroleum occupancy rate in petroleum tank E is 60%.
[0052] Furthermore, in the example in Figure 8, the API specific gravity of oil in oil tank A is SGa, the API specific gravity of oil in oil tank B is SGb, the API specific gravity of oil in oil tank C is SGc, the API specific gravity of oil in oil tank D is SGd, and the API specific gravity of oil in oil tank E is SGe. Here, API specific gravity refers to the specific gravity of crude oil as defined by the American Petroleum Institute. API specific gravity is a value that can be obtained, for example, in accordance with ASTM D1289.
[0053] In this problem, the following operations are possible: selecting the receiving tank, determining the amount to be received, selecting the source tank, and determining the amount to be removed. Selecting the receiving tank is the operation of choosing the destination tank for the oil from oil tanks A, B, C, D, and E. Determining the amount to be received is the operation of determining the amount of oil to be received. Selecting the source tank is the operation of choosing the source tank for the oil from oil tanks A, B, C, D, and E. Determining the amount to be removed is the operation of determining the amount of oil to be removed.
[0054] For example, if oil tank A is selected as the receiving tank and the amount to be received is determined to be 5kL, then 5kL of oil will be delivered from the oil tanker (ship) to oil tank A. For example, if oil tank B is selected as the outgoing tank and the amount to be received is determined to be 10kL, then 10kL of oil will be delivered from oil tank B to the oil tanker (ship).
[0055] In this problem, it is also possible to perform the operation of transferring oil between oil tanks (hereinafter also referred to as "shifting"). For example, if oil tank C is selected as the receiving tank and the amount to be received is determined to be 7kL, and oil tank D is selected as the source tank and the amount to be delivered is determined to be 7kL, then 7kL of oil will be transferred from oil tank D to oil tank C.
[0056] For example, when oil is brought into an oil tank (let's call it oil tank E), the API specific gravity of the oil may change. For instance, if the API specific gravity of the oil stored in oil tank E is SGe, and 10kL of oil with an API specific gravity of SG0 is brought in from an oil tanker (ship), the API specific gravity of the oil stored in oil tank E will change from SGe to SGe1.
[0057] In this problem, environment 220 consists of the oil storage capacity (stock) of each oil tank A, B, C, D, and E, the amount of oil brought in from the ship, and the amount of oil shipped out to the ship. The actions that agent 210 can perform are selecting the oil tank to receive the oil from, determining the amount to be received, selecting the oil tank from which the oil is shipped out, and determining the amount to be shipped out.
[0058] Figure 9 shows an example of initial environment information for the initial setup of environment 220. Initial environment information is included in the setup information. The initial setup of environment 220 includes inventory lower limit, inventory upper limit, API lower limit, API upper limit, initial inventory, and initial API for each oil tank (referred to as "tank" in the figure) A, B, C, D, E. In the example in Figure 9, the initial setup of oil tank A has an inventory lower limit of "STL_A", an inventory upper limit of "STU_A", an API lower limit of "SGL_A", an API upper limit of "SGU_A", an inventory of "ST_A0", and an API specific gravity of "SG_A0". The initial setup of oil tank B has an inventory lower limit of "STL_B", an inventory upper limit of "STU_B", an API lower limit of "SGL_B", an API upper limit of "SGU_B", an inventory of "ST_B0", and an API specific gravity of "SG_B0". In the initial settings for oil tank C, the lower limit of inventory is "STL_C", the upper limit of inventory is "STU_C", the lower limit of API is "SGL_C", the upper limit of API is "SGU_C", the inventory is "ST_C0", and the API specific gravity is "SG_C0". In the initial settings for oil tank D, the lower limit of inventory is "STL_D", the upper limit of inventory is "STU_D", the lower limit of API is "SGL_D", the upper limit of API is "SGU_D", the inventory is "ST_D0", and the API specific gravity is "SG_D0". In the initial settings for oil tank E, the lower limit of inventory is "STL_E", the upper limit of inventory is "STU_E", the lower limit of API is "SGL_E", the upper limit of API is "SGU_E", the inventory is "ST_E0", and the API specific gravity is "SG_E0".
[0059] Returning to Figure 7, the setting unit 301 initializes the environment 220 according to the initial environment information described above.
[0060] The configuration unit 301 sets an import / export plan as the target for importing and exporting petroleum. The agent 210 selects an action according to the policy, with the set import / export plan as the target. The import / export plan is included, for example, in the configuration information.
[0061] Figure 10 shows an example of a set loading / unloading plan. The loading / unloading plan specifies, for example, the plan number, the type of loading / unloading, the date of loading or unloading, the ID of the vessel to be loaded or unloaded, the error limit for the API specific gravity of the oil to be unloaded, the API specific gravity of the oil to be loaded or unloaded, and the amount of oil to be loaded or unloaded. In the example in Figure 10, plan number "1" is of type "loading", the date is "November 21, 2024", the vessel ID is "EN01", the API specific gravity is "SG_I1", and the amount of oil is "Am1". Plan number "2" is of type "loading", the date is "November 23, 2024", the vessel ID is "GA02", the API specific gravity is "SG_I2", and the amount of oil is "Am2". In the incoming plans, plan numbers "1" and "2", the error limit for API specific gravity is not specified. Plan number "3" is of type "Outbound", the date is "November 25, 2024", the ship ID is "KA03", the error limit for API specific gravity is "Tol3_U-Tol3_L" (Tol3_U is the upper limit, Tol3_L is the lower limit), the API specific gravity is "SG_O3", and the amount of oil is "Am3". Plan number "4" is of type "Outbound", the date is "November 27, 2024", the ship ID is "SA04", the error limit for API specific gravity is "Tol4_U-Tol4_L" (Tol4_U is the upper limit, Tol4_L is the lower limit), the API specific gravity is "SG_O4", and the amount of oil is "Am4".
[0062] Returning to Figure 7, the setting unit 301 further sets constraints for agent 210. The constraints are defined, for example, as follows: (1) The amount of oil tank inventory does not exceed the upper limit. (2) The amount of oil tank inventory does not fall below the minimum limit. (3) The number of oil tanks used during loading and unloading must not exceed the maximum limit. (4) The amount of oil brought in does not fall below the minimum limit. (5) The amount of oil transported must not fall below the minimum limit. (6) The API specific gravity of petroleum in stock in oil tanks shall not exceed the upper limit. (7) The specific gravity of the API of petroleum in stock in the oil tanks must not fall below the lower limit.
[0063] For example, the upper and lower limits of inventory in constraints (1) and (2) are determined by the initial environmental information. The number of oil tanks used in constraint (3) is defined, for example, in the constraint information that defines the constraints. The lower limits of the amount of oil brought in and taken out in constraints (4) and (5) are defined in the constraint information. The upper and lower limits of the API specific gravity of oil in constraints (6) and (7) are determined by the initial environmental information of environment 220. The constraint information is included, for example, in the setting information.
[0064] The starter unit 302 initiates an episode. The starter unit 302 can initiate multiple episodes using multiple agents 210. For example, multiple agents 210 are configured with a common policy.
[0065] The observation unit 303 observes the state of the environment 220. The state S of the environment 220 includes multiple parameters s1, s2, s3, s4, ... For example, parameter s1 is the inventory amount in oil tank A, parameter s2 is the API specific gravity of the oil in oil tank A, parameter s3 is the inventory amount in oil tank B, and parameter s4 is the API specific gravity of the oil in oil tank B. The environment 220 outputs the state S, and the observation unit 303 observes the state S by receiving the state S.
[0066] The probability generation unit 304 generates probability distributions for actions A1, A2, A3, ... for each agent 210 according to the policy. For example, the state S=(s1,s2,s3,s4, ...) of the environment 220 is input to the DNN described above, and the probability distributions for actions A1, A2, A3, ... are output from the DNN. Specifically, the probability generation unit 304 is a function approximator (e.g., DNN) of the policy that constitutes agent 210.
[0067] The exclusion unit 305 excludes from the selection any actions A1, A2, A3, ... that violate the constraints. Hereafter, excluding an action from the selection will also be referred to as "masking an action."
[0068] Figure 11 is a diagram illustrating the masking of actions. In the diagram, actions selectable by agent 210 are shown as black-filled circles, and masked actions are shown as dashed circles. In the state Si of environment 220 at step i (where i is a natural number), if performing action A(i,1) would violate a constraint, then action A(i,1) is masked. For example, in the state Si of environment 220, if the inventory in oil tank A is 9kL less than the inventory limit, and action A(i,1) is to bring 10kL of oil into oil tank A, then performing action A(i,1) would make the inventory in oil tank A exceed the inventory limit. In this case, the exclusion unit 305 masks action A(i,1).
[0069] Here, in step i, if in state Si of environment 220 all of actions A(i,1), A(i,2), and A(i,3) are unavailable, then state Si is said to be a dead end. A dead end is an example of a "specific state".
[0070] Figure 12 is a diagram illustrating a dead end. In state Si of environment 220, if all actions A(i,1), A(i,2), and A(i,3) violate the constraints, agent 210 cannot select any of actions A(i,1), A(i,2), and A(i,3). In this case, state Si of environment 220 is a dead end. In a dead end, agent 210 cannot select an action, and therefore the episode cannot proceed.
[0071] Returning to Figure 7, the selection unit 306 selects an action based on the probability distribution generated by the probability generation unit 304. Actions with a high probability are selected with a high probability, and actions with a low probability are selected with a low probability. However, the selection unit 306 cannot select masked actions.
[0072] The determination unit 307 determines whether, as a result of the first action selected by agent 210 in the first step of an episode, the state of environment 220 in the second step, which is later than the first step, will transition to a dead end.
[0073] Figure 13 is a diagram illustrating the determination of a dead end by the determination unit. In the figure, unselectable actions are shown by solid circles. In step i, in state Si of environment 220, actions A(i,1), A(i,2), and A(i,3) are selectable. If agent 210 selects action A(i,2), in the next step i+1, the state of environment 220 transitions to Si+1. The determination unit 307 determines whether all of actions A(i+1,1), A(i+1,2), and A(i+1,3) in step i+1 violate the constraints. That is, the determination unit 307 determines whether state Si+1 of environment 220 in step i+1 is a dead end. In the example in Figure 13, all of actions A(i+1,1), A(i+1,2), and A(i+1,3) violate the constraints. In this case, the determination unit 307 determines that state Si+1 of environment 220 in step i+1 is a dead end.
[0074] The determination unit 307 determines whether the state of environment 220 in a future step is a dead end. That is, the determination unit 307 determines not the state Si of environment 220 in the current step i (the step i to be executed), but the state Si+1 of environment 220 in the next step i+1. Therefore, the actions determined by the determination unit 307 are not masked by the exclusion unit 305. In Figure 13, unselectable actions are shown as solid circles to illustrate the difference from masked actions (dashed circles).
[0075] Returning to Figure 7, if the determination unit 307 determines that the state S of the environment 220 does not transition to a dead end, the execution unit 308 executes the action selected by the selection unit 306.
[0076] The environment update unit 309 updates the state S of the environment 220 as a result of the action being performed. That is, the environment update unit 309 estimates the changes in the environment 220 caused by the action being performed through simulation and determines the state of the environment 220 after the change. For example, if in step i the action is performed to deliver QkL of oil with an API specific gravity P to oil tank A, where the inventory is STa0% and the API specific gravity is SGa0, the environment update unit 309 determines the API specific gravity and inventory of oil tank A in step i+1 after the delivery and updates the state of the environment 220 from Si to Si+1.
[0077] The termination determination unit 310 determines whether or not the termination condition of the episode has been met. The termination condition is, for example, that the number of steps has reached a predetermined value N.
[0078] If the termination determination unit 310 determines that the termination condition has not been met, the step number is incremented, and the observation unit 303 observes the updated state of the environment 220.
[0079] If the termination determination unit 310 determines that the termination condition has been met, the model update unit 311 updates the agent 210. That is, the model update unit 311 updates the DNN weights as a policy.
[0080] For example, the model update unit 311 generates an oil loading and unloading plan as a result of the actions for each of several episodes. The model update unit 311 compares the oil loading and unloading plans generated from each episode with the oil loading and unloading plan set as the target, and determines the oil loading and unloading plan that best approximates the target. The model update unit 311 determines, for example, the cumulative reward in the episode in which the oil loading and unloading plan that best approximates the target was generated, and determines the parameters (weights) of agent 210 according to the cumulative reward. The model update unit 311 updates agent 210 by setting the determined parameters to agent 210.
[0081] If the determination unit 307 determines that the state S of the environment 220 has transitioned to a dead end, the exclusion unit 305 excludes (masks) at least one of the actions selected by agent 210 in the episode, which is the third action, from the agent's selection target.
[0082] Figure 14 illustrates an example of action masking when it is determined that the environmental state is transitioning to a dead end. In the example shown in Figure 14, the exclusion unit 305 masks the action A(i,2) selected in the current execution step i. In other words, the exclusion unit 305 masks the action A(i,2) selected in step i immediately preceding step i+1, in which it was determined that the state Si+1 of the environment 220 is a dead end.
[0083] For example, if the determination unit 307 determines that the state S of the environment 220 transitions to a dead end, the exclusion unit 305 may determine whether a dead end occurs in a certain number of episodes (a threshold) or more among multiple episodes, and may mask the selected action A(i,2) if a dead end occurs in episodes exceeding the threshold. The exclusion unit 305 does not need to mask the selected action A(i,2) if the number of episodes in which a dead end occurs is less than or equal to the threshold. In this case, the episode ends due to a dead end, but a certain number of episodes in which no dead end occurs remain, and reinforcement learning can proceed.
[0084] Returning to Figure 7, the rewind unit 312 rewinds the steps to be executed by agent 210 if the determination unit 307 determines that the state S of the environment 220 has transitioned to a dead end.
[0085] Figure 15 is a diagram illustrating an example of step reversal. In the example in Figure 15, the reversal unit 312 reverses from the current execution target step i to step i-1, which is one step prior. That is, the reversal unit 312 reverses from step i, which is masked as action A(i,2), to step i-1, which is one step prior.
[0086] Returning to Figure 7, after going back through the steps, the observation unit 303 observes the state Si-1 of the environment 220 at step i-1. Subsequently, in step i-1, the learning device 10 executes the process described above. As a result, the action A(i,2) that causes a dead end is masked, so action A(i,2) is not selected again, and reinforcement learning (episode) can proceed.
[0087] [5. Operation of the learning device] Figure 16 is a flowchart showing an example of reinforcement learning processing by a learning device according to an embodiment.
[0088] For example, the user operates the terminal device 20 to input configuration information. The configuration information includes environment initial information that defines the initial state of the environment 220, constraint information that defines the constraint conditions, and target information. In learning about the oil loading and unloading problem, the target information is, for example, the oil loading and unloading plan described above. The input configuration information is transmitted from the terminal device 20 to the learning device 10.
[0089] The processor 101 receives the configuration information and initializes the learning device 10 according to the configuration information (step S101). That is, the processor 101 initializes the environment 220 according to the environment initial information and sets the constraints according to the constraint information. Furthermore, the processor 101 sets the goal according to the goal information.
[0090] The processor 101 sets the parameters (weights) of agent 210 and reads out agent 210 (step S102). The agent 210 read out may be a pre-trained model that has been pre-processed using machine learning (e.g., supervised learning), or it may be an untrained model.
[0091] Processor 101 sets the variable i to its initial value "1" and starts the episode (step S103). That is, the first step 1 of the episode is set as the target for execution. Specifically, processor 101 starts multiple episodes. The number of episodes is included in the configuration information. The following processes proceed in parallel across multiple episodes.
[0092] The processor 101 observes the state Si of the environment 220 (step S104).
[0093] The processor 101 identifies behaviors that violate constraints based on the observed state Si of the environment 220 and masks the identified behaviors (step S105).
[0094] The processor 101 calculates the selection probability for each of the multiple actions A(i,j) based on the observed state Si of the environment 220, and generates a probability distribution. Based on the generated probability distribution, the processor 101 selects action A(i,k) (step S106).
[0095] The processor 101 determines whether the state Si+1 of the environment 220 in the next step i+1 transitions to a dead end as a result of the execution of the selected action A(i,k) (step S107). That is, the processor 101 determines whether all of the actions A(i+1,j) in step i+1 violate the constraints.
[0096] If the state Si+1 of environment 220 does not transition to a dead end (NO in step S107), the processor 101 executes the selected action A(i,k) and updates the state Si of environment 220 to Si+1 (step S108). That is, the processor 101 determines the next state Si+1 of environment 220 by simulation.
[0097] The processor 101 determines whether the episode termination condition (i=N) has been met (step S109). If the episode termination condition has not been met (NO in step S109), the processor 101 increments the variable i (step S110) and returns to step S104.
[0098] If the state Si+1 of environment 220 transitions to a dead end (YES in step S107), the processor 101 determines whether the number of episodes in which the state Si+1 of environment 220 transitions to a dead end is greater than a threshold (step S111).
[0099] If the number of episodes in which the state Si+1 of environment 220 transitions to a dead end is below a threshold (NO in step S111), the processor 101 proceeds to step S108. In this case, in the next step, episodes in which the state Si+1 of environment 220 does not transition to a dead end proceed, and episodes in which the state Si+1 of environment 220 transitions to a dead end are terminated.
[0100] If the number of episodes in which the state Si+1 of environment 220 transitions to a dead end is greater than the threshold (YES in step S111), the processor 101 masks the action A(i,k) selected in step i to be executed (step S112).
[0101] Furthermore, processor 101 decrements the variable i (step S113) and returns to step S104. This causes the execution of the step to be reversed to the previous step.
[0102] If the episode termination condition is met (YES in step S109), the processor 101 updates agent 210 (step S114). That is, the processor 101 determines the cumulative reward in the episode in which the oil loading and unloading plan that most closely approximates the target was generated, and determines the parameters of agent 210 according to the cumulative reward. The processor 101 updates agent 210 by setting the determined parameters to agent 210.
[0103] The processor 101 determines whether the termination condition for reinforcement learning has been met (step S115). For example, the termination condition for reinforcement learning is that the number of episode executions has reached a predetermined value. In another example, the termination condition for reinforcement learning may be that the performance of agent 210 has reached a predetermined value. Here, the performance of agent 210 may be the degree of agreement between the outputted oil loading and unloading plan and the target.
[0104] If the termination condition for reinforcement learning is not met (NO in step S115), the processor 101 returns the environment 220 to its initial state and returns to step S103. If the termination condition for reinforcement learning is met (YES in step S115), the reinforcement learning process ends.
[0105] [6-1. First variation] In the embodiment described above, when it was determined that the state S of the environment 220 was transitioning to a dead end, the process was reversed from the current execution target step i to step i-1, which was one step prior (see Figure 15), but it is not limited to this.
[0106] Figure 17 is a diagram illustrating a first modified example of step reversal. In the modified example shown in Figure 17, the reversal unit 312 goes back two steps past the current execution target step i to step i-2. That is, the reversal unit 312 goes back two steps past the step i in which action A(i,2) was masked to step i-2. Furthermore, the number of steps that the reversal unit 312 goes back to may be three or more. In this way, even if the reversal goes back two or more steps past the step of the masked action A(i,2), the masked action A(i,2) will not be selected again, and the state S of the environment 220 will not reach a dead end via the same path as last time.
[0107] [6-2. Second variation] In the embodiment described above, when it was determined that the state S of the environment 220 was transitioning to a dead end, the action A(i,2) selected in the current execution step i was masked (see Figure 15), but the embodiment is not limited to this.
[0108] Figure 18 illustrates a second modified example of action masking and step reversal when it is determined that the environmental state is transitioning to a dead end. In the modified example shown in Figure 18, the exclusion unit 305 masks action A(i-1,2) selected in step i-1, one step prior to the current execution target step i, rather than action A(i,2) selected in step i+1. In other words, the exclusion unit 305 masks action A(i-1,2) selected in step i-1, two steps prior to step i+1, when it is determined that the state Si+1 of the environment 220 is a dead end. Furthermore, the action that the exclusion unit 305 masks may be an action selected in a step two or more steps prior to the current execution target step i.
[0109] Furthermore, in this case, the retracing unit 312 retraces to a step one or more steps prior to the step of the masked action. In the modified example shown in Figure 18, the retracing unit 312 retraces to step i-2, one step prior to step i-1, where action A(i-1,2) was masked. However, the retracing unit 312 may also retrace to a step two or more steps prior to step i-1, where action A(i-1,2) was masked. As a result, since one of the actions in the path leading to the dead end is masked, the masked action is not selected again after the steps have been retraced, and the state S of the environment 220 does not reach the dead end via the same path as before.
[0110] Furthermore, it is possible to mask multiple previously selected actions together. For example, in the example shown in Figure 18, the exclusion unit 305 may mask actions A(i,2) and A(i-1,2). This allows multiple actions that cause a dead end to be masked together.
[0111] [6-3. Third Variation] For example, the masking of the action and step reversal when the state S of the environment 220 transitions to a dead end may be repeated multiple times. Figure 19 is a diagram illustrating a third modified example of the masking of the action and step reversal when it is determined that the state of the environment transitions to a dead end. In the modified example shown in Figure 19, as a result of the selection unit 306 selecting action A(i,2) in step i, the state Si+1 of the environment 220 in the next step i+1 becomes a dead end. The exclusion unit 305 masks the action A(i,2) selected in step i, and the reversal unit 312 reverses to step i-1, one step prior to step i.
[0112] In step i-1, the selection unit 306 selects the previously selected action A(i-1,2) again. If all of the actions A(i,1), A(i,2), and A(i,3) in step i are masked, the state Si of the environment 220 is a dead end, the exclusion unit 305 masks the action A(i-1,2) selected in step i-1, and the rewind unit 312 rewinds to step i-2, one step prior to step i-1.
[0113] In step i-2, the selection unit 306 selects action A(i-2,2) again, which it had previously selected. If all of actions A(i-1,1), A(i-1,2), and A(i-1,3) in step i-1 are masked, then state Si-1 of environment 220 is a dead end, the exclusion unit 305 masks action A(i-2,2) selected in step i-2, and the regress unit 312 goes back to step i-3, one step prior to step i-2. Repeating this process, the regress unit 312 goes back to step k.
[0114] This allows us to identify the action A(p,2) that causes the state S of environment 220 to reach a dead end, and to mask action A(p,2). Therefore, we can search for a path that avoids the dead end.
[0115] As another variation, if the ratio of masked actions to the total number of actions in the steps being traced back exceeds a threshold, the exclusion unit 305 may mask all actions in that step, even if unmasked actions remain, and the trace unit 312 may trace back further steps. This reduces the computational complexity in the process of tracing back multiple steps and allows for the efficient search of a path that avoids dead ends.
[0116] [6-4. Fourth variation] In the embodiments described above, the reinforcement learning process of the learning device 10 in the learning phase in which the agent 210 is trained on machine learning has been explained, but the invention is not limited thereto. In the inference process in the inference phase in which the reinforcement-trained agent 210 is trained on inference, it is possible to determine whether the state S of the environment 220 transitions to a dead end, and if the state S of the environment 220 transitions to a dead end, the action selected in the previous step may be masked. A fourth modified example of the embodiment will be described in detail below.
[0117] In the fourth modified example, the learning device 10 functions as an inference device. However, the agent 210, which has been trained by reinforcement learning using the learning device 10, may be installed in an inference device different from the learning device 10, and the inference process may be performed by the inference device.
[0118] Referring to Figure 7, in the learning device 10 which functions as an inference device, the function of the model update unit 311 is disabled (not used). That is, since there is no need to adjust the parameters of the trained agent 210, the model update unit 311 is unnecessary. The other functions of the inference device are the same as those of the learning device 10.
[0119] Figure 20 is a flowchart showing an example of inference processing by the inference device according to the fourth modified example.
[0120] For example, the user operates the terminal device 20 to input configuration information. The configuration information includes initial environmental information that shows the actual inventory status of the oil tanks, and target information that is the actual oil loading and unloading plan. Furthermore, the configuration information includes the same constraints as those in the reinforcement learning process.
[0121] The processor 101 receives the configuration information and initializes the learning device 10 according to the configuration information (step S101). That is, the processor 101 initializes the environment 220 according to the environment initial information and sets the constraints according to the constraint information. Furthermore, the processor 101 sets the goal according to the goal information.
[0122] The processor 101 sets the parameters (weights) of agent 210 and reads out agent 210 (step S202). The agent 210 read out is a trained model obtained through reinforcement learning.
[0123] Processor 101 sets the variable i to its initial value "1" and starts the episode (step S103). That is, the first step 1 of the episode is set as the target for execution. Specifically, processor 101 starts multiple episodes. The number of episodes is included in the configuration information. The following processes proceed in parallel across multiple episodes.
[0124] The processor 101 observes the state Si of the environment 220 (step S104).
[0125] The processor 101 identifies behaviors that violate constraints based on the observed state Si of the environment 220 and masks the identified behaviors (step S105).
[0126] The processor 101 calculates the selection probability for each of the multiple actions A(i,j) based on the observed state Si of the environment 220, and generates a probability distribution. Based on the generated probability distribution, the processor 101 selects action A(i,k) (step S106).
[0127] The processor 101 determines whether the state Si+1 of the environment 220 in the next step i+1 transitions to a dead end as a result of the execution of the selected action A(i,k) (step S107). That is, the processor 101 determines whether all of the actions A(i+1,j) in step i+1 violate the constraints.
[0128] If the state Si+1 of environment 220 does not transition to a dead end (NO in step S107), the processor 101 executes the selected action A(i,k) and updates the state Si of environment 220 to Si+1 (step S108). That is, the processor 101 determines the next state Si+1 of environment 220 by simulation.
[0129] The processor 101 determines whether the episode termination condition (i=N) has been met (step S109). If the episode termination condition has not been met (NO in step S109), the processor 101 increments the variable i (step S110) and returns to step S104.
[0130] If the state Si+1 of environment 220 transitions to a dead end (YES in step S107), the processor 101 determines whether the number of episodes in which the state Si+1 of environment 220 transitions to a dead end is greater than a threshold (step S111).
[0131] If the number of episodes in which the state Si+1 of environment 220 transitions to a dead end is below a threshold (NO in step S111), the processor 101 proceeds to step S108. In this case, in the next step, episodes in which the state Si+1 of environment 220 does not transition to a dead end proceed, and episodes in which the state Si+1 of environment 220 transitions to a dead end are terminated.
[0132] If the number of episodes in which the state Si+1 of environment 220 transitions to a dead end is greater than the threshold (YES in step S111), the processor 101 masks the action A(i,k) selected in step i to be executed (step S112).
[0133] Furthermore, processor 101 decrements the variable i (step S113) and returns to step S104. This causes the execution of the step to be reversed to the previous step.
[0134] If the episode termination condition is met (YES in step S109), the processor 101 determines whether the inference process termination condition has been met (step S214). For example, the inference process termination condition is that the number of episode executions has reached a predetermined value. In another example, the inference process termination condition is that the degree of agreement between the outputted oil transport plan and the target is greater than a predetermined value.
[0135] If the termination condition for the inference process is not met (NO in step S214), the processor 101 returns the environment 220 to its initial state and returns to step S103. If the termination condition for the inference process is met (YES in step S214), the processor 101 outputs the inference result, i.e., the oil loading and unloading plan with the highest degree of agreement with the target (step S215). This completes the inference process.
[0136] [6-5. Other variations] The above-described embodiment illustrates a learning device 10 that deals with the problem of loading and unloading petroleum, but this is illustrative and not limiting. For example, the learning device 10 can deal with problems similar to the problem of loading and unloading petroleum. A similar problem is, for example, the problem of receiving and shipping goods in a warehouse. In this problem, multiple petroleum tanks are replaced with multiple warehouses or multiple areas in one warehouse, and petroleum is replaced with cargo or items to be stored. Furthermore, the learning device 10 may deal with problems different from the problem of loading and unloading petroleum and the problem of receiving and shipping goods in a warehouse.
[0137] [7. Addendum] (Note 1) A determination unit determines whether, as a result of the learning agent executing a first action selected in the first step of an episode, the state of the environment in which the learning agent acts in a second step later than the first step transitions to a specific state in which it is impossible for the learning agent to select one or more second actions. When the determination unit determines that the state of the environment transitions to the specific state, the exclusion unit excludes a third action, which is at least one of the actions selected by the learning agent in the episode, from the learning agent's selection targets. If the determination unit determines that the state of the environment has transitioned to the specific state, the trace unit traces the steps to be executed by the learning agent back from the first step to the fourth step which is earlier than the third step in which the third action was selected. Equipped with, Information processing device.
[0138] (Note 2) The aforementioned specific state is a state in which all of one or more second actions violate the constraints defined in the environment. The information processing device described in Appendix 1.
[0139] (Note 3) The aforementioned fourth step is the step immediately preceding the aforementioned third step. The information processing device described in Appendix 1 or Appendix 2.
[0140] (Note 4) The aforementioned fourth step is a step that is several steps earlier than the aforementioned third step. The information processing device described in Appendix 1 or Appendix 2.
[0141] (Note 5) The third action described above is the first action described above. The aforementioned third step is the aforementioned first step, An information processing device as described in any one of the appendices 1 through 4.
[0142] (Note 6) The aforementioned third step is a step that precedes the aforementioned first step. An information processing device as described in any one of the appendices 1 through 4.
[0143] (Note 7) The pre-3 behavior is a set of behaviors selected in the aforementioned episode. The information processing device described in Appendix 6.
[0144] (Note 8) If the learning agent, having gone back to the fourth step, performs the fourth action selected in the fourth step, and as a result the state of the environment transitions to the specific state, The exclusion unit excludes a fifth action that was selected prior to the third action. The retrospective unit traces the steps to be executed by the learning agent back from the fourth step to the sixth step, which is earlier than the fifth step in which the fifth action was selected. An information processing device as described in any one of the appendices 1 through 7.
[0145] (Note 9) A step of determining whether, as a result of the learning agent performing a first action selected in the first step of an episode, the state of the environment in which the learning agent acts in a second step later than the first step transitions to a specific state in which it is impossible for the learning agent to select one or more second actions, When it is determined that the state of the environment transitions to the specific state, the step of excluding a third action, which is at least one of the actions selected by the learning agent in the episode, from the learning agent's selection targets, If it is determined that the state of the environment transitions to the specific state, the steps to be executed by the learning agent are reversed from the first step to the fourth step which is earlier than the third step in which the third action was selected. including, Information processing methods.
[0146] (Note 10) On the computer, A step of determining whether, as a result of the learning agent performing a first action selected in the first step of an episode, the state of the environment in which the learning agent acts in a second step later than the first step transitions to a specific state in which it is impossible for the learning agent to select one or more second actions, When it is determined that the state of the environment transitions to the specific state, the step of excluding a third action, which is at least one of the actions selected by the learning agent in the episode, from the learning agent's selection targets, If it is determined that the state of the environment transitions to the specific state, the steps to be executed by the learning agent are reversed from the first step to the fourth step which is earlier than the third step in which the third action was selected. To execute Computer program.
[0147] [8. Supplementary Notes] The embodiments disclosed herein are illustrative and not restrictive in all respects. The scope of rights in this disclosure is defined by the claims rather than by the embodiments described above, and includes the meaning of equivalents to the claims and all modifications within that scope. [Explanation of Symbols]
[0148] 1. Machine Learning Systems 10. Learning device (information processing device) 20 Terminal devices 101 Processors 102 Non-volatile memory 103 Volatile memory 104 Interface (IF) 200 Learning Programs 210 Agents (Learning Agents) 220 Environmental Models (Environment) 301 Settings Section 302 Start part 303 Observation Unit 304 Probability Generation Unit 305 Exclusion part 306 Selection Section 307 Judgment section 308 Execution Department 309 Environment Update Department 310 Termination Determination Unit 311 Model Update Section 312 Upstream section
Claims
1. Memory for storing the learning program, An information processing apparatus comprising a processor that executes the aforementioned learning program, The aforementioned learning program includes: The program includes a reinforcement learning program that rewards the agent for the action A(i,k) selected in step i of the episode, and transitions the state Si of the environment model to state Si+1 in the next step i+1. The aforementioned processor, A determination unit determines, based on the selected action A(i,k) from among the actions A(i,j) in step i, whether the state Si+1 of the environment model in the next step i+1 is a specific state in which all actions A(i+1,j) are unavailable. If the determination unit determines that the state Si+1 of the environment model is the specific state, the exclusion unit excludes the action A(i-n, k) that was actually selected in step i-n from the selection target of the agent, If the determination unit determines that the state Si+1 of the environmental model is the specific state, the reverse unit returns step i of the episode to the previous step i-m, including, Information processing device. however, i: A variable representing the number of steps in the state of the environment model. Si: A set of variables representing the state of the environmental model in step i. j: Identifier of the action (1 ≤ j ≤ N) A(i, j): A set of variables representing the action of identifier j in step i. k: Identifier of the action currently selected by the agent (1 ≤ k ≤ N) A(i, k): A set of variables representing the action of identifier k actually selected in step i. n: A setting value of 0 or greater that represents the number of returns for step i. m: A setting value representing the number of returns for step i (m > n)
2. The aforementioned specific state is a state in which all of the actions A(i+1,j) in step i+1 violate the constraints defined in the environmental model. The information processing apparatus according to claim 1.
3. m = n + 1, The information processing apparatus according to claim 1.
4. m > n+1, The information processing apparatus according to claim 1.
5. n = 0, The information processing apparatus according to claim 1.
6. n > 0, The information processing apparatus according to claim 1.
7. The exclusion unit excludes a plurality of actions previously selected by the agent from the agent's selection targets. The information processing apparatus according to claim 6.
8. If the action (Ai-m, k1) actually selected by the agent upon returning to step i-m results in the state of the environment model Si-m+1 in the next step i-m+1 being the specific state, The exclusion unit excludes the action (Ai-n1,k) that was actually selected in step i-n1 from the selection target of the agent. The reverse-direction unit returns step i-m of the episode to the previous step i-m1. The information processing apparatus according to any one of claims 1 to 7. however, k1: Identifier of the action actually selected by the agent after the step has been returned (1 ≤ k1 ≤ N, k1 ≠ k) n1: A setting value representing the number of returns in step i (n1 ≥ m) m1: A setting value representing the number of returns for step i (m1 > n1)
9. An information processing method by an information processing device that rewards the agent for the action A(i,k) selected by the agent in step i of the episode, and transitions the state Si of the environment model to the state Si+1 of the next step i+1, wherein In step i, the selected action A(i,k) among the actions A(i,j) determines whether the state Si+1 of the environmental model in the next step i+1 is a specific state in which all actions A(i+1,j) are unavailable. If it is determined that the state Si+1 of the environmental model is the specific state, the step of excluding the action A(i-n,k) that was actually selected in step i-n from the selection target of the agent, If it is determined that the state Si+1 of the environmental model is the specific state, the step of returning step i of the episode to the previous step i-m, including, Information processing methods. however, i: An incrementable variable representing the number of steps in the state of the environment model. Si: A set of variables representing the state of the environmental model in step i. j: Identifier of the action (1 ≤ j ≤ N) A(i, j): A set of variables representing the action of identifier j in step i. k: Identifier of the action currently selected by the agent (1 ≤ k ≤ N) A(i, k): A set of variables representing the action of identifier k actually selected in step i. n: A setting value of 0 or greater that represents the number of returns for step i. m: A setting value representing the number of returns for step i (m > n)
10. A computer program that rewards the agent for the action A(i,k) selected by the agent in step i of the episode, and transitions the state Si of the environment model to the state Si+1 of the next step i+1, To the aforementioned computer, In step i, the selected action A(i,k) among the actions A(i,j) determines whether the state Si+1 of the environmental model in the next step i+1 is a specific state in which all actions A(i+1,j) are unavailable. If it is determined that the state Si+1 of the environmental model is the specific state, the step of excluding the action A(i-n,k) that was actually selected in step i-n from the selection target of the agent, If it is determined that the state Si+1 of the environmental model is the specific state, the step of returning step i of the episode to the previous step i-m, To execute Computer program. however, i: An incrementable variable representing the number of steps in the state of the environment model. Si: A set of variables representing the state of the environmental model in step i. j: Identifier of the action (1 ≤ j ≤ N) A(i, j): A set of variables representing the action of identifier j in step i. k: Identifier of the action currently selected by the agent (1 ≤ k ≤ N) A(i, k): A set of variables representing the action of identifier k actually selected in step i. n: A setting value of 0 or greater that represents the number of returns for step i. m: A setting value representing the number of returns for step i (m > n)