Controlling an agent using relative variational intrinsic control
By training a policy neural network using the Relative Variational Intrinsic Control (RVIC) technique, the problem of insufficient skill learning diversity without external rewards is solved. This enables the learning of a more universal and effective set of skills with less resource consumption, applicable to both real and simulated environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GDM HOLDING LLC
- Filing Date
- 2021-09-10
- Publication Date
- 2026-06-23
AI Technical Summary
Existing skill learning methods, without external rewards, struggle to ensure the diversity of the learned skill set in altering the agent-environment relationship, resulting in poor performance of the skill set in real-world tasks.
The policy neural network is trained using the Relative Variational Intrinsic Control (RVIC) technique. Rewards are generated through relative and absolute discriminators to incentivize the learning of a set of skills that can be distinguished in terms of changing the agent-environment relationship.
With fewer training iterations and less computational resource consumption, it learns a more universal and effective set of skills, enabling it to complete tasks faster and is applicable to both real and simulated environments.
Smart Images

Figure CN116134451B_ABST
Abstract
Description
Technical Field
[0001] This manual relates to the use of machine learning models to control agents. Background Technology
[0002] Machine learning models receive input and generate outputs based on those inputs, such as predicting outputs. Some machine learning models are parametric models, generating outputs based on the received inputs and the values of the model's parameters.
[0003] Some machine learning models are deep models, which use multiple layers to generate outputs from received inputs. For example, a deep neural network is a deep machine learning model that includes an output layer and one or more hidden layers, each of which applies a non-linear transformation to the received input to generate an output. Summary of the Invention
[0004] This specification describes a system implemented as a computer program on one or more computers in one or more locations, which trains a policy neural network for controlling an agent. Specifically, the system trains the policy neural network such that it can be used to control an agent to perform a set of skills in an unsupervised manner (e.g., using only intrinsic rewards).
[0005] The subject matter described herein may be implemented in certain embodiments to achieve one or more of the following advantages.
[0006] Even without external rewards, agents can still learn useful behaviors by identifying and mastering a diverse set of skills within their environment. Existing skill learning methods use mutual information objectives to incentivize each skill to become diverse and distinguishable from others. However, applying these existing skill learning methods can result in highly diverse skill sets, such as skills that are only distinguishable in the final state of the trajectory generated by performing the skill. However, the final state of a skill depends on its initial state, i.e., on the context in which the skill is performed. To ensure useful skill diversity, this specification discloses a technique utilizing skill learning objectives and relative variational intrinsic control (RVIC) to incentivize the learning of skills that are distinguishable in how they alter the agent's relationship to its environment. The resulting skill set flattens the space of available resources for the agent and is more useful for downstream applications than skills discovered by existing methods, for example, when repurposed for use in hierarchical reinforcement learning for tasks with external rewards.
[0007] Compared to traditional systems, the system described in this specification can achieve acceptable performance levels with fewer training iterations by training policy neural networks, thereby consuming fewer computational resources (e.g., memory and computing power). For example, a hierarchical agent using a skill-conditioned policy with pre-trained relative variational intrinsic control can achieve higher performance levels than a hierarchical agent using a pre-trained skill policy discovered by existing skill learning methods. Furthermore, an ensemble of one or more policy neural networks trained by the system described in this specification can select actions that enable the agent to complete tasks more efficiently (e.g., faster) than policy neural networks trained by alternative systems. As mentioned above, the learned skills can be more generalized and therefore easier to assemble to complete tasks.
[0008] Details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the following description. Other features, aspects, and advantages of the subject matter will become apparent from the specification, drawings, and claims. Attached Figure Description
[0009] Figure 1 An example policy neural network system is shown.
[0010] Figure 2 An example architecture of Relative Variational Intrinsic Control (RVIC) is shown.
[0011] Figure 3 This is a flowchart of an example process for controlling an agent using relative variational intrinsic control. The same reference numerals and names in the various figures indicate the same elements. Detailed Implementation
[0012] This specification describes a method for training policy neural networks.
[0013] The policy neural network is configured to: receive policy inputs, which include observations characterizing the state of the environment and data on recognized skills; and generate policy outputs (e.g., a probability distribution over a set of possible actions), which are used in controlling an agent interacting with the environment to cause the agent to perform recognized skills. Specifically, the policy neural network is trained to learn a set of skills by utilizing a skill learning objective known as Relative Variational Intrinsic Control (RVIC), where each skill is diverse in how it alters the agent's relationship with the environment. Thus, during training, the policy output of the policy neural network is used to determine actions for controlling the agent. As will be described later, in general, a "skill" includes a set of actions performed by an agent in an environment.
[0014] In some implementations, the environment is a real-world environment, and the agent is a mechanical agent that interacts with the real-world environment. For example, the agent may be a robot that interacts with the environment to perform a specific task, such as locating an object of interest in the environment, moving the object of interest to a specified location in the environment, or navigating to a specified destination in the environment; or the agent may be an autonomous or semi-autonomous land, air, or sea vehicle that navigates in the environment.
[0015] In these implementations, observations may include one or more of the following: images, object orientation data, and sensor data that captures the same observations as the agent when the agent interacts with the environment, such as sensor data from images, distance or orientation sensors, or from actuators.
[0016] For example, in the case of a robot, observations may include data characterizing the robot’s current state, such as one or more of the following: joint orientation, joint velocity, joint force, torque or acceleration, such as gravity-compensated torque feedback, and the global or relative pose of items held by the robot.
[0017] In the case of robots or other mechanical agents or vehicles, observation can similarly include one or more of the following: orientation, linear or angular velocity, force, torque or acceleration, and the global or relative attitude of one or more parts of the agent. Observation can be defined in 1, 2, or 3 dimensions and can be absolute and / or relative.
[0018] Observations may also include, for example: sensed electronic signals, such as motor current or temperature signals; and / or image or video data, such as data from sensors of the agent or information from sensors located separately from the agent in the environment.
[0019] In the case of electronic agents, observation may include data from one or more sensors that monitor a part of the plant or service facility, such as current, voltage, power, temperature, and other sensors and / or electronic signals indicating the operation of electronic and / or mechanical items of the equipment.
[0020] In these implementations, the action can be a control input for controlling a robot, such as the torque of the robot's joints or a higher level of control command, or for an autonomous or semi-autonomous land, air, or sea vehicle, such as the torque of the vehicle's control surfaces or other control elements or a higher level of control command.
[0021] In other words, actions can include, for example, orientation, velocity, or force / torque / acceleration data of one or more joints of a robot or a part of another mechanical agent. Action data may additionally or alternatively include electronic control data, such as motor control data, or more generally, data for controlling one or more electronic devices within an environment, the control of which affects the observed state of the environment. For example, in the case of autonomous or semi-autonomous land, air, or sea vehicles, actions can include navigational actions such as steering and movement, such as braking and / or acceleration of the vehicle.
[0022] In some implementations, the environment is a simulated environment, and the agent is implemented as one or more computers that interact with the simulated environment. Training the agent in a simulated environment allows the agent to learn from a large amount of simulated training data while avoiding the risks associated with training the agent in a real-world environment, such as damage to the agent due to performing actions that make poor choices. The agent trained in the simulated environment can then be deployed in the real-world environment.
[0023] For example, the simulation environment could be a simulation of a robot or vehicle, and a reinforcement learning system could be trained on the simulation. For instance, the simulation environment could be a motion simulation environment, such as a driving simulation or a flight simulation, and the agent could be a simulated vehicle navigating through the motion simulation. In these implementations, actions could be control inputs used to control a simulated user or simulated vehicle.
[0024] In another example, the simulated environment could be a video game, and the agent could be a simulated user playing the video game.
[0025] In another example, the environment can be a chemical synthesis or protein folding environment, such that each state is a corresponding state of a protein chain or one or more intermediate or precursor chemicals, and the agent is a computer system for determining how to fold the protein chain or synthesize the chemical. In this example, actions are possible folding actions for folding the protein chain or actions for assembling the precursor chemical / intermediate, and the desired outcome can include, for example, folding the protein to stabilize it and enable it to perform a specific biological function or providing an efficient synthetic route for the chemical. As another example, the agent can be a mechanical agent that executes or controls protein folding actions automatically selected by the system without human interaction. Observations can include direct or indirect observation of the protein's state, and / or can be derived from the simulation.
[0026] In a similar manner, the environment can be a drug design environment, such that each state corresponds to a potential pharmaceutical chemical, and the agent is a computer system for determining the elements and / or synthetic pathways of the pharmaceutical chemical. The drug / synthesis can be designed based, for example, on rewards derived from the drug's objectives in a simulation. As another example, the agent can be a mechanical agent that performs or controls the synthesis of the drug.
[0027] In some applications, the agent can be a static or mobile software agent, i.e., a computer program configured to operate autonomously and / or in conjunction with other software agents or personnel to perform tasks. For example, the environment could be an integrated circuit routing environment, and the system could be configured to learn to perform routing tasks for routing interconnects of integrated circuits such as ASICs. Thus, the reward (or cost) can depend on one or more routing metrics, such as interconnect resistance, capacitance, impedance, loss, speed or propagation delay, physical line parameters (such as width, thickness, or geometry), and design rules. Observations can be observations of component orientation and interconnects; actions can include component placement actions to, for example, define component orientation or orientation and / or interconnect routing actions, such as interconnect selection and / or placement actions. Therefore, a routing task can include placing components, i.e., determining the orientation and / or orientation of components in an integrated circuit, and / or determining the routing of interconnects between components. Once the routing task is completed, integrated circuits, such as ASICs, can be manufactured based on the determined placement and / or routing. Alternatively, the environment could be a data packet communication network environment, and the agent is a router used to route data packets on the communication network based on observations of the network.
[0028] Generally, in a simulated environment, observations may include one or more previously described observations or simulated versions of observation types, and actions may include one or more previously described actions or simulated versions of action types.
[0029] In some other applications, the agent can control actions in a real-world environment including equipment (e.g., in a data center or mains power or water distribution system, or in a manufacturing plant or service facility). Thus, observations can relate to the operation of the plant or facility. For example, observations could include observations of equipment power or water usage, or observations of power generation or distribution control, or observations of resource usage or waste generation. The agent can control actions in the environment to improve efficiency, for example, by reducing resource usage, and / or to reduce the environmental impact of operations in the environment, for example, by reducing waste. These actions can include actions that control or impose operating conditions on equipment in the plant / facility, and / or actions that result in changes to settings in the operation of the plant / facility, such as adjusting or turning on / off components of the plant / facility.
[0030] In some further applications, the environment is a real-world environment, and the agent manages the allocation of tasks across computing resources (e.g., on mobile devices and / or in data centers). In these implementations, actions may include assigning tasks to specific computing resources.
[0031] As another example, actions may include presenting an advertisement, observations may include ad impressions or click counts or ratios, and rewards may represent previous selections of items or content by one or more users.
[0032] Generally, in the applications described above, where the environment is a simulated version of the real-world environment, once the system / method has been trained in the simulation, it can then be applied to the real-world environment. That is, control signals generated by the system / method can be used to control the agent to perform tasks in the real-world environment in response to observations from the real-world environment. Optionally, the system / method can continue training in the real-world environment based on one or more rewards from the real-world environment.
[0033] Optionally, in any of the above embodiments, observations at any given time step may include data from previous time steps, which may be helpful in characterizing the environment, such as actions performed at previous time steps, rewards received at previous time steps, etc.
[0034] Figure 1 An example policy neural network system 100 is shown. The policy neural network system 100 is an example of a system implemented as a computer program on one or more computers at one or more locations, wherein the system, components and techniques described below are implemented.
[0035] The policy neural network system 100 uses a policy neural network 102 to control the agent 104 that interacts with the environment 106 by selecting an action 116 to be performed by the agent 104 at each of multiple time steps.
[0036] Each input to the policy neural network 102 may include an observation 110 that characterizes the state of the environment with which the agent 104 is interacting, and the output of the policy neural network (“policy output” 114) may define an action 116 to be performed by the agent 104 in response to the observation 110.
[0037] As a specific example, the output of the policy neural network 102 can be the corresponding Q-value for each action in the action set, which represents the predicted reward, i.e., the predicted time discount sum of future rewards, which will be received by the agent as a result of performing an action in response to an observation.
[0038] Then, system 100 can control agent 104 based on the Q value of actions in the action set, for example, by selecting the action with the highest Q value as the action to be performed by agent 104.
[0039] As another specific example, each input to the policy neural network 102 can be an observation, and the output of the policy neural network 102 can be a probability distribution over the set of actions, where the probability of each action represents the likelihood that performing the action in response to the observation will maximize the predicted reward. The system 100 can then control the agent 104 based on probability, for example, by selecting the action with the highest probability as the action to be performed by the agent 104, or by sampling actions from the probability distribution. As another specific example, the policy output can directly define the action to be performed; that is, the policy neural network 102 can output a policy output that defines a single action.
[0040] In some cases, to allow for fine-grained control of the agent, system 100 can treat the space of actions to be performed by the agent, i.e., the possible set of control inputs, as a continuous space. This setting is called a continuous control setting. In these cases, the output of the policy neural network 102 can be parameters of a multivariate probability distribution in the space, such as the mean and covariance of a multivariate normal distribution.
[0041] At each time step while controlling agent 104, system 100 receives observations 110 characterizing the current state of environment 106 at that time step, as well as data (also referred to as skill data 112, or simply "skills") identifying skills from the skill set.
[0042] Skill data 112 indicates skills from a skill set. As used herein, a "skill" is a behavior performed by agent 104 as a result of a series of actions 116 performed by agent 104 in response to continuous observation 110. At the beginning of training the policy neural network 102, the behavior will typically be random or nearly random and may not be distinct from one another. However, as training progresses, the behavior will typically become different due to the way system 100 trains the policy neural network 102; for example, the state of the environment will be modified differently when starting from a given initial state. That is, for example, by providing the policy neural network 102 with policy inputs including data recognizing a given skill 112, and using policy output 114, training the policy neural network on the given skill 112 will result in a different set of actions 116 performed by agent 104 compared to training the policy neural network 102 on different skills from that skill set.
[0043] The skill set can be finite or infinite. When the skill set is finite, the skill data 112 can be, for example, a one-hot vector identifying a given skill or a dense embedding identifying a given technique. When the skill set is infinite, the skill data 112 can be points, such as vectors from a continuous space representing the skill set.
[0044] System 100 uses training engine 108 to train policy neural network 102 (e.g., to learn a set of distinguishable skills in how they change the relationship between agent 104 and environment 106).
[0045] During training, the training engine 108 repeatedly uses the policy neural network 102 to generate trajectories. Each trajectory includes a sequence of received observations 110, which are received simultaneously by the agent 104 interacting with the environment 106 while being controlled by the policy neural network conditioned on a given selected skill 112. (Reference) Figure 2 Describe the trajectory data in more detail.
[0046] Then, training engine 108 trains policy neural network 102 on the generated trajectories. In some cases, engine 108 trains policy neural network 102 strategically, i.e., immediately trains it using the generated trajectories, such that the current version of policy neural network 102 generates a given trajectory for training. In other cases, engine 108 trains policy neural network 102 strategically, i.e., stores the generated trajectories in memory, and then trains policy neural network 102 on the stored trajectories, such that the given trajectory for training may have already been generated by an earlier version of policy neural network 102.
[0047] To train the policy neural network on a given trajectory, training engine 108 uses a relative discriminator neural network 120 and an absolute discriminator neural network 126. Generally, in implementations, each of these discriminator neural networks is also trained during the training of the policy neural network, specifically to predict the selected skill.
[0048] The relative discriminator neural network 120 is configured to process relative input 118, which includes an initial observation and a final observation in a sequence of observations within a given trajectory, to generate a relative output 122 comprising a corresponding relative score for each skill in the skill set. Each relative score represents an estimated probability conditioned on the corresponding skill by the policy neural network 102 at the time the trajectory data is generated. In an embodiment, during training of the policy neural network 102, the training engine 108 also trains the relative discriminator neural network 120 by optimizing an objective function that encourages an increase in the relative score corresponding to the actually selected skill, i.e., encourages the relative discriminator neural network 120 to generate the relative output 122 more accurately. For example, the objective function could be log-likelihood or other objectives that measure the relative score corresponding to the actually selected skill.
[0049] The absolute discriminator neural network 126 is configured to process an absolute input 124, including the last observation in the sequence (rather than the initial observation), to generate an absolute output 128 including a corresponding absolute score for each skill in the skill set. Each absolute score represents an estimated probability conditioned on the corresponding skill by the policy neural network 102 at the time the trajectory data is generated. During training of the policy neural network, the training engine 108 trains the absolute discriminator neural network 126 by optimizing an objective function that encourages an increase in the absolute scores corresponding to the selected skills, i.e., encourages the absolute discriminator neural network 126 to generate the absolute output 128 more accurately. For example, the objective function could be log-likelihood or other objectives that measure the absolute scores corresponding to the actually selected skills.
[0050] In some implementations, the relative input 118 includes the first N observations and the last N observations in the observation sequence of a given trajectory, where N is a predefined constant. In these implementations, the absolute input 124 includes the last N observations in the sequence.
[0051] When training on a given trajectory, the training engine 108 generates a reward 130 based on the relative output 122 and the absolute output 128 (e.g., based on the difference between the absolute score and the relative score). The following will refer to... Figure 2 and Figure 3 A more detailed description of how reward 130 is generated based on relative output 122 and absolute output 128.
[0052] Then, the training engine 108 uses reward 130 to train the policy neural network 102.
[0053] Specifically, the training engine 108 trains the policy neural network 102 based on reward 130 to maximize the expected reward for the time discount of the generated trajectory data through reinforcement learning.
[0054] The following will refer to Figure 2 and Figure 3 A more detailed description of training policy neural networks based on rewards.
[0055] Once the policy neural network 102 has been trained to allow the agent 104 to execute a set of skills in the environment 106, the system 100 or another system can use the trained policy neural network to control the agent 104 in such a way as to make the agent 104 execute skills in the environment 106, for example, to explore the environment without requiring any external rewards.
[0056] Alternatively or additionally, system 100 or another system may train a controller neural network (“meta-controller”) for example, using hierarchical reinforcement learning techniques. This controller neural network controls agent 104 by selecting, and optionally selecting, primitive actions from a meta-action space comprising a set of skills. The meta-controller can be trained on a specific task where external rewards are available, thereby allowing learned skills to be reused to improve task learning. In other words, in response to a given observation, the meta-controller can be used to select from a set of “meta-actions” that includes learned skills, or in some cases, some or all of the action set. When the meta-controller selects a skill, the agent will be controlled, for example, for a fixed number of time steps by a trained policy neural network conditioned on the selected skill. When the meta-controller selects a primitive action, the agent will be controlled by performing a single primitive action in response to the current observation.
[0057] Figure 2 Example architecture 200 is shown. Example architecture 200 is an example configuration for training policy neural network 102 using relative discriminator neural network 120 and absolute discriminator neural network 126.
[0058] In example architecture 200, policy neural network 102 (represented as "skill-conditioned policy") receives observations 110 and data on recognized skills 112 as input.
[0059] In some implementations, system 100 selects skills from a discrete set. In other implementations, the system selects skills from a continuous space of skills. Reference will be made below. Figure 3 Describe the skills you choose.
[0060] System 100 generates trajectory 202 by means of a sequence (s0, ..., s10) of observations 110. TThe policy neural network 102 controls the agent by using the policy output generated by the policy neural network 102, while conditioned on the data of the recognized skill 112 within a fixed number of steps T. That is, at each time step t within the fixed number of time steps, the system receives observations s. t 110, and uses a policy neural network 102 conditioned on data conditioned on skill 112 to select action a. t 116. The policy neural network 102 is referred to as "skill-conditioned" because the policy neural network 102 receives data on recognizing skills 112(Ω) as input.
[0061] The policy neural network 102 can have any suitable architecture that allows it to process data on observation and recognition skills to generate policy outputs.
[0062] As a specific example, the policy neural network 102 can encode data on observation and recognition skills separately, and then process the combination of the encoded observations and encoded skills, for example, by concatenation or summing, to generate a policy output by processing the combination through a multilayer perceptron (MLP) or another set of one or more feedforward layers. For example, the policy neural network 102 can encode data on recognition skills by processing the data using an MLP.
[0063] When observations include high-dimensional sensor data (e.g., images or laser data), a policy neural network can use a convolutional neural network to encode the observations. As another example, when observations consist only of relatively low-dimensional inputs (e.g., sensor readings characterizing the robot's current state), a policy neural network can use a multilayer perceptron to encode the observations. As yet another example, when observations include both high-dimensional sensor data and low-dimensional inputs, a policy neural network can include a convolutional encoder to encode the high-dimensional data, a fully connected encoder to encode the lower-dimensional data, and a subnetwork that operates on combinations (e.g., concatenation) of the encoded data to generate the encoded observations.
[0064] Example architecture 200 may include two skill discriminators, namely a relative discriminator neural network 120(q φ ) and absolute discriminator neural network 126 Each skill discriminator may include one or more neural networks. System 100 trains two skill discriminators 120 and 126 to determine the initial and final states (s0, s2) of a trajectory. T (For the case of relative discriminator neural network 120) and only the last state in the trajectory (s) T(For the case of absolute discriminator neural network 126) to predict skills. System 100 uses the outputs of the two skill discriminators to generate reward 130.
[0065] In some implementations, the reward is based on the difference between the probabilities of skills assigned to generate trajectories by two skill discriminators 120 and 126. The system uses rewards to incentivize the learning of a set of skills, which are diverse in how each skill alters the agent's relationship with the environment. Because the absolute discriminator only predicts the absolute state of the environment after a skill has terminated, the system trains the policy adversarially relative to the absolute discriminator. For example, the system uses q φ Rewards are identifiable, and at the same time through Penalty discriminability. For example, given the last observation, the system minimizes the difference between two skill discriminators 120 and 126 by maximizing the reward 130 between the skill and the initial observation:
[0066] Reward = Where q is the variational distribution of the probability used to infer skill Ω, defined by the corresponding discriminator neural network (other notations have been described previously).
[0067] Each discriminator neural network can have any suitable architecture that allows the discriminator to map corresponding inputs to corresponding outputs.
[0068] In some implementations, the absolute discriminator neural network and the relative discriminator neural network share some parameters, such as shared sub-neural networks, neural network weights, or layers. For example, the absolute discriminator neural network and the relative discriminator neural network can share an encoder neural network that generates an encoded representation of the received observation (e.g., trajectory 202). The encoder neural network can be shared by sequentially using it for the absolute discriminator neural network and the relative discriminator neural network.
[0069] In some implementations, when the observation includes high-dimensional sensor data (e.g., images or laser data), the encoder neural network can use a convolutional neural network to encode the observation. As another example, when the observation only includes relatively low-dimensional inputs (e.g., sensor readings characterizing the robot's current state), the encoder neural network can encode the observation by using a multilayer perceptron (MLP). As yet another example, when the observation includes both high-dimensional sensor data and low-dimensional inputs, the encoder neural network can include a convolutional encoder that encodes the high-dimensional data, a fully connected encoder that encodes the low-dimensional data, and a subnetwork that operates on combinations (e.g., concatenation) of the encoded data to generate the encoded observation.
[0070] In these implementations, the absolute discriminator neural network includes an absolute decoder neural network, such as an MLP, which is configured to process the encoded representation of the last observation to generate an absolute output.
[0071] In these implementations, the relative discriminator neural network includes a relative decoder neural network, such as an MLP, which is configured to process the concatenation of the encoded representations of the initial and last observations to generate a relative output.
[0072] Figure 3 This is a flowchart of an example process 300 for training a policy neural network using relative variational intrinsic control. For convenience, process 300 will be described as being executed by a system of one or more computers located in one or more locations. For example, a properly programmed policy neural network system, such as... Figure 1 The strategy nervous system 100 can execute process 300.
[0073] The system repeatedly executes process 300 to train the policy neural network system. This training scheme can be called a relative variational intrinsic control scheme because the generated rewards are intrinsic, i.e., generated without any external information about the quality of any given generated trajectory, and are relative measures based on changes in the environment, i.e., based on both the relative and absolute discriminator neural network outputs. The policy neural network needs to learn skills to change the environment in different, diverse ways.
[0074] The system selects a skill from the skill set (302). In some implementations, the system samples skills from a uniform probability distribution over the skill set. As described above, a skill is a behavior performed by an agent as a result of the agent performing a series of actions in response to continuous observation.
[0075] While the policy neural network is conditioned on a selected skill, the system generates a trajectory (304) by controlling the agent using the policy neural network. The trajectory includes a sequence of observations received by the agent while it interacts with the environment using the policy neural network conditioned on the selected skill.
[0076] In some implementations, trajectories are generated starting from the last state of the environment of a previous trajectory. That is, the system begins controlling the agent from the last state of the environment in the most recently generated trajectory, and the initial observations in the trajectory characterize the last state of the environment of the previous trajectory.
[0077] In some other implementations, after generating each one, the system determines whether criteria for resetting the environment (e.g., resetting to the starting state of an agent's action) have been met. For example, criteria might include entering a state that has been designated as a termination state by agent 104, or a threshold number of actions since the environment was last reset, or both. In response to determining that the criteria are met after generating the previous trajectory, the system selects a state of the environment from the set of possible initial states of the environment as the initial state for the next trajectory to be generated. In response to determining that the criteria are not met, the system uses the last state of the previous trajectory as the initial state of the trajectory.
[0078] Then, the system trains a policy neural network on the generated trajectory.
[0079] To train the policy neural network, the system uses a relative discriminator neural network to process relative inputs (306). The relative inputs include the initial observations and the last observation in the sequence (i.e., the sequence of observations in the trajectory). The relative discriminator neural network is configured to process the relative inputs to generate relative outputs, which include a corresponding relative score for each skill in the skill set, each relative score representing an estimated probability conditioned on the corresponding skill by the policy neural network at the time the trajectory is generated. The system trains the relative discriminator neural network by optimizing an objective function that encourages an increase in the relative scores corresponding to the selected skills.
[0080] The system uses an absolute discriminator neural network to process the absolute input (308). The absolute input includes the last observation in the sequence (but not the initial observations in the sequence). The absolute discriminator neural network is configured to process the absolute input to generate an absolute output, which includes a corresponding absolute score for each skill in the skill set, each absolute score representing an estimated probability conditioned on the corresponding skill by the policy neural network while the trajectory is generated. The system trains the absolute discriminator neural network by optimizing an objective function that encourages an increase in the absolute score corresponding to the selected skill.
[0081] The system generates a reward for the trajectory based on the absolute score and the relative score corresponding to the selected skill (310). In some embodiments, the reward is equal to or directly proportional to the difference between the relative score and the absolute score corresponding to the selected skill. In other embodiments, the reward is equal to or directly proportional to the difference between the logarithm of the relative score and the logarithm of the absolute score corresponding to the selected skill.
[0082] The system trains a policy neural network based on rewards for trajectories (312). For training, the system can use the reward as a sparse reward for the trajectory, i.e., associate the reward only with the last observation in the trajectory, or use the reward as a dense reward for the trajectory, i.e., associate the reward (or a time-discounted version of the reward) with each observation in the trajectory.
[0083] In some implementations, training a policy neural network based on rewards for a trajectory includes training the neural network to maximize the expected reward for a time discount on the generated trajectory.
[0084] Once a set of rewards has been calculated for a batch of trajectories, the system can be trained by iteratively updating the parameter values of the policy neural network using reinforcement learning techniques (e.g., off-policy reinforcement learning) on a batch of trajectories and their corresponding rewards. The system can use any suitable off-policy reinforcement learning technique to perform training, such as Q-learning reinforcement learning, actor-critic reinforcement learning, or policy gradient-based reinforcement learning.
[0085] This specification uses the term "configured" in conjunction with system and computer program components. For a system of one or more computers to be configured to perform a particular operation or action, this means that software, firmware, hardware, or a combination thereof have been installed on the system, which, in operation, causes the system to perform the operation or action. For one or more computer programs to be configured to perform a particular operation or action, this means that one or more programs include instructions that, when executed by a data processing device, cause that device to perform the operation or action.
[0086] Embodiments of the subject matter and functional operation described in this specification can be implemented in digital electronic circuits, in tangibly implemented computer software or firmware, in computer hardware (including the structures disclosed in this specification and their equivalents), or in one or more combinations thereof. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory storage medium, for execution by a data processing apparatus or for controlling the operation of a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or one or more combinations thereof. Alternatively or additionally, the program instructions can be encoded on artificially generated propagation signals, such as machine-generated electrical, optical, or electromagnetic signals, generated to encode information for transmission to a suitable receiver device for execution by the data processing apparatus.
[0087] The term "data processing device" refers to data processing hardware and encompasses all kinds of devices, apparatuses, and machines used for processing data, including programmable processors, computers, or multiple processors or computers. The device may also be or further include special-purpose logic circuitry, such as FPGAs (Field-Programmable Gate Arrays) or ASICs (Application-Specific Integrated Circuits). In addition to hardware, the device may optionally include code that creates an execution environment for computer programs, such as code constituting processor firmware, protocol stacks, database management systems, operating systems, or combinations thereof.
[0088] A computer program (which may also be referred to as a program, software, software application, app, module, software module, script, or code) can be written in any form of programming language, including compiled or interpreted languages or declarative or procedural languages, and can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but does not need to, correspond to a file in a file system. A program may be stored as a part of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple co-located files (e.g., a file storing one or more modules, subroutines, or code portions). A computer program can be deployed to execute on a single computer or on multiple computers located at a single site or distributed across multiple sites and interconnected via a data communication network.
[0089] In this specification, the term "database" is used broadly to refer to any collection of data: the data does not need to be structured in any particular way, or not at all, and it can be stored on storage devices in one or more locations. Therefore, for example, an indexed database may include multiple collections of data, each of which can be organized and accessed differently.
[0090] Similarly, in this specification, the term "engine" is used broadly to refer to a software-based system, subsystem, or process programmed to perform one or more specific functions. Typically, an engine will be implemented as one or more software modules or components installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines may be installed and run on the same computer or multiple computers.
[0091] The processes and logic flows described in this specification can be executed by one or more programmable computers, which execute one or more computer programs to perform functions by manipulating input data and generating output. The processes and logic flows can also be executed by special-purpose logic circuitry (such as an FPGA or ASIC) or by a combination of special-purpose logic circuitry and one or more programmable computers.
[0092] A computer suitable for executing computer programs can be based on a general-purpose or special-purpose microprocessor, or both, or any other type of central processing unit. Typically, the central processing unit receives instructions and data from read-only memory or random access memory, or both. The basic components of a computer are the central processing unit for executing or running instructions and one or more memory devices for storing instructions and data. The central processing unit and memory may be supplemented by or incorporated into special-purpose logic circuitry. Typically, a computer will also include, or be operatively coupled to, one or more mass storage devices for receiving or transferring data from or to one or more mass storage devices used for storing data, such as disks, magneto-optical disks, or optical disks. However, a computer does not need to have such devices. Furthermore, a computer can be embedded in another device, such as a mobile phone, personal digital assistant (PDA), mobile audio or video player, game console, GPS receiver, or portable storage device (e.g., a Universal Serial Bus (USB) flash drive), to name just a few.
[0093] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example: semiconductor memory devices such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks or removable disks; magneto-optical disks; and CD ROMs and DVD-ROMs.
[0094] To provide interaction with the user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user, and a keyboard and pointing device (e.g., a mouse or trackball), through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback, such as visual, auditory, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. Additionally, the computer can interact with the user by sending and receiving documents to and from a device used by the user; for example, by sending a web page to a web browser on the user's device in response to a request received from a web browser. Furthermore, the computer can interact with the user by sending text messages or other forms of messages to a personal device (e.g., a smartphone running a messaging application) and, in turn, receiving response messages from the user.
[0095] The data processing apparatus used to implement machine learning models may also include, for example, dedicated hardware accelerator units for processing the common and computationally intensive parts of machine learning training or production, namely inference and workloads.
[0096] Machine learning models can be implemented and deployed using machine learning frameworks such as TensorFlow, Microsoft Cognitive Toolkit, Apache Singa, or Apache MXNet.
[0097] Embodiments of the subject matter described in this specification can be implemented in a computing system that includes, for example, a backend component as a data server, or a middleware component as an application server, or a frontend component as a client computer having, for example, a graphical user interface, web browser, or app through which a user can interact with embodiments of the subject matter described in this specification, or any combination of one or more such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium—for example, a communication network. Examples of communication networks include local area networks (LANs) and wide area networks (WANs) (e.g., the Internet).
[0098] A computing system may include clients and servers. Clients and servers are typically geographically separated and usually interact via a communication network. The client-server relationship is established by means of computer programs running on respective computers and having a client-server relationship with each other. In some embodiments, the server sends data (e.g., HTML pages) to a user device, for example, for the purpose of displaying data to a user interacting with a device acting as a client and receiving user input from that user. Data generated at the user device, such as the result of user interaction, may be received at the server from the device.
[0099] While this specification contains numerous specific implementation details, these should not be construed as limiting the scope of any invention or the scope of what may be claimed, but rather as descriptions of features specific to particular embodiments of a particular invention. Certain features described in this specification within the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments. Furthermore, although features may be described above as functioning in certain combinations and even initially claimed in this way, in some cases one or more features from the claimed combination may be removed from the combination, and the claimed combination may involve sub-combinations or variations of sub-combinations.
[0100] Similarly, although operations are depicted in the accompanying drawings and recited in the claims in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or sequentially, or to perform all shown operations to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the above embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated into a single software product or packaged into multiple software products.
[0101] Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. For example, the actions recited in the claims can be performed in a different order and still achieve the desired result. As an example, the processes depicted in the drawings do not necessarily require the specific order or sequence shown to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous.
Claims
1. A method for training a policy neural network for use in an agent interacting with the environment, wherein, The policy neural network is configured to receive policy inputs, the policy inputs including input observations characterizing the state of the environment and data identifying skills from a skill set, and the policy neural network is configured to generate policy outputs defining control policies for controlling the agent. The method includes repeatedly performing operations, the operations including: Select a skill from the skill set; While the policy neural network is conditioned on a selected skill, a trajectory is generated by controlling the agent using the policy neural network. The trajectory includes a sequence of observations received while the agent interacts with the environment while being controlled using the policy neural network conditioned on the selected skill. A relative discriminator neural network is used to process relative inputs, the relative inputs including (i) an initial observation in the sequence and (ii) a last observation in the sequence, the relative discriminator neural network being configured to process the relative inputs to generate relative outputs, the relative outputs including a corresponding relative score for each skill in the skill set, each relative score representing an estimated probability conditioned on the corresponding skill by the policy neural network at the time the trajectory is generated; An absolute discriminator neural network is used to process the absolute input, which includes the last observation in the sequence. The absolute discriminator neural network is configured to process the absolute input to generate an absolute output, which includes a corresponding absolute score for each skill in the skill set, each absolute score representing an estimated probability conditioned on the corresponding skill by the policy neural network at the time the trajectory is generated. A reward for the trajectory is generated based on the absolute score corresponding to the selected skill and the relative score corresponding to the selected skill; and The policy neural network is trained based on the reward for the trajectory.
2. The method according to claim 1, wherein the operation further comprises: The absolute discriminator neural network is trained to optimize an objective function that encourages an increase in the absolute score corresponding to the selected skill.
3. The method according to claim 1, wherein the operation further comprises: The relative discriminator neural network is trained to optimize an objective function that encourages an increase in the relative score corresponding to the selected skill.
4. The method according to claim 1, wherein, The absolute discriminator neural network and the relative discriminator neural network share some parameters.
5. The method according to claim 4, wherein, The absolute discriminator neural network and the relative discriminator neural network share an encoder neural network, which generates an encoded representation of the received observations.
6. The method according to claim 5, wherein, The absolute discriminator neural network includes an absolute decoder neural network configured to process the encoded representation of the last observation to generate the absolute output.
7. The method according to claim 5, wherein, The relative discriminator neural network includes a relative decoder neural network configured to process the concatenation of the encoded representations of the initial observation and the last observation to generate the relative output.
8. The method according to claim 1, wherein, Training the policy neural network based on the reward for the trajectory includes training the neural network to maximize the time-discounted expected reward for the generated trajectory, and wherein: The reward system rewards high relative scores and penalizes high absolute scores.
9. The method according to claim 8, wherein, The reward is equal to or directly proportional to the difference between the relative score corresponding to the selected skill and the absolute score corresponding to the selected skill.
10. The method according to claim 8, wherein, The reward is equal to or directly proportional to the difference between the logarithm of the relative score corresponding to the selected skill and the logarithm of the absolute score corresponding to the selected skill.
11. The method according to claim 1, wherein, Selecting skills from the skill set includes: Skills are sampled from a uniform probability distribution over the set of skills.
12. The method according to claim 1, wherein, Training the policy neural network based on the reward for the trajectory includes training the policy neural network through off-policy reinforcement learning.
13. The method according to any one of claims 1 to 12, wherein, Generating the trajectory includes generating the trajectory starting from the last state of the environment used for the previous trajectory, and wherein the initial observation characterizes the last state of the environment used for the previous trajectory.
14. The method of claim 13, wherein the operation further comprises: After generating the trajectory, determine whether the criteria for resetting the environment have been met; as well as In response to determining that the criteria are met, a state of the environment is selected from the set of possible initial states of the environment as the initial state for the next trajectory to be generated.
15. A system comprising one or more computers and one or more storage devices storing instructions, the instructions being operable, when executed by the one or more computers, to cause the one or more computers to perform the method according to any one of claims 1 to 14.
16. A computer storage medium encoded with instructions that, when executed by one or more computers, cause one or more computers to perform the method according to any one of claims 1 to 14.