Offline learning for robot control using reward prediction models
By training a reward prediction model and using offline reinforcement learning techniques, and leveraging expert and unlabeled experience data, this study addresses the performance limitations of existing robots in sparse reward signals and complex visual observation environments, and achieves efficient robot control strategy output.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GDM HOLDING LLC
- Filing Date
- 2021-07-27
- Publication Date
- 2026-06-26
AI Technical Summary
Existing behavioral clone agents struggle to learn effectively in environments with sparse or unavailable reward signals and complex visual observations, and they cannot utilize large amounts of unlabeled trajectory data, resulting in insufficient performance.
By training a reward prediction model to generate task-specific reward predictions, and combining this with offline reinforcement learning techniques, a policy neural network is trained using expert and unlabeled experience data to generate policy outputs that control the robot to perform specific tasks.
High-performance robot control was achieved in sparse reward signals and complex environments, which can utilize a large amount of unlabeled trajectory data to improve the success rate and stability of task completion.
Smart Images

Figure CN115812180B_ABST
Abstract
Description
[0001] Cross-references to related applications
[0002] This application claims the benefit of U.S. Provisional Patent Application No. 63 / 057,850, filed on July 28, 2020, which is incorporated herein by reference in its entirety. Background Technology
[0003] This manual relates to the use of neural networks to control robots.
[0004] A neural network is a machine learning model that uses one or more non-linear units to predict the output of a received input. In addition to the output layer, some neural networks also include one or more hidden layers. The output of each hidden layer is used as the input to the next layer in the network (i.e., the next hidden layer or output layer). Each layer of the network generates an output from the received input based on the current values of its corresponding set of weights. Summary of the Invention
[0005] This specification describes a system implemented as a computer program on one or more computers at one or more locations, the system being trained to control a robot, i.e., to select actions to be performed by the robot while it is interacting with its environment so that the robot can perform a specific task, a policy neural network.
[0006] Specific embodiments of the subject matter described in this specification can be implemented to achieve one or more of the following advantages.
[0007] Robotic manipulation tasks can involve sparse or unavailable reward signals and complex visual observations. Existing Behavioral Clones (BC) agents are sometimes able to solve these tasks through supervised learning based on pixel-based and expert demonstrations without reward. However, because supervised policies only regress on expert trajectories, they cannot utilize the potentially large amounts of data from other agents and other tasks. This specification describes a technique for training a policy based on demonstrations and a large number of unlabeled trajectories as follows: (1) learning a reward function by comparing expert observations with unlabeled observations; (2) annotating some or all of the data using the learned reward function; and (3) training an offline reinforcement learning agent based on the annotated data. In several consecutive control tasks, the described technique consistently outperforms BC with an equal number of demonstrations and no task reward. Furthermore, the performance of the described technique scales with the number of unlabeled trajectories across several orders of magnitude. Additionally, for several tasks, the described technique outperforms BC with only 10% of the demonstrations. Moreover, the described technique is robust to low-quality unlabeled trajectories. In the example described herein, one method includes: obtaining robot experience data characterizing robot interactions with an environment. The robot experience data may include multiple experiences, each including (i) observations characterizing the state of the environment and (ii) actions performed by the corresponding robot in response to the observations. The experiences include expert experience from episodes of a specific task performed by an expert agent and unlabeled experience. At least some of the unlabeled experience may include experience that is irrelevant to the specific task or cannot be identified as relevant to the specific task. The method includes: training a reward prediction model based on a first subset of the robot experience data, the reward prediction model receiving a reward input including input observations and generating a reward prediction as output, the reward prediction being a prediction of a task-specific reward corresponding to the specific task assigned to the input observations. Training the reward prediction model includes optimizing an objective function comprising a first term that encourages the reward prediction model to assign a first reward value to observations from expert experience, indicating successful completion of a specific task after the environment is in a state represented by the observations, and a second term that encourages the reward prediction model to assign a second reward value to observations from unlabeled experience, indicating unsuccessful completion of the specific task after the environment is in a state represented by the observations. The trained reward prediction model is used to process experiences in the robot's experience data to generate a corresponding reward prediction for each of the processed experiences; and a policy neural network is trained based on (i) the processed experiences and (ii) the corresponding reward predictions of the processed experiences, wherein the policy neural network is configured to receive network inputs including observations and generate a policy output that defines a control policy for the robot to perform a specific task.
[0008] When a robot performs a specific task, a trained policy neural network can be used to control it. For example, observations can be obtained from one or more sensors that sense the real-world environment, and these observations can be fed as input to the trained policy neural network. The input can be used by the policy neural network to generate the output, and the output of the trained policy neural network is used to select actions to control the robot to perform a specific task.
[0009] When a robot performs a specific task, data from a trained policy neural network can be provided to control the robot.
[0010] The first subset may include a proper subset of expert experience and unlabeled experience.
[0011] The objective function may include a third term that encourages the reward prediction model to assign a second reward value to observations derived from expert experience. The first and second terms may have different signs than the third term in the objective function.
[0012] The objective function includes a fourth term (in addition to or in lieu of the third term) that penalizes the reward prediction model to correctly distinguish between expert experience and unlabeled experience based on a first predetermined number of observations of a particular task performed by the expert agent.
[0013] Training a policy neural network may include using offline reinforcement learning techniques to train the policy neural network based on (i) experience and (ii) corresponding reward predictions of experience. The offline reinforcement learning technique may be an offline actor-commentator technique. The offline reinforcement learning technique is commentator regularized regression (CRR).
[0014] Training a reward prediction model can include augmenting data by applying the experience from the robot's experience data to train the reward prediction model.
[0015] At least some of the experiences in the first subset of the robot's experience data are related to the real-world environment.
[0016] In another example described herein, a system includes one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, are operable to cause the one or more computers to perform any of the methods described herein.
[0017] A computer storage medium may be provided that can be encoded with instructions that, when executed by one or more computers, cause one or more computers to perform any of the methods described herein.
[0018] Details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the following description. Other features, aspects, and advantages of this subject matter will become apparent from the specification, the accompanying drawings, and the claims. Attached Figure Description
[0019] Figure 1 An example neural network training system is shown.
[0020] Figure 2 This is a flowchart of an example process for training a policy neural network.
[0021] Figure 3 The performance of the described training process is compared to that of a conventional training process.
[0022] The same reference numerals and names in different figures indicate the same elements. Detailed Implementation
[0023] Figure 1 Example neural network training example 100 is shown. System 100 is an example of a system implemented as a computer program on one or more computers at one or more locations, wherein the systems, components and techniques described below can be implemented.
[0024] System 100 trains a policy neural network 110 to control robot 112, that is, to select actions to be performed by robot 112 while robot 112 is interacting with environment 114 so that robot 112 can perform specific tasks.
[0025] For example, a specific task may include navigating robot 112 to different locations in the environment, locating different objects, picking up different objects, or moving different objects to one or more designated locations.
[0026] It should be understood that references to controlling a robot in this specification include controlling any type of physical (i.e., real-world) intelligent agent. An intelligent agent can be a mechanical intelligent agent, such as an autonomous vehicle, a control system for an industrial facility, such as a data center or power grid, a single actuator, or multiple distributed actuators. A physical intelligent agent can be electrical. For example, the techniques described in this specification can be used to control the generation of voltage or current within one or more components of a system, such as controlling lights, such as LEDs or X-rays, controlling the generation of electromagnetic fields, or any control over other electrical components. For example, sensors can monitor the condition of objects inside an X-ray machine, such as human or animal patients, and the policy neural network 110 can control the generation of X-rays within the X-ray machine.
[0027] The techniques described in this specification can also be used to train policy neural networks to control software agents, such as software agents that control simulated robots or vehicles in virtual environments or software agents that control user interfaces.
[0028] Software agents can be controlled based on real-world inputs, such as sensor data from real-world sensors, or on virtual inputs, such as outputs from virtual sensors that receive input from a virtual environment. Similarly, real-world agents can be controlled based on either real-world or virtual inputs. Virtual environments can be constructed based on real-world inputs. For example, sensor data relating to interactions in a real-world environment can be obtained and used to create a virtual environment.
[0029] Each input to the policy neural network 110 can include an observation representing the state of the environment with which the agent is interacting, and the output of the policy neural network (“policy output”) can define the output of the probability distribution of possible actions to be performed by the agent in response to the observation, such as defining the possible actions to be performed by the agent.
[0030] Observations can include, for example, one or more of images, object position data, and sensor data, to capture observations as the agent interacts with its environment, such as sensor data from images, distance or position sensors, or from actuators. For example, in the case of a robot, observations can include data characterizing the robot's current state, such as one or more of the following: joint position, joint velocity, joint force, torque, or acceleration, such as gravity-compensated torque feedback, and the global or relative pose of an object held by the robot. In other words, observations can similarly include one or more of position, linear or angular velocity, force, torque, or acceleration, and the global or relative pose of one or more parts of the agent. Observations can be defined in one, two, or three dimensions and can be absolute and / or relative. Observations can also include, for example, sensed electronic signals, such as motor current or temperature signals; and / or image or video data, such as data from sensors of the agent or from sensors located separately from the agent in its environment.
[0031] Actions can be control inputs for controlling robots, such as torques on robot joints or higher-level control commands, or control inputs for controlling autonomous or semi-autonomous land, air, or sea vehicles, such as torques on the vehicle's control surfaces or other control elements or higher-level control commands.
[0032] In other words, motion can include, for example, position, velocity, or force / torque / acceleration data of one or more joints of a robot or a part of another mechanical agent. Alternatively, motion data can include electronic control data, such as motor control data, or more generally, data for controlling one or more electronic devices within an environment, the control of which has an impact on the observed state of the environment.
[0033] In one example, the observations include one or more images of the environment captured by one or more cameras, such as the robot's camera sensors, one or more cameras located at different locations in the environment outside the robot, or both, as well as low-dimensional proprioceptive features of the robot.
[0034] As a specific example, each input to the policy neural network 110 can include an action and an observation, and the output of the policy neural network 110 can be a Q-value, which represents the predicted reward that will be received by the robot as a result of performing an action in response to the observation.
[0035] Rewards refer to the cumulative measurement of the rewards received by agent 112, such as the sum of time-discounted rewards. Typically, rewards are scalar values and characterize, for example, the agent's progress in completing the task.
[0036] As a specific example, the reward can be a sparse binary reward, which is zero unless the task is successfully completed, and one if the task is successfully completed as a result of the action performed.
[0037] As another specific example, the reward can be a dense reward, which measures the robot's progress toward completing the task based on individual observations received during a round of attempting to perform the task. That is, individual observations can be associated with non-zero reward values that indicate the robot's progress toward completing the task while the environment is in the state characterized by the observations.
[0038] System 100 can then control robot 112 based on the Q-values of actions in the action set, for example by selecting the action with the highest Q-value as the action to be performed by robot 112.
[0039] As another specific example, each input to the policy neural network 110 can be an observation, and the output of the policy neural network 110 can be a probability distribution of a set of actions, where the probability of each action represents the likelihood that performing the action in response to the observation will maximize the predicted reward. The system 100 can then control the robot 112 based on probabilities, for example by selecting the action with the highest probability as the action to be performed by the robot 112, or by sampling the actions according to the probability distribution.
[0040] In some cases, to allow for fine-grained control of the agent, system 100 can treat the space of actions to be performed by the robot, i.e., the set of possible control inputs, as a continuous space. This setting is called a continuous control setting. In these cases, the output of policy neural network 110 can be parameters of a multivariate probability distribution in the space, such as the mean and covariance of a multivariate normal distribution.
[0041] The policy neural network 110 can have any suitable architecture that allows the policy neural network 110 to process observations to generate policy outputs.
[0042] As a specific example, when the observations include high-dimensional sensor data (e.g., image or laser data), the policy neural network 110 can be a convolutional neural network. As another example, when the observations only include relatively low-dimensional inputs (e.g., sensor readings characterizing the robot's current state), the policy neural network 110 can be a multilayer perceptron. As yet another example, when the observations include both high-dimensional sensor data and low-dimensional inputs, the policy neural network 110 can include a convolutional encoder for encoding the high-dimensional data, a fully connected encoder for encoding the low-dimensional data, and a policy subnetwork that operates on combinations (e.g., cascades) of the encoded data to generate a policy output.
[0043] In order to allow system 100 to train neural network 110, system 100 obtains robot experience data 120. Typically, robot experience data 120 is data representing robot interactions with environment 114.
[0044] In some cases, robot experience data 120 characterizes real-world interactions between real-world robots and real-world environments.
[0045] In some other cases, robot experience data 120 characterizes the interaction between a computer-simulated version of robot 112 and a computer-simulated environment 114. After training using the simulated experience data, the policy neural network 110 can then be used to control the real-world robot 112 in the real-world environment 114. Training in the simulated environment allows the neural network 110 to learn from a large amount of simulated training data while avoiding the risks associated with training in the real-world environment, such as damage to the robot due to performing poorly chosen actions or general wear and tear on the robot due to repeated interactions with the real-world environment.
[0046] Robot experience data 120 includes experience 122, which in turn includes observations and actions performed by the robot in response to the observations.
[0047] The robot experience data 120 can include a large amount of experience 122 collected when one or more robots perform various tasks or interact randomly with the environment. However, the robot experience data 120 is not typically associated with a task-specific reward, which is required to train the policy neural network 110 through reinforcement learning. That is, although a task-specific reward is needed to train the policy neural network 110 to control the robot 112 to perform a specific task, such a reward is not available in the robot experience data 120.
[0048] More specifically, robot experience data 120 will typically include a small amount of expert experience data 124 collected by an expert agent, such as a robot controlled by a user or by a learned strategy, when it successfully performs a specific task.
[0049] In addition, robot experience data 120 will typically include a large amount of unlabeled experience 126. Unlabeled experience is that which system 100 does not have permission to access the rewards for a particular task and does not have an indication of whether the experience was collected when the particular task was successfully performed.
[0050] For example, unlabeled experience can include experience collected as one or more robots perform different tasks or interact randomly with their environment. As a specific example, robot experience data 120 can include data collected from interactions between multiple robots performing multiple different tasks. For example, system 100 may have previously trained one or more other policy neural networks to control the robots to perform other tasks, and robot experience data 120 can include any data collected as a result of previous training.
[0051] As another example, unlabeled experience can include experience collected when one or more robots attempt to perform a specific task, but does not have an indication of whether the specific task has been successfully performed during the corresponding trajectory of any given unlabeled experience.
[0052] Therefore, although a large amount of data 120 is available for system 100, system 100 cannot directly use data 120 to train policy neural network 110 because the experience in the experience data 120 is not associated with the reward of a specific task.
[0053] In order to allow system 100 to use data 120 to train policy neural network 110, the system uses empirical data 120 to train reward prediction model 140, which receives input observations as input and generates reward predictions as outputs, which are predictions of task-specific rewards corresponding to the specific tasks assigned to the input observations.
[0054] Therefore, after training, the trained reward prediction model 140 is able to predict the task-specific reward of the observations, even if these observations were not generated when the robot performed a specific task.
[0055] The reward prediction model 140 can have any suitable architecture that allows the model 140 to process observations to generate reward predictions. Specifically, the reward prediction model 140 can have an architecture similar to the policy neural network 110, but with a different output layer that allows the reward prediction model 140 to generate an output as a single value, rather than a potential multi-valued policy output.
[0056] System 100 uses reward prediction model 140 to generate task-specific training data 150 for a specific task, which associates each of the multiple experiences 122 in experience data 120 with a task-specific reward for the specific task.
[0057] Specifically, for each experience in the second subset of experience 122 in robot experience data 120, system 100 uses reward prediction model 140 to process the observations in the experience to generate a reward prediction and associate the reward prediction with the experience.
[0058] By generating training data 150 in this way, system 100 is able to generate a large amount of training data for a specific task based on only a small amount of expert experience.
[0059] System 100 then trains policy neural network 110, for example, using offline reinforcement learning based on task-specific training data 150 for a specific task. With training completed entirely offline, the system is able to train policy neural network 110 based on a large amount of data without additional robot-environment interaction—that is, without environmental interaction other than the interactions already reflected in the robot's experience data 120. Avoiding such additional robot-environment interaction allows the robot to be trained to interact with the real-world environment without causing any additional wear and tear or breakage to the robot and without performing any additional actions that could be unsafe and harmful to the robot, the environment, or both.
[0060] In some implementations, system 100 first trains reward prediction model 140, and then uses reward prediction model 140 to train policy neural network 110 after reward prediction model 140 has been trained.
[0061] In some other implementations, system 100 repeatedly updates the reward prediction model 140 and the policy neural network 110 at each of multiple training iterations. That is, system 100 performs multiple training iterations during the training of policy neural network 110, and at each iteration, updates reward prediction model 140, and then uses the updated reward prediction model 140 to generate task-specific training data 150 for use in updating policy neural network 110 at iterations.
[0062] The following is for reference. Figure 2 The operation of system 100 training policy neural network 110 starting from robot experience data 120 is described in more detail.
[0063] After the system 100 has trained the policy neural network 110, the system 100 is able to control the robot 112, and the robot 112 uses the trained policy neural network 110 to perform specific tasks.
[0064] Alternatively or additionally, the system can provide data specifying the trained policy neural network 110, i.e., the training values of the neural network's parameters, for use in controlling the robot when it performs a specific task. For example, the system can provide data specifying the trained policy neural network, such as the training parameter values of the policy neural network 110, to another system via a data communication network or via a wired connection, allowing the other system to control the robot when it performs a specific task.
[0065] In either of these cases, system 100 can then acquire experience generated as a result of using the trained policy neural network 110 to control the robot to perform a specific task, and add this experience to robot experience data 120. In this way, this experience becomes available for training another neural network to control the robot to perform another task. Therefore, the amount of experience in robot experience data 120 can continue to increase and continue to be reused by system 100 to learn new tasks, even if none of the experience data 120 is associated with the task reward of the new task or was generated when the robot performed the new task.
[0066] Figure 2 This is a flowchart of an example process 200 for training a policy neural network using a reward prediction model. For convenience, process 200 will be described as being executed by a system of one or more computers located in one or more locations. For example, a properly programmed neural network training system, such as... Figure 1 The neural network training system 100 is capable of executing process 200.
[0067] The system obtains robot experience data representing the robot's interaction with the environment (step 202).
[0068] Specifically, the data includes experience, which in turn includes observations and actions performed by the corresponding robot in response to the observations.
[0069] More specifically, robot experience data includes expert experience from rounds of a specific task performed by an expert agent and unlabeled experience, which is not associated with any indication of whether they were collected when the specific task was successfully performed.
[0070] The system trains a reward prediction model based on a subset of robot experience data (step 204). The reward prediction model receives a reward input including the input observations and generates a reward prediction as an output. This reward prediction is a prediction of the task-specific reward assigned to a specific task for the input observations, that is, a numerical value representing the predicted task-specific reward value for a specific task.
[0071] Typically, the subset of robot experience data used to train the reward prediction model includes all expert experience from the robot experience data obtained in step 202 and a true subset of the unlabeled experience obtained in step 202.
[0072] More specifically, the system trains a reward prediction model to optimize the objective function.
[0073] The objective function includes a first term that encourages the reward prediction model to assign a first reward value, such as a value of 1 or another positive value, to observations derived from expert experience. This first reward value indicates that a specific task was successfully completed after the environment was in a state characterized by the observations.
[0074] The objective function also includes a second term that encourages the reward prediction model to assign a second reward value, such as a negative one or a zero value, to observations from unlabeled experience. This second reward value indicates that a specific task was not successfully completed after the environment was in a state characterized by the observations.
[0075] For example, the objective function can be a loss function L that satisfies the following conditions:
[0076]
[0077] in It is the expectation operator, s t It comes from the reward prediction model R used to train the model with parameter ψ. ψ Expert experience Empirical observations sampled from the set, R ψ (s t ) is a reward prediction model that processes observations s t The reward prediction generated from the input, s′ t It comes from the R model used to train the reward prediction model. ψUnlabeled experience Empirical observations sampled from the set, and R ψ (s′ t ) is a reward prediction model that processes observations s′ t The reward prediction is generated based on the input.
[0078] The above loss function is minimized when the reward prediction model assigns 1 to all observations from all expert experience and 0 to all observations from unlabeled experience—that is, assigns reward values indicating that the task will not be successfully executed after the observation is received to all observations from unlabeled experience. However, although the system does not have access to the labels of unlabeled experience, unlabeled experience data can include successful experiences that occur during the trajectory of successfully executed tasks, in addition to unsuccessful experiences. In other words, unlabeled experience data can include "missed reports," i.e., unlabeled experiences that should be assigned a reward value of 1, even if the loss function encourages the reward prediction model to assign a reward value of zero to these unlabeled experiences. In some cases, the presence of such missed reports can reduce the usefulness of the trained reward prediction model when training policy neural networks.
[0079] In view of this, in some implementations, in addition to the first and second terms, the objective function also includes a third term, which encourages the reward prediction model to assign a second reward value to observations derived from expert experience. In these implementations, the third term can have the opposite sign to the first and second terms, and the first term can be scaled relative to the second term, i.e., it has a lower weight in the loss function than the second term. By modifying the objective function in this way—that is, by scaling the first term and adding a third term that is also scaled relative to the second term but has the opposite sign—the system can effectively account for the presence of missed reports in unlabeled empirical data. As a specific example, in these implementations, the objective function can be a loss function L that satisfies the following condition:
[0080]
[0081] Where η is a hyperparameter set to a positive value between zero and one.
[0082] In some implementations, instead of or in addition to the third term, the reward function includes a fourth term that penalizes the reward prediction model to accurately distinguish whether early observations at the very beginning of a round are derived from expert experience or from unlabeled experience. As used herein, a "round" is a chronologically ordered sequence of experiences in which the robot attempts to perform some tasks starting from a corresponding initial environmental state, or a chronologically ordered sequence of experiences in which the robot randomly interacts with the environment starting from a corresponding initial environmental state.
[0083] In other words, early observations in an interaction round typically do not reflect task-specific behavior; that is, because at the start of the interaction round, the robot has not yet performed any meaningful actions that would indicate whether the agent will successfully perform a specific task later in the round. Conversely, the same early observations can be received if the agent is interacting randomly, performing another task, or performing a specific task.
[0084] As a specific example, the system is able to identify the first n observations in any given round, such as the first five, first ten, or first twenty observations, as early observations. The system is then able to sample early observations from an unlabeled round and early observations from an expert round, and calculate the average of the reward predictions generated by the reward model for the early observations from the unlabeled round (“unlabeled average”) and the average of the reward predictions generated by the reward model for the early observations from the expert round (“expert average”).
[0085] If the expert average is not higher than the unlabeled average, the system will set the fourth item to zero.
[0086] If the expert mean is higher than the unlabeled mean, then the system sets the fourth term to a negative value equal to the loss calculated from earlier observations from the expert and unlabeled rounds, such as the loss with only the first and second terms or the loss with the first, second and third terms.
[0087] Therefore, the system uses early observations to first control whether the reward model is overfitting to the early observations, i.e., by using the expert mean to be higher than the unlabeled mean. If the reward model is overfitting to the early observations, the system uses the early observations again to calculate the inverse loss to adjust the training of the reward model.
[0088] The system can use any appropriate neural network training technique, such as any gradient-based technique, such as stochastic gradient descent, using the Adam optimizer, rmsprop optimizer, etc., to train the reward prediction model to optimize the reward prediction objective function.
[0089] In some implementations, the system augments the reward prediction model with data from the robot's experience data used to train the model before training it based on a subset of rounds. In some cases, where expert experience can be very limited in scale, reward models can achieve "high" performance by simply memorizing all expert states and blindly assigning a reward of 0 to all other states. Applying data augmentation to the experience used to train the reward model helps mitigate this problem. Typically, data augmentation can involve using techniques to increase the amount of expert experience and, optionally, unlabeled experience, or modifying that experience. For example, images can be distorted, rotated, or cropped; sensor inputs (e.g., from sensors) can be discarded or augmented with random noise, etc. Examples of specific data augmentation techniques that can be applied are described in Task-relevant adversarial imitation learning, published in Supplementary Material B of CoRL in 2020 by Konrad Zolna, Scott Reed, Alexander Novikov, Sergio Gomez Colmenarej, David Budden, Serkan Cabi, Misha Denil, Nando de Freitas, and Ziyu Wang.
[0090] The system uses the trained reward prediction model to process each experience in the robot's experience data to generate a corresponding reward prediction for each experience (step 206). That is, for each experience, after performing the training in step 204, the system uses the reward model to process the reward input, including observations from the experience, to generate a corresponding reward prediction for the experience.
[0091] The system trains a policy neural network based on the corresponding reward predictions of (i) experience and (ii) experience (step 208).
[0092] Typically, the system can use off-policy reinforcement learning techniques to train a policy neural network based on (i) experience and (ii) corresponding reward predictions from experience. Because the technique is "off-policy," i.e., it does not require any experience to train the neural network to use the current version of the neural network, the system can train the policy neural network entirely "offline" based on task-specific training data, i.e., without needing to use the neural network to control the robot to perform a specific task.
[0093] The system can use any appropriate off-policy reinforcement learning technique to train the policy neural network.
[0094] As a specific example, the system is able to use an offline actor-commentator technique, where the commentator neural network is trained together with the policy neural network. An example of this technique is the Critic Regularized Regression (CRR) technique. CRR is described in more detail in the works of Ziyu Wang, Alexander Novikov, Konrad Zolna, Jost Tobias Springenberg, Scott Reed, Bobak Shahriari, Noah Siegel, Josh Merel, Caglar Gulcehre, Nicolas Heess, and others.
[0095] As another specific example, the system can update the policy neural network using either batch reinforcement techniques that rely on assigned policy gradients or batch reinforcement techniques that rely on non-assigned deterministic policy gradients. Batch reinforcement learning is described in more detail in *Batch Reinforcement Learning*, published by Sascha Lange, Thomas Gabel, and Martin Riedmiller in Springer's *Reinforcement Learning*, pp. 45–72, 2012. Assigned and non-assigned deterministic policy gradients are described in more detail in *Distributed Distributive Deterministic Policy Gradients*, presented at the 2018 International Conference on Learning by Gabriel Barth-Maron, Matthew W. Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap at the 2018 International Conference on Learning.
[0096] In some implementations, the system first trains a reward prediction model, and then uses the reward prediction model to train a policy neural network after the reward prediction model has been trained. In other words, in step 208, the system performs a single iteration of process 200 to complete the training of the reward prediction model and the policy neural network, and the policy neural network is trained based on the entire robot experience dataset available to the system.
[0097] In some other implementations, the system repeatedly updates the reward prediction model and the policy neural network by executing process 200. That is, the system executes process 200 multiple times during the training of the policy neural network, and at each iteration trains the reward prediction model and a reward prediction model starting from the values of the corresponding parameters up to the previous iteration.
[0098] In these implementations, at each iteration, the system is able to obtain a portion of the entire robot experience dataset. As a specific example, the system can sample a batch of expert experience and multiple batches of unlabeled experience from the entire robot experience dataset. The system can then use one of these expert experience batches and multiple batches of unlabeled experience as a subset to train a reward prediction model, and use these expert experience batches and all batches of unlabeled experience as data to train a policy neural network.
[0099] Figure 3 Graph 300 shows a comparison of the performance of the described training process with that of a conventional training process.
[0100] Specifically, Figure 3 The average return 310 for each round of a specific task is shown, as reflected in the expert experience data within a specific robot experience dataset. Because the expert experience data is collected from expert agents, the average return 310 represents the performance of the expert agent on the specific task.
[0101] Figure 3 The average return of 320 for the rounds performed is also shown, as reflected in unlabeled experiences within a specific robot experience dataset. (As from...) Figure 3 It is evident, and for the reasons stated above, that the average expert round has a significantly higher reward than the average unmarked round.
[0102] Figure 3 The performance of the described technique (referred to as Offline Reinforcement Imitation Learning (ORIL)) relative to the existing state of a technique that does not rely on pre-existing task rewards (Behavioral Cloning (BC)) is also shown.
[0103] Specifically, the y-axis of graph 300 shows the average return, and the x-axis of graph 300 shows the number of unlabeled experiences in the robot experience data. For all quantities of unlabeled experience data, there is the same relatively small number of expert experiences (189).
[0104] Figure 3 The performance of ORIL is shown as curve 330, and the performance of BC is shown as curve 340.
[0105] As from Figure 3It can be seen that ORIL can learn to use a larger amount of unlabeled experience to approach expert-level performance. That is, as the amount of unlabeled experience increases, curve 330 approaches the average return of 310, while BC's performance degrades when there is significantly more unlabeled experience than the expert experience in the robot's experience data.
[0106] This specification uses the term "configured" in conjunction with system and computer program components. A system configuration of one or more computers to perform a specific operation or action means that the system has software, firmware, hardware, or a combination thereof installed thereon that causes the system to perform those operations or actions in operation. A computer program configured to perform a specific operation or action means that one or more programs include instructions that, when executed by a data processing device, cause that device to perform those operations or actions.
[0107] The embodiments of the subject matter and functional operation described in this specification can be implemented in digital electronic circuit systems, tangibly embodied computer software or firmware, computer hardware, including the structures disclosed in this specification and their structural equivalents, or combinations thereof. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory storage medium for execution by a data processing device or for controlling the operation of a data processing device. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination thereof. Alternatively or additionally, the program instructions can be encoded on artificially generated propagation signals, such as machine-generated electrical, optical, or electromagnetic signals, which are generated to encode information for transmission to a suitable receiver device for execution by the data processing device.
[0108] The term "data processing device" refers to data processing hardware and encompasses all kinds of devices, apparatuses, and machines used for processing data, including, for example, programmable processors, computers, or multiple processors or computers. The device may also be or further include a dedicated logic circuit system, such as a FPGA (Field-Programmable Gate Array) or an ASIC (Application-Specific Integrated Circuit). In addition to hardware, the device may optionally include code that creates an execution environment for computer programs, such as code constituting processor firmware, protocol stacks, database management systems, operating systems, or combinations thereof.
[0109] A computer program can be written in any programming language, including compiled or interpreted languages, declarative or procedural languages, and can also be referred to or described as a program, software, software application, application program, module, software module, script, or code; and can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for a computing environment. A program may, but is not required to, correspond to a file in a file system. A program can be stored as part of a file that holds other programs or data, such as one or more scripts stored in a markup language document, or as a single file dedicated to the program in question, or as multiple collaborative files, such as a file storing one or more modules, subroutines, or portions of code. A computer program can be deployed to execute on a single computer or on multiple computers located at a site or distributed across multiple sites and interconnected via a data communication network.
[0110] In this specification, the term "database" is used broadly to refer to any set of data: data that does not need to be structured in any particular way, or not at all, and that can be stored on storage devices in one or more locations. Thus, for example, an indexed database can comprise multiple sets of data, each of which can be organized and accessed in different ways.
[0111] Similarly, in this specification, the term "engine" is used broadly to refer to a software-based system, subsystem, or process programmed to perform one or more specific functions. Typically, an engine will be implemented as one or more software modules or components installed on one or more computers at one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in others, multiple engines can be installed and run on the same one or more computers.
[0112] The processes and logic flows described in this specification can be executed by one or more programmable computers that execute one or more computer programs to perform functions by performing operations on input data and generating output. These processes and logic flows can also be executed by a dedicated logic circuit system, such as an FPGA or ASIC, or by a combination of a dedicated logic circuit system and one or more programmable computers.
[0113] A computer suitable for executing computer programs can be based on a general-purpose microprocessor, a special-purpose microprocessor, or both, or any other type of central processing unit (CPU). Typically, the CPU receives instructions and data from read-only memory (ROM) or random access memory (RAM), or both. The essential components of a computer are the CPU for making or executing instructions and one or more memory devices for storing instructions and data. The CPU and memory can be supplemented or integrated into a special-purpose logic circuit system. Typically, a computer will also include one or more mass storage devices for storing data, such as disks, magneto-optical disks, or optical disks, or devices that the computer can operatively couple to receive data from or transfer data to, or both. However, a computer is not required to have such devices. Furthermore, a computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a portable storage device, such as a universal serial bus (USB) flash drive, to name a few.
[0114] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including, for example, semiconductor memory devices such as EPROM, EEPROM and flash memory devices, disks such as internal hard disks or removable disks, magneto-optical disks, CD-ROMs and DVD-ROMs.
[0115] To provide interaction with the user, embodiments of the subject matter described herein can be implemented on a computer having a display device for displaying information to the user, such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor; and a keyboard and pointing device, such as a mouse or trackball, through which the user can provide input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form (including sound input, voice input, or tactile input). Additionally, the computer can interact with the user by sending documents to and receiving documents from the device used by the user, for example, by sending web pages to a web browser on the user's device in response to a request received from a web browser. Furthermore, the computer can interact with the user by sending text messages or other forms of messages to a personal device, such as a smartphone running a messaging application, and receiving response messages from the user.
[0116] Data processing devices used to implement machine learning models can also include, for example, dedicated hardware accelerator units for handling the general and computationally intensive parts of machine learning training or production, i.e., inference, workloads.
[0117] It is possible to use machine learning frameworks, such as the Tensor Flow Graph framework, to implement and deploy machine learning models.
[0118] Embodiments of the subject matter described in this specification can be implemented in computing systems that include backend components (e.g., as a data server, or a computing system that includes middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a client computer with a graphical user interface, web browser, or application through which a user can interact with embodiments of the subject matter described in this specification), or computing systems that include one or more such backend components, middleware components, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium, such as a communication network. Examples of communication networks include local area networks (LANs) and wide area networks (WANs), such as the Internet.
[0119] A computing system can include clients and servers. Clients and servers are generally geographically separated and typically interact via a communication network. The client-server relationship is established by means of computer programs running on respective computers and having a client-server relationship with each other. In some embodiments, the server transmits data, such as HTML pages, to a user device, for example, to display data to a user interacting with the device and to receive user input from that user. It is possible to receive data generated on the user device, such as the results of user interactions, from the device at the server.
[0120] While this specification contains numerous specific implementation details, these details should not be construed as limiting the scope of any invention or the scope of what may be claimed, but rather as descriptions of features specific to particular embodiments of a particular invention. Certain features described in this specification within the context of separate embodiments may also be implemented in combination in a single embodiment. Conversely, various features described within the context of a single embodiment may also be implemented separately or in any suitable sub-combination in multiple embodiments. Furthermore, although features may be described above as functioning in certain combinations and even initially claimed in this way, in some cases it is possible to remove one or more features from the claimed combination, and the claimed combination may involve sub-combinations or variations of sub-combinations.
[0121] Similarly, although the operations are depicted in a specific order in the accompanying drawings and described in a specific order in the claims, this should not be construed as requiring the operations to be performed in the specific order shown or in a sequential order, or requiring the execution of all illustrated operations to achieve the desired result. In some cases, multitasking and parallel processing can be advantageous. Furthermore, the separation of various system modules and components in the above embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
[0122] Specific embodiments of this subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions described in the claims can be performed in a different order and still achieve the desired result. As an example, the processes depicted in the drawings do not necessarily require the specific order or sequence shown to achieve the desired result. In some cases, multitasking and parallel processing can be advantageous.
Claims
1. A method, the method comprising: Obtain robot experience data, which characterizes the robot's interaction with the environment. The robot experience data includes multiple experiences, each of which includes an observation characterizing the state of the environment and an action performed by the corresponding robot in response to the observation. The multiple experiences include: Expert experience derived from rounds of specific tasks performed by expert agents, and Unmarked experience; A reward prediction model is trained based on a first subset of the robot's experience data. The reward prediction model receives a reward input including input observations and generates a reward prediction as output, the reward prediction being a prediction of a task-specific reward corresponding to the specific task assigned to the input observations. Training the reward prediction model includes optimizing an objective function, the objective function being: This includes a first item, which encourages the reward prediction model to assign a first reward value to observations derived from expert experience, the first reward value indicating successful completion of the specific task after the environment was in the state characterized by the observations. This includes a second item, which encourages the reward prediction model to assign a second reward value to observations from unlabeled experience, the second reward value indicating that the specific task was not successfully completed after the environment was in the state characterized by the observation; The trained reward prediction model is used to process the experience in the robot's experience data to generate a corresponding reward prediction for each of the processed experiences; and A policy neural network is trained based on the processed experience and the corresponding reward prediction of the processed experience, wherein the policy neural network is configured to receive network input including observations and generate policy outputs that define a control policy for the robot to perform the specific task.
2. The method according to claim 1, further comprising: When the robot performs the specific task, the trained policy neural network is used to control the robot.
3. The method according to claim 1, further comprising: When the robot performs the specific task, data is provided to a specified trained policy neural network for controlling the robot.
4. The method of claim 1, wherein the first subset comprises the expert experience and a true subset of the unlabeled experience.
5. The method of claim 1, wherein the objective function includes a third term, the third term encouraging the reward prediction model to assign the second reward value to observations derived from expert experience.
6. The method of claim 5, wherein the first and second terms have different signs from the third term in the objective function.
7. The method of claim 5, wherein the objective function includes a fourth term, the fourth term penalizing the reward prediction model to correctly distinguish expert experience from unlabeled experience based on a first predetermined number of observations of rounds of the particular task performed by the expert agent.
8. The method of claim 1, wherein training the policy neural network comprises: The policy neural network is trained using offline reinforcement learning techniques based on the experience and the corresponding reward predictions of the experience.
9. The method of claim 8, wherein the offline reinforcement learning technique is an offline actor-commentator technique.
10. The method of claim 8, wherein the offline reinforcement learning technique is commentator regularized regression.
11. The method of claim 1, wherein training the reward prediction model comprises: The data augmentation is applied to the experience in the robot's experience data used to train the reward prediction model.
12. The method according to any one of claims 1 to 11, wherein at least some of the experiences in the first subset of the robot experience data are related to a real-world environment.
13. The method of claim 2, wherein controlling the robot comprises: Observations are obtained from one or more sensors that sense the real-world environment, the observations are fed to a trained policy neural network, and the output of the trained policy neural network is used to select actions to control the robot to perform the specific task.
14. A system comprising one or more computers and one or more storage devices storing instructions, the instructions, when executed by the one or more computers, causing the one or more computers to perform the method according to any one of claims 1 to 13.
15. A non-transitory computer-readable storage medium encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform the method according to any one of claims 1 to 13.