Robotic manipulation with deep reinforcement learning
By using deep reinforcement learning and asynchronous parallel training algorithms, robots can quickly learn complex manipulation tasks in real physical systems, solving the problems of long training time and low efficiency in existing technologies, and achieving more efficient performance improvement and safety enhancement.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GOOGLE LLC
- Filing Date
- 2017-09-14
- Publication Date
- 2026-06-19
AI Technical Summary
The existing learning process for robot manipulation tasks is limited by hand-designed policy representations and human-provided demonstrations, resulting in long training times and low efficiency, making it difficult to quickly improve performance in real physical systems.
We employ deep reinforcement learning methods, using asynchronous parallel training algorithms to collect experience data from multiple robots to train a policy neural network. This reduces the reliance on manually designed policy representations. We utilize deep Q-functions and DDPG or NAF algorithms for model-free learning, enabling rapid updates and optimization of policy parameters.
It enables rapid learning of complex manipulation tasks on real robots, reducing training time, improving efficiency and safety, reducing robot joint wear, and achieving earlier performance improvements.
Smart Images

Figure CN115338859B_ABST
Abstract
Description
[0001] This application is a divisional application of the invention patent application filed on September 14, 2017, with application number 201780067067.9 and invention title "Deep Reinforcement Learning for Robot Manipulation". Technical Field
[0002] The inventive concept of this application relates to deep reinforcement learning methods, and more particularly to extensions of deep reinforcement learning methods to improve the operational performance of one or more robots. Background Technology
[0003] Many robots are programmed to manipulate one or more objects using one or more end effectors. For example, a robot can use an end effector to apply force to an object and cause the object to move. For example, a robot can use a grasping end effector or other end effectors to move an object without necessarily grasping it. Moreover, for example, a robot can use a grasping end effector such as an "impactive" gripper or an "ingressive" gripper (e.g., physically penetrating the object using a nail, needle, etc.) to pick up an object from a first position, move the object to a second position, and put the object down at the second position. Summary of the Invention
[0004] The embodiments described below provide improvements in the operational performance of one or more robots when performing one or more tasks. As described herein, using reinforcement learning processes to improve the performance of one or more robots facilitates the rapid learning of optimal methods or strategies for performing a specific physical task using one or more robots. The robot is able to use the learned strategies to improve the efficiency of task execution. For example, as the described reinforcement learning process proceeds, the physical task can be performed by the robot faster and / or with less power consumption. Because other aspects of robot performance are improved with the learning process, such physical tasks can be performed more safely, either additionally or alternatively, or can continue within defined safety parameters.
[0005] As will be apparent from the following disclosure, the learning process can be iterative. As new iterations are passed to the computing devices(s) responsible for controlling the physical actions of the robots, one or more robots can execute according to each new, improved iteration of the policy / scheme for a particular task. In this way, the aforementioned efficiency gains in terms of the physical actions performed by the robots can occur frequently as the learning process continues. In general, the rate of robot performance improvement and the resulting efficiency gains can be particularly fast, enabling the robots to complete physical tasks optimally in a shorter time than using other learning techniques. It will be understood that this allows the aforementioned advantages, such as improved power consumption of the robots, to be experienced at an earlier stage. This will be described below, for example, as part of an explanation of the decoupling between training threads and experience threads for one or more robots in different computer processors. Specifically, it is explained that parallelizing the training algorithms for multiple robots across asynchronously pooling their policy updates can yield more accurate and / or robust policy neural networks after a given number of training iterations.
[0006] The goal of reinforcement learning is to control an agent that attempts to maximize a reward function, which, in the context of robot skill (also referred to as a task in this paper), represents a user-provided definition of what the robot should attempt to accomplish. At time t, in state x... t At that point, the agent acts according to its strategy π(u) t |x t Select and execute action u t According to the robot's dynamic p(x) t |x t u t Transition to the new state x t and receiving reward r(x) t u t The goal of reinforcement learning is to find the optimal policy π* that maximizes the sum of expected rewards from the initial state distribution. Rewards are determined based on a reward function, which, as mentioned above, depends on the robot task to be completed. Therefore, reinforcement learning in the robot context seeks to learn the optimal policy for performing a given robot task.
[0007] The embodiments disclosed herein utilize deep reinforcement learning to train a policy network that parameterizes a policy for determining robot actions based on a current state. This current state may include the robot's state (e.g., angles of the robot's joints, positions of multiple end effectors and / or their time derivatives) and / or the current state of one or more components in the robot's environment (e.g., the current state of multiple sensors in the robot's environment, the current pose of multiple target objects in the robot's environment). The policy network may be a neural network, such as a deep neural network. For example, the policy network may be a neural network that takes the current state as input and generates an output indicating the action to be performed based on the neural network's input and learned policy parameters. For example, the output may indicate a velocity command to be provided to each actuator in the robot's actuators, or a torque to be applied to each actuator in the robot's actuators. The robot can utilize the policy neural network by applying the current state to the policy neural network in each control cycle of the robot, processing the current state using the policy neural network to generate an output, and implementing control commands to perform the action indicated by the output. The state after the implementation of the control command can then be used as the current state for the next control cycle.
[0008] The embodiments disclosed herein collect experience data from multiple robots operating simultaneously. Each robot generates an instance of experience data during iterative execution of a scenario, which is an exploration of the task to be performed and is guided by a policy network and its current policy parameters during each episode. For example, during an episode, the robot may generate an instance of experience data in each of the robot's multiple control cycles during that episode. Each instance of experience data may indicate a corresponding: current / starting state, subsequent state to which it transitions from the starting state, robot action to transition from the starting state to the subsequent state (where the action is based on the application of the policy network to the starting state and its current policy parameters), and optionally a reward for the action (such as determined based on a reward function). The collected experience data is generated during episodes and used to train the policy network by iteratively updating the policy parameters of the policy network based on a batch of collected experience data. Furthermore, prior to the execution of each of the multiple episodes performed by the robot, the currently updated policy parameters can be provided (or retrieved) for use during episode execution. For example, each robot can obtain updated policy parameters from the most recent iteration of its training before executing each episode in its scenario, and use the updated policy parameters when executing that episode. Thus, the empirical data for each episode is based on a policy network with updated policy parameters from the most recent (relative to the start of the episode) iteration of its training.
[0009] In many implementations, the training of the policy network is asynchronous with respect to the generation and collection of experience data from multiple robots. That is, the threads training / updating the policy network are decoupled from the threads generating and / or collecting experience data from the multiple robots. For example, the training / updating threads may operate on one or more processors, and the experience threads may operate on one or more additional processors separate from the one or more processors operating the training / updating threads. This decoupling between the training threads and the experience threads ensures that the difference between the training speed and the experience collection speed does not halt the control programs of the robots that generate experience data, which typically require sending control commands at a fixed frequency. In other words, decoupling allows experience data collection to continue through the corresponding experience threads without halting those threads used for training purposes. Furthermore, decoupling allows the training threads to operate in parallel with the experience threads, asynchronously and iteratively updating the policy parameters of the policy network and iteratively providing updated policy parameters for plotting. And, in many implementations, the training threads may operate at frequencies greater than one or more (e.g., all) of the robot's control frequencies (e.g., 20 Hz) (e.g., 60 Hz). In these implementations, real-world (e.g., a clock on a wall) training time can be reduced (compared to techniques that do not utilize experience data from multiple robots) by obtaining experience data from multiple robots operating in parallel and by performing training asynchronously in separate threads. For example, training can be performed without any (or with little) latency, where latency is due to the lack of new experience data available in the buffer. And, for example, separate threads can prevent the need to stop experience data collection for training to proceed, and vice versa.
[0010] Furthermore, utilizing empirical data from multiple robots and decoupling the training and experience collection threads can produce a more accurate and / or robust model after a given number of training iterations than if this technique were not used. This can be because, for example, the empirical data generated by the robot in a given episode is based on policy parameters that are updated based on both past instances of empirical data from the robot and past instances of empirical data from the other robots(multiple) operating in parallel. For example, in the robot's third episode, the policy parameters utilized in the third episode can be based not only on empirical data from the robot's first and / or second episodes, but also on empirical data from the first and / or second episodes of the robots(multiple) operating in parallel. In this way, the empirical data generated in the third episode is based on a policy network with updated policy parameters trained with respect to empirical data from more than two previous episodes, which can make it possible to generate empirical data in the third episode that leads to faster convergence than if the updated policy parameters were trained with respect to empirical data from only two previous episodes.
[0011] In the various embodiments described herein, one or more of the multiple robots generating experience data can operate asynchronously relative to each other, and / or updated policy parameters can be asynchronously provided (or retrieved) by the robots before episode execution. In this way, the updated policy parameters provided to each of the multiple robots can vary relative to each other. For example, at a first time point, the first robot can obtain updated policy parameters for use in an episode to be executed by the first robot. At this first time point, the second robot can still execute the previous episode. At a second time point following the first time point, the second robot can then obtain further updated policy parameters for use in an episode to be executed by the second robot and immediately following the previous episode. At this second time point, the further updated policy parameters obtained (due to further training between the first and second times) can differ from the updated policy parameters obtained by the first robot at the first time point. In this way, the updated policy parameters obtained by the first robot at the first time point are not provided to the second robot for use. Instead, more up-to-date further updated policy parameters are obtained.
[0012] In some embodiments, a method is provided that includes performing the following steps during the execution of multiple scenarios by each of a plurality of robots, wherein each scenario is performed to explore a task based on a policy neural network representing a reinforcement learning policy for the task: storing instances of robot experience data generated by the robots during the scenarios in a buffer, wherein each instance of robot experience data is generated during the corresponding scenario in the scenario and is generated at least in part on a corresponding output generated by a policy neural network using corresponding policy parameters of the policy neural network for the corresponding scenario; iteratively generating updated policy parameters of the policy neural network, wherein each iteration of the iterative generation includes generating updated policy parameters during the iteration using one or more instances of robot experience data from a set in the buffer; and updating the policy neural network used by the robots in the scenarios by each robot in conjunction with the start of each of the multiple scenarios executed by the robots, wherein updating the policy neural network includes updating the policy parameters of the most recent iteration using the iteratively generated updated policy parameters.
[0013] These and other implementations disclosed herein may include one or more of the following features.
[0014] Each of the updated policy parameters defines a corresponding value for the corresponding node in the corresponding layer of the policy neural network.
[0015] For a given robot in the robot, instances of robot experience data can be stored in a buffer at a first frequency that is lower than the frequency at which updated policy parameters are generated iteratively.
[0016] For each robot in the robot, instances of robot experience data can be stored in a buffer at a frequency that is lower than the frequency at which updated policy parameters are generated iteratively.
[0017] Storing instances of robot experience data in a buffer can be executed by one or more processors in a first thread, and iterative generation can be executed by one or more processors in a second thread that is separate from the first thread. For example, the first thread can be executed by one or more processors in a first group, and the second thread can be executed by one or more processors in a second group, wherein the second group does not overlap with the first group.
[0018] Each iteration in the iteratively generated iterations may include generating updated policy parameters based on minimizing a loss function, given one or more instances of a set of robot experience data in a buffer during the generation iteration.
[0019] Each iteration in the iteratively generated iterations may include off-policy learning during the generation iteration, based on one or more instances of a set of robot experience data in a buffer. For example, off-policy learning may be Q-learning, such as Q-learning using the normalized advantage function (NAF) algorithm or the deep deterministic policy gradient (DDPG) algorithm.
[0020] Each instance of the experience data can indicate the corresponding: the starting robot state, the subsequent robot state transitioned from the starting robot state, the action performed to transition from the starting robot state to the subsequent robot state, and the reward for that action. The action performed to transition from the starting robot state to the subsequent robot state can be generated based on processing the starting robot state using a policy neural network with updated policy parameters for the corresponding scenario. The reward for that action can be generated based on a reward function of a reinforcement learning policy.
[0021] The method may also include: terminating the execution of multiple episodes and terminating iterative generation based on one or more criteria; and providing the policy neural network with the most recently generated version of the updated policy parameters for use by one or more additional robots.
[0022] In some implementations, a method is provided that includes: one or more processors of a given robot: executing an exploration of a given episode based on a policy network having a first set of policy parameters; providing a first instance of robot experience data generated based on the policy network during the given episode; and prior to the given robot executing a subsequent episode based on the policy network, replacing one or more policy parameters of the first set with updated policy parameters, wherein the updated policy parameters are generated based on additional instances of robot experience data generated by an additional robot during an exploration episode in which the additional robot performs the task; wherein the subsequent episode immediately follows the first episode, and wherein executing the task based on the policy network in the subsequent episode includes replacing the replaced policy parameters with the updated policy parameters.
[0023] These and other implementations disclosed herein may include one or more of the following features.
[0024] In some embodiments, the method may further include generating further updated policy parameters by one or more additional processors during the execution of subsequent episodes, wherein the generation of the further updated policy parameters is based on one or more instances of robot experience data generated during the first episode. The method may also include providing the further updated policy parameters for use by the additional robot when performing a corresponding episode by the additional robot. In some of these embodiments, the additional robot begins executing the corresponding episode during the execution of a subsequent episode by the given robot and / or the given robot does not use the further updated policy parameters while performing any episode by the given robot. In some of these embodiments, the method may further include: generating even more updated policy parameters by one or more additional processors, wherein the even more updated policy parameters are generated during the execution of subsequent episodes and after the generation of the further updated policy parameters; and providing the even more updated policy parameters for use by the given robot when performing further subsequent episodes of a task performed by the given robot based on a policy network. The further subsequent episodes immediately follow the subsequent episodes. In some versions of these implementations: a given robot begins to execute further subsequent episodes while the corresponding episode is being executed by an additional robot; the additional robot does not utilize updated policy parameters and further updated policy parameters when any episode is being executed by the additional robot; and / or the additional robot does not utilize updated policy parameters when any episode is being executed by the additional robot.
[0025] The policy network may include or consist of a neural network model, and each of the updated policy parameters can define a corresponding value for the corresponding node of the corresponding layer of the neural network model.
[0026] In some implementations, the method further includes, during the execution of a given episode of the task: identifying, in a given iteration of the output from the policy network, a violation of one or more criteria for a given robot; modifying the output of the given iteration such that the violation of the one or more criteria no longer occurs; and generating a given instance of an instance of empirical data based on the modified output. The criteria may include one or more of the following: joint position constraints, joint velocity constraints, and end effector position constraints.
[0027] In some implementations, the method further includes generating a given exploration during a given episode by: applying a current state representation as input to a policy network, the current state representation indicating at least the current state of a given robot; generating an output by processing the input using the policy network; and providing control commands to one or more actuators of the given robot based on the output. Providing control commands to the actuators based on the output may include: generating a modified output by adding noise to the output; and providing control commands based on the modified output. The output may include the velocity or torque of each of a plurality of actuators of the robot, and providing control commands may include providing control commands that cause the actuators to apply velocity or torque.
[0028] Each indication in the first instance of the experience data corresponds to: the starting robot state, the subsequent robot state transitioning from the starting robot state, the action performed to transition from the starting robot state to the subsequent robot state, and the reward for that action.
[0029] In some implementations, a method is provided that includes receiving a given instance of robot experience data generated by a given robot among a plurality of robots. The given instance of robot experience data is generated during a given episode of task exploration performed based on a given version of policy parameters of a policy network utilized by the given robot at the time the given instance was generated. The method also includes receiving additional instances of robot experience data from an additional robot among the plurality of robots, the additional instances being generated during episodes of task exploration performed by the additional robot based on the policy network. The method further includes generating a new version of policy parameters of the policy network based on training of the policy network at least in part based on the given instance and the additional instance as the given robot and the additional robot continue performing episodes of task exploration. The method also includes providing the new version of the policy parameters to the given robot for performing an immediate subsequent episode of task exploration performed by the given robot based on the new version of the policy parameters.
[0030] These and other implementations disclosed herein may include one or more of the following features.
[0031] Receiving a given instance occurs in one iteration of multiple empirical data iterations of instances from which empirical data is received from a given robot, wherein the multiple empirical data iterations occur at a first frequency. Training the reinforcement model to generate updated parameters includes performing multiple training iterations, which include: a first training iteration, at least in part based on training a policy network of the given instance and additional instances; and one or more additional training iterations, based on training a policy network of further instances of empirical data from multiple robots. The training iterations occur at a second frequency, which is a higher frequency than the empirical data iterations.
[0032] In some implementations, a method is provided that includes iteratively receiving instances of experience data generated by a plurality of robots operating asynchronously and concurrently. Each of the instances of experience data is generated by a corresponding robot among the plurality of robots during a corresponding episode of task exploration based on a policy neural network. The method also includes iteratively training a policy network based on the experience data received from the plurality of robots to generate one or more updated parameters of the policy network in each training iteration. The method further includes iteratively and asynchronously providing instances of updated parameters to the robots for updating the robot's policy neural network prior to the episode of task exploration on which the instances of experience data are based.
[0033] In some implementations, a method implemented by one or more processors is provided, comprising: during the execution of multiple episodes by each of a plurality of robots, each episode including the execution of a task based on a policy neural network representing a reinforcement learning policy for a task; storing instances of robot experience data generated by the plurality of robots during the episodes in a buffer, each instance of robot experience data being generated during a corresponding episode of the episode and at least partially on a corresponding output generated by a policy neural network using corresponding policy parameters of the policy neural network of the corresponding episode; iteratively generating updated policy parameters of the policy neural network, wherein each iteration of the iteratively generated iteration includes generating updated policy parameters during the iteration using one or more instances of a set of robot experience data in the buffer; and updating the policy neural network to be used by the robot in the episode by each of the robots in conjunction with the start of each episode of the plurality of episodes executed by the robot, wherein updating the policy neural network includes updating policy parameters using the most recent iteration of iteratively generated updated policy parameters.
[0034] In some implementations, a method is provided comprising: one or more processors of a given robot of a plurality of robots: executing a given episode of performing a task based on a policy network having a first set of policy parameters; providing a first instance of robot experience data generated based on the policy network during the given episode; and prior to the execution of a subsequent episode of performing the task based on the policy network by the given robot: replacing one or more policy parameters of the first set with updated policy parameters, wherein the updated policy parameters are generated by training the policy network based on additional instances of robot experience data, the additional instances being generated by additional robots during an additional robot episode of performing the task by additional robots, wherein the additional robot episode of performing the task by additional robots is based on the policy network; wherein the subsequent episode immediately follows the first episode, and wherein performing the task based on the policy network in the subsequent episode includes replacing the replaced policy parameters with the updated policy parameters.
[0035] In some implementations, a method implemented by one or more processors is provided, comprising: receiving, in one iteration of multiple experience data iterations of receiving experience data from a plurality of real physical robots, a given instance of robot experience data generated by the given robot, wherein the given instance of robot experience data is generated during the execution of a task based on a given version of policy parameters of a policy network utilized by the given robot at the time the given instance is generated; receiving additional instances of robot experience data from additional robots among the plurality of real physical robots, the additional instances being generated during the execution of a task by the additional robots based on the policy network, wherein the plurality of experience data iterations occur at a first frequency; and receiving, in the given robot and As the additional robot continues to perform a task episode, a new version of the policy parameters of the policy network is generated based on training of the policy network based at least in part on the given instance and the additional instance, wherein the training of the policy network includes multiple training iterations occurring at a second frequency greater than the first frequency, the multiple training iterations including: a first training iteration based at least in part on training of the policy network based on the given instance and the additional instance; and one or more additional training iterations based on training of the policy network on further instances based on empirical data from multiple real physical robots; and providing the given robot with the new version of the policy parameters for executing the immediately following episode in which the given robot performs a task based on the new version of the policy parameters.
[0036] In some implementations, a method implemented by one or more processors is provided, comprising: iteratively receiving instances of experience data generated by a plurality of real physical robots operating asynchronously and concurrently, wherein each of the instances of experience data is generated by a corresponding robot among the plurality of real physical robots during a corresponding episode of performing a task based on a policy neural network; iteratively training a policy neural network based on the experience data received from the plurality of real physical robots to generate one or more updated parameters of the policy neural network in each training iteration; and iteratively and asynchronously providing instances of updated parameters to the robot for updating the robot's policy neural network prior to a subsequent episode of the task performance on which a further instance of experience data is based.
[0037] In some embodiments, a computer-readable instruction is provided that, when executed by a computing device, causes the methods described in the embodiments of this disclosure to be performed.
[0038] In some embodiments, an apparatus for deep reinforcement learning is provided, configured to perform the methods described in embodiments of this disclosure.
[0039] Other embodiments may include a non-transitory computer-readable storage medium storing instructions executable by one or more processors (e.g., one or more central processing units (CPUs), one or more graphics processing units (GPUs), and / or one or more tensor processing units (TPUs)) to perform one or more methods such as those described above and / or elsewhere herein. Yet another embodiment may include a system of one or more computers and / or one or more robots comprising one or more processors operable to execute the stored instructions to perform one or more methods such as those described above and / or elsewhere herein.
[0040] It should be understood that all combinations of the foregoing and additional concepts described in more detail herein are considered part of the subject matter disclosed herein. For example, all combinations of the claimed subject matter appearing at the end of this disclosure are considered part of the subject matter disclosed herein. Attached Figure Description
[0041] Figure 1 An example environment in which the implementations disclosed herein can be carried out is shown.
[0042] Figure 2 It shows Figure 1 One of the robots and an example of the robot's grasping end effector moving along a path.
[0043] Figure 3 This is a flowchart illustrating an example of a scenario performed by a robot.
[0044] Figure 4 This is a flowchart illustrating an example method for storing experience data.
[0045] Figure 5 This is a flowchart illustrating an example method for training to update the parameters of a policy network.
[0046] Figure 6 An example architecture of the robot is schematically depicted.
[0047] Figure 7 An example architecture of a computer system is schematically depicted. Detailed Implementation
[0048] Robotics applications of reinforcement learning often compromise the autonomy of the learning process, in order to achieve practical training time for real physical systems. This compromise in reinforcement learning may be due to the introduction of hand-designed policy representations and / or human-provided demonstrations.
[0049] The embodiments described herein are extensions of deep reinforcement learning methods to improve the operational performance of one or more robots. As previously described, the embodiments offer operational advantages, such as efficiency gains in the physical actions performed by the robot, and also mitigate the drawbacks of existing robotic applications of reinforcement learning. These may include reducing the need for hand-designed policy representations (e.g., “model-based” policy representations) and / or reducing the need for human-provided demonstrations. In some embodiments described herein, a policy neural network with parameterized policies is trained via deep reinforcement learning, which mitigates the need for hand-designed policy representations. Furthermore, the policy neural network can be “model-free” because it does not explicitly learn a model of the robot’s environment. As a specific example, in some embodiments, deep reinforcement learning algorithms based on deep Q-function-based off-policy training can be extended to complex 3D manipulation tasks and can learn deep neural network policies efficiently enough to be trained based on empirical data generated by real physical robots. In some of these embodiments, training time for real-world applications (e.g., a clock on a wall) can be reduced by parallelizing multiple robot algorithms that asynchronously pool their policy updates. In some of these implementations, the parallelization of multiple robot updates across asynchronous pooling can additionally and / or alternatively produce a policy neural network that is more accurate and / or more robust after a given number of training iterations than if such parallelization were not utilized. This can be because, for example, the experience data generated by the robot in a given instance is based on policy parameters, which are updated based on both past instances of experience data from the robot and past instances of experience data from the parallel-operating (multiple) other robots. Furthermore, some implementations of the techniques disclosed herein enable the learning of three-dimensional (“3D”) robot manipulation tasks (e.g., door-opening tasks, pick-up and put-down tasks) on real robots without any prior demonstrations or manually designed (e.g., “model-based”) policy representations.
[0050] As described in more detail below, in various implementations, off-policy reinforcement learning methods (e.g., Q-learning based on off-policy training using deep Q-functions) are extended to learn complex manipulation policies from scratch. In some of these different implementations, the complex manipulation policy is learned without user-provided demonstrations and / or using neural network representations that do not require task-specific domain knowledge. Furthermore, in some of these different implementations, the policy is learned using off-policy deep Q-function algorithms such as Deep Deterministic Policy Gradient (DDPG) or Normalized Advantage Function (NAF). For example, asynchronous deep reinforcement learning, such as asynchronous deep reinforcement learning using parallel NAF algorithms across multiple real physical robots, can be utilized. This enables efficient training on samples on real robotic platforms, allows for greater time efficiency in training by utilizing shared experience from multiple robots, and / or enables more robust training due to the differences between multiple robots and / or their environments. The improved time efficiency of training allows, for example, real physical robots to use improved policies to perform physical tasks at an earlier time. This brings the following technological advantages: for example, increased power consumption of the robot in the early stages, and / or, for example, lower overall wear rate of the robot joints due to early improvements in the robot's strategy and the resulting physical actions.
[0051] In some past applications of pooled experience from multiple robots (collective robot learning), under the assumption that simulation time was inexpensive and training was dominated by backpropagation of neural networks, these applications sought to reduce overall training time. Conversely, in cases where experience is expensive and neural network backpropagation is relatively inexpensive, some implementations described seek to minimize training time when training is based on data from real physical robots. For example, various embodiments disclosed herein collect experience data from multiple robots operating asynchronously with each other. Furthermore, various implementations utilize the collected experience data to train a policy neural network asynchronously (but simultaneously) with the operation of the multiple robots. For example, a buffer of experience data collected from one of the robots' episodes can be used to update the policy neural network, and updated policy parameters from the updated policy neural network can be provided for implementation by one or more of the multiple robots before executing the corresponding next episode. In this way, the collection of experience data can be asynchronous among the multiple robots and asynchronous with the updating of the policy neural network. However, each of the multiple robots can utilize updated policy parameters at each episode, where these updated policy parameters are based on experience data from past episodes of the robot and past episodes of other robots. This asynchronous execution and neural network training can achieve an acceleration of overall training time due to the simultaneous collection of experience across multiple robot platforms.
[0052] In real-world robotic environments, specifically those with contact events, environmental dynamics are often unavailable or impossible to model accurately. Therefore, the implementations disclosed herein focus on model-free reinforcement learning, including policy search methods and value iteration methods. While policy search methods provide a direct approach to optimizing real-world targets, they typically require significantly more data than value iteration methods due to their on-policy learning. Consequently, some implementations disclosed herein specifically focus on value iteration methods, such as those based on Q-learning approximated by a function. Two examples of value iteration methods are DDPG and NAF, which extend deep Q-learning to a continuous action space and are more sample-efficient than competing policy search methods, for example, through off-policy learning via a replay buffer.
[0053] The goal of reinforcement learning is to control an agent that attempts to maximize a reward function, which, in the context of robot skill (also referred to as a task in this paper), represents a user-defined task that the robot should attempt to accomplish. At time t, in state x... t At that point, the agent acts according to its strategy π(u) t |x t Select and execute action u t According to the dynamic p(x) t |x t ,ut Transition to the new state x t and receiving reward r(x) t u t In the implementation described herein, the infinite-horizon discounted return problem is considered. The objective of the infinite-horizon discounted return problem is the future return discounted by γ from time t to T, denoted by... Given the given information, the goal is to find the optimal policy π*, which maximizes the sum of expected rewards derived from the initial state distribution. Provided.
[0054] In reinforcement learning methods, off-policy methods such as Q-learning can provide data efficiency gains compared to on-policy variants. This can be beneficial for robotics applications. Q-learning learns the policy through the Q-function. and by greedily maximizing the Q function Iterate between update policies to learn a greedy deterministic policy π(u) t |x t )=δ(u t =μ(x t )). The parameterized action-value function, where β is an arbitrary exploration policy, and p... β The state access is caused by β, and the learning objective is to minimize the Bellman error, where the objective time y is the state access. t Fixed as:
[0055]
[0056] For continuous action problems, the policy update step can be tricky for Q-functions parameterized by deep neural networks. Therefore, in the various implementations described herein, extensions of Q-learning with function approximation can be utilized. Two examples of extensions of Q-learning with function approximation are DDPG and NAF. DDPG circumvents this tricky problem by employing an actor-critic approach, while NAF restricts the categories of the Q-function to the expression below to enable closed-form updates, as in the case of discrete actions. During exploration, temporally relevant noise can optionally be added to the policy network output.
[0057]
[0058] This reinforcement learning formula can be applied to robotic systems to learn various robotic skills defined by a reward function. However, the learning process is often time-consuming. Therefore, in the embodiments disclosed herein, a parallel variant of NAF or a parallel version of DDPG can be utilized to alleviate one or both of these concerns. This enables the learning of a neural network parameterized Q-function from scratch in complex real-world robotic tasks. In practical deep robotic learning applications, the learning time is limited by the data collection rate of real-world robots in real time, not by the network training speed. Therefore, the various embodiments disclosed herein propose using asynchronous NAF to efficiently utilize multiple real robots for data collection and improve the speed of real-world learning. Naturally, this achieves faster improvements in real-world robot performance, as well as the corresponding efficiency gains as described above, when the learned policy is executed by a robot performing a physical task. Furthermore, the various embodiments disclosed herein implement active exploration during Q-learning (e.g., by adding large temporally relevant noise to the policy network output to increase the degree of exploration), which can be beneficial and / or necessary when learning from scratch. In some of these implementations, techniques may also be employed to enable active exploration while minimizing or preventing violations of one or more criteria (e.g., safety criteria and / or other criteria) of the robot performing the exploration.
[0059] In the asynchronous NAF implementations disclosed herein, multiple trainer threads (updating / training the policy neural network) are separated from multiple experience collector threads (each collecting experience data from one or more robots during exploration). In some of these implementations, this decoupling between the training and collector threads ensures that differences in training speed do not halt the robot's control procedures for generating experience data, which typically require sending controls at a fixed frequency. While the multiple trainer threads maintain training from a replay buffer (populated by the multiple experience collector threads), each of the multiple experience collector threads synchronizes its policy parameters with the multiple trainer threads at the start of each episode (e.g., updates its policy neural network with updated parameters generated in the most recent iteration of the training threads), executes commands on the robot, and pushes instances of experience data into the replay buffer.
[0060] The following is an overview of an example algorithm for performing asynchronous NAF using N collector threads and one training thread. Although the example algorithm is presented as having one training thread, in some implementations, multiple training threads may be provided (e.g., distributed training across multiple threads).
[0061] / / Trainer thread
[0062] Randomly initialized normalized Q-network
[0063] in,
[0064] use Initialize the target network
[0065] Initialize the shared replay buffer
[0066] For the (for) iteration = l, / perform (do)
[0067] Randomly sample a small batch of m transformations from R.
[0068] set up
[0069] Update the weights by minimizing the loss (as follows).
[0070]
[0071] Update the target network:
[0072] End for
[0073] / / Collector threads n, n = 1...N
[0074] Random initialization policy network
[0075] For plot point = 1, M performs (do).
[0076] Synchronization Strategy Network Weights
[0077] Initialization random process For use in motion exploration
[0078] Receive the initial observation state x1~p(x1)
[0079] For (for) t=1, T performs (do)
[0080] Select Action
[0081] Execute u t And observe r t and x t+1
[0082] Send conversion (x) t u t r t x t+1) to R (shared replay buffer)
[0083] End for
[0084] End for
[0085] As described herein, in various implementations, neural networks can parameterize behavior value functions and policies. In some of these implementations, various state representations can be used as inputs to the model when generating outputs indicating actions to be performed based on the policy. These state representations can indicate the state of the robot, and optionally the state of one or more environmental objects. As an example, a robot state representation can include joint angles and end effector positions, along with their time derivatives. In some implementations, a success signal (e.g., target position) can be appended to the robot state representation. As described herein, the success signal can be used to determine a reward for an action and / or other purpose. The specific success signal will depend on the task being reinforced. For example, for an arrival task, the success signal could be the destination / target position of the end effector. As another example, for a door-opening task, the success signal could include the handle position when the door is closed and quaternion measurements from sensors attached to the door frame (e.g., an inertial measurement unit attached to the door frame). In various implementations, standard feedforward networks can be used as policy neural networks to parameterize behavior value functions and policies. As an example, two hidden layer networks with a size of 100 units can be used, respectively, for parameterizing m(x), L(x) (Koleski decomposition of P(x)) and V(x) in NAF, and m(x) and Q(x;u) in DDPG. For Q(x;u) in DDPG, the action vector u can be added as another input to the second hidden layer, followed by a linear projection. ReLU can be used as the hidden activation, and hyperbolic tangent (Tanh) can be used for the final layer activation in the policy-only network m(x) to constrain the action range.
[0086] As described herein, in some implementations, techniques can be employed to enable active exploration while minimizing or preventing violations of one or more criteria (e.g., safety criteria and / or other criteria) that govern the robot's exploration. In some of these implementations, criteria may include velocity limits for each joint of the robot, positional limits for each joint of the robot, and / or other kinematic and / or dynamic constraints on the robot. For example, if the robot's commanded output during exploration (e.g., via a policy network in use) indicates a joint velocity that will exceed its velocity limit, that velocity may be modified (e.g., reduced to the velocity limit) before the output is implemented, or alternatively, an error may be thrown and a new episode exploration may begin.
[0087] In some implementations, a bounding sphere (or (or) other boundary shapes) at the end effector location can be used as a criterion. If the robot's command output during exploration sends the robot's end effector outside the bounding sphere, the robot's forward kinematics can be used to project the command's velocity plus some corrective velocity onto the sphere's surface. Additional and / or alternative criteria can be used. For example, when performing exploration to learn certain tasks such as opening / closing a door, criteria can be provided to prevent the robot from pushing too hard against certain objects (e.g., doors). For example, an additional boundary plane can be added a few centimeters in front of the closed position of the door, and / or torque limits can be added to one or more of the robot's joints (e.g., so that the robot will not apply excessive torque if the command's velocity cannot be achieved due to contact with the door or handle).
[0088] The additional description and additional embodiments described above are now provided with reference to the accompanying drawings. Figure 1 An example environment is shown in which the implementations disclosed herein can be carried out. Example robots 180A and 180B are included. Figure 1 In the middle, robots 180A and 180B are "robotic arms" with multiple degrees of freedom, enabling gripping end effectors 182A and 182B to traversal along any of multiple potential paths through the movement of the robots, in order to position the gripping end effectors 182A and 182B in the desired location. For example, see reference. Figure 2 This shows an example of robot 180A moving its end effector back and forth along path 201. Figure 2 Includes phantom and non-phasing images of robot 180A, showing two different poses from a set of poses of robot 180A and its end effector as it moves back and forth along path 201. (See also...) Figure 1 Robots 180A and 180B further control two opposing "claws" of their corresponding grasping end effectors 182A and 182B to actuate the claws between at least an open position and a closed position (and / or optionally multiple "partially closed" positions).
[0089] Example vision sensors 184A and 184B are also included. Figure 1 As shown in [the image]. Figure 1 In this configuration, vision sensor 184A is mounted in a fixed orientation relative to the base or other stationary reference point of robot 180A. Vision sensor 184B is also mounted in a fixed orientation relative to the base or other stationary reference point of robot 180B. For example... Figure 1As shown, the pose of vision sensor 184A relative to robot 180A differs from the pose of vision sensor 184B relative to robot 180B. As described herein, in some embodiments, this can be advantageous in providing diversity of empirical data generated by each of robots 180A and / or 180B (if the empirical data is at least partially influenced by sensor data from vision sensors 184A and 184B). Vision sensors 184A and 184B are sensors capable of generating images or other visual data relating to the shape, color, depth, and / or other features of objects(s) within the sensor's line of sight. Vision sensors 184A and 184B can be, for example, a thematic camera, a stereo camera, and / or a 3D laser scanner. A 3D laser scanner includes one or more lasers that emit light and one or more sensors that collect data relating to the reflection of the emitted light. A 3D laser scanner can be, for example, a time-of-flight 3D laser scanner or a triangulation-based 3D laser scanner, and may include a position-sensitive detector (PSD) or other optical position sensors.
[0090] The vision sensor 184A has a field of view of at least a portion of the workspace of the robot 180A, such as a portion of the workspace of the example spatula 191A. Although the resting surface of the spatula 191A is not in... Figure 1 As shown, it can be placed on a table, box, and / or (multiple) other surfaces. In other embodiments, as described herein, more, fewer, additional, and / or alternative objects may be provided during one or more episodes performed by robot 180A. Each episode may be an exploration during the performance of a task involving corresponding actions (e.g., touching an object, "picking up and placing" an object) in spatula 191A and stapler 191B. Additional and / or alternative objects may be provided. For example, for a "door opening" task, a door may be set in the environment of each of robots 180A and 180B. Vision sensor 184B has a field of view of at least a portion of the workspace of robot 180B, such as including a portion of the workspace of example stapler 191B. Although the placement surface of stapler 191B is not in Figure 1 As shown, it can be placed on a table, box, and / or (multiple) other surfaces. In other embodiments, as described herein, more objects, fewer objects, additional objects, and / or replacement objects may be provided during one or more episodes performed by robot 180B.
[0091] Despite Figure 1Specific robots 180A and 180B are shown, but additional and / or alternative robots can be used, including additional robotic arms similar to robots 180A and 180B, robots with other robotic arm forms, humanoid robots, animal-shaped robots, robots that move via one or more wheels (e.g., self-balancing robots), submersible robots, unmanned aerial vehicles (“UAVs”), and so on. Furthermore, although... Figure 1 The document shows a specific grabbing end effector, but additional and / or alternative end effectors can also be used. For example, an end effector that cannot grab can be used. Furthermore, although... Figure 1 The specific mounting of vision sensors 184A and 184B is shown, but additional and / or alternative mountings can be used, or the vision sensors can be omitted. For example, in some embodiments, the vision sensors can be mounted directly to the robot, such as on a non-actuable component of the robot or on an actuable component of the robot (e.g., on an end effector or a component near an end effector). And, for example, in some embodiments, the vision sensors can be mounted on a non-stationary structure separate from the robot to which they are associated, and / or can be mounted in a non-stationary manner on a structure separate from the robot to which they are associated.
[0092] Robots 180A, 180B, and / or other robots can be used to execute multiple episodes, each of which is an exploration of a task based on a model-free reinforcement learning network. For example, robots 180A and 180B can each include a policy network, such as a deep neural network representing a deterministic policy function. At the start of an episode, the robot's current state (e.g., a pseudo-randomly chosen starting state) can be applied to the policy network as input, along with a success signal (e.g., reaching the target position of the end effector of the task) and an output generated on the policy network based on the input. The policy network output indicates the action to be performed in the robot's next control cycle. For example, for each actuator of the robot, in joint space, the policy network output could be a velocity command. As another example, the policy network output could be the motor torque of each actuator of the robot. The action is then performed by the robot. The robot's state after the action is performed can then be applied to the policy network as input, along with the success signal and an additional output generated on the network based on the input. This can continue to be executed iteratively (e.g., in each control cycle of the robot) until the success signal is achieved (e.g., determined based on a reward for meeting a criterion) and / or other criteria are met. Other criteria could be, for example, that the duration of the episode has met a threshold (e.g., X seconds), or that a threshold number of control cycles has occurred. A new episode can begin after a success signal and / or other criteria are met.
[0093] like Figure 1 As shown, and described in more detail herein, when robots 180A and 180B (and optionally (multiple) additional robots) execute a scenario, the experience collector engine 112 receives instances of experience data generated by robots 180A and 180B (and optionally (multiple) additional robots). For example, robot 180A may provide a new instance of experience data to the experience collector engine 112 during each control cycle of its execution of a scenario. As another example, robot 180A may provide all instances of experience data generated during the scenario to the experience collector engine 112 at the end of the scenario. As yet another example, robot 180A may provide a new instance of experience data every 0.2 seconds or at (multiple) other regular or irregular intervals. Each instance of experience data is generated by the corresponding robot in the corresponding iteration based on the inputs applied to the robot's policy network and / or the outputs generated by the robot's policy network. For example, each instance of empirical data can indicate the robot's current state, the action to be performed based on the output of the policy network, the robot's state after performing the action, and / or the reward for that action (as indicated by the output generated on the policy network and / or a separate reward function).
[0094] The experience collector engine 112 stores instances of received experience data in the replay buffer 122. The replay buffer 122 may include memory and / or a database accessible to the training engine 114. Although a single experience collector engine 112 is shown, it should be understood that multiple experience collector engines 112 may be provided. For example, each robot may include its own experience collector engine or be associated with its own experience collector engine, and these engines may all store instances of experience data in the replay buffer 122.
[0095] Training engine 114 iteratively trains one or more parameters of policy network 124 using techniques such as those described herein (e.g., techniques related to Q-learning such as NAF and / or DDPG variants). In each iteration of training, training engine 114 may use instances of one or more empirical data sets from replay buffer 122 to generate updated policy parameters. Training engine 114 may optionally purge (multiple) of the utilized instances from replay buffer 122 and / or otherwise purge them (e.g., based on a first-in-first-out scheme).
[0096] Before either robot 180A or 180B executes a new episode, the robot can update its policy network using policy parameters recently generated by training engine 114. In some embodiments, the policy parameters can be "pushed" to robots 180A and 180B by training engine 114. In other embodiments, the policy parameters can be "pulled" by robots 180A and 180B. Therefore, in Figure 1 In this implementation, robots 180A, 180B, and optional additional robots can operate in parallel, each executing multiple episodes based on the same model-free reinforcement policy network. However, the policy parameters utilized by one or more robots may differ from those of one or more other robots, one or more times. For example, at time T1 and before the start of a given episode, robot 180A may synchronize its policy parameters with the most recently updated policy parameters. However, at time T1, robot 180B may be in the episode and may still be operating with less updated policy parameters from a previous iteration of training engine 114 (immediately before the previous or multiple iterations). At time T2, robot 180B may synchronize its policy parameters with even more updated policy parameters, but robot 180A is still in the given episode at time T2 and is still operating with the (now less updated) policy parameters from time T1.
[0097] As described herein, when instances of experience data are generated during parallel operation of the robots, the experience collector engine 112 may add instances of experience data for each of robots 180A, 180B, and / or other robots to the replay buffer 122. In some embodiments, the experience data for each robot may be added to the replay buffer 122 at a corresponding (and optionally identical) frequency of the robot (e.g., the robot's control cycle frequency). For example, a robot may have a control frequency of 60 Hz and provide experience data at 60 Hz (i.e., 60 instances of experience data per second). In some of these embodiments, the training engine 114 may perform training iterations at a frequency greater than one or more (e.g., all) of the robot's frequencies, and these iterations may be performed as the robots continue to operate in parallel and generate experience data based on the episodes. One or more of these techniques may enable the convergence of the policy network to occur faster than if these techniques were not employed.
[0098] The experience collector engine 112, replay buffer 122, training engine 114, and policy network 124 are in Figure 1The components are shown as being separate from robots 180A and 180B. However, in some embodiments, all or aspects of one or more of these components may be implemented on robots 180A and / or 180B (e.g., via one or more processors of robots 180A and 180B). For example, robots 180A and 180B may each include an instance of experience collector engine 112. In some embodiments, all or aspects of one or more of these components may be implemented on one or more computer systems that are separate from robots 180A and 180B but communicate with them via a network. In some of these embodiments, experience data may be transferred from the robots to the components over one or more networks, and updated policy parameters may be transferred from the components to the robots over one or more networks.
[0099] Figure 3 This is a flowchart illustrating an example method 300 for performing an attempt to move an object and storing data associated with that attempt. For convenience, the operations in the flowchart are described with reference to a system performing the operations. This system may include one or more components of a robot, such as a processor and / or robot control system for robots 180A, 180B, 640, and / or other robots. Furthermore, although the operations of method 300 are shown in a specific order, this is not intended to be limiting. One or more operations may be reordered, omitted, or added.
[0100] At box 352, the plot begins with the execution of a task.
[0101] At box 354, if any updated parameters are available, the system synchronizes the policy parameters of the policy network used by the system based on the updated parameters. For example, the system can use a policy network based on... Figure 5 Method 500 generates one or more recently updated policy parameters to replace one or more policy parameters of the policy network. In some implementations, replacing policy parameters with another policy parameter includes replacing the values of nodes in the neural network model with another value.
[0102] At box 356, the system initializes a random process for task exploration. As used herein, randomness includes pseudo-randomness and true randomness. As an example, the system can move the robot's end effector to a random starting position. As another example, the system can make each joint of the robot individually exhibit a specific motion state (e.g., specific position, velocity, and / or acceleration).
[0103] At box 358, the system identifies the current state. The current state may include the current robot state and / or the current state of one or more environmental objects. The current state of the environmental objects may be determined based on sensors attached to those objects and / or on sensor data from the robot. For example, the current state of the environmental objects may be based on sensor data from one or more sensors, such as an inertial measurement unit (IMU) attached to a door when the task is to open it. And, for example, the current state of the environmental objects may be based on visual sensor data captured by the robot's vision sensors (e.g., the current position of the object may be determined based on visual sensor data from the robot's vision sensors). In the first iteration of box 358, the current robot state will be the initial robot state following the initialization at box 356. For example, the initial robot state may include the current state of one or more components of the robot, such as the position, velocity, and / or acceleration of each joint and / or end effector.
[0104] At box 360, the system selects an action to be performed based on the current state and the policy network. For example, the system can apply the current state as input to a reinforcement learning policy model and generate an output on the model indicating the action to be performed based on the input. The system can select an action based on the output. In some implementations, the output includes torque values and / or other values applied to the robot actuators, and selecting an action can include selecting those values as actions. In the implementation where additional and / or alternative current observations are identified at box 358, they can also be applied as input to the policy network.
[0105] At block 362, the system performs the action and observes the reward and the subsequent state resulting from the action. For example, the system may generate one or more motion commands to cause one or more actuators to move and perform the action. The system may observe the reward based on a reward function and optionally based on a success signal provided to the system. For example, the reward function for a task in which the end effector reaches a target pose may be based on the difference between the pose of the end effector resulting from the action and the target pose (where the target pose is provided as a success signal). Subsequent states may include subsequent robot states and / or subsequent states of one or more environmental objects. For example, subsequent robot states may include the state of one or more components of the robot as a result of the action, such as the position, velocity, and / or acceleration of each joint and / or end effector. In some implementations, at block 362, the system additionally or alternatively identifies other observations as a result of the action, such as visual sensor data captured by the robot's visual sensors after the action is performed and / or other sensor data from (multiple) other sensors after the action is performed.
[0106] At box 364, the system sends instances of empirical data to the replay buffer. For example, the system itself can store the instances in the replay buffer, or it can provide the instances to a separate component that stores them in the replay buffer. (See also: Regarding...) Figure 5 As described in method 500, instances and other instances from other robots can be used to update policy parameters during training. In some implementations, empirical data may include data on the current state of box 358, the action of box 360, and / or the observed reward and / or subsequent state of box 362.
[0107] At box 366, the system determines whether success or other criteria have been met. For example, if the reward observed at box 362 meets a threshold, the system can determine success. And, for example, another criterion could be the threshold time and / or the number of threshold iterations that have been met in boxes 358, 360, 362, and 364.
[0108] If the system determines that success or other criteria have been met, the system proceeds to box 352 and begins a new episode. Note that in the new episode, at box 354, the system can synchronize the policy parameters with one or more updated parameters relative to the parameters of the immediately preceding episode (as per the...). Figure 5 Method 500 and / or other methods simultaneously update these parameters. For example, the updated parameters can be based on... Figure 5 Method 500 utilizes data from... Figure 3 Method 300 asynchronously generates experience data from one or more other robots to update the experience data. In these and other ways, Figure 3 Each episode can utilize policy parameters updated based on experience data from (multiple) other robots. This allows each episode to generate experience data that enables more efficient training. The system can execute multiple episodes using method 300 until training (according to...) Figure 5 Method 500 and / or other methods) complete and / or until some other signal (e.g., an error occurs).
[0109] If the system determines that success or other criteria are not met, the system proceeds to box 358 and performs additional iterations of boxes 358, 360, 362, and 364.
[0110] exist Figure 3 In the diagram, boxes 370, 372, and 374 are also shown and represent optional boxes that can be implemented to determine whether any action at box 360 violates one or more robot criteria, and if so, to take a corrective action before the action at box 362 is executed. At box 370, the system determines whether the action at box 360 violates one or more robot criteria. If not, the system proceeds to box 362 and executes the action.
[0111] If the system determines that the action at box 360 violates one or more robot criteria, the system proceeds to box 372 or box 374. At box 372, the system can modify the action so that it no longer violates the criteria and provide a modified action to replace the original action for execution at box 362. For example, if a velocity constraint is violated, the action can be modified so that it no longer violates the velocity constraint. Modified actions can be provided in instances of empirical data to replace unmodified actions.
[0112] At box 374, the system may optionally terminate the episode in response to certain behaviors that violate robot standards. If the system terminates the episode at box 374, the system may return to box 352 and / or wait for intervention (e.g., human intervention).
[0113] As described herein, in many implementations... Figure 3 Method 300 can be implemented on each of multiple robots operating in parallel during one or more iterations (e.g., all) of its respective iterations. For example, one or more processors of a first robot can perform the instantiation of method 300, one or more processors of a second robot can perform the instantiation of method 300, and so on. This allows for the generation of more instances of experience data within a given time period compared to if only one robot is operating method 300. In an implementation where the training of the policy neural network is performed at a higher frequency than the generation of experience data by a given robot, this can result in time-efficient training of the policy neural network. Furthermore, this allows each episode of method 300 for a given robot to utilize experience data updated based on experience data from the other robots(s). This can result in the training of the policy neural network converging in fewer iterations compared to asynchronous training without utilizing asynchronous experience data. Furthermore, in embodiments where the end effectors, sensors, actuators, and / or other hardware components of multiple robots vary and / or wear, and / or where different robots interact with different objects (e.g., objects of different sizes, weights, shapes, translucency, and materials) and / or in different environments (e.g., different surfaces, different lighting, and different environmental obstacles), utilizing empirical data generated by multiple robots during training can provide robustness for various robot and / or environment configurations in the trained policy network.
[0114] Figure 4This is a flowchart illustrating an example method 400 for storing experience data. For convenience, the operations in the flowchart are described with reference to a system that performs the operations. This system may include one or more components (such as a processor) of one or more computer systems and / or one or more components (such as a robot's processor and / or robot control system) of one or more robots. Furthermore, although the operations of method 400 are shown in a specific order, this is not intended to be restrictive. One or more operations may be reordered, omitted, or added.
[0115] At box 452, empirical data collection begins.
[0116] At box 454, the system receives an instance of the robot's experience data. The robot is one of multiple robots that provide experience data to the system and / or other systems in parallel (but optionally at different frequencies). For example, the robots may each implement their own data. Figure 3 Method 300, and box 454 can be executed for each instance of the execution of box 364 of method 300.
[0117] At box 456, the system stores instances of empirical data in the replay buffer. At box 458, the system determines whether training is complete. The system can determine training completion in response to a signal from an optional separate training system that updates policy parameters based on the empirical data stored in the replay buffer.
[0118] If the system determines at box 458 that training is not complete, the system returns to box 454 and receives additional instances of experience data (from the same robot or a different robot). It should be understood that multithreading of one or more boxes of method 400 can be implemented to enable the simultaneous receipt of experience data from multiple robots.
[0119] If the system determines that training is complete at box 458, then empirical data collection ends at box 460.
[0120] Although methods 300 and 400 are shown in separate figures here for clarity, it should be understood that one or more blocks of method 400 may be executed by one or more of the same components that execute one or more blocks of method 300. For example, one or more (e.g., all) blocks of methods 300 and 400 may be executed by one or more processors of a robot. Furthermore, it should be understood that one or more blocks of method 400 may be executed in conjunction with one or more blocks of method 300, or before or after one or more blocks of method 300.
[0121] Figure 5This is a flowchart illustrating an example method 500 for training the parameters of an updated policy network. For convenience, the operations in the flowchart are described with reference to the system performing the operations. This system may include one or more components of a computer system, such as the training engine 114 and / or the processor (e.g., GPU and / or CPU) of another computer system. Furthermore, although the operations of method 500 are shown in a specific order, this is not intended to be restrictive. One or more operations may be reordered, omitted, or added.
[0122] Training begins at box 552.
[0123] At box 554, the system initializes a normalized Q-network, such as a normalized Q-network that parameterizes a randomly initialized Q-function estimate. For example, the system can initialize a normalized Q-network. in
[0124] At box 556, the system initializes a target policy network whose output serves as the input to the normalized Q-network initialized at box 554. For example, the system can utilize... To initialize the target policy network
[0125] At box 558, the system samples a batch of empirical data from the replay buffer. For example, the system can be based on... Figure 3 Method 300 and / or Figure 4 Method 400 samples one or more instances of empirical data stored in a replay buffer.
[0126] In the implementation of some empirical data, system settings
[0127]
[0128] At box 560, the system updates the normalized Q-network based on empirical data sampled at box 558. For example, the system can perform backpropagation and / or other techniques on the Q-network based on a loss function. For instance, the system can minimize the loss: To update the weights of the Q network
[0129] At box 562, the system updates the target policy network based on updates to the normalized Q-network. For example, the system can update the target policy network based on the gradient of the loss function with respect to the network parameters. For example, the system can update the target policy network based on: To update the target policy network.
[0130] At box 564, the system provides updates for the robot to use in subsequent episodes. For example, the system can provide updated policy parameters and / or other parameters for the robot to use in subsequent episodes.
[0131] At box 566, the system determines whether training is complete. In some implementations, determining training completion may be based on: determining that convergence has been achieved, that the threshold amount of iterations in boxes 558-564 has occurred, that all available empirical data has been processed, that the threshold amount of time has elapsed, and / or that other criteria have been met.
[0132] If the system determines that training is complete, training ends at box 568. If the system determines that training is not complete, the system returns to box 558. As described herein, method 500 can be executed concurrently with methods 300 and 400 described herein. In some of these embodiments, the execution frequency of iterations of method 500 can be greater than the execution frequency of iterations of methods 300 and / or 400. As a non-limiting example, method 500 can be executed at a rate of 100 Hz, and method 300 can be executed at a rate of 20 Hz. Note that in some embodiments, methods 300, 400, and / or 500 can be executed “continuously” because empirical data is continuously generated by one or more real-world robots and used to continuously update the target policy network.
[0133] To provide additional details regarding the implementations described herein, some example tasks that can be learned using the reinforcement learning techniques disclosed herein are described in more detail. Some examples of tasks include random target arrival, pushing and / or pulling doors, picking up, and placing. For example, in an arrival task, a robotic arm may attempt to reach a random target in space from a fixed initial configuration. Random targets are generated per plot by uniformly sampling points from a 0.2m cube centered at a point. The random target can be provided as a success signal. Given the end effector position e and the target position y, the reward function can be: r(x;u) = c1d(y;e(x))c2u T u.
[0134] Furthermore, for example, in push and pull door tasks, the robotic arm can attempt to open a door by pushing or pulling the door handle. The handle can turn down up to 90 degrees, while the door can open in both directions up to 90 degrees. The door has a spring so that it gradually closes when no external force is applied. The door has a latch so that it can only be opened when the handle is turned more than approximately 60 degrees. An IMU sensor attached to the door can be used to measure the door angle, and quaternion reads from the IMU sensor can be used to calculate the loss. For example, the reward function can consist of two parts: the proximity of the end effector to the handle, and a measurement of the degree to which the door is opened in the correct direction. The first part of the reward function depends on the distance between the end effector position e and the handle position h in a neutral state. The second part of the reward function depends on the handle quaternion q and the value q when the handle is turned and the door is opened. o The distance between them. State characteristics may include the joint angles of the robot arm and their time derivatives, the position of the end effector, the position of the resting door handle, the position of the door frame, the door angle, and the handle angle.
[0135] Figure 6 An example architecture of robot 640 is schematically depicted. Robot 640 includes a robot control system 660, one or more operating components 640a-640n, and one or more sensors 642a-642m. Sensors 642a-642m may include, for example, vision sensors, light sensors, pressure sensors, pressure wave sensors (e.g., microphones), proximity sensors, accelerometers, gyroscopes, thermometers, barometers, etc. Although sensors 642a-m are depicted as integrated with robot 620, this is not intended to be limiting. In some embodiments, sensors 642a-m may be located external to robot 620, for example, as separate units.
[0136] Operating components 640a-640n may include, for example, one or more end effectors and / or one or more servo motors or other actuators to perform movement of one or more components of the robot. For example, robot 620 may have multiple degrees of freedom, and each actuator may control the actuation of robot 620 in one or more degrees of freedom in response to a control command. As used herein, the term actuator, in addition to any driver(s)(s) that may be associated with an actuator and translate received control commands into one or more signals for driving the actuator, also encompasses mechanical or electrical devices (e.g., motors) that generate motion. Therefore, providing control commands to an actuator may include providing control commands to a driver that translates control commands into appropriate signals for driving electrical or mechanical devices to generate desired motion.
[0137] The robot control system 660 may be implemented in one or more processors, such as the CPU, GPU, and / or (multiple) other controllers of the robot 620. In some embodiments, the robot 620 may include a “brain box” that may include all or aspects of the control system 660. For example, the brain box may provide real-time data bursts to the operating components 640a-n, wherein each real-time burst includes a set of one or more control commands that specifically define motion parameters (if any) for each of one or more of the operating components 640a-n. In some embodiments, the robot control system 660 may perform one or more aspects of the methods 300, 400, and / or 500 described herein.
[0138] As described herein, in some embodiments, all or all aspects of control commands generated by the control system 660 at one or more components of the mobile robot can be based on outputs generated on a policy network based on the current robot state and / or other observations. Although the control system 660 in Figure 6 While shown as part of robot 620, in some embodiments, all or aspects of control system 660 may be implemented in components separate from but communicating with robot 620. For example, all or aspects of control system 660 may be implemented on one or more computing devices, such as computing device 710, that communicate with robot 620 via wired and / or wireless communication.
[0139] Figure 7 This is a block diagram of an example computing device 710 that can optionally be used to perform one or more aspects of the techniques described herein. The computing device 710 typically includes at least one processor 714 that communicates with a number of peripheral devices via a bus subsystem 712. These peripheral devices may include a storage subsystem 724, which includes, for example, a memory subsystem 725 and a file storage subsystem 726, a user interface output device 720, a user interface input device 722, and a network interface subsystem 716. The input and output devices allow users to interact with the computing device 710. The network interface subsystem 716 provides an interface to an external network and is coupled to corresponding interface devices in other computing devices.
[0140] User interface input device 722 may include a keyboard, pointing devices (such as a mouse, trackball, touchpad, or graphics tablet), scanner, touchscreen integrated into a display, audio input devices (such as a speech recognition system), microphone, and / or other types of input devices. Generally, the term "input device" is intended to include all possible types of devices and methods for inputting information onto computing device 710 or a communication network.
[0141] User interface output device 720 may include a display subsystem, a printer, a fax machine, or a non-visual display such as an audio output device. The display subsystem may include a cathode ray tube (CRT), a flat panel device such as a liquid crystal display (LCD), a projection device, or other mechanisms for creating visual images. The display subsystem may also provide non-visual displays, such as via an audio output device. Generally, the term "output device" is intended to encompass all possible types of devices and methods for outputting information from computing device 710 to a user or another machine or computing device.
[0142] Storage subsystem 724 stores the programming and data structures that provide the functionality of some or all of the modules described herein. For example, storage subsystem 724 may include programming and data structures for performing... Figure 3 , Figure 4 and / or Figure 5 The logic of selecting the method.
[0143] These software modules are typically executed by processor 714 alone or in combination with other processors. The memory 725 used in storage subsystem 724 may include a plurality of memories, including main random access memory (RAM) 730 for storing instructions and data during program execution and read-only memory (ROM) 732 for storing fixed instructions. File storage subsystem 726 can provide persistent storage for program and data files and may include hard disk drives, floppy disk drives and associated removable media, CD-ROM drives, optical drives, or removable media cartridges. Modules implementing the functionality of certain embodiments may be stored by file storage subsystem 726 in storage subsystem 724 or in other machines accessible to processor(s) 714.
[0144] The bus subsystem 712 provides a mechanism for enabling various components and subsystems of the computing device 710 to communicate with each other on demand. Although the bus subsystem 712 is schematically shown as a single bus, alternative implementations of the bus subsystem may use multiple buses.
[0145] The computing device 710 can be of various types, including workstations, servers, computing clusters, blade servers, server groups, or any other data processing system or computing device. Due to the constantly evolving nature of computers and networks, Figure 7 The description of the computing device 710 depicted herein is intended only as a specific example illustrating some implementation methods. Many other configurations of the computing device 710 may have... Figure 7 The computing device is described with more or fewer components.
[0146] While several embodiments have been described and illustrated herein, various other means and / or structures may be used to perform the functions and / or obtain the results and / or one or more advantages described herein, and each of such variations and / or modifications is considered to be within the scope of the embodiments described herein. More generally, all parameters, dimensions, materials, and configurations described herein are exemplary, and actual parameters, dimensions, materials, and / or configurations will depend on the specific application(s) taught. Those skilled in the art will recognize, or be able to identify, many equivalents of the specific embodiments described herein using only conventional experimentation. Therefore, it should be understood that the foregoing embodiments are given by way of example only, and that embodiments may be practiced in ways other than those specifically described and claimed within the scope of the appended claims and their equivalents. Embodiments of this disclosure pertain to each individual feature, system, article, material, kit, and / or method described herein. Furthermore, any combination of two or more such features, systems, articles, materials, kits, and / or methods that are not contradictory is included within the scope of this disclosure.
Claims
1. A method implemented by one or more processors, comprising: During the execution of multiple episodes by each of a plurality of real physical robots, each episode includes performing the task based on a policy neural network representing a reinforcement learning policy for the task: Instances of robot experience data generated by multiple real physical robots during a plot are stored in a buffer, each of the instances of robot experience data being generated during a corresponding plot in the plot, and at least in part on a corresponding output generated by a policy neural network using corresponding policy parameters of a policy neural network having the corresponding plot, wherein the instances of robot experience data of the multiple real physical robots are stored in the buffer at a first frequency. The updated policy parameters of the policy neural network are generated iteratively at a second frequency greater than the first frequency, wherein each iteration in the iterative generation includes generating the updated policy parameters using a group of one or more instances of robot experience data in the buffer during the iteration. as well as Each robot, in conjunction with the start of each of a plurality of episodes executed by the robot, updates the policy neural network to be used by the robot in the episode, wherein updating the policy neural network includes using the most recently updated policy parameters that are iteratively generated to produce updated policy parameters.
2. The method according to claim 1, wherein, Each of the updated policy parameters defines a corresponding value for the corresponding node of the corresponding layer in the policy neural network.
3. The method according to claim 1, wherein, For each of the robots, instances of robot experience data are stored in the buffer at a corresponding frequency, where each corresponding frequency is lower than the frequency at which the updated policy parameters are iteratively generated.
4. The method according to claim 1, wherein, Instances of the robot experience data are stored in the buffer and executed by one or more processors in the first thread, wherein iterative generation is executed by one or more processors in a second thread that is separate from the first thread.
5. The method according to claim 4, wherein, The first thread is executed by one or more processors in a first group, and the second thread is executed by one or more processors in a second group, which do not overlap with the first group.
6. The method according to claim 1, wherein, Each iteration in the iteratively generated iterations includes generating updated policy parameters based on minimizing a loss function, given a group of one or more instances of robot experience data in the buffer during the generation iteration.
7. The method according to claim 1, wherein, Each iteration in the iteratively generated iterations includes off-policy learning during the generation iteration, based on a group of one or more instances of robot experience data in the buffer.
8. The method according to claim 7, wherein, The off-policy learning mentioned above is Q-learning.
9. The method according to claim 8, wherein, The Q-learning utilizes either the Normalized Advantage Function (NAF) algorithm or the Deep Deterministic Policy Gradient (DDPG) algorithm.
10. The method according to claim 1, in, Each indication in the instance of the robot experience data corresponds to: the starting robot state, the subsequent robot state transitioned from the starting robot state, the action performed to transition from the starting robot state to the subsequent robot state, and the reward for the action; The execution of the action to transition from the starting robot state to the subsequent robot state is generated based on processing the starting robot state using a policy neural network with updated policy parameters for the corresponding scenario, and the reward for the action is generated based on the reward function of the reinforcement learning policy.
11. The method according to claim 1, further comprising: Based on one or more criteria, the execution of the plurality of episodes is terminated and the iterative generation is terminated; The most recently generated version of the updated policy parameters is provided to the policy neural network for use by one or more additional robots.
12. A method implemented by one or more processors, comprising: One or more processors of a given robot consisting of multiple real physical robots: Execute a given scenario based on a policy network with a first set of policy parameters; In one of a plurality of experience data iterations that provide experience data from the given robot, a first instance of robot experience data generated based on the policy network during the given episode is provided, wherein the plurality of experience data iterations occur at a first frequency; Before the given robot performs subsequent episodes based on the policy network to execute tasks: Replace one or more policy parameters of the first group with updated policy parameters, wherein the updated policy parameters are generated by training the policy network based on additional instances of robot experience data generated by the additional robot during an additional robot episode in which the additional robot performs a task, wherein the additional robot episode in which the additional robot performs a task is based on the policy network, and wherein the training of the policy network includes multiple training iterations occurring at a second frequency greater than the first frequency. The subsequent episode immediately follows the first episode, and the task performed in the subsequent episode based on the policy network includes replacing the replaced policy parameters with the updated policy parameters.
13. The method of claim 12, further comprising: Further updated policy parameters are generated by one or more additional processors during the execution of the subsequent episode, wherein the generation of the further updated policy parameters is based on one or more instances of first instance of robot experience data generated during the given episode; and The further updated strategy parameters are provided for use by the additional robots in the additional robots when performing the corresponding plot.
14. The method according to claim 13, wherein, The additional robot begins executing the corresponding episode while the given robot is executing the subsequent episode.
15. The method according to claim 13, wherein, When the given robot performs any episode, the further updated policy parameters are not utilized by the given robot.
16. The method of claim 13, further comprising: Further updated policy parameters are generated by one or more additional processors, wherein the further updated policy parameters are generated during the execution of the subsequent episodes and after the generation of the further updated policy parameters; and The further updated policy parameters are provided for the given robot to use when performing further subsequent episodes of the task based on the policy network. The further subsequent events follow the subsequent events.
17. The method according to claim 16, wherein, The given robot begins executing the further subsequent episodes while the corresponding episode is being executed by the additional robot.
18. The method according to claim 16, wherein, When any plot is executed by the attached robot, the updated policy parameters and the further updated policy parameters are not used by the attached robot.
19. The method according to claim 12, wherein, When any episode is executed by the attached robot, the updated policy parameters are not utilized by the attached robot.
20. The method according to claim 12, wherein, The policy network includes a neural network model.
21. The method according to claim 20, wherein, Each of the updated policy parameters defines a corresponding value for the corresponding node of the corresponding layer in the neural network model.
22. The method of claim 12, further comprising, during the execution of a given episode of the task: In a given iteration of the output from the policy network, determine one or more criteria that violate the given robot. Modify the output of the given iteration so that it no longer violates one or more of the criteria; as well as A given instance of empirical data is generated based on the modified output.
23. The method according to claim 22, wherein, The standard includes one or more of the following: joint position limitation, joint speed limitation, and end effector position limitation.
24. The method of claim 12, wherein executing the given scenario comprises: The current state representation is applied as input to the policy network, the current state representation indicating at least the current state of the robot; The output is generated by processing the input using the policy network. as well as Based on the output, control commands are provided to one or more actuators of the given robot.
25. The method according to claim 24, wherein, Providing the control command to the actuator based on the output includes: The modified output is generated by adding noise to the output: and The control commands are provided based on the modified output.
26. The method according to claim 24, wherein, The output includes the speed or torque of each of the plurality of actuators of the robot, and wherein providing the control command includes providing a control command that causes the actuator to apply the speed or torque.
27. The method according to claim 12, wherein, Each indication in the first instance of the experience data corresponds to: the starting robot state, the subsequent robot state transitioned from the starting robot state, the action performed to transition from the starting robot state to the subsequent robot state, and the reward for the action.
28. A method implemented by one or more processors, comprising: In one of multiple experience data iterations that receive experience data from multiple real physical robots, a given instance of robot experience data generated by the given robot is received, wherein the given instance of robot experience data is generated during a given episode of performing a task based on a given version of the policy parameters of the policy network utilized by the given robot at the time the given instance is generated. Additional instances of robot experience data are received from additional robots among the plurality of real physical robots, the additional instances being generated during episodes in which the additional robots perform tasks based on the policy network, wherein the plurality of experience data iterations occur at a first frequency; As the given robot and the additional robot continue to perform their tasks, a new version of the policy parameters of the policy network is generated based on training of the policy network, at least in part, based on the given instance and the additional instance. The training of the policy network includes multiple training iterations occurring at a second frequency greater than the first frequency, the multiple training iterations including: A first training iteration based at least in part on training the policy network of the given instance and the additional instance; and One or more additional training iterations of the policy network based on further instances of empirical data from multiple real physical robots. A new version of the policy parameters is provided to the given robot for executing the subsequent episodes of the task performed by the given robot based on the new version of the policy parameters.
29. The method of claim 28, further comprising: The given robot executes a new version of the task based on the policy parameters, which is the immediate subsequent episode of the task performed by the given robot.
30. A method implemented by one or more processors, comprising: Iteratively receiving instances of experience data generated by a plurality of real physical robots operating asynchronously and simultaneously, wherein each of the instances of experience data generated by a corresponding robot among the plurality of real physical robots is during a corresponding episode of performing a task based on a policy neural network, and wherein instances of experience data generated by the plurality of real physical robots are received at a first frequency. The policy neural network is iteratively trained at a second frequency based on empirical data received from the plurality of real physical robots to generate one or more updated parameters of the policy neural network in each training iteration, wherein the second frequency is greater than the first frequency; and An instance of iteratively and asynchronously providing the robot with updated parameters for updating the robot's policy neural network before subsequent episodes of the task execution based on further instances of empirical data.
31. The method of claim 30, further comprising: The scenario described is based on the execution of tasks by which the instance of performing experience data on a robot is performed.
32. A computer-readable instruction, when executed by a computing device, causes the method according to any one of claims 1 to 31 to be performed.
33. An apparatus for deep reinforcement learning, configured to perform the method according to any one of claims 1 to 31.