Data-efficient hierarchical reinforcement learning

By using a multi-level hierarchical reinforcement learning model and offline policy correction technology, the problems of complex multi-level reasoning and resource waste in robot control tasks are solved, and efficient training and robot control in complex environments are achieved.

CN117549293BActive Publication Date: 2026-06-12GOOGLE LLC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GOOGLE LLC
Filing Date
2019-05-17
Publication Date
2026-06-12

Smart Images

  • Figure CN117549293B_ABST
    Figure CN117549293B_ABST
Patent Text Reader

Abstract

Hierarchical reinforcement learning (HRL) models are trained and / or utilized with robotic control. The HRL models can include at least a higher-level policy model and a lower-level policy model. Some implementations involve techniques that enable more efficient offline policy training in the training of the higher-level policy model and / or the lower-level policy model. Some of these implementations utilize a correction of the offline policy that re-labels the higher-level actions of experience data that was generated in the past with a previously trained version of the HRL model, with modified higher-level actions. The modified higher-level actions are then used to train the higher-level policy model offline. This can enable efficient offline policy training despite the lower-level policy model being a different version (relative to the version at the time the experience data was collected) at the time of training.
Need to check novelty before this filing date? Find Prior Art

Claims

1. A method implemented by one or more processors, the method comprising: Identify the robot's current state by observation; A higher-level policy model using a hierarchical reinforcement learning model is used to determine higher-level actions for transitioning from the current state observation to the target state observation. Atomic actions are generated by processing the current state observation and the higher-level action using a lower-level policy model that employs the hierarchical reinforcement learning model. The atomic actions are applied to the robot to transition it to a newer state. An intrinsic reward is generated for the atomic action, the intrinsic reward being generated based on the updated state and the target state observation; and The lower-level policy model is trained based on the intrinsic reward of the atomic action.

2. The method according to claim 1, further comprising: Following the training, the hierarchical reinforcement learning model is used to control one or more actuators of the attached robot.

3. The method according to claim 1 or claim 2, wherein, The robot in question is a simulated robot.

4. The method according to claim 1 or claim 2, wherein, Generating the intrinsic reward based on the updated state and the target state observation includes generating the intrinsic reward based on the L2 difference between the updated state and the target state observation.

5. The method according to claim 1 or claim 2, further comprising generating an environmental reward and training the higher-level policy model based on the environmental reward.

6. A method implemented by one or more processors of a robot, the method comprising: Identify the current state of the robot; In the first control step, a higher-level policy model using a hierarchical reinforcement learning model is used to determine a higher-level action for transitioning from the current state to the target state. Based on the processing of the current state and the higher-level action using the lower-level policy model of the hierarchical reinforcement learning model, a first lower-level action is generated for the first control step; The first lower-level action is applied to the robot so that the robot transitions to the updated state; In the second control step following the first control step, an updated higher-level action is generated, wherein generating the updated higher-level action includes at least applying the current state, the updated state, and the higher-level action to the transition function; Based on processing the updated state using the lower-level policy model and the updated higher-level action, a second lower-level action is generated for the second control step; The second lower-level action is applied to the robot so that the robot transitions to a further updated state.

7. A robot comprising one or more processors for performing the method of claim 6.

8. A computer-readable storage medium comprising instructions that, when executed by one or more processors, cause the one or more processors to perform the method according to any one of claims 1 to 6.

9. A robot control system comprising one or more processors for performing the method according to any one of claims 1 to 6.