Data-efficient hierarchical reinforcement learning
By using a multi-level hierarchical reinforcement learning model and offline policy correction technology, the problems of complex multi-level reasoning and resource waste in robot control tasks are solved, and efficient training and robot control in complex environments are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GOOGLE LLC
- Filing Date
- 2019-05-17
- Publication Date
- 2026-06-12
Smart Images

Figure CN117549293B_ABST
Abstract
Claims
1. A method implemented by one or more processors, the method comprising: Identify the robot's current state by observation; A higher-level policy model using a hierarchical reinforcement learning model is used to determine higher-level actions for transitioning from the current state observation to the target state observation. Atomic actions are generated by processing the current state observation and the higher-level action using a lower-level policy model that employs the hierarchical reinforcement learning model. The atomic actions are applied to the robot to transition it to a newer state. An intrinsic reward is generated for the atomic action, the intrinsic reward being generated based on the updated state and the target state observation; and The lower-level policy model is trained based on the intrinsic reward of the atomic action.
2. The method according to claim 1, further comprising: Following the training, the hierarchical reinforcement learning model is used to control one or more actuators of the attached robot.
3. The method according to claim 1 or claim 2, wherein, The robot in question is a simulated robot.
4. The method according to claim 1 or claim 2, wherein, Generating the intrinsic reward based on the updated state and the target state observation includes generating the intrinsic reward based on the L2 difference between the updated state and the target state observation.
5. The method according to claim 1 or claim 2, further comprising generating an environmental reward and training the higher-level policy model based on the environmental reward.
6. A method implemented by one or more processors of a robot, the method comprising: Identify the current state of the robot; In the first control step, a higher-level policy model using a hierarchical reinforcement learning model is used to determine a higher-level action for transitioning from the current state to the target state. Based on the processing of the current state and the higher-level action using the lower-level policy model of the hierarchical reinforcement learning model, a first lower-level action is generated for the first control step; The first lower-level action is applied to the robot so that the robot transitions to the updated state; In the second control step following the first control step, an updated higher-level action is generated, wherein generating the updated higher-level action includes at least applying the current state, the updated state, and the higher-level action to the transition function; Based on processing the updated state using the lower-level policy model and the updated higher-level action, a second lower-level action is generated for the second control step; The second lower-level action is applied to the robot so that the robot transitions to a further updated state.
7. A robot comprising one or more processors for performing the method of claim 6.
8. A computer-readable storage medium comprising instructions that, when executed by one or more processors, cause the one or more processors to perform the method according to any one of claims 1 to 6.
9. A robot control system comprising one or more processors for performing the method according to any one of claims 1 to 6.