Trust region strategy optimization method and device based on post-event experience and related equipment

An optimization method and trust region technology, applied in the field of machine learning intelligent robots, can solve problems such as slow learning speed, low exploration efficiency, and difficulty in reward function design, and achieve the effects of increasing accuracy, improving convergence speed, and reducing variance

Pending Publication Date: 2020-12-18
XI AN JIAOTONG UNIV
View PDF1 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, reinforcement learning is currently facing many problems such as slow learning speed, difficult rewa

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Trust region strategy optimization method and device based on post-event experience and related equipment
  • Trust region strategy optimization method and device based on post-event experience and related equipment
  • Trust region strategy optimization method and device based on post-event experience and related equipment

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0039]In order to enable those skilled in the art to better understand the technical solutions in the present invention, the following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described The embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

[0040]The present invention is a trust region strategy optimization method based on post-experience. The principle is to use the empirical data of robot actions collected during the strategy training process under target conditions, and use the reached target points in the robot’s action empirical data as virtual targets. Poi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a trust region strategy optimization method and device based on post-event experience and related equipment, and the method comprises the steps: S100, taking an arrived targetpoint in experience data as a virtual target point, and generating virtual post-event experience data; S200, filtering the virtual target based on a post-event target filtering algorithm to obtain corresponding training data; s300, based on the virtual experience data, correcting the distribution deviation of the virtual empirical data and the original empirical data through weighted importance sampling; s400, correcting the distribution deviation between the virtual experience data and the original empirical data based on weighted importance sampling so as to estimate an inter-strategy KL divergence value; and S500, correcting the strategy gradient direction through the KL divergence, and calculating and updating the strategy step length through the maximum KL divergence step length. According to the method, an intelligent agent can complete an effective exploration process on the environment and tasks based on a small amount of interaction data and a simply designed reward function,and behavior strategies are efficiently learned and updated.

Description

technical field [0001] The invention belongs to the field of machine learning intelligent robots, and in particular relates to a trust domain policy optimization method, device and related equipment based on post-event experience. Background technique [0002] With the rapid development of artificial intelligence technology, it has emerged in many industries through intelligent and automated information processing. However, the current mainstream deep learning methods in the field of artificial intelligence mostly rely on large-scale human-labeled data. How to obtain data and complete the learning process through the autonomous interaction between robots or agents and the environment is a major difficulty in the field of artificial intelligence. As an important branch technology in the field of artificial intelligence, reinforcement learning can help robots explore and learn in the process of autonomous interaction with the environment. However, reinforcement learning is cu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06N20/00
CPCG06N20/00
Inventor 兰旭光张翰博柏思特郑南宁
Owner XI AN JIAOTONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products