Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Trust region strategy optimization method and device based on post-event experience and related equipment

An optimization method and trust region technology, applied in the field of machine learning intelligent robots, can solve problems such as slow learning speed, low exploration efficiency, and difficulty in reward function design, and achieve the effects of increasing accuracy, improving convergence speed, and reducing variance

Pending Publication Date: 2020-12-18
XI AN JIAOTONG UNIV
View PDF1 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, reinforcement learning is currently facing many problems such as slow learning speed, difficult reward function design, and low exploration efficiency, so it is difficult to be applied in actual complex tasks.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Trust region strategy optimization method and device based on post-event experience and related equipment
  • Trust region strategy optimization method and device based on post-event experience and related equipment
  • Trust region strategy optimization method and device based on post-event experience and related equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] In order to enable those skilled in the art to better understand the technical solutions in the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described The embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present invention.

[0040] The present invention is a trust region policy optimization method based on post-event experience, the principle of which is to use the robot execution action experience data collected in the policy training process under target conditions, and use the reached target point in the robot execution action experience data as a virtual...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a trust region strategy optimization method and device based on post-event experience and related equipment, and the method comprises the steps: S100, taking an arrived targetpoint in experience data as a virtual target point, and generating virtual post-event experience data; S200, filtering the virtual target based on a post-event target filtering algorithm to obtain corresponding training data; s300, based on the virtual experience data, correcting the distribution deviation of the virtual empirical data and the original empirical data through weighted importance sampling; s400, correcting the distribution deviation between the virtual experience data and the original empirical data based on weighted importance sampling so as to estimate an inter-strategy KL divergence value; and S500, correcting the strategy gradient direction through the KL divergence, and calculating and updating the strategy step length through the maximum KL divergence step length. According to the method, an intelligent agent can complete an effective exploration process on the environment and tasks based on a small amount of interaction data and a simply designed reward function,and behavior strategies are efficiently learned and updated.

Description

technical field [0001] The invention belongs to the field of machine learning intelligent robots, and in particular relates to a trust domain policy optimization method, device and related equipment based on post-event experience. Background technique [0002] With the rapid development of artificial intelligence technology, it has emerged in many industries through intelligent and automated information processing. However, the current mainstream deep learning methods in the field of artificial intelligence mostly rely on large-scale human-labeled data. How to obtain data and complete the learning process through the autonomous interaction between robots or agents and the environment is a major difficulty in the field of artificial intelligence. As an important branch technology in the field of artificial intelligence, reinforcement learning can help robots explore and learn in the process of autonomous interaction with the environment. However, reinforcement learning is cu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06N20/00
CPCG06N20/00
Inventor 兰旭光张翰博柏思特郑南宁
Owner XI AN JIAOTONG UNIV
Features
  • Generate Ideas
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More