A method for generating high-quality simulated experiences for dialogue policy learning

A policy learning and high-quality technology, applied in the field of machine learning, can solve the problems of weakening the advantages of Dyna-Q framework and low efficiency of DDQ, so as to avoid poor learning effect

Active Publication Date: 2021-08-10
NANHU LAB
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

That is to say, world models implemented by models with high data requirements such as DNN will weaken the advantages brought by the Dyna-Q framework and make DDQ very inefficient in reality

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for generating high-quality simulated experiences for dialogue policy learning
  • A method for generating high-quality simulated experiences for dialogue policy learning
  • A method for generating high-quality simulated experiences for dialogue policy learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0054] Such as figure 1 As shown, this scheme proposes a GP-based deep Dyna-Q method for dialogue policy learning. The basic method is consistent with the existing technology, such as using human conversation data to initialize the dialogue policy model and world model, and then Start dialogue policy learning. The dialogue policy learning of the dialogue policy model mainly includes two parts: direct reinforcement learning and indirect reinforcement learning (also called planning). Direct reinforcement learning, using Deep Q-Network (DQN) to improve the dialogue policy based on real experience, the dialogue policy model interacts with the user User, in each step, the dialogue policy model maximizes the value function Q according to the observed dialogue state s, Select the action a to perform. Then, the dialog policy model receives the reward r, the real user's action a r u , and update the current state to s’, and then the real experience (s, a, r, a r u , t) is stored...

Embodiment 2

[0086] Such as Figure 9 As shown, this embodiment is similar to Embodiment 1, and the difference is that in this embodiment, before storing the simulation experience in the buffer, the quality detector performs quality inspection on the simulation experience, and passes the quality inspection. The experience is stored in the buffer.

[0087] Specifically, the upper bound simulation experience e is detected by the quality detector respectively l , lower limit simulation experience e b and meta-simulation experience e i the quality of. The quality detector here can use the traditional GAN ​​(generative confrontation network) quality detector, or the KL divergence (Kullback-Leibler divergence) quality detector independently developed by the applicant.

[0088] The following is a brief introduction to the KL divergence quality detector, such as Figure 4 As shown, the quality inspection of the simulated experience is mainly carried out by comparing the simulated experience w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for generating high-quality simulation experience for dialogue strategy learning, which belongs to the field of machine learning technology, comprising the following steps: S1. generating simulation experience based on GP-based world model prediction; S2. storing the simulation experience in Buffers for dialog policy model training. The world model based on the Gaussian process of this solution can avoid the problem that the quality of simulation experience generated by the traditional DNN model depends on the amount of training data. Less will lead to poor learning effect, low learning efficiency and other problems.

Description

technical field [0001] The invention belongs to the technical field of machine learning, and in particular relates to a method for generating high-quality simulation experience for dialogue strategy learning. Background technique [0002] Task-completion dialogue policy learning aims to build a task-completion-oriented dialogue system that can help users complete a specific single task or multi-domain tasks through several rounds of natural language interaction. It has been widely used in chatbots and personal voice assistants such as Apple's Siri and Microsoft's Cortana. [0003] In recent years, reinforcement learning has gradually become the mainstream method for dialogue policy learning. Based on reinforcement learning, the dialogue system can gradually adjust and optimize the strategy through natural language interaction with the user to improve performance. However, the original reinforcement learning method requires a lot of human-computer dialogue interactions befo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/332G06N3/00G06N20/00
CPCG06N3/008G06F16/3329G06N20/00
Inventor 平洋曹江方文其吴冠霖栾绍童闫顼
Owner NANHU LAB
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products