Method and system for detecting simulated user experience quality in dialog policy learning

A technology for simulating users and policy learning, which is applied in the field of machine learning, can solve problems such as instability, hypersensitive hyperparameter selection, performance constraints of dialogue learning, etc., and achieve the effect of relaxing quality and effectively controlling the quality of simulation experience

Active Publication Date: 2021-06-18
NANHU LAB
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there is a huge instability problem in the training of GAN, which will lead to non-convergence in dialogue policy learning with a high probability, and is highly sensitive to the selection of hyperparameters, which seriously restricts the performance of dialogue learning

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for detecting simulated user experience quality in dialog policy learning
  • Method and system for detecting simulated user experience quality in dialog policy learning
  • Method and system for detecting simulated user experience quality in dialog policy learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0029] Such as figure 1 As shown, this scheme proposes a method for detecting the quality of simulated user experience in dialogue policy learning. Dialogue strategy learning. The dialogue policy learning of the dialogue policy model mainly includes two parts: direct reinforcement learning and indirect reinforcement learning (also called planning). Direct reinforcement learning, using Deep Q-Network (DQN) to improve the dialogue policy based on real experience, the dialogue policy model interacts with the user User, in each step, the dialogue policy model maximizes the value function Q according to the observed dialogue state s, Select the action a to perform. Then, the dialog policy model receives the reward r, the real user's action a r u , and update the current state to s’, and then the real experience (s, a, r, a r u , t) is stored in the real user experience database, and t is used to indicate whether the dialogue is terminated.

[0030] Maximize the value functi...

Embodiment 2

[0045] This embodiment is similar to Embodiment 1, the difference is that this embodiment considers that in the initial stage, there are only limited actions (behaviors) in the lexicon world-dict, so the length of the lexicon same-dict is also very small, in order to predict For the thermal world model, preferably when the length of the lexicon same-dict is less than the constant C, the simulation experience is regarded as qualified. The constant C is determined by those skilled in the art according to specific conditions, and is not limited here.

[0046] At this time, only when the length of the thesaurus same-dict reaches a certain value, that is, when it is greater than or equal to the constant C, the variable KL defined in advance is passed. pre Track the KL divergence between thesaurus real-dict and thesaurus world-dict for similarity measurement.

Embodiment 3

[0048] This embodiment provides a system for detecting the quality of simulated user experience in dialogue strategy learning, which is used to implement the method in Embodiment 1 or Embodiment 2, including a system connected to the world model, the real user experience library, and the dialogue strategy model. A quality detector, and the quality detector includes a KL divergence detector, and the KL divergence detector is used to detect the quality of the simulated experience generated by the world model according to the real experience generated by the real user.

[0049] Specifically, the quality detector includes a thesaurus real-dict for storing real experience, a thesaurus world-dict for storing simulated experience, and a primary key for saving the intersection of the thesaurus real-dict and thesaurus world-dict in two Thesaurus same-dict of frequency values ​​in a thesaurus.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and a system for detecting simulated user experience quality in dialogue strategy learning. The method comprises the following steps: S1, generating simulated experience by a world model; S2, performing quality detection on the simulation experience through a quality detector based on KL divergence; and S3, storing the simulation experience which is qualified in quality detection for dialogue strategy model training. According to the scheme, the quality detector based on the KL divergence is introduced, the quality of simulation experience can be evaluated more easily and effectively, the calculation efficiency is greatly improved while the robustness and effectiveness of dialogue strategies are ensured, and the purpose of effectively controlling the quality of the simulation experience is achieved.

Description

technical field [0001] The invention belongs to the technical field of machine learning, and in particular relates to a method and system for detecting the quality of simulated user experience in dialogue strategy learning. Background technique [0002] Task-completion dialogue policy learning aims to build a task-completion-oriented dialogue system that can help users complete a specific single task or multi-domain tasks through several rounds of natural language interaction. It has been widely used in chatbots and personal voice assistants such as Apple's Siri and Microsoft's Cortana. [0003] In recent years, reinforcement learning has gradually become the mainstream method for dialogue policy learning. Based on reinforcement learning, the dialogue system can gradually adjust and optimize the strategy through natural language interaction with the user to improve performance. However, the original reinforcement learning method requires a lot of human-computer dialogue in...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/332G06F16/36G06N3/00G06N20/00
CPCG06N3/008G06F16/3329G06F16/374G06N20/00
Inventor 曹江吴冠霖方文其平洋栾绍童闫顼
Owner NANHU LAB
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products