Dialogue policy acquisition method and device, and related equipment

By simulating K-segment single-action dialogues in a multi-action dialogue strategy, and utilizing discrete policy models and world models, the problem of poor generalization effect of multi-action dialogue strategies in real-world scenarios is solved, achieving efficient dialogue strategy planning and reducing deployment costs.

CN116431771BActive Publication Date: 2026-06-16CHINA MOBILE COMM LTD RES INST +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA MOBILE COMM LTD RES INST
Filing Date
2021-12-31
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing multi-action dialogue strategy learning methods have poor generalization performance in real-world scenarios and are difficult to effectively handle unfamiliar human-computer dialogues. Furthermore, robust learning methods suffer from high deployment costs and unstable training processes.

Method used

By acquiring information about the current dialogue state, the first latent vector is obtained, and K single-action dialogues are simulated. The dialogue state transition is modeled in a compact latent state space using discrete policy models and world models, and multi-action decision-making methods are deeply explored to improve the planning efficiency and accuracy of dialogue strategies.

🎯Benefits of technology

Modeling dialogue state transitions in a compact hidden state space improves the generalization effect of multi-action dialogue strategies in real-world scenarios, reduces deployment and subsequent maintenance costs, and avoids additional manual annotation and environment modeling.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116431771B_ABST
    Figure CN116431771B_ABST
Patent Text Reader

Abstract

The application provides a dialogue strategy acquisition method and device and related equipment, wherein the method comprises: acquiring first information for representing a current dialogue state; obtaining a first hidden vector corresponding to the current dialogue state based on the first information; and simulating K-stage single-action dialogue according to the first hidden vector to obtain a dialogue strategy. The method provided in the application embodiment improves the generalization effect of the MADPL on a real scene.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of natural language processing technology, and in particular to a method, apparatus and related equipment for obtaining dialogue strategies. Background Technology

[0002] Task-oriented systems (TDS) help users complete specific tasks (such as restaurant reservations) using natural language and have been applied to various business services. To address the problem of verbose dialogues in traditional dialogue strategies, research has emerged on Multi-Agent Task-Oriented Dialog Policy Learning (MADPL).

[0003] In related technologies, MADPL directly imitates and learns action combinations in human dialogue datasets. However, the action combinations in human dialogue datasets are limited, and potential action combinations are often not covered in human dialogue datasets, resulting in poor generalization effect of MADPL on real-world scenarios. Summary of the Invention

[0004] This application provides a method, apparatus, and related equipment for obtaining dialogue strategies, which solves the problem that MADPL has poor generalization effect on real-world scenarios.

[0005] To achieve the above objectives, in a first aspect, embodiments of this application provide a method for obtaining a dialogue strategy, comprising:

[0006] Obtain initial information that represents the current dialogue state;

[0007] Based on the first information, obtain the first hidden vector corresponding to the current dialogue state;

[0008] Based on the first latent vector, simulate K segments of single-action dialogue to obtain the dialogue strategy, where K is a positive integer.

[0009] Secondly, embodiments of this application provide a dialogue strategy acquisition device, comprising:

[0010] The acquisition module is used to acquire the first information that represents the current dialogue state;

[0011] The first determining module is used to obtain the first hidden vector corresponding to the current dialogue state based on the first information;

[0012] The second determining module is used to simulate K segments of single-action dialogue based on the first latent vector to obtain a dialogue strategy, where K is a positive integer.

[0013] Thirdly, embodiments of this application provide an electronic device, including a processor, a memory, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, it implements the steps in the dialogue policy acquisition method as described in the first aspect.

[0014] Fourthly, embodiments of this application provide a readable storage medium storing a program that, when executed by a processor, implements the steps of the dialogue policy acquisition method as described in the first aspect.

[0015] In this embodiment, by obtaining first information representing the current dialogue state, and based on the first information, obtaining the first latent vector corresponding to the current dialogue state, and then simulating K segments of single-action dialogue based on the first latent vector, a dialogue strategy is obtained. In this way, by simulating K segments of single-action dialogue based on the first latent vector, dialogue state transitions can be modeled in a compact latent state space, improving the efficiency and accuracy of planning, thereby enhancing the generalization effect of MADPL on real-world scenarios. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings are described below. Obviously, the following drawings are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the listed drawings without creative effort.

[0017] Figure 1 This is a flowchart comparing traditional dialogue strategies and multi-action dialogue strategies;

[0018] Figure 2 This is one of the flowcharts of the dialogue strategy acquisition method provided in the embodiments of this application;

[0019] Figure 3 This is the second flowchart of the dialogue strategy acquisition method provided in the embodiments of this application;

[0020] Figure 4 This is a schematic diagram of the structure of the dialogue strategy acquisition device provided in the embodiments of this application;

[0021] Figure 5 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0022] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0023] To better understand the method provided in this application, a brief introduction to the background of the method is given below.

[0024] Task-oriented dialogue systems (TDS) help users complete specific tasks (such as restaurant reservations) using natural language and have been applied to various business services. In TDS, dialogue strategy significantly impacts dialogue efficiency (such as conciseness and fluency), determining the next dialogue action based on the current dialogue state. See also Figure 1 , Figure 1 This is a flowchart comparing traditional dialogue strategies and multi-action dialogue strategies. Traditional dialogue strategies typically assume that the system generates only one dialogue action per turn in response to the current dialogue state, which leads to lengthy dialogues and increases the risk of task failure. To address these weaknesses, research has emerged on multi-action dialogue strategies (Multi-Agent Task-Oriented Dialog Policy Learning, MADPL), which improves the expressiveness and efficiency of TDS by simultaneously generating multiple actions as responses to the current state.

[0025] Currently, MADPL typically has two main technical approaches: supervised learning methods and robust learning methods.

[0026] Supervised learning methods learn multi-action dialogue strategies by mimicking action combinations from stored human dialogue samples in human dialogue datasets. These methods typically employ basic networks, such as feedforward fully connected neural networks and recurrent neural networks, to decode and predict multiple dialogue actions from the current dialogue state. However, task-oriented dialogues have a "one-to-many" characteristic, meaning that multiple suitable responses may exist within the same dialogue scenario. Therefore, under a multi-action setting, different action combinations can be considered as responses within the same dialogue scenario, and these potentially reasonable action combinations are often not covered in human dialogue datasets. Consequently, existing methods based on directly mimicking human dialogue samples in human dialogue datasets often only explore a small subspace of the entire action space, making it difficult to effectively handle unseen real-world human-computer dialogues. This results in poor robustness and leads to the poor generalization performance of MADPL in real-world scenarios.

[0027] Robust learning methods address the poor generalization performance of MADPL to real-world scenarios from three perspectives: 1) active learning, using human-computer interaction for data augmentation; 2) reinforcement learning, using human-computer interaction for further learning; and 3) adversarial learning, using a neural network-based discriminator to determine response quality and provide additional supervision. However, the performance improvements achieved through these methods come at the cost of additional supervision. For example, significant human labor is required to label dialogues, construct real-world environmental systems, and design complex learning strategies. Furthermore, algorithms based on reinforcement learning and adversarial learning suffer from training instability, introducing additional deployment costs in practical applications. Therefore, while robust learning methods can solve the problem of MADPL's poor generalization performance to real-world scenarios, they suffer from high deployment costs and difficulty in maintenance.

[0028] To better address the issue of poor generalization performance of the current MADPL (Multi-Action Logic Program) for real-world scenarios, this application provides a dialogue policy acquisition method. This method deeply utilizes dataset annotation information, distinguishing itself from directly imitating action combinations. By fully exploring the decision-making processes of multiple actions, it achieves generalization to real-world scenarios, thus improving MADPL's generalization performance. Furthermore, the dialogue policy acquisition method provided in this application achieves efficient joint training of various modules. While improving generalization performance to real-world scenarios, it minimizes the need for additional manual annotation and environment modeling, reducing deployment and subsequent maintenance costs.

[0029] The core idea of ​​the dialogue strategy acquisition method provided in this application is to pre-imagine the dialogue content for each single-action dialogue and use it as auxiliary information to enhance the prediction of multiple actions. First, considering only single-action dialogues, the system-side single-action prediction and the user-side response pattern to a single action are modeled, allowing the model to simulate the single-action dialogue process. This enables preparation of the content to be expressed before predicting multiple actions by simulating single-action dialogues. Then, based on the simulated multiple single-action dialogues, the dialogue strategy is obtained. This is described in detail below.

[0030] See Figure 2 , Figure 2 This is one of the flowcharts of the dialogue strategy acquisition method provided in the embodiments of this application.

[0031] like Figure 2 As shown, the method provided in this application embodiment may include the following steps:

[0032] Step 101: Obtain first information to represent the current dialogue state;

[0033] The current dialogue state can be considered a summary of all user interactions with the system prior to the current response. Optionally, the first information includes at least one of the following: the representation result of the returned entity from the query; the last system action prior to the current dialogue state; the last user action prior to the current dialogue state; the state of the user's request slot; and the state of the system's notification slot. The states of the user's request slot and the system's notification slot are determined by the actual situation. For example, if the user inputs "What's the weather like today?", then the state of the user's request slot can be set to two slots: time and location, and the state of the system's notification slot can be set to one slot: weather conditions.

[0034] Step 102: Based on the first information, obtain the first hidden vector corresponding to the current dialogue state;

[0035] In practice, a structured dialogue state *st* is first constructed based on the initial information, where *t* is the current dialogue round and is a positive integer. The structured dialogue state *S* t It can be a 553-bit vector. A fully connected feedforward network (FFN) can be used to extract the dense hidden state features, i.e., the first hidden vector h. t The FFN network can consist of two linear transformations, with a Rectified Linear Unit (ReLU) activation function in between.

[0036] Specifically, h t =FFN enc (s t ) = max(0,s t W1+b1)W2+b2. Let the current dialogue state be the initial dialogue state, i.e., t=0. Based on the first information, construct a structured dialogue state representation s0, and obtain the first latent vector h0 used to describe the initial dialogue state.

[0037] Step 103: Simulate K segments of single-action dialogue based on the first latent vector to obtain the dialogue strategy, where K is a positive integer.

[0038] In practical implementation, a single-action dialogue planning module is proposed to model the co-occurrence patterns of each action in the dialogue context by utilizing actions from human dialogue samples already stored in the human dialogue dataset. K segments of single-action dialogue are simulated based on the first latent vector to obtain the dialogue policy. By simulating K segments of single-action dialogue based on the first latent vector, dialogue state transitions can be modeled in a compact latent state space, improving the efficiency and accuracy of planning, thereby enhancing the generalization effect of MADPL on real-world scenarios.

[0039] In this embodiment, by obtaining first information representing the current dialogue state, and based on the first information, obtaining the first latent vector corresponding to the current dialogue state, and then simulating K segments of single-action dialogue based on the first latent vector, a dialogue strategy is obtained. In this way, by simulating K segments of single-action dialogue based on the first latent vector, dialogue state transitions can be modeled in a compact latent state space, improving the efficiency and accuracy of planning, thereby enhancing the generalization effect of MADPL on real-world scenarios.

[0040] Optionally, step 103 includes:

[0041] The first latent vector is input into the first model and the second model respectively to simulate each single-action dialogue in the K single-action dialogues, thereby obtaining K target latent vectors and K sets of action sequences. The first model is a discrete policy model and the second model is a world model that simulates user behavior. The K sets of action sequences include at least one action predicted by the first model.

[0042] The dialogue strategy is obtained based on the first latent vector, the K target latent vectors, and the K sets of action sequences.

[0043] For specific implementation details, please refer to [link / reference]. Figure 3 , Figure 3 This is the second flowchart of the dialogue policy acquisition method provided in the embodiments of this application. Figure 3 The specific process of single-action dialogue planning in step 202 is shown below. Taking a single-action dialogue in the simulation of K single-action dialogues as an example, it explains how to simulate K single-action dialogues based on the first latent vector and obtain the dialogue strategy.

[0044] The first latent vector h0, used to describe the initial dialogue state, is input into the discrete policy model and the world model. The discrete policy model and the world model interact for several dialogue rounds in the single-action dialogue mode. That is, each single-action dialogue segment in the K segments predicts several dialogue rounds forward based on the current dialogue state, ultimately obtaining the target latent vector h corresponding to that single-action dialogue segment. (1) and action sequence (a1) (1) a2 (1) …). In each of the several dialogue rounds predicted forward, the first model predicts an action, and the actions predicted in each of the several dialogue rounds predicted forward constitute an action sequence (a1). (1) a2 (1) …).

[0045] The process of simulating other single-action dialogue segments in the K-segment single-action dialogue is similar to this process, ultimately obtaining K target latent vectors h. (1) h (2) h (3) ...h(K) And K group of action sequences.

[0046] In this embodiment, by inputting the first latent vector into the first model and the second model respectively, each single-action dialogue segment in K single-action dialogues is simulated to obtain K target latent vectors and K sets of action sequences. Then, based on the first latent vector, the K target latent vectors, and the K sets of action sequences, a dialogue strategy is obtained. In this way, several dialogue rounds can be predicted forward based on the current dialogue state to obtain additional information beyond the human dialogue dataset, such as the K target latent vectors and the K sets of action sequences, to determine the dialogue strategy. This allows for deep utilization of limited human dialogue datasets, enabling generalization to unknown human-computer dialogues, thereby improving the generalization effect of MADPL on real-world scenarios.

[0047] Optionally, the step of inputting the first latent vector into the first model and the second model respectively to simulate each single-action dialogue in the K single-action dialogue segments, and obtaining K target latent vectors and K sets of action sequences, includes:

[0048] The first latent vector is input into the first model to obtain the first action, and the K sets of action sequences include the first action;

[0049] The first hidden vector and the first action are respectively input into the second model to obtain the second hidden vector. The second hidden vector corresponds to the future dialogue state. The future dialogue state is the dialogue state after the current dialogue state predicted by the second model based on the first hidden vector and the first action.

[0050] Compare the first hidden vector and the second hidden vector to obtain the comparison result;

[0051] If the comparison result meets the preset rules, the second latent vector obtained by the second model will be determined as one of the K target latent vectors.

[0052] For specific implementation details, please refer to [link / reference]. Figure 3 ,like Figure 3 The specific process of single-action dialogue planning in step 202 is shown below. Taking a single-action dialogue in a simulated K-segment single-action dialogue as an example, it explains how to obtain the target latent vector and action sequence.

[0053] Input the first latent vector h0 into the first model (DP) to obtain the first action a1. (1) Then combine the first hidden vector h0 and the first action a1 (1) Input the second model (World) to obtain the second latent vector h1 (1) Compare the initial latent vector h0 with the latent vector h of the current dialogue round. n (1)(At this point, n=1) to decide whether to stop planning.

[0054] If the comparison result does not conform to the preset rules, i.e., the planning does not stop, the input latent vector is updated and the above steps are repeated, that is, the first latent vector is updated to h1. (1) Input the first model (DP) and obtain the first action a2. (1) Then the first hidden vector h1 (1) and the first action a2 (1) Input the second model (World) to obtain the second latent vector h2 (1) It should be noted that here we will still use the first latent vector before the update, i.e., the initial latent vector h0, and the latent vector h of the current dialogue round. n (1) (At this point, n=2) Compare the results and decide whether to stop planning based on the comparison results.

[0055] The final second latent vector h obtained by the second model will be used until the comparison result meets the preset rules. (1) The target latent vector h is determined to be one of the K target latent vectors. (1) This single-action dialogue segment can be summarized.

[0056] The steps described above will be explained in detail below with specific examples. Taking the prediction of one dialogue turn forward in this single-action dialogue segment as an example, that is, from... arrive Where K is the single-action dialogue index, t is the dialogue round, and n is a positive integer. For clarity, the single-action dialogue index k and dialogue round t are omitted. Given the implicit dialogue state h n First, you need to predict an action a. n Then use a n h n Updated to h n+1 This can be achieved through a cyclical interaction between the discrete policy model (DP) and the world model, i.e.:

[0057]

[0058]

[0059] Among them, a n It is an integer. DP is a single linear layer followed by a GumbleSoftmax function. The GumbleSoftmax function samples individual dialogue actions probabilistically from the classification distribution, which improves the diversity of the planned paths (i.e., single-action dialogues, referring to their state transition paths in the latent space). T dThis is used to balance the magnitude of approximation bias and gradient variance. The classic GRU is used as the world model to model the hidden state transition patterns. `Emb` represents returning a given single dialogue action `a`. n The embedding encoding layer of the hidden vector.

[0060] Executing multiple actions simultaneously and sequentially executing corresponding single actions will result in similar (or identical) dialogue state transitions. Therefore, it is assumed that once this number of state transitions is reached, the predicted information is sufficient, and planning stops. Based on this, a method is proposed to compare the initial hidden state h0 with the current hidden state h n This determines whether to stop planning. This can also be modeled using neural networks, i.e.:

[0061]

[0062] Where c n For binary variables, “:” indicates vector concatenation. FFN is a 2-layer fully connected feedforward network with ReLU activation function in the middle layer.

[0063] In this embodiment, by inputting a first latent vector into a first model to obtain a first action, and K action sequences including the first action, and inputting the first latent vector and the first action into a second model to obtain a second latent vector, the decision-making process for combining dialogue actions can be decoupled and refined. First, the model simulates single-action dialogue to determine the dialogue topic and related action content, and then combines actions based on this. This allows for deep utilization of limited data, enabling generalization to unknown human-computer dialogues and improving the generalization effect of MADPL on real-world scenarios.

[0064] Furthermore, by jointly modeling the user side and the system side, single-action dialogue can be simulated. Before multi-action prediction, the dialogue content can be simulated by predicting several dialogue turns ahead based on the current dialogue state, thereby realizing context-based action co-occurrence pattern modeling and enhancing subsequent multi-action prediction.

[0065] Optionally, obtaining the dialogue strategy based on the first latent vector, the K target latent vectors, and the K sets of action sequences includes:

[0066] The K target latent vectors are input into a third model to obtain K target states. The third model is a recovery model implemented by a fully connected feedforward network that can map the latent vectors to a structured state space. The dialogue strategy includes the K target states.

[0067] For specific implementation details, please refer to [link / reference]. Figure 3 To learn the latent space dialogue state transitions, a recovery model is used to combine the initial and final planned latent vectors h0 and h... NThe mapping is then performed back into the structured state space to match the dialogue states s of the current and next dialogue rounds, respectively. t and s t+1 .

[0068] s t =Recover(h0)

[0069] s t+1 =Recover(h N )

[0070] The recovery model, Recover, can be implemented using an FNN. Recover and the state encoder together form an autoencoder that ensures correspondence with the structured dialogue state.

[0071] Optionally, obtaining the dialogue strategy based on the first latent vector, the K target latent vectors, and the K sets of action sequences includes:

[0072] Based on the first latent vector and the K target latent vectors, K probability distributions corresponding one-to-one with the K action sequences are determined. The K probability distributions are used to describe whether each action in the K action sequences is selected. The dialogue strategy includes the K probability distributions.

[0073] In specific implementation, given the initial and final implicit vectors h0 and h... (k) The method utilizes a neural network to decode multi-action probabilities and applies them to each path, i.e., each segment of single-action dialogue. Since the task modeling provided in this embodiment assumes no dependency between dialogue actions, each specific action in the action sequence is binary-classified to determine whether the action is selected. A decoder can be constructed using a set of FFNs.

[0074]

[0075]

[0076] Where “∶” indicates vector concatenation, k represents the planned path number, i.e., the segment number of a single-action dialogue, and M is the size of the action space, i.e., the number of actions in the action sequence.

[0077] Specifically, p (k) Let m be the probability distribution corresponding to the action sequence, which contains M actions, where m and M are positive integers. This represents the probability of the m-th action in the action sequence of the K-th single-action dialogue. When k=1, for Indicates action a1 (1) The probability, if If the value is 0, then action a1 (1)Not selected; if If the value is 1, then action a1 (1) Selected.

[0078] In this embodiment, by determining K probability distributions corresponding one-to-one with K action sequences based on the first latent vector and K target latent vectors, the simulated K segments of single-action dialogue can be decoded and aggregated sequentially, effectively avoiding the impact of dialogue planning errors on the dialogue strategy. During decoding, the action sequence is treated as a set, and multiple binary classification models are used to select whether each action in the action sequence should be replaced by the original multi-label classification model, thereby effectively decoupling each action in the action space and reducing the influence of noise between actions.

[0079] Optionally, after determining the K probability distributions corresponding one-to-one with the K action sequences based on the first latent vector and the K target latent vectors, the method further includes:

[0080] The K probability distributions are integrated using an aggregation function to obtain an aggregation result, wherein the aggregation function includes a function implemented using the average value;

[0081] The aggregation result is sampled to obtain a target action group for responding to the current dialogue state, and the dialogue strategy includes the target action group.

[0082] In practical implementation, to effectively reduce the impact of bad paths, i.e., bad single-action dialogues, an aggregation function can be used to integrate K probability distributions corresponding to K action sequences, i.e.:

[0083] P t =Aggr(p (1) ,...,p (K) )

[0084] In this embodiment, the aggregation function is implemented using the average value. Task-oriented dialogues have an important "one-to-many" nature, where different actions can be taken in the same context, requiring the incorporation of random factors into the dialogue strategy. Furthermore, this embodiment empirically verifies that such randomness can be achieved using basic sampling methods. In this embodiment, the Gumbel-Sigmoid function can be used to sample multiple actions, i.e.

[0085]

[0086] Here, the Gumbel-Sigmoid function is a modification of the Gumbel-Softmax function, treating the sigmoid as a softmax with two logits p and 0. T is the temperature factor, and g1 and g2 are two samples taken from the Gumbel noise. A t For the target action group.

[0087] In this embodiment, an aggregation function is used to integrate K groups of probability distributions to obtain an aggregation result. The aggregation result is then sampled to obtain a target action group for answering the current dialogue state. The dialogue strategy includes the target action group, which can effectively reduce the impact of bad single-action dialogues in K segments of single-action dialogues. Furthermore, the influence of random factors can be considered in the dialogue strategy, improving the generalization ability and robustness of the dialogue strategy for human-computer dialogue in real business scenarios.

[0088] This application embodiment also provides a multi-task objective joint training of the models used in this application embodiment, such as the first model, the second model, and the recovery model, to determine the values ​​of variable parameters in the model. Each task is a supervised learning task, and the multi-task provided in this application embodiment includes Task 1, Task 2, and Task 3.

[0089] Task 1: Discrete Act Prediction (DAP). For a planned sequence of single actions a = (a0, ..., a... N-1 The goal is to maximize the log-likelihood (MLE) of the joint probability p(a|h0), which can be decomposed into:

[0090]

[0091] Where θ and φ are the trainable parameters of the discrete dialogue policy model and the world model, respectively.

[0092] Task 2: Stop Flag Prediction (SFP). Similar to Task 1, the objective of predicting the stop flag sequence is defined.

[0093] c = (c1,...,c N The MLE of joint probability p(c|h0) can be decomposed as:

[0094]

[0095] Where γ is the trainable parameter of the stopping prediction model, and p φ,θ (h n |h n-1 The joint probability decomposition of ) is into p for state transition and discrete behavior prediction. φ (h n |a n-1 ,h n-1 )p θ (a n-1 |h n-1 ).

[0096] Task 3: State Recovery (SR). Consider the state recovery objective to supervise state encoding and state transitions. More precisely, predict the current dialogue state s. t Next conversation state s t+1 Initial hidden vector h0 and last hidden vector h N They are respectively, that is

[0097]

[0098]

[0099] Where η and These are the trainable parameters for the state encoder and the recovery, respectively. The joint probability p φ,θ (h n |h n-1 The explanation is the same as in Task 2.

[0100] It is important to note that due to the diversity of human dialogue, no prior assumptions are made regarding sequential dependencies between dialogue actions. Instead, it can be assumed that for any sequence of single actions in a multi-action context, there will always exist a plausible single-action dialogue process corresponding to the real world. To this end, the single-action dialogue planning module is trained using all sequential combinations of expert instances. In practice, this is achieved by randomly shuffling the action sequences in each batch before forward propagation.

[0101] The dialogue policy acquisition method provided in this application adopts a supervised multi-task learning approach to deeply mine and utilize the labeled information in the dataset without introducing additional data sources. This greatly alleviates the manual workload and ensures stable training, facilitating deployment and updates for actual industry applications. In other words, the dialogue policy acquisition method provided in this application improves the generalization effect on real-world scenarios while minimizing additional manual annotation and environment modeling, thus reducing deployment and subsequent maintenance costs.

[0102] See Figure 3 The following is an example illustrating the dialogue policy acquisition method provided in this application.

[0103] The methods for obtaining dialogue strategies provided in this application include:

[0104] Step 201: Encode the input current state into a latent vector h0. Specifically, construct a structured dialogue state representation St. Based on the dialogue state representation St, a fully connected feedforward network (FFN) is used to extract dense latent state features, where t is the current dialogue turn. The dialogue state representation St includes four types of information: the representation result corresponding to the entity returned by the query; the last user action; the last system action; and the signaling state containing the request slots and notification slots of the user and the system.

[0105] Step 202: Simulate K segments of single-action dialogue based on the encoded latent vectors. Specifically, the k-th segment of dialogue simulates N segments of dialogue forward from the current dialogue state. k In each round of simulated dialogue, a discrete policy network is used to provide the current hidden state h of the dialogue. n Predict a dialogue action a n+1 As a system response, an environment model is used to simulate user behavior given the previous dialogue state h. n and system reply a n+1 Predict the next dialogue state h n+1 .

[0106] Step 203 involves decoding and predicting the probability distribution of multiple actions that should be executed simultaneously for each segment of the simulated dialogue. Specifically, a neural network can be used to decode the probabilities of multiple actions and apply them to each path.

[0107] Step 204 involves aggregating the decoded probabilities and sampling the multiple actions to be performed. Specifically, an average aggregation function can be used to integrate the probabilities of multiple actions, and empirical verification can be conducted to confirm that this randomness can be achieved through basic sampling methods.

[0108] See Figure 4 This application provides a dialogue strategy acquisition device 300, including:

[0109] The acquisition module 301 is used to acquire first information that represents the current dialogue state;

[0110] The first determining module 302 is used to obtain the first hidden vector corresponding to the current dialogue state based on the first information;

[0111] The second determining module 303 is used to simulate K segments of single-action dialogue based on the first latent vector to obtain a dialogue strategy, where K is a positive integer.

[0112] Optionally, the first information includes at least one of the following:

[0113] The query returns the representation of the entity;

[0114] The last system action before the current dialogue state;

[0115] The last user action prior to the current conversation state;

[0116] The status of the user's request slot;

[0117] The status of the system's notification slots.

[0118] Optionally, the second determining module 303 includes:

[0119] The first latent vector is input into the first model and the second model respectively to simulate each single-action dialogue in the K single-action dialogues, thereby obtaining K target latent vectors and K sets of action sequences. The first model is a discrete policy model and the second model is a world model that simulates user behavior. The K sets of action sequences include at least one action predicted by the first model.

[0120] The dialogue strategy is obtained based on the first latent vector, the K target latent vectors, and the K sets of action sequences.

[0121] Optionally, the step of inputting the first latent vector into the first model and the second model respectively to simulate each single-action dialogue in the K single-action dialogue segments, and obtaining K target latent vectors and K sets of action sequences, includes:

[0122] The first latent vector is input into the first model to obtain the first action, and the K sets of action sequences include the first action;

[0123] The first hidden vector and the first action are respectively input into the second model to obtain the second hidden vector. The second hidden vector corresponds to the future dialogue state. The future dialogue state is the dialogue state after the current dialogue state predicted by the second model based on the first hidden vector and the first action.

[0124] Compare the first hidden vector and the second hidden vector to obtain the comparison result;

[0125] If the comparison result meets the preset rules, the second latent vector obtained by the second model will be determined as one of the K target latent vectors.

[0126] Optionally, obtaining the dialogue strategy based on the first latent vector, the K target latent vectors, and the K sets of action sequences includes:

[0127] The K target latent vectors are input into a third model to obtain K target states. The third model is a recovery model implemented by a fully connected feedforward network that can map the latent vectors to a structured state space. The dialogue strategy includes the K target states.

[0128] Optionally, obtaining the dialogue strategy based on the first latent vector, the K target latent vectors, and the K sets of action sequences includes:

[0129] Based on the first latent vector and the K target latent vectors, K probability distributions corresponding one-to-one with the K action sequences are determined. The K probability distributions are used to describe whether each action in the K action sequences is selected. The dialogue strategy includes the K probability distributions.

[0130] Optionally, after determining the K probability distributions corresponding one-to-one with the K action sequences based on the first latent vector and the K target latent vectors, the method further includes:

[0131] The K probability distributions are integrated using an aggregation function to obtain an aggregation result, wherein the aggregation function includes a function implemented using the average value;

[0132] The aggregation result is sampled to obtain a target action group for responding to the current dialogue state, and the dialogue strategy includes the target action group.

[0133] The dialogue strategy acquisition device 300 provided in this application embodiment can implement all the processes that can be implemented in the dialogue strategy acquisition method embodiment of this application, and achieve the same beneficial effects. To avoid repetition, it will not be described again here.

[0134] This application provides an electronic device. For example... Figure 5 As shown, the electronic device 400 includes a processor 401, a memory 402, and a computer program stored in the memory 402 and executable on the processor. The various components of the electronic device 400 are coupled together via a bus system 403. It is understood that the bus system 403 is used to enable communication between these components.

[0135] The processor 401 is used to acquire first information that represents the current dialogue state;

[0136] Based on the first information, obtain the first hidden vector corresponding to the current dialogue state;

[0137] Based on the first latent vector, simulate K segments of single-action dialogue to obtain the dialogue strategy, where K is a positive integer.

[0138] Optionally, the first information includes at least one of the following:

[0139] The query returns the representation of the entity;

[0140] The last system action before the current dialogue state;

[0141] The last user action prior to the current conversation state;

[0142] The status of the user's request slot;

[0143] The status of the system's notification slots.

[0144] Optionally, the processor 401 is also used to include:

[0145] The first latent vector is input into the first model and the second model respectively to simulate each single-action dialogue in the K single-action dialogues, thereby obtaining K target latent vectors and K sets of action sequences. The first model is a discrete policy model and the second model is a world model that simulates user behavior. The K sets of action sequences include at least one action predicted by the first model.

[0146] The dialogue strategy is obtained based on the first latent vector, the K target latent vectors, and the K sets of action sequences.

[0147] Optionally, the processor 401 is further configured to input the first latent vector into the first model and the second model respectively, simulate each single-action dialogue in the K single-action dialogue segments, and obtain K target latent vectors and K sets of action sequences, including:

[0148] The first latent vector is input into the first model to obtain the first action, and the K sets of action sequences include the first action;

[0149] The first hidden vector and the first action are respectively input into the second model to obtain the second hidden vector. The second hidden vector corresponds to the future dialogue state. The future dialogue state is the dialogue state after the current dialogue state predicted by the second model based on the first hidden vector and the first action.

[0150] Compare the first hidden vector and the second hidden vector to obtain the comparison result;

[0151] If the comparison result meets the preset rules, the second latent vector obtained by the second model will be determined as one of the K target latent vectors.

[0152] Optionally, the processor 401 is further configured to obtain the dialogue policy based on the first latent vector, the K target latent vectors, and the K sets of action sequences, including:

[0153] The K target latent vectors are input into a third model to obtain K target states. The third model is a recovery model implemented by a fully connected feedforward network that can map the latent vectors to a structured state space. The dialogue strategy includes the K target states.

[0154] Optionally, the processor 401 is further configured to obtain the dialogue policy based on the first latent vector, the K target latent vectors, and the K sets of action sequences, including:

[0155] Based on the first latent vector and the K target latent vectors, K probability distributions corresponding one-to-one with the K action sequences are determined. The K probability distributions are used to describe whether each action in the K action sequences is selected. The dialogue strategy includes the K probability distributions.

[0156] Optionally, the processor 401 is further configured to, after determining the K probability distributions corresponding one-to-one with the K sets of action sequences based on the first latent vector and the K target latent vectors, the method further includes:

[0157] The K probability distributions are integrated using an aggregation function to obtain an aggregation result, wherein the aggregation function includes a function implemented using the average value;

[0158] The aggregation result is sampled to obtain a target action group for responding to the current dialogue state, and the dialogue strategy includes the target action group.

[0159] The electronic device 400 provided in this application embodiment can implement all the processes that can be implemented in the dialogue strategy acquisition method embodiment of this application, and achieve the same beneficial effects. To avoid repetition, it will not be described again here.

[0160] This application also provides a computer-readable storage medium storing a computer program. When executed by a processor, this computer program implements the various processes of the above-described dialogue policy acquisition method embodiments and achieves the same technical effects. To avoid repetition, it will not be described again here. The computer-readable storage medium may be a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

[0161] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.

Claims

1. A method for obtaining dialogue strategies, characterized in that, include: Obtain initial information that represents the current dialogue state; Based on the first information, obtain the first hidden vector corresponding to the current dialogue state; Based on the first latent vector, simulate K segments of single-action dialogue to obtain the dialogue strategy, where K is a positive integer; The step of simulating K segments of single-action dialogue based on the first latent vector to obtain the dialogue strategy includes: The first latent vector is input into the first model and the second model respectively to simulate each single-action dialogue in the K single-action dialogues, thereby obtaining K target latent vectors and K sets of action sequences. The first model is a discrete policy model and the second model is a world model that simulates user behavior. The K sets of action sequences include at least one action predicted by the first model. The dialogue strategy is obtained based on the first latent vector, the K target latent vectors, and the K sets of action sequences.

2. The method according to claim 1, characterized in that, The first information includes at least one of the following: The query returns the representation of the entity; The last system action before the current dialogue state; The last user action prior to the current conversation state; The status of the user's request slot; The status of the system's notification slots.

3. The method according to claim 1, characterized in that, The step of inputting the first latent vector into the first model and the second model respectively to simulate each single-action dialogue in the K single-action dialogue segments, and obtaining K target latent vectors and K sets of action sequences, includes: The first latent vector is input into the first model to obtain the first action, and the K sets of action sequences include the first action; The first hidden vector and the first action are respectively input into the second model to obtain the second hidden vector. The second hidden vector corresponds to the future dialogue state. The future dialogue state is the dialogue state after the current dialogue state predicted by the second model based on the first hidden vector and the first action. Compare the first hidden vector and the second hidden vector to obtain the comparison result; If the comparison result meets the preset rules, the second latent vector obtained by the second model will be determined as one of the K target latent vectors.

4. The method according to claim 1, characterized in that, The step of obtaining the dialogue strategy based on the first latent vector, the K target latent vectors, and the K sets of action sequences includes: The K target latent vectors are input into a third model to obtain K target states. The third model is a recovery model implemented by a fully connected feedforward network that can map the latent vectors to a structured state space. The dialogue strategy includes the K target states.

5. The method according to claim 1, characterized in that, The step of obtaining the dialogue strategy based on the first latent vector, the K target latent vectors, and the K sets of action sequences includes: Based on the first latent vector and the K target latent vectors, K probability distributions corresponding one-to-one with the K action sequences are determined. The K probability distributions are used to describe whether each action in the K action sequences is selected. The dialogue strategy includes the K probability distributions.

6. The method according to claim 5, characterized in that, After determining the K probability distributions corresponding one-to-one with the K action sequences based on the first latent vector and the K target latent vectors, the method further includes: The K probability distributions are integrated using an aggregation function to obtain an aggregation result, wherein the aggregation function includes a function implemented using the average value; The aggregation result is sampled to obtain a target action group for responding to the current dialogue state, and the dialogue strategy includes the target action group.

7. A dialogue strategy acquisition device, characterized in that, include: The acquisition module is used to acquire the first information that represents the current dialogue state; The first determining module is used to obtain the first hidden vector corresponding to the current dialogue state based on the first information; The second determining module is used to simulate K segments of single-action dialogue based on the first latent vector to obtain a dialogue strategy, where K is a positive integer. The dialogue strategy acquisition device can also be used for: The first latent vector is input into the first model and the second model respectively to simulate each single-action dialogue in the K single-action dialogues, thereby obtaining K target latent vectors and K sets of action sequences. The first model is a discrete policy model and the second model is a world model that simulates user behavior. The K sets of action sequences include at least one action predicted by the first model. The dialogue strategy is obtained based on the first latent vector, the K target latent vectors, and the K sets of action sequences.

8. An electronic device, characterized in that, It includes a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the steps of the dialogue policy acquisition method as described in any one of claims 1 to 6.

9. A readable storage medium, characterized in that, The readable storage medium stores a program that, when executed by a processor, implements the steps of the dialogue policy acquisition method as described in any one of claims 1 to 6.