Dialogue management method and device, electronic equipment and computer readable storage medium

By using dialogue management methods to detect and optimize the perplexity of question-answer text pairs, the problem of low-quality question-answer pairs polluting the context in noisy environments is solved, the coherence and accuracy of the dialogue system are improved, and the robustness in complex environments is enhanced.

CN122240777APending Publication Date: 2026-06-19SHENZHEN TCL NEW-TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN TCL NEW-TECH CO LTD
Filing Date
2026-03-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In noisy environments, speech recognition errors and user-unrelated conversations with others lead to the generation of low-quality question-answer pairs, polluting the contextual memory of the large model and affecting the accuracy and fluency of subsequent interactions.

Method used

By acquiring question-and-answer text pairs, perplexity is detected, and when the perplexity exceeds a threshold, optimization is performed in the model's historical context, including deleting or adding target identifiers, or generating replacement text pairs through a stronger audio understanding model.

🎯Benefits of technology

Effectively identify and eliminate low-quality dialogues, prevent context pollution, improve the coherence and accuracy of multi-turn dialogues, enhance the robustness of dialogue systems in noisy and multi-person dialogue scenarios, and improve the user interaction experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240777A_ABST
    Figure CN122240777A_ABST
Patent Text Reader

Abstract

This application discloses a dialogue management method, apparatus, electronic device, and computer-readable storage medium, belonging to the field of artificial intelligence technology. The method includes: acquiring a first question-and-answer text pair; performing perplexity detection on the first question-and-answer text pair to obtain its perplexity; and optimizing the first question-and-answer text pair within a first historical context of a first model when the perplexity exceeds a perplexity threshold. This effectively identifies and removes low-quality dialogues from the dialogue history of the first model, preventing contextual pollution, thereby significantly improving the coherence and accuracy of multi-turn dialogues, enhancing the robustness of the dialogue system in noisy and multi-person dialogue scenarios, and improving the user experience.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, specifically to a dialogue management method, apparatus, electronic device, and computer-readable storage medium. Background Technology

[0002] In large model-based voice dialogue systems, speech recognition errors in noisy environments or non-task-related conversations between users and others often lead to the generation of low-quality question-answer pairs. These low-quality question-answer pairs contaminate the contextual memory of the large model, thereby affecting the accuracy and fluency of subsequent interactions. Summary of the Invention

[0003] This application provides a dialogue management method, apparatus, electronic device, and computer-readable storage medium that can effectively identify and clear low-quality memories in dialogue history and prevent context pollution.

[0004] In a first aspect, embodiments of this application provide a dialogue management method, including: Obtain a first question-and-answer text pair; the first question-and-answer text pair includes a first question text and a first answer text generated by a first model based on the first question text; Perform perplexity detection on the first question-and-answer text pair to obtain the perplexity of the first question-and-answer text pair; When the perplexity is greater than the perplexity threshold, the first question-and-answer text pair is optimized in the first historical context of the first model.

[0005] In one embodiment, the first question text is obtained by performing speech recognition on the first question speech using a speech recognition model; the step of performing perplexity detection on the first question-and-answer text pair to obtain the perplexity of the first question-and-answer text pair includes: Obtain the confidence score of the first question text pair output by the speech recognition model; Based on the confidence score, the perplexity of the first question-and-answer text pair is determined.

[0006] In one embodiment, the step of performing perplexity detection on the first question-and-answer text pair to obtain the perplexity of the first question-and-answer text pair includes: Determine the relevance between the first question-and-answer text pair and the second historical context of the first model; Based on the relevance, the perplexity of the first question-and-answer text pair is determined.

[0007] In one embodiment, the optimization processing of the first question-answer text pair in the first historical context of the first model includes: Delete the first question-and-answer text pair from the first historical context of the first model; or... In the first historical context of the first model, a target identifier is added to the first question-and-answer text pair, the target identifier being used to instruct the first model to ignore the first question-and-answer text pair.

[0008] In one embodiment, the first question text is obtained by speech recognition of the first question speech, and the optimization processing of the first question-and-answer text pair in the first historical context of the first model includes: The first question is input into the second model, and the second model outputs the second question text and the second response text. Based on the second question text and the second response text, a second question-and-answer text pair is generated; In the first historical context of the first model, the first question-and-answer text pair is replaced with the second question-and-answer text pair.

[0009] In one embodiment, the first question text is obtained by speech recognition of the first question speech, and the optimization processing of the first question-and-answer text pair in the first historical context of the first model includes: Acquire the second question's audio; the second question's audio and the first question's audio are audio from different channels collected from the same audio source; Based on the phase difference information between the first question speech and the second question speech, the target direction of the speech source is determined, and a beamforming signal pointing to the target direction is generated; The beamforming signal is subjected to speech recognition to generate a third question text; The third question text is input into the first model, and the first model outputs the third response text. Based on the third question text and the third response text, a third question-and-answer text pair is generated; In the first historical context of the first model, the first question-and-answer text pair is replaced with the third question-and-answer text pair.

[0010] In one embodiment, the first question text is obtained in the following manner: Acquire the first question's voice and the user's image; Visual feature information is obtained by extracting features from the mouth region of the person in the user image. Feature extraction is performed on the speech of the first question to obtain speech feature information; The visual feature information and the speech feature information are input into the third model, and the first question text is output through the third model.

[0011] Secondly, embodiments of this application provide a dialogue management device, the device comprising: The acquisition module is used to acquire a first question-and-answer text pair; the first question-and-answer text pair includes a first question text and a first answer text generated by a first model based on the first question text. The detection module is used to perform perplexity detection on the first question-and-answer text pair to obtain the perplexity of the first question-and-answer text pair; An optimization module is used to optimize the first question-and-answer text pair in the first historical context of the first model when the perplexity is greater than the perplexity threshold.

[0012] In one embodiment, the first question text is obtained by performing speech recognition on the first question speech using a speech recognition model; the detection module includes: The score acquisition submodule is used to acquire the confidence score of the first question text pair output by the speech recognition model; The first determining submodule is used to determine the perplexity of the first question-and-answer text pair based on the confidence score.

[0013] In one embodiment, the detection module includes: The relevance determination submodule is used to determine the relevance between the first question-and-answer text pair and the second historical context of the first model; The second determining submodule is used to determine the perplexity of the first question-and-answer text pair based on the relevance.

[0014] In one embodiment, the optimization module includes: The first optimization submodule is used to delete the first question-and-answer text pair in the first historical context of the first model; The second optimization submodule is used to add a target identifier to the first question-and-answer text pair in the first historical context of the first model. The target identifier is used to instruct the first model to ignore the first question-and-answer text pair.

[0015] In one embodiment, the first question text is obtained by performing speech recognition on the first question speech, and the optimization module includes: The first response submodule is used to input the voice of the first question into the second model, and output the text of the second question and the text of the second response through the second model. The first generation submodule is used to generate a second question-and-answer text pair based on the second question text and the second answer text; The first replacement submodule is used to replace the first question-and-answer text pair with the second question-and-answer text pair in the first historical context of the first model.

[0016] In one embodiment, the first question text is obtained by performing speech recognition on the first question speech, and the optimization module includes: The voice acquisition submodule is used to acquire the second question voice; the second question voice and the first question voice are voices from different channels collected from the same voice source; The signal generation submodule is used to determine the target direction of the speech source based on the phase difference information between the first question speech and the second question speech, and to generate a beamforming signal pointing to the target direction. The second generation submodule is used to perform speech recognition on the beamforming signal and generate the third question text. The second response submodule is used to input the third question text into the first model and output the third response text through the first model. The third generation submodule is used to generate a third question-and-answer text pair based on the third question text and the third answer text; The second replacement submodule is used to replace the first question-and-answer text pair with the third question-and-answer text pair in the first historical context of the first model.

[0017] In one embodiment, the first question text is obtained in the following manner: Acquire the first question's voice and the user's image; Visual feature information is obtained by extracting features from the mouth region of the person in the user image. Feature extraction is performed on the speech of the first question to obtain speech feature information; The visual feature information and the speech feature information are input into the third model, and the first question text is output through the third model.

[0018] Thirdly, embodiments of this application also provide an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, it implements the steps in the dialogue management method described above.

[0019] Fourthly, embodiments of this application also provide a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps in the dialogue management method described above.

[0020] Fifthly, embodiments of this application also provide a computer program product or computer program, which includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the methods provided in the various optional implementations described in the embodiments of this application.

[0021] In summary, in this embodiment, by acquiring the first question-and-answer text pair and performing perplexity detection on it to obtain its perplexity, optimization processing can be performed on the first question-and-answer text pair within the first historical context of the first model when the perplexity exceeds a perplexity threshold. This effectively identifies and removes low-quality dialogues from the dialogue history of the first model, preventing contextual pollution, thereby significantly improving the coherence and accuracy of multi-turn dialogues, enhancing the robustness of the dialogue system in noisy and multi-person dialogue scenarios, and improving the user experience. Attached Figure Description

[0022] To more clearly illustrate the technical solutions in this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0023] Figure 1 This is a flowchart illustrating a dialogue management method provided in an embodiment of this application; Figure 2 This is a schematic flowchart of a specific embodiment of the optimization process provided in this application; Figure 3 This is a schematic flowchart of another specific embodiment of the optimization process provided in one embodiment of this application; Figure 4 This is a schematic diagram of the structure of a dialogue management device provided in an embodiment of this application; Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0024] The technical solutions of this application will now be clearly and completely described with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of the present invention, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.

[0025] With the popularization of voice interaction technology, dialogue systems based on Large Language Models (LLMs) have been widely used in scenarios such as intelligent assistants and customer service robots. In these systems, user speech is usually converted into text by an Automatic Speech Recognition (ASR) module, and then the dialogue model generates a response. However, in real-world usage environments, ASR is susceptible to interference from background noise, multi-person conversations, and user accents, resulting in recognition errors and highly semantically perplexing input text. Furthermore, conversations inserted by the user during human-computer interaction can also be misrecognized as valid input by the system, leading to irrelevant or low-quality responses.

[0026] If these highly perplexing question-and-answer pairs are included in the model's dialogue history (i.e., contextual memory), they form so-called "poison memory." This not only occupies a limited amount of context time but also misleads the understanding and generation of subsequent dialogues, manifesting as stiff responses, biased intent recognition, and frequent requests for clarification. Most existing dialogue management strategies employ simple rejection mechanisms based on confidence filtering or intent matching, or forcibly switch topics when low-quality input is detected. These strategies lack the ability to dynamically identify and repair contaminated context, failing to fundamentally optimize the coherence and quality of multi-turn dialogues.

[0027] To address the current challenge of effectively managing the dialogue history of a model, this application aims to provide a dialogue management method. By acquiring the first question-and-answer text pair and performing perplexity detection on it to obtain its perplexity level, this method optimizes the first question-and-answer text pair within the first historical context of the first model when the perplexity exceeds a perplexity threshold. This effectively identifies and removes low-quality dialogues from the first model's dialogue history, preventing contextual pollution and significantly improving the coherence and accuracy of multi-turn dialogues. It also enhances the robustness of the dialogue system in noisy and multi-person dialogue scenarios, improving the user experience.

[0028] The following sections provide detailed descriptions of each example. It should be noted that the order in which the embodiments are described is not intended to limit the priority of the embodiments.

[0029] Figure 1 The illustration shows a flowchart of a dialogue management method according to an embodiment of this application. The entity executing the dialogue management method can be a dialogue management device, which can be integrated into any electronic device with data processing, network communication, and program execution functions. The electronic device can be a server or a terminal, etc.

[0030] The server can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, network acceleration services (Content Delivery Network, CDN), as well as big data and artificial intelligence platforms.

[0031] The terminal can be a smartphone, tablet, laptop, desktop computer, smart home device, etc., but is not limited to these. The terminal and server can be connected directly or indirectly through wired or wireless communication, which is not limited herein.

[0032] In this embodiment, the description will be from the perspective of a dialogue management device, which can be integrated into a server or terminal. To facilitate the explanation of the dialogue management method of this application, the following will describe the dialogue management device integrated into a terminal in detail, that is, the terminal will be used as the execution subject for detailed explanation.

[0033] Reference Figure 1 The diagram shows a flowchart of a dialogue management method according to this application. The method may specifically include steps S101 to S103, as follows: S101: Retrieve the first question-and-answer text pair.

[0034] In this embodiment, the first question-and-answer text pair includes a first question text and a first response text generated based on the first question text by a first model. The first question text can be obtained by speech recognition of the first question speech using a speech recognition model, or it can be text data entered by the user via a keyboard.

[0035] In this embodiment, the first model can be built upon a large model, which refers to an artificial neural network model with a very large number of parameters. In the field of artificial intelligence, a large model typically refers to a model with hundreds of millions to trillions of parameters. Such models usually need to be trained on large-scale datasets and require significant computational resources for optimization and tuning. Large models are commonly used to solve complex tasks such as natural language processing, computer vision, and speech recognition.

[0036] In this embodiment, the large model can be a large-scale pre-trained model such as Doubao, ChatGPT series, BERT, XLNet, Zhipu model, Claude, Moonshot AI model, ChatGLM model, Tongwen Qianyi model, MiniMax model, Xinghuo model, Llama model, 360GPT model, Qwen model, Baichuan model, Yunque model, vivoLM model, deepseek, Tencent Yuanbao and Wenxin Yiyan, etc. This application embodiment does not limit it.

[0037] For example, a user can speak the question "What's the weather like today?" into the terminal. After receiving the user's voice, the terminal can call a speech recognition model to recognize it and transcribe it into a first question text. This first question text is then input into a first model. Based on its own knowledge and understanding of the first question text, the first model generates a first response text, such as "Today is sunny, the temperature is 20-25 degrees Celsius," and displays the first response text through the interactive interface or broadcasts it aloud. The terminal adds this "first question text - first response text" as a paired first question-and-answer text pair to the first model's dialogue history context for reference in subsequent rounds of dialogue.

[0038] S102: Perform perplexity detection on the first question-and-answer text pair to obtain the perplexity of the first question-and-answer text pair.

[0039] In this implementation, perplexity is used to characterize the dialogue quality of the first question-and-answer text pair. Specifically, perplexity is a quantitative indicator that measures whether the first question-and-answer text is semantically confusing, anomalous, or irrelevant to the dialogue flow. High perplexity typically means that the input may be generated by noise misidentification or is irrelevant to the content itself.

[0040] In this embodiment, by performing perplexity detection on the first question-and-answer text, the dialogue quality of the first question-and-answer text can be effectively detected, thereby providing an accurate quantitative basis for subsequent optimization processing.

[0041] S103: When the perplexity is greater than the perplexity threshold, optimize the first question-answer text pair in the first historical context of the first model.

[0042] In this embodiment, the confusion threshold can be preset based on dialogue quality requirements or empirical data. For example, if strict control of the terminal's dialogue quality is required, a higher confusion threshold can be set.

[0043] In this embodiment, when the perplexity of the first question-and-answer text pair is greater than the perplexity threshold, it indicates that the first question-and-answer text pair has a high degree of semantic uncertainty or confusion. If the first question-and-answer text pair is not processed, it will form a toxic memory in the first historical context of the first model, misleading the understanding and generation of subsequent dialogues. Therefore, in order to prevent it from entering and polluting the context memory of the first model, the first question-and-answer text pair will be optimized in the first historical context of the first model.

[0044] In this embodiment, the first historical context refers to the memory cache or context window that the first model relies on when conducting the current round of dialogue, which records question-and-answer pairs from several past rounds. After generating the first response text based on the first question text, the first model will automatically store the first question-and-answer text pair in the first historical context of the first model.

[0045] In this embodiment, optimizing the first question-and-answer text pair in the first historical context of the first model means eliminating the negative impact of the first question-and-answer text pair in the first historical context to improve the data quality of the first historical context.

[0046] In this implementation, by detecting the perplexity of the first question-and-answer text pair in real time during each round of dialogue with the user, and selectively optimizing first question-and-answer text pairs with perplexity exceeding a perplexity threshold, the pollution of the first model's context by low-quality dialogue caused by noise misidentification and irrelevant interruptions can be prevented at its source. This ensures that the first model's context environment remains highly relevant and of high quality, significantly improving the coherence, accuracy, and overall user experience of multi-turn dialogues, and enhancing the robustness of the voice interaction system in complex real-world environments.

[0047] In one feasible implementation, the first question text is obtained by speech recognition of the first question speech using a speech recognition model; the step of performing perplexity detection on the first question-and-answer text pair to obtain the perplexity of the first question-and-answer text pair may specifically include: obtaining the confidence score of the first question text pair output by the speech recognition model; and determining the perplexity of the first question-and-answer text pair based on the confidence score.

[0048] In this embodiment, the confidence score is a probability estimate value of the correctness of the speech recognition result by the speech recognition model, usually represented in numerical form (such as between 0.0 and 1.0). The higher the score, the more certain the ASR system is that the text converted from the audio segment is accurate. For example, when the user says "Set an alarm for me", the speech recognition model configured by the ASR system not only outputs the text of these five characters, but the internal calculation process (such as based on a neural network or a hidden Markov model) will calculate a confidence score for the entire sentence or each word. This score reflects the matching degree between the audio signal features and the prediction of the language model. The terminal can directly obtain the confidence score of the first question-answer text pair output by the speech recognition model through the API (Application Programming Interface) or the internal data bus. In a noisy environment, if the recognition result is the ambiguous "Help me make an alarm", its confidence score will usually be significantly lower than that in a clear environment.

[0049] In this embodiment, considering that there is a strong correlation between the low confidence of ASR and the high perplexity of the text. If a piece of speech is clear and the grammar is standard, ASR will output the standard text with high confidence. On the contrary, if the speech is contaminated by noise or is itself meaningless mumbling, ASR will be "hesitant" and output a text with low confidence and often chaotic semantics. Therefore, the confidence score and the perplexity are negatively correlated.

[0050] In this embodiment, the steps of determining the perplexity of the first question-answer text pair based on the confidence score may specifically include: determining the reciprocal of the confidence score as the perplexity of the first question-answer text pair; or, inputting the confidence score into a preset first negative correlation function, and calculating the perplexity of the first question-answer text pair through the first negative correlation function. For example, the first negative correlation function can be set as PP = 1 - Conf, where PP represents the perplexity of the first question-answer text pair, and Conf represents the confidence score of the first question text pair. When the confidence score Conf is 0.9 (high), the perplexity PP is 0.1 (low); when Conf is 0.2 (low), PP is 0.8 (high).

[0051] In this embodiment, by directly using the confidence score natively output by the speech recognition model to evaluate the perplexity of the question-answer text, a lightweight, low-latency detection mechanism closely related to the quality of the speech signal is realized. This method does not need to introduce an additional complex semantic understanding model, makes full use of the by-products of the existing ASR process, has high computational efficiency, can quickly identify low-confidence inputs caused by poor audio quality, and thus timely trigger optimization processing to effectively prevent the contamination of the conversation history by low-quality question-answer text pairs.

[0052] In one feasible implementation, the step of performing perplexity detection on the first question-and-answer text pair to obtain the perplexity of the first question-and-answer text pair may specifically include: determining the relevance between the first question-and-answer text pair and the second historical context of the first model; and determining the perplexity of the first question-and-answer text pair based on the relevance.

[0053] In this embodiment, the second historical context refers to the dialogue history consisting of several rounds of historical question-and-answer pairs recorded by the first model before the generation of the first question-and-answer text pair. Relevance is a quantitative indicator used to measure the semantic and topical coherence and association between the first question-and-answer text pair and the second historical context.

[0054] It should be noted that in a normal, task-oriented multi-turn dialogue, the topics are usually coherent; while suddenly inserted, irrelevant remarks will create a clear semantic gap with the previous dialogue history.

[0055] In this embodiment, the step of determining the relevance between the first question-and-answer text pair and the second historical context of the first model may specifically include: generating a first semantic vector of the first question-and-answer text pair and a second semantic vector of the second historical context through a text embedding model; and determining the cosine similarity between the first semantic vector and the second semantic vector as the relevance between the first question-and-answer text pair and the second historical context of the first model.

[0056] For example, the second historical context (e.g., the most recent three rounds of dialogue about "weather and travel") and the first question text in the first question-and-answer text pair (e.g., "What do you want to eat tonight?") can be converted into high-dimensional semantic vectors using a text embedding model (such as Sentence-BERT). Then, the cosine similarity between these two semantic vectors is calculated, and this cosine similarity is determined as the relevance. The closer the cosine similarity value is to 1, the more semantically relevant it is; the closer it is to 0, the less relevant it is.

[0057] In this implementation, if the first question-and-answer text pair is highly relevant to the historical dialogue, then it is a normal and understandable continuation with low semantic perplexity. If it is completely irrelevant, then it is an abrupt and confusing input for the current dialogue flow, i.e., it has high perplexity. Therefore, relevance and perplexity are negatively correlated.

[0058] In this embodiment, the step of determining the perplexity of the first question-and-answer text pair based on relevance may specifically include: determining the perplexity of the first question-and-answer text pair as the reciprocal of the relevance; or, inputting the relevance into a preset second negative correlation function, and calculating the perplexity of the first question-and-answer text pair through the second negative correlation function. For example, the second negative correlation function can be set to PP=1-Rel, where PP represents the perplexity of the first question-and-answer text pair, and Rel represents the relevance. If the calculated relevance Rel is 0.9 (high relevance), then the perplexity PP is 0.1 (low); if the relevance Rel is 0.1 (low relevance), then the perplexity PP is 0.9 (high).

[0059] In this embodiment, the perplexity of the first question-and-answer text pair is indirectly measured by calculating the semantic relevance between the first question-and-answer text pair and the second historical context, thus realizing an intelligent detection mechanism based on the understanding of the coherence of the dialogue content. This method can effectively identify dialogues that, although the confidence level of ASR recognition may be not low (e.g., the chatter itself is clear), are inserted midway and are completely irrelevant to the current dialogue task. By filtering out such semantic interference, the purity and consistency of the dialogue history in terms of topic and intent can be ensured, thereby significantly improving the completion quality and logical coherence of multi-turn task-oriented dialogues.

[0060] In this embodiment, the step of performing perplexity detection on the first question-and-answer text pair to obtain the perplexity of the first question-and-answer text pair may specifically include: obtaining the confidence score of the first question-and-answer text pair output by the speech recognition model, and determining the relevance between the first question-and-answer text pair and the second historical context of the first model; determining the first perplexity based on the confidence score, and determining the second perplexity based on the relevance; and performing a weighted sum of the first perplexity and the second perplexity to obtain the perplexity of the first question-and-answer text pair.

[0061] In this embodiment, by fusing speech recognition confidence and contextual semantic relevance information, a multi-layered and more robust detection mechanism is constructed to comprehensively evaluate the perplexity of the first question-and-answer text pair in a weighted manner. This mechanism can capture surface recognition errors caused by audio quality through confidence and identify logically irrelevant inputs at the content level through semantic relevance. The two complement each other, avoiding misjudgment or omission by a single indicator. By weighted fusion of the two perplexity levels, the system can flexibly adjust the detection strategy according to the focus of the actual scenario, thereby achieving more accurate and comprehensive recognition of low-quality dialogue content. This lays a solid foundation for subsequent efficient optimization processing and comprehensively improves the environmental adaptability and intelligence level of the dialogue management system.

[0062] In one feasible implementation, the first question-and-answer text pair is optimized in the first historical context of the first model, including: deleting the first question-and-answer text pair in the first historical context of the first model; or, adding a target identifier to the first question-and-answer text pair in the first historical context of the first model, the target identifier being used to instruct the first model to ignore the first question-and-answer text pair.

[0063] In this embodiment, deletion means completely removing the first question-and-answer text pair from the dialogue history context cache where it should have been added or has been temporarily stored, without leaving any textual trace.

[0064] In this implementation, when the perplexity of the first question-and-answer text pair is determined to be high, indicating that the interaction contains almost no useful information and poses a significant risk of contamination, a deletion operation can be performed. Thus, in subsequent dialogue rounds, the first model will completely forget about this interaction when generating responses, and the first question-and-answer text pair will not appear in its context window. This fundamentally eliminates any negative impact of this low-quality content on subsequent semantic understanding and response generation, ensuring the absolute cleanliness of the historical context.

[0065] In this embodiment, the target identifier is a special mark or instruction added before or after the text pair or associated as metadata. The target identifier is used to instruct the first model to ignore the first question-and-answer text pair. For example, the target identifier can be formatted as: [Low confidence input, please ignore] or a more structured instruction.

[0066] In this implementation, considering that some highly perplexing inputs may be "partially contaminated," still containing fragments of the user's true intent (such as the core entity "time"), complete deletion could lead to broken dialogue turns or information loss. Therefore, target content, which is the perplexing part of the first question-and-answer text pair, can also be extracted. In this case, the target identifier can specifically be used to instruct the first model to ignore the target content of the first question-and-answer text pair.

[0067] In this implementation, by adding a target identifier to the first question-and-answer text pair within the first model's first historical context, this target identifier plays a crucial role when the first model reads the historical context in subsequent dialogues and needs to understand the entire dialogue flow. It doesn't allow the model to completely ignore the text, but rather guides the model's attention mechanism or internal processing flow. For example, the target identifier can prompt the model to: reduce the weight of the first question-and-answer text pair when calculating context embeddings; focus on identifying valid content while actively ignoring target content with high perplexity when attempting to parse the first question-and-answer text pair; and treat the first question-and-answer text pair as a reference source of questionable reliability, using it cautiously only when no other explicit information is available.

[0068] In this implementation, by using a target identifier to instruct the first model to ignore the target content of the first question-and-answer text pair while retaining the valid content of the first question-and-answer text pair, the interaction is not completely discarded (maintaining the continuity of the dialogue turns), and the negative impact of noise is suppressed to the greatest extent possible, while retaining potentially valid content. This allows the model to maintain the coherence of the dialogue and potentially extract valid information even when faced with partially poor-quality input.

[0069] In this implementation, two optimization strategies—direct deletion and adding target identifiers—combined to form a flexible and multi-layered historical context purification mechanism. The direct deletion strategy provides a simple and reliable solution for highly polluted scenarios, ensuring the purity of the dialogue foundation. The strategy of adding intelligent target identifiers allows the system to suppress noise while attempting to salvage potentially useful information, thus achieving an effective balance between thorough cleanup and maintaining dialogue continuity and information integrity. The combined use of these two strategies can greatly enhance the adaptability and overall performance of the dialogue management system in handling complex input situations.

[0070] In one feasible implementation, refer to Figure 2 The first question text is obtained by speech recognition of the first question speech. The optimization process of the first question-and-answer text pair in the first historical context of the first model may specifically include steps S201~S203, as follows: S201: Input the first question into the second model, and output the second question text and the second response text through the second model.

[0071] In this embodiment, the second model is distinct from the speech recognition model, possessing audio understanding and contextual reasoning capabilities. Specifically, this second model can be Audio-LLM (Audio Large Language Model). Audio-LLM is a large language model capable of directly processing audio input, possessing capabilities such as speech recognition, speaker recognition, and speech summarization, and is generally more robust in complex scenarios than traditional ASR.

[0072] In this implementation, traditional speech recognition models typically rely solely on acoustic-language models for conversion, making them susceptible to noise and pronunciation issues. The second model, however, can directly process the raw audio signal end-to-end and deeply integrate a holistic understanding of the speech content, semantics, and even the context of the dialogue. Inputting a questioning first question into the second model allows it to better resist noise, correct pronunciation, and potentially incorporate dialogue history to identify a more accurate second question text. Based on this second question text, it then outputs a more accurate and fluent second response text. For example, if a user asks a voice question in an interaction: "Play Jay Chou's 'Seven Mile Fragrance'" (which ASR might interpret as "Play Zhou Jielun's 'Seven Mile Fragrance'" due to unclear pronunciation or environmental noise), the second model can directly output the second question text as "Play Jay Chou's 'Seven Mile Fragrance'" and the second response text as "Okay, I'm about to play Jay Chou's 'Seven Mile Fragrance' for you."

[0073] S202: Generate a second question-and-answer text pair based on the second question text and the second answer text.

[0074] In this embodiment, the second question-and-answer text pair is a combination of the second question text and the high-quality second response text output by the second model. For example, "Play Jay Chou's 'Seven Mile Fragrance'" is paired with "Okay, I'm about to play Jay Chou's 'Seven Mile Fragrance' for you" to form a new second question-and-answer text pair.

[0075] S203: Replace the first question-and-answer text pair with the second question-and-answer text pair in the first historical context of the first model.

[0076] In this implementation, after generating the second question-and-answer text pair, the previously stored low-quality first question-and-answer text pair can be found in the first historical context of the first model and overwritten with the newly constructed second question-and-answer text pair. This not only clears the first model's toxic memories but also replaces them with more accurate and beneficial ones. For example, when the user engages in subsequent dialogue (e.g., asking "Who sang that song just now?"), the first model, reviewing the historical context, retrieves the second question-and-answer text pair as: "Play Jay Chou's 'Seven Mile Fragrance'" → "Okay, I'm about to play Jay Chou's 'Seven Mile Fragrance' for you." Based on the second question-and-answer text pair, the first model correctly understands the historical context (that a song by Jay Chou was just played) and thus provides the correct response ("It was Jay Chou").

[0077] In this implementation, a more powerful second model is introduced to reprocess and directly respond to the original speech corresponding to high-perplexity questions. This reprocessed question-and-answer pair replaces the original low-quality recording, achieving an optimization mechanism that moves from "identifying the problem" to "actively fixing the problem." This not only removes toxic memories from the dialogue history but also replaces them with beneficial memories containing correct semantics and action results. This not only purifies the context but also significantly improves the accuracy, information content, and subsequent reference value of the historical context, fundamentally improving the logical coherence and task completion of multi-turn dialogues.

[0078] In one feasible implementation, refer to Figure 3 The first question text is obtained by speech recognition of the first question speech. The optimization process of the first question-and-answer text pair in the first historical context of the first model may specifically include steps S301 to S306, as follows: S301: Obtain the second question's audio.

[0079] In this embodiment, the second question speech and the first question speech are speech from different channels collected from the same speech source. That is, the second question speech and the first question speech are multiple audio signals synchronously collected by multiple microphones at different spatial locations. The second question speech can be one or more speech signals. Each signal contains a target sound source (e.g., user A) and interfering sound sources (e.g., user B and ambient noise), but due to the differences in microphone positions, the intensity ratio and arrival time (phase) of the target and interference in each signal are different.

[0080] In this embodiment, considering that a single microphone cannot distinguish the direction of sound, when the terminal detects that the first question text recognized based on the first question voice collected by a single channel has a high perplexity, it will retrieve the second question voice collected synchronously with the first question voice.

[0081] In this embodiment, when the terminal detects that the perplexity of the first question text recognized based on the first question speech acquired through a single channel is less than or equal to the perplexity threshold, the second question speech can be deleted to reduce unnecessary memory space usage.

[0082] S302: Based on the phase difference information between the first question speech and the second question speech, determine the target direction of the speech source and generate a beamforming signal pointing to the target direction.

[0083] In this embodiment, the phase difference information refers to the difference in the time it takes for the same sound signal to arrive at different microphones, which is reflected in the phase shift of the audio waveform. By calculating the phase difference between the signals of each channel, the direction of arrival of the voice source can be deduced.

[0084] In this embodiment, the target direction of the speech source can be determined based on phase difference information using a preset sound source localization algorithm. The sound source localization algorithm can be GCC-PHAT (Generalized Cross-Correlation Phase Transform), subspace methods, or other algorithms.

[0085] In this embodiment, beamforming signal is a digital signal processing technique that weights, delays, and sums multi-channel signals. Its principle is that by adjusting the weights and delays of each channel signal, sound signals from the target direction are amplified by in-phase superposition, while interference signals from other directions are suppressed by out-of-phase cancellation.

[0086] In the specific implementation, the process of generating a beamforming signal pointing in the target direction is as follows: First, the system calculates the phase difference between the first question speech and the second question speech (i.e., the synchronization signal collected by the multi-channel microphone) and estimates the target direction of the sound source using GCC-PHAT or subspace methods. Then, based on the target direction, a preset digital beamforming filter is used to apply specific time delays and complex weights to the signals of each channel, so that the sound waves from the target direction are enhanced by in-phase superposition during synthesis, while the interfering sound waves from other directions are suppressed by out-of-phase cancellation. Finally, the multi-channel original speech signal is input into the filter for processing, and a clean beamforming signal pointing in the target direction is output, thereby improving the speech signal-to-noise ratio and clarity at the physical level.

[0087] S303: Perform speech recognition on the beamforming signal to generate the text for the third question.

[0088] In this embodiment, the spatially enhanced beamforming signal, with its significantly improved signal-to-noise ratio, can be fed into the speech recognition model for re-recognition. Because the quality of the input signal is improved at the physical level, the probability of the speech recognition model making a mistake is greatly reduced. For example, mixed speech that was previously misrecognized may now be clearly recognized as accurate question text from user A.

[0089] S304: Input the third question text into the first model, and output the third response text through the first model.

[0090] In this embodiment, by re-inputting a more accurate third question text into the first model, the first model is able to output a more accurate third response text that reflects the user's true intent.

[0091] S305: Generate a third question-and-answer text pair based on the third question text and the third answer text.

[0092] In this embodiment, the third question-and-answer text pair is a question-and-answer pair formed by recombining the third question text and the high-quality third response text output by the first model.

[0093] S306: Replace the first question-and-answer text pair with the third question-and-answer text pair in the first historical context of the first model.

[0094] In this embodiment, after generating the third question-and-answer text pair, the previously stored low-quality first question-and-answer text pair can be found in the first historical context of the first model and overwritten with the newly constructed third question-and-answer text pair.

[0095] In this implementation, by leveraging the hardware advantages of a multi-channel microphone array and combining sound source localization and beamforming signal processing techniques, noise and interference are actively suppressed at the physical signal level, fundamentally improving the input quality of front-end speech recognition. When perplexity detection at the software level detects a problem, this solution does not merely patch it at the back-end text level, but rather goes back to the audio input end, using spatial filtering techniques to reacquire a cleaner speech signal and recognize it again. This achieves high-quality optimization of low-quality dialogue history based on hardware collaboration, making it particularly suitable for complex acoustic scenarios such as multi-person conversations and noisy environments, significantly improving the robustness and recognition accuracy of the voice interaction system.

[0096] In one feasible implementation, the first question text can be obtained by: acquiring the first question speech and user image; extracting features from the mouth region of the person in the user image to obtain visual feature information; extracting features from the first question speech to obtain speech feature information; inputting the visual feature information and speech feature information into a third model, and outputting the first question text through the third model.

[0097] In this embodiment, the first question speech refers to the raw audio signal generated when the user initiates the current round of dialogue. The user image refers to a video frame or image containing the user's face captured by a camera at the same time point as the acquisition of the first question speech. By acquiring the audio and video data streams of the user speaking in parallel through a microphone array and a camera, it is ensured that the subsequently extracted visual feature information and speech feature information are temporally aligned, providing a prerequisite for multimodal fusion.

[0098] In this embodiment, the mouth region refers to an image patch containing the lips and surrounding area, located and cropped from the user image. Visual feature information refers to a high-dimensional feature vector extracted from the mouth region image using a computer vision model (such as a convolutional neural network), which characterizes lip shape, degree of opening and closing, movement trajectory, and texture changes. Visual feature information is strongly correlated with the phonemes of pronunciation.

[0099] In practical implementation, a facial landmark detection model can first be used to locate the lips and then perform standardized cropping to obtain an image of the mouth region. Subsequently, the sequence of mouth region images is input into a visual feature extraction network. This network learns how the appearance and movement patterns of the lips correspond to different articulation units. For example, the lip closure and plosive movements are visually distinct when pronouncing the sounds "ba" and "pa". The extracted visual feature information is essentially the visual spectrum of the speech content, which is unaffected by ambient noise.

[0100] In this embodiment, speech feature information refers to the features extracted from the original audio signal that can characterize its acoustic properties, such as Mel frequency cepstral coefficients, Mel spectrograms, or deep acoustic features extracted by neural networks.

[0101] In practical implementation, the first problem speech can be preprocessed (such as pre-emphasis, framing, and windowing), and then its time-frequency features, such as Mel spectrogram, can be calculated. Time-frequency features reflect the changes in the frequency and energy of the sound over time and are the core representation of audio content, but they are easily contaminated by background noise and reverberation.

[0102] In this embodiment, the third model is an audio-visual speech recognition model. This third model is an end-to-end deep learning model used to fuse and jointly process visual and speech bimodal features, ultimately directly outputting the recognized text. For example, the third model can be an AVSR (Audio-Visual Speech Recognition) model. The architecture of this AVSR model typically includes two feature encoding branches (visual encoder and audio encoder) and a fusion decoder.

[0103] In this embodiment, the third model can fuse temporally aligned visual feature information sequences and speech feature information sequences to obtain fused multimodal features. Fusion can be performed in the early stage (feature concatenation), the middle stage (attention mechanism interaction), or the late stage (decision layer). For example, the third model can use a modal attention mechanism: automatically assigning higher weights to visual features in noisy audio frames; and relying more on audio features in areas with clear pronunciation but unclear lip movements. The fused multimodal features are fed into a pre-defined sequence modeling network (such as a Transformer or recurrent neural network), which learns the complex joint distribution between visual and speech features. Finally, the decoder predicts the most likely word sequence, i.e., the first question text, based on the more robust and information-rich joint representation output by the sequence modeling network.

[0104] In this embodiment, by fusing visual and audio information of the user's lips at the speech recognition front end, a more accurate and reliable first question text can be generated from the source. This effectively reduces the probability of generating highly perplexing question-and-answer pairs due to errors in recognizing a single audio modality, thereby fundamentally reducing the risk of dialogue history being contaminated. This lays a solid and clean data foundation for subsequent high-quality dialogue management and is particularly suitable for complex interactive scenarios with high requirements for reliability and anti-interference capabilities.

[0105] To facilitate better implementation of the dialogue management method of this application, this application also provides a dialogue management device based on the above-described dialogue management method. The meanings of the terms used are the same as in the dialogue management method described above, and specific implementation details can be found in the descriptions of the method embodiments.

[0106] Based on the same inventive concept, and referring to Figure 4 This application provides a dialogue management device 400, which includes: The acquisition module 401 is used to acquire a first question-and-answer text pair; the first question-and-answer text pair includes a first question text and a first answer text generated by a first model based on the first question text. The detection module 402 is used to perform perplexity detection on the first question-and-answer text pair and obtain the perplexity of the first question-and-answer text pair; The optimization module 403 is used to optimize the first question-answer text pair in the first historical context of the first model when the perplexity is greater than the perplexity threshold.

[0107] In one embodiment, the first question text is obtained by performing speech recognition on the first question speech using a speech recognition model; the detection module 402 includes: The score acquisition submodule is used to obtain the confidence score of the first question text pair output by the speech recognition model; The first determination submodule is used to determine the perplexity of the first question-answer text pair based on the confidence score.

[0108] In one embodiment, the detection module 402 includes: The relevance determination submodule is used to determine the relevance between the first question-and-answer text pair and the second historical context of the first model; The second determination submodule is used to determine the perplexity of the first question-and-answer text pair based on relevance.

[0109] In one embodiment, the optimization module 403 includes: The first optimization submodule is used to delete the first question-answer text pair in the first historical context of the first model; The second optimization submodule is used to add a target identifier to the first question-and-answer text pair in the first historical context of the first model. The target identifier is used to instruct the first model to ignore the first question-and-answer text pair.

[0110] In one embodiment, the first question text is obtained by performing speech recognition on the first question speech, and the optimization module 403 includes: The first response submodule is used to input the voice of the first question into the second model, and output the text of the second question and the text of the second response through the second model. The first generation submodule is used to generate a second question-and-answer text pair based on the second question text and the second answer text; The first replacement submodule is used to replace the first question-and-answer text pair with the second question-and-answer text pair in the first historical context of the first model.

[0111] In one embodiment, the first question text is obtained by performing speech recognition on the first question speech, and the optimization module 403 includes: The voice acquisition submodule is used to acquire the voice of the second question; the voice of the second question and the voice of the first question are voices from different channels collected from the same voice source; The signal generation submodule is used to determine the target direction of the speech source based on the phase difference information between the first question speech and the second question speech, and to generate a beamforming signal pointing to the target direction. The second generation submodule is used to perform speech recognition on the beamforming signal and generate the third question text. The second response submodule is used to input the third question text into the first model and output the third response text through the first model. The third generation submodule is used to generate a third question-and-answer text pair based on the third question text and the third answer text. The second replacement submodule is used to replace the first question-and-answer text pair with the third question-and-answer text pair in the first historical context of the first model.

[0112] In one embodiment, the first question text is obtained in the following manner: Acquire the first question's voice and the user's image; Feature extraction is performed on the mouth area of ​​a person in a user image to obtain visual feature information; Feature extraction is performed on the speech of the first question to obtain speech feature information; Visual and speech features are input into the third model, which then outputs the text of the first question.

[0113] By employing the technical solution of this application embodiment, in each round of dialogue with the user, the perplexity of the first question-and-answer text pair is detected in real time, and targeted optimization processing is performed on the first question-and-answer text pairs with perplexity exceeding the perplexity threshold. This fundamentally prevents low-quality dialogue caused by noise misidentification and irrelevant interruptions from polluting the context of the first model. Thus, the context environment of the first model always maintains high relevance and high quality, significantly improving the coherence, accuracy, and overall user experience of multi-turn dialogues, and enhancing the robustness of the voice interaction system in complex real-world environments.

[0114] Specific limitations regarding the dialogue management device 400 can be found in the limitations of the dialogue management method described above, and will not be repeated here. Each module in the aforementioned dialogue management device 400 can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in the computer device in hardware form, or stored in the memory of the computer device in software form, so that the processor can call and execute the operations corresponding to each module.

[0115] In addition, this application also provides an electronic device, such as Figure 5 As shown, it illustrates the structural diagram of the electronic device involved in this application, specifically: The electronic device may include components such as a processor 501 with one or more processing cores and a memory 502 with one or more computer-readable storage media. Those skilled in the art will understand that... Figure 5 The electronic device structure shown does not constitute a limitation on the electronic device and may include more or fewer components than shown, or combine certain components, or have different component arrangements. Wherein: The processor 501 is the control center of the electronic device. It connects various parts of the electronic device via various interfaces and lines. By running or executing software programs and / or modules stored in the memory 502, and by calling data stored in the memory 502, it performs various functions and processes data, thereby providing overall monitoring of the electronic device. Optionally, the processor 501 may include one or more processing cores; preferably, the processor 501 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications, and the modem processor mainly handles wireless communication. It is understood that the modem processor may not be integrated into the processor 501.

[0116] The memory 502 can be used to store software programs and modules. The processor 501 executes various functional applications and data processing by running the software programs and modules stored in the memory 502. The memory 502 may mainly include a program storage area and a data storage area. The program storage area may store the operating system, application programs required for at least one function (such as sound playback function, image playback function, etc.), etc.; the data storage area may store data created according to the use of the electronic device, etc. In addition, the memory 502 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 502 may also include a memory controller to provide the processor 501 with access to the memory 502.

[0117] In one feasible implementation, the electronic device further includes a power supply 503 that supplies power to the various components. Preferably, the power supply 503 can be logically connected to the processor 501 through a power management system, thereby enabling functions such as charging, discharging, and power consumption management through the power management system. The power supply 503 may also include one or more DC or AC power supplies, recharging systems, power equipment debugging circuits, power converters or inverters, power status indicators, and other arbitrary components.

[0118] In one feasible implementation, the electronic device may further include an input unit 504, which can be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

[0119] Although not shown, the electronic device may also include a display unit, etc., which will not be described in detail here. Specifically, in this embodiment, the processor 501 in the electronic device loads the executable files corresponding to the processes of one or more applications into the memory 502 according to the following instructions, and the processor 501 runs the applications stored in the memory 502, thereby implementing the steps in any of the dialogue management methods provided in the embodiments of this application.

[0120] Those skilled in the art will understand that Figure 5 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the electronic device to which the present application is applied. The specific electronic device may include more or fewer components than shown in the figure, or combine certain components, or have different component arrangements.

[0121] In one feasible implementation, an electronic device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the methods described in any embodiment of this application.

[0122] In one feasible implementation, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the methods described in any embodiment of this application.

[0123] In one feasible implementation, a computer program product is also proposed, comprising a computer program or instructions that, when executed by a processor, implement the methods described in any embodiment of this application.

[0124] For details on the implementation of each of the above operations, please refer to the previous examples, which will not be repeated here.

[0125] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be performed by instructions, or by instructions controlling related hardware. These instructions can be stored in a computer-readable storage medium and loaded and executed by a processor.

[0126] Therefore, this application provides a computer-readable storage medium storing a computer program that can be loaded by a processor to execute the steps in any of the dialogue management methods provided in this application.

[0127] For details on the implementation of each of the above operations, please refer to the previous examples, which will not be repeated here.

[0128] The computer-readable storage medium may include: read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.

[0129] Since the instructions stored in the computer-readable storage medium can execute the steps of any of the dialogue management methods provided in this application, the beneficial effects that any of the dialogue management methods provided in this application can achieve can be realized, as detailed in the preceding embodiments, and will not be repeated here.

[0130] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.

[0131] The foregoing has provided a detailed description of a dialogue management method, apparatus, electronic device, and computer-readable storage medium provided in this application. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The description of the above embodiments is only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, those skilled in the art will recognize that, based on the ideas of the present invention, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of the present invention.

Claims

1. A dialogue management method, characterized in that, The method includes: Obtain a first question-and-answer text pair; the first question-and-answer text pair includes a first question text and a first answer text generated by a first model based on the first question text; Perform perplexity detection on the first question-and-answer text pair to obtain the perplexity of the first question-and-answer text pair; When the perplexity is greater than the perplexity threshold, the first question-and-answer text pair is optimized in the first historical context of the first model.

2. The dialogue management method according to claim 1, characterized in that, The first question text is obtained by performing speech recognition on the first question speech using a speech recognition model; the perplexity detection of the first question-and-answer text pair to obtain the perplexity of the first question-and-answer text pair includes: Obtain the confidence score of the first question text pair output by the speech recognition model; Based on the confidence score, the perplexity of the first question-and-answer text pair is determined.

3. The dialogue management method according to claim 1, characterized in that, The step of performing perplexity detection on the first question-and-answer text pair to obtain the perplexity of the first question-and-answer text pair includes: Determine the relevance between the first question-and-answer text pair and the second historical context of the first model; Based on the relevance, the perplexity of the first question-and-answer text pair is determined.

4. The dialogue management method according to claim 1, characterized in that, The optimization process for the first question-and-answer text pair within the first historical context of the first model includes: Delete the first question-and-answer text pair from the first historical context of the first model; or... In the first historical context of the first model, a target identifier is added to the first question-and-answer text pair, the target identifier being used to instruct the first model to ignore the first question-and-answer text pair.

5. The dialogue management method according to claim 1, characterized in that, The first question text is obtained by speech recognition of the first question speech. The optimization processing of the first question-answer text pair in the first historical context of the first model includes: The first question is input into the second model, and the second model outputs the second question text and the second response text. Based on the second question text and the second response text, a second question-and-answer text pair is generated; In the first historical context of the first model, the first question-and-answer text pair is replaced with the second question-and-answer text pair.

6. The dialogue management method according to claim 1, characterized in that, The first question text is obtained by speech recognition of the first question speech. The optimization processing of the first question-answer text pair in the first historical context of the first model includes: Acquire the second question's audio; the second question's audio and the first question's audio are audio from different channels collected from the same audio source; Based on the phase difference information between the first question speech and the second question speech, the target direction of the speech source is determined, and a beamforming signal pointing to the target direction is generated; The beamforming signal is subjected to speech recognition to generate a third question text; The third question text is input into the first model, and the first model outputs the third response text. Based on the third question text and the third response text, a third question-and-answer text pair is generated; In the first historical context of the first model, the first question-and-answer text pair is replaced with the third question-and-answer text pair.

7. The dialogue management method according to claim 1, characterized in that, The first question text was obtained in the following way: Acquire the first question's voice and the user's image; Visual feature information is obtained by extracting features from the mouth region of the person in the user image. Feature extraction is performed on the speech of the first question to obtain speech feature information; The visual feature information and the speech feature information are input into the third model, and the first question text is output through the third model.

8. A dialogue management device, characterized in that, The device includes: The acquisition module is used to acquire a first question-and-answer text pair; the first question-and-answer text pair includes a first question text and a first answer text generated by a first model based on the first question text. The detection module is used to perform perplexity detection on the first question-and-answer text pair to obtain the perplexity of the first question-and-answer text pair; An optimization module is used to optimize the first question-and-answer text pair in the first historical context of the first model when the perplexity is greater than the perplexity threshold.

9. An electronic device, characterized in that, It includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the dialogue management method as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of the dialogue management method as described in any one of claims 1 to 7.