Intelligent dialogue method and device, electronic equipment and storage medium

The intelligent dialogue system, built on the WebSocket and SSE protocol architecture and combined with a dynamic interruption mechanism based on voice energy and semantic judgment, solves the problems of interaction latency and compliance risks in existing systems, and achieves natural and smooth voice-text collaborative interaction and personalized dialogue experience.

CN122245310APending Publication Date: 2026-06-19中国邮政储蓄银行股份有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
中国邮政储蓄银行股份有限公司
Filing Date
2026-03-12
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing intelligent dialogue systems suffer from high interaction latency, multimodal asynchrony, primitive interruption mechanisms, lack of user profiling and context awareness capabilities, and lack of multi-turn context memory and recovery mechanisms, resulting in unnatural user experience, high compliance risks, and low professionalism.

Method used

It adopts a dual-protocol architecture of WebSocket and SSE to achieve millisecond-level synchronization of ASR, LLM and TTS modules. Combined with a dynamic interruption mechanism based on voice energy and semantic judgment, it adjusts the interaction strategy through user profiles and business state machines, and maintains a context memory buffer to ensure dialogue continuity.

Benefits of technology

It achieves millisecond-level streaming voice-text collaborative interaction, supports "speaking while recognizing, generating and broadcasting", enhances user immersion and professionalism, reduces compliance risks, and provides personalized experience and dialogue coherence.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245310A_ABST
    Figure CN122245310A_ABST
Patent Text Reader

Abstract

This application discloses an intelligent dialogue method, device, electronic device, and storage medium. The method includes responding to user voice input by outputting two synchronous results: one is text data generated by an LLM (Large Language Model), and the other is voice data obtained through speech recognition and speech synthesis. Based on the voice data, a dynamic engine interruption mechanism is used to process it to obtain optimized target voice. This application optimizes intelligent dialogue and solves the shortcomings of traditional voice assistants, such as high response latency, rigid interruption mechanisms, and fragmented user experience in multi-turn dialogues.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of intelligent dialogue technology, and in particular to an intelligent dialogue method, device, electronic device, and storage medium. Background Technology

[0002] Streaming voice dialogue is a real-time, two-way human-computer voice interaction method. Its core lies in the streaming processing capabilities of Automatic Speech Recognition (ASR) and Text-to-Speech (TTS), enabling a natural conversational experience where users can speak and listen simultaneously, greatly reducing interaction latency. Simply put, it doesn't wait for the user to finish speaking before processing and responding. Instead, the system recognizes the content and understands the intent while the user is speaking, and generates and plays the response speech at the appropriate time, making the whole process resemble a natural conversation between humans.

[0003] Existing Solution 1: Rule-based or NLP-based intelligent customer service systems, such as Rasa and Dialogflow. These systems rely on predefined intent classifiers like BERT-based models, slot-filling models, and state machine dialogue managers, employing a serial blocking architecture of sentence recognition → sentence generation → sentence playback. The drawbacks are: inability to understand complex semantics and context; high interaction latency; lack of streaming capabilities; and text-to-speech asynchrony. Furthermore, the interruption mechanism relies solely on the Voice Energy Threshold (VAD), lacking semantic judgment capabilities.

[0004] Existing technical solution two: A voice interaction system based on WebSocket, which supports audio stream transmission via WebSocket to achieve real-time ASR recognition and TTS playback. However, text generation still relies on traditional NLP or static templates and does not introduce large language model streaming generation. The disadvantages are: text generation is not streaming, and TTS only begins after the entire sentence is output; there is no SSE collaboration mechanism, and the front end cannot render the generated content in real time; interruption control is crude, only supporting "immediate interruption" and lacking semantic safety point judgment.

[0005] Existing technical solution three: An LLM-based dialogue system that uses a large language model to generate semantic content and supports streaming token output (SSE), but it is not deeply coupled with the speech engine and lacks a dual-mode interruption mechanism of "speech + semantics". Its drawbacks include: speech playback and text generation are not strictly synchronized; there is no dynamic strategy engine to adjust interruption behavior based on user profiles; there is no context memory recovery mechanism, making it easy to lose the dialogue state after an interruption; and it is not optimized for highly compliant scenarios such as finance. Summary of the Invention

[0006] This application provides an intelligent dialogue method, device, electronic device, and storage medium to achieve millisecond-level "simultaneous generation, display, and playback." Simultaneously, multimodal strict synchronization ensures a natural and smooth experience, complies with intelligent interruption regulations, and maintains controllable risks.

[0007] The embodiments of this application adopt the following technical solutions:

[0008] In a first aspect, embodiments of this application provide an intelligent dialogue method, the method comprising:

[0009] In response to user voice input, two synchronous results are output: one is text data generated by the LLM large language model, and the other is voice data obtained by speech recognition and speech synthesis.

[0010] Based on the spoken data, a dynamic engine interruption mechanism is used to process it, resulting in optimized target spoken data.

[0011] In some embodiments, the step of responding to user voice input and outputting two synchronized results includes:

[0012] The WebSocket protocol is used to establish WebSocket connections with the TTS service and ASR service respectively to respond to the user's voice input;

[0013] An HTTP streaming request is initiated to the LLM Large Language Model service based on the SSE streaming protocol to obtain the response content corresponding to the user's voice input.

[0014] In some embodiments, the step of processing the speech data using a dynamic engine interruption mechanism to obtain optimized target speech includes:

[0015] The voice energy of the voice data is detected. If the voice energy exceeds a preset dynamic threshold and the duration exceeds a specified duration, it is determined to be a potential user speaking and an interruption candidate event is triggered.

[0016] The semantic interruption decision module sends a request to the agent and determines whether to interrupt the current audio playback based on the output of the semantic interruption decision node.

[0017] In some embodiments, the step of processing the speech data using a dynamic engine interruption mechanism to obtain optimized target speech includes:

[0018] Execute differentiated interruption strategies based on semantic judgment results;

[0019] If it is an interruptible policy, immediately terminate the TTS playback process and clear any unfinished statements in the context memory buffer;

[0020] If the policy is non-interruptible, record the interruption request and start a timer, setting an interrupt checkpoint at the end of the current statement with a preset number of milliseconds.

[0021] In some embodiments, the step of processing the speech data using a dynamic engine interruption mechanism to obtain optimized target speech includes:

[0022] Based on the results of this interaction, adaptive adjustments are executed through the dynamic strategy engine.

[0023] If a user frequently attempts to interrupt and is rejected each time, the speech energy threshold should be appropriately lowered or the judgment tolerance of the LLM large language model should be increased.

[0024] If a user forcibly interrupts the interaction at a high-risk point, causing misunderstanding, the weight of interruption restrictions in subsequent similar scenarios will be increased.

[0025] In response to interruption events, update the interruption preference tags in the user profile and record all interruption events to the log. The interruption events include at least the trigger time, energy peak, statement content, LLM judgment result, actual interruption time, and subsequent user behavior fields.

[0026] In some embodiments, the method further includes: maintaining a context memory buffer for accurately restoring context semantics after interruption recovery, wherein the context memory buffer stores information including at least:

[0027] The original ASR text and semantic parsing results of the most recent three rounds of dialogue;

[0028] The syntax tree fragment and intent label of the currently incomplete statement;

[0029] The audio playback progress indicator at the moment of interruption;

[0030] Audio block index mapping table in the TTS synthesis process.

[0031] In some embodiments, the method further includes:

[0032] Multiple dialogue state nodes are defined using a business state machine, and corresponding actions are triggered based on state transition conditions, specifically including:

[0033] When entering a sensitive conversation node, automatically reduce the interruption sensitivity;

[0034] Match the preset TTS speech rate template corresponding to the current state;

[0035] Dynamically load corresponding prompt word templates for the LLM large language model to generate response content.

[0036] Secondly, embodiments of this application also provide an intelligent dialogue device, the device comprising:

[0037] The output module is used to respond to user voice input and output two synchronous results: one is text data generated by the LLM large language model, and the other is voice data obtained by speech recognition and speech synthesis.

[0038] The interruption module is used to process the voice data using a dynamic engine interruption mechanism to obtain the optimized target voice.

[0039] Thirdly, embodiments of this application also provide an electronic device, including: a processor; and a memory arranged to store computer-executable instructions, which, when executed, cause the processor to perform the above-described method.

[0040] Fourthly, embodiments of this application also provide a computer-readable storage medium that stores one or more programs, which, when executed by an electronic device including multiple applications, cause the electronic device to perform the above-described method.

[0041] The at least one technical solution adopted in this application embodiment can achieve the following beneficial effects: In response to user voice input, two synchronous results are output respectively: one is text data generated by the LLM large language model, and the other is voice data obtained by speech recognition and speech synthesis. The text data can be used as SSE lapse response text feedback, while the voice data needs to be interrupted. Specifically, a dynamic engine interruption mechanism is used to process the data, and the optimized target voice is then fed back based on the WebSocket protocol. Through the above method, millisecond-level streaming voice-text collaborative interaction can be achieved. Through the WebSocket + SSE dual-protocol architecture, a natural dialogue experience of "speaking while recognizing, generating and broadcasting" is supported. In addition, by constructing a voice and semantic dual-mode dynamic coupling interruption mechanism, intelligent, secure and personalized interruption control is achieved while ensuring compliance. Attached Figure Description

[0042] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:

[0043] Figure 1 This is a schematic diagram of the system architecture of the intelligent dialogue method in the embodiments of this application;

[0044] Figure 2 This is a flowchart illustrating the intelligent dialogue method in the embodiments of this application;

[0045] Figure 3 This is a schematic diagram of the intelligent dialogue device structure in the embodiments of this application;

[0046] Figure 4 This is a schematic diagram of the structure of an electronic device according to an embodiment of this application. Detailed Implementation

[0047] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0048] The drawbacks of the related technologies are:

[0049] (1) High interaction latency and poor non-streaming experience

[0050] Existing systems mostly adopt a serial mode of "sentence recognition → sentence generation → sentence broadcasting". After the user finishes speaking, the system will only respond after 1 to 3 seconds, which does not conform to the rhythm of natural conversation.

[0051] (2) Outdated protocol architecture and multimodal asynchrony

[0052] The lack of a WebSocket and SSE coordination mechanism leads to asynchronous playback of ASR text recognition, LLM text generation, and TTS audio, resulting in a disconnect between front-end display and voice playback, which affects user experience.

[0053] (3) The interruption mechanism is primitive and lacks semantic safety control.

[0054] Relying solely on voice energy thresholds to trigger interruptions, and lacking semantic judgment capabilities, it is easily interrupted when broadcasting key information (such as interest rates and contract numbers), resulting in compliance risks or business losses.

[0055] (4) Lack of user profiling and context awareness

[0056] All users are subject to a uniform interruption policy, which cannot dynamically adjust the speaking speed, tone, and interruption tolerance based on user type (VIP / high-risk), emotional state, or historical behavior, resulting in a rigid user experience.

[0057] (5) No multi-turn context memory and retrieval mechanism

[0058] When a user interrupts, the system loses its dialogue state, leading to repeated questions, logical breaks, and disjointed LLM-generated content, which seriously affects professionalism and trustworthiness.

[0059] To address the aforementioned shortcomings, the intelligent dialogue method in this application embodiment has the following technical effects:

[0060] (1) Reduced interaction latency: Through the streaming collaborative architecture, users can respond quickly after speaking, which is far superior to the high latency of traditional systems.

[0061] (2) Strict multimodal synchronization ensures a natural and smooth experience. The error between text display and voice playback is very small, achieving "what you see is what you hear" and enhancing user immersion.

[0062] (3) Disruption of intelligent compliance, risk is controllable. Semantic judgment avoids being interrupted in key statements (such as "annual interest rate of 6.5%), reducing compliance risk.

[0063] (4) Personalized experience and improved conversion rate. The dynamic strategy engine adjusts the interaction style according to user profile. It is more lenient for VIP users to interrupt and more cautious for high-risk users, which can greatly improve satisfaction.

[0064] (5) Enhanced dialogue coherence and improved professionalism. The contextual memory mechanism ensures that semantics are not lost after interruption, avoids repeated questioning, and enhances user trust.

[0065] The technical solutions provided by the various embodiments of this application are described in detail below with reference to the accompanying drawings.

[0066] like Figure 1 As shown, based on a distributed software architecture that combines streaming communication protocol collaboration, dual-mode interrupt decision-making, and context memory retention, the entire system is divided into four core modules:

[0067] Front-end (Web / APP): Responsible for user interaction and audio / video capture.

[0068] Real-time voice dialogue engine: the core scheduling and real-time processing hub.

[0069] Voice middleware: Provides ASR (Audio Recognition) and TTS (Text-to-Speech) capabilities.

[0070] Agent platform: Provides large-scale model dialogue, knowledge base and tool call capabilities.

[0071] It consists of four core layers: front-end interaction layer, real-time voice dialogue engine layer, voice middleware layer, and Agent intelligent platform layer. Each layer works together through standardized communication protocols to achieve end-to-end real-time voice interaction capabilities.

[0072] The front-end (Web / APP) includes, but is not limited to: an audio acquisition unit: capturing user voice input; a WebSocket communication unit: establishing a long-lived connection with the real-time voice dialogue engine to transmit audio streams and text messages; and a real-time UI rendering unit: displaying interactive interfaces such as dialogue status and playback feedback.

[0073] For the front-end interaction layer (Web / App), the front-end serves as the user interaction entry point, responsible for collecting user input, displaying interaction status, and establishing a real-time communication link with the back-end engine. It mainly includes the following units: Audio Acquisition Unit: Collects user voice input through the device's microphone, completing the initial encoding and encapsulation of audio data. WebSocket Communication Unit: Establishes a long-lived connection with the real-time voice dialogue engine, enabling bidirectional real-time transmission of audio streams and text messages. Real-time UI Rendering Unit: Dynamically renders the dialogue interface, playback status, and interactive prompts based on engine feedback, enhancing the user experience.

[0074] The real-time voice dialogue engine includes, but is not limited to: audio processing: audio format processing, audio streaming, message conversion; dialogue control: dialogue routing, session control, session initialization, interruption control, dialogue state management; and output feedback: playback feedback, text streaming SSE, and text stream segmentation.

[0075] For the real-time voice dialogue engine layer, this layer serves as the core scheduling hub of the system, responsible for coordinating voice processing, dialogue management, and intelligent interaction. Its main functional modules include: Audio Processing Module: Includes sub-functions such as audio format processing, audio streaming, and message conversion, adapting to different front-end and middleware audio protocols. Dialogue Control Module: Ensures the continuity and real-time performance of multi-turn dialogues through mechanisms such as dialogue routing, session control, session initialization, interruption control, and dialogue state management. Output Feedback Module: Supports text streaming SSE push, text stream segmentation, and audio playback feedback, enabling real-time output of intelligent responses.

[0076] The voice middleware includes, but is not limited to: TTS (Text-to-Speech): CosyVoice2 TTS, iFlytek XTTS 4.0, traditional TTS, and iFlytek XTTS 5.0. ASR (Automatic Speech Recognition): FunASR and iFlytek ASR.

[0077] The voice middleware layer provides professional speech recognition (ASR) and text-to-speech (TTS) capabilities. It interacts with the real-time engine via WebSocket. Key components include: a TTS service integrating CosyVoice2 TTS, iFlytek XTTS 4.0 / 5.0, and traditional TTS engines, supporting multi-voice and multi-scenario speech generation; and an ASR service integrating FunASR, iFlytek ASR, and other recognition engines to achieve high-accuracy real-time speech-to-text conversion.

[0078] The Agent platform includes, but is not limited to: Capability orchestration layer: process orchestration, tool invocation, knowledge base RAG, MCP. LLM-MA platform: large models such as DeepSeek, Pangu, GLM, and Qwen.

[0079] The Agent intelligent platform layer provides the system with intelligent decision-making and content generation capabilities. It interacts with the real-time engine via HTTP requests and SSE streaming responses. It is mainly divided into: Capability Orchestration Layer: This layer utilizes components such as process orchestration, tool invocation, knowledge base RAG, and MCP to decompose and execute complex tasks. LLM-MA Platform: This platform integrates large language models such as DeepSeek, Pangu, GLM, and Qwen, providing natural language understanding, dialogue generation, and knowledge reasoning capabilities.

[0080] The specific interaction process between each of the above modules includes:

[0081] Step S1. Frontend → Real-time Voice Dialogue Engine

[0082] Establish a WebSocket persistent connection to transmit audio streams and text messages in real time.

[0083] Step S2. Real-time voice dialogue engine ↔ Voice middleware

[0084] Establish a WebSocket connection with ASR for speech recognition; establish a WebSocket connection with TTS for speech synthesis.

[0085] Step S3. Real-time Voice Dialogue Engine ↔ Agent Platform

[0086] The Agent platform is invoked via an HTTP request; the Agent platform returns the scripted text via an SSE streaming response.

[0087] 1. Voice Acquisition and Recognition: User voice is acquired by the front end and transmitted to the real-time engine via WebSocket. The engine then schedules the ASR middleware to convert the speech to text. 2. Intelligent Interaction and Generation: The recognized text is routed by the engine to the Agent platform. After the large model generates the response text, it is streamed back to the engine via SSE. 3. Speech Synthesis and Output: The response text is synthesized into speech by the TTS middleware scheduled by the engine and then pushed to the front end for playback via WebSocket, completing one round of interaction.

[0088] By combining the above system with intelligent dialogue methods, it is possible to solve the problems of high response latency, rigid interruption mechanisms, and fragmented user experience in traditional voice assistants during multi-turn dialogues.

[0089] This application provides an intelligent dialogue method, such as... Figure 2 The diagram shows a flowchart of an intelligent dialogue method in an embodiment of this application. The method includes at least the following steps S210 to S220:

[0090] In step S210, in response to the user's voice input, two synchronous results are output: one is text data generated by the LLM large language model, and the other is voice data obtained by speech recognition and speech synthesis.

[0091] After the front end transfers the user's voice input to the real-time voice dialogue engine, it outputs two synchronous results. The text data generated by the LLM large language model and the voice data obtained by speech recognition and speech synthesis are returned to the real-time voice dialogue engine.

[0092] Step S220: Based on the speech data, a dynamic engine interruption mechanism is used to process it to obtain the optimized target speech.

[0093] By using a dynamic engine interruption mechanism to process the speech data, an optimized target speech can be obtained, which is then returned along with the text data.

[0094] The above method responds to user voice input, outputting two synchronous results: one is text data generated by the LLM large language model, and the other is voice data obtained through speech recognition and speech synthesis. This achieves millisecond-level streaming voice-text collaborative interaction, supporting a natural dialogue experience of "speaking while recognizing, generating, and broadcasting" through a WebSocket + SSE dual-protocol architecture.

[0095] Based on the aforementioned method, the voice data is processed using a dynamic engine interruption mechanism to obtain optimized target voice. A dual-mode dynamic coupling interruption mechanism combining voice and semantics is constructed to achieve intelligent, secure, and personalized interruption control while ensuring compliance.

[0096] Unlike related technologies that suffer from high interaction latency and poor non-streaming experience, the above method, using a dual-protocol streaming collaborative architecture of WebSocket and SSE, achieves millisecond-level synchronization of the three modules ASR, LLM, and TTS, supporting "generating, displaying, and playing simultaneously".

[0097] Unlike the primitive interruption mechanisms in related technologies, which lack semantic security controls, the above method introduces a user profile and business state machine linkage strategy engine to achieve dynamic adaptation of interruption parameters, speech rate and tone, and speech style; and establishes a context memory buffer and recovery mechanism to ensure semantic coherence and state traceability in multi-turn dialogues, thereby improving professionalism and conversion rate.

[0098] Unlike related technologies that lack multi-turn context memory and recovery mechanisms, the above method employs a dual-mode dynamic coupling interruption controller based on speech and semantics: speech energy serves only as a candidate trigger, while the semantic model determines whether to execute the interrupt and the safety point offset.

[0099] It's important to note that replacing WebSocket with gRPC streaming or HTTP / 2 Server Push requires frontend support and has poor compatibility. Furthermore, replacing SSE with WebSocket text stream pushing of LLM tokens increases protocol complexity and negates the inherent streaming advantages of SSE.

[0100] It should be noted that when replacing the interruption decision model, LLM semantic judgment can be replaced with a lightweight BERT classifier or rule engine, but the semantic understanding depth is insufficient and the accuracy decreases; the safety point offset can be replaced with a fixed delay (such as 1.5 seconds), but the flexibility is poor and it cannot adapt to different sentence lengths.

[0101] It's important to note that when using a strategy engine, user profile-driven approaches can be replaced with static configurations (such as applying a uniform policy to all users), but the personalized advantage is lost. Furthermore, while dynamic parameter adjustments can be replaced with manual rule configurations, they cannot achieve the adaptive evolution of "understanding you better the more you use it."

[0102] In one embodiment of this application, the step of responding to user voice input and outputting two synchronous results includes: establishing WebSocket connections with TTS service and ASR service respectively using the WebSocket protocol to respond to the user voice input; and initiating an HTTP streaming request to LLM large language model service based on SSE streaming protocol to obtain the response content corresponding to the user voice input.

[0103] It adopts a collaborative processing mechanism of WebSocket and SSE protocols. At the same time, in order to overcome the problems of "response lag" and "streaming asynchrony" in existing voice interaction systems, it provides a dual-protocol collaborative flow control mechanism. Through the division of labor and cooperation between WebSocket and SSE protocols, it realizes a low-latency closed loop from voice input to text / audio output.

[0104] The WebSocket protocol is a network protocol that enables full-duplex communication over a single TCP connection, allowing clients and servers to establish persistent, low-latency, bidirectional data transmission channels.

[0105] The SSE protocol, or Server-Sent Events, is a one-way real-time communication technology based on the HTTP protocol. It allows the server to proactively push data streams to the client without the client having to poll frequently, making it particularly suitable for scenarios where the server needs to continuously send updates to the client.

[0106] The technical implementation process is as follows:

[0107] Step S1: Upload the audio stream in segments from the front end.

[0108] After the front-end interaction module starts recording, it collects audio data every 100ms, forming a data block of approximately 2KB in size. This data block is packaged into a WebSocket binary frame (Opcode=0x2), with an additional timestamp field (Unix millisecond-level format) and a sequence number (incrementing from 0), and sent to the back-end engine module via a WSS encrypted channel.

[0109] Step S2: ASR incremental recognition and text push.

[0110] After receiving the audio block, the backend immediately forwards it to the ASR service through the speech engine middleware. The ASR service returns two types of results:

[0111] Partial Result: Not the final recognition result, carrying the is_final=false flag, used for real-time display of draft text on the front end;

[0112] Final Result: Sentence recognition complete, is_final=true, triggering the semantic understanding and response generation process.

[0113] All text results are pushed to the front end in JSON format via the same WebSocket connection:

[0114] Json: {

[0115] "type": "asr_text",

[0116] "text": "Hello, this is an insurance consultation service",

[0117] "is_final": false,

[0118] "timestamp": 1719876543210,

[0119] "seq": 5

[0120] }

[0121] Step S3: SSE stream calls LLM to generate responses.

[0122] Upon detecting the "final" text, the backend constructs an SSE request, submitting complete context information including the context, business status, and user profile to the LLM service. The SSE connection remains open, and the LLM returns tokens sequentially.

[0123] Step S4: Dual-output synchronization mechanism ("one in, two out" flow control).

[0124] For each received token, the following two parallel operations are performed:

[0125] Text output path: Append the token to the current response text, push it to the front end via WebSocket, and update the UI display;

[0126] Audio output path: The token is sent to the TTS scheduler to generate the corresponding audio block (approximately 200-300ms in length), which is also pushed to the front end for playback via WebSocket in binary frame form.

[0127] The "one-in-two-out" flow control mechanism takes as input the LLM token stream obtained by SSE and outputs two paths: one pushes text via WebSocket and the other drives TTS to generate and push audio, ultimately achieving near-synchronous output of text and speech, significantly improving the naturalness of the audio.

[0128] Step S5: Front-end multimodal alignment.

[0129] The front end maintains two buffer queues:

[0130] Text buffer: sorted by seq, complete missing frames;

[0131] Audio buffer: sorted by timestamp, with silent frames inserted to fill gaps;

[0132] A timestamp-based alignment algorithm is used, with a maximum tolerance error of ±50ms. If the threshold is exceeded, a compensation mechanism is activated (playing a short segment of audio earlier / later).

[0133] Based on data alignment buffers and backpressure control mechanisms, the buffer has an upper limit (e.g., a maximum of 10 text entries and a maximum of 3 audio blocks); when the front-end consumption rate is lower than the production rate, the back-end actively pauses SSE reading or reduces the TTS generation rate; the client load status is monitored through WebSocket ping / pong heartbeat, and the flow rate is dynamically adjusted.

[0134] In one embodiment of this application, the step of processing the voice data using a dynamic engine interruption mechanism to obtain optimized target voice includes: detecting the voice energy of the voice data; if the voice energy exceeds a preset dynamic threshold and the duration exceeds a specified duration, determining it as potential user speech and triggering an interruption candidate event; invoking a semantic interruption decision module to send a request to the agent; and determining whether to interrupt the current voice playback based on the output of the semantic interruption decision node. A context memory buffer + interruption recovery engine is used to preserve the semantics of interrupted statements and support seamless recovery from multi-turn dialogues.

[0135] To address the problem of false interruptions or delayed responses caused by the "one-size-fits-all" interruption strategy of traditional speech systems, a dual-mode dynamic coupling interruption mechanism that integrates physical layer speech detection and semantic layer understanding judgment is proposed, thereby achieving more intelligent and humanized interruption control.

[0136] The technical implementation process is as follows:

[0137] Step D1: Voice energy detection triggers candidate events.

[0138] During TTS playback, the front end continuously monitors microphone input and calculates the audio energy (RMS value) within every 100ms window. When the energy exceeds a preset dynamic threshold and the duration exceeds a specified duration, it is determined to be a potential user speaking event, triggering an "interruption candidate event." This event is reported to the back end via WebSocket, carrying an energy curve segment and a timestamp.

[0139] Step D2: The semantic interruption decision module intervenes in the analysis.

[0140] Upon receiving the interruption request, the backend immediately queries the content of the currently broadcast statement and its interruption attribute: it calls the semantic interruption decision module, sends a request to the Agent, and the semantic interruption decision node determines whether the current system broadcast should be interrupted.

[0141] Receive structured responses:

[0142] Json: {

[0143] "interruptible": false,

[0144] "safe_point_offset": 1200

[0145] }

[0146] The above methods enable a leap from "passive response" to "proactive understanding"; avoid legal disputes or customer complaints caused by arbitrary interruptions during the transmission of key information; and achieve the personalized service goal of "understanding you better the more you use it" through dynamic strategy evolution.

[0147] In one embodiment of this application, the step of processing the voice data using a dynamic engine interruption mechanism to obtain optimized target voice includes: executing a differentiated interruption strategy based on the semantic judgment result; if it is an interruptible strategy, immediately terminating the TTS playback process and clearing the unfinished statements in the context memory buffer; if it is an uninterruptible strategy, recording the interruption request and starting a timer, setting an interruption checkpoint at a preset number of milliseconds at the end of the current statement.

[0148] Continuing with the above steps, step D3: interrupt policy execution branch.

[0149] Execute a differentiated interruption strategy based on the semantic judgment result:

[0150] Scenario A: Interruptible (interruptible = true)

[0151] Immediately terminate the TTS playback process; clear any unfinished statements in the context memory buffer; reset the business state machine to the "waiting for user input" state; send the control command {"command": "stop_tts"} to the front end.

[0152] Scenario B: Uninterruptible (interruptible = false)

[0153] Record the interruption request as pending; start a timer to set an interrupt checkpoint milliseconds from the end of the current statement (safe_point_offset); upon reaching the safe point, perform the interrupt operation; if the user stops speaking during this period, cancel the interruption request.

[0154] In one embodiment of this application, the step of processing the speech data using a dynamic engine interruption mechanism to obtain optimized target speech includes: performing adaptive adjustments based on the current interaction result through a dynamic strategy engine; if the user frequently attempts to interrupt and is rejected each time, appropriately reducing the speech energy threshold or increasing the judgment tolerance of the LLM large language model; if the user forcibly interrupts at a high-risk node, causing misunderstanding of the interaction, increasing the interruption restriction weight for subsequent similar scenarios; responding to interruption events, updating the interruption preference label in the user profile, and recording all interruption events to a log, wherein the interruption event at least includes the trigger time, energy peak, sentence content, LLM judgment result, actual interruption timing, and user subsequent behavior fields.

[0155] Continue with the above steps, step D4: Adjusting dynamic strategy engine parameters.

[0156] The dynamic strategy engine adaptively adjusts based on the interaction results: if a user frequently attempts to interrupt but is rejected, the voice energy threshold is appropriately lowered or the LLM judgment tolerance is increased; if a user forcibly interrupts at a high-risk point, causing misunderstanding, interruption restrictions in subsequent similar scenarios are increased; the "interruption preference" tag in the user profile is updated to influence future strategy decisions. Based on user profiles, the dynamic strategy engine adjusts interruption thresholds, tolerance, and delay times in real time to achieve a personalized experience. A user profile and business state machine-linked strategy engine is introduced to achieve dynamic adaptation of interruption parameters, speech rate and tone, and dialogue style.

[0157] In addition, all interruption events are recorded in the log system, including fields such as: trigger time, energy peak, statement content, LLM judgment result, actual interruption time, and subsequent user behavior, providing a basis for subsequent data analysis and model training.

[0158] In one embodiment of this application, the method further includes: maintaining a context memory buffer for accurately restoring context semantics after interruption recovery. The context memory buffer stores information including at least: the original ASR text and semantic parsing results of the most recent three rounds of dialogue; the syntax tree fragments and intent tags of the currently incomplete statements; the speech playback progress pointer at the time of interruption; and the audio block index mapping table in the TTS synthesis process.

[0159] Establish a context memory buffer and recovery mechanism to ensure semantic coherence and state traceability in multi-turn dialogues, thereby improving professionalism and conversion rate. Establish a context memory buffer and use an in-memory database to achieve high-speed read and write access, storing the following key information: (1) the original ASR text and semantic parsing results of the last three (or more) rounds of dialogue; (2) the syntax tree fragments and intent tags of the currently incomplete statements; (3) the playback progress pointer (Timestamp + Offset) at the time of interruption; (4) the audio block index mapping table in the TTS synthesis process; this buffer ensures that the context semantics can be accurately restored after the interruption is resumed, avoiding information loss or repeated playback.

[0160] In one embodiment of this application, the method further includes: defining multiple dialogue state nodes using a business state machine, and triggering corresponding actions based on state transition conditions, specifically including: automatically reducing interruption sensitivity when entering a sensitive dialogue node; matching a preset TTS speech rate template corresponding to the current state; and dynamically loading corresponding prompt word templates for the LLM large language model to generate response content.

[0161] This application also provides a business state machine, modeled based on a finite state machine (FSM), defining multiple dialogue state nodes (such as "Welcome," "Authentication," "Quotation Explanation," "Contract Confirmation," etc.), triggering corresponding actions through state transition conditions:

[0162] When entering sensitive nodes (such as payment or calling payment interfaces), the interruption sensitivity is automatically reduced and the security point window is extended; the preset TTS speech rate template is matched according to the current status (such as slowing down the speech rate by 10% during the verification stage); the corresponding prompt template is dynamically loaded for LLM to use in generating responses.

[0163] Atomized operation mechanism of streaming control bus: When interrupted, the ASR buffer is cleared synchronously, the SSE stream is interrupted, and the state machine is reset to ensure system consistency.

[0164] This application also provides an intelligent dialogue device 300, such as... Figure 3 As shown, a schematic diagram of the structure of the intelligent dialogue device in this application embodiment is provided. The intelligent dialogue device 300 includes at least: an output module 310 and an interruption module 320, wherein:

[0165] In one embodiment of this application, the output module 310 is specifically used to: respond to user voice input and output two synchronous results respectively, one being text data generated by the LLM large language model, and the other being voice data obtained by speech recognition and speech synthesis.

[0166] After the front end transfers the user's voice input to the real-time voice dialogue engine, it outputs two synchronous results. The text data generated by the LLM large language model and the voice data obtained by speech recognition and speech synthesis are returned to the real-time voice dialogue engine.

[0167] In one embodiment of this application, the interruption module 320 is specifically used to: process the speech data using a dynamic engine interruption mechanism to obtain optimized target speech.

[0168] By using a dynamic engine interruption mechanism to process the speech data, an optimized target speech can be obtained, which is then returned along with the text data.

[0169] In one embodiment of this application, the output module 310 is further configured to:

[0170] The WebSocket protocol is used to establish WebSocket connections with the TTS service and ASR service respectively to respond to the user's voice input;

[0171] An HTTP streaming request is initiated to the LLM Large Language Model service based on the SSE streaming protocol to obtain the response content corresponding to the user's voice input.

[0172] In one embodiment of this application, the interruption module 320 is further configured to:

[0173] The voice energy of the voice data is detected. If the voice energy exceeds a preset dynamic threshold and the duration exceeds a specified duration, it is determined to be a potential user speaking and an interruption candidate event is triggered.

[0174] The semantic interruption decision module sends a request to the agent and determines whether to interrupt the current audio playback based on the output of the semantic interruption decision node.

[0175] In one embodiment of this application, the interruption module 320 is further configured to:

[0176] Execute differentiated interruption strategies based on semantic judgment results;

[0177] If it is an interruptible policy, immediately terminate the TTS playback process and clear any unfinished statements in the context memory buffer;

[0178] If the policy is non-interruptible, record the interruption request and start a timer, setting an interrupt checkpoint at the end of the current statement with a preset number of milliseconds.

[0179] In one embodiment of this application, the interruption module 320 is further configured to:

[0180] Based on the results of this interaction, adaptive adjustments are executed through the dynamic strategy engine.

[0181] If a user frequently attempts to interrupt and is rejected each time, the speech energy threshold should be appropriately lowered or the judgment tolerance of the LLM large language model should be increased.

[0182] If a user forcibly interrupts the interaction at a high-risk point, causing misunderstanding, the weight of interruption restrictions in subsequent similar scenarios will be increased.

[0183] In response to interruption events, update the interruption preference tags in the user profile and record all interruption events to the log. The interruption events include at least the trigger time, energy peak, statement content, LLM judgment result, actual interruption time, and subsequent user behavior fields.

[0184] In one embodiment of this application, a context memory buffer is also included for:

[0185] The original ASR text and semantic parsing results of the most recent three rounds of dialogue;

[0186] The syntax tree fragment and intent label of the currently incomplete statement;

[0187] The audio playback progress indicator at the moment of interruption;

[0188] Audio block index mapping table in the TTS synthesis process.

[0189] In one embodiment of this application, a business state machine is also included for:

[0190] Multiple dialogue state nodes are defined using a business state machine, and corresponding actions are triggered based on state transition conditions, specifically including:

[0191] When entering a sensitive conversation node, automatically reduce the interruption sensitivity;

[0192] Match the preset TTS speech rate template corresponding to the current state;

[0193] Dynamically load corresponding prompt word templates for the LLM large language model to generate response content.

[0194] It is understood that the above-mentioned intelligent dialogue device can implement each step of the intelligent dialogue method provided in the foregoing embodiments. The relevant explanations of the intelligent dialogue method are applicable to the intelligent dialogue device and will not be repeated here.

[0195] Figure 4 This is a schematic diagram of the structure of an electronic device according to an embodiment of this application. Please refer to it. Figure 4 At the hardware level, the electronic device includes a processor, and optionally also includes an internal bus, a network interface, and memory. The memory may include main memory, such as high-speed random-access memory (RAM), or non-volatile memory, such as at least one disk drive. Of course, the electronic device may also include other hardware required for other business operations.

[0196] The processor, network interface, and memory can be interconnected via an internal bus, which can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, or an EISA (Extended Industry Standard Architecture) bus, etc. This bus can be divided into address bus, data bus, control bus, etc. For ease of representation, Figure 4 The symbol is represented by a single double-headed arrow, but this does not mean that there is only one bus or one type of bus.

[0197] Memory is used to store programs. Specifically, programs may include program code, which includes computer operation instructions. Memory may include main memory and non-volatile memory, and provides instructions and data to the processor.

[0198] The processor reads the corresponding computer program from non-volatile memory into main memory and then runs it, forming an intelligent dialogue device at the logical level. The processor executes the program stored in memory and specifically performs the following operations:

[0199] In response to user voice input, two synchronous results are output: one is text data generated by the LLM large language model, and the other is voice data obtained by speech recognition and speech synthesis.

[0200] Based on the spoken data, a dynamic engine interruption mechanism is used to process it, resulting in optimized target spoken data.

[0201] The above is as stated in this application. Figure 1 The method executed by the intelligent dialogue device disclosed in the illustrated embodiments can be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip with signal processing capabilities. During implementation, each step of the above method can be completed by integrated logic circuits in the processor's hardware or by instructions in software form. The processor can be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc.; it can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in the embodiments of this application can be directly embodied as being executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can reside in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory, and the processor reads information from the memory and, in conjunction with its hardware, completes the steps of the above method.

[0202] The electronic device can also perform Figure 1 The method for executing intelligent dialogue devices, and the implementation of intelligent dialogue devices in... Figure 1 The functions of the embodiments shown are not described in detail here.

[0203] This application also proposes a computer-readable storage medium that stores one or more programs, the programs including instructions that, when executed by an electronic device including multiple applications, enable the electronic device to perform... Figure 1 The method executed by the intelligent dialogue device in the illustrated embodiment is specifically used to perform:

[0204] In response to user voice input, two synchronous results are output: one is text data generated by the LLM large language model, and the other is voice data obtained by speech recognition and speech synthesis.

[0205] Based on the spoken data, a dynamic engine interruption mechanism is used to process it, resulting in optimized target spoken data.

[0206] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0207] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0208] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0209] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0210] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.

[0211] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0212] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0213] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0214] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0215] The above description is merely an embodiment of this application and is not intended to limit the scope of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.

Claims

1. An intelligent dialogue method, characterized in that, The method includes: In response to user voice input, two synchronous results are output: one is text data generated by the LLM large language model, and the other is voice data obtained by speech recognition and speech synthesis. Based on the spoken data, a dynamic engine interruption mechanism is used to process it, resulting in optimized target spoken data.

2. The intelligent dialogue method as described in claim 1, characterized in that, The response to user voice input outputs two synchronized results, including: The WebSocket protocol is used to establish WebSocket connections with the TTS service and ASR service respectively to respond to the user's voice input; An HTTP streaming request is initiated to the LLM Large Language Model service based on the SSE streaming protocol to obtain the response content corresponding to the user's voice input.

3. The intelligent dialogue method as described in claim 1, characterized in that, The step of processing the speech data using a dynamic engine interruption mechanism to obtain optimized target speech includes: The voice energy of the voice data is detected. If the voice energy exceeds a preset dynamic threshold and the duration exceeds a specified duration, it is determined to be a potential user speaking and an interruption candidate event is triggered. The semantic interruption decision module sends a request to the agent and determines whether to interrupt the current audio playback based on the output of the semantic interruption decision node.

4. The intelligent dialogue method as described in claim 1, characterized in that, The step of processing the speech data using a dynamic engine interruption mechanism to obtain optimized target speech includes: Execute differentiated interruption strategies based on semantic judgment results; If it is an interruptible policy, immediately terminate the TTS playback process and clear any unfinished statements in the context memory buffer; If the policy is non-interruptible, record the interruption request and start a timer, setting an interrupt checkpoint at the end of the current statement with a preset number of milliseconds.

5. The intelligent dialogue method as described in claim 1, characterized in that, The step of processing the speech data using a dynamic engine interruption mechanism to obtain optimized target speech includes: Based on the results of this interaction, adaptive adjustments are executed through the dynamic strategy engine. If a user frequently attempts to interrupt and is rejected each time, the voice energy threshold should be appropriately lowered or the LLM's judgment tolerance should be increased. If a user forcibly interrupts the interaction at a high-risk point, causing misunderstanding, the weight of interruption restrictions in subsequent similar scenarios will be increased. In response to interruption events, update the interruption preference tags in the user profile and record all interruption events to the log. The interruption events include at least the trigger time, energy peak, statement content, LLM judgment result, actual interruption time, and subsequent user behavior fields.

6. The intelligent dialogue method as described in claim 1, characterized in that, The method further includes: maintaining a context memory buffer for accurately restoring context semantics after interruption recovery, wherein the context memory buffer stores information including at least: The original ASR text and semantic parsing results of the most recent three rounds of dialogue; The syntax tree fragment and intent label of the currently incomplete statement; The audio playback progress indicator at the moment of interruption; Audio block index mapping table in the TTS synthesis process.

7. The intelligent dialogue method as described in claim 1, characterized in that, The method further includes: Multiple dialogue state nodes are defined using a business state machine, and corresponding actions are triggered based on state transition conditions, specifically including: When entering a sensitive conversation node, automatically reduce the interruption sensitivity; Match the preset TTS speech rate template corresponding to the current state; Dynamically load corresponding prompt word templates for the LLM large language model to generate response content.

8. An intelligent dialogue device, characterized in that, The device includes: The output module is used to respond to user voice input and output two synchronous results: one is text data generated by the LLM large language model, and the other is voice data obtained by speech recognition and speech synthesis. The interruption module is used to process the voice data using a dynamic engine interruption mechanism to obtain the optimized target voice.

9. An electronic device, comprising: processor; as well as A memory configured to store computer-executable instructions, which, when executed, cause the processor to perform the method of any one of claims 1 to 7.

10. A computer-readable storage medium storing one or more programs, which, when executed by an electronic device including a plurality of applications, cause the electronic device to perform the method of any one of claims 1 to 7.