An identity perception and full-duplex voice interaction control method and system based on an audio multi-modal large model, a storage medium, and an electronic device
By jointly processing a large multimodal audio model, the problems of multi-model fragmentation and latency in voice interaction systems are solved, achieving efficient and natural voice interaction control and improving recognition accuracy and response speed.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 杭州市余杭区海创人形机器人产业创新中心
- Filing Date
- 2026-05-22
- Publication Date
- 2026-06-19
AI Technical Summary
Existing voice interaction systems suffer from problems such as fragmented multi-model architecture, wasted computing power and response latency, poor continuous interaction experience, and slow inference due to long sequence inputs in end-to-end solutions.
An identity perception and full-duplex voice interaction control method based on an audio multimodal large model is adopted. By combining semantic encoder, voiceprint encoder, linear projection layer and large language model, joint attention calculation and autoregressive generation of output sequence are realized, the interaction state is dynamically switched, and a wake word customization and real-time voiceprint anchoring mechanism are introduced.
It improves the system's noise immunity and robustness, reduces inference latency, achieves a natural and smooth continuous interactive experience, and reduces system maintenance costs.
Smart Images

Figure CN122245327A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence and human-computer interaction technology, specifically relating to a method, system, storage medium and electronic device for identity perception and full-duplex voice interaction control based on an audio multimodal large model. Background Technology
[0002] With the rapid development of artificial intelligence technology, voice interaction has become one of the mainstream interaction methods for smart devices. Existing voice interaction systems typically adopt a cascaded processing architecture, decomposing tasks such as wake word detection (KWS), voice activity detection (VAD), speaker verification (SV), automatic speech recognition (ASR), and natural language processing (NLP) into multiple independent modules, which are then processed sequentially.
[0003] Specifically, a typical cascaded processing flow includes: First, a lightweight KWS model running on a low-power chip continuously monitors the audio stream and detects preset wake words; upon detection of the wake word, the VAD module begins detecting the user's voice activity; subsequently, the system extracts the voiceprint features (such as X-vector) of the speech segment and performs authentication by calculating the cosine similarity with pre-registered voiceprints; after successful authentication, the ASR engine transcribes the speech into text; finally, the NLP model performs intent understanding and semantic analysis on the text to generate corresponding responses. Regarding sentence segmentation logic, traditional systems primarily rely on VAD for physical energy detection, typically setting a relatively long silence threshold (e.g., greater than 500 milliseconds) to prevent missegmentation; some advanced solutions attempt to introduce text classification models to assist in judgment, but this is usually done after the ASR generates the complete text, making streaming processing impossible.
[0004] However, the above-mentioned cascaded processing architecture has the following technical drawbacks in practical applications:
[0005] First, the multiple models are fragmented and lack joint decision-making capabilities. Modules such as KWS, voiceprint, ASR, and NLP operate independently, and each module can only make decisions based on its own input, failing to achieve complementary multimodal information. For example, when the voiceprint similarity is at a boundary value (such as 0.6) and the environment is noisy, traditional voiceprint models will directly reject the user; while large models can combine the semantic information that "the user accurately uttered a code known only to the device owner" to assist in identity verification. However, the existing architecture cannot utilize this kind of complementary multimodal information.
[0006] Second, there is wasted computing power and response latency. Traditional systems often require multiple processing stages before speech from non-wake words or non-target speakers can be recognized and intercepted, resulting in unnecessary consumption of computing resources. Furthermore, to ensure semantic integrity, the system must wait for a relatively long VAD silence period (e.g., 500–700ms), leading to interaction delays and impacting user experience.
[0007] Third, the continuous interaction experience is poor. In continuous dialogue scenarios, users usually do not repeat the wake word. However, existing systems struggle to accurately distinguish between "the user speaking to the device" and "background human voices / TV sounds" without using a wake word, which can easily lead to false triggers or rejection, making it impossible to achieve natural and smooth continuous identity-aware interaction.
[0008] Fourth, long input sequences lead to slow inference. Some solutions that attempt to use end-to-end models try to incorporate voiceprint information as a long input sequence, resulting in a surge in the number of input tokens and inference latency that cannot meet the needs of real-time interaction. Summary of the Invention
[0009] The purpose of this invention is to solve the aforementioned technical problems in the prior art and to provide an identity perception and full-duplex voice interaction control method based on an audio multimodal large model. This method addresses the technical problems in the existing cascaded processing architecture, such as insufficient joint decision-making ability due to multi-model fragmentation, wasted computing resources and response delays, poor continuous interaction experience, and slow inference due to long sequence inputs in end-to-end solutions.
[0010] To solve the above-mentioned technical problems, the present invention adopts the following technical solution:
[0011] A method for identity perception and full-duplex voice interaction control based on an audio multimodal large model includes the following steps:
[0012] Step 1: Obtain the reference audio and the currently queried audio, and perform signal preprocessing;
[0013] Step 2: Input the reference audio and the current query audio into the multimodal large model to extract semantic feature sequences and voiceprint feature vectors;
[0014] Step 3 utilizes a multimodal large model to perform joint attention calculation and autoregressively generate the output sequence;
[0015] Step 4: Dynamically switch the interaction state according to the output sequence and execute the corresponding interaction control logic.
[0016] Furthermore, the multimodal large model includes:
[0017] A semantic encoder encodes the current query audio to generate a sequence of semantic features.
[0018] The voiceprint encoder encodes the reference audio and the current query audio separately to generate the reference voiceprint feature vector and the query voiceprint feature vector.
[0019] The linear projection layer projects the reference voiceprint feature vector and the query voiceprint feature vector onto the feature space of the large language model to obtain the corresponding voiceprint embedding vector.
[0020] An adapter is used to compress semantic feature sequences.
[0021] The large language model concatenates the voiceprint embedding vector with the compressed semantic feature sequence and inputs it into the large language model.
[0022] Furthermore, in step three, the generated output sequence includes, in a preset logical order:
[0023] The first control identifier is used to indicate the wake word detection status;
[0024] The second control identifier is used to indicate the voiceprint authentication status;
[0025] Speech-to-text transcription is used to reconstruct the user's speech content;
[0026] Interaction status labels are used to indicate the semantic integrity of the current statement and the state of the dialogue flow.
[0027] Furthermore, the wake word customization includes:
[0028] Receive new wake word commands set by the user via natural language;
[0029] Update the corresponding wake word field in the system prompts of the multimodal large model according to the new wake word instruction.
[0030] Furthermore, the interaction status label includes at least one of the following:
[0031] Semantic completeness tags indicate that the currently queried audio expresses a complete semantic intent;
[0032] The semantically incomplete tag indicates that the semantics of the currently queried audio have not yet ended;
[0033] The "confirmation" tag indicates that the current audio query is a confirmatory or affirmative word.
[0034] The "Wait" or "Pause" label indicates to the user that they wish to pause the conversation.
[0035] Furthermore, in step four, the dynamic switching of interaction states includes standby state and active state;
[0036] In standby mode, a first system prompt word input multimodal large model is constructed. The first system prompt word contains instructions for instructing the execution of wake word and voiceprint verification.
[0037] In the active state, a second system prompt word input multimodal large model is constructed. The second system prompt word contains instructions to ignore wake word detection and continue voiceprint verification.
[0038] The trigger condition for switching from standby mode to active mode is: the generated first control identifier indicates that a wake word has been detected;
[0039] The trigger conditions for switching from active state back to standby state are: the generated interaction status label indicates a wait or pause label, or the active countdown times out. The active countdown is reset after each successful interaction.
[0040] Furthermore, in step four, the interactive control logic includes:
[0041] When generating the first control flag of the output sequence, if the first control flag indicates that no wake-up word has been detected and the current system is in standby mode, the inference process is immediately interrupted.
[0042] When generating the second control flag of the output sequence, if the second control flag indicates a non-target speaker, the inference process is immediately interrupted and the current input is determined to be interference speech.
[0043] If the first control flag indicates that a wake word has been detected, and the second control flag indicates that the target person is speaking, then the speech-to-text and interactive status labels will continue to be generated.
[0044] Furthermore, it also includes an instant voiceprint anchoring step:
[0045] When in standby mode and with no reference audio, the model relies solely on wake word detection.
[0046] When in standby mode, if the first control flag of the output sequence is generated and the first control flag indicates that a wake-up word has been detected, the voiceprint feature vector of the current query audio is extracted and cached as a temporary reference voiceprint vector for the current session.
[0047] When switching to the active state, the current query audio input is compared with the temporary reference voiceprint vector to verify the speaker's identity.
[0048] A system for implementing the above-mentioned identity perception and full-duplex voice interaction control method based on an audio multimodal large model includes an audio acquisition module, a multimodal large model inference module, a state control module, and an output parsing and execution module.
[0049] The audio acquisition module is used to acquire reference audio and the currently queried audio and perform signal processing; the multimodal large model inference module generates the output sequence autoregressively through joint attention calculation; the state control module is used to dynamically switch the interaction state according to the output sequence and execute the corresponding interaction control logic; the output parsing and execution module is used to parse the model output and execute the corresponding control logic or business instructions.
[0050] Furthermore, the multimodal large-scale model inference module incorporates a semantic encoder, a voiceprint encoder, a linear projection layer, an adapter, and a large language model.
[0051] The semantic encoder encodes the current query audio to generate a sequence of semantic features;
[0052] The voiceprint encoder encodes the reference audio and the current query audio separately to generate a reference voiceprint feature vector and a query voiceprint feature vector;
[0053] The linear projection layer projects the reference voiceprint feature vector and the query voiceprint feature vector onto the feature space of the large language model to obtain the corresponding voiceprint embedding vector.
[0054] The adapter compresses the semantic feature sequences;
[0055] The large language model takes the voiceprint embedding vector and the compressed semantic feature sequence as input and generates the output sequence.
[0056] A computer-readable storage medium storing a computer program, which, when executed by a processor, implements the aforementioned method for identity perception and full-duplex voice interaction control based on an audio multimodal large model.
[0057] An electronic device includes: one or more processors and a memory, wherein a computer program is stored in the memory, and when the computer program is executed by one or more processors, the electronic device enables the above-described identity perception and full-duplex voice interaction control method based on an audio multimodal large model.
[0058] The present invention, by adopting the above-described technical solution, has the following beneficial effects:
[0059] This invention significantly improves noise resistance and robustness. Utilizing the cross-attention mechanism of a large language model, it enables the model to dynamically "focus" on target voiceprint features amidst noisy background noise, achieving selective auditory perception similar to the "cocktail party effect." Compared to traditional solutions where independent voiceprint models rely solely on cosine similarity for comparison, this invention deeply integrates semantic context (such as the user uttering specific content known only to the device owner) to assist in identity verification, significantly improving recognition accuracy and system robustness in complex acoustic environments.
[0060] This invention boasts exceptional inference efficiency. Input-side optimization involves compressing and projecting variable-length voiceprint feature sequences into fixed-length voiceprint embedding vectors, avoiding the use of long input sequences as model inputs. This makes the inference speed almost equivalent to a pure ASR model, effectively solving the inference latency problem caused by excessively long inputs in end-to-end models. Output-side optimization implements an "early termination" mechanism by forcing key classification labels such as wake-word detection and voiceprint verification to the beginning of the generated sequence. During inference, once a rejection label (such as "wake-word not detected" or "non-target speaker") is generated, the system can immediately interrupt the subsequent text generation process, resulting in a significant reduction in the processing time for invalid speech, greatly conserving computational resources.
[0061] This invention provides a natural and smooth continuous interactive experience. By introducing a state machine switching mechanism based on dynamic system prompts, it achieves a natural, anthropomorphic interaction mode of "waking up by calling the name first, and then only speaking afterwards." In continuous dialogue, the system can ignore the wake word and rely solely on voiceprint anchoring to continuously verify the speaker's identity, accurately distinguishing between the owner's commands and background voice interference, providing voiceprint-level security protection throughout, and effectively improving the fluency and accuracy of continuous dialogue.
[0062] This invention features a unified architecture and low maintenance costs. It integrates five core tasks—wake word detection, voiceprint verification, automatic speech recognition, and interaction state determination (sentence segmentation)—into a single end-to-end multimodal large model. This unified architecture not only simplifies system deployment but also allows for real-time changes to wake words and interaction strategies simply by modifying system prompts, eliminating the need for complex firmware upgrades or cloud training, thus significantly reducing product development and maintenance costs. Attached Figure Description
[0063] The present invention will be further described below with reference to the accompanying drawings:
[0064] Figure 1 This is a flowchart of an identity perception and full-duplex voice interaction control method based on an audio multimodal large model according to the present invention;
[0065] Figure 2 This is a schematic diagram of the dynamic switching logic of the interactive state in this invention. Detailed Implementation
[0066] Example 1: A Method for Identity Perception and Full-Duplex Voice Interaction Control Based on Audio Multimodal Large Model
[0067] like Figure 1 and Figure 2 As shown, this invention provides a method for identity perception and full-duplex voice interaction control based on an audio multimodal large model, which includes the following steps:
[0068] Step 1: Obtain the reference audio and the currently queried audio, and perform signal preprocessing;
[0069] First, the system acquires ambient sound in real time via an audio input device (such as a microphone). The system maintains a circular audio buffer and uses an energy threshold or a lightweight VAD model (such as SileroVAD) for speech activity detection.
[0070] When the duration of continuous silence is detected to reach the preset silence duration threshold When a speech segment ends, it is considered the end of the current audio segment, and the segment is extracted from the buffer as the current query audio. To compensate for the potential loss of the first syllable due to VAD trigger delay, the system automatically splices historical audio frames with a preset forward backtracking duration before the VAD trigger point during extraction, enabling forward backtracking to compensate for syllable loss. (Silence duration threshold) The preferred duration is 100ms to 200ms, and the preferred forward backtracking duration is 200ms.
[0071] The sources of the reference audio include two modes:
[0072] In the pre-registration mode, the user's pre-recorded voiceprint registration audio is stored in a local secure area or in the cloud as a long-term identity benchmark.
[0073] Instant Anchoring Mode: In a specific interactive session, a valid wake-up audio (such as a speech segment containing a wake word) dynamically captured by the system is used as a temporary reference audio.
[0074] The current query audio is a speech segment input by the user in real time, which may contain wake words, command content, or background noise. Before inputting the audio into the model, it can be preprocessed, including but not limited to pre-emphasis, framing, windowing, and Mel-spectrum extraction, to adapt to the input requirements of the semantic encoder and speaker encoder.
[0075] Step 2: Input the reference audio and the current query audio into the multimodal large model to extract semantic feature sequences and voiceprint feature vectors, and project the voiceprint feature vectors into voiceprint embedding vectors;
[0076] After acquiring the audio, it is input into a pre-trained multimodal large model, such as... Figure 2 As shown, this model uses a dual encoder structure to process audio input:
[0077] (1) Semantic Feature Sequence Extraction: The current query audio is fed into a pre-trained and frozen semantic encoder (such as Whisper, Conformer, etc.). The semantic encoder performs temporal modeling on the audio and outputs a high-dimensional semantic feature sequence. This sequence retains the temporal information and semantic details of the speech content, but the data volume is large. In order to reduce the computational load of the subsequent large language model, this embodiment also includes an adapter module. The adapter is usually composed of a multi-layer 1D Convolutional Neural Network (1D-CNN) or downsampling layers, which is used to perform temporal step compression (downsampling) on the semantic feature sequence, reducing the sequence length while retaining key semantic information, and obtaining a compressed semantic feature sequence.
[0078] (2) Voiceprint Feature Vector Extraction: The reference audio and the current query audio are input into the voiceprint encoder. The voiceprint encoder preferably uses a pre-trained voiceprint recognition model (such as CAM++, ResNet34, etc.), keeping the parameters frozen during training and inference to ensure the stability of the voiceprint features. The voiceprint encoder performs global pooling on the entire audio segment, outputting a fixed-length reference voiceprint feature vector and a query voiceprint feature vector. These two vectors represent the identity features of the registered user and the current speaker, respectively.
[0079] (3) Vector Projection: Since the dimension of the voiceprint feature vector is inconsistent with the dimension of the hidden layer of the Large Language Model (LLM), this embodiment uses a linear projection layer for processing. The linear projection layer maps the reference voiceprint feature vector and the query voiceprint feature vector to the feature space of the Large Language Model through a trainable linear transformation matrix and a bias term, generating corresponding reference voiceprint embedding vectors and query voiceprint embedding vectors. These voiceprint embedding vectors are constructed as one or a few special tokens so that they can be directly inserted into the input sequence of the LLM.
[0080] In this way, the model compresses lengthy voiceprint features into a very small number of tokens, avoiding the inference delay problem caused by using voiceprints as long sequence inputs.
[0081] Step 3: The large language model accepts the concatenated vector sequence, uses a joint attention mechanism to simultaneously focus on voiceprint identity information and semantic content information, and generates the output sequence autoregressively.
[0082] The generated output sequence strictly follows the preset logical order:
[0083] The first control identifier, located at the beginning of the sequence, is used to indicate the wake word detection status, such as whether a wake word is detected or not.
[0084] The second control identifier, which follows the first control identifier, is used to indicate the voiceprint authentication status, such as the target speaker or the non-target speaker.
[0085] The speech-to-text transcription, located in the middle section, is used to reconstruct the user's speech content;
[0086] Interaction status labels, located at the end of the sequence, are used to indicate the semantic integrity of the current statement and the state of the dialogue flow.
[0087] After generating the first control flag or the second control flag, the inference engine decides whether to continue generating subsequent content based on the flag's judgment result: if the first control flag indicates that no wake word has been detected and the current system is in standby mode, the inference process is immediately interrupted; or if the second control flag indicates that the speaker is not the target speaker, the inference process is immediately interrupted, and the current input is determined to be interference speech. Only when the first control flag indicates that a wake word has been detected and the second control flag indicates that the target speaker is speaking, will the speech-to-text and interaction status labels continue to be generated.
[0088] Regarding the custom wake word feature, this invention allows users to dynamically set the wake word via natural language commands. For example, when receiving a user's command to "set the wake word to 'Hello Car'", the system parses the intent and updates the corresponding wake word field in the system prompts of the multimodal large model in real time. In this way, the model can adapt to different wake word requirements without retraining, greatly improving the system's flexibility.
[0089] The specific definition of interactive status labels includes, but is not limited to, the following types:
[0090] A semantically complete tag indicates that the currently queried audio expresses a complete semantic intent, which the system can use to execute instructions.
[0091] The semantically incomplete label indicates that the semantics of the current query audio have not yet ended (e.g., the user paused but did not finish speaking). The system should keep the microphone on, concatenate the current query audio with the subsequent input audio frames at the semantic feature level, and continue to receive and recognize them.
[0092] The "confirmation" tag indicates that the current audio query is an affirmative or confirming word such as "um" or "yes," and the system may respond briefly or continue listening.
[0093] The "Wait" or "Pause" label indicates to the user that they explicitly request a pause in the conversation (e.g., "Don't say anything more"). The system should then stop broadcasting and enter a silent state.
[0094] Step 4: Dynamically switch the interaction state according to the output sequence and execute the corresponding interaction control logic.
[0095] The system's operation relies on the cooperation between the state machine controller and the inference early termination mechanism, such as... Figure 2 As shown, it mainly includes standby state (IDLE) and active state (ACTIVE).
[0096] In standby mode, the system constructs a multimodal large model with the first system prompt word input. The first system prompt word contains explicit instructions, requiring the model to focus on wake word detection and voiceprint verification.
[0097] In the active state, a second system prompt word input multimodal large model is constructed. The second system prompt word contains instructions that require the model to ignore wake word detection and instead continuously perform voiceprint verification to confirm whether it is the same speaker.
[0098] State transition trigger condition:
[0099] Switching from standby to active state: When the first control flag in the output sequence generated by the model indicates that a wake word has been detected, the state switch is triggered and the system enters the active state;
[0100] Switching from active state to standby state: When the interaction state label generated by the model indicates "wait or pause label", or when the internal active countdown of the system times out (e.g., no valid voice is detected within 20 seconds), the state switch is triggered and the system returns to the standby state.
[0101] Real-time voiceprint anchoring mechanism: To eliminate the cumbersome process of users pre-registering their voiceprints and support the experience of "first sentence wake-up to lock identity, subsequent continuous conversations without wake-up", this method also includes real-time voiceprint anchoring steps:
[0102] Cold start phase: When the system is in standby mode, the reference audio is empty or set to a null token. At this time, the model only relies on wake word detection to work.
[0103] When the system is in standby mode and the generated first control identifier indicates that a wake-up word has been detected, the system automatically extracts the voiceprint feature vector of the currently queried audio (i.e., the query voiceprint feature vector) and caches it as a temporary reference voiceprint vector for the current session.
[0104] The system then switches to an active state. During this active state, all subsequent input audio for the current query is compared to this temporary reference voiceprint vector to verify the speaker's identity, instead of relying on pre-registered long-term reference audio. This ensures that even if the pre-registered voiceprint changes slightly, as long as the user speaks continuously, the system can accurately identify and maintain the interaction, achieving a natural, continuous dialogue experience. It also implements temporary access control based on "whoever wakes up, locks," effectively shielding against subsequent background noise.
[0105] Interactive control logic and inference early termination: To reduce latency and computational power consumption, this embodiment implements a strict inference early termination mechanism:
[0106] Wake-up filtering: During the generation of the output sequence, once the first control flag is generated, if the flag indicates "no wake-up word detected" and the current system is in standby mode, the inference process is immediately interrupted, the current query audio is discarded, and no further voiceprint verification results and transcribed text are generated.
[0107] Identity Filtering: If the first control flag indicates that a wake word has been detected, the model continues to generate a second control flag. If the flag indicates "non-target speaker" (i.e., voiceprint mismatch), the inference process is immediately interrupted, the current input is determined to be interference speech, no business instructions are executed, and the model may choose not to give the user any feedback or only give a brief rejection prompt (such as outputting "Do not disturb" or remaining silent).
[0108] Normal execution: The model will only continue to generate subsequent speech-to-text and interaction status labels when the first control flag indicates "wake word detected" and the second control flag indicates "target speaker". After the system parses the complete text and intent, it executes the corresponding control logic (such as playing music, adjusting the air conditioner, answering questions, etc.) and resets the active countdown.
[0109] The training method for the large language model of this invention adopts a two-stage training strategy:
[0110] Phase 1: Hybrid Modality Alignment. In this phase, the large language model, semantic encoder, and speaker encoder are frozen; only the adapter and linear projection layer are trained. Semantic features and speaker recognition features are aligned to the LLM's text embedding space by alternately performing ASR alignment and speaker recognition tasks. The ASR alignment task input contains only a sequence of semantic features, with the system prompt "Please write a speech," and the loss is the cross-entropy loss from standard ASR text generation. The speaker recognition task input includes both a reference speaker projection and a query speaker projection, with the system prompt "Determine if it is the same speaker," and the loss is a classification loss based on the target speaker label or the non-target speaker label.
[0111] Phase Two: Multi-task Full-Scale Instruction Fine-tuning. In this phase, the large language model is unfrozen. Building upon Phase One, the model is trained to understand complex system prompts and simultaneously master multiple tasks, including wake-word detection, voiceprint gating, speech-to-text transcription, and interactive state labeling. To address the issue of sample scarcity, this embodiment employs a data generation pipeline based on controlled speech synthesis (TTS) to construct large-scale, accurately labeled training data. An orthogonal balanced sampling strategy is used to ensure the model does not exhibit bias in multi-task learning.
[0112] To address the task imbalance between long-sequence text generation and key classification label prediction, this invention designs an adaptive token-weighted cross-entropy loss function:
[0113]
[0114] in The total length of the target sequence; For the first The target token for each time step; For the front The sequence of tokens generated at each time step; For the first The task importance weight of each token; The input sequence is a multimodal input sequence to a large language model; These are the trainable parameters of the model; To predict the first given input and context conditions. Each Token is The conditional probability.
[0115] This invention addresses the task imbalance problem by assigning different weights ωt to different types of tokens. For critical control tokens (such as wake-up tags and voiceprint verification tags), a higher first weight value (e.g., 10.0) is assigned; for less important state tokens (such as interaction state tags), a medium second weight value (e.g., 5.0) is assigned; and for ordinary text generation tokens, a basic third weight value (e.g., 1.0) is assigned. This significantly amplifies the gradient contribution of critical control tags, forcing the model to converge preferentially to correct wake-up and identity determination. It should be noted that the above weight values are merely examples and can be adjusted according to specific scenarios in practical applications. As long as the hierarchical relationship of 'control tag weight > state tag weight > text tag weight' is satisfied, it falls within the protection scope of this invention.
[0116] This invention can be flexibly applied to a variety of scenarios:
[0117] Scenario 1: Targeted wake-up in a noisy environment (cocktail party effect)
[0118] Environment: The mall is noisy with many people talking in the background.
[0119] Action: User A says, "Hey Jarvis, turn on the air conditioner."
[0120] deal with:
[0121] 1. The system is in IDLE state. Input includes the voiceprint of the pre-registered user A as a reference.
[0122] 2. The model uses a joint attention mechanism to focus on the voiceprint features of user A amidst noise.
[0123] 3. The output sequence first generates the detected wake word tag, and then generates the target speaker tag.
[0124] 4. The system confirms the legality of the command and continues to generate the transcribed text "turn on the air conditioner" and semantically complete tags.
[0125] 5. Execute the command to turn on the air conditioner.
[0126] Comparative advantage: If user B speaks loudly but does not say the wake word, or says the wake word but the voiceprint does not match, the model will trigger early termination at the first two tokens, which takes only about 50ms, greatly saving computing power and avoiding accidental touches.
[0127] Scenario 2: Continuous dialogue without wake word and anti-interruption
[0128] Environment: Inside the vehicle, the driver interacts with the vehicle.
[0129] operate:
[0130] 1. The driver says first, "Hey Jarvis, navigate to the company." (This triggers wake-up, enters ACTIVE state, and anchors the driver's voiceprint).
[0131] 2. After thinking for a moment, the driver continued, "...while I'm at it, I'll check the nearby gas stations."
[0132] 3. During the conversation, a passenger interrupted, "I want to go too."
[0133] deal with:
[0134] 1. First voice message: The system recognizes this as a complete command and executes navigation. The status remains ACTIVE, and the countdown resets.
[0135] 2. Second audio segment (after thinking): VAD detects a short silence (150ms) and triggers inference. The model identifies it as a semantically incomplete label (because "by the way..." usually connects to the preceding text). The system keeps the microphone on, concatenates the current audio with the subsequent audio at the semantic feature level, does not interrupt the dialogue, and waits for subsequent input.
[0136] 3. Passenger interruption: If the model detects that the voiceprint does not match the anchor point (driver), it outputs a label indicating that the person is not the target, immediately interrupts the inference, ignores the passenger's instructions, and ensures driving safety.
[0137] 4. Third voice segment (driver continues): Generate complete text and semantically complete tags, and the system merges the context to execute the compound command "navigate to the company and check the gas station".
[0138] Scenario 3: Custom wake word and instant exit
[0139] Operation: Users can set a new wake word "Xiao Chuang Tongxue" through the APP.
[0140] deal with:
[0141] 1. The system does not need to retrain the model; it only needs to update the wake word field in the system prompt to "Xiao Chuang Tongxue".
[0142] 2. The user says, "Xiao Chuang, stop."
[0143] 3. After the model recognizes the wake word and voiceprint, it generates the transcribed text "stop" and a waiting or pause tag at the end.
[0144] 4. When the status control module detects a wait or pause tag, it immediately stops the current voice broadcast and switches the status back to IDLE, achieving a millisecond-level response.
[0145] Example 2: Identity Awareness and Full-Duplex Voice Interaction Control System
[0146] The present invention also provides a control system for implementing the above control method, comprising the following modules:
[0147] Audio acquisition module: used to acquire reference audio and currently queried audio, and perform signal processing; this module typically includes a microphone, audio buffer, and VAD detection unit, used to implement functions such as audio acquisition, framing, silence detection, and forward backtracking.
[0148] Multimodal Large Model Inference Module: This module is the core of the system, containing a trained semantic encoder, speaker encoder, linear projection layer, adapter, and large language model. Its function is to autoregressively generate an output sequence containing control labels, speech text, and status labels based on the input audio and system prompts through joint attention computation.
[0149] State control module: Used to maintain the interactive state of the system, dynamically switch states according to the output sequence and internal timer, and execute real-time voiceprint anchoring logic.
[0150] The output parsing and execution module is used to parse the sequence output by the model. It is responsible for determining whether to terminate the inference prematurely based on the early termination logic, and converting the final speech-to-text and interactive status labels into specific control commands, such as waking up the device, executing business, waiting for the user, pausing broadcasting, etc.
[0151] This example includes a comparative experiment using a traditional cascaded voice interaction process: First, a real-time wake word detection module continuously monitors the user's voice. Only after a valid wake word is detected are the voiceprint verification, voice activity detection (VAD), and automatic speech recognition (ASR) modules sequentially invoked for processing. Both methods use the same VAD algorithm (SileroVAD), the difference being that the traditional approach sets the VAD's end-silence duration to 500-700ms to ensure voice integrity; while this approach, supporting continuous interaction without a wake word, can significantly shorten this duration to 150ms, thereby significantly reducing interaction latency.
[0152] Test platform: Single Nvidia RTX 4090 graphics card (GPU) + Intel i9-14900KF central processing unit (CPU).
[0153] 1. Single inference time: With KVcache (key-value cache) reuse enabled, the average inference time is about 80ms (including audio encoding, voiceprint encoding, and LLM inference) on an average audio input of about 3 seconds, and it supports about 12 parallel real-time inference channels.
[0154] 2. Comprehensive Cascaded Delay Comparison
[0155] Test conditions Traditional solution This plan Pure single interaction (not considering wake-up) VAD tail time 500ms; total 500ms VAD tail time is 150ms + model inference is about 80ms, totaling about 230ms. A single interaction with wake-up (correct wake-up word, correct identity). The streaming voice wake-up word is ignored; the time from wake-up announcement to re-allowing the voice request is approximately 800ms; the end time of the voice interaction is 500ms; the voiceprint matching calculation process is basically the same as this solution, so this part of the time is ignored; the total is approximately 1.3s. VAD tail time is 150ms + model inference is about 80ms, totaling about 230ms. A single interaction with wake-up (correct wake-up word, incorrect identity). The wake-up announcement takes approximately 800ms, plus a 500ms tail time, totaling about 1.3 seconds. The VAD tail time is 150ms + the first two tokens are about 50ms (early termination can be detected), totaling about 200ms.
[0156] 3. Semantic completeness: The average sentence segmentation accuracy of this scheme on the four types of tags is about 96% (on the test set constructed by this scheme).
[0157] 4. Comparison of response success rates:
[0158] Wake word detection: Traditional solutions have an accuracy rate of less than 75% in wake word detection, while this solution has an accuracy rate of more than 90% in wake word detection and state recognition.
[0159] Voiceprint identification: Traditional solutions have an accuracy rate of about 70% in voiceprint judgment, while this solution has an accuracy rate of about 87% in voiceprint identity verification status recognition, which is further improved to 94% in scenarios with interference from surrounding electrical appliances.
[0160] Example 3: Computer-readable storage medium
[0161] This invention provides a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, it implements the aforementioned method for identity perception and full-duplex voice interaction control based on an audio multimodal large model.
[0162] Example 4: Electronic Equipment
[0163] The present invention provides an electronic device, including one or more processors and a memory. The memory stores a computer program. When the computer program is executed by one or more processors, the electronic device enables the above-described identity perception and full-duplex voice interaction control method based on an audio multimodal large model.
[0164] This invention constructs a unified audio multimodal model, deeply integrating core tasks such as wake-up word detection, voiceprint verification, automatic speech recognition, and interaction state determination into a single architecture, completely breaking the inherent limitations of traditional cascaded processing architectures. This invention utilizes a joint attention mechanism to achieve complementary enhancement of voiceprint features and semantic content, significantly improving the accuracy of identity verification and system robustness in complex acoustic environments. By compressing voiceprint features into fixed-length embedding vectors and setting an early termination flag at the front of the output sequence, it achieves an order-of-magnitude improvement in inference efficiency and extreme conservation of computing resources. Combined with dynamic state machine switching and an instant voiceprint anchoring mechanism, it endows the system with natural, anthropomorphic interaction capabilities of "one-time wake-up, continuous dialogue, and identity locking," accurately distinguishing target users from background interference. Furthermore, this invention supports flexible customization of wake-up words and interaction strategies by modifying system prompts, greatly reducing product development and maintenance costs. In summary, this invention has achieved significant technological advancements in recognition accuracy, response speed, interactive experience, and system flexibility, providing a complete solution for building high-performance, low-latency, and highly secure full-duplex voice interaction systems.
[0165] The above are merely specific embodiments of the present invention, but the technical features of the present invention are not limited thereto. Any simple changes, equivalent substitutions, or modifications made based on the present invention to solve essentially the same technical problems and achieve essentially the same technical effects are all covered within the protection scope of the present invention.
Claims
1. A method for identity perception and full-duplex voice interaction control based on an audio multimodal large model, characterized in that, The steps include the following: Step 1: Obtain the reference audio and the currently queried audio; Step 2: Input the reference audio and the current query audio into the multimodal large model to extract semantic feature sequences and voiceprint feature vectors; Step 3: Utilize a multimodal large model to perform joint attention calculation and autoregressively generate the output sequence; Step 4: Dynamically switch the interaction state according to the output sequence and execute the corresponding interaction control logic.
2. The identity perception and full-duplex voice interaction control method based on an audio multimodal large model according to claim 1, characterized in that: The multimodal large model includes: A semantic encoder encodes the current query audio to generate a sequence of semantic features. The voiceprint encoder encodes the reference audio and the current query audio separately to generate the reference voiceprint feature vector and the query voiceprint feature vector. The linear projection layer projects the reference voiceprint feature vector and the query voiceprint feature vector onto the feature space of the large language model to obtain the corresponding voiceprint embedding vector. An adapter is used to compress semantic feature sequences. The large language model concatenates the voiceprint embedding vector with the compressed semantic feature sequence and inputs it into the large language model.
3. The identity perception and full-duplex voice interaction control method based on an audio multimodal large model according to claim 1, characterized in that: In step three, the generated output sequence includes the following in a preset logical order: The first control identifier is used to indicate the wake word detection status; The second control identifier is used to indicate the voiceprint authentication status; Speech-to-text transcription is used to reconstruct the user's speech content; Interaction status labels are used to indicate the semantic integrity of the current statement and the state of the dialogue flow.
4. The identity perception and full-duplex voice interaction control method based on an audio multimodal large model according to claim 3, characterized in that: Custom wake words include: Receive new wake word commands set by the user via natural language; Update the corresponding wake word field in the system prompts of the multimodal large model according to the new wake word instruction.
5. The identity perception and full-duplex voice interaction control method based on an audio multimodal large model according to claim 3, characterized in that: The interactive status label includes at least one of the following: Semantic completeness tags indicate that the currently queried audio expresses a complete semantic intent; The semantically incomplete tag indicates that the semantics of the currently queried audio have not yet ended; The "confirmation" tag indicates that the current audio query is a confirmatory or affirmative word. The "Wait" or "Pause" label indicates to the user that they wish to pause the conversation.
6. The identity perception and full-duplex voice interaction control method based on an audio multimodal large model according to claim 3, characterized in that: In step four, the dynamic switching of interactive states includes standby state and active state; In standby mode, a first system prompt word input multimodal large model is constructed. The first system prompt word contains instructions for instructing the execution of wake word and voiceprint verification. In the active state, a second system prompt word input multimodal large model is constructed. The second system prompt word contains instructions to ignore wake word detection and continue voiceprint verification. The trigger condition for switching from standby mode to active mode is: the generated first control identifier indicates that a wake word has been detected; The trigger conditions for switching from active state back to standby state are: the generated interaction status label indicates a wait or pause label, or the active countdown times out.
7. The identity perception and full-duplex voice interaction control method based on an audio multimodal large model according to claim 6, characterized in that: In step four, the interactive control logic includes: When generating the first control flag of the output sequence, if the first control flag indicates that no wake-up word has been detected and the current system is in standby mode, the inference process is immediately interrupted. When generating the second control flag of the output sequence, if the second control flag indicates a non-target speaker, the inference process is immediately interrupted and the current input is determined to be interference speech. If the first control flag indicates that a wake word has been detected, and the second control flag indicates that the target person is speaking, then the speech-to-text and interactive status labels will continue to be generated.
8. The identity perception and full-duplex voice interaction control method based on an audio multimodal large model according to claim 6, characterized in that: It also includes an instant voiceprint anchoring step: When in standby mode, if the first control flag of the output sequence is generated and the first control flag indicates that a wake-up word has been detected, the voiceprint feature vector of the current query audio is extracted and cached as a temporary reference voiceprint vector for the current session. When switching to the active state, the current query audio input is compared with the temporary reference voiceprint vector to verify the speaker's identity.
9. A control system for implementing the identity perception and full-duplex voice interaction control method based on an audio multimodal large model as described in any one of claims 1 to 8, characterized in that, include: The audio acquisition module is used to acquire reference audio and the currently queried audio, and to perform signal processing. The multimodal large model inference module generates output sequences autoregressively through joint attention computation. The state control module is used to dynamically switch the interaction state according to the output sequence and execute the corresponding interaction control logic; The output parsing and execution module is used to parse the model output and execute the corresponding control logic or business instructions.
10. The control system based on an identity perception and full-duplex voice interaction control method according to claim 9, characterized in that: The multimodal large model inference module incorporates a semantic encoder, a voiceprint encoder, a linear projection layer, an adapter, and a large language model. The semantic encoder encodes the current query audio to generate a sequence of semantic features; The voiceprint encoder encodes the reference audio and the current query audio separately to generate a reference voiceprint feature vector and a query voiceprint feature vector; The linear projection layer projects the reference voiceprint feature vector and the query voiceprint feature vector onto the feature space of the large language model to obtain the corresponding voiceprint embedding vector. The adapter compresses the semantic feature sequences; The large language model takes the voiceprint embedding vector and the compressed semantic feature sequence as input and generates the output sequence.
11. A computer-readable storage medium storing a computer program thereon, characterized in that: When the computer program is executed by the processor, it implements the identity perception and full-duplex voice interaction control method based on an audio multimodal large model as described in any one of claims 1 to 8.
12. An electronic device, characterized in that, include: One or more processors; The memory stores a computer program, which, when executed by the one or more processors, enables the electronic device to implement the identity perception and full-duplex voice interaction control method based on an audio multimodal large model as described in any one of claims 1 to 8.