Wake-up free voice interaction method and computer program product

By employing a two-tier decision-making architecture that combines local sensory perception with cloud-based final adjudication, the computational burden and unnatural interaction issues caused by wake word dependence in smart devices are resolved. This enables efficient and natural human-computer interaction without wake words, reducing false triggering rates and protecting user security.

CN122201268APending Publication Date: 2026-06-12HANGZHOU ROBAM APPLIANCES CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HANGZHOU ROBAM APPLIANCES CO LTD
Filing Date
2026-03-11
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

The existing voice interaction methods of smart devices require wake words, which results in a heavy computing burden on the device, unnatural human-computer interaction, cumbersome user experience, and a high rate of false triggering.

Method used

It adopts a two-stage decision architecture of local sensing perception + cloud final judgment. It obtains audio and multimodal sensing data on the device to initially judge potential commands, and performs semantic understanding in the cloud to determine the effective commands, reducing the computing burden on the device.

🎯Benefits of technology

It enables voice interaction without a wake word, reducing device-side computing power consumption, improving the smoothness and naturalness of interaction, reducing the false trigger rate, and protecting user security and privacy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201268A_ABST
    Figure CN122201268A_ABST
Patent Text Reader

Abstract

The application discloses a wake-up-free voice interaction method and a computer program product, relates to the technical field of large language models, and comprises the following steps: determining the instruction confidence of audio data according to the audio data and multi-modal sensor data of a user, wherein the instruction confidence is used to represent the possibility that the audio data comprises a potential instruction; if it is determined that the audio data comprises a potential instruction, obtaining historical audio data related to the audio data, and sending an intent recognition request comprising the audio data, the historical audio data and the multi-modal sensor data to the cloud to instruct the cloud to perform semantic understanding on the audio data, the historical audio data and the multi-modal sensor data, obtain a complete user intent, and determine whether the potential instruction is a valid instruction based on the complete user intent. The method of the application realizes a voice interaction mechanism without a wake-up word, ensures that the device end is not limited by hardware, breaks through the algorithm power bottleneck of the device end, improves the interaction fluency, and reduces the false trigger rate in the wake-up-free mode.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure generally relates to the field of large language model technology, and more particularly to a wake-up-free voice interaction method and computer program product. Background Technology

[0002] Currently, voice is the main way for users to interact with smart devices (such as smart speakers, smart kitchen appliances, mobile phones, cars, headphones, etc.). Users can use voice assistants to say a wake word, and only after the smart device is woken up can subsequent voice commands be received. However, since voice assistants can usually only recognize fixed and uniquely pronounced wake words, and the entire device wake-up process and command decision-making process are executed on the device side, the device side has a heavy computing burden, the human-computer interaction is unnatural, and the user experience is cumbersome. Summary of the Invention

[0003] In view of the above-mentioned defects or deficiencies in the existing technology, it is desirable to provide a wake-up-free voice interaction method and computer program product. By limiting the simple determination of whether a potential command exists in the audio data to be executed on the device side, and transferring the complex semantic understanding task to the cloud, a two-stage decision architecture of local sensitive perception + cloud final decision is realized. This achieves a voice interaction mechanism that does not require a wake word, ensuring that the device side is not limited by hardware, breaking through the computing power bottleneck of the device side, improving the smoothness of interaction, and reducing the false trigger rate in the wake-up-free mode.

[0004] Firstly, this application provides a wake-up-free voice interaction method. The method includes: The system acquires user audio data and multimodal sensor data, and determines the instruction confidence level of the audio data based on the audio data and the multimodal sensor data. The instruction confidence level is used to characterize the possibility that the audio data includes potential instructions. If it is determined that the audio data includes the potential instruction, then historical audio data related to the audio data is obtained, and an intent recognition request is sent to the cloud. The intent recognition instruction includes the audio data, the historical audio data, and the multimodal sensor data, to instruct the cloud to perform semantic understanding on the audio data, the historical audio data, and the multimodal sensor data to obtain the complete user intent, and to determine whether the potential instruction is a valid instruction based on the complete user intent.

[0005] In conjunction with the first aspect, in one possible implementation, determining the command confidence level of the audio data based on the audio data and the multimodal sensing data includes: Acoustic and semantic features are extracted from the audio data, and user behavior features are extracted from the multimodal sensor data; The audio confidence level is determined based on the acoustic features and the semantic features, and the audio confidence level is used to characterize the probability that the audio data includes the audio signal corresponding to the potential instruction; The user behavior confidence level is determined based on the user behavior characteristics, and the user behavior confidence level is used to characterize the probability that the multimodal sensing data includes the user behavior corresponding to the potential instruction; The instruction confidence is obtained by weighted fusion of the audio confidence and the user behavior confidence.

[0006] In conjunction with the first aspect, in one possible implementation, determining that the audio data includes the potential instruction includes: Extract instruction prompts related to the potential instructions from the audio data, and determine the matching degree between the instruction prompts and preset instruction prompts; If the confidence level of the instruction is higher than the confidence level threshold and the matching degree is higher than the matching degree threshold, then it is determined that the audio data includes the potential instruction.

[0007] In conjunction with the first aspect, in one possible implementation, obtaining historical audio data related to the audio data includes: Determine a historical moment, wherein the historical duration between the historical moment and the current moment corresponding to the audio data satisfies a duration threshold; The audio data between the current time and the historical time in the circular buffer is determined as the historical audio data; the circular buffer is used to cache audio data in real time.

[0008] In conjunction with the first aspect, in one possible implementation, the semantic understanding of the audio data, the historical audio data, and the multimodal sensor data to obtain the complete user intent includes: The audio data and the historical audio data are converted into text to obtain the first text information; The user behavior features extracted from the multimodal sensing data are described in natural language to obtain the second text information; The first text information and the second text information are used to construct the intent to obtain the complete user intent.

[0009] In conjunction with the first aspect, in one possible implementation, determining whether the potential instruction is a valid instruction based on the complete user intent includes: The complete user intent is identified and analyzed using a large language model to obtain the instruction type of the potential instruction; Based on the instruction type and the instruction confidence level corresponding to the potential instruction, determine whether the potential instruction is a valid instruction.

[0010] In conjunction with the first aspect, in one possible implementation, determining whether a potential instruction is a valid instruction based on the instruction type and the instruction confidence level corresponding to the potential instruction includes: If the instruction type is determined to be a command and the instruction confidence level is higher than the confidence level threshold, then the potential instruction is determined to be the valid instruction. If the instruction type is determined to be a mention and / or the instruction confidence level is lower than the confidence level threshold, then the potential instruction is determined to be an invalid instruction.

[0011] In conjunction with the first aspect, in one possible implementation, the method further includes: If the potential instruction is determined to be a valid instruction, the multimodal sensing data is used to determine whether there is a security risk in executing the potential instruction. If it is determined that executing the potential instruction poses a security risk, then the potential instruction will not be executed, and the user will be notified of the instruction execution failure and the reason for the failure. If it is determined that there is no security risk in executing the potential instruction, then the potential instruction is executed.

[0012] In conjunction with the first aspect, in one possible implementation, determining the audio confidence level based on the acoustic features and the semantic features includes: The acoustic features are identified and classified to obtain acoustic confidence scores that characterize the reliability of the classification results; The similarity between the semantic features and each instruction keyword in the instruction keyword library is determined, and the maximum similarity among the multiple similarities is determined as the semantic confidence score. The semantic confidence score represents the degree of certainty that there is an audio signal in the audio data that best matches the user's operation intention. The acoustic confidence and the semantic confidence are weighted and fused to obtain the audio confidence.

[0013] Secondly, this application also provides a computer program product. This computer program product includes a computer program that, when executed by a processor, implements the wake-up-free voice interaction method described in the first aspect.

[0014] This application provides a wake-free voice interaction method and a computer program product. The wake-free voice interaction method achieves a wake-word-free voice interaction mechanism by limiting the simple determination of whether a potential command exists in the audio data to be executed on the device side and transferring the complex semantic understanding task to the cloud side. This is achieved through a two-stage decision architecture of local sensitive perception + cloud ultimate decision-making. It also ensures that the device side is not limited by hardware, thereby breaking through the computing power bottleneck of the device side. Therefore, the device only needs to process lightweight data such as the user's audio data and multimodal sensor data, avoiding the computational power consumption caused by the device performing the entire wake-up and command adjudication in existing technologies, thus significantly reducing the device's computational burden. In addition, the introduction of multimodal sensor data enables the device to comprehensively perceive the user's behavioral state without relying on a fixed wake word, making the human-computer interaction process more in line with natural conversation habits. This not only enhances the naturalness and flexibility of human-computer interaction, but also improves the accuracy and sensitivity of triggering the cloud to adjudicate the validity of potential commands. The cloud's integrated analysis of multi-source data ensures the completeness and accuracy of understanding the user's intent, thereby reducing the processing burden on the device while improving the smoothness of interaction. It can also reduce the false trigger rate in wake-free mode, achieving wake-free operation while maximizing the protection of user security and privacy. Attached Figure Description

[0015] Other features, objects, and advantages of this application will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings: Figure 1 This is one of the flowcharts illustrating a wake-up-free voice interaction method in one embodiment; Figure 2 This is one of the schematic diagrams of the potential instruction determination process in one embodiment; Figure 3 This is a second flowchart illustrating a wake-up-free voice interaction method in one embodiment; Figure 4 This is the third flowchart of a wake-up-free voice interaction method in one embodiment; Figure 5 This is a second schematic diagram of the potential instruction determination process in one embodiment; Figure 6 This is a flowchart of a wake-up-free voice interaction method in one embodiment; Figure 7 This is the fifth flowchart of a wake-up-free voice interaction method in one embodiment; Figure 8 This is a flowchart of a wake-up-free voice interaction method in one embodiment, number six. Figure 9 This is the seventh flowchart of a wake-up-free voice interaction method in one embodiment; Figure 10This is the eighth flowchart of a wake-up-free voice interaction method in one embodiment; Figure 11 This is a schematic diagram of the instruction execution and feedback process in one embodiment; Figure 12 This is the ninth flowchart of a wake-up-free voice interaction method in one embodiment; Figure 13 This is a flowchart of a wake-up-free voice interaction method in one embodiment, number ten. Figure 14 This is a schematic diagram of the structure of a smart device in one embodiment; Figure 15 This is a schematic diagram of the interaction process of a wake-up-free voice interaction system in one embodiment; Figure 16 This is a diagram of the internal structure of an electronic device in one embodiment. Detailed Implementation

[0016] The present application will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings.

[0017] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. The present application will now be described in detail with reference to the accompanying drawings and embodiments. Furthermore, the term "and / or" in this document is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. The terms "first" and "second," etc., in the specification and claims of the embodiments of this application are used to distinguish different objects, not to describe a specific order of objects.

[0018] With the widespread adoption of smart devices (such as smart speakers, smart kitchen appliances, mobile phones, cars, and headphones), voice has become an important human-computer interaction method. Currently, most mainstream voice assistants require users to first say a fixed wake-up word before the smart device can receive subsequent voice commands; the whole process suffers from unnatural interaction, cumbersome user experience, and accessibility issues.

[0019] For example, unnatural interaction can be understood as disrupting the continuity and naturalness of the conversation, falling far short of the seamless communication patterns between humans. A cumbersome user experience can be understood as requiring a wake word for every voice command, which is particularly lengthy in scenarios with frequent interactions. Accessibility issues: For users with unique pronunciations, tongue-twisting fixed keywords (such as English words), or those with mobility issues or in urgent situations (e.g., a user with mobility issues cannot quickly approach a location, or a user who sees a fire must run there and shout a fixed wake word to activate the smart device), reciting the complete wake word can be a burden.

[0020] Therefore, it can be understood that existing smart devices need to be woken up before receiving voice commands, and the entire device wake-up process and command adjudication process are executed on the device side, resulting in a heavy computing burden on the device side, unnatural human-computer interaction, and a cumbersome user experience.

[0021] To address the aforementioned technical problems, this application provides a wake-up-free voice interaction method and a computer program product. The execution entity of the wake-up-free voice interaction method can be a processing unit, which can be built into a smart device or independently located outside the smart device. The smart device can be a smart speaker, smart kitchen appliance, mobile phone, car, headphones, or other electronic devices. This application does not specifically limit the configuration or form of the processing unit.

[0022] Below, in conjunction with Figures 1 to 16 This application describes the wake-free voice interaction method and computer program product provided. To facilitate understanding of the wake-free voice interaction method provided in this application, the following exemplary embodiments will be used to describe the method in detail. It is understood that these exemplary embodiments can be combined with each other, and similar concepts or processes may not be described repeatedly in some embodiments.

[0023] Reference Figure 1 The above is a flowchart illustrating the wake-up-free voice interaction method provided in the embodiments of this application. Figure 1 As shown, the wake-up-free voice interaction method includes the following steps 101 and 102.

[0024] Step 101: Obtain the user's audio data and multimodal sensor data, and determine the command confidence level of the audio data based on the audio data and multimodal sensor data.

[0025] Multimodal sensing data can be understood as non-audio information related to user behavior collected through multiple sensors. For example, multimodal sensing data may include visual image information collected by a camera, user location information collected by a human body sensor, and device status information collected by a device status sensor.

[0026] Multimodal sensor data complement each other, providing more basis for judgment, making device perception more reliable and reducing false alarms; in other words, the purpose of acquiring users' multimodal sensor data is to avoid device wake-up through user speech and to provide supplementary basis for users' audio data to assess user intent.

[0027] Instruction confidence is used to characterize the likelihood that audio data contains potential instructions. In other words, instruction confidence can be understood as a quantitative indicator of the likelihood that audio data contains potential instructions. It is a real-time, comprehensive intent confidence.

[0028] Among them, potential instructions can be user operation instructions that may exist in the user's audio data. These potential instructions may be actual voice commands issued by the user, or they may be voice data that resembles voice commands in the user's speech. In short, potential instructions are user operation instructions that have not yet been confirmed.

[0029] Specifically, when an audio acquisition module and a multimodal sensor are pre-installed in a smart device, the audio acquisition module collects the user's audio data in real time and sends the audio data to the processing unit in real time. At the same time, the multimodal sensing unit also collects the user's multimodal sensing data in real time and sends the multimodal sensing data to the processing unit in real time. In this way, the processor can obtain the user's audio data and multimodal sensing data in real time.

[0030] The processing unit can perform comprehensive analysis on the real-time acquired audio data. Specifically, it extracts latent instruction features from the audio data and makes a preliminary judgment on whether the audio data contains latent instructions based on the extracted latent instruction features. These latent instruction features are speech features used to characterize potential instructions, such as keywords, emotions, and intonation.

[0031] For example, the processing unit analyzes whether the audio data contains instruction keywords with a sufficiently high similarity to preset potential instructions, identifies the user's emotion and tone when outputting the audio data, and detects the user's visual attention information when outputting the audio data. If instruction keywords, visual attention information (used to determine whether the user is focused when speaking), emotion (such as calm, questioning, excited, or dissatisfied), and tone (such as level tone, rising tone, or falling tone) are simultaneously determined from the user's audio data at the current moment, then the instruction keywords, visual attention information, emotion, and tone are all taken as potential instruction features, and the processing unit determines whether the multiple potential instruction features meet the preset requirements. The judgment results of the conditions preliminarily determine whether the audio data contains potential instructions. For example, if instruction keywords are identified in the audio data, the user's visual attention is focused when outputting the audio data, and the user's emotion is calm and the tone is flat when outputting the audio data, it can be preliminarily determined that the audio data may contain user operation intentions that have not yet been finalized, that is, the audio data contains potential instructions. Conversely, if instruction keywords are not identified in the audio data, the user's visual attention is not focused when outputting the audio data, and the user's expression is questioning and the tone is rising when outputting the audio data, it can be preliminarily determined that the audio data does not contain potential instructions. The above-mentioned multiple potential instruction features act as a "high-sensitivity sensor". Once multiple potential instruction features that meet the preset conditions are detected simultaneously in the audio data, the corresponding audio data can be marked as a "potential instruction" event.

[0032] To enrich the criteria for determining potential commands, the processing unit can perform comprehensive analysis of the user's multimodal sensor data at the current moment when audio data is marked as a "potential command" event. For example, the processing unit identifies whether the user is facing the smart device and whether the user is making any opening or closing motions from visual image information; combines this with human body sensing information to determine whether the user is moving near the smart device; and identifies whether the smart device's screen is lit up and whether headphones are being worn from device status information; thereby obtaining the comprehensive analysis results of the multimodal sensor data.

[0033] At this point, the processing unit can determine the instruction confidence level of the audio data based on various potential instruction characteristics and comprehensive analysis results. This determination process can be based on a pre-set instruction confidence evaluation algorithm, or it can utilize a pre-trained lightweight confidence evaluation model. No specific limitations are specified here.

[0034] It should be noted that, as described above, the processing unit can first comprehensively analyze the user's audio data at the current moment, and then analyze the user's multimodal sensor data at the current moment; or, it can simultaneously perform comprehensive analysis on the user's audio data and multimodal sensor data at the current moment, until multiple potential command features are detected from the audio data, and the analysis results of the multimodal sensor data are obtained respectively.

[0035] Step 102: If it is determined that the audio data contains a potential instruction, then obtain the historical audio data related to the audio data and send an intent recognition request to the cloud. The intent recognition instruction includes the audio data, historical audio data and multimodal sensor data, instructing the cloud to perform semantic understanding on the audio data, historical audio data and multimodal sensor data to obtain the complete user intent, and determine whether the potential instruction is a valid instruction based on the complete user intent.

[0036] In this system, the audio acquisition module in the smart device periodically collects audio data. The historical audio data was collected before the current audio data was collected, and the collection times of the historical audio data and the current audio data are consecutive. In practical applications, historical audio data within the historical time window corresponding to the current audio time can be retrieved from local storage media. This is mainly to provide contextual information for the current audio data to support the understanding of the intended meaning of the audio data.

[0037] Specifically, during the implementation of the wake-free voice interaction method, the processing unit can determine whether the audio data contains potential commands based on the command confidence level of the user's audio data at the current moment. The specific determination process involves comparing the command confidence level with a pre-set confidence threshold. If the command confidence level is higher than the confidence threshold, it can be determined that the audio data contains potential commands, at which point the cloud data upload mechanism can be triggered to automatically generate an intent recognition request. For example, data upload is triggered when the command confidence level S_total ≥ the confidence threshold Threshold_trigger. For instance, the confidence threshold Threshold_trigger is set to ensure a high recall rate (preferring false positives to false negatives) and that the final decision is made in the cloud; therefore, its value range can be 0.3-0.5.

[0038] Once the processing unit in the cloud determines that the audio data acquired at the current moment contains a potential instruction, it automatically generates an intent recognition request. At the same time, the processing unit acquires historical audio data related to the audio data and packages the audio data at the current moment, the historical audio data, and the multimodal sensor data into a data packet. This data packet is then placed in the intent recognition request, and the intent recognition request carrying the data packet is sent to the cloud via the network. This is a one-time, triggered upload.

[0039] For example, refer to Figure 2 The diagram shown illustrates the potential instruction determination process, as follows: Figure 2As shown, smart devices, such as smart kitchen appliances, are equipped with an audio acquisition module, a camera, a human body sensor, and a device status sensor. When a user speaks towards the smart kitchen appliance and makes gestures, the camera identifies the user's orientation, the human body sensor detects human movement within its detection range to determine if the user is moving within its detection range, and the device status sensor records device status information such as whether the device screen is lit up and whether headphones are being worn. The processing unit in the smart kitchen appliance calculates the command confidence of the audio data based on the user's orientation information, human body movement information, device status information, and the user's audio data at the current moment. If the confidence level is high enough, it can be determined that the user is moving near the smart kitchen appliance, the user is speaking towards the smart kitchen appliance, the device screen is lit up, and headphones are being worn. At this time, the processing unit determines that there is a potential command in the user's audio data at the current moment and needs to request the cloud for deep adjudication to determine whether the potential command is a valid command. Therefore, it extracts the audio data from the local cache that goes back 5 seconds from the current moment as historical audio data, and packages and transmits the user's audio data at the current moment, historical audio data, and multimodal sensor data to the cloud for deep processing.

[0040] In response to the intent recognition request, the cloud extracts the current audio data, historical audio data, and multimodal sensor data from the data packet, and inputs the current audio data, historical audio data, and multimodal sensor data into the pre-trained intent recognition model for semantic understanding to obtain the complete user intent.

[0041] Then, the cloud can extract keywords related to the instruction from the complete user intent, and match at least one extracted keyword with at least one preset valid instruction corresponding to the potential instruction. If the match is successful, the potential instruction is determined to be a valid instruction; otherwise, if the match fails to be recognized, the potential instruction is determined to be a invalid instruction.

[0042] The wake-word-free voice interaction method provided in this application achieves a wake-word-free voice interaction mechanism by limiting the determination of the existence of simple potential commands in audio data to be executed on the device side, and transferring the complex semantic understanding task to the cloud side. This is achieved through a two-stage decision-making architecture of local sensitive perception + cloud final decision-making. It also ensures that the device side is not limited by hardware, thus overcoming the device side's computing power bottleneck. Therefore, the device side only needs to process lightweight data such as user audio data and multimodal sensor data, avoiding the computing power consumption caused by the device side performing the entire wake-up and command decision-making process in existing technologies, significantly reducing the device side's computing power burden. Furthermore, the introduction of multimodal sensor data allows the device side to comprehensively perceive the user's behavioral state without relying on a fixed wake-word, making the human-computer interaction process more in line with natural conversation habits. This not only enhances the naturalness and flexibility of human-computer interaction but also improves the accuracy and sensitivity of triggering the cloud to determine the validity of potential commands. The cloud's integrated analysis of multi-source data ensures the completeness and accuracy of user intent understanding, thereby reducing the processing burden on the device side while improving the smoothness of interaction and reducing the false trigger rate in wake-word-free mode. While achieving wake-word-free operation, it also maximizes the protection of user security and privacy.

[0043] Based on the above Figure 1 In one example embodiment of the method shown, step 101 involves determining the command confidence level of the audio data based on the audio data and multimodal sensor data. The specific process of this step can be described in this embodiment through… Figure 3 Steps 201 to 204 shown are implemented.

[0044] Step 201: Extract acoustic and semantic features from audio data, and extract user behavior features from multimodal sensor data.

[0045] Step 202: Determine the audio confidence level based on acoustic and semantic features. The audio confidence level is used to characterize the probability that the audio data includes the audio signal corresponding to the potential instruction.

[0046] Step 203: Determine the user behavior confidence level based on user behavior characteristics. The user behavior confidence level is used to characterize the probability that the multimodal sensing data includes user behavior corresponding to potential commands.

[0047] Step 204: Perform a weighted fusion of audio confidence and user behavior confidence to obtain the instruction confidence of the audio data.

[0048] Among them, acoustic features can be understood as the fundamental frequency, energy, spectrum, pitch, speech rate and volume changes in the user's audio data at the current moment. These features can be achieved by using Mel frequency cepstral coefficient extraction or fundamental frequency tracking algorithms. The purpose is to capture the tone and emotional tendency of the user when issuing the command, thereby distinguishing between imperative tone and everyday conversation tone.

[0049] Semantic features can be understood as the key words and syntactic structure information contained in the user's audio data at the current moment. Specifically, they can be achieved through keyword extraction and semantic parsing, with the aim of identifying the core operational intent and core operational object in potential instructions.

[0050] User behavior characteristics can be understood as user action patterns obtained from the user's multimodal sensor data at the current moment, such as gestures or changes in body posture. These can be achieved based on computer vision algorithms or inertial sensor data analysis, with the aim of providing more basis for judgment and confirming the user's interaction intent through non-voice cues.

[0051] Audio confidence can be understood as the probability index that the user's audio data at the current moment contains potential instructions. It can be obtained by comprehensively evaluating acoustic and semantic features through a machine learning classifier. Its purpose is to quantify the instruction relevance of audio data.

[0052] User behavior confidence can be understood as a probability index that user behavior contains potential instructions. It can be analyzed and calculated through behavior recognition models to quantify the interactive indicativeness of multimodal sensor data.

[0053] Weighted fusion can be understood as combining audio confidence and user behavior confidence according to preset weights. It can adopt a dynamic weight allocation mechanism, such as adjusting audio weights based on the level of ambient noise. Its purpose is to integrate the reliability advantages of different modalities and generate a unified and accurate command confidence.

[0054] Specifically, the processing unit first extracts acoustic and semantic features from the user's audio data at the current moment to capture various acoustic characteristics. Then, based on these acoustic and semantic features, it determines the audio confidence level, quantifying the probability of potential commands in the user's audio data at the current moment. For example, by training a classification or regression model, using acoustic and semantic features as input, it outputs a probability value between 0 and 1, and uses this probability value as the audio confidence level. Alternatively, the acoustic and semantic features can be input into different sub-models to obtain their respective confidence scores, and then the two confidence scores can be fused to obtain the final audio confidence level.

[0055] Simultaneously, the processing unit extracts user behavior features from multimodal sensor data to obtain criteria for non-voice interaction judgments. These features include: the presence of a face in the visual image information, whether the face is facing the device, and whether user lip movements are detected; whether the smart device's screen is on, whether it is unlocked when on, and whether the screen is off and inactive. Accordingly, user behavior confidence can include visual confidence S_visual and device context state confidence S_context.

[0056] The camera on the smart device (with privacy protection processing on the device side) continuously captures visual image information and uses a pre-trained lightweight visual model to detect in real time whether there is a face in the image, the orientation of the face, and whether lip movement is detected. It outputs a visual confidence score S_visual, which can be used to characterize the probability that a user is detected in the visual image information and that the user is facing the smart device. When S_visual=0, it means that no face is detected in the visual image information or the detected face is facing away from the smart device.

[0057] The processing unit queries the current device status of the smart device through the device status sensor. If the current device status indicates that the screen of the smart device is on and unlocked, the device context state confidence S_context = 1.0 is determined; if the current device status indicates that the smart device is a smart speaker and there has been user activity in the last minute, the device context state confidence S_context = 0.8 is determined; if the current device status indicates that the screen of the smart device is off and there is no activity, or other situations, the device context state confidence S_context = 0.2 is determined.

[0058] At this point, the processing unit can perform weighted fusion of audio confidence S_audio, visual confidence S_visual, device context state confidence S_context, and pre-set temporal smoothing factor S_temporal (which can be pre-trained by the model or adjusted based on actual experience) to obtain the instruction confidence S_total of the audio data, and its calculation formula is shown in Equation (1).

[0059] S_total = (S_audio)^w_audio (S_visual)^w_visual (S_context)^w_context (S_temporal)^w_temporal(1) In equation (1), w_audio represents the weight coefficient of the audio confidence S_audio, w_visual represents the weight coefficient of the visual confidence S_visual, w_context represents the weight coefficient of the device context state confidence S_context, and w_temporal represents the weight coefficient of the temporal smoothing factor S_temporal. The use of the temporal smoothing factor and its weight coefficient in weighted fusion aims to improve the accuracy, stability, and adaptability of the fusion results.

[0060] Finally, a weighted fusion mechanism is used to combine the two, dynamically adjusting the weights of each modality to adapt to environmental changes, thereby generating a comprehensive command confidence score. This multimodal collaborative processing approach can accurately distinguish between real commands and environmental interference, avoiding misjudgments caused by relying solely on audio data.

[0061] The wake-up-free voice interaction method provided in this application determines the command confidence of audio data by fusing the audio confidence of audio data and the user behavior confidence of multimodal sensor data. This effectively reduces false wake-ups and missed wake-ups, lowers the computing power consumption of the device, and improves the naturalness of human-computer interaction and user experience.

[0062] Based on the above Figure 1 The method shown, in one example embodiment, involves determining in step 102 that the audio data includes potential instructions. The specific process of this step in this embodiment can be achieved through… Figure 4 Steps 301 and 302 shown are implemented.

[0063] Step 301: Extract instruction prompts related to potential instructions from the audio data and determine the matching degree between the instruction prompts and preset instruction prompts.

[0064] Step 302: If the confidence level of the instruction is higher than the confidence threshold and the matching degree is higher than the matching degree threshold, then it is determined that the audio data contains a potential instruction.

[0065] In this context, instruction prompts can be understood as keywords or phrases in the user's audio data at the current moment that are directly related to potential instructions. They can also be called optional, very short attention prompts, implemented using rule-based keyword matching algorithms or lightweight natural language processing models. The aim is to focus on the core semantic elements of the user's intent, avoiding over-reliance on complete audio data and thus reducing interference from background noise. For example, instruction prompts can include "hey," "hello," a snap of fingers, "range hood," "all-in-one machine," and other custom words or phrases. Their function is not necessarily to wake the user up, but rather to serve as a strong trigger feature for initiating the cloud data upload mechanism.

[0066] Matching degree can be understood as the semantic similarity between the extracted instruction guide words and the standard instruction guide words in the preset instruction guide word library. It can be implemented by string edit distance algorithm or word vector similarity calculation model. The purpose is to verify whether the extracted instruction guide words conform to the semantic features of the potential instruction and filter out irrelevant speech segments.

[0067] Dual threshold judgment can be understood as a joint decision-making process that combines two independent indicators: instruction confidence and matching degree. It can be implemented by setting a dynamically adjustable threshold range, with the aim of avoiding the limitations of a single indicator and ensuring the reliability of the judgment result.

[0068] Specifically, the processing unit accurately extracts instruction prompts from the user's audio data at the current moment to anchor the core elements of the user's intent. It then determines the matching degree between this instruction prompt and standard instruction prompts in a preset instruction prompt library to verify its semantic validity. When both instruction confidence and matching degree simultaneously meet their respective threshold conditions, it confirms that the user's audio data at the current moment contains a potential instruction, thus forming a dual verification mechanism. This dual verification mechanism ensures effective differentiation between genuine instructions and interfering speech in noisy environments. This collaborative operation optimizes the robustness of instruction judgment and achieves efficient resource allocation.

[0069] For example, if the confidence level of the instruction in the user's audio data at the previous moment is greater than 0.3 and the matching degree between the instruction prompt and the preset instruction prompt is 0.8, then it can be determined that the user's audio data at the current moment includes a potential instruction.

[0070] Reference Figure 5 The diagram shown illustrates the potential instruction generation process, as follows: Figure 5 As shown, if the user says "Hey, turn on the air conditioner" to the smart air conditioner at the current moment, and the confidence level of the instruction "Hey, turn on the air conditioner" is high enough and the instruction word "Hey" is present, then it is determined that there is a potential instruction "turn on the air conditioner" in "Hey, turn on the air conditioner".

[0071] The wake-free voice interaction method provided in this application improves the robustness and accuracy of audio data containing potential commands by judging whether the command confidence and command guidance words of the audio data simultaneously meet the corresponding threshold conditions. In this way, by introducing non-fixed attention guidance words, a compromise and more controllable interaction method is provided for users. It is more natural than fixed wake words and more accurate than pure wake-free words. It avoids invalid cloud requests caused by inflated command confidence and ensures accurate recognition of real commands in low-confidence scenarios. It significantly improves the naturalness and response efficiency of voice interaction and enhances the naturalness and flexibility of interaction.

[0072] Based on the above Figure 1In one example embodiment of the method shown, step 102 involves acquiring historical audio data related to the audio data. The specific process of this step can be achieved through… Figure 6 Steps 401 and 402 shown are implemented.

[0073] Step 401: Determine the historical moment. The historical duration between the historical moment and the current moment corresponding to the audio data meets the duration threshold.

[0074] Step 402: Determine the audio data between the current time and the historical time in the circular buffer as historical audio data; the circular buffer is used to cache audio data in real time.

[0075] Historical moments can be understood as past moments that are a specific time interval away from the current moment corresponding to the audio data. They can be dynamically calculated and determined based on the system timestamp. The purpose is to accurately define the starting boundary of historical audio data and avoid blindly expanding the time range to introduce irrelevant audio segments.

[0076] The duration threshold can be understood as a preset time threshold used to determine whether the historical duration meets the requirements. Its purpose is to control the width of the historical audio data window and prevent the context from being missing due to the time window being too narrow.

[0077] A circular buffer can be understood as a first-in, first-out (FIFO) data caching mechanism. It can be implemented as a circular array or a fixed-size queue structure. Its purpose is to continuously and efficiently maintain the latest audio stream, reduce memory allocation overhead, and ensure data continuity.

[0078] Specifically, the processing unit first determines the historical time based on the current time and duration threshold corresponding to the audio data, thereby limiting the effective time window range of the historical audio data and ensuring that the acquired historical audio data is closely related to the semantic scenario of the potential instruction; for example, the audio data that goes back a period of time (such as 10 seconds or 30 seconds) from the current time can be used as the historical audio data, and the historical audio data can also be used as the preceding context audio data of the current audio data.

[0079] Subsequently, the processing unit utilizes the real-time caching characteristics of the circular buffer to directly extract continuous audio data segments between the current moment and historical moments. Since the circular buffer continuously updates the audio stream using a first-in-first-out mechanism, the device can instantly obtain the required historical audio data without additional processing of historical data storage logic, significantly reducing the real-time computing burden. At the same time, the coordinated design of the duration threshold and the circular buffer ensures that the range of historical audio data avoids interference from irrelevant information and prevents the loss of key context, thus building a highly relevant contextual foundation for cloud-based intent recognition.

[0080] For example, the audio acquisition module in the smart device acquires audio data in real time and writes it into a circular buffer in real time. The circular buffer retains the real-time audio data within the last 30 seconds. When the real-time audio data exceeding 30 seconds or the retained real-time audio data occupies 85% of the storage space, it is deleted from the circular buffer in real time. The principle of deletion is to prioritize the audio data that has been stored for longer periods, thereby ensuring that the data is not overloaded.

[0081] The wake-up-free voice interaction method provided in this application embodiment effectively alleviates the computing burden on the device in real-time processing by obtaining historical audio data related to the audio data at the current moment from the circular buffer, and improves the semantic relevance between historical audio data and potential commands, thereby enhancing the accuracy of cloud intent recognition and the real-time performance of interactive response.

[0082] Based on the above Figure 1 In one example embodiment, the method shown in step 102 performs semantic understanding on audio data, historical audio data, and multimodal sensor data to obtain the complete user intent. The specific process in this embodiment can be achieved through… Figure 7 Steps 501 to 503 shown are implemented.

[0083] Step 501: Convert the audio data and historical audio data into text to obtain the first text information.

[0084] Step 502: Perform natural language description on the user behavior features extracted from the multimodal sensor data to obtain the second text information.

[0085] Step 503: Construct intent from the first text information and the second text information to obtain the complete user intent.

[0086] The first text information can be understood as a textual representation obtained by converting the user's audio data at the current moment and historical audio data into text. It can be implemented using automatic speech recognition technology or a speech-to-text model based on deep learning. The purpose is to convert the two parts of audio data into continuous and complete semantic text, preserving the coherence of the dialogue context.

[0087] The second text information can be understood as a textual description obtained by describing user behavior characteristics in natural language. It can be achieved by using feature extraction algorithms combined with predefined templates or lightweight text generation models. The purpose is to transform the user's non-voice behavior data (such as user gestures, facial orientation, user location, device status, etc.) into a text form that can be parsed by a natural language model.

[0088] Intent construction can be understood as the process of inferring a user's complete intent based on integrated textual information. It can be achieved using a semantic parsing engine or a pre-built intent prompt template. Its purpose is to integrate multi-source information to eliminate semantic ambiguity and restore the user's true needs.

[0089] Specifically, the cloud can use Automatic Speech Recognition (ASR) algorithms to merge and transcribe the current audio data (i.e., the user's audio data collected by the device at the current moment), historical audio data, and multimodal sensor data extracted from data packets into a complete text sentence, thus obtaining the first text information. In this way, the current audio data is processed within the historical dialogue sequence, avoiding semantic gaps caused by isolated analysis.

[0090] At the same time, the cloud extracts user behavior features from multimodal sensor data and converts the extracted user behavior features into a simple natural language description to obtain second text information, so that the user behavior cues captured by the multimodal sensors can participate in the intent understanding process in a linguistic form; thereby completing the input preprocessing for current audio data and historical audio data.

[0091] Subsequently, the cloud uses a pre-built intent cue word template to fill the first and second text information into the intent cue word template to complete the intent construction. Through cross-validation and semantic association of the two-source text, the voice history and behavioral context are simultaneously integrated to form an intent inference that covers the complete interaction scenario, thereby obtaining the complete user intent.

[0092] The wake-free voice interaction method provided in this application constructs intent by converting first text information from current and historical audio data (a context window) and second text information from multimodal sensor data in the cloud. Through cross-validation and semantic association of the two-source text, the method simultaneously integrates speech history and behavioral context to form intent inference covering the complete interaction scenario, achieving accurate intent recognition. This not only provides crucial contextual information for the adjudication and execution of subsequent effective commands, enabling human-like judgments, but also comprehensively assesses the risks of execution through multimodal sensor information such as images and voice, improving the security of command execution. This effectively solves the problem of fragmented user intent, accurately captures implicit needs in continuous interactions, and avoids command misjudgment due to missing context or unintegrated behavioral information, thereby significantly improving the naturalness of voice interaction and user experience.

[0093] Based on the above Figure 1In one example embodiment of the method shown, step 102, which determines whether a potential instruction is a valid instruction based on the complete user intent, can be specifically implemented in this embodiment through... Figure 8 Steps 601 and 602 shown are implemented.

[0094] Step 601: Use a large language model to identify and analyze the complete user intent to obtain the instruction type of the potential instruction.

[0095] Step 602: Determine whether a potential instruction is a valid instruction based on the instruction type and the instruction confidence level corresponding to the potential instruction.

[0096] Among them, the large language model can be understood as a natural language processing model based on deep learning. It can be implemented using a neural network model with the Transformer architecture, and its purpose is to perform deep semantic parsing of the complete user intent.

[0097] Instruction type can be understood as a semantic classification category of potential instructions, which can include types such as commands and references, with the purpose of distinguishing the essential intent expressed by the user.

[0098] Command confidence can be understood as the probability value that the user's audio data at the current moment contains a potential command, and its purpose is to provide a quantitative indicator of data reliability.

[0099] Specifically, the cloud utilizes a pre-trained Large Language Model (LLM) to identify and analyze complete user intent that integrates current audio data, historical audio data, and multimodal sensor data. This accurately identifies the command type of potential instructions, overcoming the limitations of relying solely on audio features. Based on this, a dual verification process combining command type and command confidence level is performed to accurately determine the validity of potential commands. That is, based on the currently determined command type and command confidence level, a pre-set mapping relationship is found between different command types and their corresponding confidence levels and probabilities. If a command probability is found and is higher than a probability threshold, the potential command is determined to be valid. If a command probability is found and is lower than the probability threshold, the potential command is determined to be invalid. If no command probability is found, the command type is considered abnormal, and the potential command is also determined to be invalid.

[0100] For example, when submitting the assembled complete user intent to an LLM for identification and analysis, the LLM can be asked to answer or output only one of the two instruction types: "Command" or "Mention," thereby completing the LLM's inference call.

[0101] The wake-free voice interaction method provided in this application embodiment effectively solves the problem of false triggering or missed triggering caused by relying solely on instruction confidence by combining the instruction type of the potential instruction and the instruction confidence corresponding to the potential instruction for dual verification. It significantly improves the accuracy and naturalness of wake-free voice interaction, provides key contextual information for subsequent adjudication and execution, and enables large speech models to make human-like judgments.

[0102] Based on the above Figure 8 In one example embodiment of the method shown, step 602 determines whether a potential instruction is a valid instruction based on the instruction type and the instruction confidence level corresponding to the potential instruction. The specific process in this embodiment can be achieved through… Figure 9 Steps 701 and 702 shown are implemented.

[0103] Step 701: If the instruction type is determined to be a command and the instruction confidence level is higher than the confidence level threshold, then the potential instruction is determined to be a valid instruction.

[0104] Step 702: If the instruction type is determined to be mention and / or the instruction confidence is lower than the confidence threshold, then the potential instruction is determined to be an invalid instruction.

[0105] The instruction type can be understood as the category of intent expressed in the audio data uploaded by the smart device, which can be a command or a mention.

[0106] A command can be understood as an operation instruction issued by the user, while a mention can be understood as the user simply referencing the device name without any intention to operate. The purpose is to distinguish between the user's true operational intention and an unintentional mention.

[0107] Specifically, the cloud utilizes a pre-trained LLM (Limited Learning Model) to identify and analyze complete user intent, determining the instruction type as either a "Command" or a "Mention." Then, it comprehensively assesses the instruction type and the corresponding instruction confidence level. When the instruction type is determined to be a command and the instruction confidence level is higher than a confidence threshold, the LLM confirms the potential instruction as valid. When the instruction type is determined to be a Mention or the instruction confidence level is lower than the confidence threshold, the LLM confirms the potential instruction as invalid. This dual-judgment mechanism effectively distinguishes between the user's true operational intent and unintentional Mentions, avoiding the limitations of single-dimensional judgment and ensuring that operations are only executed when the user explicitly issues a high-quality command.

[0108] Understandably, the core task of LLM is to perform contextualized intent analysis, constructing it as a dedicated "intent adjudicator" through carefully designed prompt engineering. The command is only executed when the LLM output is a command and the confidence level of the instruction uploaded from the device is higher than the confidence threshold. All other cases (such as LLM output being a mention, an abnormal instruction type, or an instruction confidence level lower than the confidence threshold) are discarded by default without any response. For example, LLM can understand that "turn on the air conditioner" in a user's statement "I just told Xiao Ai to turn on the air conditioner" is a mention, while saying "turn on the air conditioner" to a smart device is a command. In other words, LLM can distinguish between the essential difference between A saying "I want to turn on the air conditioner" and A telling B "I told the speaker I want to turn on the air conditioner."

[0109] Therefore, it can be understood that by "translating" current audio data, historical audio data, and multimodal sensor data into a language that LLM understands, and then allowing it to make judgments through a rigorous question-and-answer process, and finally combining the "fuse" of command confidence, safe and accurate command decisions can be achieved.

[0110] For example, when the complete user intent is "I just told Xiao Ai [turn on the air conditioner] and it turned on," and "turn on the air conditioner" is the potential instruction, the LLM inference process is roughly as follows: 1. Understanding quotation and meta-language: In the example, LLM recognizes the verb "said," indicating that the following "turn on the air conditioner" is a quotation of a past statement by the user, a kind of meta-language (language about language), rather than a current request for a real action. 2. Analyzing time adverbs and context: The time adverb "just now" clearly positions the action "said turn on the air conditioner" in the past, unrelated to "what needs to be done now." 3. Understanding syntactic structure: The entire sentence is a declarative sentence (stating something that happened in the past), not an imperative sentence (issuing a command). 4. Combining multimodal signals (enhancing confidence): Visual signals uploaded from the device may show that the user was not facing the device when saying this, but was talking to a friend (low S_visual value), which serves as an auxiliary feature, further supporting the LLM's "Mention" judgment.

[0111] For example, analyzing the text "turn on the air conditioner," in this case, the text is simply "turn on the air conditioner," without any context indicating that it is a quotation or reference. Its syntax is itself an imperative sentence. Meanwhile, the multimodal signal indicates that the user is looking at the device (S_visual value is high). Therefore, the LLM will output "Command" without hesitation.

[0112] If the cloud determines the potential instruction to be valid through the LLM, the cloud's instruction executor is invoked to perform the response operation and feed the result back to the smart device for presentation to the user. If the cloud determines the potential instruction to be invalid through the LLM, the entire intent recognition request is silently discarded, the cloud does not feed back any response to the smart device, and the smart device remains in its original state.

[0113] The code for the above reasoning process is as follows: LLM Intent to adjudicate service code def llm_intention_arbitration_service(asr_text, multimodal_features): """ LLM-based Intent Determination Service parameter: asr_text: The complete context text after ASR conversion. multimodal_features: Multimodal feature dictionary uploaded by the device return: intention: "COMMAND" or "MENTION" confidence: confidence score """ # 1. Input preprocessing: Locate the trigger phrase to be adjudicated from the ASR text (roughly located by timestamp or keyword matching). trigger_phrase = extract_trigger_phrase(asr_text) # For example: "Turn on the air conditioner" full_context = asr_text # 2. Build a well-designed prompt template system_prompt = """ You are a smart device intent arbiter. You analyze the user's input text rigorously.

[0114] The portion within quotation marks in the text is a phrase awaiting adjudication. Please determine whether this phrase is one that needs to be executed. direct instructions (COMMAND), or just in the dialogue? The content mentioned (MENTION).

[0115] Please follow these rules when performing analysis: - Output COMMAND if the phrase is an action the user is requesting the device to perform.

[0116] - If the phrase is the user recalling, quoting, exemplifying, discussing, or negating an instruction, rather than wanting to execute it now, output MENTION.

[0117] - Your output must be a single word: COMMAND or MENTION.

[0118] """ user_prompt = f""" Please analyze the following text: "{full_context}" The phrase awaiting adjudication is: "{trigger_phrase}". """ # 3. Call the LLM API (such as OpenAI GPT-4, Claude, or a locally deployed LLM). llm_response = call_llm_api( model="gpt-4o-mini", system_message=system_prompt, user_message=user_prompt, temperature=0.1# Low temperature value, ensuring high output certainty, should not be arbitrarily adjusted. ) # 4. Analyze the LLM response output intention, confidence = parse_llm_response(llm_response) # 5. (Optional) Fuse multimodal features for final decision calibration # If the visual confidence level is extremely low, even if the LLM classifies it as COMMAND, the confidence level can be rejected or reduced. if intention == "COMMAND" and multimodal_features.get('visual_confidence', 0)<0.2: intention = "MENTION" confidence = 0.5 return intention, confidence # --- Auxiliary Functions --- def call_llm_api(model, system_message, user_message, temperature): # Abstract interface for calling the Large Language Model API # Example: Using OpenAI-style API calls response = openai.chat.completions.create( model=model, messages=[ {"role": "system", "content": system_message}, {"role": "user", "content": user_message} ], temperature = temperature, max_tokens=10 # Limit the output length; we only need one word. ) return response.choices[0].message.content.strip() def parse_llm_response(response): # Parse the text response of an LLM and extract intent and confidence level. # Simple keyword matching if "COMMAND" is in the response: `return "COMMAND", 0.9#` returns the intent and estimated confidence level. elif "MENTION" in response: return "MENTION", 0.9 else: # If the LLM does not output as required, the default behavior is to use MENTION to ensure system security. return "MENTION", 0.5 Main processing flow code: def main_processing_pipeline(uploaded_data_packet): # 1. Execute ASR asr_text = run_asr(uploaded_data_packet.audio_context) # 2. Call the LLM adjudication service intention, confidence = llm_intention_arbitration_service(asr_text,uploaded_data_packet.multimodal_features) # 3. Execute according to the ruling. if intention == "COMMAND" and confidence>0.8: # Execute command execute_command(asr_text) send_response_to_device("Command executed") else: # Silently discarded, without performing any action, and without the user's awareness. log_event("Silent drop due to: ", intention) The wake-free voice interaction method provided in this application uses a large language model in the cloud to understand the complete user intent and effectively distinguish between commands and mentions. This fundamentally solves the core pain points of existing solutions, greatly reduces the false trigger rate and missed trigger rate, makes voice interaction more natural and smooth, and improves the user experience.

[0119] Based on the above Figure 9 In one example embodiment of the method shown, when determining a potential instruction as a valid instruction in step 901, the method can further determine whether to ultimately execute the instruction based on the result of a judgment on whether instruction execution is safe. Therefore, the wake-up-free voice interaction method provided in this application embodiment also includes an instruction execution judgment process. In this embodiment, this instruction execution judgment process can be achieved through… Figure 10 Steps 801 to 803 shown are implemented.

[0120] Step 801: If the potential instruction is determined to be a valid instruction, determine whether there is a security risk in executing the potential instruction based on the multimodal sensor data.

[0121] Step 802: If it is determined that there is a security risk in executing a potential instruction, then the potential instruction will not be executed, and the user will be notified of the instruction execution failure and the reason for the failure.

[0122] Step 803: If it is determined that there is no security risk in executing the potential instruction, then execute the potential instruction.

[0123] Specifically, if a potential instruction is deemed valid via LLM in the cloud, a safety assessment of the impending execution instruction can be further conducted by comprehensively analyzing multimodal sensor data. This can be achieved by using pre-set safety rules to comprehensively analyze the multimodal sensor data. For instance, if a visual sensor detects a user's hand approaching a dangerous area, or an inertial sensor detects that the device is in an unstable state with violent shaking, it can be determined that there is a safety risk in executing the potential instruction.

[0124] If the cloud determines, through LLM (Local Management Model) adjudication, that executing a potential instruction poses a security risk, it will proactively suspend the execution of that instruction and inform the user of the failure to execute and the specific reason through appropriate feedback. This feedback may include voice prompts (e.g., "Your hand has been detected near a dangerous area; for safety reasons, the instruction has been cancelled."), screen displays (e.g., "Safety warning: The current environment is not suitable for performing this operation; the instruction has been paused."), or vibration alerts, or a combination of these methods. This application does not specifically limit this approach.

[0125] If the cloud determines through LLM adjudication that there is no security risk in executing a potential instruction, then it drives the corresponding smart device or software function to operate according to the semantic content of the potential instruction; for example, it may do so by sending a control signal to the device's execution module to complete the action corresponding to the potential instruction, or by calling the application programming interface to execute the software-level operation.

[0126] For example, if the cloud determines a potential instruction to be valid through LLM, but the analysis of multimodal sensor data reveals other reasons, such as security risks, that prevent the instruction from being executed, the cloud will still provide feedback to the user, such as a voice prompt indicating execution failure and explaining the reason for the failure. For instance, the feedback might be that the action of opening the door of an all-in-one machine could potentially cause burns, therefore it was not executed.

[0127] If the cloud uses LLM to identify from complete user intent that the user is merely mentioning a function rather than issuing a command, the potential command can be ruled invalid. For example, a user tells a family member, "Turn the range hood to the highest setting later." Although it includes "highest setting," it's a casual remark, therefore an invalid command, and the kitchen appliance will not respond at all.

[0128] If the cloud uses LLM to identify clear, direct, and security-free potential instructions from complete user intent, then the potential instructions are deemed valid. For example, a user might say to the oven, "Preheat to 180 degrees," or "Add 5 minutes to the steam oven." In this case, the kitchen appliance executes the command smoothly and immediately, providing a confirmation tone.

[0129] If the cloud-based system uses LLM to identify a potential, valid instruction from the complete user intent, and detects a safety hazard or state conflict, it will provide relevant feedback to the user and execute the instruction only after user confirmation. For example, if a user says "Open the oven door" to a running oven, the appliance will not immediately open the door but will instead ask via voice, "The oven temperature is very high, are you sure you want to open the door?" The appliance will then execute the instruction after user confirmation. Another example is if a young child says "Open the oven door" (potentially posing a burn risk). In this case, the appliance will directly refuse to execute the instruction and respond, "This function is locked." This allows the appliance to not only "understand commands" but also "judge the situation," becoming a safety-conscious smart kitchen manager.

[0130] The above-described execution and feedback process for invalid instructions, valid instructions, and valid instructions with security risks can be referred to as follows: Figure 11 The diagram illustrates the instruction execution and feedback process. Figure 11 As shown, the cloud receives the kitchen appliance instruction, which is the potential instruction in the complete user intent. At this point, it further determines whether the potential instruction is valid, whether there are security risks if the potential instruction is valid, and whether the instruction is executed if there are security risks, either executing the instruction with user confirmation or canceling the instruction if the user cancels. The process involved can be found in [reference needed]. Figure 11 As shown. Further details will not be elaborated here.

[0131] The wake-free voice interaction method provided in this application, while ensuring the validity of user commands, further introduces a security risk assessment mechanism based on multimodal sensor data. This effectively avoids accidental injury or device damage that may result from executing commands in unsafe scenarios, significantly improving the security and reliability of human-computer interaction in smart devices. Simultaneously, clear feedback on failure reasons enhances user understanding and trust in system decisions, optimizing the overall user experience.

[0132] Based on the above Figure 3 In one example embodiment of the method shown, step 202 determines the audio confidence level based on acoustic and semantic features. The specific process of this step in this embodiment can be achieved through… Figure 12 Steps 901 to 903 shown are implemented.

[0133] Step 901: Identify and classify the acoustic features to obtain the acoustic confidence level used to characterize the reliability of the classification results.

[0134] Step 902: Determine the similarity between the semantic features and each instruction keyword in the instruction keyword library, and determine the maximum similarity among multiple similarities as the semantic confidence score. The semantic confidence score represents the degree of certainty that there is an audio signal in the audio data that best matches the user's operation intention.

[0135] Step 903: Perform weighted fusion of acoustic confidence and semantic confidence to obtain audio confidence.

[0136] Specifically, the processing unit extracts acoustic features such as fundamental frequency, energy, spectrum, pitch, speech rate, and volume changes from the user's audio data at the current moment. These extracted acoustic features are then input into a pre-trained lightweight acoustic model (such as the TinyML model) to identify and classify all acoustic features, outputting an acoustic confidence score S_acoustic between 0 and 1. This lightweight acoustic model is trained to distinguish between "imperative intonation" (such as affirmative and short phrases) and "normal conversational intonation."

[0137] Meanwhile, the processing unit extracts instruction keywords and other semantic features such as speech recognition text from the user's audio data at the current moment, and quickly matches the extracted semantic features with a dynamically updated instruction keyword library (which includes instruction keywords such as open, stop, and close). The matching process can use matching algorithms such as cosine similarity to calculate the similarity between the extracted semantic features and each instruction keyword in the instruction keyword library, and select the maximum similarity from all calculated similarities as the semantic confidence S_keyword.

[0138] At this point, the acoustic confidence score S_acoustic and the semantic confidence score S_keyword are weighted and fused to obtain the audio confidence score S_audio of the user's audio data at the current moment, where S_audio = 0.3. S_acoustic + 0.7 S_keyword; From this formula, it can be understood that audio confidence depends more on explicit instruction keywords and speech recognition text.

[0139] The wake-free voice interaction method provided in this application, by performing fine matching between semantic features and a command keyword library and selecting the maximum similarity as semantic confidence, can more accurately capture user intent. Even when users express themselves in diverse ways, it can effectively identify the core command semantics and reduce the ambiguity of semantic understanding. Furthermore, by weighted fusion of acoustic confidence and semantic confidence, it can dynamically adjust according to the reliability and importance of different information sources, thereby obtaining a more comprehensive, stable, and accurate audio confidence. This significantly improves the accuracy and reliability of the entire wake-free voice interaction system in recognizing potential commands, reduces the occurrence of false wake-ups and missed wake-ups, and thus improves the user experience, making human-computer interaction more natural and efficient.

[0140] For example, refer to Figure 13 The flowchart of the wake-up-free voice interaction method shown is as follows: Figure 13 As shown, the user and environment can be understood as the continuous operation of the audio acquisition module and multimodal sensors on the device side, acquiring the user's audio data and multimodal sensing data at the current moment; the smart device, as the device side, has a processing unit including an intelligent perception layer, which includes audio acquisition and circular buffering, multimodal sensing data perception, and local lightweight fusion decision-making; audio acquisition and circular buffering can be understood as the acquired audio data being written into a circular buffer in real time; multimodal sensing data can be understood as visual image information, device status information, human body sensing information, etc.; local lightweight fusion decision-making can be understood as determining the command confidence level of the audio data based on the user's audio data and multimodal sensing data at the current moment, and generating an intent recognition request carrying a data packet and sending the intent recognition request to the cloud when the command confidence level exceeds the confidence level threshold. Subsequently, the cloud-based decision-making and execution layer responds to the intent recognition request, performs semantic understanding on the current audio data, historical audio data, and multimodal sensor data in the data packet to obtain the complete user intent, and determines whether the potential instruction is a valid instruction based on the complete user intent. If it is determined to be an invalid instruction, a valid instruction, or a valid instruction with security risks, corresponding execution and feedback are performed. The specific processes involved can be referred to the aforementioned embodiments. They will not be repeated here.

[0141] It should be noted that although the operations of the method of this application are described in a specific order in the accompanying drawings, this does not require or imply that these operations must be performed in that specific order, or that all the operations shown must be performed to achieve the desired result. On the contrary, the steps depicted in the flowchart can be performed in a different order. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step, and / or one step may be broken down into multiple steps.

[0142] Reference Figure 14This application also provides a smart device, including an audio acquisition module, a multimodal sensor, and a processing unit. The audio acquisition module is used to acquire user audio data in real time and send the audio data to the processing unit. Multimodal sensors are used to acquire multimodal sensing data in real time and send the multimodal sensing data to the processing unit. The processing unit is configured to execute the wake-free voice interaction method described in the foregoing embodiments to determine whether the audio data includes a potential instruction based on the audio data and multimodal sensor data, and to send an intent recognition request to the cloud if the audio data includes a potential instruction. The intent recognition request is used to instruct the cloud to adjudicate whether the potential instruction is a valid instruction.

[0143] It should be noted that the process by which the processing unit determines that the audio data contains potential instructions, and the process by which the cloud determines whether a potential instruction is a valid instruction, can be referred to the aforementioned method embodiments. They will not be repeated here.

[0144] Reference Figure 15 This application also provides a wake-up-free voice interaction system, including a smart device and a cloud server; wherein: The intelligent device is used to acquire the user's audio data and multimodal sensor data, and determine the instruction confidence of the audio data based on the audio data and multimodal sensor data. The instruction confidence is used to characterize the possibility that the audio data includes potential instructions. If it is determined that the audio data includes potential instructions, the device acquires historical audio data related to the audio data, generates an intent recognition instruction, and sends an intent recognition request to the cloud server. The intent recognition instruction includes audio data, historical audio data, and multimodal sensor data. The cloud server is used to perform semantic understanding on audio data, historical audio data, and multimodal sensor data to obtain complete user intent, and to determine whether potential instructions are valid based on the complete user intent.

[0145] It should be noted that the process by which the smart device determines that the audio data contains potential instructions, and the process by which the cloud server determines whether the potential instructions are valid, can be referred to the aforementioned method embodiments. They will not be repeated here.

[0146] Figure 16 A schematic diagram of a hardware architecture suitable for implementing embodiments of this application is shown, such as... Figure 16As shown, the electronic device specifically includes a processor 1001, a communication interface 1002, a memory 1003, and a communication bus 1004. The processor 1001, communication interface 1002, and memory 1003 communicate with each other via the communication bus 1004. The memory 1003 stores programs that can be executed by the processor 1001. The processor 1001 executes the programs stored in the memory 1003 to implement… Figure 1 The steps of the method shown.

[0147] The communication bus 1004 mentioned in the above electronic device can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus 1004 can be divided into an address bus, a data bus, a control bus, etc. For ease of representation, Figure 16 The bus is represented by a single thick line, but this does not imply that there is only one bus or one type of bus. Communication interface 1002 is used for communication between the aforementioned electronic device and other devices. Memory 1003 may include random access memory (RAM) or non-volatile memory, such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor 1001. The aforementioned processor 1001 may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc., or it may be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0148] In another embodiment of this application, a computer-readable storage medium is provided, which stores a computer program that, when run on a computer, causes the computer to perform the actions described in the above embodiments. Figure 1 The steps of the method shown.

[0149] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. This computer program product includes one or more computer instructions. When these computer instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another, for example, computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, microwave, etc.) means. The computer-readable storage medium can be any available medium accessible to a computer or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape, etc.), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive), etc.

[0150] The above description is merely a preferred embodiment of this application and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in this application is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the inventive concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features with similar functions disclosed in this application.

Claims

1. A wake-up-free voice interaction method, characterized in that, The method includes: The system acquires user audio data and multimodal sensor data, and determines the instruction confidence level of the audio data based on the audio data and the multimodal sensor data. The instruction confidence level is used to characterize the possibility that the audio data includes potential instructions. If it is determined that the audio data includes the potential instruction, then historical audio data related to the audio data is obtained, and an intent recognition request is sent to the cloud. The intent recognition instruction includes the audio data, the historical audio data, and the multimodal sensor data, to instruct the cloud to perform semantic understanding on the audio data, the historical audio data, and the multimodal sensor data to obtain the complete user intent, and to determine whether the potential instruction is a valid instruction based on the complete user intent.

2. The method according to claim 1, characterized in that, Determining the command confidence level of the audio data based on the audio data and the multimodal sensing data includes: Acoustic and semantic features are extracted from the audio data, and user behavior features are extracted from the multimodal sensor data; The audio confidence level is determined based on the acoustic features and the semantic features, and the audio confidence level is used to characterize the probability that the audio data includes the audio signal corresponding to the potential instruction; The user behavior confidence level is determined based on the user behavior characteristics, and the user behavior confidence level is used to characterize the probability that the multimodal sensing data includes the user behavior corresponding to the potential instruction; The instruction confidence is obtained by weighted fusion of the audio confidence and the user behavior confidence.

3. The method according to claim 2, characterized in that, The determination that the audio data includes the potential instruction includes: Extract instruction prompts related to the potential instructions from the audio data, and determine the matching degree between the instruction prompts and preset instruction prompts; If the confidence level of the instruction is higher than the confidence level threshold and the matching degree is higher than the matching degree threshold, then it is determined that the audio data includes the potential instruction.

4. The method according to any one of claims 1 to 3, characterized in that, The acquisition of historical audio data related to the audio data includes: Determine a historical moment, wherein the historical duration between the historical moment and the current moment corresponding to the audio data satisfies a duration threshold; The audio data between the current time and the historical time in the circular buffer is determined as the historical audio data; the circular buffer is used to cache audio data in real time.

5. The method according to any one of claims 1 to 3, characterized in that, The step of performing semantic understanding on the audio data, the historical audio data, and the multimodal sensor data to obtain the complete user intent includes: The audio data and the historical audio data are converted into text to obtain the first text information; The user behavior features extracted from the multimodal sensing data are described in natural language to obtain the second text information; The first text information and the second text information are used to construct the intent to obtain the complete user intent.

6. The method according to any one of claims 1 to 3, characterized in that, The step of determining whether a potential instruction is a valid instruction based on the complete user intent includes: The complete user intent is identified and analyzed using a large language model to obtain the instruction type of the potential instruction; Based on the instruction type and the instruction confidence level corresponding to the potential instruction, determine whether the potential instruction is a valid instruction.

7. The method according to claim 6, characterized in that, The step of determining whether a potential instruction is a valid instruction based on the instruction type and the instruction confidence level corresponding to the potential instruction includes: If the instruction type is determined to be a command and the instruction confidence level is higher than the confidence level threshold, then the potential instruction is determined to be the valid instruction. If the instruction type is determined to be a mention and / or the instruction confidence level is lower than the confidence level threshold, then the potential instruction is determined to be an invalid instruction.

8. The method according to claim 7, characterized in that, The method further includes: If the potential instruction is determined to be a valid instruction, the multimodal sensing data is used to determine whether there is a security risk in executing the potential instruction. If it is determined that there is a security risk in executing the potential instruction, then the potential instruction will not be executed, and the user will be informed of the instruction execution failure and the reason for the failure. If it is determined that there is no security risk in executing the potential instruction, then the potential instruction is executed.

9. The method according to claim 2, characterized in that, Determining audio confidence based on the acoustic features and the semantic features includes: The acoustic features are identified and classified to obtain acoustic confidence scores that characterize the reliability of the classification results; The similarity between the semantic features and each instruction keyword in the instruction keyword library is determined, and the maximum similarity among the multiple similarities is determined as the semantic confidence score. The semantic confidence score represents the degree of certainty that there is an audio signal in the audio data that best matches the user's operation intention. The acoustic confidence and the semantic confidence are weighted and fused to obtain the audio confidence.

10. A computer program product, characterized in that, The computer program product includes instructions that, when executed, cause the method as described in any one of claims 1-9 to be implemented.