A voice instruction processing method, device, equipment and medium

By extracting and enhancing speech features using a pre-trained sound source separation model and a lightweight Transformer encoder, and combining device and environmental information, the problem of insufficient command acquisition accuracy was solved, achieving high-accuracy command response and multi-device collaboration in noisy environments.

CN122201296APending Publication Date: 2026-06-12MALANSHAN AUDIO & VIDEO LABORATORY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
MALANSHAN AUDIO & VIDEO LABORATORY
Filing Date
2026-03-27
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing technologies, instruction processing relies on a combination of microphone and main chip, and also on cloud computing, which leads to insufficient instruction acquisition accuracy and reliance on fixed command words for response, resulting in insufficient accuracy of instruction response.

Method used

A pre-trained sound source separation model is used to extract target human voice commands. A lightweight Transformer encoder is used for speech feature extraction and semantic enhancement. Combined with device and environmental information, the optimal execution device and operation are determined through arbitration scoring to achieve accurate matching and execution of commands.

🎯Benefits of technology

It improves the accuracy and robustness of command response, can accurately extract the target user's voice signal in noisy environments, achieves seamless collaborative response between multiple devices, and enhances the accuracy of voice interaction and user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201296A_ABST
    Figure CN122201296A_ABST
Patent Text Reader

Abstract

The application discloses a voice instruction processing method and device, equipment and medium, relates to the computer technical field, is applied to the target terminal, and includes: using a pre-training sound source separation model to process original voice information to extract target human voice instructions, performing feature extraction on the target human voice instructions to obtain a target voice feature sequence; processing target voice text information and auxiliary information obtained based on the target voice feature sequence by a deep bidirectional pre-training language model to obtain a target voice instruction intent; the auxiliary information includes device information and environment information; determining the comprehensive arbitration score of each intent behavior determined based on the target voice instruction intent; the intent behavior includes an intent device and an intent operation; taking the intent device and the intent operation with the highest comprehensive arbitration score as the optimal execution device and the optimal execution operation to determine a target execution instruction, and sending the target execution instruction to the optimal execution device for execution. The accuracy of the instruction response can be improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and in particular to a voice command processing method, apparatus, device, and medium. Background Technology

[0002] Currently, instruction processing is mainly accomplished through a combination of "microphone + main chip," relying on a cloud-based processing engine. However, the instruction processing suffers from drawbacks such as insufficient instruction acquisition accuracy, reliance on fixed command words for response, and the existence of multiple executable devices, resulting in insufficient accuracy in instruction response.

[0003] In summary, improving the accuracy of command response is an urgent problem to be solved. Summary of the Invention

[0004] In view of this, the purpose of the present invention is to provide a voice command processing method, apparatus, device, and medium that can improve the accuracy of command response, and the specific solution is as follows:

[0005] In a first aspect, this application discloses a voice command processing method, applied to a target terminal, comprising:

[0006] The raw speech information is processed using a pre-trained sound source separation model to extract the target human voice command, and the target human voice command is then subjected to feature extraction to obtain the target speech feature sequence.

[0007] The target speech feature sequence is converted into target speech text information, and the target speech text information and auxiliary information are processed by a deep bidirectional pre-trained language model based on a Transformer encoder to obtain the target speech command intent; the auxiliary information includes device information and environmental information.

[0008] Based on the target voice command intent, each intent behavior is determined, and a comprehensive arbitration score is determined for each intent behavior; the intent behavior includes intent device and intent operation;

[0009] The intent device and intent operation corresponding to the highest comprehensive arbitration score are respectively designated as the optimal execution device and optimal execution operation. Based on the optimal execution device and optimal execution operation, a target execution instruction matching the target voice instruction intent is determined, and the target execution instruction is sent to the optimal execution device so that the optimal execution device executes the target execution instruction.

[0010] Optionally, the step of extracting features from the target human voice command to obtain a target speech feature sequence includes:

[0011] The target human voice command is subjected to feature extraction to obtain an initial speech feature sequence;

[0012] A semantic enhancement network based on a lightweight Transformer bidirectional encoder is used to assign corresponding feature weights to the speech features of each speech frame in the initial speech feature sequence to obtain the target speech feature sequence; wherein, the degree of correlation between the speech features corresponding to the speech frame and the core content of the instruction is positively linearly correlated with the weight of the corresponding speech frame.

[0013] Optionally, the step of extracting features from the target human voice command to obtain an initial speech feature sequence includes:

[0014] The initial speech feature sequence is obtained by extracting features from the target human voice command using Mel frequency cepstral coefficients and in combination with preprocessing operations; the preprocessing operations include DC removal, pre-emphasis, framing, and windowing.

[0015] Optionally, before processing the raw speech information using a pre-trained sound source separation model to extract the target human voice command, the method further includes:

[0016] The pre-trained sound source separation model is obtained by training a time-domain audio separation network using a training dataset. The training dataset includes data from simulated multi-sound source scenarios and real-recorded multi-sound source scenarios. The simulated multi-sound source scenario data includes scenarios with different noise types, different signal-to-noise ratio ranges, and different sound source distances. The noise types include environmental noise, equipment noise, and other human voice interference.

[0017] Optionally, determining the comprehensive arbitration score corresponding to each of the intended behaviors includes:

[0018] Determine the initial arbitration score of the intended behavior under each preset arbitration dimension, and determine the comprehensive arbitration score of each intended behavior based on the weighted algorithm and the initial arbitration score.

[0019] Optionally, the preset arbitration dimensions include semantic association degree dimension, state matching degree dimension, physical proximity degree dimension, and user preference coefficient dimension; semantic association degree is the degree of association between the intent device and the intent operation; state matching degree is the degree of matching between the current state of the intent device and the intent operation; physical proximity degree is the quantified value of the distance between the user and the intent device; and user preference coefficient is the quantified value of the user's historical operation of the intent device.

[0020] Optionally, converting the target speech feature sequence into target speech text information includes:

[0021] The ultra-lightweight convolutional neural network is used to query the speech features corresponding to the specified speech command in the target speech feature sequence;

[0022] If a voice feature corresponding to the specified voice command exists, the specified voice command is sent to the corresponding device to execute the specified voice command, and the target voice feature sequence is converted into initial voice text information through a streaming recurrent neural network model, and the specified voice command and the initial voice text information are fused into target voice text information;

[0023] If no speech feature corresponds to the specified speech instruction, the target speech feature sequence is converted into initial speech text information through a streaming recurrent neural network model, and the initial speech text information is used as the target speech text information.

[0024] Secondly, this application discloses a voice command processing method applied to a target terminal, comprising:

[0025] The instruction extraction module is used to process the original speech information using a pre-trained sound source separation model to extract the target human voice instruction, and to perform feature extraction on the target human voice instruction to obtain the target speech feature sequence.

[0026] The instruction intent acquisition module is used to convert the target speech feature sequence into target speech text information, and process the target speech text information and auxiliary information through a deep bidirectional pre-trained language model based on a Transformer encoder to obtain the target speech instruction intent; the auxiliary information includes device information and environmental information.

[0027] An arbitration module is used to determine each intent behavior based on the target voice command intent, and to determine a comprehensive arbitration score corresponding to each intent behavior; the intent behavior includes intent device and intent operation;

[0028] The optimal instruction determination module is used to select the intent device and intent operation corresponding to the highest comprehensive arbitration score as the optimal execution device and optimal execution operation, respectively, and determine the target execution instruction that matches the target voice instruction intent based on the optimal execution device and optimal execution operation, and send the target execution instruction to the optimal execution device so that the optimal execution device executes the target execution instruction.

[0029] Thirdly, this application discloses an electronic device, including:

[0030] Memory, used to store computer programs;

[0031] A processor is used to execute the computer program to implement the aforementioned disclosed voice command processing method.

[0032] Fourthly, this application discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned voice command processing method.

[0033] As can be seen, this application utilizes a pre-trained sound source separation model to process raw speech information to extract target human voice commands, and performs feature extraction on the target human voice commands to obtain target speech feature sequences; the target speech feature sequences are converted into target speech text information, and the target speech text information and auxiliary information are processed by a deep bidirectional pre-trained language model based on a Transformer encoder to obtain the target speech command intent; the auxiliary information includes device information and environmental information; based on the target speech command intent, each intent behavior is determined, and a comprehensive arbitration score corresponding to each intent behavior is determined; the intent behavior includes intent device and intent operation; the intent device and intent operation corresponding to the highest comprehensive arbitration score are respectively taken as the optimal execution device and optimal execution operation, and a target execution command matching the target speech command intent is determined based on the optimal execution device and optimal execution operation, and the target execution command is sent to the optimal execution device so that the optimal execution device executes the target execution command. Therefore, this application's pre-trained sound source separation model extracts target human voice commands, improving command acquisition accuracy and further enhancing command response accuracy. This application does not rely on a fixed number of commands, but instead utilizes a deep bidirectional pre-trained language model based on a Transformer encoder to obtain command intent, possessing semantic intent reasoning capabilities. This accurately transforms ambiguous spoken expressions into clear device operation commands, achieving a technological leap from speech recognition to speech understanding in voice interaction. Furthermore, intent reasoning considers not only target speech text information but also auxiliary information, fully taking into account real-world factors such as device and environmental information, improving speech understanding accuracy and further enhancing command response accuracy. This application uses an arbitration method to calculate the optimal execution device and optimal execution operation, further improving command response accuracy. Attached Figure Description

[0034] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0035] Figure 1 This is a flowchart of a voice command processing method disclosed in this application;

[0036] Figure 2This is a schematic diagram of the structure of a voice command processing device disclosed in this application;

[0037] Figure 3 This is a structural diagram of an electronic device disclosed in this application. Detailed Implementation

[0038] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0039] Currently, instruction processing is mainly accomplished through a combination of "microphone + main chip," relying on a cloud-based processing engine. However, the instruction processing suffers from drawbacks such as insufficient instruction acquisition accuracy, reliance on fixed command words for response, and the existence of multiple executable devices, resulting in insufficient accuracy in instruction response.

[0040] Therefore, this application proposes a voice command processing scheme that can improve the accuracy of command response.

[0041] This application discloses a voice command processing method. See also Figure 1 As shown, applied to the target terminal, the method includes:

[0042] Step S11: Use a pre-trained sound source separation model to process the original speech information to extract the target human voice command, and perform feature extraction on the target human voice command to obtain the target speech feature sequence.

[0043] It should be noted that existing voice interaction technologies mostly employ passive noise reduction mechanisms, which significantly degrade speech recognition performance in complex scenarios such as sudden noise or concurrent multi-user speech. This application, however, addresses the contradiction between effective speech acquisition and insufficient recognition accuracy under noise interference through a pre-trained sound source separation model, achieving robustness in speech acquisition under extreme environments. The design concept of this application is to upgrade the traditional passive noise reduction mode to an active sound source focusing mode, designing a front-end speech processing architecture similar to human hearing characteristics. This ensures accurate extraction of clear and pure target user speech signals in various noisy environments, providing a reliable data foundation for all subsequent intelligent speech processing flows.

[0044] In this embodiment, a pre-trained sound source separation model is proposed to achieve efficient separation of human voice and noise in multi-source scenarios, providing a high signal-to-noise ratio input for subsequent speech recognition. This requires pre-training the sound source separation model. Specifically, before processing the original speech information using the pre-trained sound source separation model to extract the target human voice command, the method further includes: training a temporal audio separation network using a training dataset to obtain the pre-trained sound source separation model; the training dataset includes data from simulated multi-source scenarios and real-recorded multi-source scenarios; the simulated multi-source scenario data includes scenarios with different noise types, different signal-to-noise ratio ranges, and different sound source distances; the noise types include environmental noise, equipment noise, and other human voice interference.

[0045] It should be noted that time-domain audio separation networks have the advantages of low latency and high separation accuracy, making them suitable for deployment requirements at the device end (terminal).

[0046] It should be noted that during the model training phase, different noise types, including environmental noise, equipment noise, and human voice interference, and different signal-to-noise ratio ranges are all within the range of 0dB-20dB. The distances to different sound sources are not specifically set here. Real data is collected from actual application scenarios, such as homes, offices, and outdoors, to ensure the comprehensiveness and representativeness of the dataset. During the training process, the accuracy of human voice separation and the improvement in signal-to-noise ratio are the core evaluation indicators. By iteratively optimizing the model parameters, the model can accurately learn and distinguish the spectral characteristics, temporal characteristics, and feature differences of human voices and various types of noise, thereby achieving accurate extraction of the target human voice. During the model deployment phase, the trained sound source separation model needs to be deployed on the terminal's neural processing unit (NPU). Relying on the NPU's efficient parallel computing capabilities, the model can achieve real-time inference. After the multi-microphone array collects the on-site speech signal, the signal is input into the separation model in the NPU. The model performs real-time preprocessing, feature extraction, and separation operations on the signal, and finally outputs the separated independent human voice track and noise track. According to actual tests, this step can improve the signal-to-noise ratio of the speech signal by 15dB or more and effectively suppress various types of noise interference.

[0047] In a specific embodiment, the above model training stage is specifically described as follows: 1. Data preprocessing: Before inputting audio data into the model, the following preprocessing operations are performed: (1) Resampling: All training audio (including pure human voice and noisy mixed voice) are uniformly resampled to a sampling rate of 16kHz; (2) Amplitude normalization: The amplitude of the frequency signal is normalized to between [-1, 1] to accelerate model convergence; (3) Short-time Fourier Transform: The preprocessed time domain waveform is converted into a spectrum as model input through short-time Fourier transform. The parameters of STFT (Short-Time Fourier Transform) are set as follows: the window length is 1024 sample points (corresponding to 64ms), the frame shift is 256 sample points (corresponding to 16ms), and the number of Fourier transform points is 1024. Therefore, the input feature is a 513-dimensional complex spectrum (only the amplitude spectrum is used as the input feature, and the phase information is used in the later waveform reconstruction). 2. Core structure of the model: This embodiment adopts the U-Net (U-shaped Convolutional Network) architecture for human voice separation. The specific parameters are as follows: Overall structure: encoder-decoder structure, including 4 downsampling layers and 4 upsampling layers. There is a skip connection between the encoder and the decoder. Convolution kernel: All convolutional layers use a (5, 5) two-dimensional convolution kernel to extract features in both time and frequency dimensions. Number of channels: The number of channels in the first convolutional layer of the encoder is 32. After that, the number of channels doubles with each downsampling (i.e., from 32 to 64 to 128 to 256). The decoder part is gradually reduced to 1 channel after concatenation and outputs a soft mask. Activation function: The encoder part uses LeakyReLU (LeakyRectified Linear Unit). The last layer of the decoder uses the Sigmoid activation function to generate a mask in the range of [0, 1]. 3. Training configuration and convergence conditions: (1) Basic training configuration: The Adam optimizer is used and the initial learning rate is set to 0.001. The loss function is the L1 loss (Least Absolute Deviations Loss) between the predicted spectrum and the target clean spectrum. (2) Convergence and stopping conditions: Early stopping is adopted. The model training cycle is preset to 100 epochs. After each epoch, the loss value is calculated on the validation set. If the validation set loss does not decrease within 10 consecutive epochs, training is stopped and rolled back to the model parameters with the minimum validation set loss.4. Model Evaluation Indicators and Specific Values: On the publicly available test set, after the early stopping condition is met, the model's performance indicators are as follows: (1) Signal-to-noise ratio improvement: After model processing, the signal-to-noise ratio improvement of the output audio on the test set is ≥12.5 dB compared to the input noisy frequency; (2) Voice separation accuracy: The voice separation accuracy in this patent is measured by the objective evaluation indicator SDR (Signal to Distortion Ratio). On the test set, the average SDR value of the audio after model processing is ≥10.8 dB.

[0048] It should be noted that, compared to traditional general-purpose noise reduction approaches, this application adopts a high-order technical paradigm of target sound source extraction. This paradigm differs from the traditional method of simply reducing the overall ambient sound intensity. By accurately locating and separating the target sound source, it fundamentally improves the input signal-to-noise ratio in noisy environments, laying a solid foundation for the stable implementation of various subsequent high-order functions.

[0049] In this embodiment, feature optimization is required for the separated human voice signal to highlight the core semantic information in the voice command, reduce the interference of irrelevant semantics on subsequent recognition, and improve the accuracy and robustness of voice recognition. Specifically, the step of extracting features from the target human voice command to obtain a target voice feature sequence includes: extracting features from the target human voice command to obtain an initial voice feature sequence; and using a semantic enhancement network built based on a lightweight Transformer bidirectional encoder to assign corresponding feature weights to the voice features of each voice frame in the initial voice feature sequence to obtain a target voice feature sequence. The degree of correlation between the voice features corresponding to the voice frame and the core content of the command is positively linearly correlated with the corresponding weight of the voice frame.

[0050] It should be noted that the lightweight Transformer encoder, as a semantic enhancement network (specifically using the open-source WavLM semantic enhancement model (Waveform Language Model)), adopts a simplified network structure, reducing the number of network parameters and computational load, adapting to device resource constraints, while ensuring semantic enhancement effect.

[0051] In one specific embodiment, taking the voice command "Please dim the lights over there" as an example, the semantic logic of the command is analyzed using a self-attention mechanism. It is determined that "lights" is the object of the operation, and "dim" is the core action, representing the key semantics of the command. Therefore, these two words are assigned high weights to their corresponding speech time frames. "Over there," as an indicator pronoun, and "some," as a degree modifier, are auxiliary semantics with less impact on command execution; therefore, their corresponding time frames are assigned lower weights. The weighted speech feature vector effectively highlights the core semantics of the command while weakening secondary semantic interference, enabling the subsequent speech recognition model to quickly focus on key information and improve recognition accuracy.

[0052] It should be noted that during the operation of the semantic enhancement network, the extracted speech feature sequence is input into the Transformer encoder. The encoder calculates and assigns weights to each time frame of the speech feature sequence through a self-attention mechanism. The core logic is: based on the semantic relevance of speech, time frames related to the core content of the instruction are given higher weights, while time frames corresponding to irrelevant or secondary semantics are given lower weights, thereby achieving weighted enhancement of semantic features.

[0053] In this embodiment, the step of extracting features from the target human voice command to obtain an initial speech feature sequence includes: using Mel-Frequency Cepstral Coefficients (MFCCs) in conjunction with preprocessing operations to extract features from the target human voice command to obtain an initial speech feature sequence; the preprocessing operations include DC removal, pre-emphasis, framing, and windowing. It should be noted that this application uses Mel-Frequency Cepstral Coefficients (MFCCs) as the speech feature parameter, and combines signal preprocessing (DC removal, pre-emphasis, framing, and windowing) during the extraction process to ensure that the extracted speech features accurately reflect the semantic information of the speech.

[0054] In summary, the core of this application's perceptual enhancement layer addresses the issues of effective speech acquisition and decryption. At the speech input end, a high-performance audio processing model (pre-trained sound source separation model) is deployed. This model can accurately separate and extract the target user's clean speech signal from complex environmental noise in real time. Simultaneously, through intelligent gain adjustment and sound quality optimization algorithms, it completes the preprocessing of the speech signal, providing high-quality input data support for subsequent processes such as speech recognition and intent parsing.

[0055] Step S12: Convert the target speech feature sequence into target speech text information, and process the target speech text information and auxiliary information through a deep bidirectional pre-trained language model based on a Transformer encoder to obtain the target speech command intent; the auxiliary information includes device information and environmental information.

[0056] It should be noted that this application improves the accuracy of command recognition by combining the target speech-text information and auxiliary information. The device information in the auxiliary information includes device operating status, device spatial location, and device historical operations, while the environmental information includes light intensity and ambient temperature. It should also be noted that the joint processing is completed by the intent parsing module, which focuses on contextual judgment. Its input information includes two parts: first, the speech-transcribed text output from the complete recognition path (target speech-text information); and second, the contextual encoding information (auxiliary information) digitized by the device sensors. The contextual encoding information covers key parameters such as environmental status and device status. The model achieves contextual judgment of the user's voice command intent through the fusion analysis of the transcribed text and contextual encoding information, ultimately generating a structured user intent data object containing contextual information, ensuring that the recognition results are highly adapted to actual application scenarios.

[0057] It should be noted that this application integrates static environmental data (such as light intensity and ambient temperature) collected by sensors with the dynamic operating status of the equipment in real time, and organically incorporates it into the entire process of semantic understanding and decision execution. This design enables a qualitative improvement in the system's cognitive ability, accurately distinguishing semantic ambiguities, such as differentiating the physical temperature meaning of "cold" from its physiological health meaning, and the daytime supplemental lighting and nighttime lighting scenarios corresponding to the "turn on the lights" command, truly achieving cognitive intelligence at the contextual awareness level. In one specific embodiment, when the target voice command is "too bright," the intended meaning of the target voice command may include dimming the main living room light (including the degree of dimming, which is determined based on environmental information, such as the difference in light intensity between daytime supplemental lighting and nighttime lighting), turning off the balcony light, etc.

[0058] It should be noted that the deep bidirectional pre-trained language model based on the Transformer encoder is a miniature BERT (Bidirectional Encoder Representations from Transformers) model. This model has the advantages of being lightweight and having strong semantic understanding capabilities, and is suitable for edge deployment requirements.

[0059] In this embodiment, converting the target speech feature sequence into target speech text information includes: querying the target speech feature sequence for speech features corresponding to a specified speech command using an ultra-lightweight convolutional neural network; if a speech feature corresponding to the specified speech command exists, sending the specified speech command to a corresponding device to execute the specified speech command, and converting the target speech feature sequence into initial speech text information using a streaming recurrent neural network model, and fusing the specified speech command and the initial speech text information into target speech text information; if no speech feature corresponding to the specified speech command exists, converting the target speech feature sequence into initial speech text information using a streaming recurrent neural network model, and using the initial speech text information as the target speech text information.

[0060] It should be noted that since the auxiliary information is acquired by local sensors, a network is generally not required, thus enabling operation under offline conditions. However, if the auxiliary information must be acquired via a network, the step of processing the target speech text information and auxiliary information to obtain the target speech command intent cannot be achieved without a network. Therefore, when the absence of a network is detected, subsequent steps are directly prohibited, and only the specified speech command is sent to the corresponding device to execute it, avoiding the inability to perform any corresponding actions when a network is absent. It should be noted that the specified speech command is also known as a fixed command word.

[0061] It should be noted that this application adopts a dual-track path architecture when converting target speech feature sequences into target speech text information, including a fast detection path using an ultra-lightweight convolutional neural network and a complete recognition path using a streaming recurrent neural network model.

[0062] It should be noted that the fast detection path primarily focuses on low-latency wake-up. The ultra-lightweight convolutional neural network boasts advantages such as fewer parameters, lower computational cost, and faster response speed. Deployed in a persistent memory mode (an end-to-end deployment), it eliminates the need for temporary model loading, achieving a wake-up response time of ≤50ms. The designated voice command in the fast detection path is used to wake up the corresponding device and execute a simple command response, ensuring the system can respond quickly after the user issues a voice command, thus improving the user experience. Furthermore, the fast detection path also incorporates the ability to filter invalid voice signals, reducing subsequent system resource consumption, and the ability to wake up the subsequent complete recognition path using a wake word. This wake word can be a fixed command word or other words adjusted by the device; no specific limitations are imposed here.

[0063] In one specific embodiment, the fast detection path is an edge-side voice command recognition model that employs a lightweight TC-ResNet architecture (Lightweight Temporal Convolution Residual). The network (a lightweight temporal convolutional residual network) receives 96 frames of 40-dimensional MFCC features in its input layer. The backbone network contains 8 convolutional layers, extensively using depthwise separable convolutions to reduce the number of parameters. The number of channels per layer is controlled between 64 and 128. Average pooling is used before fully connected layers to reduce dimensionality. The training configuration uses the Adam optimizer with an initial learning rate of 0.001, a batch size of 32, and a cosine decay strategy for the learning rate. The convergence condition is set to stop training early if the validation set loss does not decrease for 10 consecutive epochs, with a maximum training epoch limit of 100 epochs. The data preprocessing process includes pre-emphasis, frame segmentation, adding Hamming windows, extracting 40-dimensional MFCCs and their first-order difference coefficients, and performing mean normalization to enhance robustness. The model evaluation metrics mainly focus on the accuracy (>95%) in edge scenarios, the number of model parameters (<1MB), the inference latency (<500ms), and the instruction recognition rate (>50%) in low signal-to-noise ratio environments.

[0064] It should be noted that the complete recognition path mainly focuses on accurate transcription. The streaming recurrent neural network model has streaming processing capabilities, which can realize real-time transcription of speech signals. It can perform recognition calculations without waiting for complete speech input, and is suitable for continuous speech command scenarios.

[0065] In one specific embodiment, the complete recognition path uses a streaming recurrent neural network model as the core recognition engine. The model structure is based on multi-layer unidirectional LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit) units, with each hidden state dimension set to 256. A projection layer is used to reduce the output dimension to adapt to edge computing. Input features use an 80-dimensional Fbank (Filter Bank Features) and apply frame-by-frame mean subtraction, supporting streaming data input with a step size of 10ms. The training configuration uses Connectionist Temporal Classification (CTC) as the loss function, employs the Adam optimizer with an initial learning rate of 0.001, and incorporates gradient clipping to prevent gradient explosion. The batch size is set to 64 based on memory limitations. The convergence condition is set to stop training when the Character Error Rate (CER) on the development set decreases by less than 0.5% for five consecutive rounds, with a maximum of 50 training rounds. The data preprocessing process includes online feature extraction and Voice Activity Detection (VAD). The model employs detection to filter out silent segments and speed-perturbation-based data augmentation to improve generalization ability. The model evaluation metrics mainly focus on word error rate (required to be <8%), first inference latency (<500ms), and stability performance during streaming decoding in real-time transcription scenarios.

[0066] In summary, the core of this application in the understanding and parsing layer lies in solving the problems of speech semantic recognition and intent parsing. A dual-track parallel processing architecture is adopted, specifically divided into two processing branches: one is a lightweight keyword detection branch with millisecond-level response capability, mainly responsible for the rapid recognition and response of wake words and core control commands (fixed command words); the other is an end-to-end speech recognition and intent understanding joint model branch, responsible for deep semantic parsing of the user's complete voice commands. The processing results of the two branches are fused here, incorporating real-time data from environmental sensors (such as light intensity, ambient temperature, etc.), ultimately generating a structured user intent data object (target voice command intent) containing contextual information.

[0067] Step S13: Determine each intent behavior based on the target voice command intent, and determine the comprehensive arbitration score corresponding to each intent behavior; the intent behavior includes intent device and intent operation.

[0068] It should be noted that the target voice command intent may simultaneously involve several intent operations. For example, when the target voice command is "It's too bright," the intent actions may include dimming the main living room light (including the degree of dimming, which is determined based on environmental information, such as the difference in light intensity between daytime supplemental lighting and nighttime lighting), turning off the balcony light, etc. It should be noted that unless it is a decisive turning-off operation, other turning-on or adjusting operations require determining the degree of turning on or adjusting; that is, the intent operation includes the degree of operation.

[0069] It should be noted that a single voice command may correspond to multiple executable devices (intent devices, such as dimming and turning off the lights corresponding to the main living room light and balcony light). Traditional technical solutions are prone to device response conflicts or require cumbersome device naming to distinguish the executing entity. This application determines the optimal executing device and the optimal executing operation through arbitration, realizing seamless and orderly collaborative response among multiple devices, replacing the isolated operation mode of a single device.

[0070] It should be noted that this application constructs an interpretable and quantifiable arbitration scoring model as the core engine of the system's instruction decision-making, breaking the traditional "black box" mode of arbitration decision-making. Through multi-dimensional quantitative scoring, it achieves accurate matching and decision-making for fuzzy user instructions, ensuring the accuracy and rationality of instruction execution. Specifically, determining the comprehensive arbitration score corresponding to each intentional behavior includes: determining the initial arbitration score of the intentional behavior under each preset arbitration dimension; determining the comprehensive arbitration score of each intentional behavior based on a weighted algorithm and the initial arbitration score. The preset arbitration dimensions include semantic relevance dimension, state matching degree dimension, physical proximity dimension, and user preference coefficient dimension. Semantic relevance is the degree of relevance between the intentional device and the intentional operation; state matching degree is the degree of matching between the current state of the intentional device and the intentional operation; physical proximity is the quantified value of the distance between the user and the intentional device; and user preference coefficient is the quantified value of the user's historical operation of the intentional device.

[0071] It should be noted that the semantic association degree can be calculated by querying a pre-constructed knowledge graph. The knowledge graph stores the association relationships and association weights between various devices and their corresponding executable operations. For example, the association degree between "light" and the "dim" operation is set to 0.95, and the association degree between "air conditioner" and the "temperature adjustment" operation is set to 0.98. The association degree ranges from 0 to 1. The higher the value, the stronger the association between the instruction and the device operation. The state matching degree is used to determine whether the current operating state of the device supports the intent operation corresponding to the voice command and to ensure the efficiency of operation execution. The calculation process combines real-time device state data (such as the "on / off" state of the light, the "current temperature" state of the air conditioner, etc.). For example, a light that is already in the off state has a state matching degree of 0.9 for the "on" command and a state matching degree of 0.1 for the "dim" command. The state matching degree ranges from 0 to 1. The higher the value, the stronger the adaptability between the device state and the instruction. The physical proximity is used to quantify the distance between the user and the device, serving as an auxiliary reference for command decision-making. It is estimated using Bluetooth Signal Strength Index (RSSI) or Ultra-Wideband (UWB) precision ranging technology on the device, converting the signal strength or ranging result into a quantized value in the 0-1 range. A higher value indicates a closer physical distance between the user and the device, and a higher priority for the device as the target of command execution. The user preference coefficient combines user historical usage habits to optimize command decision-making. By analyzing user historical operation records (such as frequently used devices, frequently used operations, operation times, etc.), user preferences are learned and quantified into a coefficient in the 0-1 range. A higher value indicates that the device or operation is more in line with the user's usage habits, and a higher priority in command decision-making.

[0072] It should be noted that the decision-making logic of the arbitration engine is based on an interpretable weighted scoring function. The design of the scoring function follows the principle of prioritizing the weight of core factors and adapting the weight of auxiliary factors, and clearly defines the weight allocation of each preset arbitration dimension. The weight of each preset arbitration dimension can be changed in real time according to the actual situation. For example, the comprehensive arbitration score = 0.4 × semantic relevance + 0.3 × state matching degree + 0.2 × physical proximity + 0.1 × user preference coefficient.

[0073] Step S14: The intent device and intent operation corresponding to the highest comprehensive arbitration score are respectively designated as the optimal execution device and optimal execution operation. Based on the optimal execution device and optimal execution operation, a target execution instruction matching the target voice instruction intent is determined, and the target execution instruction is sent to the optimal execution device so that the optimal execution device executes the target execution instruction.

[0074] It should be noted that after the optimal execution device executes the target execution instruction, it needs to provide clear voice confirmation information to the user so that the user can judge whether the execution is correct; of course, the target execution instruction can also be sent to the optimal execution device after the user confirms that the target execution instruction is correct.

[0075] In summary, this application primarily addresses the issues of intent matching and precise execution at the decision-making and execution layer. The terminal incorporates a multi-factor dynamic arbitration algorithm. This algorithm matches and compares structured user intent data objects with the real-time operational status files of all currently controllable devices. Through multi-dimensional parameter calculations, it obtains a comprehensive suitability score for each device. Based on this score, the system abandons the traditional command broadcasting mode, precisely selecting the device with the highest comprehensive suitability and issuing specific operation commands matching the user's intent to it. After the command is executed, it provides the user with clear voice confirmation information.

[0076] In one specific embodiment, the voice command is "Too bright." There are two possible actions: dimming the main living room light (the degree of dimming is determined based on environmental information, such as the difference in light intensity between daytime supplemental lighting and nighttime illumination) and turning off the balcony light. Regarding dimming the main living room light: the semantic association between the main living room light and the dimming operation is 0.9; the matching degree between the main living room light being on and allowed to be dimmed and the dimming operation state is also 0.9; the distance quantification between the main living room light and the user is 0.9; and the user preference coefficient for the dimming operation is 0.9. Therefore, the overall arbitration score for dimming the main living room light is 0.4 × 0.9 + 0.3 × 0.9 + 0.2 × 0.9 + ... 0.1 × 0.9 = 0.9; Regarding turning off the balcony light: the semantic association between the balcony light and the turn-off operation is 0.9; the matching degree between the balcony light being on and allowed to be dimmed and the turn-off operation is 0.9; the distance quantification between the balcony light and the user is 0.6; the user preference coefficient for the turn-off operation of the balcony light is 0.7; at this time, the comprehensive arbitration score for turning off the main living room light is 0.4 × 0.9 + 0.3 × 0.9 + 0.2 × 0.6 + 0.1 × 0.7 = 0.82; finally, the comprehensive arbitration score for dimming the main living room light is calculated to be 0.9 points, and the arbitration score for turning off the balcony light is 0.82 points. Finally, the main living room light corresponding to 0.9 points is determined to be the optimal execution device, and the intent operation corresponding to 0.9 points is the dimming operation. It can be determined that dimming the main living room light is the target execution instruction that matches the target voice command intent, ensuring the accuracy and rationality of the instruction execution.

[0077] It should be noted that if the overall arbitration score for dimming the living room light and turning off the balcony light is the same and is the highest, then both operations can be performed simultaneously.

[0078] It should be noted that, based on the above, the temporal audio separation network (pre-trained sound source separation model), the deep bidirectional pre-trained language model based on the Transformer encoder, the fast detection path (ultra-lightweight convolutional neural network), and the complete recognition path (streaming recurrent neural network model) can all be deployed on the terminal, and the device sensors are also on the same terminal (modular deployment of multimodal sensor interfaces). No network is needed when transmitting device and environmental information, thus enabling offline operation on the terminal. Because the model and network primarily utilize lightweight and heterogeneous computing architectures, complex processing flows can be successfully deployed on resource-constrained target terminals. The entire processing flow is achieved through chip-level software... Supported by hardware co-optimization technology, the modular deployment integrating dedicated signal processing units (such as NPU), multimodal sensor interfaces, and arbitration logic units ensures that the processing flow is completed within hundreds of milliseconds. This ensures that computationally intensive tasks (such as sound source separation and semantic enhancement) in the method can run efficiently and with low power consumption offline and on the local terminal. Complex interactions can be completed without relying on the cloud-based brain, highlighting the autonomy, real-time performance, and privacy security of edge intelligence. It achieves the dual satisfaction of cloud-level intelligent processing capabilities and terminal-level low power consumption requirements, effectively breaking the triangle dilemma of performance, latency, and power consumption, and avoiding the inherent defects of cloud-based systems such as response latency, leakage of user privacy data, and strong network dependence.

[0079] It should be noted that the voice command processing method can be specifically applied to a voice command processing system burned into a target terminal. By burning the system into various terminals, portability, flexibility and standardization can be achieved.

[0080] As can be seen, this application utilizes a pre-trained sound source separation model to process raw speech information to extract target human voice commands, and performs feature extraction on the target human voice commands to obtain target speech feature sequences; the target speech feature sequences are converted into target speech text information, and the target speech text information and auxiliary information are processed by a deep bidirectional pre-trained language model based on a Transformer encoder to obtain the target speech command intent; the auxiliary information includes device information and environmental information; based on the target speech command intent, each intent behavior is determined, and a comprehensive arbitration score corresponding to each intent behavior is determined; the intent behavior includes intent device and intent operation; the intent device and intent operation corresponding to the highest comprehensive arbitration score are respectively taken as the optimal execution device and optimal execution operation, and a target execution command matching the target speech command intent is determined based on the optimal execution device and optimal execution operation, and the target execution command is sent to the optimal execution device so that the optimal execution device executes the target execution command. Therefore, this application's pre-trained sound source separation model extracts target human voice commands, improving command acquisition accuracy and further enhancing command response accuracy. This application does not rely on a fixed number of commands, but instead utilizes a deep bidirectional pre-trained language model based on a Transformer encoder to obtain command intent, possessing semantic intent reasoning capabilities. This accurately transforms ambiguous spoken expressions into clear device operation commands, achieving a technological leap from speech recognition to speech understanding in voice interaction. Furthermore, intent reasoning considers not only target speech text information but also auxiliary information, fully taking into account real-world factors such as device and environmental information, improving speech understanding accuracy and further enhancing command response accuracy. This application uses an arbitration method to calculate the optimal execution device and optimal execution operation, setting preset arbitration dimensions including semantic relevance, state matching, physical proximity, and user preference coefficients, fully considering the influence of each dimension and improving command response accuracy.

[0081] Accordingly, this application also discloses a voice command processing device applied to a target terminal, see [link to relevant documentation]. Figure 2 As shown, the device includes:

[0082] The instruction extraction module 11 is used to process the original speech information using a pre-trained sound source separation model to extract the target human voice instruction, and to perform feature extraction on the target human voice instruction to obtain the target speech feature sequence.

[0083] The instruction intent acquisition module 12 is used to convert the target speech feature sequence into target speech text information, and process the target speech text information and auxiliary information through a deep bidirectional pre-trained language model based on a Transformer encoder to obtain the target speech instruction intent; the auxiliary information includes device information and environmental information.

[0084] Arbitration module 13 is used to determine each intentional behavior based on the target voice command intent, and to determine a comprehensive arbitration score corresponding to each intentional behavior; the intentional behavior includes intentional device and intentional operation;

[0085] The optimal instruction determination module 14 is used to take the intent device and intent operation corresponding to the highest comprehensive arbitration score as the optimal execution device and optimal execution operation, respectively, and determine the target execution instruction that matches the target voice instruction intent based on the optimal execution device and optimal execution operation, and send the target execution instruction to the optimal execution device so that the optimal execution device executes the target execution instruction.

[0086] The more specific working process of each of the above modules can be found in the corresponding content disclosed in the foregoing embodiments, and will not be repeated here.

[0087] As can be seen, this application utilizes a pre-trained sound source separation model to process raw speech information to extract target human voice commands, and performs feature extraction on the target human voice commands to obtain target speech feature sequences; the target speech feature sequences are converted into target speech text information, and the target speech text information and auxiliary information are processed by a deep bidirectional pre-trained language model based on a Transformer encoder to obtain the target speech command intent; the auxiliary information includes device information and environmental information; based on the target speech command intent, each intent behavior is determined, and a comprehensive arbitration score corresponding to each intent behavior is determined; the intent behavior includes intent device and intent operation; the intent device and intent operation corresponding to the highest comprehensive arbitration score are respectively taken as the optimal execution device and optimal execution operation, and a target execution command matching the target speech command intent is determined based on the optimal execution device and optimal execution operation, and the target execution command is sent to the optimal execution device so that the optimal execution device executes the target execution command. Therefore, this application's pre-trained sound source separation model extracts target human voice commands, improving command acquisition accuracy and further enhancing command response accuracy. This application does not rely on a fixed number of commands, but instead utilizes a deep bidirectional pre-trained language model based on a Transformer encoder to obtain command intent, possessing semantic intent reasoning capabilities. This accurately transforms ambiguous spoken expressions into clear device operation commands, achieving a technological leap from speech recognition to speech understanding in voice interaction. Furthermore, intent reasoning considers not only target speech text information but also auxiliary information, fully taking into account real-world factors such as device and environmental information, improving speech understanding accuracy and further enhancing command response accuracy. This application uses an arbitration method to calculate the optimal execution device and optimal execution operation, further improving command response accuracy.

[0088] Furthermore, embodiments of this application also provide an electronic device. Figure 3 This is a structural diagram of an electronic device 20 according to an exemplary embodiment. The content of the diagram should not be construed as limiting the scope of this application.

[0089] Figure 3 This is a schematic diagram of the structure of an electronic device 20 provided in an embodiment of this application. Specifically, the electronic device 20 may include: at least one processor 21, at least one memory 22, a display screen 23, an input / output interface 24, a communication interface 25, a power supply 26, and a communication bus 27. The memory 22 stores a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the voice command processing method disclosed in any of the foregoing embodiments. Furthermore, the electronic device 20 in this embodiment may specifically be an electronic computer.

[0090] In this embodiment, the power supply 26 is used to provide operating voltage for each hardware device on the electronic device 20; the communication interface 25 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows can be any communication protocol applicable to the technical solution of this application, and is not specifically limited here; the input / output interface 24 is used to acquire external input data or output data to the outside world, and its specific interface type can be selected according to specific application needs, and is not specifically limited here.

[0091] Furthermore, the memory 22, as a carrier for resource storage, can be a read-only memory, random access memory, disk, or optical disk, etc. The resources stored thereon may include computer programs 221, and the storage method may be temporary storage or permanent storage. The computer programs 221 may include, in addition to computer programs capable of performing the voice command processing method executed by the electronic device 20 as disclosed in any of the foregoing embodiments, computer programs capable of performing other specific tasks.

[0092] Furthermore, embodiments of this application also disclose a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned disclosed voice command processing method.

[0093] The specific steps of this method can be found in the corresponding content disclosed in the foregoing embodiments, and will not be repeated here.

[0094] The various embodiments in this application are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. For the same or similar parts between the various embodiments, refer to each other. As for the apparatus disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple, and relevant parts can be referred to in the method section.

[0095] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0096] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.

[0097] Finally, it should be noted that in this document, relational terms such as "first" and "first" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0098] The above provides a detailed description of a voice command processing method, apparatus, device, and storage medium provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A voice command processing method, characterized in that, Applied to target terminals, including: The raw speech information is processed using a pre-trained sound source separation model to extract the target human voice command, and the target human voice command is then subjected to feature extraction to obtain the target speech feature sequence. The target speech feature sequence is converted into target speech text information, and the target speech text information and auxiliary information are processed by a deep bidirectional pre-trained language model based on a Transformer encoder to obtain the target speech command intent; the auxiliary information includes device information and environmental information. Based on the target voice command intent, each intent behavior is determined, and a comprehensive arbitration score is determined for each intent behavior; the intent behavior includes intent device and intent operation; The optimal execution device and optimal execution operation are determined based on the intent behavior corresponding to the highest comprehensive arbitration score. Based on the optimal execution device and optimal execution operation, a target execution instruction matching the intent of the target voice command is determined. The target execution instruction is then sent to the optimal execution device so that the optimal execution device executes the target execution instruction.

2. The voice command processing method according to claim 1, characterized in that, The step of extracting features from the target human voice command to obtain the target speech feature sequence includes: The target human voice command is subjected to feature extraction to obtain an initial speech feature sequence; A semantic enhancement network based on a lightweight Transformer bidirectional encoder is used to assign corresponding feature weights to the speech features of each speech frame in the initial speech feature sequence to obtain the target speech feature sequence; wherein, the degree of correlation between the speech features corresponding to the speech frame and the core content of the instruction is positively linearly correlated with the weight of the corresponding speech frame.

3. The voice command processing method according to claim 2, characterized in that, The step of extracting features from the target human voice command to obtain an initial speech feature sequence includes: The initial speech feature sequence is obtained by extracting features from the target human voice command using Mel frequency cepstral coefficients and in combination with preprocessing operations; the preprocessing operations include DC removal, pre-emphasis, framing, and windowing.

4. The voice command processing method according to claim 1, characterized in that, Before processing the raw speech information using a pre-trained sound source separation model to extract the target human voice command, the process also includes: The pre-trained sound source separation model is obtained by training a time-domain audio separation network using a training dataset. The training dataset includes data from simulated multi-sound source scenarios and real-recorded multi-sound source scenarios. The simulated multi-sound source scenario data includes scenarios with different noise types, different signal-to-noise ratio ranges, and different sound source distances. The noise types include environmental noise, equipment noise, and other human voice interference.

5. The voice command processing method according to claim 1, characterized in that, The determination of the comprehensive arbitration score corresponding to each of the aforementioned intentional behaviors includes: Determine the initial arbitration score of the intended behavior under each preset arbitration dimension, and determine the comprehensive arbitration score of each intended behavior based on the weighted algorithm and the initial arbitration score.

6. The voice command processing method according to claim 5, characterized in that, The preset arbitration dimensions include semantic relevance dimension, state matching dimension, physical proximity dimension, and user preference coefficient dimension; Semantic association degree is the degree of association between the intent device and the intent operation; state matching degree is the degree of matching between the current state of the intent device and the intent operation; physical proximity degree is the quantified value of the distance between the user and the intent device; The user preference coefficient is a quantitative value representing the user's historical operations on various devices intended for a particular device.

7. The voice command processing method according to any one of claims 1 to 6, characterized in that, The step of converting the target speech feature sequence into target speech text information includes: The ultra-lightweight convolutional neural network is used to query the speech features corresponding to the specified speech command in the target speech feature sequence; If a voice feature corresponding to the specified voice command exists, the specified voice command is sent to the corresponding device to execute the specified voice command, and the target voice feature sequence is converted into initial voice text information through a streaming recurrent neural network model, and the specified voice command and the initial voice text information are fused into target voice text information; If no speech feature corresponds to the specified speech instruction, the target speech feature sequence is converted into initial speech text information through a streaming recurrent neural network model, and the initial speech text information is used as the target speech text information.

8. A voice command processing method, characterized in that, Applied to target terminals, including: The instruction extraction module is used to process the original speech information using a pre-trained sound source separation model to extract the target human voice instruction, and to perform feature extraction on the target human voice instruction to obtain the target speech feature sequence. The instruction intent acquisition module is used to convert the target speech feature sequence into target speech text information, and process the target speech text information and auxiliary information through a deep bidirectional pre-trained language model based on a Transformer encoder to obtain the target speech instruction intent; the auxiliary information includes device information and environmental information. An arbitration module is used to determine each intent behavior based on the target voice command intent, and to determine a comprehensive arbitration score corresponding to each intent behavior; the intent behavior includes intent device and intent operation; The optimal instruction determination module is used to select the intent device and intent operation corresponding to the highest comprehensive arbitration score as the optimal execution device and optimal execution operation, respectively, and determine the target execution instruction that matches the target voice instruction intent based on the optimal execution device and optimal execution operation, and send the target execution instruction to the optimal execution device so that the optimal execution device executes the target execution instruction.

9. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor for executing the computer program to implement the voice command processing method as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, Used to store computer programs; wherein, when the computer program is executed by a processor, it implements the voice command processing method as described in any one of claims 1 to 7.