Voice wake-up method, apparatus, storage medium, and system
By employing a two-level separation and wake-up scheme, combined with a neural network model and a conformer network structure, the problems of low wake-up rate and high false wake-up rate in complex acoustic environments are solved, achieving a more efficient voice wake-up effect.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2021-03-31
- Publication Date
- 2026-06-12
AI Technical Summary
In the presence of background noise, existing voice wake-up technologies suffer from low wake-up rates and high false wake-up rates in scenarios with multiple sound sources or far-field echoes, resulting in poor voice wake-up performance.
A two-level separation and wake-up scheme is adopted. First, pre-wake-up judgment is performed through the first-level streaming separation and wake-up processing. Then, wake-up confirmation is performed in the second-level offline scenario. The neural network model is used for multi-level separation and wake-up processing. Combined with the Conformer network structure and multi-feature fusion technology, the wake-up rate is improved and the false wake-up rate is reduced.
It significantly improves the accuracy and reliability of voice wake-up in complex acoustic environments, ensuring a high wake-up rate while reducing the false wake-up rate, thus enhancing the overall effect of voice wake-up.
Smart Images

Figure CN115148197B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of terminal technology, and in particular to a voice wake-up method, device, storage medium and system. Background Technology
[0002] With the rise of intelligent voice interaction, more and more electronic devices support voice interaction functions. Among them, voice wake-up, as the starting point of voice interaction, is widely used in various electronic devices, such as smart speakers and smart TVs. When there are electronic devices in the user's space that support voice wake-up, the user issues a wake-up voice, and the woken-up electronic device will respond to the speaker's request and interact with the user.
[0003] In related technologies, in order to improve the wake-up rate of electronic devices, the wake-up module in the electronic device can be trained under multiple conditions and the trained wake-up module can be used for voice wake-up; alternatively, microphone array processing technology can be used for voice wake-up; or, traditional sound source separation technology can be used for voice wake-up.
[0004] While the above methods have made some progress in terms of wake-up rate, the recognition of human voices is relatively poor in the presence of background noise, especially in scenarios with multiple sound sources, strong sound sources, or far-field echoes, where the wake-up rate is even lower and the voice wake-up effect of electronic devices is poor. Summary of the Invention
[0005] In view of this, a voice wake-up method, device, storage medium, and system are proposed. The embodiments of this application design a two-level separation and wake-up scheme. In the first-level scenario, a pre-wake-up judgment is performed using the first-level separation and wake-up scheme. After successful pre-wake-up, a second wake-up confirmation is performed in the second-level scenario, ensuring a high wake-up rate while reducing the false wake-up rate, thereby achieving a better voice wake-up effect.
[0006] In a first aspect, embodiments of this application provide a voice wake-up method, the method comprising:
[0007] Acquire raw first microphone data;
[0008] The first wake-up data is obtained by performing a first-level processing on the first microphone data. The first-level processing includes a first-level separation processing and a first-level wake-up processing based on a neural network model.
[0009] When the first wake-up data indicates that the pre-wake-up is successful, the second wake-up data is obtained by performing a second-level processing based on the first microphone data. The second-level processing includes a second-level separation processing and a second-level wake-up processing based on a neural network model.
[0010] The wake-up result is determined based on the second wake-up data.
[0011] In this implementation, a two-level separation and wake-up scheme is designed. In the first-level scenario, the original first microphone data is processed by first-level separation and first-level wake-up to obtain the first wake-up data. Based on the first wake-up data, a pre-wake-up judgment is made. The first-level separation and wake-up scheme can ensure a high wake-up rate, but it will also bring a high false wake-up rate. Therefore, when the first wake-up data indicates that the pre-wake-up is successful, the first microphone data is processed by second-level separation and second-level wake-up in the second-level scenario. That is, the first microphone data is woken up again. This can achieve better separation performance, ensure a high wake-up rate while reducing the false wake-up rate, thereby achieving a better voice wake-up effect.
[0012] In conjunction with the first aspect, in one possible implementation of the first aspect, the step of performing a first-level processing based on the first microphone data to obtain the first wake-up data includes:
[0013] The first microphone data is preprocessed to obtain multi-channel feature data;
[0014] Based on the multi-channel feature data, the first separation data is obtained by calling the pre-trained first-level separation module. The first-level separation module is used to perform the first-level separation processing.
[0015] Based on the multi-channel feature data and the first separation data, the first wake-up data is obtained by calling the pre-trained first-level wake-up module, which is used to perform the first-level wake-up processing.
[0016] In this implementation, the first microphone data is preprocessed to obtain multi-channel feature data. Then, the first-level separation module is called to output the first separation data based on the multi-channel feature data. Then, the first-level wake-up module is called to output the first wake-up data based on the multi-channel feature data and the first separation data. This realizes the first-level separation processing and the first-level wake-up processing of the first microphone data in the first-level scenario, ensuring that the wake-up rate of the pre-wake-up is as high as possible.
[0017] In conjunction with the first possible implementation of the first aspect, in the second possible implementation of the first aspect, the step of obtaining second wake-up data by performing second-level processing based on the first microphone data when the first wake-up data indicates successful pre-wake-up includes:
[0018] When the first wake-up data indicates that the pre-wake-up is successful, the second separation data is obtained by calling the pre-trained second-level separation module based on the multi-channel feature data and the first separation data. The second-level separation module is used to perform the second-level separation processing.
[0019] Based on the multi-channel feature data, the first separation data, and the second separation data, the second wake-up data is obtained by calling the pre-trained second-level wake-up module. The second-level wake-up module is used to perform the second-level wake-up processing.
[0020] In this implementation, when the first wake-up data indicates successful pre-wake-up, the second-level separation module is called to output the second separation data based on the multi-channel feature data and the first separation data. The second-level wake-up module is then called to output the second wake-up data based on the multi-channel feature data, the first separation data, and the second separation data. This achieves second-level separation processing and second-level wake-up processing of the first microphone data based on the first separation data output by the first-level separation module in the second-level scenario. In other words, the first microphone data is re-confirmed for wake-up, which ensures a high wake-up rate while reducing the false wake-up rate, further improving the voice wake-up effect.
[0021] In conjunction with the second possible implementation of the first aspect, in the third possible implementation of the first aspect, the first-level separation process is a streaming sound source separation process, and the first-level wake-up process is a streaming sound source wake-up process; and / or,
[0022] The second-level separation process is an offline sound source separation process, and the second-level wake-up process is an offline sound source wake-up process.
[0023] In this implementation, the first-level scenario is the first-level streaming scenario, and the second-level scenario is the second-level offline scenario. Since the first-level separation and wake-up scheme is designed for streaming, it generally sacrifices separation performance to ensure a high wake-up rate, but it also brings a high false wake-up rate. Therefore, when the first wake-up data indicates that the pre-wake-up is successful, the first microphone data is then processed offline in the second-level offline scenario for the second-level separation and second-level wake-up. This can achieve better separation performance, ensure a high wake-up rate while reducing the false wake-up rate, and further improve the voice wake-up effect.
[0024] Combining the second or third possible implementation of the first aspect, in the fourth possible implementation of the first aspect,
[0025] The first-level wake-up module includes a wake-up model in the form of multiple-input single-output or multiple-input multiple-output; and / or,
[0026] The second-level wake-up module includes wake-up models in the form of multiple-input single-output or multiple-input multiple-output.
[0027] In this implementation, the first-level wake-up module and / or the second-level wake-up module are multi-input wake-up modules. Compared with the single-input wake-up modules in related technologies, this not only saves computation and avoids the significant increase and waste caused by repeatedly calling the wake-up model, but also greatly improves wake-up performance by making better use of the correlation between various input parameters.
[0028] In combination with any one of the second to fourth possible implementations of the first aspect, in the fifth possible implementation of the first aspect, the first-level separation module and / or the second-level separation module adopt a dual-path conformer (dpconformer) network structure.
[0029] In this implementation, the self-attention network layer modeling technique based on Conformer provides a Conformer network structure with dual paths. By designing alternating calculations of Conformer layers within and between blocks, it can model long sequences and avoid the problem of increased computation caused by directly using Conformer. Furthermore, due to the strong modeling capability of Conformer networks, it can significantly improve the separation effect of separation modules (i.e., the first-level separation module and / or the second-level separation module).
[0030] In conjunction with any one of the second to fifth possible implementations of the first aspect, in the sixth possible implementation of the first aspect, the first-level separation module and / or the second-level separation module are separation modules for performing at least one task, the at least one task including a separate sound source separation task, or including the sound source separation task and other tasks;
[0031] The other tasks include at least one of the following: sound source localization task, specific person extraction task, specific direction extraction task, and specific person confirmation task.
[0032] In this implementation, a multi-task design scheme is provided for sound source separation tasks and other tasks. For example, other tasks include at least one of sound source localization tasks, specific person extraction tasks, specific direction extraction tasks, and specific person confirmation tasks. The sound source separation results can be associated with other information and provided to downstream tasks or lower-level wake-up modules, thereby improving the output effect of the separation module (i.e., the first-level separation module and / or the second-level separation module).
[0033] In combination with any one of the second to sixth possible implementations of the first aspect, in the seventh possible implementation of the first aspect, the first-level wake-up module and / or the second-level wake-up module are wake-up modules for performing at least one task, wherein the at least one task includes a single wake-up task, or includes the wake-up task and other tasks;
[0034] The other tasks include at least one of the following: sound source localization task, specific person extraction task, specific direction extraction task, and specific person confirmation task.
[0035] This implementation provides a multi-task design scheme for the sound source wake-up task and other tasks. These other tasks include at least one of the following: sound source localization, specific person extraction, specific direction extraction, and specific person confirmation. The sound source wake-up result can be correlated with other information and provided to downstream tasks, improving the output performance of the wake-up module (i.e., the first-level wake-up module and / or the second-level wake-up module). For example, if the other task is sound source localization, the wake-up module can provide more accurate directional information while providing the sound source wake-up result. Compared with related technologies that directly use multiple fixed beams in space, this ensures a more accurate directional estimation effect.
[0036] In an eighth possible implementation of the first aspect, combining any one of the first to seventh possible implementations, the first-level separation module includes a first-level multi-feature fusion model and a first-level separation model; the step of calling the pre-trained first-level separation module to output the first separation data based on the multi-channel feature data includes:
[0037] The multi-channel feature data is input into the first-level multi-feature fusion model to output the first single-channel feature data;
[0038] The first single-channel feature data is input into the first-level separation model to obtain the first separation data.
[0039] This implementation provides a multi-channel feature data fusion mechanism to avoid manual selection of feature data in related technologies. The first-level multi-feature fusion model automatically learns the interrelationships between feature channels and the contribution of each feature to the final separation effect, further ensuring the separation effect of the first-level separation model.
[0040] In a ninth possible implementation of the first aspect, combining any one of the second to eighth possible implementations, the second-level separation module includes a second-level multi-feature fusion model and a second-level separation model; the step of calling the pre-trained second-level separation module to output the second separation data based on the multi-channel feature data and the first separation data includes:
[0041] The multi-channel feature data and the first separated data are input into the second-level multi-feature fusion model to output the second single-channel feature data;
[0042] The second single-channel feature data is input into the second-level separation model to obtain the second separation data.
[0043] This implementation provides a multi-channel feature data fusion mechanism to avoid manual selection of feature data in related technologies. The second-level multi-feature fusion model automatically learns the interrelationships between feature channels and the contribution of each feature to the final separation effect, further ensuring the separation effect of the second-level separation model.
[0044] In a tenth possible implementation of the first aspect, combining any one of the first to ninth possible implementations, the first-level wake-up module includes a first wake-up model in the form of multiple inputs and a single output. The step of calling the pre-trained first-level wake-up module to output the first wake-up data based on the multi-channel feature data and the first separated data includes:
[0045] The multi-channel feature data and the first separation data are input into the first-level wake-up model to output the first wake-up data. The first wake-up data includes a first confidence level, which is used to indicate the probability that the original first microphone data includes a preset wake-up word.
[0046] In this implementation, a first wake-up model in the form of multiple inputs and single output is provided. Since the first wake-up model is a multi-input model, it avoids the problem of significantly increased computational load and waste caused by repeated calls to the wake-up model in related technologies, saves computational resources, and improves the processing efficiency of the first wake-up model. Furthermore, due to better utilization of the correlation between various input parameters, the wake-up performance of the first wake-up model is greatly improved.
[0047] In conjunction with any one of the first to ninth possible implementations of the first aspect, in the eleventh possible implementation of the first aspect, the first-level wake-up module includes a first wake-up model in the form of multiple inputs and multiple outputs and a first post-processing module. The step of calling the pre-trained first-level wake-up module to output the first wake-up data based on the multi-channel feature data and the first separated data includes:
[0048] The multi-channel feature data and the first separated data are input into the first wake-up model, and the phoneme sequence information corresponding to each of the multiple sound source data is output.
[0049] The phoneme sequence information corresponding to each of the multiple sound source data is input into the first post-processing module, and the first wake-up data is output. The first wake-up data includes the second confidence level corresponding to each of the multiple sound source data. The second confidence level is used to indicate the acoustic feature similarity between the sound source data and the preset wake-up word.
[0050] In this implementation, a first wake-up model with multiple inputs and multiple outputs is provided. On the one hand, since the first wake-up model is a multi-input model, it avoids the problem of significantly increased computational load and waste caused by repeated calls to the wake-up model in related technologies, thus saving computational resources and improving the processing efficiency of the first wake-up model. On the other hand, since the first wake-up model is a multi-output model, it can simultaneously output the phoneme sequence information corresponding to multiple sound source data, thereby avoiding the situation where the wake-up rate is low due to mutual interference between various sound source data, and further ensuring the subsequent wake-up rate.
[0051] In conjunction with any one of the second to eleventh possible implementations of the first aspect, in the twelfth possible implementation of the first aspect, the second-level wake-up module includes a second wake-up model in the form of multiple inputs and single outputs. The step of calling the pre-trained second-level wake-up module to output the second wake-up data based on the multi-channel feature data, the first separation data, and the second separation data includes:
[0052] The multi-channel feature data, the first separation data, and the second separation data are input into the second-level wake-up model to output the second wake-up data. The second wake-up data includes a third confidence level, which is used to indicate the probability that the original first microphone data includes a preset wake-up word.
[0053] In this implementation, a second wake-up model in the form of multiple inputs and single output is provided. Since the second wake-up model is a multi-input model, it avoids the problem of significantly increased computational load and waste caused by repeated calls to the wake-up model in related technologies, saves computational resources, and improves the processing efficiency of the second wake-up model. Furthermore, by making better use of the correlation between various input parameters, the wake-up performance of the second wake-up model is greatly improved.
[0054] In a thirteenth possible implementation of the first aspect, combining any one of the second to eleventh possible implementations, the second-level wake-up module includes a second wake-up model in the form of multiple inputs and multiple outputs and a second post-processing module. The step of calling the pre-trained second-level wake-up module to output the second wake-up data based on the multi-channel feature data, the first separation data, and the second separation data includes:
[0055] The multi-channel feature data, the first separation data, and the second separation data are input into the second-level wake-up model, and the phoneme sequence information corresponding to each of the multiple sound source data is output.
[0056] The phoneme sequence information corresponding to each of the multiple sound source data is input into the second post-processing module, and the second wake-up data is output. The second wake-up data includes the fourth confidence level corresponding to each of the multiple sound source data. The fourth confidence level is used to indicate the acoustic feature similarity between the sound source data and the preset wake-up word.
[0057] In this implementation, a second wake-up model with multiple inputs and multiple outputs is provided. On the one hand, since the second wake-up model is a multi-input model, it avoids the problem of significantly increased computational load and waste caused by repeated calls to the wake-up model in related technologies, thus saving computational resources and improving the processing efficiency of the second wake-up model. On the other hand, since the second wake-up model is a multi-output model, it can simultaneously output the phoneme sequence information corresponding to multiple sound source data, thereby avoiding the situation where the wake-up rate is low due to mutual interference between various sound source data, and further ensuring the subsequent wake-up rate.
[0058] Secondly, embodiments of this application provide a voice wake-up device, the device comprising: an acquisition module, a first-level processing module, a second-level processing module, and a determination module;
[0059] The acquisition module is used to acquire the original first microphone data;
[0060] The first-level processing module is used to perform first-level processing based on the first microphone data to obtain first wake-up data. The first-level processing includes first-level separation processing and first-level wake-up processing based on a neural network model.
[0061] The second-level processing module is used to perform second-level processing on the first microphone data to obtain second wake-up data when the first wake-up data indicates that the pre-wake-up is successful. The second-level processing includes second-level separation processing and second-level wake-up processing based on a neural network model.
[0062] The determining module is used to determine the wake-up result based on the second wake-up data.
[0063] In conjunction with the second aspect, in one possible implementation of the second aspect, the device further includes a preprocessing module, and the first-level processing module further includes a first-level separation module and a first-level wake-up module;
[0064] The preprocessing module is used to preprocess the first microphone data to obtain multi-channel feature data;
[0065] The first-level separation module is used to perform the first-level separation processing based on the multi-channel feature data and output the first separation data.
[0066] The first-level wake-up module is used to perform the first-level wake-up processing based on the multi-channel feature data and the first separation data, and output the first wake-up data.
[0067] In conjunction with the first possible implementation of the second aspect, in the second possible implementation of the second aspect, the second-level processing module further includes a second-level separation module and a second-level wake-up module;
[0068] The second-level separation module is used to perform second-level separation processing based on the multi-channel feature data and the first separation data when the first wake-up data indicates successful pre-wake-up, and output the second separation data.
[0069] The second-level wake-up module is used to perform second-level wake-up processing based on the multi-channel feature data, the first separation data, and the second separation data, and output the second wake-up data.
[0070] In conjunction with the second possible implementation of the second aspect, in the third possible implementation of the second aspect,
[0071] The first-stage separation process is a streaming sound source separation process, and the first-stage wake-up process is a streaming sound source wake-up process; and / or,
[0072] The second-level separation process is an offline sound source separation process, and the second-level wake-up process is an offline sound source wake-up process.
[0073] Combining the second or third possible implementation of the second aspect, in the fourth possible implementation of the second aspect,
[0074] The first-level wake-up module includes a wake-up model in the form of multiple-input single-output or multiple-input multiple-output; and / or,
[0075] The second-level wake-up module includes wake-up models in the form of multiple-input single-output or multiple-input multiple-output.
[0076] In combination with any one of the second to fourth possible implementations of the second aspect, in the fifth possible implementation of the second aspect, the first-level separation module and / or the second-level separation module adopt a conformer network structure with dual paths.
[0077] In conjunction with any one of the second to fifth possible implementations of the second aspect, in the sixth possible implementation of the second aspect, the first-level separation module and / or the second-level separation module are separation modules for performing at least one task, the at least one task including a separate sound source separation task, or including the sound source separation task and other tasks;
[0078] The other tasks include at least one of the following: sound source localization task, specific person extraction task, specific direction extraction task, and specific person confirmation task.
[0079] In combination with any one of the second to sixth possible implementations of the second aspect, in the seventh possible implementation of the second aspect, the first-level wake-up module and / or the second-level wake-up module are wake-up modules for performing at least one task, the at least one task including a single wake-up task, or including the wake-up task and other tasks;
[0080] The other tasks include at least one of the following: sound source localization task, specific person extraction task, specific direction extraction task, and specific person confirmation task.
[0081] In conjunction with any one of the first to seventh possible implementations of the second aspect, in the eighth possible implementation of the second aspect, the first-level separation module includes a first-level multi-feature fusion model and a first-level separation model; the first-level separation module is further configured to:
[0082] The multi-channel feature data is input into the first-level multi-feature fusion model to output the first single-channel feature data;
[0083] The first single-channel feature data is input into the first-level separation model to obtain the first separation data.
[0084] In a ninth possible implementation of the second aspect, combining any one of the second to eighth possible implementations, the second-level separation module includes a second-level multi-feature fusion model and a second-level separation model; the second-level separation module is further used for:
[0085] The multi-channel feature data and the first separated data are input into the second-level multi-feature fusion model to output the second single-channel feature data;
[0086] The second single-channel feature data is input into the second-level separation model to obtain the second separation data.
[0087] In conjunction with any one of the first to ninth possible implementations of the second aspect, in the tenth possible implementation of the second aspect, the first-level wake-up module includes a first wake-up model in the form of multiple-input single-output, and the first-level wake-up module is further configured to:
[0088] The multi-channel feature data and the first separation data are input into the first-level wake-up model to output the first wake-up data. The first wake-up data includes a first confidence level, which is used to indicate the probability that the original first microphone data includes a preset wake-up word.
[0089] In conjunction with any one of the first to ninth possible implementations of the second aspect, in the eleventh possible implementation of the second aspect, the first-level wake-up module includes a first wake-up model in the form of multiple-input multiple-output and a first post-processing module. The first-level wake-up module is further configured to:
[0090] The multi-channel feature data and the first separated data are input into the first wake-up model, and the phoneme sequence information corresponding to each of the multiple sound source data is output.
[0091] The phoneme sequence information corresponding to each of the multiple sound source data is input into the first post-processing module, and the first wake-up data is output. The first wake-up data includes the second confidence level corresponding to each of the multiple sound source data. The second confidence level is used to indicate the acoustic feature similarity between the sound source data and the preset wake-up word.
[0092] In conjunction with any one of the second to eleventh possible implementations of the second aspect, in the twelfth possible implementation of the second aspect, the second-level wake-up module includes a second wake-up model in the form of multiple-input single-output, and the second-level wake-up module is further used for:
[0093] The multi-channel feature data, the first separation data, and the second separation data are input into the second-level wake-up model to output the second wake-up data. The second wake-up data includes a third confidence level, which is used to indicate the probability that the original first microphone data includes a preset wake-up word.
[0094] In a thirteenth possible implementation of the second aspect, combining any one of the second to eleventh possible implementations, the second-level wake-up module includes a second wake-up model in the form of multiple-input multiple-output and a second post-processing module. The second-level wake-up module is further used for:
[0095] The multi-channel feature data, the first separation data, and the second separation data are input into the second-level wake-up model, and the phoneme sequence information corresponding to each of the multiple sound source data is output.
[0096] The phoneme sequence information corresponding to each of the multiple sound source data is input into the second post-processing module, and the second wake-up data is output. The second wake-up data includes the fourth confidence level corresponding to each of the multiple sound source data. The fourth confidence level is used to indicate the acoustic feature similarity between the sound source data and the preset wake-up word.
[0097] Thirdly, embodiments of this application provide an electronic device, the electronic device comprising:
[0098] processor;
[0099] Memory used to store processor-executable instructions;
[0100] The processor is configured to implement the voice wake-up method provided by the first aspect or any possible implementation of the first aspect when executing the instructions.
[0101] Fourthly, embodiments of this application provide a non-volatile computer-readable storage medium storing computer program instructions thereon, which, when executed by a processor, implement the voice wake-up method provided by the first aspect or any possible implementation of the first aspect.
[0102] Fifthly, embodiments of this application provide a computer program product including computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code. When the computer-readable code is executed in an electronic device, the processor in the electronic device executes the voice wake-up method provided by the first aspect or any possible implementation of the first aspect.
[0103] Sixthly, embodiments of this application provide a voice wake-up system for executing the voice wake-up method provided by the first aspect or any possible implementation thereof. Attached Figure Description
[0104] The accompanying drawings, which are included in and form part of this specification, illustrate exemplary embodiments, features, and aspects of this application together with the specification and serve to explain the principles of this application.
[0105] Figure 1 This diagram illustrates the relationship between the wake-up rate of electronic devices and the distance to the sound source in related technologies.
[0106] Figure 2 A schematic diagram of the structure of an electronic device provided in an exemplary embodiment of this application is shown.
[0107] Figure 3 A flowchart of a voice wake-up method provided in an exemplary embodiment of this application is shown.
[0108] Figure 4 A schematic diagram illustrating the principle of a voice wake-up method provided in an exemplary embodiment of this application is shown.
[0109] Figure 5 A schematic diagram of the structure of a dpconformer network provided in an exemplary embodiment of this application is shown.
[0110] Figure 6 A schematic diagram illustrating the principle of a two-stage separation scheme provided in an exemplary embodiment of this application is shown.
[0111] Figures 7 to 14 The diagram illustrates the principles of several possible implementations of the first-level separation scheme provided by the exemplary embodiments of this application.
[0112] Figure 15 The diagram illustrates the principle of a two-stage wake-up scheme provided in an exemplary embodiment of this application.
[0113] Figures 16 to 19 The diagram illustrates the principles of several possible implementations of the first-level wake-up scheme provided in the exemplary embodiments of this application.
[0114] Figures 20 to 23 This illustration shows a schematic diagram of the principle of the voice wake-up method in a single-microphone scenario provided by an exemplary embodiment of this application.
[0115] Figures 24 to 28 This illustration shows a schematic diagram of the principle of the voice wake-up method in a multi-microphone scenario provided by an exemplary embodiment of this application.
[0116] Figure 29 A flowchart of a voice wake-up method provided by another exemplary embodiment of this application is shown.
[0117] Figure 30 A block diagram of a voice wake-up device provided in an exemplary embodiment of this application is shown. Detailed Implementation
[0118] Various exemplary embodiments, features, and aspects of this application will now be described in detail with reference to the accompanying drawings. The same reference numerals in the drawings denote elements that have the same or similar functions. Although various aspects of the embodiments are shown in the drawings, they are not necessarily drawn to scale unless specifically indicated otherwise.
[0119] The term “exemplary” as used herein means “serving as an example, embodiment, or illustration.” Any embodiment illustrated herein as “exemplary” is not necessarily to be construed as superior to or better than other embodiments.
[0120] Furthermore, to better illustrate this application, numerous specific details are provided in the following detailed embodiments. Those skilled in the art should understand that this application can be implemented without certain specific details. In some instances, methods, means, components, and circuits well-known to those skilled in the art have not been described in detail in order to highlight the main points of this application.
[0121] Voice interaction technology is a crucial technology in modern electronic devices, including smartphones, speakers, televisions, robots, tablets, and in-vehicle systems. Voice wake-up is a key function of voice interaction technology. Using a specific wake-up word or command (such as "Hey Celia"), electronic devices in non-voice interaction states (e.g., sleep or other states) are activated, enabling voice recognition, voice search, dialogue, voice navigation, and other voice functions. This ensures the constant availability of voice interaction technology while avoiding power consumption issues or the potential for user privacy breaches caused by prolonged voice interaction.
[0122] In ideal environments (such as quiet environments where the user is close to the electronic device to be woken up), voice wake-up functionality meets user needs, achieving a wake-up rate of over 95%. However, real-world acoustic environments are often more complex. When the user is far from the electronic device (e.g., 3-5 meters) and there is background noise (e.g., television sounds, voices, background music, reverberation, echoes, etc.), the wake-up rate will drop sharply. Figure 1 As shown, the wake-up rate of electronic devices decreases as the distance from the sound source increases, where the distance from the sound source is the distance between the user and the electronic device. Figure 1 In the test, the wake-up rate was 80% when the sound source distance was 0.5 meters, 65% when the sound source distance was 1 meter, 30% when the sound source distance was 3 meters, and 10% when the sound source distance was 5 meters. The low wake-up rate resulted in poor voice wake-up performance of electronic devices.
[0123] While some progress has been made in improving wake-up rates through methods provided in related technologies, human voice recognition is relatively poor in the presence of background noise. This is especially true in scenarios with multiple sound sources of interference (such as interference from other speakers, background music, echo remnants in echo scenes, etc.), strong sound source interference, or far-field echo scenes, where the wake-up rate is even lower and there is a higher rate of false wake-ups.
[0124] The embodiments of this application design a two-level separation and wake-up scheme. In the first-level streaming scenario, the first-level separation and wake-up scheme is used to make a pre-wake-up judgment to ensure the wake-up rate is as high as possible, but this will bring a higher false wake-up rate. Therefore, after the pre-wake-up is successful, offline wake-up confirmation is performed in the second-level offline scenario to ensure a high wake-up rate while reducing the false wake-up rate, thereby obtaining a better voice wake-up effect.
[0125] First, some terms used in the embodiments of this application will be introduced.
[0126] 1. Offline sound source wake-up processing: This refers to performing sound source wake-up processing on the audio content after acquiring the complete audio content. Offline sound source wake-up processing includes offline separation processing and offline wake-up processing.
[0127] 2. Streaming wake-up processing (also known as online wake-up processing): This refers to acquiring audio segments in real time or at preset time intervals and performing wake-up processing on those segments. Streaming wake-up processing includes streaming separation processing and streaming wake-up processing.
[0128] The audio segments are a continuous number of sample data collected in real time or at preset time intervals, for example, a preset time interval of 16 milliseconds. This application does not limit this.
[0129] 3. Multi-source separation technology: This refers to the technique of separating received single-microphone or multi-microphone speech signals into multiple sound source data. These multiple sound source data include the sound source data of the target object and the sound source data of interfering sound sources. Multi-source separation technology is used to separate the sound source data of the target object from the sound source data of interfering sound sources in order to better determine wake-up status.
[0130] 4. Wake-up technology, also known as Keyword Spotting (KWS), is used to determine whether the sound source data under test contains a preset wake-up word. The wake-up word can be a default setting or a user-defined setting. For example, the default fixed wake-up word might be "Xiaoyi Xiaoyi," which the user cannot change, and the wake-up scheme design often relies on specific training sample data. Alternatively, users can manually set personalized wake-up words. Regardless of the personalized wake-up word chosen by the user, a high wake-up rate is expected, while frequent model self-learning on the electronic device side is undesirable. Optionally, the modeling methods for wake-up technology include, but are not limited to, the following two possible implementation methods: The first is to build a wake-up module using whole words, such as using a fixed wake-up word as the output target of the wake-up module; the second is to build a wake-up module for phoneme recognition based on phoneme representation in general speech recognition, such as automatically constructing a corresponding personalized decoding map when supporting a fixed wake-up word or a user-defined wake-up word, and ultimately relying on the output decoding map of the wake-up module to determine the user's wake-up intention.
[0131] For the first possible implementation method, which uses fixed wake-up word modeling, in multi-source interference scenarios, the wake-up module requires a single output data point to indicate whether wake-up has occurred or whether a fixed wake-up word has been used. For the second possible implementation method, which uses phoneme modeling, in multi-source interference scenarios, the outputs of the wake-up module from multiple sound sources are meaningful and require separate decoding graphs to determine whether a custom wake-up word has been used. Therefore, in multi-source interference scenarios, the wake-up module using fixed wake-up word modeling is a multi-input, single-output model; while the wake-up module using phoneme modeling is a multi-input, multi-output model, with multiple output data points corresponding to the posterior probability sequences of phonemes from multiple sound sources.
[0132] Please refer to Figure 2 This illustrates a schematic diagram of the structure of an electronic device provided in an exemplary embodiment of this application.
[0133] The electronic device can be a terminal, including mobile terminals or fixed terminals. Examples include mobile phones, speakers, televisions, robots, tablets, in-vehicle devices, headphones, smart glasses, smartwatches, laptops, and desktop computers. A server can be a single server, a server cluster consisting of several servers, or a cloud computing service center.
[0134] Reference Figure 2 The electronic device 200 may include one or more of the following components: processing component 202, memory 204, power supply component 206, multimedia component 208, audio component 210, input / output (I / O) interface 212, sensor component 214, and communication component 216.
[0135] Processing component 202 typically controls the overall operation of electronic device 200, such as operations associated with display, telephone calls, data communication, camera operation, and recording operations. Processing component 202 may include one or more processors 220 to execute instructions to perform all or part of the steps of the voice wake-up method provided in the embodiments of this application. Furthermore, processing component 202 may include one or more modules to facilitate interaction between processing component 202 and other components. For example, processing component 202 may include a multimedia module to facilitate interaction between multimedia component 208 and processing component 202.
[0136] Memory 204 is configured to store various types of data to support the operation of electronic device 200. Examples of such data include instructions for any application or method operating on electronic device 200, contact data, phonebook data, messages, pictures, multimedia content, etc. Memory 204 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.
[0137] Power supply component 206 provides power to various components of electronic device 200. Power supply component 206 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 200.
[0138] Multimedia component 208 includes a screen that provides an output interface between the electronic device 200 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touchscreen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensors may sense not only the boundaries of the touch or swipe action but also the duration and pressure associated with the touch or swipe operation. In some embodiments, multimedia component 208 includes a front-facing camera and / or a rear-facing camera. When the electronic device 200 is in an operating mode, such as a shooting mode or a multimedia content mode, the front-facing camera and / or the rear-facing camera may receive external multimedia data. Each front-facing camera and rear-facing camera may be a fixed optical lens system or have focal length and optical zoom capabilities. Optionally, the electronic device 200 may acquire video information through the cameras (front-facing camera and / or rear-facing camera).
[0139] Audio component 210 is configured to output and / or input audio signals. For example, audio component 210 includes a microphone (MIC) configured to receive external audio signals when electronic device 200 is in an operating mode, such as call mode, recording mode, and voice recognition mode. The received audio signals may be further stored in memory 204 or transmitted via communication component 216. Optionally, electronic device 200 acquires raw first microphone data through the microphone. In some embodiments, audio component 210 also includes a speaker for outputting audio signals.
[0140] I / O interface 212 provides an interface between processing component 202 and peripheral interface modules, such as keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, home buttons, volume buttons, power buttons, and lock buttons.
[0141] Sensor assembly 214 includes one or more sensors for providing state assessments of various aspects of electronic device 200. For example, sensor assembly 214 can detect the on / off state of electronic device 200, the relative positioning of components such as the display and keypad of electronic device 200, changes in position of electronic device 200 or a component of electronic device 200, the presence or absence of user contact with electronic device 200, orientation or acceleration / deceleration of electronic device 200, and temperature changes of electronic device 200. Sensor assembly 214 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, sensor assembly 214 may also include an accelerometer, gyroscope, magnetometer, pressure sensor, or temperature sensor.
[0142] Communication component 216 is configured to facilitate wired or wireless communication between electronic device 200 and other devices. Electronic device 200 can access wireless networks based on communication standards, such as WiFi, 2G, or 3G, or combinations thereof. In one exemplary embodiment, communication component 216 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, communication component 216 also includes a near-field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
[0143] In an exemplary embodiment, the electronic device 200 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the voice wake-up method provided in the embodiments of this application.
[0144] In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as a memory 204 including computer program instructions, which can be executed by the processor 220 of the electronic device 200 to complete the voice wake-up method provided in the embodiments of this application.
[0145] The voice wake-up method provided in this application will now be described using several exemplary embodiments.
[0146] Please refer to Figure 3It illustrates a flowchart of a voice wake-up method provided in an exemplary embodiment of this application, in which the method is used for Figure 2 The following example uses an electronic device. The method includes the following steps.
[0147] Step 301: Obtain the raw first microphone data.
[0148] Electronic devices acquire microphone output signals through a single microphone or multiple microphones and use the microphone output signals as raw first microphone data.
[0149] Optionally, the first microphone data includes the sound source data of the target object and the sound source data of the interfering sound source, which includes at least one of the following: the voice of other objects besides the target object, background music, and ambient noise.
[0150] Step 302: Preprocess the first microphone data to obtain multi-channel feature data.
[0151] To address issues such as acoustic echo, reverberation, and signal amplitude that may arise in real-world acoustic scenarios, the electronic device preprocesses the data from the first microphone to obtain multi-channel feature data. Optionally, the preprocessing includes at least one of the following: Acoustic Echo Cancellation (AEC), Dereverberation, Voice Activity Detection (VAD), Automatic Gain Control (AGC), and Beamfiltering.
[0152] Optionally, the multi-channel features can be multiple sets of multi-channel features. The multi-channel feature data includes at least one of the following: multi-channel time-domain signal data, multi-channel spectrum data, multiple sets of inter-channel phase difference (IPD) data, multi-directional feature data, and multi-beam feature data.
[0153] Step 303: Perform first-level separation processing based on multi-channel feature data to obtain first-separation data.
[0154] The first-level separation process can also be called the first-level neural network separation process. The first-level separation process is a separation process based on a neural network model, that is, the first-level separation process includes calling the neural network model to perform sound source separation processing.
[0155] Optionally, the electronic device uses multi-channel feature data to call the output of a pre-trained first-level separation module to obtain first-level separated data. The first-level separation module performs first-level separation processing, which is a streaming sound source separation process. Optionally, the first-level separation module uses a dpconformer network structure.
[0156] The electronic device uses multi-channel feature data to call the pre-trained first-level separation module to output the first-level separation data, including but not limited to the following two possible implementation methods:
[0157] In one possible implementation, the first-level separation module includes a first-level separation model. The electronic device splices multi-channel features and inputs the spliced multi-channel feature data into the first-level separation model to output the first separation data.
[0158] In another possible implementation, the first-level separation module includes a first-level multi-feature fusion model and a first-level separation model. The electronic device inputs multi-channel feature data into the first-level multi-feature fusion model and outputs first single-channel feature data; the first single-channel feature data is then input into the first-level separation model and outputs first separated data. For ease of explanation, the second possible implementation will be used as an example below. This application does not limit the scope of the implementation.
[0159] Optionally, the first-level multi-feature fusion model is the conformer feature fusion model.
[0160] The first-level separation model employs a streaming network structure. Optionally, the first-level separation model may use a dpconformer network structure.
[0161] The first-level separation model is a neural network model, meaning it is a model trained using a neural network. Optionally, the first-level separation model can employ any of the following network structures: Deep Neural Networks (DNN), Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN), Conv-TasNet (Fully Convolutional Temporal Audio Separation Network), and DPRNN. It should be noted that the first-level separation model can also employ other network structures suitable for streaming scenarios; this embodiment does not limit its application to these structures.
[0162] The separation task design of the first-level separation module can be a single-task design of streaming sound source separation task, or a multi-task design of streaming sound source separation task and other tasks. Optionally, other tasks include the orientation estimation task corresponding to each of the multiple sound sources and / or the sound source object recognition task corresponding to each of the multiple sound sources.
[0163] In one possible implementation, a first-level separation module is used to perform blind separation of multiple sound source data, and the first separation data includes the separated multiple sound source data.
[0164] In another possible implementation, the first-level separation module is used to extract the sound source data of the target object from multiple sound source data, and the first separation data includes the extracted sound source data of the target object.
[0165] In another possible implementation, the first-level separation module is used to extract the sound source data of the target object from multiple sound source data based on video information. The first separation data includes the extracted sound source data of the target object. For example, the video information includes the visual data of the target object.
[0166] In another possible implementation, the first-level separation module is used to extract at least one sound source data in the target direction from multiple sound source data, and the first separation data includes at least one sound source data in the target direction.
[0167] It should be noted that the details of the possible implementation methods of the separate task design can be found in the relevant descriptions in the following embodiments, and will not be introduced here.
[0168] Optionally, for blind separation tasks that require separating data from multiple sound sources, the cost function in the first-level separation module is a function designed based on the Permutation Invariant Training (PIT) criterion.
[0169] Optionally, during the training of the cost function, the electronic device sorts multiple sample sound source data according to the chronological order of the start time of the speech segment, and calculates the loss value of the cost function based on the sorted sample sound source data. The cost function is then trained based on the calculated loss value.
[0170] Optionally, after obtaining multiple sound source data through the first-level separation module, the multiple sound source data can be directly input into the next-level processing model, namely the first-level wake-up module.
[0171] Optionally, for multi-microphone scenarios, after obtaining multiple sound source data through the first-level separation module, the statistical information of the multiple sound source data is calculated, the statistical information is input into the beamforming model to output beamforming data, and the beamforming data is input into the next-level processing model, namely the first-level wake-up module.
[0172] Step 304: Perform first-level wake-up processing based on multi-channel feature data and first separation data to obtain first wake-up data.
[0173] Optionally, the electronic device, based on multi-channel feature data and the first separation data, calls the pre-trained first-level wake-up module to output the first wake-up data. The first-level wake-up module performs first-level wake-up processing, which is a streaming sound source wake-up process.
[0174] It should be noted that the descriptions of multi-channel feature data and the first separation data can be found in the relevant steps above, and will not be repeated here.
[0175] Optionally, the electronic device inputs multi-channel feature data and first separation data into the first-level wake-up module to output first wake-up data.
[0176] Optionally, the wake-up scheme is a multi-input single-output streaming wake-up scheme (MISO-KWS), meaning the first-level wake-up module uses a fixed wake-up word for modeling. The first-level wake-up module is a multi-input single-output wake-up model, with input parameters including multi-channel feature data and first separation data, and output parameters including a first confidence level. The first confidence level indicates the probability that the original first microphone data includes the preset wake-up word.
[0177] Optionally, the first confidence level is a multidimensional vector, where each dimension of the multidimensional vector has a probability value between 0 and 1.
[0178] Optionally, the wake-up scheme is a multiple-input multiple-output streaming wake-up scheme (MIMO-KWS), where the first-level wake-up module uses phoneme modeling. This first-level wake-up module includes a MIMO-KWS wake-up model and a first post-processing module (e.g., a decoder). The input parameters of the first-level wake-up module (i.e., the input parameters of the wake-up model) include multi-channel feature data and first separation data. The output parameters of the wake-up model include phoneme sequence information corresponding to each of the multiple sound source data. The phoneme sequence information corresponding to the sound source data indicates the probability distribution of multiple phonemes in the sound source data; that is, the phoneme sequence information includes the probability values corresponding to each of the multiple phonemes. The output parameters of the first-level wake-up module (i.e., the output parameters of the first post-processing module) include a second confidence level corresponding to each of the multiple sound source data. The second confidence level indicates the acoustic feature similarity between the sound source data and the preset wake-up word.
[0179] The preset wake-up word can be a fixed wake-up word set by default or a wake-up word set by the user. This application does not limit this aspect.
[0180] The first-level wake-up module adopts a streaming network structure. Optionally, the first-level wake-up module adopts a streaming dpconformer network structure.
[0181] Optionally, the first-level wake-up module can employ any of the network structures of DNN, LSTM, and CNN. It should be noted that the first-level wake-up module can also employ other network structures suitable for streaming scenarios. The network structure of the first-level wake-up module can be compared to the network structure of the first-level separation module, and this embodiment does not limit it in this regard.
[0182] The wake-up task design of the first-level wake-up module can be a single-task design of the wake-up task or a multi-task design of the wake-up task and other tasks. Optionally, other tasks include orientation estimation tasks and / or sound source object recognition tasks.
[0183] Optionally, the first wake-up data includes a first confidence level, which indicates the probability that the original first microphone data includes a preset wake-up word. Optionally, the first wake-up data includes second confidence levels corresponding to multiple sound source data, which indicate the acoustic feature similarity between the sound source data and the preset wake-up word.
[0184] Optionally, the first wake-up data may also include the location information corresponding to the wake-up event and / or the object information of the wake-up object, wherein the object information is used to indicate the object identity of the sound source data.
[0185] Step 305: Determine whether to pre-wake up based on the first wake-up data.
[0186] The electronic device sets a first threshold value for the first-level wake-up module. This first threshold value is the threshold that allows the electronic device to be successfully pre-wake up.
[0187] In one possible implementation, the first wake-up data includes a first confidence level, which indicates the probability that the original first microphone data includes a preset wake-up word. When the first confidence level in the first wake-up data is greater than a first threshold, the pre-wake-up is determined to be successful, i.e., the first-level streaming wake-up is successful. The cached multi-channel feature data and the first separation data are input to the second-level separation module, and step 306 is executed. When the first confidence level is less than or equal to the first threshold, the pre-wake-up is determined to be unsuccessful, i.e., the first-level streaming wake-up is unsuccessful, and the process ends.
[0188] In another possible implementation, the first wake-up data includes the second confidence scores corresponding to multiple sound source data. The second confidence scores are used to indicate the acoustic feature similarity between the sound source data and the preset wake-up word. When any second confidence score in the first wake-up data is greater than the first threshold value, the pre-wake-up is determined to be successful, that is, the first-level streaming wake-up is successful. The cached multi-channel feature data and the first separation data are input to the second-level separation module, and step 306 is executed. When all the second confidence scores in the first wake-up data are less than or equal to the first threshold value, the pre-wake-up is determined to be unsuccessful, that is, the first-level streaming wake-up is unsuccessful, and the process ends.
[0189] Step 306: Perform a second-level separation process based on the multi-channel feature data and the first separation data to obtain the second separation data.
[0190] The second-level separation process can also be called the second-level neural network separation process. The second-level separation process is a separation process based on a neural network model, that is, the second-level separation process includes calling the neural network model to perform sound source separation processing.
[0191] Optionally, the electronic device, based on the multi-channel feature data and the first separation data, calls a pre-trained second-level separation module to output the second separation data. The second-level separation module performs a second-level separation process, which is an offline sound source separation process.
[0192] Optionally, the first wake-up data also includes the directional information corresponding to the wake-up word. The electronic device calls the second-level separation module to output the second separation data based on the multi-channel feature data, the first separation data and the directional information corresponding to the wake-up word.
[0193] It should be noted that the descriptions of the first separation data, multi-channel feature data, and first wake-up data can be found in the relevant descriptions in the steps above, and will not be repeated here. For ease of explanation, the following example will be used to illustrate how an electronic device uses multi-channel feature data and the first separation data to call a pre-trained second-level separation module to output the second separation data.
[0194] Optionally, the second-level separation module adopts the dpconformer network structure.
[0195] The electronic device, based on multi-channel feature data and the first separation data, calls a pre-trained second-level separation module to output the second separation data, including but not limited to the following two possible implementation methods:
[0196] In one possible implementation, the second-level separation module includes a second-level separation model. The electronic device splices the multi-channel features and the first separation data, and inputs the spliced data into the second-level separation model to output the second separation data.
[0197] In another possible implementation, the second-level separation module includes a second-level multi-feature fusion model and a second-level separation model. The electronic device inputs multi-channel feature data and first separation data into the second-level multi-feature fusion model and outputs second single-channel feature data; the second single-channel feature data is then input into the second-level separation model and outputs second separation data. For ease of explanation, the following description uses only the second possible implementation as an example. This application does not limit the scope of the implementation.
[0198] Optionally, the second-level multi-feature fusion model is the conformer feature fusion model.
[0199] The second-level separation model is a neural network model, meaning it is a model trained using a neural network. Optionally, the second-level separation model uses the dpconformer network structure. Alternatively, it can use any one of the following network structures: Deep Neural Networks (DNN), Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN), Conv-TasNet (fully convolutional temporal audio separation network), or Recurrent Neural Networks (RNN). It should be noted that the second-level separation model can also use other network structures suitable for offline scenarios, and this embodiment does not limit this approach.
[0200] The separation task design of the second-level separation module can be a single-task design of offline sound source separation task, or a multi-task design of offline sound source separation task and other tasks. Optionally, the other tasks include the orientation estimation tasks corresponding to multiple sound sources and / or the sound source object recognition tasks corresponding to multiple sound sources.
[0201] In one possible implementation, a second-level separation module is used to perform blind separation of multiple sound source data, and the second separation data includes the separated multiple sound source data.
[0202] In another possible implementation, a second-level separation module is used to extract the sound source data of the target object from multiple sound source data, and the second separation data includes the extracted sound source data of the target object.
[0203] In another possible implementation, the second-level separation module is used to extract the sound source data of the target object from multiple sound source data based on video information, and the second separation data includes the extracted sound source data of the target object.
[0204] In another possible implementation, the second-level separation module is used to extract at least one sound source data in the target direction from multiple sound source data, and the second separation data includes at least one sound source data in the target direction.
[0205] It should be noted that the fusion of multi-channel features, the selection of network structure, the design of separation tasks, the use of cost functions, and the use of separation results can be compared with the relevant descriptions of the first-level separation process, and will not be repeated here.
[0206] Step 307: Perform a second-level wake-up process based on the multi-channel feature data, the first separation data, and the second separation data to obtain the second wake-up data.
[0207] Optionally, the electronic device, based on multi-channel feature data, first separation data, and second separation data, invokes a pre-trained second-level wake-up module to output second-level wake-up data. The second-level wake-up module performs second-level wake-up processing, which is an offline sound source wake-up process.
[0208] Optionally, the first wake-up data also includes the directional information corresponding to the wake-up word. The electronic device calls the second-level wake-up module to output the second wake-up data based on the multi-channel feature data, the first separation data, the second separation data, and the directional information corresponding to the wake-up word.
[0209] It should be noted that the descriptions of multi-channel feature data, first separation data, and second separation data can be found in the relevant descriptions in the steps above, and will not be repeated here.
[0210] Optionally, the electronic device inputs multi-channel feature data, first separation data, and second separation data into the second-level wake-up module to output second wake-up data.
[0211] Optionally, the second-level wake-up module uses a fixed wake-up word for modeling. This second-level wake-up module is a multi-input single-output (MIMO) wake-up model, meaning the wake-up scheme is a streaming MIMO-KWS (Multi-Input Single-Output Wake-Up Scheme). Alternatively, the second-level wake-up module uses phoneme modeling. This second-level wake-up module includes a multi-input multiple-output (MIMO) wake-up model and a second post-processing module (such as a decoder), meaning the wake-up scheme is a streaming MIMO-KWS (Multi-Input Multiple-Output Wake-Up Scheme).
[0212] Optionally, the second-level wake-up module adopts a dpconformer network structure. Alternatively, the second-level wake-up module can adopt any of the network structures of DNN, LSTM, and CNN. It should be noted that the second-level wake-up module can also adopt other network structures suitable for offline scenarios. The network structure of the second-level wake-up module can be compared with the network structure of the second-level separation module, and this application embodiment does not limit it in this way.
[0213] The wake-up task design of the second-level wake-up module can be a single-task design of the wake-up task or a multi-task design of the wake-up task and other tasks. Optionally, other tasks include orientation estimation tasks and / or sound source object recognition tasks.
[0214] Optionally, the second wake-up data includes a third confidence level, which indicates the probability that the original first microphone data includes a preset wake-up word.
[0215] Optionally, the second wake-up data includes a fourth confidence level corresponding to each of the multiple sound source data. The fourth confidence level of the sound source data is used to indicate the acoustic feature similarity between the sound source data and the preset wake-up word. For ease of explanation, the following explanation will only use the example of the second wake-up data including a third confidence level, which is used to indicate the probability that the preset wake-up word is included in the original first microphone data.
[0216] Optionally, the second wake-up data may also include the location information corresponding to the wake-up event and / or the object information of the wake-up object.
[0217] Step 308: Determine the wake-up result based on the second wake-up data.
[0218] The electronic device determines the wake-up result based on the second wake-up data, which may be either a successful wake-up or a failed wake-up.
[0219] Optionally, the electronic device sets a second threshold value for the second-level wake-up module. The second threshold value is a threshold that allows the electronic device to be successfully woken up. Illustratively, the second threshold value is greater than the first threshold value.
[0220] In one possible implementation, the second wake-up data includes a third confidence level, which indicates the probability that the original first microphone data includes a preset wake-up word. When the third confidence level in the second wake-up data is greater than a second threshold, the electronic device determines that the wake-up result is successful. When the third confidence level is less than or equal to the second threshold, the electronic device determines that the wake-up result is unsuccessful and terminates the process.
[0221] In another possible implementation, the second wake-up data includes a fourth confidence score corresponding to each of multiple sound source data. The fourth confidence score of the sound source data is used to indicate the acoustic feature similarity between the sound source data and the preset wake-up word. When any fourth confidence score in the second wake-up data is greater than a second threshold value, the electronic device determines that the wake-up result is successful. When all fourth confidence scores in the second wake-up data are less than or equal to the second threshold value, the electronic device determines that the wake-up result is failed and terminates the process.
[0222] Optionally, when the second wake-up data indicates successful wake-up, the electronic device outputs a wake-up success flag; or, it outputs a wake-up success flag and other information. The wake-up success flag indicates successful wake-up, and the other information includes the location information corresponding to the wake-up event and the object information of the wake-up object.
[0223] It should be noted that, to ensure a high wake-up rate while minimizing false wake-ups, this application embodiment designs a two-level wake-up processing module. After a successful first-level wake-up, a more complex second-level wake-up module is invoked to perform offline wake-up confirmation on the data from the successful first-level wake-up. To better support this two-level testing of the wake-up scheme, the separation module is also designed in two levels. The first-level separation scheme is streaming and needs to run continuously, so the first-level separation module requires a causal streaming design. Streaming designs generally sacrifice separation performance, so after a successful first-level wake-up, the second-level separation scheme can be applied to the output data. Since it is an offline scenario, the second-level wake-up scheme can adopt an offline design. At the same time, the data already output by the first level can also be used for the second-level separation scheme, ultimately achieving better separation performance and thus better supporting the effect of two-level wake-up.
[0224] In an illustrative example, such as Figure 4 As shown, the electronic device includes a first-level separation module 41 (including a first-level separation model), a first-level wake-up module 42, a second-level separation module 43 (including a second-level separation model), and a second-level wake-up module 44. The electronic device inputs the raw first microphone data to a preprocessing module for preprocessing (e.g., acoustic echo cancellation, dereverberation, and beam filtering) to obtain multi-channel feature data; it then inputs the multi-channel feature data to the first-level separation module 41 for first-level separation processing to obtain first-level separation data; and finally, it inputs the multi-channel feature data and the first-level separation data to the first-level wake-up module 42 for first-level wake-up processing to obtain first-level wake-up data. Based on the first-level wake-up data, the electronic device determines whether pre-wake-up is successful. If pre-wake-up is successful, the multi-channel feature data and the first-level separation data are input to the second-level separation module 43 for second-level separation processing to obtain second-level separation data; and the multi-channel feature data, the first-level separation data, and the second-level separation data are input to the second-level wake-up module 44 for second-level wake-up processing to obtain second-level wake-up data. The electronic device then determines whether wake-up is successful based on the second-level wake-up data.
[0225] The voice wake-up method provided in this application is optimized from two perspectives: multi-source sound separation technology and wake-up technology, which can significantly solve the aforementioned technical problems. The multi-source sound separation technology and wake-up technology involved in this application embodiment will be described below.
[0226] Before introducing multi-source separation and wake-up techniques, let's first introduce the dpconformer network structure. A schematic diagram of the dpconformer network structure is shown below. Figure 5 As shown, the dpconformer network consists of an encoding layer, a splitting layer, and a decoding layer.
[0227] 1. Encoding layer: The dpconformer network receives single-channel feature data and obtains intermediate feature data through a one-dimensional convolution (1-D Conv) layer, such as a two-dimensional matrix.
[0228] Optionally, a one-dimensional convolution operation is performed on the input single-channel feature data, transforming it into the latent space of the input temporal data using the following formula: X = ReLU(x*W); where x is the single-channel feature data in the temporal domain, W is the weight coefficient corresponding to the encoding transformation, and x is subjected to a one-dimensional convolution operation with W according to a fixed convolution kernel size and convolution stride, finally obtaining the encoded intermediate feature data satisfying X∈R N*I , where N is the encoding dimension, I is the total number of frames in the time domain, and the intermediate feature data X is an N*I dimensional two-dimensional matrix.
[0229] 2. The separation layer includes a data cutting module, an intra-block conformer layer, and an inter-block conformer layer.
[0230] (1) Data Segmentation Module
[0231] The data segmentation module takes intermediate feature data as input and outputs a three-dimensional tensor. Specifically, it represents the intermediate feature data as a three-dimensional tensor according to the data frame-segmentation method, corresponding to intra-block features, inter-block features, and feature dimensions.
[0232] Optionally, the N*I dimensional two-dimensional matrix is divided into N*K*P dimensional three-dimensional tensors in equal blocks, where N is the feature dimension, K is the number of blocks, P is the length of the blocks, and the overlap between blocks is P / 2.
[0233] (3) Conformer layer within the block
[0234] The input parameters of the conformer layer within the block are the three-dimensional tensors output by the data cutting module, and the output parameters are the first intermediate parameters.
[0235] Optionally, the conformer layer includes at least one of a linear layer, a multi-head self-attention (MHSA) layer, and a convolutional layer.
[0236] Optionally, the conformer within each of the K blocks of length P can be calculated using the following formula:
[0237]
[0238] Where b is the b-th dpconformer submodule currently in the block, and there are a total of B dpconformer submodules. Each dpconformer submodule includes a conformer layer within a block and a conformer layer between blocks, and B is a positive integer.
[0239] It should be noted that the calculation method of the conformer layer within a block is the same in both streaming and offline scenarios.
[0240] (4) Conformer layer between blocks
[0241] The input parameters of the conformer layer between blocks are the first intermediate parameters output by the conformer layer within the block, and the output parameters are the second intermediate parameters.
[0242] Optionally, in offline scenarios, the conformer between blocks is calculated along each identical dimension of P within a block using the following formula:
[0243]
[0244] In offline scenarios, the conformer layer between blocks calculates attention on all features of the entire sentence. However, in streaming scenarios, in order to control latency, a mask mechanism is used to calculate attention only for the current block and previous time steps to ensure causality.
[0245] Optionally, in a streaming scenario, if the current block is t, the conformer calculation between blocks in the current block t is only related to the block corresponding to the previous block in the past block t, and is independent of block t+1. Then, the conformer calculation between blocks is performed using the following formula:
[0246]
[0247] The calculation is performed through the conformer layers within and between blocks in layer B, meaning that the conformer layers within and between blocks are calculated B times repeatedly.
[0248] Then, the three-dimensional N*K*P tensor after the 2-D Conv layer is converted into C two-dimensional N*I matrices, which correspond to the masking matrix M of C sound sources, where M is the preset number of sound sources to be separated.
[0249] 3. Decoding layer
[0250] Based on the masking matrix M of each sound source and the latent space representation of each sound source, a one-dimensional convolutional layer is used to finally obtain the separation result, that is, the separated multiple sound source data.
[0251] The multi-source separation scheme provided in this application is a two-stage separation scheme, in which the multi-feature fusion model and the separation module both adopt... Figure 5 Taking the provided dpconformer network structure as an example, this two-stage separation scheme is as follows: Figure 6 As shown.
[0252] The first-level streaming separation module includes a Conformer feature fusion model 61 and a dpConformer separation model 62, and the second-level offline separation module includes a Conformer feature fusion model 63 and a dpConformer separation model 64. The first-level streaming separation module can be the first-level separation module 41 described above, and the second-level offline separation module can be the second-level offline separation module 43 described above.
[0253] The electronic device inputs multi-channel feature data into the Conformer feature fusion model 61 and outputs single-channel feature data; it then inputs the single-channel feature data into the dpconformer separation model 62 and outputs first separated data. When pre-wake-up is successful, the multi-channel feature data and the first separated data are input into the Conformer feature fusion model 63 and output single-channel feature data; the single-channel feature data is then input into the dpconformer separation model 64 and outputs second separated data.
[0254] It should be noted that, for ease of explanation, only the first-stage separation scheme in the two-stage separation scheme will be used as an example. The second-stage separation scheme can be compared and referred to, and will not be elaborated on further.
[0255] In one possible implementation, the first-level separation scheme includes blind separation technology, which includes, but is not limited to, the following aspects: Figure 7 As shown:
[0256] (1) Feature Input Section: Includes multi-channel feature data. In a multi-microphone scenario, the multi-channel feature data includes multiple sets of multi-channel feature data. Optionally, the multi-channel feature data includes at least one set of multi-channel feature data from the following: raw time-domain data of multiple microphones, corresponding multi-channel transform-domain data, multiple sets of IPD data, output data of multiple fixed beams in preset directions, and directional feature data of each preset direction. For example, the feature input section includes three sets of multi-channel feature data, namely multi-channel feature data 1, multi-channel feature data 2, and multi-channel feature data 3. The embodiments of this application do not limit the number of sets of multi-channel feature data.
[0257] (2) Conformer Feature Fusion Model 71: Used to fuse multiple sets of multi-channel feature data into single-channel feature data. First, each set of multi-channel feature data is calculated based on the conformer layer to obtain the first attention feature data between channels within the set; then, the first attention feature data between channels in each set is uniformly passed through another conformer layer, namely the full-channel attention layer 72, to obtain the second attention feature data of each set, and then passed through a pooling layer or a projection layer to obtain the intermediate feature representation of a single channel, namely the single-channel feature data.
[0258] (3) dpconformer separation model 73: This model is used to input the fused multi-channel feature data (i.e., single-channel feature data) into the dpconformer separation model and output M estimated sound source data, where M is a positive integer. For example, the M estimated sound source data may include sound source data 1, sound source data 2, sound source data 3, and sound source data 4. This application embodiment does not limit this aspect.
[0259] (4) Cost Function Design: During cost function training, there is a permutation confusion problem between the outputs of multiple sound source data and their corresponding annotations. Therefore, it is necessary to use the Permutation Invariant Training (PIT) criterion, which involves determining all possible annotation orders for multiple sound source data, calculating the loss values corresponding to each annotation order based on the multiple annotation orders and the output parameters of the cost function, and calculating the gradient based on the annotation order with the smallest loss value. Besides using the above method to train the cost function, a fixed sorting order can also be set using prior information from multiple sound source data to avoid the problem of high computational complexity of loss values due to an increase in the number of sound source data. The prior information of the sound source data includes the start time of the sound source data, and multiple sound source data are arranged sequentially according to the order of their start times from earliest to latest.
[0260] In another possible implementation, the first-level separation scheme includes person-specific extraction technology, which is another major technical solution in multi-source interference scenarios. This first-level separation scheme includes, but is not limited to, the following aspects: Figure 8 As shown:
[0261] (1) Feature Input Section: Includes multi-channel feature data and registered speech data. Figure 7 The first-level separation scheme differs in that, in specific person extraction scenarios, the target object needs to be registered, and the registered voice data of the target object is used as additional feature data as input. For example, the feature input includes multi-channel feature data 1, multi-channel feature data 2, and registered voice data. This application embodiment does not limit the number of sets of multi-channel feature data.
[0262] (2) Conformer Feature Fusion Model 81: This model is used to fuse multiple sets of multi-channel feature data and registered speech data into single-channel feature data. First, each set of multi-channel feature data is calculated based on the Conformer layer to obtain the first attention feature data between channels within the set. Then, the first attention feature data between channels in each set and the speaker representation feature data of the target object are uniformly passed through the full-channel attention layer 82. The full-channel attention layer 82 is used to calculate the correlation between the speaker representation feature data of the target object and other multi-channel feature data, and fuses the output to obtain single-channel features.
[0263] Optionally, the registered speech data of the target object is input into the speaker representation model, and the output is the embedding representation of the target object, i.e., the speaker representation feature data. The speaker representation model is pre-trained and obtained through standard speaker recognition training methods.
[0264] Optionally, the speaker representation feature data of the target object can be pre-stored in the electronic device in vector form.
[0265] (3) dpconformer separation model 83: The single-channel feature data is input into the dpconformer separation model 83, and the output is the sound source data of the target object. That is, the output parameter of the dpconformer separation model 83 is a single output parameter, and the expected output parameter is the sound source data of the target object. For example, the sound source data of the target object is sound source data 1.
[0266] (4) Cost function design: The above introduction to cost function can be used as a reference, and will not be repeated here.
[0267] In another possible implementation, the first-level separation scheme includes visual data-assisted person-specific extraction techniques, which include, but are not limited to, the following aspects: Figure 9 As shown:
[0268] (1) Feature Input Section: This includes multi-channel feature data and target person visual data. In certain scenarios, such as televisions, mobile phones, robots, and in-vehicle devices equipped with cameras, these electronic devices can acquire visual data of the target object, i.e., target person visual data, through the camera. In these scenarios, target person visual data can be used to assist in specific person extraction tasks. For example, the feature input section includes multi-channel feature data 1, multi-channel feature data 2, and target person visual data. The embodiments of this application do not limit the number of sets of multi-channel feature data.
[0269] (2) Conformer Feature Fusion Model 91: This model is used to fuse multiple sets of multi-channel feature data and visual data into single-channel feature data. First, each set of multi-channel feature data is calculated based on the Conformer layer to obtain the first attention feature data between channels within the set. Then, the first attention feature data between channels in each set and the visual representation feature data of the target object are uniformly passed through the full-channel attention layer 92. The full-channel attention layer 92 is used to calculate the correlation between the visual representation feature data of the target object and other multi-channel feature data, and fuses the output to obtain single-channel features.
[0270] Optionally, the electronic device uses the target person's visual data to call a pre-trained visual classification model to output a vector representation of the target object, i.e., visual representation feature data. For example, the visual classification model may include a lip-reading recognition model, and the target person's visual data may include visual data of lip movements. This application does not limit this aspect.
[0271] (3) dpconformer separation model 93: The single-channel feature data is input into the dpconformer separation model, and 93 outputs the sound source data of the target object. That is, the output parameter of the dpconformer separation model 83 is a single output parameter, and the expected output parameter is the sound source data of the target object. For example, the sound source data of the target object is sound source data 1.
[0272] (4) Cost function design: The above introduction to cost function can be used as a reference, and will not be repeated here.
[0273] In another possible implementation, the first-level separation scheme includes a specific direction extraction technique, which is a technique for extracting sound source data in a preset target direction under multi-source interference scenarios. This first-level separation scheme includes, but is not limited to, the following aspects: Figure 10 As shown:
[0274] (1) Feature Input Section: Includes multi-channel feature data and target orientation data. (Analogy Reference) Figure 8 The provided person-specific extraction technology, in this scenario, uses target orientation data as additional feature data as input. For example, the feature input includes multi-channel feature data 1, multi-channel feature data 2, multi-channel feature data 3, and target orientation data. This application embodiment does not limit the number of sets of multi-channel feature data.
[0275] (2) Conformer Feature Fusion Model 101: This model is used to fuse multiple sets of multi-channel feature data and target direction data into single-channel feature data. First, each set of multi-channel feature data is calculated based on the Conformer layer to obtain the first attention feature data between channels within the set. Then, the first attention feature data between channels and the direction feature data of the target direction data are uniformly passed through the full-channel attention layer 102. The full-channel attention layer 102 is used to calculate the correlation between the direction feature data of the target direction data and other multi-channel feature data, and fuses the output to obtain single-channel features.
[0276] Optionally, directional feature data of the target direction data can be calculated based on the target direction data and the microphone position information of the microphone array.
[0277] Optionally, the directional feature data of the target direction data can be pre-stored in an electronic device.
[0278] (3) dpconformer separation model 103: The single-channel feature data is input into the dpconformer separation model 103, and the output is at least one sound source data in the target direction. That is, the output parameters of the dpconformer separation model 103 are single output parameters or multiple output parameters, and the expected output parameters are at least one sound source data in the target direction. For example, at least one sound source data in the target direction includes sound source data 1 and sound source data 2.
[0279] (4) Cost function design: The above introduction to cost function can be used as a reference, and will not be repeated here.
[0280] It should be noted that the above-mentioned first-level separation scheme can be implemented in pairs, or any three of them can be implemented in combination, or all of them can be implemented in combination with the embodiments. This application does not limit this.
[0281] In another possible implementation, the first-level separation scheme includes techniques for multi-task design combining blind separation and multi-source localization. This first-level separation scheme includes, but is not limited to, the following aspects: Figure 11 As shown:
[0282] (1) Feature input part: including multi-channel feature data.
[0283] (2) Conformer feature fusion model 111 (including full-channel attention layer 112): used to fuse multiple sets of multi-channel feature data into single-channel feature data.
[0284] (3) dpconformer separation model 113, sound source separation layer 114, and direction estimation layer 115: Single-channel feature data is input into dpconformer separation model 113 to obtain intermediate parameters. The intermediate parameters are input into sound source separation layer 114 to obtain sound source separation results. The intermediate parameters are input into direction estimation layer 115 to obtain azimuth estimation results. The sound source separation results include the separated m sound source data, and the azimuth estimation results include the azimuth information corresponding to each of the m sound source data. For example, the output parameters include sound source data 1 and sound source data 2, as well as the azimuth information of sound source data 1 and sound source data 2.
[0285] The sound source separation layer 114 and the direction estimation layer 115 can be set as separate modules outside the dpconformer separation model 113, that is, the sound source separation layer 114 and the direction estimation layer 115 are set at the output of the dpconformer separation model 113. Illustratively, the i-th azimuth information output by the direction estimation layer 115 is the azimuth information of the i-th sound source data separated by the sound source separation layer 114, where i is a positive integer.
[0286] Optionally, the azimuth information is a azimuth label, using a one-hot vector format. For example, in multi-source localization technology, the horizontal azimuth of 360 degrees is divided into 360 / gamma = 36 parts with a resolution of gamma = 10 degrees, resulting in an output dimension of 36 dimensions and a 36-dimensional one-hot vector for azimuth information.
[0287] (4) Cost function design
[0288] Optionally, the cost functions for both the separation task and the direction estimation task adopt the PIT criterion.
[0289] It should be noted that the above descriptions can be compared with the relevant descriptions in the above embodiments, and will not be repeated here.
[0290] In another possible implementation, the first-level separation scheme includes techniques for multi-task design involving person-specific extraction and person-specific orientation estimation. This first-level separation scheme includes, but is not limited to, the following aspects: Figure 12 As shown:
[0291] (1) Feature input part: including multi-channel feature data and registered voice data.
[0292] (2) Conformer feature fusion model 121 (including full-channel attention layer 122): used to fuse multiple sets of multi-channel feature data and registered speech data into single-channel feature data.
[0293] (3) dpconformer separation model 123, person-specific extraction layer 124, and person-specific orientation estimation layer 125: Single-channel feature data is input into the dpconformer separation model 123 to obtain intermediate parameters. The intermediate parameters are input into the person-specific extraction layer 124 to obtain the sound source data of the target object. The intermediate parameters are then input into the person-specific orientation estimation layer 125 to obtain the orientation information of the sound source data of the target object. For example, the output parameters include the sound source data 1 of the target object and the orientation information of the sound source data 1. Optionally, the orientation information is an orientation label in one-hot vector form.
[0294] Given the registered speech data of the target object, speaker representation feature data and other multi-channel feature data are used to design one-hot vector-based directional labels through a dpconformer network structure, and the network is trained using a cross-entropy (CE) cost function. The technique for multi-task design of speaker extraction and speaker directional estimation involves sharing multi-channel feature data, registered speech data, a Conformer feature fusion model 121, and a dpconformer separation model 123 between the two tasks. The output of the dpconformer separation model 123 is configured with a speaker extraction layer 124 and a speaker directional estimation layer 125, and multi-task training is performed using a weighted average of the cost functions for the separation task and the directional estimation task, respectively.
[0295] (4) Cost function design
[0296] It should be noted that the above descriptions can be compared with the relevant descriptions in the above embodiments, and will not be repeated here.
[0297] In another possible implementation, the first-level separation scheme includes a technique for multi-task design combining blind separation and multi-speaker recognition. This technique involves separating multiple sound source data from microphone data and identifying the object information corresponding to each of the multiple sound source data. The object information is used to indicate the object identity of the sound source data. Optionally, the electronic device stores the correspondence between multiple sample sound source data and multiple object information. This first-level separation scheme includes, but is not limited to, the following aspects: Figure 13 As shown:
[0298] (1) Feature input part: including multi-channel feature data.
[0299] (2) Conformer feature fusion model 131 (including full-channel attention layer 132): used to fuse multiple sets of multi-channel feature data into single-channel feature data.
[0300] (3) dpconformer separation model 133, sound source separation layer 134, and object recognition layer 135: Single-channel feature data is input into the dpconformer separation model 1333 to obtain intermediate parameters. The intermediate parameters are input into the sound source separation layer 134 to obtain the sound source separation result. The intermediate parameters are input into the object recognition layer 135 to obtain the object recognition result. The sound source separation result includes the separated m sound source data, and the object recognition result includes the object information corresponding to each of the m sound source data. For example, the output parameters include sound source data 1 and sound source data 2, as well as the object information of sound source data 1 and the object information of sound source data 2.
[0301] The separation task and the object recognition task share multi-channel feature data, a Conformer feature fusion model 131, and a dpconformer separation model 133. A sound source separation layer 134 and an object recognition layer 135 are set at the output of the dpconformer separation model 133. The sound source separation layer 134 separates multiple sound source data. After frame-level feature calculation, the object recognition layer 135 performs segment-level feature fusion to obtain segment-level multi-object representations. The object representation of each segment outputs the object identity of that segment, and the corresponding object information is a one-hot vector used to indicate the object identity. Optionally, the dimension of the one-hot vector is the number of objects. In the one-hot vector corresponding to a sound source data, the position corresponding to that sound source data is 1, used to indicate the speaking order of the object among multiple objects, and other positions are 0.
[0302] The i-th object information output by the object recognition layer 135 is the object information of the i-th sound source data separated by the sound source separation layer 134, where i is a positive integer.
[0303] (4) Cost function design
[0304] Optionally, the cost functions for both the separation task and the object recognition task adopt the PIT criterion.
[0305] It should be noted that the above descriptions can be compared with the relevant descriptions in the above embodiments, and will not be repeated here.
[0306] In another possible implementation, the first-level separation scheme includes a multi-task design technique for person-specific extraction and person-specific verification. The person-specific extraction task uses the registered voice data of the target object to extract the target object's sound source data from the microphone data. However, a standalone person-specific extraction task might not output sound source data for the target object in the microphone data; therefore, a person-specific verification task is needed to verify the extracted sound source data. The person-specific verification task verifies whether the extracted sound source data is identical to the target object's registered voice data, or whether the object corresponding to the extracted sound source data contains the target object. The multi-task design technique for person-specific extraction and person-specific verification determines the object recognition result of the sound source data while extracting it. Similarly, this task is designed offline. This first-level separation scheme includes, but is not limited to, the following aspects: Figure 14 As shown:
[0307] (1) Feature input part: including multi-channel feature data and registered voice data.
[0308] (2) Conformer feature fusion model 141 (including full-channel attention layer 142): used to fuse multiple sets of multi-channel feature data and registered speech data into single-channel feature data.
[0309] (3) dpconformer separation model 143, person-specific extraction layer 144, and person-specific confirmation layer 145: Single-channel feature data is input into the dpconformer separation model 143 to obtain intermediate parameters. The intermediate parameters are input into the person-specific extraction layer 144 to obtain the sound source data of the target object. The intermediate parameters are input into the person-specific confirmation layer 145 to obtain the object recognition result of the sound source data. The object recognition result is used to indicate the acoustic feature similarity between the output sound source data and the registered speech data. Optionally, the object recognition result includes the probability that the object corresponding to the output sound source data is the target object. For example, the output parameters include the sound source data 1 of the target object and the object recognition result of the sound source data 1.
[0310] The specific person extraction and specific person confirmation tasks share multi-channel feature data, a conformer feature fusion model 141 and a dpconformer separation model 143, and a specific person extraction layer 144 and a specific person confirmation layer 145 are set at the output of the dpconformer separation model 143.
[0311] (4) Cost function design
[0312] It should be noted that the above descriptions can be compared with the relevant descriptions in the above embodiments, and will not be repeated here.
[0313] The wake-up scheme involved in this application is a two-stage wake-up scheme. Both the first-stage and second-stage wake-up modules in the two-stage scheme are multi-input wake-up model structures, such as any one of the following network structures: DNN, LSTM, CNN, Transformer, and Conformer. It should be noted that other network structures can also be used for the wake-up model structure; for ease of explanation, only the first and second-stage wake-up modules in the two-stage wake-up scheme are described using multi-input wake-up models. Figure 5 Taking the provided dpconformer network structure as an example, this two-stage wake-up scheme is as follows: Figure 15 As shown.
[0314] The electronic device inputs multi-channel feature data and first separation data into the dpconformer wake-up module 151 and outputs first wake-up data; when the first wake-up data indicates that the pre-wake-up is successful, it inputs multi-channel feature data, first separation data and second separation data into the dpconformer wake-up module 152 and outputs second wake-up data; the wake-up result is determined based on the second wake-up data.
[0315] It should be noted that, for ease of explanation, only the first-level wake-up scheme in the two-stage wake-up scheme will be used as an example. The second-level wake-up scheme can be compared and referenced, and will not be elaborated further.
[0316] In one possible implementation, the first-level wake-up scheme provided in this application embodiment includes a wake-up technology based on multi-input single-output whole-word modeling. The first-level wake-up module is a multi-input single-output whole-word modeling wake-up module, such as... Figure 16 As shown, including but not limited to the following aspects:
[0317] (1) Feature input section: includes multiple sets of multi-channel feature data. Among them, the multiple sets of multi-channel feature data include multi-channel feature data obtained by preprocessing the first microphone data and first separation data obtained by performing the first-level separation processing.
[0318] (2) Conformer feature fusion model 161 (including full-channel attention layer 162): used to fuse multiple sets of multi-channel feature data into single-channel feature data.
[0319] (3) dpconformer separation model 163: Input the single-channel feature data into the dpconformer separation model 163 and output the first confidence level. The first confidence level is used to indicate the probability that the original first microphone data includes a preset wake word. The preset wake word is a fixed wake word set by default.
[0320] For example, if there are N pre-defined wake words, the first confidence score output by the dpconformer separation model 163 is an N+1 dimensional vector. The N dimensions of the N+1 dimensional vector correspond to the N wake words, and the other dimension corresponds to the category that does not belong to any of the N wake words. The value of each dimension in the N+1 dimensional vector is a probability value between 0 and 1, which is used to indicate the wake-up probability of the wake word at the corresponding position.
[0321] (4) Cost function design
[0322] It should be noted that the above descriptions can be compared with the relevant descriptions in the first-stage separation scheme, and will not be repeated here.
[0323] In this embodiment, the output parameter of the dpconformer separation model 163 is a single output parameter, the number of modeling units is the number of wake words plus one, and the extra unit is a garbage unit, which is used to output the probability value of other words besides the wake words. The output parameter of the dpconformer separation model 163 is the first confidence level.
[0324] Optionally, the two preset wake-up words are preset wake-up word 1 and preset wake-up word 2. The probability value of each modeling unit is one of a first value, a second value, and a third value. When the probability value is the first value, it indicates that the sound source data does not include the preset wake-up word; when the probability value is the second value, it indicates that the sound source data includes preset wake-up word 1; and when the probability value is the third value, it indicates that the sound source data includes preset wake-up word 2. For example, preset wake-up word 1 is "Xiaoyi Xiaoyi", preset wake-up word 2 is "Nihao Xiaoyi", the first value is 0, the second value is 1, and the third value is 2. This application embodiment does not limit this.
[0325] The first-level wake-up module performs real-time calculations. For the current input of multiple sets of multi-channel feature data, the first-level wake-up module determines in real time whether it includes a fixed wake-up word. When the output first confidence score is greater than the first threshold value, the pre-wake-up is considered successful. For the first-level wake-up module, the electronic device determines that the pre-wake-up is successful, and at this time, it has received complete wake-up word information. It determines the current time as the wake-up time, which is used to provide time point reference information to the second-level separation module and the second-level wake-up module, and then starts the second-level offline separation module.
[0326] In another possible implementation, the wake-up scheme provided in this application includes a multi-input multi-output (MIMO) phoneme modeling wake-up technology, wherein the first-level wake-up module is a MIMO phoneme modeling wake-up module, such as... Figure 17 As shown, including but not limited to the following aspects:
[0327] (1) Feature input section: includes multiple sets of multi-channel feature data. Among them, the multiple sets of multi-channel feature data include multi-channel feature data obtained by preprocessing the first microphone data and first separation data obtained by performing the first-level separation processing.
[0328] (2) Conformer feature fusion model 171 (including full-channel attention layer 172): used to fuse multiple sets of multi-channel feature data into single-channel feature data.
[0329] (3) dpconformer separation model 173: The single-channel feature data is input into the dpconformer separation model 173, and the output is a phoneme set. This phoneme set includes the phoneme sequence information corresponding to each of the multiple sound source data. Optionally, the phoneme sequence information is the posterior probability of the phoneme sequence, which is the product of the posterior probability values of each phoneme corresponding to the sound source data. For example, the output parameters of the dpconformer separation model 173 include the phoneme sequence information 1 of sound source data 1 and the phoneme sequence information 2 of sound source data 2.
[0330] (4) Cost function design
[0331] It should be noted that the above descriptions can be compared with the relevant descriptions in the first-stage separation scheme, and will not be repeated here.
[0332] For the multi-input multi-output phoneme modeling wake-up module, the output parameters of the dpconformer separation model 173 are the phoneme sequence information corresponding to each of the multiple sound source data. The multiple phoneme sequence information is input into the decoder respectively, and finally the second confidence level corresponding to each of the multiple phoneme sequence information is output.
[0333] The phoneme sequence information corresponding to the sound source data is used to indicate the probability distribution of multiple phonemes in the sound source data; that is, the phoneme sequence information includes the probability values corresponding to each of the multiple phonemes. For each phoneme sequence information, the decoder is called once to obtain the second confidence level corresponding to that phoneme sequence information. The second confidence level is used to indicate the acoustic feature similarity between the sound source data and the preset wake word. The decoder part cannot participate in the model calculation. When the model cannot determine which isolated sound source data is the preset wake word, it needs to calculate the phoneme sequence information corresponding to each of the multiple sound source data.
[0334] In this embodiment, the modeling unit is a phoneme, which is a representation form of a basic speech unit. For example, for the wake-up word "Xiaoyi Xiaoyi", the corresponding phoneme sequence can be "x i ao y i x i ao y i", and each phoneme is represented by a space. In a multi-source interference scenario, the phoneme sequence 1 corresponding to the sound source data 1 is "x i ao y i x i ao y i", while the speech content corresponding to the sound source data 2 can be "What's the weather like", and the corresponding phoneme sequence 2 is "t i an q i z en m o yang". The output parameters of the dpconformer separation model 173 include two phoneme sequence information, that is, the probability value of the phoneme sequence 1 "x i ao y i x i ao y i" corresponding to the sound source data 1, and the probability value of the phoneme sequence 12 "t i an q iz en m o y ang" corresponding to the sound source data 2.
[0335] For the first-level wake-up module, taking the output parameters including two phoneme sequence information as an example, one phoneme sequence information can be the probability distribution of each phoneme corresponding to the sound source data 1, and the other phoneme sequence information can be the probability distribution of each phoneme corresponding to the sound source data 2. For example, if the size of the phoneme set is 100, then the two phoneme sequence information are respectively 100-dimensional vectors, and the values of the vectors are in the range greater than or equal to 0 and less than or equal to 1, and the sum of the 100 values is 1. For example, the two phoneme sequence information are respectively 100-dimensional vectors. The probability value corresponding to the position of "x" in the first phoneme sequence information is the highest, and the probability value corresponding to the position of "t" in the second phoneme sequence information is the highest.
[0336] After determining the two phoneme sequence information, calculate the output probability of the phoneme sequence "x i ao y i x iao y i" of the preset wake-up word in these phoneme sequences respectively and perform geometric averaging to obtain the second confidence level corresponding to each of these two phoneme sequence information. When any one of the second confidence levels is greater than the first threshold, it is determined that the pre-wake-up is successful.
[0337] In another possible implementation manner, the wake-up scheme provided in the embodiments of the present application includes a technology of multi-input single-output whole-word modeling for wake-up and direction estimation for multi-task design. The first-level wake-up module is a multi-input single-output whole-word modeling wake-up module, as Figure 18 shown, including but not limited to the following aspects:
[0338] (1). Feature input part: including multiple groups of multi-channel feature data. Among them, the multiple groups of multi-channel feature data include the multi-channel feature data obtained by preprocessing the first microphone data and the first separation data obtained by performing the first-level separation processing.
[0339] (2) Conformer feature fusion model 181 (including full-channel attention layer 182): used to fuse multiple sets of multi-channel feature data into single-channel feature data.
[0340] (3) dpconformer separation model 183, wake word detection layer 184 and orientation estimation layer 185: The single-channel feature data is input into the dpconformer separation model 183 to obtain intermediate parameters. The intermediate parameters are input into the wake word detection layer 184 to obtain wake-up information. The intermediate parameters are input into the orientation estimation layer 185 to obtain the orientation information of the wake-up event. The wake-up information includes the first confidence level of each of the separated sound source data. For example, the orientation information adopts the one-hot vector form.
[0341] For the wake-up task, the model calculates the probabilities of each wake-up event and spam words, while the direction estimation task only outputs the directional information corresponding to the wake-up event. Therefore, the directional information is the output parameter of the direction estimation task corresponding to a successful wake-up.
[0342] The wake-up word detection layer 184 and the orientation estimation layer 185 can be additional network modules, set at the output of the dpconformer separation model 183, such as a single-layer DNN or LSTM, followed by a linear layer and a softmax layer of the corresponding dimension. For the wake-up task, the output parameters (i.e., wake-up information) of the wake-up word detection layer 184 are the detection probabilities of the wake-up word. For the orientation estimation task, the output parameters (i.e., orientation information) of the orientation estimation layer 185 are the probability distribution of the orientation estimation vector.
[0343] (4) Cost function design
[0344] It should be noted that the above descriptions can be compared with the relevant descriptions in the first-stage separation scheme, and will not be repeated here.
[0345] In another possible implementation, the wake-up scheme provided in this application includes a technique for multi-task design using multi-input multi-output phoneme modeling wake-up and direction estimation. The first-level wake-up module is a multi-input multi-output phoneme modeling wake-up module, such as... Figure 19 As shown, including but not limited to the following aspects:
[0346] (1) Feature input section: includes multiple sets of multi-channel feature data. Among them, the multiple sets of multi-channel feature data include multi-channel feature data obtained by preprocessing the first microphone data and first separation data obtained by performing the first-level separation processing.
[0347] (2) Conformer feature fusion model 191 (including full-channel attention layer 192): used to fuse multiple sets of multi-channel feature data into single-channel feature data.
[0348] (3) dpconformer separation model 193, multi-wake-up phoneme sequence layer 194, and orientation estimation layer 195: Single-channel feature data is input into the dpconformer separation model 193 to obtain intermediate parameters. The intermediate parameters are input into the multi-wake-up phoneme sequence layer 194 to obtain wake-up information. The intermediate parameters are input into the orientation estimation layer 195 to obtain orientation estimation results. The wake-up information includes phoneme sequence information corresponding to each of the multiple sound source data. The orientation estimation results include orientation information corresponding to each of the multiple phoneme sequence information. Optionally, the phoneme sequence information is the posterior probability of the phoneme sequence, which is the product of the posterior probability values of each phoneme corresponding to the sound source data. For example, the output parameters include phoneme sequence information 1 of sound source data 1, phoneme sequence information 2 of sound source data 2, orientation information of phoneme sequence information 1, and orientation information of phoneme sequence information 2.
[0349] Among them, the multi-wake-up phoneme sequence layer 194 and the orientation estimation layer 195 can be additional network modules set at the output of the dpconformer separation model 193.
[0350] (4) Cost function design
[0351] It should be noted that the above descriptions can be compared with the relevant descriptions in the first-stage separation scheme, and will not be repeated here.
[0352] The wake-up task and the orientation estimation task share the same feature input, the Conformer feature fusion model 191, and the pConformer separation model 193. The output parameters of the wake-up task include the phoneme sequence information corresponding to each of the multiple sound source data, and the output parameters of the orientation estimation task include the orientation information corresponding to each of the multiple phoneme sequence information. Finally, the wake-up result, i.e., the first confidence level, is obtained by decoding each phoneme sequence information.
[0353] It should be noted that the above-mentioned first-level wake-up scheme can be implemented in pairs, or any three of them can be implemented in combination, or all of them can be implemented in combination with the embodiments. This application does not limit this.
[0354] The following uses several illustrative examples to introduce the voice wake-up method provided in the embodiments of this application.
[0355] In an illustrative example, the electronic device is a device with a single microphone, and the voice wake-up method is a single-channel two-level separation and two-level wake-up approach. This method can be used in near-field wake-up scenarios for electronic devices, ensuring a high wake-up rate while reducing the false wake-up rate when users use the wake-up function of the electronic device in noisy environments.
[0356] like Figure 20 As shown, the electronic device includes a first-level separation module 201, a first-level wake-up module 202, a second-level separation module 203, and a second-level wake-up module 204. The electronic device collects raw first-microphone data (such as background music, echo, voice 1, voice 2, voice K, and ambient noise) through a single microphone. This first-microphone data is input to a preprocessing module 205 for preprocessing to obtain multi-channel feature data. The multi-channel feature data is then input to the first-level separation module 201 for first-level separation processing to obtain first-separated data. The multi-channel feature data and the first-separated data are input to the first-level wake-up module 202 for first-level wake-up processing to obtain first-wake-up data. Based on the first-wake-up data, the electronic device determines whether to pre-wake up. If pre-wake-up is successful, the multi-channel feature data and the first-separated data are input to the second-level separation module 203 for second-level separation processing to obtain second-separated data. The multi-channel feature data, the first-separated data, and the second-separated data are then input to the second-level wake-up module 204 for second-level wake-up processing to obtain second-wake-up data. The electronic device then determines whether wake-up is successful based on the second-wake-up data.
[0357] based on Figure 20 The provided voice wake-up method can be replaced with some steps to achieve the following possible implementations.
[0358] Optionally, the preprocessing module includes an acoustic echo cancellation module. The output parameters of the acoustic echo cancellation module are used as multi-channel feature data and input to the subsequent separation and wake-up modules.
[0359] Optionally, the preprocessing module includes an acoustic echo cancellation module and a dereverberation module. The output parameters of the acoustic echo cancellation module are input to the dereverberation module, and the output parameters of the dereverberation module are used as multi-channel feature data and input to the subsequent separation module and wake-up module.
[0360] Optionally, both the first-level wake-up module and the second-level wake-up module can be the aforementioned multi-input single-output whole-word modeling wake-up module. Alternatively, both the first-level wake-up module and the second-level wake-up module can be the aforementioned multi-input multi-output phoneme modeling wake-up module.
[0361] Optionally, when the scenario requires support for specific user wake-up, the two-level wake-up module needs to support specific user confirmation functionality. In one possible implementation, based on... Figure 20 Examples provided, such as Figure 21 As shown, the multiple sound source data output by the second-level separation module 203 and the registered voice data (i.e., registered speech) of the target object are input to the speaker identification (SID) module 210 to confirm whether the separated multiple sound source data includes the registered voice data. The speaker identification module 210 is a separate network module, distinct from the second-level wake-up module 204. If the second wake-up data output by the second-level wake-up module 204 indicates successful wake-up, and the speaker identification module 210 confirms that the separated multiple sound source data includes the registered voice data, then the wake-up is considered successful; otherwise, the wake-up fails.
[0362] In another possible implementation, based on Figure 20 Examples provided, such as Figure 22 As shown, the speaker confirmation module 210 is integrated into the second-level wake-up module 204. It inputs multiple sound source data output by the first-level separation module 201, multiple sound source data output by the second-level separation module 203, and the registered voice data (i.e., registered speech) of the target object into the second-level wake-up module 204 (including the speaker confirmation module 210). It outputs the second wake-up data and the object confirmation result. When the second wake-up data indicates that the wake-up is successful and the object confirmation result indicates that the sound source data of the target object exists in the output sound source data, the wake-up is determined to be successful; otherwise, the wake-up fails.
[0363] Optionally, the object confirmation result is used to indicate whether the output sound source data contains sound source data of the target object; that is, the object confirmation result is used to indicate whether the current wake-up event is caused by the target object. Illustratively, the object confirmation result includes one of a first identifier and a second identifier. The first identifier indicates that the output sound source data contains sound source data of the target object, and the second identifier indicates that the output sound source data does not contain sound source data of the target object. When the second wake-up data indicates successful wake-up and the object confirmation result is the first identifier, the wake-up is determined to be successful; otherwise, the wake-up fails.
[0364] In another possible implementation, based on Figure 22 Examples provided, such as Figure 23As shown, the first-level separation module 201 is replaced by the first-level specific person extraction module 231, and the second-level separation module 203 is replaced by the second-level specific person extraction module 232. Multi-channel feature data and registered speech data are input into the first-level specific person extraction module 231, which outputs the first sound source data of the target object. Multi-channel feature data and the first sound source data of the target object are input into the first-level wake-up module 202, which outputs the first wake-up data. When the first wake-up data indicates successful pre-wake-up, multi-channel feature data, the first sound source data of the target object, and the registered speech data (i.e., registered speech) of the target object are input into the second-level specific person extraction module 232, which outputs the second sound source data of the target object. Multi-channel feature data, the first sound source data, the second sound source data, and the registered speech data of the target object are input into the second-level wake-up module 204 (including the speaker confirmation module 210), which outputs the second wake-up data and the object confirmation result. When the second wake-up data indicates successful wake-up and the object confirmation result indicates that the target object's sound source data exists in the output sound source data, wake-up is confirmed as successful; otherwise, wake-up fails.
[0365] It should be noted that this scenario also supports techniques such as person-specific extraction, visual data-assisted person-specific extraction, direction-specific extraction, multi-task design combining blind separation and multi-source localization, multi-task design combining person-specific extraction and person-specific orientation estimation, multi-task design combining blind separation and multi-speaker recognition, and multi-task design combining wake-up and orientation estimation, etc. The implementation details of each step can be found in the relevant descriptions in the above embodiments, and will not be repeated here.
[0366] In another illustrative example, the electronic device is a device with multiple microphones, and the voice wake-up method is a multi-channel, two-level separation and two-level wake-up method. This method can be used in electronic devices with multiple microphones, which respond to a preset wake-up word.
[0367] like Figure 24As shown, the electronic device includes a first-level separation module 241, a first-level wake-up module 242, a second-level separation module 243, and a second-level wake-up module 244. The electronic device collects raw first-microphone data (such as background music, echo, voices 1 and 2 from the same direction, voice K, and ambient noise) through multiple microphones. The first-microphone data is input to a preprocessing module 245 for preprocessing to obtain multi-channel feature data. The multi-channel feature data is then input to the first-level separation module 241 for first-level separation processing to obtain first-separated data. The multi-channel feature data and the first-separated data are input to the first-level wake-up module 242 for first-level wake-up processing to obtain first-wake-up data. Based on the first-wake-up data, the electronic device determines whether to pre-wake up. If pre-wake-up is successful, the multi-channel feature data and the first-separated data are input to the second-level separation module 243 for second-level separation processing to obtain second-separated data. The multi-channel feature data, the first-separated data, and the second-separated data are then input to the second-level wake-up module 244 for second-level wake-up processing to obtain second-wake-up data. The electronic device then determines whether wake-up is successful based on the second-wake-up data.
[0368] based on Figure 24 The provided voice wake-up method can be replaced with some steps to achieve the following possible implementations.
[0369] Optionally, the preprocessing module includes an acoustic echo cancellation module. Alternatively, the preprocessing module may include both an acoustic echo cancellation module and a dereverberation module.
[0370] Optionally, the preprocessing module includes an acoustic echo cancellation module, a dérecency module, and a beam filtering module. After echo cancellation and dérecency processing of the original first microphone data, beam filtering is performed in multiple directions to obtain multiple sets of multi-channel feature data such as multi-beam filter output parameters, dérecency multi-microphone data, and scene IPD, which are then input to the subsequent separation module and wake-up module.
[0371] Optionally, both the first-level wake-up module and the second-level wake-up module can be the aforementioned multi-input single-output whole-word modeling wake-up module. Alternatively, both the first-level wake-up module and the second-level wake-up module can be the aforementioned multi-input multi-output phoneme modeling wake-up module.
[0372] Optionally, in a multi-task scenario involving separation, wake-up, and localization, the separation task can be designed in conjunction with the localization task, and the wake-up task can also be designed in conjunction with the localization task. Optionally, the execution entity of the separation task is a direction feature extractor, which can be integrated into the separation module or the wake-up module, ultimately outputting multiple separated sound source data and the corresponding directional information for each sound source data. For related details, please refer to the description of the multi-task design including the localization task in the above embodiments, which will not be repeated here.
[0373] In scenarios requiring multi-task design, the following are some possible multi-task design approaches, including but not limited to:
[0374] 1. Multi-task design of first-level streaming separation and orientation estimation. The output parameters of the first-level separation module include multiple sound source data from streaming separation and the orientation information corresponding to each sound source data. The output parameters of the first-level separation module can be provided to the first-level wake-up module, the second-level separation module, and the second-level wake-up module. The multiple sound source data output by the first-level separation module can also be provided to the acoustic event detection module to determine whether the current sound source data contains a specific acoustic event, or simultaneously provided to the speaker confirmation module to determine the identity information corresponding to the current sound source data. The multiple orientation information output by the first-level separation module can be provided to the system interactive control module to display the orientation of each sound source data in real time.
[0375] 2. Multi-task design of first-level streaming wake-up, speaker recognition, and location estimation. The output parameters of the first-level wake-up module include multiple stream-separated sound source data, the location information corresponding to each sound source data, and the object confirmation result. This can be used to determine whether the current wake-up event is caused by the target object, and the location information corresponding to the wake-up time. The multiple location information output by the first-level wake-up module can be provided to the backend system to determine the main location of the target object. For example, it can be provided to the beamforming module to perform real-time enhancement of the sound source data in that location, and then perform speech recognition on the enhanced sound source data.
[0376] 3. Multi-task design for second-level offline separation, speaker recognition, and location estimation. Speaker recognition and location estimation results are more accurate in offline scenarios. The output parameters of the second-level separation module include multiple sound source data separated offline, the location information corresponding to each sound source, and the object confirmation result. The output parameters of the second-level separation module can be used for system debugging to determine the quality of the separation results.
[0377] 4. Multi-task design for second-level offline wake-up, speaker recognition, and location estimation: Offline wake-up outperforms real-time streaming wake-up. The output parameters of the second-level wake-up module include offline separated sound source data, the location information corresponding to each sound source, and the object confirmation result. The location information can serve as supplementary information for the wake-up event, used for subsequent wake-up direction enhancement tasks and speech recognition.
[0378] In one possible implementation, based on Figure 24 The provided example is a schematic diagram of a multi-task design for second-level offline wake-up and wake-up location estimation. Figure 25As shown, the second-level wake-up module 244 can adopt a wake-up model in the form of multiple inputs and multiple outputs or multiple inputs and single outputs, and finally outputs multiple sound source data and the directional information corresponding to each of the multiple sound source data.
[0379] In another possible implementation, based on Figure 24 The provided example is a schematic diagram of a multi-task design for second-level offline wake-up and speaker confirmation, as shown below. Figure 26 As shown, the speaker confirmation module 261 is integrated into the second-level wake-up module 244. It inputs multiple sound source data output by the first-level separation module 241, multiple sound source data output by the second-level separation module 243, and the registered voice data (i.e., registered speech) of the target object into the second-level wake-up module 244 (including the speaker confirmation module 261). It outputs the second wake-up data and the object confirmation result. When the second wake-up data indicates that the wake-up is successful and the object confirmation result indicates that the sound source data of the target object exists in the output sound source data, the wake-up is determined to be successful; otherwise, the wake-up fails.
[0380] Optionally, this scenario also supports the combined use of neural network-based separation and traditional beamforming techniques. Besides inputting the first separation data into the first-level wake-up module and the first and second separation data into the second-level wake-up module, the first and second separation data can also be input into an adaptive beamforming module, such as a minimum variance distortionless response (MVDR) beam filter, to calculate the noise interference covariance matrix, thereby achieving better spatial interference suppression. The output parameters after beam filtering of multiple sound source data can be used as new sound source data, and simultaneously as additional feature data input into the first-level and / or second-level wake-up modules to enhance the wake-up effect.
[0381] In one possible implementation, based on Figure 24 Examples provided, such as Figure 27 As shown, the first separation data is input into the adaptive beamforming module 271 to obtain the first filtered data. The multi-channel feature data, the first separation data, and the first filtered data are input into the first-level wake-up module 242 to obtain the first wake-up data. When the first wake-up data indicates that the pre-wake-up is successful, the multi-channel feature data and the first separation data are input into the second-level separation module 242 to obtain the second separation data. The second separation data is input into the adaptive beamforming module 272 to obtain the second filtered data. The multi-channel feature data, the first separation data, the second separation data, and the second filtered data are input into the second-level wake-up module 244 to obtain the second wake-up data. The second wake-up data is used to determine whether the wake-up is successful.
[0382] Optionally, this scenario also supports a multi-source wake-up scheme using a full neural network. Without a preprocessing module, the raw first microphone data and the calculated multi-channel feature data are input into the subsequent separation and wake-up modules. Optionally, the first-level and second-level separation modules need to consider the echo scenario, so they need to receive echo reference signals to handle the echo problem. In this implementation, the voice wake-up method can run on a chip equipped with a dedicated neural network acceleration such as a GPU or Tensor Processing Unit (TPU), thereby achieving better algorithm acceleration.
[0383] In one possible implementation, based on Figure 24 Examples provided, such as Figure 28 As shown, without using the preprocessing module 245, the original first microphone data, the calculated multi-channel feature data, and the echo reference data are input to the first separation module 241, and the first separation data is output. The first microphone data, multi-channel feature data, and the first separation data are input to the first-level wake-up module 242, and the first wake-up data is output. When the first wake-up data indicates that the pre-wake-up is successful, the first microphone data, multi-channel feature data, the first separation data, and the echo reference signal are input to the second-level separation module 242, and the second separation data is output. The first microphone data, multi-channel feature data, the first separation data, and the second separation data are input to the second-level wake-up module 244, and the second wake-up data is output. The second wake-up data is used to determine whether the wake-up is successful.
[0384] It should be noted that this scenario also supports techniques such as person-specific extraction, visual data-assisted person-specific extraction, direction-specific extraction, multi-task design combining blind separation and multi-source localization, multi-task design combining person-specific extraction and person-specific orientation estimation, multi-task design combining blind separation and multi-speaker recognition, and multi-task design combining wake-up and orientation estimation, etc. The implementation details of each step can be found in the relevant descriptions in the above embodiments, and will not be repeated here.
[0385] In summary, the voice wake-up method provided in this application embodiment, on the one hand, provides a dual-path Conformer network structure based on the self-attention network layer modeling technology of Conformer. By designing alternating calculations of Conformer layers within and between blocks, it can model long sequences and avoid the problem of increased computation caused by directly using Conformer. Furthermore, due to the strong modeling capability of Conformer network, the separation effect can be significantly improved.
[0386] On the other hand, it provides a fusion mechanism for multiple sets of multi-channel feature data of the Conformer. For multiple sets of multi-channel features, the first attention feature data within the group is calculated first, and then the second attention feature data between the groups is calculated. This allows the model to better learn the contribution of each feature to the final separation effect, further ensuring the subsequent separation effect.
[0387] On the other hand, a two-stage separation scheme is provided, namely a streaming separation process for the first-level wake-up and an offline separation process for the second-level wake-up. Since the second-level separation module can additionally use the first separation data output by the first-level separation module as input parameters, the separation effect is further enhanced.
[0388] On the other hand, a wake-up module with multiple inputs is provided. Compared with the single-input wake-up module in related technologies, it can not only save computation and avoid the problem of significantly increased computation and waste caused by repeated calls to the wake-up model, but also greatly improve wake-up performance by making better use of the correlation between various input parameters.
[0389] On the other hand, a multi-task design scheme is provided for sound source wake-up tasks and other tasks. These other tasks include at least one of the following: sound source localization, specific person extraction, specific direction extraction, and specific person confirmation. The sound source wake-up results can be correlated with other information and provided to downstream tasks, improving the output effect of the wake-up module (i.e., the first-level wake-up module and / or the second-level wake-up module). For example, if the other task is sound source localization, the output wake-up data includes multiple sound source data and the directional information corresponding to each sound source. This allows the wake-up module to provide more accurate directional information while providing the sound source wake-up results, ensuring more accurate directional estimation compared to related technologies that directly use multiple fixed beams in space. Another example is a specific person extraction task, where the output wake-up data includes the sound source data of the target object, ensuring that the electronic device only responds to wake-ups from a specific person (i.e., the target object), further reducing the false wake-up rate. Similarly, if the other task is a specific direction extraction task, the output wake-up data includes at least one sound source data in the target direction, ensuring that the electronic device only responds to wake-ups from a specific direction (i.e., the target direction), further reducing the false wake-up rate. For example, taking the robot as the execution subject of the voice wake-up method provided in this application embodiment, other tasks are the specific person extraction task and the sound source localization task. The output wake-up data includes the sound source data of the target object and the location information of the sound source data of the target object, so that the robot will only respond to the wake-up of the specific person (i.e. the target object) and determine the location of the specific person at the same time as being woken up. Thus, the robot can adjust its own orientation to face the specific person and ensure that it can better receive the instructions issued by the person later.
[0390] Please refer to Figure 29 It illustrates a flowchart of a voice wake-up method provided in another exemplary embodiment of this application, which is used in this embodiment for... Figure 2 The following example uses an electronic device. The method includes the following steps.
[0391] Step 2901: Obtain the raw first microphone data.
[0392] Step 2902: Perform first-level processing based on the first microphone data to obtain first wake-up data. The first-level processing includes first-level separation processing and first-level wake-up processing based on a neural network model.
[0393] Step 2903: When the first wake-up data indicates that the pre-wake-up is successful, the second wake-up data is obtained by performing a second-level processing based on the first microphone data. The second-level processing includes a second-level separation processing and a second-level wake-up processing based on a neural network model.
[0394] Step 2904: Determine the wake-up result based on the second wake-up data.
[0395] It should be noted that the relevant descriptions of each step in this embodiment can be found in the above method embodiments, and will not be repeated here.
[0396] The following are embodiments of the apparatus described in this application, which can be used to execute the embodiments of the method described in this application. For details not disclosed in the apparatus embodiments of this application, please refer to the embodiments of the method described in this application.
[0397] Please refer to Figure 30 The diagram illustrates a block diagram of a voice wake-up device provided in an exemplary embodiment of this application. This device can be implemented as one or more chips, or as a voice wake-up system, or as a combination of software, hardware, or both. Figure 2 The provided electronic device may include all or part of an acquisition module 3010, a first-level processing module 3020, a second-level processing module 3030, and a determination module 3040.
[0398] The acquisition module 3010 is used to acquire the raw first microphone data;
[0399] The first-level processing module 3020 is used to perform first-level processing based on the first microphone data to obtain first wake-up data. The first-level processing includes first-level separation processing and first-level wake-up processing based on a neural network model.
[0400] The second-level processing module 3030 is used to perform second-level processing on the first microphone data to obtain second-level wake-up data when the first wake-up data indicates that the pre-wake-up is successful. The second-level processing includes second-level separation processing and second-level wake-up processing based on a neural network model.
[0401] The determination module 3040 is used to determine the wake-up result based on the second wake-up data.
[0402] In one possible implementation, the device further includes a preprocessing module, and the first-level processing module 3020 further includes a first-level separation module and a first-level wake-up module;
[0403] The preprocessing module is used to preprocess the first microphone data to obtain multi-channel feature data;
[0404] The first-level separation module is used to perform first-level separation processing based on multi-channel feature data and output the first-level separated data.
[0405] The first-level wake-up module is used to perform first-level wake-up processing based on multi-channel feature data and first separation data, and output the first wake-up data.
[0406] In another possible implementation, the second-level processing module 3030 also includes a second-level separation module and a second-level wake-up module;
[0407] The second-level separation module is used to perform second-level separation processing based on multi-channel feature data and first separation data when the first wake-up data indicates that the pre-wake-up is successful, and output the second separation data.
[0408] The second-level wake-up module is used to perform second-level wake-up processing based on multi-channel feature data, first separation data, and second separation data, and output the second wake-up data.
[0409] In another possible implementation, the first-level separation process is a streaming sound source separation process, and the first-level wake-up process is a streaming sound source wake-up process; and / or,
[0410] The second-level separation processing is offline sound source separation processing, and the second-level wake-up processing is offline sound source wake-up processing.
[0411] In another possible implementation
[0412] The first-level wake-up module includes wake-up models in either a multi-input single-output (MIMO) or multi-input multi-output (MIMO) format; and / or,
[0413] The second-level wake-up module includes wake-up models in the form of multiple inputs and single outputs or multiple inputs and multiple outputs.
[0414] In another possible implementation, the first-level separation module and / or the second-level separation module adopt a conformer network structure with dual paths.
[0415] In another possible implementation, the first-level separation module and / or the second-level separation module are separation modules for performing at least one task, the at least one task including a separate sound source separation task, or including a sound source separation task and other tasks;
[0416] Other tasks include at least one of the following: sound source localization task, specific person extraction task, specific direction extraction task, and specific person confirmation task.
[0417] In another possible implementation, the first-level wake-up module and / or the second-level wake-up module are wake-up modules for performing at least one task, the at least one task including a separate wake-up task, or including a wake-up task and other tasks;
[0418] Other tasks include at least one of the following: sound source localization task, specific person extraction task, specific direction extraction task, and specific person confirmation task.
[0419] In another possible implementation, the first-level separation module includes a first-level multi-feature fusion model and a first-level separation model; the first-level separation module is also used for:
[0420] The multi-channel feature data is input into the first-level multi-feature fusion model and the first single-channel feature data is output.
[0421] The first single-channel feature data is input into the first-level separation model to obtain the first separation data.
[0422] In another possible implementation, the second-level separation module includes a second-level multi-feature fusion model and a second-level separation model; the second-level separation module is also used for:
[0423] The multi-channel feature data and the first separated data are input into the second-level multi-feature fusion model to output the second single-channel feature data;
[0424] The second single-channel feature data is input into the second-level separation model to obtain the second separation data.
[0425] In another possible implementation, the first-level wake-up module includes a first wake-up model in the form of multiple-input single-output. The first-level wake-up module is also used for:
[0426] The multi-channel feature data and the first separation data are input into the first-level wake-up model to output the first wake-up data. The first wake-up data includes the first confidence level, which is used to indicate the probability that the original first microphone data includes a preset wake-up word.
[0427] In another possible implementation, the first-level wake-up module includes a first wake-up model in the form of multiple-input multiple-output and a first post-processing module. The first-level wake-up module is also used for:
[0428] The multi-channel feature data and the first separation data are input into the first wake-up model, and the phoneme sequence information corresponding to each of the multiple sound source data is output.
[0429] The phoneme sequence information corresponding to each of the multiple sound source data is input into the first post-processing module, and the first wake-up data is output. The first wake-up data includes the second confidence level corresponding to each of the multiple sound source data. The second confidence level is used to indicate the acoustic feature similarity between the sound source data and the preset wake-up word.
[0430] In another possible implementation, the second-level wake-up module includes a second wake-up model in the form of multiple-input single-output. The second-level wake-up module is also used for:
[0431] The multi-channel feature data, the first separation data, and the second separation data are input into the second-level wake-up model to output the second wake-up data. The second wake-up data includes a third confidence level, which is used to indicate the probability that the original first microphone data includes a preset wake-up word.
[0432] In another possible implementation, the second-level wake-up module includes a second wake-up model in the form of multiple-input multiple-output and a second post-processing module. The second-level wake-up module is also used for:
[0433] The multi-channel feature data, the first separation data, and the second separation data are input into the second-level wake-up model, and the phoneme sequence information corresponding to each of the multiple sound source data is output.
[0434] The phoneme sequence information corresponding to each of the multiple sound source data is input into the second post-processing module, and the second wake-up data is output. The second wake-up data includes the fourth confidence level corresponding to each of the multiple sound source data. The fourth confidence level is used to indicate the acoustic feature similarity between the sound source data and the preset wake-up word.
[0435] It should be noted that the apparatus provided in the above embodiments is only illustrated by the division of the above functional modules when implementing its functions. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and method embodiments provided in the above embodiments belong to the same concept, and the specific implementation process can be found in the method embodiments, which will not be repeated here.
[0436] This application provides an electronic device, which includes: a processor; and a memory for storing processor-executable instructions; wherein the processor is configured to implement the method described above when executing instructions.
[0437] This application provides a computer program product, including computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code. When the computer-readable code is run in the processor of an electronic device, the processor in the electronic device executes the method described above performed by the electronic device.
[0438] This application provides a voice wake-up system for performing the methods described above by electronic devices.
[0439] This application provides a non-volatile computer-readable storage medium storing computer program instructions thereon, which, when executed by a processor, implement the method described above that is executed by an electronic device.
[0440] Computer-readable storage media can be tangible devices capable of holding and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example—but not limited to—electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), electrically programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), compact disc read-only memory (CD-ROM), digital video disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination of the foregoing.
[0441] The computer-readable program instructions or code described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.
[0442] The computer program instructions used to perform the operations of this application may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as "C" or similar languages. The computer-readable program instructions may be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuits, such as programmable logic circuits, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), are personalized by utilizing state information from computer-readable program instructions. These electronic circuits can execute computer-readable program instructions to implement various aspects of this application.
[0443] Various aspects of this application are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.
[0444] These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processor of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner; thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.
[0445] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.
[0446] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of apparatus, systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved.
[0447] It should also be noted that each block in the block diagram and / or flowchart, as well as combinations of blocks in the block diagram and / or flowchart, can be implemented using hardware (such as circuits or ASICs (Application Specific Integrated Circuits)) that performs the corresponding function or action, or using a combination of hardware and software, such as firmware.
[0448] Although this application has been described herein in conjunction with various embodiments, those skilled in the art, by reviewing the accompanying drawings, disclosure, and appended claims, will understand and implement other variations of the disclosed embodiments in carrying out the claimed application. In the claims, the word "comprising" does not exclude other components or steps, and "a" or "an" does not exclude multiple instances. A single processor or other unit can implement several functions listed in the claims. While different dependent claims may recite certain measures, this does not mean that these measures cannot be combined to produce good results.
[0449] The various embodiments of this application have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or improvement of the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.
Claims
1. A voice wake-up method, characterized in that, The method includes: Acquire raw first microphone data; The first wake-up data is obtained by performing a first-level processing on the first microphone data. The first-level processing includes a first-level separation processing and a first-level wake-up processing based on a neural network model. When the first wake-up data indicates that the pre-wake-up is successful, the second wake-up data is obtained by performing a second-level processing based on the first microphone data. The second-level processing includes a second-level separation processing and a second-level wake-up processing based on a neural network model. The wake-up result is determined based on the second wake-up data; Wherein, the first-level separation process is a streaming sound source separation process, the first-level wake-up process is a streaming sound source wake-up process; and / or, the second-level separation process is an offline sound source separation process, the second-level wake-up process is an offline sound source wake-up process.
2. The method according to claim 1, characterized in that, The first wake-up data is obtained by performing first-level processing based on the first microphone data, including: The first microphone data is preprocessed to obtain multi-channel feature data; Based on the multi-channel feature data, the first separation data is obtained by calling the pre-trained first-level separation module. The first-level separation module is used to perform the first-level separation processing. Based on the multi-channel feature data and the first separation data, the first wake-up data is obtained by calling the pre-trained first-level wake-up module, which is used to perform the first-level wake-up processing.
3. The method according to claim 2, characterized in that, When the first wake-up data indicates successful pre-wake-up, the second wake-up data is obtained by performing a second-level processing based on the first microphone data, including: When the first wake-up data indicates that the pre-wake-up is successful, the second separation data is obtained by calling the pre-trained second-level separation module based on the multi-channel feature data and the first separation data. The second-level separation module is used to perform the second-level separation processing. Based on the multi-channel feature data, the first separation data, and the second separation data, the second wake-up data is obtained by calling the pre-trained second-level wake-up module. The second-level wake-up module is used to perform the second-level wake-up processing.
4. The method according to claim 3, characterized in that, The first-level wake-up module includes a wake-up model in the form of multiple-input single-output or multiple-input multiple-output; and / or, The second-level wake-up module includes wake-up models in the form of multiple-input single-output or multiple-input multiple-output.
5. The method according to claim 3, characterized in that, The first-level separation module and / or the second-level separation module adopt a conformer network structure with dual paths.
6. The method according to claim 3, characterized in that, The first-level separation module and / or the second-level separation module are separation modules for performing at least one task, the at least one task including a single sound source separation task, or including the sound source separation task and other tasks; The other tasks include at least one of the following: sound source localization task, specific person extraction task, specific direction extraction task, and specific person confirmation task.
7. The method according to claim 3, characterized in that, The first-level wake-up module and / or the second-level wake-up module are wake-up modules for performing at least one task, the at least one task including a single wake-up task, or including the wake-up task and other tasks; The other tasks include at least one of the following: sound source localization task, specific person extraction task, specific direction extraction task, and specific person confirmation task.
8. The method according to claim 2, characterized in that, The first-level separation module includes a first-level multi-feature fusion model and a first-level separation model; the step of calling the pre-trained first-level separation module to output the first separation data based on the multi-channel feature data includes: The multi-channel feature data is input into the first-level multi-feature fusion model to output the first single-channel feature data; The first single-channel feature data is input into the first-level separation model to obtain the first separation data.
9. The method according to claim 3, characterized in that, The second-level separation module includes a second-level multi-feature fusion model and a second-level separation model; the step of calling the pre-trained second-level separation module to obtain the second separation data based on the multi-channel feature data and the first separation data includes: The multi-channel feature data and the first separated data are input into the second-level multi-feature fusion model to output the second single-channel feature data; The second single-channel feature data is input into the second-level separation model to obtain the second separation data.
10. The method according to claim 2, characterized in that, The first-level wake-up module includes a first wake-up model in the form of multiple inputs and single outputs. The step of obtaining the first wake-up data by calling the pre-trained first-level wake-up module output based on the multi-channel feature data and the first separated data includes: The multi-channel feature data and the first separation data are input into the first-level wake-up model to output the first wake-up data. The first wake-up data includes a first confidence level, which is used to indicate the probability that the original first microphone data includes a preset wake-up word.
11. The method according to claim 2, characterized in that, The first-level wake-up module includes a first wake-up model in the form of multiple inputs and multiple outputs and a first post-processing module. The step of obtaining the first wake-up data by calling the pre-trained first-level wake-up module output based on the multi-channel feature data and the first separated data includes: The multi-channel feature data and the first separated data are input into the first wake-up model, and the phoneme sequence information corresponding to each of the multiple sound source data is output. The phoneme sequence information corresponding to each of the multiple sound source data is input into the first post-processing module, and the first wake-up data is output. The first wake-up data includes the second confidence level corresponding to each of the multiple sound source data. The second confidence level is used to indicate the acoustic feature similarity between the sound source data and the preset wake-up word.
12. The method according to claim 3, characterized in that, The second-level wake-up module includes a second wake-up model in the form of multiple inputs and single outputs. The step of calling the pre-trained second-level wake-up module to output the second wake-up data based on the multi-channel feature data, the first separation data, and the second separation data includes: The multi-channel feature data, the first separation data, and the second separation data are input into the second-level wake-up module to output the second wake-up data. The second wake-up data includes a third confidence level, which is used to indicate the probability that the original first microphone data includes a preset wake-up word.
13. The method according to claim 3, characterized in that, The second-level wake-up module includes a second wake-up model in the form of multiple inputs and multiple outputs and a second post-processing module. The step of calling the pre-trained second-level wake-up module to output the second wake-up data based on the multi-channel feature data, the first separated data, and the second separated data includes: The multi-channel feature data, the first separation data, and the second separation data are input into the second-level wake-up module, and the phoneme sequence information corresponding to each of the multiple sound source data is output. The phoneme sequence information corresponding to each of the multiple sound source data is input into the second post-processing module, and the second wake-up data is output. The second wake-up data includes the fourth confidence level corresponding to each of the multiple sound source data. The fourth confidence level is used to indicate the acoustic feature similarity between the sound source data and the preset wake-up word.
14. A voice wake-up device, characterized in that, The device includes: an acquisition module, a first-level processing module, a second-level processing module, and a determination module; The acquisition module is used to acquire the original first microphone data; The first-level processing module is used to perform first-level processing based on the first microphone data to obtain first wake-up data. The first-level processing includes first-level separation processing and first-level wake-up processing based on a neural network model. The second-level processing module is used to perform second-level processing on the first microphone data to obtain second wake-up data when the first wake-up data indicates that the pre-wake-up is successful. The second-level processing includes second-level separation processing and second-level wake-up processing based on a neural network model. The determining module is used to determine the wake-up result based on the second wake-up data; Wherein, the first-level separation processing is a streaming sound source separation processing, the first-level wake-up processing is a streaming sound source wake-up processing; and / or, The second-level separation process is an offline sound source separation process, and the second-level wake-up process is an offline sound source wake-up process.
15. The apparatus according to claim 14, characterized in that, The device further includes a preprocessing module, and the first-level processing module further includes a first-level separation module and a first-level wake-up module; The preprocessing module is used to preprocess the first microphone data to obtain multi-channel feature data; The first-level separation module is used to perform the first-level separation processing based on the multi-channel feature data and output the first separation data. The first-level wake-up module is used to perform the first-level wake-up processing based on the multi-channel feature data and the first separation data, and output the first wake-up data.
16. The apparatus according to claim 15, characterized in that, The second-level processing module also includes a second-level separation module and a second-level wake-up module; The second-level separation module is used to perform second-level separation processing based on the multi-channel feature data and the first separation data when the first wake-up data indicates successful pre-wake-up, and output the second separation data. The second-level wake-up module is used to perform second-level wake-up processing based on the multi-channel feature data, the first separation data, and the second separation data, and output the second wake-up data.
17. The apparatus according to claim 16, characterized in that, The first-level wake-up module includes a wake-up model in the form of multiple-input single-output or multiple-input multiple-output; and / or, The second-level wake-up module includes wake-up models in the form of multiple-input single-output or multiple-input multiple-output.
18. The apparatus according to claim 16, characterized in that, The first-level separation module and / or the second-level separation module adopt a conformer network structure with dual paths.
19. The apparatus according to claim 16, characterized in that, The first-level separation module and / or the second-level separation module are separation modules for performing at least one task, the at least one task including a single sound source separation task, or including the sound source separation task and other tasks; The other tasks include at least one of the following: sound source localization task, specific person extraction task, specific direction extraction task, and specific person confirmation task.
20. The apparatus according to claim 16, characterized in that, The first-level wake-up module and / or the second-level wake-up module are wake-up modules for performing at least one task, the at least one task including a single wake-up task, or including the wake-up task and other tasks; The other tasks include at least one of the following: sound source localization task, specific person extraction task, specific direction extraction task, and specific person confirmation task.
21. The apparatus according to claim 15, characterized in that, The first-level separation module includes a first-level multi-feature fusion model and a first-level separation model; the first-level separation module is further used for: The multi-channel feature data is input into the first-level multi-feature fusion model to output the first single-channel feature data; The first single-channel feature data is input into the first-level separation model to obtain the first separation data.
22. The apparatus according to claim 16, characterized in that, The second-level separation module includes a second-level multi-feature fusion model and a second-level separation model; the second-level separation module is also used for: The multi-channel feature data and the first separated data are input into the second-level multi-feature fusion model to output the second single-channel feature data; The second single-channel feature data is input into the second-level separation model to obtain the second separation data.
23. The apparatus according to claim 15, characterized in that, The first-level wake-up module includes a first wake-up model in the form of multiple inputs and a single output. The first-level wake-up module is further used for: The multi-channel feature data and the first separation data are input into the first-level wake-up model to output the first wake-up data. The first wake-up data includes a first confidence level, which is used to indicate the probability that the original first microphone data includes a preset wake-up word.
24. The apparatus according to claim 15, characterized in that, The first-level wake-up module includes a first wake-up model in the form of multiple-input multiple-output and a first post-processing module. The first-level wake-up module is also used for: The multi-channel feature data and the first separated data are input into the first wake-up model, and the phoneme sequence information corresponding to each of the multiple sound source data is output. The phoneme sequence information corresponding to each of the multiple sound source data is input into the first post-processing module, and the first wake-up data is output. The first wake-up data includes the second confidence level corresponding to each of the multiple sound source data. The second confidence level is used to indicate the acoustic feature similarity between the sound source data and the preset wake-up word.
25. The apparatus according to claim 16, characterized in that, The second-level wake-up module includes a second wake-up model in the form of multiple inputs and a single output. The second-level wake-up module is also used for: The multi-channel feature data, the first separation data, and the second separation data are input into the second-level wake-up module to output the second wake-up data. The second wake-up data includes a third confidence level, which is used to indicate the probability that the original first microphone data includes a preset wake-up word.
26. The apparatus according to claim 16, characterized in that, The second-level wake-up module includes a second wake-up model in the form of multiple-input multiple-output and a second post-processing module. The second-level wake-up module is also used for: The multi-channel feature data, the first separation data, and the second separation data are input into the second-level wake-up module, and the phoneme sequence information corresponding to each of the multiple sound source data is output. The phoneme sequence information corresponding to each of the multiple sound source data is input into the second post-processing module, and the second wake-up data is output. The second wake-up data includes the fourth confidence level corresponding to each of the multiple sound source data. The fourth confidence level is used to indicate the acoustic feature similarity between the sound source data and the preset wake-up word.
27. An electronic device, characterized in that, The electronic device includes: processor; Memory used to store processor-executable instructions; The processor is configured to implement the method according to any one of claims 1-13 when executing the instructions.
28. A non-volatile computer-readable storage medium storing computer program instructions thereon, characterized in that, When the computer program instructions are executed by the processor, they implement the method described in any one of claims 1-13.
29. A voice wake-up system, characterized in that, The voice wake-up system is used to perform the method described in any one of claims 1-13.