Speech recognition method and apparatus, storage medium, and electronic device
By collecting and recognizing voice audio, determining compensation audio, and sending it to the server for recognition instructions, the problem of waiting time after the user wakes up the voice device is solved, and a voice interaction experience without waiting is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- QINGDAO HAIER TECH
- Filing Date
- 2021-09-27
- Publication Date
- 2026-06-23
AI Technical Summary
Users have to wait a certain amount of time after waking up the voice device before they can give voice commands, resulting in a poor user experience.
Collect a first voice audio and a second voice audio within a preset time period, identify the first voice audio and determine the compensation audio if the identification result indicates the target wake-up word, and send the compensation audio and the second voice audio to the target server for instruction recognition.
There's no need to wait for the voice device to wake up before speaking the compensation audio; the audio to be recognized is sent directly to the server for recognition, improving the user experience.
Smart Images

Figure CN115881114B_ABST
Abstract
Description
Technical Field
[0001] The present invention relates to the field of communications, and more specifically, to a speech recognition method, apparatus, storage medium, and electronic device. Background Technology
[0002] With the continuous maturation of intelligent voice application technology, more and more home devices are using intelligent voice technology. If you want to listen to music, check the weather, or control home appliances, you only need to speak the command to the home appliance, and you no longer need to touch the device or take out your mobile phone to control it.
[0003] In related technologies, voice devices typically employ a wake-up, question, and response interaction model. For example, a user wakes up a smart device by saying "Xiao You Xiao You," and the device responds with voice or light to enter wake-up mode. The user might ask, "What's the weather like today?" and the device, after checking, announces, "The weather is sunny today..." However, the wake-up engine requires computation time. It needs to detect the attenuation point of sound energy before sending the previous audio segment for calculation. It then matches this with the acoustic features of the local wake-up model to calculate the wake-up score. Adding to this, the player amplifier takes over 100ms to start, resulting in a short wait after the user wakes up before giving a voice command. If the command is given without waiting for a wake-up response, the first part of the command may not be recognized. For example, if a user says, "Xiao You Xiao You, turn off the living room air conditioner," the recognition result might only be "living room air conditioner," leaving the user unsure whether to turn the air conditioner on or off, leading to a poor user experience.
[0004] Therefore, it can be seen that the relevant technology has the problem that users need to wait for a certain period of time after waking up the voice device before they can give voice commands, resulting in a poor user experience.
[0005] There are currently no effective solutions to the problems existing in the relevant technologies. Summary of the Invention
[0006] This invention provides a voice recognition method, apparatus, storage medium, and electronic device to at least solve the problem in related technologies where users need to wait a certain amount of time after waking up a voice device before they can give voice commands, resulting in a poor user experience.
[0007] According to an embodiment of the present invention, a speech recognition method is provided, comprising: acquiring a first speech audio and a second speech audio within a preset time period after the first speech audio; recognizing the first speech audio to obtain a recognition result; if the recognition result indicates that the first speech audio includes a target wake-up word, determining a compensation audio based on the acquisition time of the target wake-up word; acquiring the second speech audio within the preset time period after the first speech audio; determining an audio to be recognized based on the compensation audio and the second speech audio, and sending the audio to be recognized to a target server to instruct the target server to recognize an instruction included in the audio to be recognized.
[0008] According to another embodiment of the present invention, a speech recognition device is provided, comprising: a acquisition module for acquiring a first speech audio and a second speech audio within a preset time period after the first speech audio; a first recognition module for recognizing the first speech audio to obtain a recognition result; a determination module for determining a compensation audio based on the acquisition time of the target wake-up word when the recognition result indicates that the first speech audio includes a target wake-up word; and a second recognition module for determining an audio to be recognized based on the compensation audio and the second speech audio, and for sending the audio to be recognized to a target server to instruct the target server to recognize the instructions included in the audio to be recognized.
[0009] According to yet another embodiment of the present invention, a computer-readable storage medium is also provided, wherein a computer program is stored therein, wherein the computer program, when executed by a processor, implements the steps of the method described in any of the preceding claims.
[0010] According to yet another embodiment of the present invention, an electronic device is also provided, including a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to perform the steps in any of the above method embodiments.
[0011] This invention involves acquiring a first voice audio file and a second voice audio file within a preset time period following the first voice audio file. The first voice audio file is then recognized to obtain a recognition result. If the recognition result indicates that the first voice audio file contains a target wake-up word, a compensation audio file is determined based on the acquisition time of the target wake-up word. The compensation audio file and the second voice audio file are then used to determine the audio file to be recognized. This audio file is then sent to a target server to instruct the target server to recognize the command contained within it. Since the first voice audio file contains the target wake-up word, the audio file to be recognized, determined based on the compensation audio file, can be directly sent to the target server. The target server can then recognize the command within the audio file without needing to confirm that the device has been woken up before speaking the compensation audio file. Therefore, this invention solves the problem in related technologies where users need to wait a certain amount of time after waking up the voice device before speaking a voice command, resulting in a poor user experience. It achieves the effect of waking up without waiting, thus improving the user experience. Attached Figure Description
[0012] Figure 1 This is a hardware structure block diagram of a mobile terminal for a speech recognition method according to an embodiment of the present invention.
[0013] Figure 2 This is a flowchart of a speech recognition method according to an embodiment of the present invention;
[0014] Figure 3 This is a schematic diagram illustrating the determination of compensated audio according to an exemplary embodiment of the present invention;
[0015] Figure 4 This is a flowchart of a speech recognition method according to a specific embodiment of the present invention;
[0016] Figure 5 This is a structural block diagram of a speech recognition device according to an embodiment of the present invention. Detailed Implementation
[0017] The embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples.
[0018] It should be noted that the terms "first," "second," etc., in the specification, claims, and drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.
[0019] To make human-computer interaction with smart home appliances smoother and more natural, the voice command is spoken simultaneously with the user's wake-up call. For example, a user can say, "Xiao You, Xiao You, turn on the air conditioner," and the air conditioner will turn on directly. This is typically achieved in the following ways in related technologies:
[0020] Option 1 involves using a superior CPU to improve hardware computing performance, thereby reducing the wake-up engine's computation time.
[0021] Option 2 involves optimizing the wake-up engine algorithm to find a better computation path.
[0022] Option 3 considers the word "wake-up" as being triggered by detecting the first few words of the wake-up phrase. This is actually a fallacy, expanding the original "Xiao You Xiao You" to "Xiao You Xiao".
[0023] However, the above solution has the following drawbacks:
[0024] Option 1: Increases hardware costs, which is inconsistent with the company's cost reduction strategy.
[0025] Option 2: Optimize the wake-up engine algorithm. This requires significant R&D resources and has a long development cycle, which will inevitably affect the wake-up rate and false wake-up rate. Most importantly, this option has its limitations. The return on investment is not high.
[0026] Option 3: This is a workaround, which inevitably increases the probability of false wake-ups. For example, "Xiao You Xiao Mei" shouldn't be woken up, but this option wakes it up because it matches "Xiao You Xiao". False wake-ups not only disturb users but also pose a risk of user privacy leaks.
[0027] To address the aforementioned problems in the relevant technologies, the following embodiments are proposed:
[0028] The methods and embodiments provided in this application can be executed on a mobile terminal, computer terminal, or similar computing device. Taking running on a mobile terminal as an example, Figure 1 This is a hardware structure block diagram of a mobile terminal for a speech recognition method according to an embodiment of the present invention. Figure 1 As shown, a mobile terminal may include one or more ( Figure 1 Only one is shown in the diagram. A processor 102 (which may include, but is not limited to, a microprocessor MCU or a programmable logic device FPGA, etc.) and a memory 104 for storing data are also shown. The mobile terminal may further include a transmission device 106 for communication functions and an input / output device 108. Those skilled in the art will understand that... Figure 1 The structure shown is for illustrative purposes only and does not limit the structure of the mobile terminal described above. For example, the mobile terminal may also include components that are more... Figure 1 The more or fewer components shown, or having the same Figure 1 The different configurations shown.
[0029] The memory 104 can be used to store computer programs, such as application software programs and modules, like the computer program corresponding to the speech recognition method in this embodiment of the invention. The processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, thereby implementing the above-described method. The memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory remotely located relative to the processor 102, and these remote memories can be connected to the mobile terminal via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0030] The transmission device 106 is used to receive or send data via a network. Specific examples of the network described above may include a wireless network provided by the mobile terminal's communication provider. In one example, the transmission device 106 includes a Network Interface Controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In another example, the transmission device 106 may be a Radio Frequency (RF) module used for wireless communication with the Internet.
[0031] This embodiment provides a speech recognition method. Figure 2 This is a flowchart of a speech recognition method according to an embodiment of the present invention, such as... Figure 2 As shown, the process includes the following steps:
[0032] Step S202: Collect the first audio recording and the second audio recording within a preset time period after the first audio recording;
[0033] Step S204: Recognize the first speech audio to obtain the recognition result;
[0034] Step S206: If the recognition result indicates that the first speech audio includes a target wake-up word, determine the compensation audio based on the acquisition time of the target wake-up word;
[0035] Step S208: Based on the compensated audio and the second speech audio, determine the audio to be identified, and send the audio to be identified to the target server to instruct the target server to identify the instructions included in the audio to be identified.
[0036] In the above embodiments, the first and second audio recordings can be audio collected by a sound acquisition device, which can be a sound acquisition device integrated into the voice device, such as a microphone. The voice device can include smart speakers, smart home devices, and devices with voice interaction functions, such as mobile phones, tablets, and smart wearable devices with voice interaction functions.
[0037] In the above embodiments, when a user wants to issue a command through the voice device, they can directly say the wake word and the voice containing the command, such as "Xiao You, Xiao You, turn on the air conditioner." After the sound acquisition device acquires the voice, it can recognize the voice through the processor included in the voice device to obtain the recognition result. Specifically, whether the voice contains the target wake word can be recognized locally on the voice device, without needing to upload the acquired voice to the target server in real time. Only when the target wake word is recognized is the compensation audio uploaded to the target server, thus protecting the user's privacy.
[0038] In the above embodiments, when the recognition result indicates that the speech includes the target wake-up word, the compensation audio can be determined based on the acquisition time of the target wake-up word. The target wake-up word is "Xiao You Xiao You". The target wake-up word also supports custom devices; users can issue a command to the voice device to change the system's default wake-up word to a custom one.
[0039] In the above embodiments, the compensation audio can be collected after the target wake-up word, and the second collection time period for collecting the compensation audio is adjacent to the first collection time period for collecting the target wake-up word. For example, the target wake-up word can be the speech collected from 1 minute 40 seconds to 1 minute 42 seconds, and the compensation audio can be the speech collected from 1 minute 42 seconds to 1 minute 44 seconds.
[0040] In the above embodiments, the audio to be recognized includes compensation audio and other audio besides the compensation audio. The other audio is the speech captured after the compensation audio, and the compensation audio can precede the other audio in the speech to be recognized. When a user says "Xiaoyou Xiaoyou, turn on the air conditioner," the sound acquisition device can capture the user's spoken speech, perform recognition, and obtain a recognition result. When it is determined that the speech contains "Xiaoyou Xiaoyou," the recognition result is that it contains the target wake-up word. During the process from when the device starts recognizing to obtaining the recognition result, the user may have already uttered part of the voice command, such as "turn on." If the voice command is captured after obtaining the recognition result, there is a possibility that only "air conditioner" will be captured. Therefore, the captured "turn on" can be used as compensation audio, and "air conditioner" can be used as other audio. During uploading, the compensation audio is placed before the other audio to obtain the audio to be recognized, which is then uploaded to the target server. Therefore, there is no need to wait for the voice device to respond before uttering the voice command, improving the user experience. After determining the compensation audio, it can be superimposed on other audio to obtain the audio to be recognized. This audio is then sent to the target server. Upon receiving the audio, the target server can analyze it and identify the commands contained within, such as "turn on the air conditioner." The other audio can be determined based on the second audio recording.
[0041] In the above embodiments, users can speak the wake word and voice command consecutively, without having to say the wake word first and then speak the voice command after the voice device is woken up. This reduces waiting time and improves the user experience.
[0042] Optionally, the entity performing the above steps may be a voice device, a background processor, or other devices with similar processing capabilities. It may also be a machine that integrates at least a sound acquisition device and a data processing device. The sound acquisition device may include a sound acquisition module such as a microphone, and the data processing device may include a terminal such as a computer or a mobile phone, but is not limited thereto.
[0043] This invention involves acquiring a first voice audio file and a second voice audio file within a preset time period following the first voice audio file. The first voice audio file is then recognized to obtain a recognition result. If the recognition result indicates that the first voice audio file contains a target wake-up word, a compensation audio file is determined based on the acquisition time of the target wake-up word. The compensation audio file and the second voice audio file are then used to determine the audio file to be recognized. This audio file is then sent to a target server to instruct the target server to recognize the command contained within it. Since the first voice audio file contains the target wake-up word, the audio file to be recognized, determined based on the compensation audio file, can be directly sent to the target server. The target server can then recognize the command within the audio file without needing to confirm that the device has been woken up before speaking the compensation audio file. Therefore, this invention solves the problem in related technologies where users need to wait a certain amount of time after waking up the voice device before speaking a voice command, resulting in a poor user experience. It achieves the effect of waking up without waiting, thus improving the user experience.
[0044] In an exemplary embodiment, the method further includes: determining the sound features of the target wake-up word and the acquisition time of the target wake-up word based on the recognition result; wherein determining the compensation audio based on the acquisition time of the target wake-up word includes: determining the acquisition end time of the target wake-up word based on the sound features and the acquisition time; determining the result reporting time of obtaining the recognition result; and determining the audio acquired during the time period from the end time to the result reporting time as the compensation audio. In this embodiment, the wake-up engine can determine the sound features of the target wake-up word and the acquisition time of the target wake-up word according to the recognition result, wherein the sound features may include sound waveforms, etc. The acquisition start time and acquisition end time of the target wake-up word can be determined based on the sound features and the acquisition time. During the speech recognition process, the result reporting time of obtaining the recognition result can be determined. The time period from the end time to the result reporting time is determined as the time period for acquiring the compensation audio, and the audio acquired during this time period is determined as the compensation audio. The compensation audio may be audio included in the audio acquired by the sound acquisition device. For example, during the period from the end time to the result reporting time, if the sound acquisition device is playing audio, the audio it acquires will include both the audio played by the sound acquisition device and the compensation audio. Therefore, after acquiring the compensation audio, it can be preprocessed to eliminate the audio played by the sound acquisition device.
[0045] In the above embodiments, the time period from the end time to the result reporting time is not a fixed duration. That is, the audio to be compensated is not of fixed duration. This is because the wake-up duration is not fixed each time. The wake-up duration is affected by factors such as user speech rate, environmental noise, and distributed wake-up, resulting in different wake-up durations, and this difference can be significant. If the compensation is too little, words will still be missed; if the compensation is too much, the wake-up word "Xiao You" may be included, causing recognition errors and affecting interaction. Therefore, a compensation calculation module can be introduced. By utilizing the acoustic information obtained from the wake-up engine, the start time of the wake-up word, the end time of the wake-up word (i.e., the end time of acquisition), and the wake-up decision result reporting time (i.e., the result reporting time) can be accurately obtained. The length of the audio to be compensated can be accurately obtained by subtracting the end time of the wake-up word from the wake-up decision result reporting time. The compensation algorithm uses a dynamic calculation method, obtaining the accurate time interval from the end of the wake-up word to the wake-up event reporting time from the wake-up event, thereby accurately calculating the length of the audio to be compensated. A schematic diagram of determining the compensated audio can be found in the appendix. Figure 3 .
[0046] In one exemplary embodiment, the method further includes: caching the acquired speech audio to obtain cached audio; wherein caching the acquired speech audio to obtain cached audio includes at least one of the following steps: caching the acquired first speech audio and second speech audio in real time, and determining the first speech audio and second speech audio as the cached audio; determining the speech audio cached in real time before obtaining the recognition result as the cached audio. In this embodiment, the sound acquisition device can cache the acquired audio in real time while acquiring audio, and can also cache the acquired first speech audio and second speech audio in real time.
[0047] In the above embodiments, without changing existing hardware resources and wake-up algorithms, a recording buffer can be added to the voice device. When wake-up occurs, the data in the wake-up buffer can be used to compensate for the beginning of the audio in the second voice audio, including the instruction, thus ensuring no recognition instructions are lost. If no wake-up occurs, the data in the recording buffer will not be uploaded to the cloud, guaranteeing user privacy. By caching the recording and then compensating for the audio to be recognized after wake-up, the goal of wake-up-without-wait interaction is achieved.
[0048] In one exemplary embodiment, the method further includes: deleting the cached audio when the recognition result indicates that the first speech audio does not include a target wake-up word; and determining the compensation audio from the cached audio when the recognition result indicates that the first speech audio includes a target wake-up word. In this embodiment, when the recognition result indicates that the first speech audio does not include a target wake-up word, the cached audio can be deleted to save storage space. When the recognition result indicates that the first speech audio includes a target wake-up word, the compensation audio can be determined from the cached audio. This prevents the situation where incomplete compensation audio cannot be obtained due to the lack of cached audio.
[0049] In one exemplary embodiment, determining the audio to be identified based on the compensated audio and the second speech audio includes: identifying the sound features of the second speech audio; if the sound features of the second speech audio determine that the second speech audio includes a voice command, determining a sub-audio in the second speech audio that includes the voice command, and determining the compensated audio and the sub-audio as the audio to be identified; if the sound features of the second speech audio determine that the second speech audio does not include a voice command, determining the compensated audio as the audio to be identified. In this embodiment, the second speech audio is audio acquired after a target wake-up word. When determining the audio to be identified, it can be determined whether there is a sub-audio in the second speech audio that includes a voice command. If the sub-audio exists, the sub-audio and the target wake-up word are combined to obtain the audio to be identified. If the sub-audio does not exist, the compensated audio is determined as the audio to be identified.
[0050] In one exemplary embodiment, the method further includes: preprocessing the buffered audio to obtain preprocessed buffered audio; the preprocessing includes at least one of the following: echo cancellation processing, automatic gain control processing; wherein, determining the compensation audio from the buffered audio includes: determining the compensation audio based on the preprocessed buffered audio. In this embodiment, the buffered audio can be preprocessed to determine the compensation audio from the preprocessed buffered audio.
[0051] In the above embodiments, since the voice device may capture the voice while playing it, the voice can be preprocessed to eliminate the voice played by the voice device itself and enhance the user's spoken voice. The preprocessing may include one or more of echo cancellation processing and automatic gain control processing.
[0052] In the above embodiments, preprocessing can be performed on all speech collected by the sound acquisition device, and preprocessing can also be performed on the speech collected after the target wake-up word, i.e., the audio to be recognized. Since the audio to be recognized collected after the target wake-up word contains speech commands, preprocessing only the audio to be recognized does not affect the effect of recognizing speech commands, and can save the processing resources of the voice device and improve the processing speed.
[0053] In the above embodiments, echo cancellation processing can be performed on the cached audio. This process eliminates sounds played by the voice device and ambient noise. Echo cancellation can be performed using an AEC (Automatic Gain Control) module. After echo cancellation, the audio can be further processed by Automatic Gain Control (AGC), which enhances the user's spoken voice. This enhancement can be achieved using an AGC model. The cached audio data may be the audio processed by both AEC and AGC modules, filtering out the device's own sound and preventing the inclusion of wake-up response words.
[0054] In one exemplary embodiment, after sending the audio to be recognized to the target server, the method further includes: receiving a control command sent by the target server, the control command being generated based on the audio to be recognized; executing the target operation indicated by the control command; and deleting the cached audio. In this embodiment, after sending the audio to be recognized to the target server, the target server can recognize the audio to be recognized, determine the speech recognition result, determine the control command based on the speech recognition result, and send the control command. After receiving the control command, the voice device can determine the target operation corresponding to the control command and execute the target operation. After receiving the control command, the voice device can also delete the locally cached audio to save storage space.
[0055] The speech recognition method will be explained below with reference to specific implementation methods:
[0056] Figure 4 This is a flowchart of a speech recognition method according to a specific embodiment of the present invention, such as... Figure 4 As shown, the method includes:
[0057] Step S402, the microphone (corresponding to the above-mentioned sound acquisition device) picks up sound.
[0058] Step S404: Perform AEC (Audio Echo Cancellation) processing on the picked-up audio.
[0059] Step S406: Perform AGC (Automatic Gain Control) processing on the AEC-processed speech.
[0060] Step S408: Wake up the judgment. If the judgment result is yes, execute step S410. If the judgment result is no, execute step S402.
[0061] In step S410, the compensation module performs sound compensation, adjusting recording B (corresponding to the compensated audio above) before recording A (corresponding to the other audio above), to obtain recording C (corresponding to the audio to be identified above).
[0062] Step S412: Upload the recording C to the server.
[0063] Step S414: The recording C is recognized using ASR automatic speech recognition technology.
[0064] In the aforementioned embodiments, audio buffering is used to retrieve buffered data after wake-up, achieving a wake-up-without-wait effect. Since it requires neither improved hardware performance nor optimization of the wake-up algorithm, it is characterized by low cost and rapid implementation, making it highly scalable. It achieves the effect of completing "wake-up + command issuance" with a single sentence, without waiting for wake-up before issuing commands.
[0065] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present invention.
[0066] This embodiment also provides a voice recognition device for implementing the above embodiments and preferred embodiments; details already described will not be repeated. As used below, the term "module" can refer to a combination of software and / or hardware that performs a predetermined function. Although the device described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.
[0067] Figure 5 This is a structural block diagram of a speech recognition device according to an embodiment of the present invention, such as... Figure 5 As shown, the device includes:
[0068] Acquisition module 52 is used to acquire the first audio recording and the second audio recording within a preset time period after the first audio recording.
[0069] The first recognition module 54 is used to recognize the first voice audio and obtain a recognition result;
[0070] The determining module 56 is used to determine compensation audio based on the acquisition time of the target wake-up word when the recognition result indicates that the first speech audio includes a target wake-up word;
[0071] The second recognition module 58 is used to determine the audio to be recognized based on the compensated audio and the second speech audio, and send the audio to be recognized to the target server to instruct the target server to recognize the instructions included in the audio to be recognized.
[0072] In the above embodiments, the first and second audio recordings can be audio collected by a sound acquisition device, which can be a sound acquisition device integrated into the voice device, such as a microphone. The voice device can include smart speakers, smart home devices, and devices with voice interaction functions, such as mobile phones, tablets, and smart wearable devices with voice interaction functions.
[0073] In the above embodiments, when a user wants to issue a command through the voice device, they can directly say the wake word and the voice containing the command, such as "Xiao You, Xiao You, turn on the air conditioner." After the sound acquisition device acquires the voice, it can recognize the voice through the processor included in the voice device to obtain the recognition result. Specifically, whether the voice contains the target wake word can be recognized locally on the voice device, without needing to upload the acquired voice to the target server in real time. Only when the target wake word is recognized is the compensation audio uploaded to the target server, thus protecting the user's privacy.
[0074] In the above embodiments, when the recognition result indicates that the speech includes the target wake-up word, the compensation audio can be determined based on the acquisition time of the target wake-up word. The target wake-up word is "Xiao You Xiao You". The target wake-up word also supports custom devices; users can issue a command to the voice device to change the system's default wake-up word to a custom one.
[0075] In the above embodiments, the compensation audio can be collected after the target wake-up word, and the second collection time period for collecting the compensation audio is adjacent to the first collection time period for collecting the target wake-up word. For example, the target wake-up word can be the speech collected from 1 minute 40 seconds to 1 minute 42 seconds, and the compensation audio can be the speech collected from 1 minute 42 seconds to 1 minute 44 seconds.
[0076] In the above embodiments, the target speech includes compensation audio and other audio besides the compensation audio. The other audio is the speech captured after the compensation audio, and in the target speech, the compensation audio precedes the other audio. When a user says "Xiaoyou, Xiaoyou, turn on the air conditioner," the sound acquisition device can capture the user's spoken speech, perform recognition, and obtain a recognition result. Once it is determined that the speech contains the target wake-up word "Xiaoyou," the recognition result is "containing the target wake-up word." During the process from when the device begins recognition to when the recognition result is obtained, the user may have already uttered part of the voice command, such as "turn on." If the voice command is captured after obtaining the recognition result, there is a possibility that only "air conditioner" will be captured. Therefore, the captured "turn on" can be used as compensation audio, and "air conditioner" can be used as other audio. During uploading, the compensation audio is placed before the other audio to obtain the target speech, which is then uploaded to the target server. Therefore, there is no need to wait for the voice device to respond before uttering the voice command, improving the user experience. That is, after determining the compensation audio, the compensation audio can be placed before other audio to obtain the target speech. The target speech is then sent to the target server. After receiving the audio to be recognized, the target server can analyze it and identify the instructions included in the audio, such as "turn on the air conditioner". The other audio can be the audio determined based on the second speech audio.
[0077] In the above embodiments, users can speak the wake word and voice command consecutively, without having to say the wake word first and then speak the voice command after the voice device is woken up. This reduces waiting time and improves the user experience.
[0078] This invention involves acquiring a first voice audio file and a second voice audio file within a preset time period following the first voice audio file. The first voice audio file is then recognized to obtain a recognition result. If the recognition result indicates that the first voice audio file contains a target wake-up word, a compensation audio file is determined based on the acquisition time of the target wake-up word. The compensation audio file and the second voice audio file are then used to determine the audio file to be recognized. This audio file is then sent to a target server to instruct the target server to recognize the command contained within it. Since the first voice audio file contains the target wake-up word, the audio file to be recognized, determined based on the compensation audio file, can be directly sent to the target server. The target server can then recognize the command within the audio file without needing to confirm that the device has been woken up before speaking the compensation audio file. Therefore, this invention solves the problem in related technologies where users need to wait a certain amount of time after waking up the voice device before speaking a voice command, resulting in a poor user experience. It achieves the effect of waking up without waiting, thus improving the user experience.
[0079] In an exemplary embodiment, the apparatus is configured to include: determining the sound features of the target wake-up word and the acquisition time of the target wake-up word based on the recognition result; wherein determining compensation audio based on the acquisition time of the target wake-up word includes: determining the acquisition end time of the target wake-up word based on the sound features and the acquisition time; determining the result reporting time of obtaining the recognition result; and determining the audio acquired during the time period from the end time to the result reporting time as the compensation audio. In this embodiment, the sound features of the target wake-up word and the acquisition time of the target wake-up word can be determined by a wake-up engine based on the recognition result, wherein the sound features may include sound waveforms, etc. The acquisition start time and acquisition end time of the target wake-up word can be determined based on the sound features and the acquisition time. During the speech recognition process, the result reporting time of obtaining the recognition result can be determined. The time period from the end time to the result reporting time is determined as the time period for acquiring compensation audio, and the audio acquired during this time period is determined as compensation audio. The compensation audio may be audio included in the audio acquired by the sound acquisition device. For example, during the period from the end time to the result reporting time, if the sound acquisition device is playing audio, the audio it acquires will include both the audio played by the sound acquisition device and the compensation audio. Therefore, after acquiring the compensation audio, it can be preprocessed to eliminate the audio played by the sound acquisition device.
[0080] In the above embodiments, the time period from the end time to the result reporting time is not a fixed duration. That is, the audio to be compensated is not of fixed duration. This is because the wake-up duration is not fixed each time. The wake-up duration is affected by factors such as user speech rate, environmental noise, and distributed wake-up, resulting in different wake-up durations, and this difference can be significant. If the compensation is too little, words will still be missed; if the compensation is too much, the wake-up word "Xiao You" may be included, causing recognition errors and affecting interaction. Therefore, a compensation calculation module can be introduced. By utilizing the acoustic information obtained from the wake-up engine, the start time of the wake-up word, the end time of the wake-up word (i.e., the end time of acquisition), and the wake-up decision result reporting time (i.e., the result reporting time) can be accurately obtained. The length of the audio to be compensated can be accurately obtained by subtracting the end time of the wake-up word from the wake-up decision result reporting time. The compensation algorithm uses a dynamic calculation method, obtaining the accurate time interval from the end of the wake-up word to the wake-up event reporting time from the wake-up event, thereby accurately calculating the length of the audio to be compensated. A schematic diagram of determining the second acquisition time period can be found in the appendix. Figure 3 .
[0081] In an exemplary embodiment, the device can be used to cache the acquired speech audio to obtain cached audio; wherein, caching the acquired speech audio to obtain cached audio can be achieved through at least one of the following steps: caching the acquired first speech audio and second speech audio in real time, and determining the first speech audio and second speech audio as the cached audio; determining the speech audio cached in real time before obtaining the recognition result as the cached audio. In this embodiment, the sound acquisition device can cache the acquired audio in real time while acquiring audio, and can also cache the second speech audio if it is determined that the first speech audio includes a target wake-up word. It can also cache the acquired first and second speech audio in real time.
[0082] In the above embodiments, without changing existing hardware resources and wake-up algorithms, a recording buffer can be added to the voice device. When wake-up occurs, the data in the wake-up buffer can be used to compensate for the beginning of the audio in the second voice audio, including the instruction, thus ensuring no recognition instructions are lost. If no wake-up occurs, the data in the recording buffer will not be uploaded to the cloud, guaranteeing user privacy. By caching the recording and then compensating for the audio to be recognized after wake-up, the goal of wake-up-without-wait interaction is achieved.
[0083] In one exemplary embodiment, the apparatus is further configured to: delete the cached audio when the recognition result indicates that the first speech audio does not include a target wake-up word; and determine the compensation audio from the cached audio when the recognition result indicates that the first speech audio includes a target wake-up word. In this embodiment, when the recognition result indicates that the first speech audio does not include a target wake-up word, the cached audio can be deleted to save storage space. When the recognition result indicates that the first speech audio includes a target wake-up word, the compensation audio can be determined from the cached audio. This prevents the situation where incomplete compensation audio cannot be obtained due to the lack of cached audio.
[0084] In an exemplary embodiment, the second recognition module 58 can determine the audio to be recognized based on the compensation audio and the second speech audio in the following manner: recognizing the sound features of the second speech audio; if the sound features of the second speech audio determine that the second speech audio includes a voice command, determining a sub-audio in the second speech audio that includes the voice command, and determining the compensation audio and the sub-audio as the audio to be recognized; if the sound features of the second speech audio determine that the second speech audio does not include a voice command, determining the compensation audio as the audio to be recognized. In this embodiment, the second speech audio is the audio collected after the target wake-up word. When determining the audio to be recognized, it can be determined whether there is a sub-audio in the second speech audio that includes a voice command. If there is a sub-audio, the sub-audio and the target wake-up word are combined to obtain the audio to be recognized. If there is no sub-audio, the compensation audio is determined as the audio to be recognized.
[0085] In an exemplary embodiment, the apparatus can further be used to preprocess the buffered audio to obtain preprocessed buffered audio; the preprocessing includes at least one of the following: echo cancellation processing, automatic gain control processing; wherein, determining the compensation audio from the buffered audio includes: determining the compensation audio based on the preprocessed buffered audio. In this embodiment, the buffered audio can be preprocessed to determine the compensation audio from the preprocessed buffered audio.
[0086] In the above embodiments, since the voice device may capture the voice while playing it, the voice can be preprocessed to eliminate the voice played by the voice device itself and enhance the user's spoken voice. The preprocessing may include one or more of echo cancellation processing and automatic gain control processing.
[0087] In the above embodiments, preprocessing can be performed on all speech collected by the sound acquisition device, and preprocessing can also be performed on the speech collected after the target wake-up word, i.e., the audio to be recognized. Since the audio to be recognized collected after the target wake-up word contains speech commands, preprocessing only the audio to be recognized does not affect the effect of recognizing speech commands, and can save the processing resources of the voice device and improve the processing speed.
[0088] In the above embodiments, echo cancellation processing can be performed on the cached audio. This process eliminates sounds played by the voice device and ambient noise. Echo cancellation can be performed using an AEC (Automatic Gain Control) module. After echo cancellation, the audio can be further processed with Automatic Gain Control (AGC), which enhances the user's spoken voice. This enhancement can be achieved using an AGC model. The cached audio data may be the audio processed by both AEC and AGC modules, filtering out the device's own sound and preventing the inclusion of wake-up response words.
[0089] In an exemplary embodiment, the device can further be used to receive a control command sent by the target server after sending the audio to be recognized to the target server, the control command being generated based on the audio to be recognized; execute the target operation indicated by the control command; and delete the cached audio. In this embodiment, after sending the audio to be recognized to the target server, the target server can recognize the audio to be recognized, determine the speech recognition result, determine the control command based on the speech recognition result, and send the control command. After receiving the control command, the voice device can determine the target operation corresponding to the control command and execute the target operation. After receiving the control command, the voice device can also delete the locally cached audio to save storage space.
[0090] It should be noted that the above modules can be implemented by software or hardware. For the latter, they can be implemented in the following ways, but are not limited to: all the above modules are located in the same processor; or, the above modules are located in different processors in any combination.
[0091] Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the method described in any of the preceding claims.
[0092] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.
[0093] Embodiments of the present invention also provide an electronic device including a memory and a processor, the memory storing a computer program and the processor being configured to run the computer program to perform the steps in any of the above method embodiments.
[0094] In one exemplary embodiment, the electronic device may further include a transmission device and an input / output device, wherein the transmission device is connected to the processor and the input / output device is connected to the processor.
[0095] Specific examples in this embodiment can be found in the examples described in the above embodiments and exemplary implementations, and will not be repeated here.
[0096] It is obvious to those skilled in the art that the modules or steps of the present invention described above can be implemented using general-purpose computing devices. They can be centralized on a single computing device or distributed across a network of multiple computing devices. They can be implemented using computer-executable program code, and thus can be stored in a storage device for execution by a computing device. In some cases, the steps shown or described can be performed in a different order than those described herein, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. Thus, the present invention is not limited to any particular combination of hardware and software.
[0097] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, or improvements made within the principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A speech recognition method, characterized in that, include: Collect the first audio audio and the second audio audio within a preset time period after the first audio audio; The first audio message is recognized to obtain the recognition result. If the recognition result indicates that the first speech audio includes a target wake-up word, a compensation audio is determined based on the acquisition time of the target wake-up word; The audio to be identified is determined based on the compensated audio and the second speech audio, and the audio to be identified is sent to the target server to instruct the target server to identify the instructions included in the audio to be identified. The audio to be identified includes the compensated audio and other audio, which are audios collected after the compensated audio and are determined based on the second speech audio. The method further includes: determining the sound features of the target wake-up word and the acquisition time of the target wake-up word based on the recognition result; wherein, determining the compensation audio based on the acquisition time of the target wake-up word includes: determining the acquisition end time of the target wake-up word based on the sound features and the acquisition time; determining the result reporting time of obtaining the recognition result; and determining the audio acquired during the time period from the end time to the result reporting time as the compensation audio.
2. The method according to claim 1, characterized in that, The method further includes: caching the acquired speech audio to obtain cached audio; The step of caching the acquired speech audio to obtain cached audio includes at least one of the following steps: The first and second audio audio samples are cached in real time, and the first and second audio audio samples are determined as the cached audio. The voice audio that is cached in real time before the recognition result is obtained is identified as the cached audio.
3. The method according to claim 2, characterized in that, The method further includes: If the recognition result indicates that the target wake word is not included in the first voice audio, the cached audio is deleted; If the recognition result indicates that the first speech audio includes the target wake word, the compensation audio is determined from the cached audio.
4. The method according to any one of claims 1 to 3, characterized in that, The audio to be identified based on the compensated audio and the second speech audio includes: Identify the acoustic features of the second speech audio; If it is determined that the second speech audio contains a voice command based on the sound features of the second speech audio, then a sub-audio containing the voice command is determined in the second speech audio, and the compensation audio and the sub-audio are determined as the audio to be identified; If it is determined, based on the sound features of the second speech audio, that the second speech audio does not contain a voice command, the compensated audio is determined as the audio to be identified.
5. The method according to any one of claims 2 to 4, characterized in that, The method further includes: The buffered audio is preprocessed to obtain preprocessed buffered audio; the preprocessing includes at least one of the following: echo cancellation processing, automatic gain control processing; The step of determining the compensation audio from the cached audio includes: The compensation audio is determined based on the preprocessed cached audio.
6. The method according to any one of claims 2 to 5, characterized in that, After sending the audio to be identified to the target server, the method further includes: Receive control instructions sent by the target server, the control instructions being generated based on the audio to be identified; Execute the target operation indicated by the control command and delete the cached audio.
7. A voice recognition device, characterized in that, include: The acquisition module is used to acquire the first audio audio and the second audio audio within a preset time period after the first audio audio. The first recognition module is used to recognize the first voice audio and obtain the recognition result; The determination module is used to determine compensation audio based on the acquisition time of the target wake-up word when the recognition result indicates that the first speech audio includes a target wake-up word; The second recognition module is used to determine the audio to be recognized based on the compensated audio and the second speech audio, and send the audio to be recognized to the target server to instruct the target server to recognize the instructions included in the audio to be recognized. The audio to be recognized includes the compensated audio and other audio, the other audio being the audio collected after the compensated audio, and the other audio being determined based on the second speech audio. The device is further configured to: determine the sound features of the target wake-up word and the acquisition time of the target wake-up word based on the recognition result; wherein, the determining module determines the compensation audio based on the acquisition time of the target wake-up word in the following manner: determining the acquisition end time of the target wake-up word based on the sound features and the acquisition time; determining the result reporting time of obtaining the recognition result; and determining the audio acquired during the time period from the end time to the result reporting time as the compensation audio.
8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the steps of the method described in any one of claims 1 to 6.
9. An electronic device comprising a memory and a processor, characterized in that, The memory stores a computer program, and the processor is configured to run the computer program to perform the method as described in any one of claims 1 to 6.