Voice detection method, storage medium, and terminal
The speech detection method, which calculates variance and peak detection by frame segmentation, solves the problems of high complexity and high power consumption in existing technologies, and achieves high-accuracy speech detection with low complexity. It is applicable to terminals such as TWS earphones, smart speakers, and mobile phones.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN BLUETRUM TECH CO LTD
- Filing Date
- 2022-12-13
- Publication Date
- 2026-06-16
AI Technical Summary
Existing speech detection methods are complex to design and have high power consumption requirements for terminals.
By calculating the variance value and finding the peak value of the audio signal in frames, peak detection and variance detection are performed. Combined with multi-dimensional decision-making, the speech detection result is obtained, reducing the complexity of speech detection.
It effectively reduces terminal power consumption while improving the accuracy of voice detection, enabling precise recognition of voice signals and activation of subsequent functions.
Smart Images

Figure CN115985346B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of speech signal processing technology, specifically to a speech detection method, storage medium, and terminal. Background Technology
[0002] With the continuous development of speech signal processing, speech signals are widely used in voice-enabled terminals such as TWS (True Wireless Stereo) earphones, smart speakers, mobile phones, tablets, and computers. Speech detection, as a front-end operation in speech signal processing, aims to identify speech signals from noisy environments so that the terminal can perform corresponding operations based on the content of the speech signal, such as identifying whether it is a specific user based on voice timbre or executing corresponding commands based on keywords. However, current speech detection methods are relatively complex in design and place high demands on the power consumption of the terminal. Summary of the Invention
[0003] The present invention aims to provide a voice detection method, storage medium, and terminal to solve the technical problems of the complexity of voice detection methods in existing terminals and the high power consumption requirements of terminals.
[0004] In a first aspect, embodiments of the present invention provide a speech detection method, comprising: dividing a sound signal into frames to obtain n sound frame signals; calculating the variance value corresponding to each of the n sound frame signals and finding a peak value; performing peak detection based on the peak value to obtain a peak detection result; performing variance detection based on the variance value to obtain a variance detection result; and obtaining a speech detection result based on the peak detection result and the variance detection result, wherein the speech detection result is used to indicate whether the sound signal is speech.
[0005] Optionally, the method further includes: performing burst sound detection based on the variance value to obtain burst sound detection results; and performing Gaussian white noise detection based on the variance value to obtain Gaussian white noise detection results.
[0006] Optionally, obtaining the speech detection result based on the peak detection result and the variance result includes obtaining the speech detection result based on the peak detection result, the variance detection result, the burst sound detection result, and the Gaussian white noise detection result.
[0007] Optionally, the peak detection based on the peak value to obtain the peak detection result includes: selecting m sound frame signals from the n sound frame signals; obtaining the target peak value and the peak average value based on the peak values corresponding to each of the m sound frame signals; obtaining the peak noise threshold value based on the peak average value; and comparing the target peak value with the peak noise threshold value to obtain the peak detection result.
[0008] Optionally, obtaining the target peak value and the average peak value based on the peak values corresponding to each of the m sound frame signals includes: smoothing the peak values corresponding to each of the m sound frame signals to obtain smoothed peak values corresponding to each of the m sound frame signals; selecting the smoothed peak value with the largest value as the target peak value; and calculating the average peak value based on the smoothed peak values corresponding to each of the m sound frame signals.
[0009] Optionally, the method further includes updating the peak noise threshold based on the peak detection result.
[0010] Optionally, the variance detection based on the variance value to obtain the variance detection result includes: selecting m sound frame signals from the n sound frame signals; obtaining a target variance value and a mean variance value based on the variance value corresponding to each of the m sound frame signals; obtaining a variance noise threshold based on the mean variance value; and comparing the target variance value with the variance noise threshold to obtain the variance detection result.
[0011] Optionally, obtaining the target variance value and the mean variance value based on the variance values corresponding to each of the m sound frame signals includes: smoothing the variance values corresponding to each of the m sound frame signals to obtain smoothed variance values corresponding to each of the m sound frame signals; selecting the smoothed variance value with the largest value as the target variance value; and calculating the mean variance value based on the smoothed variance values corresponding to each of the m sound frame signals.
[0012] Optionally, the method further includes updating the variance noise threshold based on the variance detection result.
[0013] Optionally, the method of detecting sudden sounds based on the variance value to obtain the sudden sound detection result includes: selecting s sound frame signals from the n sound frame signals; determining whether the variance values corresponding to each of the s sound frame signals satisfy a preset set of sudden sound detection conditions; if the variance values corresponding to each of the s sound frame signals satisfy any one of the preset set of sudden sound detection conditions, then the sudden sound detection result is determined to be true; if the variance values corresponding to each of the s sound frame signals do not satisfy the preset set of sudden sound detection conditions, then the sudden sound detection result is determined to be false.
[0014] Optionally, the method of performing Gaussian white noise detection based on the variance value to obtain the Gaussian white noise detection result includes: comparing the variance value with the variance noise threshold to obtain the Gaussian white noise detection result.
[0015] Optionally, obtaining the speech detection result based on the peak detection result, the variance detection result, the burst sound detection result, and the Gaussian white noise detection result includes: if a first preset condition is met, then the speech detection result is determined to be the variance detection result, where the first preset condition is that the Gaussian white noise detection result is true; if a second preset condition is met, then the speech detection result is determined to be the peak detection result, where the second preset condition is that the burst sound detection result is true and the Gaussian white noise detection result is false; if neither the first preset condition nor the second preset condition is met, then the speech detection result is determined to be the peak sound detection result.
[0016] In a second aspect, embodiments of the present invention provide a storage medium storing computer-executable instructions for causing a terminal to perform the aforementioned voice detection method.
[0017] In a third aspect, embodiments of the present invention provide a terminal, comprising:
[0018] At least one processor; and,
[0019] A memory communicatively connected to the at least one processor; wherein,
[0020] The memory stores instructions that can be executed by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the above-described speech detection method.
[0021] In the speech detection method provided in this embodiment of the invention, the sound signal is divided into frames to obtain n sound frame signals. The variance value corresponding to each of the n sound frame signals is calculated and the peak value is found. Peak detection is performed based on the peak value to obtain the peak detection result. Variance detection is performed based on the variance value to obtain the variance detection result. The speech detection result is obtained based on the peak detection result and the variance detection result. In this way, it is possible to determine whether there is a speech signal in the current noisy signal based on the variance and peak value of the sound frame. Thus, the terminal can determine whether to enable subsequent functions such as keyword recognition based on the speech detection result. This speech detection method has low complexity and can effectively reduce the power consumption of the terminal. Attached Figure Description
[0022] One or more embodiments are illustrated by way of example with reference numerals in the accompanying drawings. These illustrations do not constitute a limitation on the embodiments. Elements with the same reference numerals in the drawings are denoted as similar elements. Unless otherwise stated, the figures in the drawings are not to be limited by scale.
[0023] Figure 1 This is a flowchart illustrating a speech detection method provided in an embodiment of the present invention;
[0024] Figure 2 A flowchart illustrating another speech detection method provided in an embodiment of the present invention;
[0025] Figure 3 A flowchart illustrating another speech detection method provided in an embodiment of the present invention;
[0026] Figure 4 This is a schematic diagram of the structure of a terminal provided in an embodiment of the present invention. Detailed Implementation
[0027] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without inventive effort are within the scope of protection of this invention.
[0028] It should be noted that, unless otherwise specified, the various features in the embodiments of this invention can be combined with each other, all of which are within the protection scope of this invention. Furthermore, although functional modules are divided in the device schematic diagram and a logical order is shown in the flowchart, in some cases, the steps shown or described can be executed in a different order than the module division in the device or the order in the flowchart. Moreover, the terms "first," "second," and "third" used in this invention do not limit the data or execution order, but only distinguish identical or similar items with essentially the same function and effect.
[0029] This invention provides a voice detection method. It should be noted that the method can be executed by any voice-enabled terminal, such as TWS earphones, smart speakers, mobile phones, tablets, or computers; this invention does not impose any limitations on this. Please refer to [link / reference]. Figure 1 Speech detection methods include:
[0030] S11: Divide the sound signal into n sound frame signals, calculate the variance value of each of the n sound frame signals and find the peak value.
[0031] In this step, the terminal can acquire sound signals from the surrounding environment in real time and perform time-domain-based framing processing on the sound signals to obtain n sound frame signals. The length of each sound frame signal can be 256 sampling points, or it can be other numerical sampling points; this invention does not impose any limitations on this.
[0032] The terminal can perform peak lookup on the n audio frame signals, taking the maximum value of each frame as the peak value, which is the peak value corresponding to each of the n audio frame signals. At the same time, the terminal can perform variance calculation on the n audio frame signals to obtain the variance value corresponding to each of the n audio frame signals.
[0033] In one embodiment, obtaining the variance values corresponding to each of the n sound frame signals may include: setting the mean variance of the initial frame signal to 0, starting from the second frame, using the mean of the previous frame to calculate the variance of the current frame, obtaining the variance value of the current frame, and repeating the above variance calculation steps to obtain the variance values corresponding to each of the n sound frame signals.
[0034] S12: Perform peak detection based on the peak value to obtain the peak detection result.
[0035] In this step, the peak detection result can be true or false. When the peak detection result is true, it means that the sound signal is initially considered to be speech in this peak detection step; when the peak detection result is false, it means that the sound signal is initially considered not to be speech in this peak detection step.
[0036] This article defines speech as the sound emitted by the end user that can be used by the terminal for subsequent speech signal processing operations such as keyword recognition.
[0037] S13: Perform variance detection based on the variance value to obtain the variance detection result.
[0038] In this step, the variance detection result can be true or false. When the variance detection result is true, it means that the sound signal is preliminarily considered to be speech in this variance detection step; when the variance detection result is false, it means that the sound signal is preliminarily considered not to be speech in this variance detection step.
[0039] S14: Obtain the speech detection results based on the peak detection results and variance detection results.
[0040] In this step, the speech detection result is used to indicate whether the sound signal is speech. The speech detection result can be true or false. When the speech detection result is true, it means that the terminal determines that the sound signal is speech; when the speech detection result is false, it means that the terminal determines that the sound signal is not speech.
[0041] In one embodiment, when the terminal determines that the voice detection result is true, it can activate functions such as keyword recognition to further process the voice signal, so that the terminal can perform corresponding operations based on the content of the voice signal.
[0042] In some embodiments, the peak detection based on the peak value to obtain the peak detection result includes:
[0043] S121: Select m sound frame signals from n sound frame signals.
[0044] S122: Obtain the target peak value and the average peak value based on the peak values corresponding to each of the m sound frame signals.
[0045] S123: Obtain the peak noise threshold based on the average peak value, and compare the target peak value with the peak noise threshold to obtain the peak detection result.
[0046] It should be noted that the value of m is less than or equal to the value of n. For example, when n = 20, m can take values such as 10 or 15, and this invention does not impose any restrictions on this.
[0047] For example, assuming m is 10, the terminal selects 10 audio frames from n audio frames, obtains the peak value of each of the 10 audio frames, sorts the 10 peak values, and selects the largest value as the target peak value. Simultaneously, the average of these 10 peak values is calculated to obtain the peak mean.
[0048] Alternatively, the peak noise threshold can be obtained from the peak mean value using the following formula:
[0049] NT peak =a*P mean
[0050] Among them, NT peak P represents the peak noise threshold. mean This represents the peak mean, and 'a' represents the preset noise parameter. For example, the value of 'a' can be 1.5, or any value such as 2 or 3.
[0051] Optionally, the method for comparing the target peak value with the peak noise threshold to obtain the peak detection result can be as follows: if the target peak value is greater than the peak noise threshold, the terminal can increment the count value of the peak truth counter by 1 (the initial value of the truth counter is 0), and when the peak truth counter is greater than the preset truth count value, the peak detection result is determined to be true; if the target peak value is less than or equal to the peak noise threshold, the terminal can increment the count value of the peak false value counter by 1 (the initial value of the false value counter is 0), and when the peak false value counter is greater than the preset peak false value count value, the maximum noise result is determined to be false.
[0052] For example, suppose NT peak =1.5P mean n = 10, P max P represents the target peak value. peak N represents the peak truth counter. peakFLG represents the peak false value counter. peak This indicates the peak detection result. If P max >NT peak Then P peak =P peak +1, when P peak >10, FLG peak =1 indicates that the peak detection result is true; otherwise, N peak =N peak +1, when N peak >10, FLG peak =0, P peak =0.
[0053] In some embodiments, obtaining the target peak value and the average peak value based on the peak values corresponding to each of the m sound frame signals includes:
[0054] S1221: Smooth the peak values corresponding to each of the m sound frame signals to obtain the smoothed peak values corresponding to each of the m sound frame signals.
[0055] S1222: Select the smooth peak value with the largest value as the target peak value.
[0056] S1223: Calculate the mean value of the peak values by performing a mean calculation on the smoothed peak values corresponding to each of the m sound frame signals.
[0057] In this embodiment, before obtaining the target peak value and the average peak value, the terminal can smooth the peak values corresponding to each of the m sound frame signals so that the difference between each peak value is within a reasonable range.
[0058] Specifically, a smoothing coefficient of b can be preset. The terminal sets the smoothed peak value of the first frame of m audio frames to the peak value of the first frame. Starting from the second frame, if the peak value of the current frame is greater than the peak value of the previous frame, then the smoothed peak value of the current frame is equal to the peak value of the current frame; if the peak value of the current frame is less than or equal to the peak value of the previous frame, then the smoothed peak value of the current frame is equal to b * the smoothed peak value of the previous frame. After obtaining the smoothed peak values corresponding to each of the m audio frame signals, the smoothed peak value with the largest value is selected as the target peak value. At the same time, the average value of the m smoothed peak values is calculated to obtain the peak mean.
[0059] For example, assuming m = 10 and the smoothing coefficient = 31 / 32, the terminal can smooth the peak values of these 10 frames. The peak value of the first frame is directly used as the smoothed peak value of the first frame. Starting from the second frame, if the peak value of the current frame is greater than or equal to the peak value of the previous frame, then the smoothed peak value of the current frame = the peak value of the current frame; if the peak value of the current frame is less than the peak value of the previous frame, then the smoothed peak value of the current frame = 31 / 32 * the smoothed peak value of the previous frame. This smoothing process is repeated for each of the 10 frames to obtain the smoothed peak value corresponding to each of the 10 audio frames. Then, these 10 smoothed peak values are sorted and their average is calculated to obtain the smoothed peak value P in the 10 frames. max and peak mean P mean .
[0060] In some embodiments, after obtaining a peak noise threshold based on the peak mean and comparing the target peak with the peak noise threshold to obtain a peak detection result, the method further includes:
[0061] S124: Update the peak noise threshold based on the peak detection results.
[0062] Optionally, if the peak detection result is false, the peak noise threshold can be updated based on the peak mean and the current peak noise threshold. The peak noise threshold can be calculated using the following formula:
[0063] NT peak =10% * a * P mean +90%*NT peak
[0064] Among them, NT peak ' represents the updated peak noise threshold, a represents the preset noise parameter, and P mean NT represents the peak mean. peak This indicates the current peak noise threshold.
[0065] In some embodiments, the variance detection based on the variance value to obtain the variance detection result includes:
[0066] S131: Select m sound frame signals from n sound frame signals.
[0067] S132: Obtain the target variance value and the mean variance value based on the variance values corresponding to each of the m sound frame signals.
[0068] S133: Obtain the variance noise threshold based on the mean variance value, and compare the target variance value with the variance noise threshold to obtain the variance detection result.
[0069] For example, assuming m is 10, the terminal selects 10 sound frames from n sound frames. First, the variance of each of these 10 sound frames can be calculated. Then, these 10 variance values can be sorted, and the largest variance value can be selected as the target variance value. Simultaneously, the mean of these 10 variance values can be calculated; that is, the average of the 10 variance values is obtained.
[0070] Alternatively, the variance noise threshold can be obtained from the mean variance value using the following formula:
[0071] NT var =a*V mean
[0072] Among them, NT var V represents the variance noise threshold. mean This represents the mean of the variance, and 'a' represents the preset noise parameter. For example, the value of 'a' can be 1.5, or any value such as 2 or 3.
[0073] Optionally, the method for comparing the target variance value with the variance noise threshold to obtain the variance detection result can be as follows: if the target variance value > the variance noise threshold, the terminal can increment the count value of the variance truth counter by 1 (the initial value of the variance truth counter is 0), and when the variance truth counter > the preset variance truth count value, the variance detection result is determined to be true; if the target variance value ≤ the variance noise threshold, the terminal can increment the count value of the variance false counter by 1 (the initial value of the variance false counter is 0), and when the variance false counter > the preset variance false count value, the variance detection result is determined to be false.
[0074] For example, suppose NT var =1.5V mean n = 10, V max P represents the target variance. var N represents the variance truth counter. var FLG represents the variance false value counter. var This indicates the variance test result. If V max >NT var Then P var =P var +1, when P var >10, FLG var =1 indicates that the variance test result is true; otherwise, N var =N var +1, when N var >10, FLG var =0, P var =0. Where, P var and N varThe initial values of all are 0.
[0075] In some embodiments, obtaining the target variance value and the mean variance value based on the variance values corresponding to each of the m sound frame signals includes:
[0076] S1321: Smooth the variance values corresponding to each of the m sound frame signals to obtain the smoothed variance values corresponding to each of the m sound frame signals.
[0077] S1322: Select the smoothed variance value with the largest numerical value as the target variance value;
[0078] S1323: Calculate the mean of the variances by performing a mean calculation on the smoothed variances of each of the m sound frame signals.
[0079] In this embodiment, before obtaining the target variance value and the mean variance value, the terminal can smooth the variance values corresponding to each of the m audio frame signals so that the difference between the variance values is within a reasonable range.
[0080] Specifically, a smoothing coefficient of b can be preset. The terminal can default to the smoothing variance value of the starting frame of the m sound frames being equal to the variance value of the starting frame. Starting from the second frame, if the variance value of the current frame is greater than the smoothing variance value of the previous frame, then the smoothing variance value of the current frame is equal to the variance value of the current frame; if the variance value of the current frame is less than or equal to the smoothing variance value of the previous frame, then the smoothing variance value of the current frame is equal to b * the smoothing variance value of the previous frame. After obtaining the smoothing variance values corresponding to the m sound frame signals, the smoothing variance value with the largest value is selected as the target variance value. At the same time, the mean of the m smoothing variance values is calculated to obtain the mean variance value.
[0081] For example, assuming m = 10 and the smoothing coefficient = 31 / 32, the terminal can smooth the variance values of these 10 frames. The variance value of the first frame can be directly used as the smoothed variance value of the first frame. Starting from the second frame, if the variance value of the current frame is greater than or equal to the smoothed variance value of the previous frame, then the smoothed variance value of the current frame = the variance value of the current frame; if the variance value of the current frame is less than the smoothed variance value of the previous frame, then the smoothed variance value of the current frame = 31 / 32 * the smoothed variance value of the previous frame. This smoothing process is repeated for each of the 10 frames to obtain the smoothed variance value corresponding to each of the 10 audio frames. Then, these 10 smoothed variance values are sorted and their mean is calculated to obtain the smoothed variance value V for the 10 frames. max and variance mean V mean .
[0082] In some embodiments, after obtaining a variance noise threshold based on the mean variance value and comparing the target variance value with the variance noise threshold to obtain a variance detection result, the method further includes:
[0083] S134: Update the variance noise threshold based on the variance detection results.
[0084] Optionally, if the variance detection result is false, the variance noise threshold can be updated based on the mean variance value and the current variance noise threshold. The variance noise threshold can be calculated using the following formula:
[0085] NT var =15% * a * V mean +85%*NT var
[0086] Among them, NT var ' represents the updated variance noise threshold, a represents the preset noise parameter, and V mean NT represents the mean of the variances. var This represents the current variance noise threshold.
[0087] As can be seen, by implementing the sound detection method provided in this embodiment of the invention, the terminal can divide the sound signal into n sound frame signals, calculate the variance value corresponding to each of the n sound frame signals and find the peak value, perform peak detection based on the peak value to obtain the peak detection result, perform variance detection based on the variance value to obtain the variance detection result, and obtain the speech detection result based on the peak detection result and the variance detection result. In this way, it is possible to determine whether the current noisy signal has a speech signal based on the variance and peak value of the sound frame, so that the terminal can determine whether to enable subsequent speech processing functions such as keyword recognition based on the speech detection result. The complexity of this speech detection method is low and can effectively reduce the power consumption of the terminal.
[0088] Please refer to the following: Figure 2 This is another sound detection method provided by an embodiment of the present invention, which is in Figure 1 Based on the illustrated embodiment, the addition of Gaussian white noise detection and burst sound detection functions can further improve the accuracy of speech detection. The method includes:
[0089] S21: Divide the sound signal into n sound frame signals, calculate the variance value of each of the n sound frame signals and find the peak value.
[0090] S22: Perform peak detection based on the peak value to obtain the peak detection result.
[0091] S23: Perform variance testing based on the variance values to obtain the variance results.
[0092] It should be noted that the methods shown in steps S21-S23 can be referred to Figure 1 The detailed explanations of steps S11-S23 shown are not repeated here.
[0093] S24: Perform burst sound detection based on the variance value to obtain the burst sound detection result.
[0094] In this step, burst sound detection can refer to detecting sounds such as those produced by typing on a keyboard, mouse, or on a table or chair. These burst sounds are typically short in duration. The terminal can obtain the variance values of s audio frames from n audio frames for burst sound detection. It then determines whether the variance values of these s audio frames satisfy a burst sound condition set. If any one of the burst sound condition set is satisfied, the burst sound detection result is determined to be true; otherwise, if none of the conditions in the burst sound condition set are satisfied, the burst sound detection result is determined to be false.
[0095] S25: Gaussian white noise detection is performed based on the variance value to obtain the Gaussian white noise detection result.
[0096] In this step, the terminal can sequentially perform Gaussian white noise detection on the n audio frame signals to obtain the Gaussian white noise detection result.
[0097] S26: Obtain the speech detection results based on the peak detection results, variance detection results, burst sound detection results, and Gaussian white noise detection results.
[0098] In this step, the speech detection results can be obtained from four dimensions: peak detection results, variance detection results, burst sound detection results, and Gaussian white noise detection results. The multi-dimensional judgment ensures the accuracy of the speech detection results to a certain extent.
[0099] In some embodiments, the burst sound detection based on variance values to obtain burst sound detection results includes:
[0100] S241: Select s sound frame signals from n sound frame signals.
[0101] S242: Determine whether the variance values corresponding to each of the s sound frame signals meet the preset set of sudden sound detection conditions.
[0102] S243: If the variance values corresponding to each of the s sound frame signals satisfy any one of the preset burst sound detection conditions, then the burst sound detection result is determined to be true.
[0103] S244: If the variance values corresponding to each of the s sound frame signals do not meet the preset set of sudden sound detection conditions, then the sudden sound detection result is determined to be false.
[0104] Optionally, the value of s is less than or equal to the value of n, and less than or equal to the value of m. For example, when n = 20, m can be 10, and s can be 5.
[0105] Optionally, the preset recurring sound detection set can be set based on the variance values of selected s sound frames.
[0106] For example, suppose we first select the variance values of the first 5 sound frames out of n sound frames as burst sound detection samples, and use TICK to detect these burst sound samples. reg Indicates that TICK reg ={S var (n-4), S var (n-3), S var (n-2), S var (n-1), S var (n)}, the preset set of sudden sound detection conditions can include the following four:
[0107] ① S var (n-1)>4*S var (n)&S var (n-1)>4*S var (n-2)
[0108] ② S var (n-1)>4*S var (n)&S var (n-1)>4*S var (n-3)
[0109] ③ S var (n-2)>4*S var (n)&S var (n-2)>4*S var (n-3)
[0110] ④ S var (n-2)>4*S var (n)&S var (n-2)>4*S var (n-4)
[0111] If TICK reg If any one of the burst sound detection conditions is met, the terminal considers the burst sound detection result to be FLG. tick If true, FLGtick = 1, and then the burst sound detection sample TICK can be updated. reg The variance value of the 6th frame is added, and the variance value of the 1st frame is removed. Each time the sample is updated (i.e., for each new sound frame signal that enters the burst sound detection sample), the preset burst sound count value is decremented by 1 (the initial value of the burst sound count value can be preset to 20). When the preset burst sound count value is 0, FLG is output. tick =1.
[0112] If TICKreg If the set of conditions for detecting sudden sounds is not met, the terminal initially considers the sudden sound detection result to be FLG. tick If the result is false, FLGtick = 0, and then the sudden sound detection sample TICK can be updated. reg, The variance value of the 6th frame is added, and the variance value of the 1st frame is removed. Each time the samples are updated, the terminal can perform burst sound detection based on the updated burst sound detection samples. If the burst sound detection condition set is not met, the preset burst sound count is decremented by 1, and the samples are updated again. When the preset burst sound count is 0, an FLG is output. tick =0.
[0113] In some embodiments, the Gaussian white noise detection based on the variance value to obtain the Gaussian white noise detection result includes:
[0114] S251: Compare the variance value with the variance noise threshold to obtain the Gaussian white noise detection result.
[0115] For example, if the variance of the current frame is less than the variance noise threshold and the peak detection result is true, then the Gaussian white noise detection result of the current frame is true. If the Gaussian white noise detection results of more than q sound frames are all true, then the Gaussian white noise detection result of the sound signal is confirmed to be true; otherwise, the Gaussian white noise detection result of the sound signal is confirmed to be false. Here, the value of q is less than the value of n.
[0116] In some embodiments, obtaining the speech detection result based on the peak detection result, variance detection result, burst sound detection result, and Gaussian white noise detection result includes:
[0117] S261: If the first preset condition is met, then the speech detection result is determined to be the variance detection result, and the first preset condition is that the Gaussian white noise detection result is true;
[0118] S262: If the second preset condition is met, the speech detection result is determined to be the peak detection result. The second preset condition is that the sudden sound detection result is true and the Gaussian white noise detection result is false.
[0119] S263: If neither the first preset condition nor the second preset condition is met, then the speech detection result is determined to be the peak detection result.
[0120] In this embodiment, the speech detection result is obtained by comprehensively considering the peak detection result, variance detection result, burst sound detection result, and Gaussian white noise detection result. If the burst sound detection result is true and the Gaussian white noise detection result is false, then the speech detection result is the peak detection result; that is, if the peak detection result is true, then the speech detection result is true, and if the peak result is false, then the speech detection result is false. If the Gaussian white noise detection result is true, then the speech detection result is the variance detection result; that is, if the variance detection result is true, then the speech detection result is true, and if the variance detection result is false, then the speech detection result is false. If it is any other case, then the speech detection result is the speech detection result.
[0121] For example, please refer to the following. Figure 3 This is a schematic diagram of a speech detection process provided by an embodiment of the present invention. Figure 3 The speech detection process shown includes the following steps:
[0122] Step 1: Receive the sound signal and perform frame segmentation on the sound signal to obtain n sound frames S(n).
[0123] Step 2: Variance Calculation. Calculate the variance for each audio frame. The terminal can default to a mean variance of 0 for the first frame. Starting from the second frame, use the previous frame as the mean variance for the current frame to calculate the variance, thus obtaining the variance S for each frame. var (n).
[0124] Step 3: Peak Search. Perform a peak search for each audio frame and obtain the maximum value of each frame as the peak value S. max (n).
[0125] Step 4: Variance smoothing and sorting. The terminal can preset the smoothing coefficient b = 31 / 32, select the variances corresponding to m sound frames from n sound frames, and smooth the variances of the m sound frames. The terminal can record the smoothed variance value of the first frame in the m sound frames as v(1). By default, v(1) is the variance value of the first frame. Starting from the second frame, if the variance value S of the current frame is... var If (n) ≥ the smoothed variance of the previous frame V(n-1), then V(n) = S var (n), if S var If (n) < V(n-1), then V(n) = 31 / 32 * V(n-1); Sort and calculate the mean of the variance values after smoothing these m frames to obtain the maximum smoothed variance (i.e., the target variance value) V in the 10 frames. max and variance mean V mean When the signal has less than 10 frames, V can be calculated based on the actual number of frames. max and V mean .
[0126] Step 5: Peak Smoothing and Sorting. Similar to the variance smoothing process, the terminal can record the smoothed peak value of the first frame out of m audio frames as p(1). By default, p(1) is the peak value of the first frame. Starting from the second frame, if the peak value of the current frame is S... max If (n) ≥ the smooth peak value P(n-1) of the previous frame, then P(n) = S max (n), if S max If (n) < P(n-1), then P(n) = 31 / 32 * P(n-1); Sort and calculate the average of the smoothed peak values of these m frames to obtain the smoothed peak value (i.e., the target peak value) P with the largest value among the 10 frames. max and peak mean P mean When the signal has less than 10 frames, P can be calculated based on the actual number of frames. max and P mean .
[0127] Step 6: Burst Sound Detection. The terminal can pre-store the variance values corresponding to 5 audio frame signals to form a burst sound detection sample TICK. reg Then TICK reg ={S var (n-4), S var (n-3), S var (n-2), S var (n-1), S var (n)}, if TICK reg If any one of the burst sound detection conditions is met, then FLGtick = 1, and the burst sound detection sample TICK can then be updated. reg The variance value of the 6th frame is added, and the variance value of the 1st frame is removed. Each time the sample is updated, that is, each time a new sound frame signal comes in for burst sound detection, the preset burst sound count value CNT is increased. tick Subtract 1 (CNT can be preset) tick The initial value is 20), when CNT tick When = 0, output FLG. tick =0. The preset set of sudden sound detection conditions may include the following four: ①S var (n-1)>4*S var (n)&S var (n-1)>4*S var (n-2); ②S var (n-1)>4*S var (n)&S var (n-1)>4*S var (n-3); ③S var (n-2)>4*S var(n)&S var (n-2)>4*S var (n-3); ④S var (n-2)>4*S var (n)&S var (n-2)>4*S var (n-4).
[0128] Step 7: Variance detection and adaptive update of variance noise threshold. The terminal can preset the noise parameter a = 1.5, and initialize the variance noise threshold NTvar = 1.5 * V for the first 10 frames. mean The decision can be based on a comparison with the variance noise threshold; if V max >NT var Variance truth counter P var =P var +1, when P var When the variance is greater than 10, the variance test result is FLG. var =1 indicates that the variance test result is true; otherwise, the variance false value counter N is true. var =N var +1, when N var >10, FLG var =0, P var =0. Where, P var and N var The initial values are all 0. When FLG var When = 0, the variance noise threshold can be obtained using the following formula: NT var = 15% * 1.5 * V mean +85%*NT var Among them, NT var 'Indicates the updated variance noise threshold, NT var This represents the current variance noise threshold.
[0129] Step 8: Peak detection and adaptive update of peak noise threshold. Initialize the peak noise threshold NT for the first 10 frames. peak =1.5P mean The decision can be based on a comparison with the peak noise threshold; if P max >NT peak Then the peak truth counter P peak =P peak +1, when P peak >10, FLG peak =1 indicates that the peak detection result is true; otherwise, the peak false value counter N is set to 1. peak =N peak +1, when N peak >10, FLG peak =0, P peak=0. Where, P peak and N peak The initial values are all 0. When FLG peak When = 0, the peak noise threshold can be updated using the following formula: NT peak = 10% * 1.5 * P mean +90%*NT peak Among them, NT peak 'Indicates the updated peak noise threshold, NT peak This indicates the current peak noise threshold.
[0130] Step 9: Gaussian white noise detection. If it satisfies (FLG) peak =1)&(S var (n) < NT var If the Gaussian white noise count value W is given, then... cnt =W cnt +1, if W cnt If the value is greater than 60, then output FLG. white =1; if not satisfied (FLG) peak =1)&(S var (n) < NT var If ), then the output is FLG. white =0;
[0131] Step 10: Voice detection decision. If (FLG) peak =1)&(S var (n)>NT var If the voice judgment result is FLG, then... vad =FLG peak If FLG white =1, FLG vad =FLG var In all other cases besides the two mentioned above, FLG vad =FLG peak .
[0132] As can be seen, the speech detection method provided in this embodiment of the invention can perform speech detection based on the peak value and variance of the time-domain sound frame. It has low complexity and low power consumption requirements for the terminal, effectively saving the power consumption of the terminal. At the same time, the speech detection method has the functions of filtering sudden sounds such as keyboard tapping and Gaussian white noise, so that the terminal can more accurately identify whether there is a speech signal in noisy sounds, thus improving the accuracy of speech detection.
[0133] It should be noted that in the above embodiments, there is no necessarily a certain order between the steps. Those skilled in the art can understand from the description of the embodiments of the present invention that the above steps may have different execution orders in different embodiments, that is, they may be executed in parallel or in turn, etc.
[0134] The following describes a terminal provided by an embodiment of the present invention. Please refer to [link / reference]. Figure 4 , Figure 4 This is a schematic diagram of the structure of a terminal provided in an embodiment of the present invention. Figure 4 As shown, terminal 400 includes one or more processors 41 and memory 42. Wherein, Figure 4 Take a processor 41 as an example.
[0135] Processor 41 and memory 42 can be connected via a bus or other means. Figure 4 Taking the example of a connection between China and Israel via a bus.
[0136] The memory 42, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as the program instructions / modules corresponding to the speech detection method in the embodiments of the present invention. The processor 41 implements the function of the speech detection method provided in the above method embodiments by running the non-volatile software programs, instructions, and modules stored in the memory 42.
[0137] Memory 42 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 42 may optionally include memory remotely located relative to processor 41, which can be connected to processor 41 via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0138] The program instructions / modules are stored in the memory 42 and, when executed by one or more processors 41, perform the speech detection method in any of the above method embodiments.
[0139] This invention also provides a storage medium storing computer-executable instructions that are executed by one or more processors, for example... Figure 4 One of the processors 41 can enable the one or more processors to execute the speech detection method in any of the above method embodiments.
[0140] This invention also provides a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium. The computer program includes program instructions, which, when executed by a terminal, cause the terminal to perform any of the speech detection methods described above.
[0141] The device or equipment embodiments described above are merely illustrative. The unit modules described as separate components may or may not be physically separate. The components shown as module units may or may not be physical units; that is, they may be located in one place or distributed across multiple network module units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0142] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented using software plus a general-purpose hardware platform, or of course, using hardware. Based on this understanding, the above technical solutions, in essence or the parts that contribute to the related technology, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0143] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; under the concept of the present invention, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of different aspects of the present invention as described above, which are not provided in detail for the sake of brevity; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.
Claims
1. A speech detection method, characterized in that, include: The sound signal is divided into frames to obtain n sound frame signals. The variance value corresponding to each of the n sound frame signals is calculated and the peak value is found. Peak detection is performed based on the peak value to obtain the peak detection result; Variance detection is performed based on the variance value to obtain the variance detection result; The sudden sound detection is performed based on the variance value to obtain the sudden sound detection result; Gaussian white noise detection is performed based on the variance value to obtain the Gaussian white noise detection result. If the first preset condition is met, the speech detection result is determined to be the variance detection result, where the first preset condition is that the Gaussian white noise detection result is true. If the second preset condition is met, the speech detection result is determined to be the peak detection result, whereby the second preset condition is that the sudden sound detection result is true and the Gaussian white noise detection result is false. If neither the first preset condition nor the second preset condition is met, then the speech detection result is determined to be the peak detection result; The speech detection result is used to indicate whether the sound signal is speech.
2. The method as described in claim 1, characterized in that, The step of performing peak detection based on the peak value to obtain the peak detection result includes: Select m sound frame signals from the n sound frame signals; The target peak value and the average peak value are obtained based on the peak values corresponding to the m sound frame signals. The peak noise threshold is obtained based on the average peak value, and the target peak value is compared with the peak noise threshold to obtain the peak detection result.
3. The method as described in claim 2, characterized in that, The step of obtaining the target peak value and the average peak value based on the peak values corresponding to the m sound frame signals includes: The peak values corresponding to each of the m sound frame signals are smoothed to obtain the smoothed peak values corresponding to each of the m sound frame signals. Select the smooth peak value with the largest value as the target peak value; The mean value of the peaks is obtained by calculating the mean value of the smoothed peak values corresponding to each of the m sound frame signals.
4. The method as described in claim 2 or 3, characterized in that, The method further includes: The peak noise threshold is updated based on the peak detection results.
5. The method as described in claim 1, characterized in that, The step of performing variance detection based on the variance value to obtain the variance detection result includes: Select m sound frame signals from the n sound frame signals; The target variance value and the mean variance value are obtained based on the variance values corresponding to each of the m sound frame signals. The variance noise threshold is obtained based on the mean variance value, and the target variance value is compared with the variance noise threshold to obtain the variance detection result.
6. The method as described in claim 5, characterized in that, The step of obtaining the target variance value and the mean variance value based on the variance values corresponding to each of the m sound frame signals includes: The variance values corresponding to each of the m sound frame signals are smoothed to obtain the smoothed variance values corresponding to each of the m sound frame signals. The smoothed variance value with the largest numerical value is selected as the target variance value; The mean variance value is obtained by calculating the mean of the smoothed variance values corresponding to each of the m sound frame signals.
7. The method as described in claim 5 or 6, characterized in that, The method further includes: The variance noise threshold is updated based on the variance detection results.
8. The method as described in claim 1, characterized in that, The step of detecting sudden sounds based on the variance value to obtain the sudden sound detection result includes: Select s sound frame signals from the n sound frame signals; Determine whether the variance values corresponding to the s sound frame signals meet the preset set of sudden sound detection conditions; If the variance value corresponding to each of the s sound frame signals satisfies any one of the preset sudden sound detection conditions, then the sudden sound detection result is determined to be true. If the variance values corresponding to the s sound frame signals do not meet the preset set of sudden sound detection conditions, then the sudden sound detection result is determined to be false.
9. The method as described in claim 5, characterized in that, The step of performing Gaussian white noise detection based on the variance value to obtain the Gaussian white noise detection result includes: The variance value is compared with the variance noise threshold to obtain the Gaussian white noise detection result.
10. A storage medium, characterized in that, The device stores computer-executable instructions for causing the terminal to perform the speech detection method as described in any one of claims 1 to 9.
11. A terminal, characterized in that, include: At least one processor; as well as, A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the speech detection method as described in any one of claims 1 to 9.