[0032] The present invention will be further described in detail below in conjunction with examples and specific implementations, but it should not be understood that the scope of the above-mentioned subject of the present invention is limited to the following examples. All technologies implemented based on the content of the present invention belong to the present invention. range.
[0033] Such as figure 1 As shown, a broadband background noise and speech separation detection system, the system current frame time-frequency domain energy calculation circuit, the background noise calculation circuit connected to the current frame time-frequency domain energy calculation circuit, and the time-domain speech detection long and short time The average energy comparison circuit and the frequency domain speech detection long and short time frequency domain energy comparison circuit are connected to the background noise calculation circuit, the time domain speech detection long and short time average energy comparison circuit and the frequency domain speech detection long and short time frequency domain energy comparison circuit. A noise comparison circuit, a sub-band energy distribution uniformity voice detection circuit respectively connected to the time-domain speech detection long-short-term average energy comparison circuit and frequency-domain speech detection long-short-time frequency domain energy comparison circuit, and the sub-band energy distribution is uniform The voice frame number statistics circuit connected to the sexual voice detection circuit, the background noise calculation circuit is also respectively connected with the sub-band energy distribution uniformity voice detection circuit, the voice frame number statistics circuit, the time domain voice detection long and short-term average energy comparison circuit and The frequency domain speech detection long and short time frequency domain energy comparison circuit is connected, the speech frame number statistics circuit is composed of a time width filter, and the time width filter is used to count the number of frames of speech. In this embodiment, the number of time width filters is 1 In this embodiment, the time-width filter is a voice frame counter.
[0034] Such as figure 2 As shown, a wideband background noise and speech separation detection method includes the following eleven steps:
[0035] Step 1: Load sound data, the sound data is processed in frames, the sound data is speech data in the time domain, and the time size of the frame can be configured, usually between 10 milliseconds and 50 milliseconds;
[0036] Step 2: Calculate the time-domain short-term energy and the time-domain long-term average energy. The time-domain short-term energy is the sum of the energy of the current frame of speech data in the time domain. The time-domain short-term energy of multiple frames is accumulated and divided by The number of frames of the time domain short-term energy obtains the time domain long-term average energy;
[0037] Step 3: Perform FFT (Fast Fourier) transformation on the current frame of voice data in the time domain, and transform the current frame of voice data in the time domain into subband voice data in the frequency domain;
[0038] Step 4 Calculate the short-term energy in the frequency domain and the long-term average energy in the frequency domain, and accumulate the sub-band energy in the frequency range of the main energy distribution of the human voice in the current frame of the sub-band voice data in the frequency domain to obtain the short-term energy in the frequency domain. Accumulating and dividing the frequency domain short-term energy by the number of frames of the frequency domain short-term energy to obtain the frequency domain long-term average energy;
[0039] Step 5: Background noise accumulation calculation, sending the time-domain short-term energy of non-speech frames to the background noise estimation unit for accumulation, and outputting a new background noise every time accumulation reaches a certain number of frames;
[0040] Step 6: Compare the background noise with the set threshold value 1, if it is greater than the threshold value 1, proceed to step 7, and if it is less than the threshold value 1, proceed to step 8;
[0041] Step 7: Perform frequency domain speech detection. The frequency domain speech detection compares the frequency domain short-term energy with the frequency domain long-term average energy. The frequency domain short-term energy exceeds the frequency domain long-term average energy by a certain amount. If the level is voice, if it is non-voice, go to step 9 if it is voice, and proceed to step 5 and step 11 if it is not voice;
[0042] Step 8: Perform time-domain speech detection. The time-domain speech detection compares the time-domain short-term energy with the time-domain long-term average energy. The time-domain short-term energy exceeds the time-domain long-term average energy by a certain amount. If the level is voice, otherwise it is non-speech, if it is voice, then go to said step 9, if it is not voice, go to said step 5 and step 11;
[0043] Step 9: Perform frequency domain subband energy distribution uniformity detection. If the detection result has a high uniformity, it is speech, if the detection result is low, it is non-speech, if it is speech, go to step 10, and if it is not speech, proceed to step Step 5 and Step 11;
[0044] Step 10: The time-width filter counts the number of voice frames generated in the step 9, and the time-width filter counts the number of frames in which the voice data is continuous speech, and compares it with the set threshold 2. If the number of frames is greater than the second threshold, it means that the voice directly enters the step eleven, and if the number of frames is less than the threshold two, it means that the non-voice enters the steps 5 and eleven;
[0045] Step 11 The detection result is output and the detection ends.
[0046] When running step 7 to step 10, when the running result is determined to be non-speech, run step 5 of the non-speech data to generate the new background noise.
[0047] In this embodiment, the calculation process of step three is as follows:
[0048] Assuming that the number of frequency domain subbands is N, the average subband energy is , Where Eavg is the average sub-band energy, Etotal is the sum of all sub-band energies, Ei is the energy of the i-th sub-band, i = 1, 2...N. In the frequency domain, the subband energy is equal to the sum of the square of the real part and the square of the imaginary part.
[0049] In this embodiment, the calculation process of step 9 is as follows:
[0050] Use the mean square error method to find the non-uniformity, set the energy of each subband as Ei, then use the mean square error to find the non-uniformity, the formula is , Where nU is non-uniformity, set the threshold Th_nu as the non-uniformity threshold, then when nU When
[0051] In other embodiments, the following two methods can be used for calculation:
[0052] 1. Using the absolute value of the difference and the average, the formula is , Where nU is the non-uniformity, and the threshold Th_nu is the non-uniformity threshold, then when nU
[0053] 2. Count the sub-bands whose sub-band energy is close to the average sub-band energy. If more sub-band energy is distributed near the average energy, it is speech, otherwise it is non-speech. The specific formula is as follows, if: |Ei-Eavg| Th_u, It is judged as speech, otherwise it is non-speech.
[0054] The detailed calculation process of step 10 in this embodiment is as follows:
[0055] Set a voice frame counter, which is initially 0 at the beginning, cleared to zero when encountering a non-voice frame, and incremented by 1 when encountering a voice frame, and when changing from a non-voice frame to a voice frame, the value of the first voice frame The serial number is updated to the start address of the voice frame. When the value of the voice frame counter is greater than a threshold of two, then starting from the first voice frame, the continuous voice frames are voice frames until a non-voice frame appears. When a non-speech frame is reached, the speech frame counter value is less than the threshold, and the previous speech frame is also judged as a non-speech frame.