Audio processing method and device, electronic equipment and storage medium

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By using spatial filters and gain processing techniques in audio recording equipment, the problems of poor quality and stability in directional audio pickup were solved, achieving higher fidelity and stability in audio signal processing and improving the user experience.

CN116634329BActive Publication Date: 2026-06-12BEIJING XIAOMI MOBILE SOFTWARE CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIJING XIAOMI MOBILE SOFTWARE CO LTD
Filing Date: 2022-02-14
Publication Date: 2026-06-12

AI Technical Summary

Technical Problem

In existing technologies, devices such as mobile phones and headphones exhibit poor audio quality and stability when picking up directional audio.

Method used

By using at least two microphones in an audio recording device, utilizing spatial filters for directional audio processing, and combining gain processing and noise cancellation techniques, the fidelity and stability of the audio signal are improved.

Benefits of technology

It improves the fidelity and stability of directional audio signal pickup, enhancing the user experience in voice calls and human-computer voice interaction scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116634329B_ABST

Patent Text Reader

Abstract

The present disclosure relates to an audio processing method, device, electronic equipment and storage medium. The method is applied to an audio recording device having at least two microphones. The method comprises: obtaining original audio signals collected by each microphone of the at least two microphones; performing directional audio processing on the original audio signals collected by each microphone using a spatial filter according to a target sound source direction and an end-fire direction of each microphone to obtain a first audio signal; and performing gain processing on the first audio signal according to an original audio signal collected by a target microphone of the at least two microphones and the first audio signal to obtain a second audio signal, wherein the end-fire direction of the target microphone matches the target sound source direction.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of audio processing technology, specifically to an audio processing method, apparatus, electronic device, and storage medium. Background Technology

[0002] Current mobile phones, headsets, and other devices can all record audio, a function that can be used in scenarios such as voice calls and human-computer voice interaction. Mobile phones and headsets contain microphone microarrays, each microphone in which can collect audio. This audio needs to be processed through delay estimation, beamforming, and noise cancellation to achieve directional audio pickup. However, the audio quality and stability of directional pickup achieved through processing the audio collected by the microphones in related technologies are generally poor. Summary of the Invention

[0003] To overcome the problems existing in the related technologies, this disclosure provides an audio processing method, apparatus, electronic device, and storage medium to solve the defects in the related technologies.

[0004] According to a first aspect of the present disclosure, an audio processing method is provided, applied to an audio recording device having at least two microphones, the method comprising:

[0005] Acquire the raw audio signal captured by each of the at least two microphones;

[0006] Based on the direction of the target sound source and the end-fire direction of each microphone, a spatial filter is used to perform directional audio processing on the raw audio signals collected by each microphone to obtain a first audio signal;

[0007] Based on the original audio signal acquired by the target microphone among the at least two microphones and the first audio signal, the first audio signal is subjected to gain processing to obtain a second audio signal, wherein the end-fire direction of the target microphone matches the direction of the target sound source.

[0008] In one embodiment, the step of performing directional audio processing on the raw audio signals acquired by each of the microphones using a spatial filter, based on the direction of the target sound source and the end-fire direction of each microphone, to obtain a first audio signal, includes:

[0009] Following the order from the target microphone to the far-end microphone, the raw audio signals collected by each microphone are sequentially input into the spatial filter for spatial filtering to obtain a mixed signal. The end-fire direction of the far-end microphone is opposite to that of the target microphone. The mixed signal includes a first audio signal and a noise signal.

[0010] The raw audio signals collected by each microphone are sequentially input into the spatial filter for spatial filtering in the order from the remote microphone to the target microphone to obtain the noise signal.

[0011] The mixed signal and the noise signal are input into a noise cancellation filter for noise cancellation processing to obtain the first audio signal.

[0012] In one embodiment, the step of sequentially inputting the raw audio signals collected by each microphone into the spatial filter for spatial filtering includes:

[0013] The initial input of the original audio signal is used as the target signal and input into the spatial filter. Based on the input order of the remaining original audio signals, each of the original audio signals is used as an interference signal at each level and input into the spatial filter for spatial filtering.

[0014] In one embodiment, the filtering function within the spatial filter is generated based on the distance between adjacent microphones.

[0015] In one embodiment, the step of performing gain processing on the first audio signal to obtain a second audio signal based on the original audio signal acquired by the target microphone among the at least two microphones and the first audio signal includes:

[0016] The gain adjustment value is determined based on the root mean square value of the original audio signal acquired by the target microphone of the at least two microphones in the reference frequency band, and the root mean square value of the first audio signal in the reference frequency band.

[0017] The first audio signal is subjected to gain processing based on the gain adjustment value to obtain the second audio signal.

[0018] In one embodiment, it also includes:

[0019] The upper and lower frequency limits of the reference frequency band are determined based on the reference center frequency and the bandwidth ratio.

[0020] In one embodiment, it also includes:

[0021] The second audio signal in time domain form is used as the target audio signal of the audio recording device.

[0022] In one embodiment, it also includes:

[0023] The frequency points to be compensated are determined based on the distance between adjacent microphones and the preset number of frequency points to be compensated.

[0024] Based on the value of the original audio signal collected by the target microphone at the frequency point to be compensated, and the value of the second audio signal at the frequency point to be compensated, the frequency compensation value of the frequency point to be compensated is determined;

[0025] Based on the frequency compensation value of the frequency point to be compensated, the value of the second audio signal at the frequency point to be compensated is frequency compensated to obtain the third audio signal.

[0026] In one embodiment, frequency compensation is performed on the value of the second audio signal at the frequency point to be compensated, based on the frequency compensation value of the frequency point to be compensated, including:

[0027] If the frequency point to be compensated is less than a preset frequency point threshold, the value of the second audio signal at the frequency point to be compensated is frequency compensated according to the frequency compensation value of the frequency point to be compensated.

[0028] If the frequency point to be compensated is greater than or equal to the frequency point threshold, the value of the second audio signal at the frequency point to be compensated remains unchanged.

[0029] In one embodiment, it also includes:

[0030] The third audio signal in time domain form is used as the target audio signal of the audio recording device.

[0031] According to a second aspect of the present disclosure, an audio processing apparatus is provided, applied to an audio recording device, the audio recording device having at least two microphones, the apparatus comprising:

[0032] An acquisition module is used to acquire the raw audio signal collected by each of the at least two microphones;

[0033] The directional processing module is used to perform directional audio processing on the raw audio signals collected by each of the microphones using a spatial filter, based on the direction of the target sound source and the end-fire direction of each microphone, to obtain a first audio signal;

[0034] A gain processing module is used to perform gain processing on the first audio signal based on the original audio signal collected by the target microphone among the at least two microphones and the first audio signal to obtain a second audio signal, wherein the end-fire direction of the target microphone matches the direction of the target sound source.

[0035] In one embodiment, the orientation processing module is specifically used for:

[0036] Following the order from the target microphone to the far-end microphone, the raw audio signals collected by each microphone are sequentially input into the spatial filter for spatial filtering to obtain a mixed signal. The end-fire direction of the far-end microphone is opposite to that of the target microphone. The mixed signal includes a first audio signal and a noise signal.

[0037] The raw audio signals collected by each microphone are sequentially input into the spatial filter for spatial filtering in the order from the remote microphone to the target microphone to obtain the noise signal.

[0038] The mixed signal and the noise signal are input into a noise cancellation filter for noise cancellation processing to obtain the first audio signal.

[0039] In one embodiment, when the directional processing module sequentially inputs the raw audio signals collected by each microphone into the spatial filter for spatial filtering, it is specifically used for:

[0040] The initial input of the original audio signal is used as the target signal and input into the spatial filter. Based on the input order of the remaining original audio signals, each of the original audio signals is used as an interference signal at each level and input into the spatial filter for spatial filtering.

[0041] In one embodiment, the filtering function within the spatial filter is generated based on the distance between adjacent microphones.

[0042] In one embodiment, the gain processing module is specifically used for:

[0043] The gain adjustment value is determined based on the root mean square value of the original audio signal acquired by the target microphone of the at least two microphones in the reference frequency band, and the root mean square value of the first audio signal in the reference frequency band.

[0044] The first audio signal is subjected to gain processing based on the gain adjustment value to obtain the second audio signal.

[0045] In one embodiment, the gain processing module is further configured to:

[0046] The upper and lower frequency limits of the reference frequency band are determined based on the reference center frequency and the bandwidth ratio.

[0047] In one embodiment, the apparatus further includes a first target module, configured to:

[0048] The second audio signal in time domain form is used as the target audio signal of the audio recording device.

[0049] In one embodiment, a compensation module is further included, for:

[0050] The frequency points to be compensated are determined based on the distance between adjacent microphones and the preset number of frequency points to be compensated.

[0051] Based on the value of the original audio signal collected by the target microphone at the frequency point to be compensated, and the value of the second audio signal at the frequency point to be compensated, the frequency compensation value of the frequency point to be compensated is determined;

[0052] Based on the frequency compensation value of the frequency point to be compensated, the value of the second audio signal at the frequency point to be compensated is frequency compensated to obtain the third audio signal.

[0053] In one embodiment, the compensation module is used to perform frequency compensation on the value of the second audio signal at the frequency point to be compensated based on the frequency compensation value of the frequency point to be compensated, specifically for:

[0054] If the frequency point to be compensated is less than a preset frequency point threshold, the value of the second audio signal at the frequency point to be compensated is frequency compensated according to the frequency compensation value of the frequency point to be compensated.

[0055] If the frequency point to be compensated is greater than or equal to the frequency point threshold, the value of the second audio signal at the frequency point to be compensated remains unchanged.

[0056] In one embodiment, the apparatus further includes a second target module for:

[0057] The third audio signal in time domain form is used as the target audio signal of the audio recording device.

[0058] According to a third aspect of the present disclosure, an electronic device is provided, the electronic device including a memory and a processor, the memory being used to store computer instructions executable on the processor, and the processor being used to execute the computer instructions based on the image detection method described in the first aspect.

[0059] According to a fourth aspect of the present disclosure, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the method described in the first aspect.

[0060] The technical solutions provided by the embodiments of this disclosure may include the following beneficial effects:

[0061] This disclosure acquires the raw audio signals from each of at least two microphones in an audio recording device, and performs directional audio processing on the raw audio signals from each microphone using a spatial filter based on the direction of the target sound source and the end-fire direction of each microphone to obtain a first audio signal. Finally, based on the raw audio signal acquired by the target microphone among the at least two microphones and the first audio signal obtained above, a gain processing is performed on the first audio signal to obtain a second audio signal. Because the first audio signal is gain-processed, the fidelity and stability of the directionally picked-up second audio signal are improved, thereby enhancing the user experience of the audio recording device in scenarios such as voice calls and human-computer voice interaction. Attached Figure Description

[0062] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with the invention and, together with the description, serve to explain the principles of the invention.

[0063] Figure 1 This is a flowchart illustrating an exemplary embodiment of the audio processing method disclosed herein;

[0064] Figure 2 This is a schematic diagram of a spatial filter for two microphones illustrating an exemplary embodiment of this disclosure;

[0065] Figure 3 This is a schematic diagram of a spatial filter for three microphones shown in an exemplary embodiment of this disclosure;

[0066] Figure 4 This is a flowchart illustrating directional audio processing as shown in an exemplary embodiment of this disclosure;

[0067] Figure 5 This is a flowchart illustrating an audio processing method according to another exemplary embodiment of this disclosure;

[0068] Figure 6 This is a flowchart illustrating an audio processing method in yet another exemplary embodiment of this disclosure;

[0069] Figure 7 This is a schematic diagram of the structure of an audio processing apparatus shown in an exemplary embodiment of the present disclosure;

[0070] Figure 8 This is a structural block diagram of an electronic device illustrated in an exemplary embodiment of the present disclosure. Detailed Implementation

[0071] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this disclosure as detailed in the appended claims.

[0072] The terminology used in this disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The singular forms “a,” “the,” and “the” as used in this disclosure and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

[0073] It should be understood that although the terms first, second, third, etc., may be used in this disclosure to describe various information, such information should not be limited to these terms. These terms are used only to distinguish information of the same type from one another. For example, without departing from the scope of this disclosure, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to determination."

[0074] The audio picked up directionally by devices such as mobile phones and headphones in related technologies is prone to audio accidents with fluctuating volume, and also has frequency defects and sound quality damage.

[0075] Based on this, in a first aspect, at least one embodiment of this disclosure provides an audio processing method, please refer to the appendix. Figure 1 The diagram illustrates the process of the method, including steps S101 and S103.

[0076] This audio processing method is applied to audio recording devices such as mobile phones and headphones. The audio recording device has at least two microphones, which can form a microphone array or be set individually. Each microphone has a different end-fire direction, which is the direction of the end of the audio recording device corresponding to that microphone. For example, if a mobile phone has two microphones at the top and two at the bottom, the end-fire direction of the top microphone is the top direction, and the end-fire direction of the bottom microphone is the bottom direction. Each microphone can be used to collect audio. This audio processing method is used to process the collected audio to directionally pick up audio from the direction of the target sound source. Since the audio collected by the microphone is not used as the final output or saved audio, the audio collected by the microphone is referred to as the raw audio signal below.

[0077] In step S101, the raw audio signal captured by each of the at least two microphones is acquired.

[0078] Microphones in audio recording devices such as mobile phones and headphones can capture raw audio signals in real time or in specific modes. For example, a mobile phone microphone can capture raw audio signals in call mode or human-computer interaction mode, while a headphone microphone can capture raw audio signals when the host device to which the headphone is connected is in headphone mode.

[0079] The original audio signal can be a time-domain signal, that is, an audio signal in time-domain form.

[0080] In step S102, based on the direction of the target sound source and the end-fire direction of each microphone, a spatial filter is used to perform directional audio processing on the original audio signals collected by each microphone to obtain a first audio signal.

[0081] The filtering function H(ω) of the spatial filter can be expressed by the following formula:

[0082] H(ω)=H L (ω)·a(ω,θ);

[0083] H L (ω) is the low-frequency compensation filter coefficient, H L (ω) is a function related to the distance between adjacent microphones, so the filter function in the spatial filter is generated based on the distance between adjacent microphones; a(ω,θ) is the steering vector of the microphone array;

[0084] The filtering process of a spatial filter can be represented by the following formula:

[0085] Y = S * H(ω);

[0086] Y is the frequency domain output signal of the spatial filter, which can be converted into the time domain signal y by inverse Fourier transform; S is the input signal of the spatial filter, which is a vector composed of the target signal in frequency domain form and the interference signals in frequency domain form at each level. The target signal and the interference signal are both the original audio signals collected by the microphone.

[0087] In one possible scenario, the audio recording device has two microphones, and the spatial filter is as follows: Figure 2 As shown, the raw audio signals s1 and s2 collected by the two microphones are input into the spatial filter, and the output is the directional audio processing result, namely the first audio signal.

[0088] In the scenario with these two microphones, the low-frequency compensation filter coefficient H of the spatial filter... L (ω) can be:

[0089] Where j is a negative unit vector, ω represents the digital angular frequency, and τ represents the delay.

[0090] In this scenario with two microphones, the steering vector a(ω,θ) of the microphone array can be:

[0091] Where c is the speed of sound;

[0092] In the scenario with two microphones, the input signal S of the spatial filter can be a vector [S1 S2], where S1 is the original audio signal s1 collected by one of the microphones converted from the time domain to the frequency domain, and S2 is the original audio signal s2 collected by the other microphone converted from the time domain to the frequency domain. S1 is the target signal, and S2 is the interference signal.

[0093] In another possible scenario, the audio recording device has three microphones, and the spatial filter is as follows: Figure 3 As shown, the raw audio signals s1, s2, and s3 collected by the three microphones are input into the spatial filter, and the output is the directional processing result, which is the first audio signal.

[0094] In this scenario with three microphones, the low-frequency compensation filter coefficient H of the spatial filter... L (ω) can be:

[0095]

[0096] In a scenario with three microphones, the steering vector a(ω,θ) of the microphone array can be:

[0097] Where, α 2,1 =-1,α 2,2 =0.

[0098] In the scenario with two microphones, the input signal S of the spatial filter can be a vector [S1 S2 S3], where S1 is the original audio signal s1 collected by the first microphone converted from the time domain to the frequency domain, S2 is the original audio signal s2 collected by the second microphone converted from the time domain to the frequency domain, and S3 is the original audio signal s3 collected by the third microphone converted from the time domain to the frequency domain. S1 is the target signal, S2 is the first-level interference signal, and S3 is the second-level interference signal.

[0099] Based on the structure and parameters of the spatial filter described above, this step can be performed as follows: Figure 4 The procedure is as shown, including sub-steps S1021 to S1023.

[0100] In sub-step S1021, the original audio signals collected by each microphone are sequentially input into the spatial filter for spatial filtering in the order from the target microphone to the far-end microphone to obtain a mixed signal. The end-fire direction of the far-end microphone is opposite to that of the target microphone. The mixed signal includes a first audio signal and a noise signal.

[0101] In this context, the end-fire direction of the target microphone matches the direction of the target sound source. That is, the target microphone is the microphone located at the end of the audio recording device corresponding to the direction of the target sound source. Specifically, a preset angle range on both sides of the microphone's end-fire direction can be set as the matching range. When the direction of the target sound source is within the matching range of a certain microphone, the end-fire direction of that microphone matches the direction of the target sound source. For example, a mobile phone has microphones at its top and bottom. When recording with the bottom of the phone facing the sound source, the target sound source direction is the bottom, so the microphone located at the bottom, with its end-fire direction at the bottom, is the target microphone. As another example, an earphone has microphones at its head and tail. When a user wears the earphone for a voice call, the target sound source direction is the tail of the earphone, so the microphone located at the tail, with its end-fire direction at the tail, is the target microphone.

[0102] In this context, the end of the audio recording device containing the far-end microphone is opposite to the end of the audio recording device corresponding to the direction of the target sound source. In other words, the far-end microphone is the microphone furthest from the target microphone. For example, if a mobile phone has microphones at its top and bottom, and the phone is recording with the bottom facing the sound source, then the microphone located at the top (with its end-to-end direction pointing towards the top) is the far-end microphone. Similarly, if an earphone has microphones at its head and tail, and the user is making a voice call while wearing the earphone, then the microphone located at the head (with its end-to-end direction pointing towards the head) is the far-end microphone, as the target sound source is at the tail of the earphone.

[0103] The initial input of the original audio signal can be used as the target signal and input into the spatial filter. Based on the input order of the remaining original audio signals, each of the original audio signals can be used as an interference signal at each level and input into the spatial filter for spatial filtering.

[0104] When the audio recording device has two microphones, the target microphone can be used as the target signal input to the spatial filter, and the original audio signal collected by the other microphone can be used as the interference signal input to the spatial filter. Then the input signal S of the spatial filter can be a vector [S1 S2], where S1 is the original audio signal s1 collected by the target microphone converted from the time domain to the frequency domain, and S2 is the original audio signal s2 collected by the other microphone converted from the time domain to the frequency domain.

[0105] Since the input order of each original audio signal is matched with the distance of each microphone from the target microphone, when the audio recording device has at least three microphones, the target microphone can be used as the target signal input spatial filter. Based on the distance of each of the other microphones from the target microphone, the original audio signals collected by each microphone can be input as interference signals of various levels into the spatial filter. That is, the smaller the distance from the target microphone, the higher the interference level. For example, each microphone can be numbered along the direction from the target microphone to the far-end microphone. The target microphone is numbered 1, and the interference level of the original audio signal collected by microphone number 2 is 1, and so on for the other microphones. The input signal S of the spatial filter can be a vector [S1…Sn], where S1 is the original audio signal s1 collected by the target microphone converted from the time domain to the frequency domain, and Sn is the original audio signal sn collected by the far-end microphone converted from the time domain to the frequency domain. The far-end microphone is the microphone furthest from the target microphone, and its end-fire direction is opposite to that of the target microphone, where n≥3.

[0106] In sub-step S1022, the original audio signals collected by each microphone are sequentially input into the spatial filter for spatial filtering in the order from the remote microphone to the target microphone to obtain the noise signal.

[0107] The input signal of the spatial filter in sub-step S1021 can be changed in direction and used as the input signal of the spatial filter in this sub-step.

[0108] The initial input of the original audio signal can be used as the target signal and input into the spatial filter. Based on the input order of the remaining original audio signals, each of the original audio signals can be used as an interference signal at each level and input into the spatial filter for spatial filtering.

[0109] When the audio recording device has two microphones, the far-end microphone can be used as the target signal input to the spatial filter, and the original audio signal collected by the other microphone can be used as the interference signal input to the spatial filter. Then the input signal S of the spatial filter can be a vector [S2 S1], where S1 is the original audio signal s1 collected by the target microphone converted from the time domain to the frequency domain, and S2 is the original audio signal s2 collected by the far-end microphone converted from the time domain to the frequency domain.

[0110] Since the input order of each original audio signal is matched with the distance of each microphone from the target microphone, when the audio recording device has at least three microphones, the far-end microphone can be used as the target signal input spatial filter. Furthermore, based on the distance of each of the other microphones from the far-end microphone, each microphone can be used as an interference signal input to the spatial filter at different levels. That is, the smaller the distance from the far-end microphone, the higher the interference level. For example, each microphone can be numbered along the direction from the target microphone to the far-end microphone. The target microphone is numbered 1, and the interference level of the original audio signal collected by microphone number 2 is 1, and so on for the other microphones. The input signal S of the spatial filter can then be a vector [Sn……S1], where S1 is the original audio signal s1 collected by the target microphone converted from the time domain to the frequency domain, Sn is the original audio signal sn collected by the far-end microphone converted from the time domain to the frequency domain, and n≥3.

[0111] In sub-step S1023, the mixed signal and the noise signal are input into a noise cancellation filter for noise cancellation processing to obtain the first audio signal.

[0112] Optionally, the noise signal is used as the noise reference signal of the noise cancellation filter to perform adaptive noise cancellation on the mixed signal, thereby obtaining the first audio signal.

[0113] The first audio signal obtained in this step has undergone directional beamforming and noise cancellation, resulting in a relatively clean audio signal, denoted as Sig_c. The first audio signal can be a time-domain signal or a frequency-domain signal.

[0114] In step S103, based on the original audio signal collected by the target microphone among the at least two microphones and the first audio signal, the first audio signal is subjected to gain processing to obtain a second audio signal, wherein the end-fire direction of the target microphone matches the direction of the target sound source.

[0115] Specifically, the upper and lower limits of the reference frequency band can be determined in advance based on the reference center frequency and the bandwidth ratio. For example, a miniature microphone has a relatively flat frequency response at 1kHz, so the center frequency can be determined to be 1kHz, and the bandwidth can be set to 1 / 3 octave. In this case, the upper limit frequency f2 of the reference frequency band is 1420Hz, and the lower limit frequency f1 of the reference frequency band is 710Hz.

[0116] In this step, the root mean square value amp_ref of the original audio signal Sig_1 collected by the target microphone in the reference frequency band can be determined first according to the following formula: amp_ref=rms(Sig_1(f1:f2)). Then, the root mean square value amp_real of the first audio signal Sig_c in the reference frequency band can be determined according to the following formula: amp_real=rms(Sig_c(f1:f2)).

[0117] Then, according to the following formula, the gain adjustment value gain_all is determined based on the root mean square value amp_ref of the original audio signal collected by the target microphone of the at least two microphones in the reference frequency band, and the root mean square value amp_real of the first audio signal in the reference frequency band, i.e., the formula is: gain_all = amp_ref / amp_real.

[0118] Finally, the first audio signal is subjected to gain processing according to the gain adjustment value according to the following formula to obtain the second audio signal Sig_g, that is, the formula is: Sig_g=Sig_c*gain_all.

[0119] The second audio signal can be a time-domain signal or a frequency-domain signal. When the second audio signal is a time-domain signal, it can be directly used as the target audio signal for the audio recording device, i.e., the audio signal that the device ultimately acquires, stores, and transmits. For example, it can be used as audio recorded during a mobile phone call and sent to the other party's phone, or as audio recorded during human-computer voice interaction using headphones connected to a terminal device for semantic recognition. When the second audio signal is a frequency-domain signal, it can be converted from frequency-domain to time-domain form, and the time-domain form of the second audio signal can be used as the target audio signal for the audio recording device. For example, it can be used as audio recorded during a mobile phone call and sent to the other party's phone, or as audio recorded during human-computer voice interaction using headphones connected to a terminal device for semantic recognition.

[0120] This disclosure acquires the raw audio signal from each of at least two microphones in an audio recording device, and performs directional audio processing on the raw audio signals acquired by each microphone using a spatial filter based on the direction of the target sound source and the end-fire direction of each microphone to obtain a first audio signal. Finally, based on the raw audio signal acquired by the target microphone among the at least two microphones and the first audio signal obtained above, a gain processing is performed on the first audio signal to obtain a second audio signal. Because the first audio signal is gain-processed, the fidelity of the directionally picked-up second audio signal is improved. Furthermore, the gain processing can enhance the signal at some distorted frequency points, thus improving the stability of the overall audio signal. This, in turn, enhances the user experience of the audio recording device in scenarios such as voice calls and human-computer voice interaction.

[0121] In some embodiments of this disclosure, after obtaining the second audio signal, it can also be processed as follows: Figure 5 The frequency compensation of the second audio signal is performed in the manner shown, including steps S501 to S503.

[0122] In step S501, the frequency points to be compensated are determined based on the distance between adjacent microphones and the preset number of frequency points to be compensated.

[0123] The beam response B(ω,θ0) of the spatial filter in step S102 is:

[0124]

[0125] Will and Substituting into the above formula for beam response, we get:

[0126]

[0127] Obviously, when When f represents the frequency and i is an integer, taking values from 1 to ∞, B(ω,θ) = 0. That is, the frequency value f that produces the zero point. z for:

[0128]

[0129] Since spatial filters can create zeros at specific high-frequency points, frequency compensation is needed for these frequencies, which means compensating for the frequency values f that generate the zeros. z The frequency point to be compensated has been identified.

[0130] In step S502, the frequency compensation value of the frequency to be compensated is determined based on the value of the original audio signal collected by the target microphone at the frequency to be compensated and the value of the second audio signal at the frequency to be compensated.

[0131] Optionally, the frequency compensation value fcomp(i) for each frequency point to be compensated is calculated according to the following formula, i.e., the formula is: fcomp(i)=abs(Sig_1(f z (i))) / abs(Sig_g(f z (i))), where abs(Sig_1(f z (i))) is the absolute value of the original audio signal acquired by the target microphone at the frequency point to be compensated, abs(Sig_g(f z (i))) is the absolute value of the second audio signal at the frequency point to be compensated.

[0132] In step S503, the value of the second audio signal at the frequency point to be compensated is frequency compensated according to the frequency compensation value of the frequency point to be compensated, so as to obtain the third audio signal.

[0133] Optionally, if the frequency to be compensated is less than a preset frequency threshold, the value of the second audio signal at the frequency to be compensated is frequency compensated according to the frequency compensation value of the frequency to be compensated; if the frequency to be compensated is greater than or equal to the frequency threshold, the value of the second audio signal at the frequency to be compensated remains unchanged. The frequency threshold can be 20000Hz.

[0134] Frequency compensation can be performed on the value of the frequency point to be compensated according to the following formula to obtain the compensated frequency value Sig_fcomp(f z (i)), that is, the formula is: Sig_fcomp(f z (i))=Sig_g(f z (i))*fcomp(i).

[0135] The third audio signal can be a time-domain signal or a frequency-domain signal. When the third audio signal is a time-domain signal, it can be directly used as the target audio signal for the audio recording device. For example, it can be used as audio recorded during a mobile phone call and sent to the other party's phone, or as audio recorded by headphones connected to a terminal device during human-computer voice interaction for semantic recognition. When the third audio signal is a frequency-domain signal, it can be converted from frequency-domain to time-domain form, and the time-domain form of the third audio signal can be used as the target audio signal for the audio recording device. For example, it can be used as audio recorded during a mobile phone call and sent to the other party's phone, or as audio recorded by headphones connected to a terminal device during human-computer voice interaction for semantic recognition.

[0136] In this embodiment, by performing frequency compensation on the zero-point frequency in the second audio signal, the zero-point in the audio signal can be eliminated, thereby improving the fidelity and stability of the third audio signal, and thus improving the quality of the target audio signal of the audio recording device.

[0137] Please refer to the appendix. Figure 6 This example illustrates a complete flow of audio processing according to an embodiment of the present disclosure. Figure 6 As can be seen from the diagram, the audio recording device in this embodiment has two microphones: a first microphone and a second microphone. First, the raw audio signal collected by the first microphone is input to a spatial filter as the first microphone input signal, and the raw audio signal collected by the second microphone is input to the spatial filter as the second microphone input signal. The mixed signal and noise signal output by the spatial filter after filtering are input to an adaptive noise cancellation filter for noise cancellation, resulting in a first audio signal. Then, audio compensation is performed on the first audio signal. Specifically, the first audio signal can be gain-processed to obtain a second audio signal, which is then used as the target direction output signal, i.e., the target audio signal. Alternatively, after gain processing, the obtained second audio signal can undergo further frequency compensation to obtain a third audio signal, which is also used as the target direction output signal, i.e., the target audio signal. Because the first audio signal undergoes gain processing, the fidelity of the directionally picked-up second audio signal is improved. Furthermore, the frequency step size of the zero point of the second audio signal is adjusted, thus improving the stability of the third audio signal. This enhances the user experience in scenarios such as voice calls and human-computer voice interaction.

[0138] According to a second aspect of the present disclosure, an audio processing apparatus is provided, applied to an audio recording device, the audio recording device having at least two microphones. Please refer to the appendix. Figure 7 The device includes:

[0139] Acquisition module 701 is used to acquire the raw audio signal collected by each of the at least two microphones;

[0140] The directional processing module 702 is used to perform directional audio processing on the original audio signals collected by each of the microphones according to the direction of the target sound source and the end-fire direction of each microphone, using a spatial filter to obtain a first audio signal;

[0141] The gain processing module 703 is used to perform gain processing on the first audio signal based on the original audio signal collected by the target microphone among the at least two microphones and the first audio signal to obtain a second audio signal, wherein the end-fire direction of the target microphone matches the direction of the target sound source.

[0142] In some embodiments of this disclosure, the orientation processing module is specifically used for:

[0143] Following the order from the target microphone to the far-end microphone, the raw audio signals collected by each microphone are sequentially input into the spatial filter for spatial filtering to obtain a mixed signal. The end-fire direction of the far-end microphone is opposite to that of the target microphone. The mixed signal includes a first audio signal and a noise signal.

[0144] The raw audio signals collected by each microphone are sequentially input into the spatial filter for spatial filtering in the order from the remote microphone to the target microphone to obtain the noise signal.

[0145] The mixed signal and the noise signal are input into a noise cancellation filter for noise cancellation processing to obtain the first audio signal.

[0146] In some embodiments of this disclosure, the step of sequentially inputting the raw audio signals collected by each microphone into the spatial filter for spatial filtering includes:

[0147] The initial input of the original audio signal is used as the target signal and input into the spatial filter. Based on the input order of the remaining original audio signals, each of the original audio signals is used as an interference signal at each level and input into the spatial filter for spatial filtering.

[0148] In some embodiments of this disclosure, the filtering function within the spatial filter is generated based on the distance between adjacent microphones.

[0149] In some embodiments of this disclosure, the gain processing module is specifically used for:

[0150] The gain adjustment value is determined based on the root mean square value of the original audio signal acquired by the target microphone of the at least two microphones in the reference frequency band, and the root mean square value of the first audio signal in the reference frequency band.

[0151] The first audio signal is subjected to gain processing based on the gain adjustment value to obtain the second audio signal.

[0152] In some embodiments of this disclosure, the gain processing module is further configured to:

[0153] The upper and lower frequency limits of the reference frequency band are determined based on the reference center frequency and the bandwidth ratio.

[0154] In some embodiments of this disclosure, the apparatus further includes a first target module for:

[0155] The second audio signal in time domain form is used as the target audio signal of the audio recording device.

[0156] In some embodiments of this disclosure, a compensation module is also included, for:

[0157] The frequency points to be compensated are determined based on the distance between adjacent microphones and the preset number of frequency points to be compensated.

[0158] Based on the value of the original audio signal collected by the target microphone at the frequency point to be compensated, and the value of the second audio signal at the frequency point to be compensated, the frequency compensation value of the frequency point to be compensated is determined;

[0159] Based on the frequency compensation value of the frequency point to be compensated, the value of the second audio signal at the frequency point to be compensated is frequency compensated to obtain the third audio signal.

[0160] In some embodiments of this disclosure, when the compensation module performs frequency compensation on the value of the second audio signal at the frequency point to be compensated based on the frequency compensation value of the frequency point to be compensated, it is specifically used for:

[0161] If the frequency point to be compensated is less than a preset frequency point threshold, the value of the second audio signal at the frequency point to be compensated is frequency compensated according to the frequency compensation value of the frequency point to be compensated.

[0162] If the frequency point to be compensated is greater than or equal to the frequency point threshold, the value of the second audio signal at the frequency point to be compensated remains unchanged.

[0163] In some embodiments of this disclosure, the apparatus further includes a second target module for:

[0164] The third audio signal in time domain form is used as the target audio signal of the audio recording device.

[0165] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments of the method in the first aspect, and will not be elaborated upon here.

[0166] According to a third aspect of the embodiments of this disclosure, please refer to the appendix. Figure 8 The diagram illustrates, for example, a block diagram of an electronic device. For instance, device 800 could be a mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, medical device, fitness equipment, personal digital assistant, etc.

[0167] Reference Figure 8The device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input / output (I / O) interface 812, a sensor component 814, and a communication component 816.

[0168] Processing component 802 typically controls the overall operation of device 800, such as operations associated with display, telephone calls, data communication, camera operation, and recording. Processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the methods described above. Furthermore, processing component 802 may include one or more modules to facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.

[0169] Memory 804 is configured to store various types of data to support the operation of device 800. Examples of this data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, etc. Memory 804 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.

[0170] The power supply component 806 provides power to the various components of the device 800. The power supply component 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power to the device 800.

[0171] Multimedia component 808 includes a screen that provides an output interface between the device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touchscreen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensors may sense not only the boundaries of the touch or swipe action but also the duration and pressure associated with the touch or swipe operation. In some embodiments, multimedia component 808 includes a front-facing camera and / or a rear-facing camera. When the device 800 is in an operating mode, such as a shooting mode or a video mode, the front-facing camera and / or the rear-facing camera may receive external multimedia data. Each front-facing camera and rear-facing camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

[0172] Audio component 810 is configured to output and / or input audio signals. For example, audio component 810 includes a microphone (MIC) configured to receive external audio signals when device 800 is in an operating mode, such as call mode, recording mode, and voice recognition mode. The received audio signals may be further stored in memory 804 or transmitted via communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

[0173] I / O interface 812 provides an interface between processing component 802 and peripheral interface modules, such as keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, home buttons, volume buttons, power buttons, and lock buttons.

[0174] Sensor assembly 814 includes one or more sensors for providing status assessments of various aspects of device 800. For example, sensor assembly 814 can detect the on / off state of device 800, the relative positioning of components such as the display and keypad of device 800, image detection of changes in the position of device 800 or a component of device 800, the presence or absence of user contact with device 800, orientation or acceleration / deceleration of device 800, and temperature changes of device 800. Sensor assembly 814 may also include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, sensor assembly 814 may also include an accelerometer, a gyroscope, a magnetometer, a pressure sensor, or a temperature sensor.

[0175] Communication component 816 is configured to facilitate wired or wireless communication between device 800 and other devices. Device 800 can access wireless networks based on communication standards, such as WiFi, 2G or 3G, 4G or 5G, or combinations thereof. In one exemplary embodiment, communication component 816 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, communication component 816 also includes a near-field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

[0176] In an exemplary embodiment, device 800 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the power supply method of the aforementioned electronic device.

[0177] Fourthly, in exemplary embodiments, this disclosure also provides a non-transitory computer-readable storage medium including instructions, such as a memory 804 including instructions, which can be executed by a processor 820 of device 800 to complete the power supply method of the electronic device. For example, the non-transitory computer-readable storage medium may be a ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage device, etc.

[0178] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the following claims.

[0179] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.

Claims

1. An audio processing method, characterized in that, Applied to an audio recording device having at least two microphones, the method includes: Acquire the raw audio signal captured by each of the at least two microphones; Based on the direction of the target sound source and the end-fire direction of each microphone, a spatial filter is used to perform directional audio processing on the raw audio signals collected by each microphone to obtain a first audio signal; Based on the original audio signal collected by the target microphone among the at least two microphones and the first audio signal, the first audio signal is subjected to gain processing to obtain a second audio signal, wherein the end-firing direction of the target microphone is matched with the direction of the target sound source; Also includes: The second audio signal in time domain form is used as the target audio signal of the audio recording device.

2. The audio processing method according to claim 1, characterized in that, The first audio signal is obtained by performing directional audio processing on the raw audio signals collected by each microphone using a spatial filter based on the direction of the target sound source and the end-fire direction of each microphone, including: Following the order from the target microphone to the far-end microphone, the raw audio signals collected by each microphone are sequentially input into the spatial filter for spatial filtering to obtain a mixed signal. The end-fire direction of the far-end microphone is opposite to that of the target microphone. The mixed signal includes a first audio signal and a noise signal. The raw audio signals collected by each microphone are sequentially input into the spatial filter for spatial filtering in the order from the remote microphone to the target microphone to obtain the noise signal. The mixed signal and the noise signal are input into a noise cancellation filter for noise cancellation processing to obtain the first audio signal.

3. The audio processing method according to claim 2, characterized in that, The step of sequentially inputting the raw audio signals collected by each microphone into the spatial filter for spatial filtering includes: The initial input of the original audio signal is used as the target signal and input into the spatial filter. Based on the input order of the remaining original audio signals, each of the original audio signals is used as an interference signal at each level and input into the spatial filter for spatial filtering.

4. The audio processing method according to claim 2, characterized in that, The filtering function within the spatial filter is generated based on the distance between adjacent microphones.

5. The audio processing method according to claim 1, characterized in that, The step of performing gain processing on the first audio signal to obtain a second audio signal based on the original audio signal acquired by the target microphone among the at least two microphones and the first audio signal includes: The gain adjustment value is determined based on the root mean square value of the original audio signal acquired by the target microphone of the at least two microphones in the reference frequency band, and the root mean square value of the first audio signal in the reference frequency band. The first audio signal is subjected to gain processing based on the gain adjustment value to obtain the second audio signal.

6. The audio processing method according to claim 5, characterized in that, Also includes: The upper and lower frequency limits of the reference frequency band are determined based on the reference center frequency and the bandwidth ratio.

7. The audio processing method according to any one of claims 1 to 6, characterized in that, Also includes: The frequency points to be compensated are determined based on the distance between adjacent microphones and the preset number of frequency points to be compensated. Based on the value of the original audio signal collected by the target microphone at the frequency point to be compensated, and the value of the second audio signal at the frequency point to be compensated, the frequency compensation value of the frequency point to be compensated is determined; Based on the frequency compensation value of the frequency point to be compensated, the value of the second audio signal at the frequency point to be compensated is frequency compensated to obtain the third audio signal.

8. The audio processing method according to claim 7, characterized in that, Based on the frequency compensation value of the frequency point to be compensated, frequency compensation is performed on the value of the second audio signal at the frequency point to be compensated, including: If the frequency point to be compensated is less than a preset frequency point threshold, the value of the second audio signal at the frequency point to be compensated is frequency compensated according to the frequency compensation value of the frequency point to be compensated. If the frequency point to be compensated is greater than or equal to the frequency point threshold, the value of the second audio signal at the frequency point to be compensated remains unchanged.

9. The audio processing method according to claim 7, characterized in that, Also includes: The third audio signal in time domain form is used as the target audio signal of the audio recording device.

10. An audio processing apparatus, characterized in that, Applied to an audio recording device having at least two microphones, the device includes: An acquisition module is used to acquire the raw audio signal collected by each of the at least two microphones; The directional processing module is used to perform directional audio processing on the raw audio signals collected by each of the microphones using a spatial filter, based on the direction of the target sound source and the end-fire direction of each microphone, to obtain a first audio signal; A gain processing module is used to perform gain processing on the first audio signal based on the original audio signal collected by the target microphone among the at least two microphones and the first audio signal to obtain a second audio signal, wherein the end-fire direction of the target microphone is matched with the direction of the target sound source; The device further includes a first target module for: The second audio signal in time domain form is used as the target audio signal of the audio recording device.

11. The audio processing apparatus according to claim 10, characterized in that, The orientation processing module is specifically used for: Following the order from the target microphone to the far-end microphone, the raw audio signals collected by each microphone are sequentially input into the spatial filter for spatial filtering to obtain a mixed signal. The end-fire direction of the far-end microphone is opposite to that of the target microphone. The mixed signal includes a first audio signal and a noise signal. The raw audio signals collected by each microphone are sequentially input into the spatial filter for spatial filtering in the order from the remote microphone to the target microphone to obtain the noise signal. The mixed signal and the noise signal are input into a noise cancellation filter for noise cancellation processing to obtain the first audio signal.

12. The audio processing apparatus according to claim 11, characterized in that, The directional processing module is used to sequentially input the raw audio signals collected by each microphone into the spatial filter for spatial filtering, specifically for: The initial input of the original audio signal is used as the target signal and input into the spatial filter. Based on the input order of the remaining original audio signals, each of the original audio signals is used as an interference signal at each level and input into the spatial filter for spatial filtering.

13. The audio processing apparatus according to claim 11, characterized in that, The filtering function within the spatial filter is generated based on the distance between adjacent microphones.

14. The audio processing apparatus according to claim 10, characterized in that, The gain processing module is specifically used for: The gain adjustment value is determined based on the root mean square value of the original audio signal acquired by the target microphone of the at least two microphones in the reference frequency band, and the root mean square value of the first audio signal in the reference frequency band. The first audio signal is subjected to gain processing based on the gain adjustment value to obtain the second audio signal.

15. The audio processing apparatus according to claim 14, characterized in that, The gain processing module is also used for: The upper and lower frequency limits of the reference frequency band are determined based on the reference center frequency and the bandwidth ratio.

16. The audio processing apparatus according to any one of claims 10 to 15, characterized in that, It also includes a compensation module, used for: The frequency points to be compensated are determined based on the distance between adjacent microphones and the preset number of frequency points to be compensated. Based on the value of the original audio signal collected by the target microphone at the frequency point to be compensated, and the value of the second audio signal at the frequency point to be compensated, the frequency compensation value of the frequency point to be compensated is determined; Based on the frequency compensation value of the frequency point to be compensated, the value of the second audio signal at the frequency point to be compensated is frequency compensated to obtain the third audio signal.

17. The audio processing apparatus according to claim 16, characterized in that, The compensation module is used to perform frequency compensation on the value of the second audio signal at the frequency point to be compensated, based on the frequency compensation value of the frequency point to be compensated. Specifically, it is used for: If the frequency point to be compensated is less than a preset frequency point threshold, the value of the second audio signal at the frequency point to be compensated is frequency compensated according to the frequency compensation value of the frequency point to be compensated. If the frequency point to be compensated is greater than or equal to the frequency point threshold, the value of the second audio signal at the frequency point to be compensated remains unchanged.

18. The audio processing apparatus according to claim 17, characterized in that, The device further includes a second target module for: The third audio signal in time domain form is used as the target audio signal of the audio recording device.

19. An electronic device, characterized in that, The electronic device includes a memory and a processor, the memory being used to store computer instructions that can be executed on the processor, and the processor being used to execute the computer instructions based on the audio processing method according to any one of claims 1 to 9.

20. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method of any one of claims 1 to 9.

Citation Information

Patent Citations

Audio processing method and electronic equipment
CN111050269A
Pickup method and device and electronic equipment
CN113496708A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Audio processing method and electronic equipment

Pickup method and device and electronic equipment