Audio signal processing method, apparatus, device, and storage medium

By performing two processing steps on the song's audio signal—voice cancellation in the left and right channels and background noise removal—the problem of incomplete voice removal was solved, achieving a purer voice removal effect.

CN115550819BActive Publication Date: 2026-06-16SHENZHEN BLUETRUM TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN BLUETRUM TECH CO LTD
Filing Date
2022-10-21
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing vocal removal technologies suffer from incomplete vocal removal, especially when removing vocals from songs, where some vocals often remain.

Method used

By canceling the human voice signals of the left and right channels of the target audio signal to each other, a human voice cancellation signal is obtained. This signal is then input into the background noise cancellation system for background noise cancellation. Finally, the human voice cancellation signal is canceled with the residual human voice signal to achieve two-stage human voice cancellation.

🎯Benefits of technology

It achieves maximum elimination of human voices, making the eliminated signal cleaner and purer. Through adaptive filtering and frequency compensation, it ensures that the background sound system matches the target audio signal, thereby improving the elimination effect.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115550819B_ABST
    Figure CN115550819B_ABST
Patent Text Reader

Abstract

The application provides an audio signal processing method, device and equipment and a storage medium. The method comprises the following steps: obtaining a target audio signal, wherein the target audio signal comprises a left channel audio signal and a right channel audio signal, and the target audio signal is an audio signal containing a human voice and background sound; performing mutual cancellation of left channel human voice signals and right channel human voice signals on the target audio signal to obtain a human voice cancellation signal corresponding to the target audio signal; inputting the human voice cancellation signal into a background sound elimination system to eliminate the background sound, so as to obtain a human voice residual signal corresponding to the target audio signal; performing signal cancellation on the human voice cancellation signal and the human voice residual signal to obtain a human voice elimination signal corresponding to the target audio signal. The technical scheme can maximize the elimination of the human voice, so that the obtained human voice elimination signal is cleaner and purer.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of signal processing, and more particularly to audio signal processing methods, apparatus, devices, and storage media. Background Technology

[0002] Singing apps, music content community applications that have emerged in recent years with the development of the internet, offer background music playback and recording functions, allowing users to sing online via their mobile phones. For singing apps, accompaniment is crucial; in some cases, it's necessary to process songs containing vocals, removing the vocals to obtain the accompaniment.

[0003] Current voice removal technology generally utilizes the characteristic that human voices are basically the same in the left and right channels. It removes human voices by directly inverting and adding the signals in the two channels. However, since the human voices in the two channels are not exactly the same, some human voices will remain after removal, resulting in incomplete removal of human voices. Summary of the Invention

[0004] This application provides an audio signal processing method, apparatus, device, and storage medium to solve the technical problem of incomplete voice removal in existing voice removal solutions.

[0005] Firstly, an audio signal processing method is provided, including:

[0006] Acquire a target audio signal, which includes a left channel audio signal and a right channel audio signal, and the target audio signal is an audio signal containing human voices and background sounds;

[0007] The target audio signal is subjected to mutual cancellation of the left and right channel human voice signals to obtain the human voice cancellation signal corresponding to the target audio signal;

[0008] The human voice cancellation signal is input into the background noise cancellation system to eliminate background noise, so as to obtain the human voice residual signal corresponding to the target audio signal. The background noise cancellation system is used to eliminate background noise.

[0009] The human voice cancellation signal and the human voice residual signal are canceled together to obtain the human voice cancellation signal corresponding to the target audio signal.

[0010] In this technical solution, after acquiring the left and right channel audio signals containing human voice and background noise, the human voice signals in the left and right channels are first canceled to obtain a human voice cancellation signal, achieving initial elimination of the human voice signal. Then, the human voice cancellation signal is input into a background noise elimination system to eliminate background noise, resulting in a residual human voice signal in the human voice cancellation signal. Finally, the human voice cancellation signal and the residual human voice signal are canceled together to obtain a second human voice elimination signal, achieving a second elimination of the residual human voice signal. Through this two-stage human voice elimination process, the elimination of human voice can be maximized, resulting in a cleaner and purer human voice elimination signal.

[0011] In conjunction with the first aspect, in one possible implementation, before inputting the voice cancellation signal into the background noise cancellation system for background noise cancellation to obtain the voice residual signal corresponding to the target audio signal, the method further includes: canceling the voice cancellation signal with the target audio signal to obtain the voice signal corresponding to the target audio signal; and determining the background noise cancellation system based on the target audio signal and the voice signal. Determining the background noise cancellation system using the audio signal and the voice signal within the audio signal allows the background noise cancellation system to perfectly match the target audio signal, thereby enabling the background noise cancellation system to better eliminate background noise in the voice cancellation signal and obtain a more accurate voice residual signal.

[0012] In conjunction with the first aspect, in one possible implementation, determining the background noise cancellation system based on the target audio signal and the human voice signal includes: using the target audio signal as an input signal and the human voice signal as an output signal, performing adaptive filtering and fitting to obtain a target function, whereby the target function characterizes the correlation between the input signal and the output signal; and using the target function as the transfer function of the background noise cancellation system. Determining the transfer function between the audio signal and the human voice within the audio signal using an adaptive filtering and fitting algorithm enables optimization of the background noise cancellation system, ensuring that the system can better eliminate background noise.

[0013] In conjunction with the first aspect, in one possible implementation, after canceling the human voice cancellation signal with the residual human voice signal to obtain the human voice-cancelled signal corresponding to the target audio signal, the method further includes: performing frequency compensation on the target audio signal to obtain a frequency compensation signal corresponding to the target audio signal; and mixing the human voice-cancelled signal and the frequency compensation signal to obtain a background sound signal corresponding to the target audio signal. After obtaining the human voice-cancelled signal, by performing frequency compensation on the audio signal and mixing the frequency compensation signal with the human voice-cancelled signal to obtain the background sound signal in the audio signal, frequency compensation of the human voice-cancelled signal can be achieved, resulting in less loss of the background sound signal and improving the integrity of the background sound signal.

[0014] In conjunction with the first aspect, in one possible implementation, the step of frequency compensation of the target audio signal to obtain a frequency compensation signal corresponding to the target audio signal includes: inputting the target audio signal to a first filter to obtain a first frequency compensation signal corresponding to the target audio signal, wherein the cutoff frequency of the first filter is less than a first preset cutoff frequency; and / or inputting the target audio signal to a second filter to obtain a second frequency compensation signal corresponding to the target audio signal, wherein the cutoff frequency of the second filter is greater than a second preset cutoff frequency; the second preset cutoff frequency is greater than the first preset cutoff frequency. By obtaining the first frequency compensation signal and the second frequency compensation signal through low-frequency and high-frequency compensation respectively, the missing low-frequency and high-frequency components in the voice cancellation signal can be compensated, making the spectrum of the background sound signal sufficiently complete.

[0015] In conjunction with the first aspect, in one possible implementation, before mixing the voice cancellation signal and the frequency compensation signal to obtain the background sound signal corresponding to the target audio signal, the method further includes: adjusting the gain of the voice cancellation signal and the frequency compensation signal to obtain a voice cancellation gain signal and a frequency compensation gain signal corresponding to the target audio signal; the mixing of the voice cancellation signal and the frequency compensation signal to obtain the background sound signal corresponding to the target audio signal includes: mixing the voice cancellation gain signal and the frequency compensation gain signal to obtain the background sound signal corresponding to the target audio signal. By adjusting the gain of the voice cancellation signal and the frequency compensation signal before mixing, the resulting background sound signal can be made more natural and complete.

[0016] In conjunction with the first aspect, in one possible implementation, the gain adjustment of the voice cancellation signal and the frequency compensation signal to obtain the voice cancellation gain signal and frequency compensation gain signal corresponding to the target audio signal includes: adjusting the gain of the voice cancellation signal and the frequency compensation signal based on the signal correlation between the left channel audio signal and the right channel audio signal to obtain the voice cancellation gain signal and frequency compensation gain signal corresponding to the target audio signal. Adjusting the gain of the voice cancellation signal and the frequency compensation signal based on the signal correlation between the left and right channel audio signals ensures that the adjusted voice cancellation gain signal and frequency compensation gain signal conform to the signal characteristics.

[0017] Secondly, an audio signal processing apparatus is provided, comprising:

[0018] An acquisition module is used to acquire a target audio signal, the target audio signal including a left channel audio signal and a right channel audio signal, the target audio signal being an audio signal containing human voices and background sounds;

[0019] The first cancellation module is used to cancel out the left and right channel human voice signals of the target audio signal to obtain the human voice cancellation signal corresponding to the target audio signal.

[0020] The second cancellation module is used to input the human voice cancellation signal into the background noise cancellation system to perform background noise cancellation, so as to obtain the human voice residual signal corresponding to the target audio signal. The background noise cancellation system is used to eliminate background noise.

[0021] The third cancellation module is used to cancel the human voice cancellation signal with the human voice residual signal to obtain the human voice cancellation signal corresponding to the target audio signal.

[0022] Thirdly, an audio device is provided, including a memory and one or more processors, the memory being connected to the one or more processors, the one or more processors being configured to execute one or more computer programs stored in the memory, the one or more processors causing the audio device to implement the audio signal processing method of the first aspect described above when executing the one or more computer programs.

[0023] Fourthly, a computer-readable storage medium is provided, which stores a computer program, the computer program including program instructions, which, when executed by a processor, cause the processor to perform the audio signal processing method of the first aspect.

[0024] This application can achieve the following technical effects: by using two voice cancellation methods, the voice can be eliminated to the greatest extent, resulting in a cleaner and purer voice cancellation signal. Attached Figure Description

[0025] Figure 1 A flowchart illustrating an audio signal processing method provided in an embodiment of this application;

[0026] Figure 2 A flowchart illustrating another audio signal processing method provided in an embodiment of this application;

[0027] Figure 3 A flowchart illustrating another audio signal processing method provided in this application embodiment;

[0028] Figure 4 This is a schematic diagram of the structure of an audio signal processing device provided in an embodiment of this application;

[0029] Figure 5 This is a schematic diagram of the structure of a computer device provided in an embodiment of this application. Detailed Implementation

[0030] The technical solutions in the embodiments of this application will now be described with reference to the accompanying drawings.

[0031] The technical solution of this application can be applied to audio processing scenarios, specifically to scenarios where audio containing human voices needs to be processed into audio without human voices. For example, it can be applied to processing audio containing original vocals into accompaniment audio in a karaoke scenario, or it can be applied to processing audio containing human voices into pure accompaniment audio for playback in an audio playback scenario, and so on, not limited to the examples here.

[0032] The technical solution of this application can be applied to audio devices with audio processing functions, including but not limited to mobile phones, laptops, microphones, etc.

[0033] The general technical principle of this application is as follows: After acquiring a two-channel audio signal containing human voice and background noise, the audio signal is first subjected to a first human voice cancellation by canceling the human voice signals in the left and right channels, resulting in a human voice cancellation signal, thus eliminating identical human voice signals in the left and right channels. Then, the human voice cancellation signal is input into a background noise cancellation system, which eliminates the background noise in the human voice cancellation signal, thereby separating the residual human voice signal from the human voice cancellation signal. Finally, the human voice cancellation signal and the residual human voice signal are canceled together to eliminate the different human voice signals in the left and right channels. Because the residual human voice signal in the human voice cancellation signal (i.e., the different human voice signals in the left and right channels) is separated by the background noise cancellation system, eliminating identical human voice signals in the left and right channels followed by eliminating the different human voice signals in the left and right channels almost completely removes the human voice signal, resulting in a cleaner and purer human voice cancellation signal.

[0034] The technical solution of this application is described in detail below.

[0035] See Figure 1 , Figure 1 This is a flowchart illustrating an audio signal processing method provided in an embodiment of this application. This method can be applied to the aforementioned audio devices, such as... Figure 1 As shown, the method includes the following steps:

[0036] S101, acquire the target audio signal.

[0037] The target audio signal refers to an audio signal containing both vocals and background noise; it can be understood as an audio signal where vocals and background noise are mixed together. The target audio signal includes the left channel audio signal and the right channel audio signal. The target audio signal can be an audio signal containing the singer's original voice in the audio playback scene.

[0038] S102, cancel out the left and right channel human voice signals of the target audio signal to obtain the human voice cancellation signal corresponding to the target audio signal.

[0039] Here, canceling the left and right channel voice signals of the target audio signal to obtain the corresponding voice cancellation signal means subtracting the left and right channel audio signals to obtain the left channel voice cancellation signal and the right channel voice cancellation signal.

[0040] Specifically, the left channel audio signal can be used as the main signal, and the right channel audio signal can be subtracted to eliminate identical vocals in both channels, resulting in the left channel vocal cancellation signal, i.e., X. L1 =S L - S RThe right channel audio signal can be used as the main signal, and the left channel audio signal can be subtracted to eliminate identical vocals in both channels, resulting in the right channel vocal cancellation signal, i.e., X. R1 =S R - S L ; where X L1 X is the left channel voice cancellation signal. R1 For right channel voice cancellation signal, S L For the left channel audio signal, S R This is the right channel audio signal.

[0041] In practice, the left and right channel audio signals can be input into a subtractor to subtract the left and right channel audio signals. By subtracting the left and right channel audio signals, the identical vocal parts in the left and right channel audio signals can be eliminated, achieving initial elimination of the vocal signals.

[0042] S103, input the human voice cancellation signal corresponding to the target audio signal into the background noise cancellation system to perform background noise cancellation, so as to obtain the human voice residual signal corresponding to the target audio signal.

[0043] Among them, the background noise cancellation system refers to the transmission system used for background noise. The background noise system can obtain the human voice signal in the audio signal by eliminating the background noise signal in the audio signal.

[0044] A background noise cancellation system can be any transmission system capable of eliminating background noise while preserving vocals. Specifically, it can eliminate background noise from the left channel vocal cancellation signal, resulting in a left channel residual vocal signal, and similarly, eliminate background noise from the right channel vocal cancellation signal, resulting in a right channel residual vocal signal. By inputting the left and right channel vocal cancellation signals separately into the background noise cancellation system, background noise can be removed from both channels, thereby extracting the distinct portions of the vocal signals from the left and right channels.

[0045] S104, cancel the human voice cancellation signal corresponding to the target audio signal with the human voice residual signal corresponding to the target audio signal to obtain the human voice cancellation signal corresponding to the target audio signal.

[0046] Here, canceling the human voice cancellation signal corresponding to the target audio signal with the human voice residual signal corresponding to the target audio signal to obtain the human voice cancellation signal corresponding to the target audio signal means subtracting the human voice cancellation signal corresponding to the target audio signal from the human voice residual signal corresponding to the target audio signal to obtain the human voice cancellation signal corresponding to the target audio signal.

[0047] Specifically, the left channel voice cancellation signal can be subtracted from the left channel voice residual signal to obtain the left channel voice cancellation signal, i.e., X. L2 = X L1 - V L2 X L2 To eliminate the signal for the left channel vocals, V L2 The left channel vocal residual signal; the right channel vocal residual signal can be obtained by subtracting the right channel vocal cancellation signal from the right channel vocal cancellation signal, i.e., X. R2 = X R1 -V R2 X R2 To cancel the right channel voice signal, V R2 This is the residual human voice signal in the right channel.

[0048] In practice, the left channel voice cancellation signal and the left channel voice residual signal can be input into a subtractor to obtain the left channel voice cancellation signal; similarly, the right channel voice cancellation signal and the right channel voice residual signal can be input into a subtractor to obtain the right channel voice cancellation signal. By subtracting the left channel voice cancellation signal from the left channel voice residual signal, and subtracting the right channel voice cancellation signal from the right channel voice residual signal, different parts of the voice in the left and right channels can be eliminated, achieving secondary cancellation of the voice signal.

[0049] exist Figure 1 In the corresponding technical solution, after acquiring the left and right channel audio signals containing human voice and background noise, the human voice signals in the left and right channels are first canceled to obtain a human voice cancellation signal, achieving initial elimination of the human voice signal. Then, the human voice cancellation signal is input into a background noise elimination system to eliminate background noise, resulting in a residual human voice signal in the human voice cancellation signal. Finally, the human voice cancellation signal and the residual human voice signal are canceled together to obtain a second human voice elimination signal, achieving a second elimination of the residual human voice signal. Through this two-stage human voice elimination process, maximum elimination of human voice can be achieved, resulting in a cleaner and purer human voice elimination signal.

[0050] See Figure 2 , Figure 2 This is a flowchart illustrating another audio signal processing method provided in an embodiment of this application. This method can be applied to the aforementioned audio devices, such as... Figure 2 As shown, the method includes the following steps:

[0051] S201, acquire the target audio signal.

[0052] S202, cancel out the left and right channel human voice signals of the target audio signal to obtain the human voice cancellation signal corresponding to the target audio signal.

[0053] The specific implementation of steps S201 to S202 can be found in the description of steps S101 to S102 above, and will not be repeated here.

[0054] S203, cancel the human voice cancellation signal corresponding to the target audio signal with the target audio signal to obtain the human voice signal corresponding to the target audio signal.

[0055] Here, canceling the target audio signal with the corresponding human voice cancellation signal to obtain the corresponding human voice signal means subtracting the target audio signal from the human voice cancellation signal to obtain the human voice signal.

[0056] Specifically, the left channel audio signal can be obtained by subtracting the left channel voice cancellation signal from the left channel audio signal, i.e., V. L1 =X L1 - S L V L1 The left channel voice signal is obtained by subtracting the right channel audio signal from the right channel voice cancellation signal, i.e., V. R1 =X R1 - S R V R1 This is the left channel voice signal.

[0057] In practice, the left channel voice cancellation signal and the left channel audio signal can be input into a subtractor to obtain the left channel voice signal; similarly, the right channel voice cancellation signal and the right channel audio signal can be input into a subtractor to obtain the right channel voice signal. By subtracting the left channel voice cancellation signal from the left channel audio signal and vice versa, the voice signals in the left and right channels can be extracted.

[0058] S204, determine the background noise cancellation system based on the target audio signal and the corresponding human voice signal.

[0059] Here, determining the background noise cancellation system based on the target audio signal and its corresponding human voice signal refers to a transmission system that fits the target audio signal and its corresponding human voice signal to obtain a representation of the transmission relationship between them. The background noise filtering system is essentially a filter.

[0060] In this process, the target audio signal can be used as the input signal, and the corresponding human voice signal can be used as the output signal. Adaptive filtering and fitting are then performed to obtain the objective function that characterizes the correlation between the input and output signals. The objective function is then used as the transfer function of the background removal system.

[0061] The objective functions include a left channel objective function and a right channel objective function. The left channel audio signal can be used as the input signal and the left channel vocal signal as the output signal, and adaptive filtering and fitting can be performed to obtain the left channel objective function. Similarly, the right channel audio signal can be used as the input signal and the right channel vocal signal as the output signal, and adaptive filtering and fitting can be performed to obtain the right channel objective function.

[0062] In one specific implementation, the target function can be obtained by adaptively filtering and fitting the target audio signal and the corresponding human voice signal based on the normalized least mean square (NMLS) algorithm.

[0063] The vector form of the weight update in the NMLS algorithm is as follows:

[0064] w(n+1)= w(n)+2μ(n)x(n)e(n) Formula 1

[0065] μ(n) = Formula 2

[0066] e(n) = y(n) - w(n) Formula 3 for x(n)

[0067] Where w(n+1) is the weight vector at the nth iteration, and w(n+1) is the weight vector updated based on w(n). Each weight coefficient in w(0) is 0; x(n) is the input vector at the nth iteration, and x(n) is obtained by sampling the left or right channel audio signal; y(n) is the expected output vector at the nth iteration, and y(n) is obtained by sampling the left or right channel human voice signal; e(n) is the filter output w(n) at the nth iteration. The error between x(n) and the expected output y(n), where μ is the step size factor P. x (n) represents the estimated signal power at time n, P x (n) = x 2 (n), where α is the correction step size constant, 0 < α < 2, and δ is a very small constant, δ > 0. The value of δ can be set to 0.000001.

[0068] In practical implementation, the left channel audio signal can be sampled as the input signal x1(n), and the left channel human voice signal can be sampled as the output signal y1(n). Then, according to the above formulas 1 to 3, through multiple iterations, the solution is obtained to make e 2The left channel weight vector W1(z) is minimized (n), and W1(z) is used as the objective function for the left channel. The right channel audio signal can be sampled as the input signal x2(n), and the left channel voice signal can be sampled as the output signal y2(n). Then, according to formulas 1 to 3 above, through multiple iterations, the solution is obtained to minimize e. 2 The right channel weight vector W2(z) is minimized (n), and W2(z) is used as the target function for the left channel.

[0069] By using an adaptive filtering fitting algorithm to determine the transfer function between the audio signal and the human voice in the audio signal, the background noise cancellation system can be optimized, ensuring that the background noise cancellation system can better eliminate background noise.

[0070] S205, input the human voice cancellation signal corresponding to the target audio signal into the background noise cancellation system to perform background noise cancellation, so as to obtain the human voice residual signal corresponding to the target audio signal.

[0071] Here, inputting the voice cancellation signal corresponding to the target audio signal into the background noise cancellation system for background noise cancellation to obtain the voice residual signal corresponding to the target audio signal means filtering the voice cancellation signal through a filter in the background noise cancellation system to remove the background noise signal and obtain the voice residual signal. The filter in the background noise cancellation system is characterized by the transfer function of the background noise system.

[0072] Specifically, the left channel voice cancellation signal can be convolved with the transfer function of the background noise cancellation system to obtain the left channel residual signal, i.e., V. L2 = X L1 W1(z) can be used to convolve the right channel voice cancellation signal with the transfer function of the background noise cancellation system to obtain the right channel residual signal, i.e., V. R2 = X R1 W2(z).

[0073] S206, cancel the human voice cancellation signal corresponding to the target audio signal with the human voice residual signal corresponding to the target audio signal to obtain the human voice cancellation signal corresponding to the target audio signal.

[0074] The specific implementation of step S206 can be found in the description of step S104 above, and will not be repeated here.

[0075] exist Figure 2In the corresponding technical solution, after obtaining the voice cancellation signal for preliminary voice removal, the voice cancellation signal corresponding to the target audio signal is first canceled with the target audio signal to obtain the voice signal corresponding to the target audio signal, thus extracting the voice from the target audio signal. Then, based on the target audio signal and the voice signal corresponding to the target audio signal, a background noise removal system is determined, and the voice cancellation signal corresponding to the target audio signal is input into the mid-range sound removal system for background noise removal, resulting in a residual voice signal. Finally, the voice cancellation signal is canceled with the residual voice signal to obtain the voice removal signal, thus achieving a further removal of the residual voice signal. Because the background noise removal system is determined based on the target audio signal and the voice signal corresponding to the target audio signal, it can better match the background noise removal system with the target audio signal, achieving better removal of background noise from the audio signal, thereby obtaining a more complete and sufficient residual voice signal, which is conducive to the complete removal of the voice signal.

[0076] See Figure 3 , Figure 3 This is a flowchart illustrating another audio signal processing method provided in an embodiment of this application. This method can be applied to the aforementioned audio devices, such as... Figure 1 As shown, the method includes the following steps:

[0077] S301, acquire the target audio signal.

[0078] S302, cancel out the left and right channel human voice signals of the target audio signal to obtain the human voice cancellation signal corresponding to the target audio signal.

[0079] S303, input the human voice cancellation signal corresponding to the target audio signal into the background noise cancellation system to perform background noise cancellation, so as to obtain the human voice residual signal corresponding to the target audio signal.

[0080] S304, cancel the human voice cancellation signal corresponding to the target audio signal with the human voice residual signal corresponding to the target audio signal to obtain the human voice cancellation signal corresponding to the target audio signal.

[0081] The specific implementation methods of steps S301 to S304 can be found in the description of steps S101 to S104 above, and will not be repeated here.

[0082] S305 performs frequency compensation on the target audio signal to obtain the frequency compensation signal corresponding to the target audio signal.

[0083] Here, frequency compensation of the target audio signal to obtain the corresponding frequency compensation signal refers to inputting the target audio signal into a filter for filtering to obtain the target audio signal within a preset frequency band, which serves as the frequency compensation signal. The frequency compensation signal includes a left channel compensation signal and a right channel compensation signal.

[0084] In one feasible implementation, the target audio signal can be input to a first filter to obtain a first frequency compensation signal corresponding to the target audio signal, wherein the cutoff frequency of the first filter is less than a first preset cutoff frequency. The first preset cutoff frequency can be the lowest frequency in the frequency range of human voice.

[0085] Specifically, the left channel audio signal can be convolved with the filtering function corresponding to the first filter to obtain the left channel first frequency compensation signal, i.e., X. Llp =S L H lp (z); The right channel audio signal can be convolved with the filter function corresponding to the first filter to obtain the right channel first frequency compensation signal, i.e., X. Rlp =S R H lp (z); where X Llp X is the first frequency compensation signal for the left channel. Rlp H is the first frequency compensation signal for the right channel. lp (z) is the filtering function corresponding to the first filter. By inputting the target audio signal into the first filter, signals with frequencies higher than the first preset cutoff frequency in the target audio signal can be removed, thereby retaining low-frequency signals with frequencies lower than the first preset cutoff frequency in the target audio signal.

[0086] In another feasible implementation, the target audio signal can be input to a second filter to obtain a second frequency compensation signal corresponding to the target audio signal. The cutoff frequency of the second filter is greater than a second preset cutoff frequency, which is greater than a first preset cutoff frequency. The second preset cutoff frequency can be the maximum frequency in the frequency range of human voice.

[0087] Specifically, the left channel audio signal can be convolved with the filtering function corresponding to the second filter to obtain the left channel second frequency compensation signal, i.e., X. Lhp =S L H hp (z); The right channel audio signal can be convolved with the filter function corresponding to the second filter to obtain the right channel second frequency compensation signal, i.e., X. Rhp =S R Hhp (z); where X Lhp X is the second frequency compensation signal for the left channel. Rhp H is the second frequency compensation signal for the right channel. hp (z) represents the filtering function corresponding to the second filter. By inputting the target audio signal into the second filter, signals with frequencies lower than the second preset cutoff frequency in the target audio signal can be removed, thereby retaining high-frequency signals with frequencies higher than the second preset cutoff frequency in the target audio signal.

[0088] In another feasible implementation, the target audio signal can be input to a first filter to obtain a first frequency compensation signal corresponding to the target audio signal, and the target audio signal can be input to a second filter to obtain a second frequency compensation signal corresponding to the target audio signal.

[0089] S306, mix the human voice cancellation signal and the frequency compensation signal corresponding to the target audio signal to obtain the background sound signal corresponding to the target audio signal.

[0090] Here, mixing the voice cancellation signal and the frequency compensation signal corresponding to the target audio signal to obtain the background sound signal corresponding to the target audio signal means adding the voice cancellation signal and the frequency compensation signal together to obtain the background sound signal.

[0091] Specifically, the left channel voice cancellation signal can be added to the left channel frequency compensation signal to obtain the left channel background sound signal, i.e., out. L = X L2 +X Lp out L X is the left channel background sound signal. Lp This is the left channel frequency compensation signal. The left channel frequency compensation signal can be the aforementioned first left channel frequency compensation signal, i.e., out. L = X L2 +X Llp Alternatively, the aforementioned left channel second frequency compensation signal, i.e., out, can be used. L = X L2 +X Lhp It can also provide compensation signals for the first frequency of the left channel and the second frequency of the left channel, i.e., out. L = X L2 + X Llp +X Lhp .

[0092] Specifically, the right channel voice cancellation signal can be added to the right channel frequency compensation signal to obtain the right channel background sound signal, i.e., out. R = XR2 +X Rp out R X is the right channel background sound signal. Rp This is the right channel frequency compensation signal. The right channel frequency compensation signal can be the aforementioned right channel first frequency compensation signal, i.e., out. R = X R2 +X Rlp Alternatively, the aforementioned right channel second frequency compensation signal, i.e., out, can be used. R = X R2 +X Rhp It can also provide compensation signals for the first frequency of the right channel and the second frequency of the right channel, i.e., out. R = X R2 + X Rlp +X Rhp

[0093] exist Figure 3 In the corresponding technical solution, after obtaining the voice-canceling signal that has eliminated human voices to the greatest extent, the target audio signal is input into the filter to obtain a frequency compensation signal that is not within the frequency range of human voices. The frequency compensation signal is mixed with the voice-canceling signal to obtain the background sound signal. This achieves frequency compensation of the voice-canceling signal, which can compensate for the missing low-frequency and / or high-frequency parts in the voice-canceling signal, making the spectrum of the background sound signal complete, thereby obtaining a high-quality background sound audio.

[0094] Optionally, in some possible cases, before mixing the voice cancellation signal and the frequency compensation signal corresponding to the target audio signal, the gain of the voice cancellation signal and the frequency compensation signal corresponding to the target audio signal can be adjusted to obtain the voice cancellation gain signal and the frequency compensation gain signal corresponding to the target audio signal.

[0095] Specifically, the gain of the left channel voice cancellation signal can be adjusted to obtain the left channel voice cancellation gain signal, i.e., c X L2 Gain adjustment can be applied to the right channel vocal cancellation signal to obtain the right channel vocal cancellation gain signal, c X R2 Where c is the gain coefficient.

[0096] In practice, the left channel voice cancellation signal can be input into the multiplier to obtain the left channel voice cancellation gain signal; the right channel voice cancellation signal can be input into the multiplier to obtain the right channel voice cancellation gain signal.

[0097] Specifically, the gain of the left channel frequency compensation signal can be adjusted to obtain the left channel frequency compensation gain signal, i.e., a X Llp and / or b XLhp The gain of the right channel frequency compensation signal can be adjusted to obtain the right channel frequency compensation signal, i.e., a X. Rlp and / or b X Rhp Where a and b are the low-frequency gain coefficient and the high-frequency gain coefficient, respectively.

[0098] In practice, the left channel frequency compensation signal can be input into the multiplier to obtain the left channel frequency compensation gain signal; the right channel frequency compensation signal can be input into the multiplier to obtain the right channel frequency compensation gain signal. It should be noted that if the frequency compensation signal includes the aforementioned first and second frequency compensation signals, then there are two multipliers, used to adjust the gain of the first and second frequency compensation signals respectively.

[0099] In some possible cases, the gain of the vocal cancellation signal and the frequency compensation signal corresponding to the target audio signal can be adjusted based on the signal correlation between the left and right channel audio signals to obtain the vocal cancellation gain signal and the frequency compensation gain signal corresponding to the target audio signal. Specifically, the gain coefficient of the multiplier can be adjusted based on the signal correlation between the left and right channel audio signals.

[0100] Specifically, when the frequency compensation signal includes the aforementioned first frequency compensation signal and second frequency compensation signal, the values ​​of a and b can be adjusted based on the signal correlation between the left channel audio signal and the right channel audio signal. The formula for calculating the signal correlation is as follows:

[0101]

[0102] By adjusting the gain of the voice cancellation signal and the frequency compensation signal based on the signal correlation between the left and right audio channels, the adjusted voice cancellation gain signal and frequency compensation gain signal can conform to the signal characteristics.

[0103] After obtaining the human voice cancellation gain signal and frequency compensation gain signal corresponding to the target audio signal, the human voice cancellation gain signal and frequency compensation gain signal corresponding to the target audio signal can be mixed to obtain the background sound signal corresponding to the target audio signal.

[0104] Specifically, the left channel vocal cancellation gain signal and the left channel frequency compensation gain signal can be added together to obtain the left channel background sound signal, i.e., out. L = cX L2 + aX Llp , or, out L = cX L2 +b X Lhp , or outL = cX L2 +b X Lhp +aX Llp The right channel vocal cancellation gain signal and the right channel frequency compensation gain signal can be added together to obtain the right channel background sound signal, i.e., out. R = cX R2 + aX Rlp , or, out R = cX R2 +b X Rhp , or out R = cX R2 +b X Rhp + aX Rlp .

[0105] By adjusting the gain of the voice cancellation signal and the frequency compensation signal before mixing them, the resulting background sound signal can be made more natural and complete.

[0106] The method of this application has been described above; the apparatus of this application will be described below.

[0107] See Figure 4 , Figure 4 This is a schematic diagram of the structure of an audio signal processing device provided in an embodiment of this application. This audio signal processing device can be one of the aforementioned audio devices. Figure 4 As shown, the audio signal processing device 40 includes:

[0108] The acquisition module 401 is used to acquire a target audio signal, the target audio signal including a left channel audio signal and a right channel audio signal, the target audio signal being an audio signal containing human voices and background sounds;

[0109] The first cancellation module 402 is used to cancel out the left and right channel human voice signals of the target audio signal to obtain the human voice cancellation signal corresponding to the target audio signal.

[0110] The second cancellation module 403 is used to input the human voice cancellation signal into the background noise cancellation system to perform background noise cancellation, so as to obtain the human voice residual signal corresponding to the target audio signal. The background noise cancellation system is used to eliminate background noise.

[0111] The third cancellation module 404 is used to cancel the human voice cancellation signal with the human voice residual signal to obtain the human voice cancellation signal corresponding to the target audio signal.

[0112] In one possible design, the audio signal processing device 40 further includes a transmission system determination module 405, which is used to cancel the human voice cancellation signal with the target audio signal to obtain the human voice signal corresponding to the target audio signal; and to determine the background noise cancellation system based on the target audio signal and the human voice signal.

[0113] In one possible design, the determining module 405 is specifically used to: take the target audio signal as the input signal and the human voice signal as the output signal, perform adaptive filtering and fitting to obtain a target function, the target function being used to characterize the correlation between the input signal and the output signal; and use the target function as the transfer function of the background noise cancellation system.

[0114] In one possible design, the audio signal processing device 40 further includes a frequency compensation module 406 for performing frequency compensation on the target audio signal to obtain a frequency compensation signal corresponding to the target audio signal; and a signal mixing module 407 for mixing the human voice cancellation signal and the frequency compensation signal to obtain a background sound signal corresponding to the target audio signal.

[0115] In one possible design, the frequency compensation module 406 is specifically used to: input the target audio signal to a first filter to obtain a first frequency compensation signal corresponding to the target audio signal, wherein the cutoff frequency of the first filter is less than a first preset cutoff frequency; and / or input the target audio signal to a second filter to obtain a second frequency compensation signal corresponding to the target audio signal, wherein the cutoff frequency of the second filter is greater than a second preset cutoff frequency; the second preset cutoff frequency is greater than the first preset cutoff frequency.

[0116] In one possible design, the audio signal processing device 40 further includes a gain adjustment module 408, used to adjust the gain of the human voice cancellation signal and the frequency compensation signal to obtain the human voice cancellation gain signal and the frequency compensation gain signal corresponding to the target audio signal; the signal mixing module 407 is specifically used to mix the human voice cancellation gain signal and the frequency compensation gain signal to obtain the background sound signal corresponding to the target audio signal.

[0117] In one possible design, the gain adjustment module 408 is specifically used to adjust the gain of the voice cancellation signal and the frequency compensation signal based on the signal correlation between the left channel audio signal and the right channel audio signal, so as to obtain the voice cancellation gain signal and the frequency compensation gain signal corresponding to the target audio signal.

[0118] It should be noted that, Figure 4For any content not mentioned in the corresponding embodiments, please refer to the description of the foregoing method embodiments, which will not be repeated here.

[0119] The aforementioned device, after acquiring the left and right channel audio signals containing human voice and background noise, first cancels out the human voice signals in the left and right channels to obtain a human voice cancellation signal, achieving initial elimination of the human voice signal. Then, the human voice cancellation signal is input into a background noise elimination system to eliminate background noise, resulting in a residual human voice signal in the human voice cancellation signal. Finally, the human voice cancellation signal and the residual human voice signal are canceled out to obtain a second human voice elimination signal, achieving a second elimination of the residual human voice signal. Through this two-stage human voice elimination process, the elimination of human voice can be maximized, resulting in a cleaner and purer human voice elimination signal.

[0120] See Figure 5 , Figure 5 This is a schematic diagram of the structure of an audio device provided in an embodiment of this application. The audio device 50 includes a processor 501 and a memory 502. The memory 502 is connected to the processor 501, for example, via a bus.

[0121] Processor 501 is configured to support the audio device 50 in performing the corresponding functions in the methods described in the above method embodiments. Processor 501 may be a central processing unit (CPU), a network processor (NP), a hardware chip, or any combination thereof. The aforementioned hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof. The aforementioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL), or any combination thereof.

[0122] Memory 502 is used to store program code, etc. Memory 502 may include volatile memory (VM), such as random access memory (RAM); memory 502 may also include non-volatile memory (NVM), such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid-state drive (SSD); memory 502 may also include combinations of the above types of memory.

[0123] Optionally, the audio device 50 may also include playback peripherals such as a microphone and a speaker.

[0124] Processor 501 can call the program code to perform the following operations:

[0125] Acquire a target audio signal, which includes a left channel audio signal and a right channel audio signal, and the target audio signal is an audio signal containing human voices and background sounds;

[0126] The target audio signal is subjected to mutual cancellation of the left and right channel human voice signals to obtain the human voice cancellation signal corresponding to the target audio signal;

[0127] The human voice cancellation signal is input into the background noise cancellation system to eliminate background noise, so as to obtain the human voice residual signal corresponding to the target audio signal. The background noise cancellation system is used to eliminate background noise.

[0128] The human voice cancellation signal and the human voice residual signal are canceled together to obtain the human voice cancellation signal corresponding to the target audio signal.

[0129] This application also provides a computer-readable storage medium storing a computer program, the computer program including program instructions, which, when executed by a computer, cause the computer to perform the method described in the foregoing embodiments.

[0130] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.

[0131] The above-disclosed embodiments are merely preferred embodiments of this application and should not be construed as limiting the scope of this application. Therefore, any equivalent variations made in accordance with the claims of this application shall still fall within the scope of this application.

Claims

1. An audio signal processing method, characterized in that, include: Acquire a target audio signal, which includes a left channel audio signal and a right channel audio signal, and the target audio signal is an audio signal containing human voices and background sounds; The target audio signal is subjected to mutual cancellation of the left and right channel human voice signals to obtain the human voice cancellation signal corresponding to the target audio signal; The human voice cancellation signal and the target audio signal are canceled to obtain the human voice signal corresponding to the target audio signal; the target audio signal is used as the input signal and the human voice signal is used as the output signal, and adaptive filtering and fitting are performed to obtain the objective function. The objective function is used to characterize the correlation between the input signal and the output signal. The objective function is used as the transfer function of the background noise cancellation system to determine the background noise cancellation system. The human voice cancellation signal is input into the background noise cancellation system to eliminate the background noise, so as to obtain the human voice residual signal corresponding to the target audio signal. The background noise cancellation system is used to eliminate background noise. The human voice cancellation signal and the human voice residual signal are canceled together to obtain the human voice cancellation signal corresponding to the target audio signal.

2. The method according to claim 1, characterized in that, After cancelling the human voice cancellation signal with the human voice residual signal to obtain the human voice cancellation signal corresponding to the target audio signal, the method further includes: The target audio signal is frequency compensated to obtain the frequency compensated signal corresponding to the target audio signal; The human voice cancellation signal and the frequency compensation signal are mixed to obtain the background sound signal corresponding to the target audio signal.

3. The method according to claim 2, characterized in that, The step of performing frequency compensation on the target audio signal to obtain a frequency compensation signal corresponding to the target audio signal includes: The target audio signal is input to a first filter to obtain a first frequency compensation signal corresponding to the target audio signal, wherein the cutoff frequency of the first filter is less than a first preset cutoff frequency; and / or The target audio signal is input to a second filter to obtain a second frequency compensation signal corresponding to the target audio signal. The cutoff frequency of the second filter is greater than a second preset cutoff frequency; the second preset cutoff frequency is greater than a first preset cutoff frequency.

4. The method according to claim 2, characterized in that, Before mixing the human voice cancellation signal and the frequency compensation signal to obtain the background sound signal corresponding to the target audio signal, the method further includes: Gain adjustment is performed on the human voice cancellation signal and the frequency compensation signal to obtain the human voice cancellation gain signal and the frequency compensation gain signal corresponding to the target audio signal; The step of mixing the human voice cancellation signal and the frequency compensation signal to obtain the background sound signal corresponding to the target audio signal includes: The human voice cancellation gain signal and the frequency compensation gain signal are mixed to obtain the background sound signal corresponding to the target audio signal.

5. The method according to claim 4, characterized in that, The step of adjusting the gain of the voice cancellation signal and the frequency compensation signal to obtain the voice cancellation gain signal and the frequency compensation gain signal corresponding to the target audio signal includes: Based on the signal correlation between the left channel audio signal and the right channel audio signal, the gain of the voice cancellation signal and the frequency compensation signal is adjusted to obtain the voice cancellation gain signal and the frequency compensation gain signal corresponding to the target audio signal.

6. An audio signal processing device, characterized in that, include: An acquisition module is used to acquire a target audio signal, the target audio signal including a left channel audio signal and a right channel audio signal, the target audio signal being an audio signal containing human voices and background sounds; The first cancellation module is used to cancel out the left and right channel human voice signals of the target audio signal to obtain the human voice cancellation signal corresponding to the target audio signal. The determination module is used to cancel the human voice cancellation signal with the target audio signal to obtain the human voice signal corresponding to the target audio signal, use the target audio signal as the input signal and the human voice signal as the output signal, perform adaptive filtering fitting to obtain the objective function, the objective function is used to characterize the correlation between the input signal and the output signal, and the objective function is used as the transfer function of the background noise cancellation system to determine the background noise cancellation system; The second cancellation module is used to input the human voice cancellation signal into the background noise cancellation system to perform background noise cancellation, so as to obtain the human voice residual signal corresponding to the target audio signal. The background noise cancellation system is used to eliminate background noise. The third cancellation module is used to cancel the human voice cancellation signal with the human voice residual signal to obtain the human voice cancellation signal corresponding to the target audio signal.

7. An audio device, characterized in that, The device includes a memory and a processor, the memory being connected to the processor, the processor being configured to execute one or more computer programs stored in the memory, the processor causing the audio device to perform the method as described in any one of claims 1-5 when executing the one or more computer programs.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, the computer program including program instructions that, when executed by a processor, cause the processor to perform the method as described in any one of claims 1-5.