Method and system for stereo source cancellation
By using source separation and smoothing to apply coefficients, the target audio source is suppressed, solving the problem of acoustic artifacts in existing technologies, achieving high-quality audio signal processing, and preserving the spatial and timbre characteristics of the original audio signal.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- DOLBY LABORATORIES LICENSING CORP
- Filing Date
- 2024-09-09
- Publication Date
- 2026-06-19
Smart Images

Figure CN122249854A_ABST
Abstract
Description
Cross-reference to related applications
[0001] This application claims priority to U.S. Provisional Patent Application No. 63 / 583,198, filed September 15, 2023, the entire contents of which are incorporated herein by reference. Technical Field
[0002] The present invention relates to a method and system for processing audio signals, and particularly for performing source suppression processing. Background Technology
[0003] In some audio processing scenarios, it is often desirable to suppress certain types of audio sources in a mixed audio signal that includes multiple audio sources. For example, the audio signal captured at a concert may include a mixture of music and noise from the crowd, where it may be desirable to process the audio signal to suppress the crowd noise while retaining the music.
[0004] As another example, dubbing audio signals associated with a film or video involves replacing the original speech in the first language with an alternative speech in the second language. In most modern professionally generated content (PGC), dialogue typically resides in a separate audio channel, while other audio content (music, sound effects, ambient sounds) is included in other audio channels. Dubbing this type of content where speech resides in a separate channel may simply involve replacing the speech channel with a different speech channel that is recorded and / or processed to synchronize with the original video content and other audio content (music, sound effects, ambient sounds).
[0005] However, for many types of audio content, speech is mixed with other audio content in one or more channels, where each channel contains a mixture of speech and other audio content. This is often the case with user-generated content (UGC), where, for example, a user records a scene with their smartphone, capturing a stereo audio signal that includes a mixture of speech, music, sound effects, and ambient sounds. To dub an audio signal that includes original speech mixed with other audio content, the audio signal is typically processed first to remove or at least suppress the original speech before adding alternative speech, for example, by mixing an audio signal containing only the alternative speech with a processed audio signal in which the original speech has been suppressed.
[0006] In a simplified implementation, when the original speech is active, the volume of the original audio signal, which contains a mixture of the original speech and other audio content, is reduced, for example, completely muted. This ensures that the original speech is suppressed, and that the fact that other audio content is also attenuated or muted after the corresponding time point when alternative speech is mixed in is not very noticeable to the listener. Summary of the Invention
[0007] The problem with the aforementioned solutions for dubbing audio signals is that audio processing introduces acoustic artifacts that are noticeable and disruptive to some listeners. Therefore, the object of this disclosure is to provide an audio processing method and system that offers enhancement performance for suppressing a specified target audio source while still preserving the spatial and / or timbre characteristics of the original audio signal, thereby producing a source-suppressed output audio signal with reduced or completely eliminated acoustic artifacts.
[0008] According to a first inventive concept of the present invention, a method for processing audio is provided, the method comprising: obtaining an input audio signal including two channels, the input audio signal including a plurality of consecutive segments; for each segment of the input audio signal, determining a segment-specific source activity index indicating whether the predetermined target audio source is active in the segment; for a specific segment associated with the segment-specific source activity index indicating that the target audio source is inactive, determining a processing application coefficient by smoothing the segment-specific source activity index over a group of segments associated with the specific segment, wherein the group of segments includes segments associated with the segment-specific source activity index indicating that the target audio source is inactive and segments associated with the segment-specific source activity index indicating that the target audio source is active. The method further includes: extracting a side audio signal from the input audio signal for the specific segment; extracting a difference audio signal based on the difference between the side audio signal and the input audio signal; extracting a source-suppressed difference audio signal from the difference audio signal using a source splitter configured to determine the degree to which the specific segment contains audio content or residual audio content associated with the target audio source; and forming the source-suppressed difference audio signal based on the degree to which the specific segment contains audio content associated with the target audio source. The method further includes: weighting and summing the difference audio signal and the source-suppressed difference audio signal based on the processing application coefficient to form a modified difference audio signal; and combining the modified difference audio signal with the side audio signal to form an output audio signal where the target audio source is suppressed.
[0009] The segment group comprises at least two segments. The association of the segment group with a specific segment (whose inactivity) means that the specific segment is included in the segment group along with at least one active segment, or that a segment-specific source activity index of the segment group is used to extrapolate or interpolate processing application coefficients for specific segments, such as those located between, before, or after segments in the group in time. Because the group includes active and inactive segments, the smoothing is applied when the target source transitions from active to inactive or vice versa. In some embodiments, the smoothing processing application coefficient is determined only for inactive segments, while active segments use a processing application coefficient equal to the segment-specific source activity index.
[0010] In some implementations, the segment-specific source activity index is binary, taking a high value (e.g., 1) when the target source is active and a low value (e.g., 0) when the target source is inactive. Therefore, the segment-specific source activity index exhibits abrupt changes when the target audio source becomes active or inactive. The processing application coefficient can be a smoothed version of the segment-specific source activity index that satisfies, for example, a minimum rise or fall rate. For example, when the segment-specific source activity index is binary (0 or 1), the processing application coefficient can take values of 0, 1, or fractions between 0 and 1. As another example, the processing application coefficient for a specific segment is found by running a smoothing window across the segment-specific source activity index, where the smoothing window includes one or more look-ahead segments and / or one or more look-back segments for the specific segment.
[0011] According to a first aspect of the invention, a method generates an output audio signal comprising audio content of a side audio signal, a source-suppressed difference audio signal, and a weighted combination of the difference audio signals, wherein the combination is based on processing application coefficients. Since most types of target audio sources (e.g., speech, music, music from a specific instrument) exist in both channels of the input audio signal and therefore not in the side audio signal based on the difference between channels, the side audio signal is expected to contain audio content associated with the background audio content. The terms "background audio content" or "residual audio content" refer to any audio content not associated with the target audio source.
[0012] Furthermore, the difference audio signal, based on the difference between the side audio signal and the input audio signal, includes all audio content present in the input audio signal but not in the side audio signal. The difference audio signal may contain the target audio source, but also some background audio content. Therefore, a source separator is used to process the difference audio signal to extract any residual audio content. The residual content from the side audio signal and the difference audio signal is then combined to form an output audio signal in which the target audio source is removed, but most or all of the residual audio content remains intact.
[0013] For most types of audio content, the target audio source is not always active (i.e., throughout all segments of the audio signal). This means that at certain moments (when the target audio source is inactive), there is no target audio source to eliminate. To preserve the content and spatial characteristics from the input audio signal as much as possible, the input audio signal is used as the output audio signal for the inactive segments. For example, this could involve using a difference audio signal as a modified difference audio signal, thereby perfectly reconstructing the input audio signal to form the output audio signal.
[0014] On the other hand, during the segment of activity of the target audio source, the target audio source should be completely suppressed so that only background audio content is included in the output audio signal. This is achieved, for example, by using a source-suppressed differential audio signal as a modified differential audio signal, whereby the input audio signal is a combination of the side audio signal and the source-suppressed differential audio signal.
[0015] However, preferably, the transition between using the difference audio signal and using the source-suppressed difference audio signal as the modified difference audio signal should not be too rapid, as this can cause undesirable acoustic artifacts. Therefore, when transitioning from a segment where the target audio source is active to a segment where the target audio source is inactive (or vice versa), the modified difference audio signal for a specific segment is formed as a weighted sum of the difference audio signal and the source-suppressed difference audio signal. That is, for a specific segment, the modified difference audio signal is a combination of the difference audio signal and the source-suppressed difference audio signal, and the contribution from each of these audio signals is non-zero. The specific segments forming the weighted combination are segments where the target audio source is inactive, which ensures that the target audio source is not reintroduced through the weighted sum. For example, the modified difference audio signal based on the weighted combination is formed only for segments associated with target source activity indicators indicating that the target audio source is inactive.
[0016] In some implementations, smoothing the segment-specific target source activity index to form the processing application coefficient PAC may include at least one of the following: enforcing the maximum rise rate and / or fall rate of the PAC, and identifying a sufficiently short sequence of segments associated with inactive target audio sources, and setting the PAC to a high value associated with active sources for these segments.
[0017] According to a second inventive concept, a method for processing audio is provided, the method comprising: obtaining an input audio signal including two channels, the input audio signal comprising a plurality of consecutive segments; and for each segment of the input audio signal, determining a segment-specific source activity index indicating whether a predetermined target audio source is active in that segment. The method further comprises: for each segment, extracting a side audio signal from the input audio signal; extracting a difference audio signal based on a difference between the side audio signal and the input audio signal; and extracting a source-suppressed difference audio signal from the difference audio signal by processing the difference audio signal with a source splitter. The method further includes: determining an inactivity energy level based on the spectral energy of the difference audio signal of at least one segment associated with a segment-specific activity index indicating inactivity of the target audio source; determining an activity energy level based on the spectral energy of the source-suppressed difference audio signal of at least one segment associated with a segment-specific activity index indicating activity of the target audio source; applying a gain to at least one of the side audio signal and the source-suppressed difference audio signal based on the difference between the inactivity energy level of the at least one segment associated with the segment-specific activity index indicating activity of the target audio source and the activity energy level; and combining the source-suppressed difference audio signal with the side audio signal, wherein the gain is applied to at least one of the source-suppressed difference audio signal and the side audio signal to form an output audio signal in which the target audio source is suppressed.
[0018] Since a portion of the audio content is removed when the target audio source is eliminated, a perceptible difference in spectral energy (i.e., perceived volume) may exist between the active and inactive segments of the input audio signal in the output audio signal. This spectral energy difference can be reduced or eliminated by determining and applying a compensation gain using the method of the second inventive concept.
[0019] According to some embodiments of the second inventive concept, the active energy and / or inactive energy are respectively determined as the mean and / or average value over a plurality of active and inactive segments.
[0020] According to a third inventive concept, a method for processing audio is provided, the method comprising: obtaining an input audio signal including two channels, the input audio signal comprising a plurality of consecutive segments; and for each segment, extracting a side audio signal from the input audio signal; and extracting a difference audio signal based on the difference between the side audio signal and the input audio signal. The method further comprises: processing the difference audio signal with a source splitter to extract a source-suppressed difference audio signal from the difference audio signal; combining the source-suppressed difference audio signal with the side audio signal to form an output audio signal in which the target audio source is suppressed; and determining image parameters and / or phase parameters for multiple frequency bands of each segment of the output audio signal. The method further includes: scaling the image parameter of each frequency band and segment using an image scaling function, the image scaling function being based on the deviation between the image parameter and a centered image to obtain an adjusted image parameter that is more significantly different from the deviation of the image parameter relative to the centered image; and / or determining a reference phase for each frequency band and segment, and scaling the phase parameter of each frequency band and segment using a phase scaling function, the phase scaling function being based on the deviation between the phase parameter and the reference phase to obtain an adjusted phase parameter that is more significantly different from the deviation of the phase parameter relative to the reference phase. The method further includes: forming a modified output audio signal using the adjusted image parameter and / or the adjusted image parameter.
[0021] For many types of audio content, the target audio source to be eliminated dominates the content, and its image is centered and / or its phase difference is approximately zero. After eliminating the target audio source, the remaining background audio content will inherit the same spatial characteristics (image and / or phase) from the target audio source. However, if the spatial characteristics are more widely distributed, the background audio content will be perceived as more realistic and believable.
[0022] This is achieved by determining the spatial characteristics of the output audio signal and scaling (and optionally smoothing) these spatial characteristics to cover a wider range of image and / or phase parameters, and using the adjusted spatial parameters to form a modified output audio signal.
[0023] The source splitter in the second and / or third inventive concept may be the same source splitter used in the first inventive concept. However, it should be understood that the source splitter in the second and / or third inventive concept may also be different. For example, it is envisioned that the source splitter attenuates (e.g., completely mutes) only the difference audio signal of segments whose segment-specific source activity indicators indicate that the target audio source is inactive.
[0024] The second and third inventive concepts can be used in conjunction with the method of the first aspect of the invention, or as an alternative thereto. Each inventive concept relates to an audio processing technique that generates a source-cancelled audio signal while simultaneously enhancing the perceived quality of the resulting output audio signal with the target source suppressed. The inventive concepts listed above can be used individually or in combination as needed.
[0025] According to another aspect of the invention, an apparatus is provided, comprising one or more processors and a memory storing instructions that, when executed by the one or more processors, cause the one or more processors to perform one or more of the methods described according to the inventive concept.
[0026] According to another aspect of the invention, a non-transitory computer-readable storage medium is provided, having instructions stored thereon that, when executed by one or more processors, cause the one or more processors to perform the method according to any one of the inventive concepts. Attached Figure Description
[0027] Various aspects of the invention will be described in more detail with reference to the accompanying drawings, which illustrate exemplary embodiments.
[0028] Figure 1 This is a block diagram illustrating an audio processing system according to some embodiments.
[0029] Figure 2 It is a flowchart describing a method for processing audio signals according to some implementations.
[0030] Figure 3A It is a graph showing the time-varying activity index of a specific source segment according to some implementations.
[0031] Figure 3B It is based on some implementation methods that enable Figure 3A The curve of the processing application coefficient PAC is obtained by smoothing the specific source activity index of the fragment.
[0032] Figure 4A This is a block diagram illustrating how a source-suppressed differential audio signal, according to some embodiments, is combined with a differential audio signal to form a modified differential audio signal.
[0033] Figure 4B It is a block diagram illustrating alternative arrangements for determining modified differential audio signals according to some embodiments.
[0034] Figure 5 This is a block diagram illustrating details of a source suppressor according to some embodiments.
[0035] Figure 6A An example of a soft mask determined by a source splitter according to some implementations is shown.
[0036] Figure 6B The diagram illustrates a method based on some implementations. Figure 6A An example of a more inclusive modified soft mask generated from the soft mask in [the example].
[0037] Figure 7a is a graph illustrating the spectral energy difference between the input audio signal and the side audio signal of an inactive segment of a target audio source according to some embodiments.
[0038] Figure 7b is a diagram illustrating the spectral energy difference of a segment of a target audio source activity that has undergone source suppression and is composed of a differential audio signal, according to some embodiments.
[0039] Figure 8 This is a block diagram illustrating a brick-wall filtering system that can be used with an audio processing system according to some embodiments.
[0040] Figure 9 This is a block diagram illustrating how the output audio signal is mixed with an alternative audio signal according to some implementations. Detailed Implementation
[0041] The systems and methods disclosed in this application can be implemented as software, firmware, hardware, or a combination thereof. In hardware implementations, the division of tasks does not necessarily correspond to the division of physical units; on the contrary, a physical component can have multiple functions, and a task can be performed collaboratively by several physical components.
[0042] Computer hardware can be, for example, a server computer, client computer, personal computer (PC), tablet PC, set-top box (STB), personal digital assistant (PDA), cellular phone, smartphone, AR / VR wearable automotive infotainment system, web device, network router, switch or bridge, or any machine capable of (sequentially or otherwise) executing instructions specifying actions to be taken by said computer hardware. Furthermore, this disclosure will relate to any collection of computer hardware that individually or in combination executes instructions to perform any one or more of the concepts discussed herein.
[0043] Some or all of the components may be implemented by one or more processors that accept computer-readable (also known as machine-readable) code containing a set of instructions that, when executed by the one or more processors, perform at least one of the methods described herein. This includes any processor capable of executing (sequentially or otherwise) a set of instructions specifying an action to be taken. Thus, an example is a typical processing system (e.g., computer hardware) that includes one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system may further include a memory subsystem, including hard disk drives, SSDs, RAM, and / or ROM. A bus subsystem may be included for communication between components. Software may reside within the memory subsystem and / or the processors during execution by the computer system.
[0044] The one or more processors can operate as independent devices or can be connected to other processors, such as a network connecting to other processors. Such a network can be built on a variety of different network protocols and can be the Internet, a wide area network (WAN), a local area network (LAN), or any combination thereof.
[0045] Software can be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transient media). As is well known to those skilled in the art, the term computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, various forms of physical (non-transitory) storage media, such as EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical disc storage devices, magnetic tape cassettes, magnetic tape, disk storage devices or other magnetic storage devices, or any other medium that can be used to store desired information and can be accessed by a computer. Furthermore, as is well known to those skilled in the art, (transitory) communication media typically embody computer-readable instructions, data structures, program modules or other data in the form of modulated data signals such as carrier waves or other transmission mechanisms, and includes any information transmission medium.
[0046] Figure 1 This is a schematic block diagram of an audio processing system 1 according to some embodiments.
[0047] Audio processing system 1 is configured to perform audio processing that removes a target audio source (e.g., speech, music, or a specific instrument such as a guitar) from a two-channel input audio signal IN to generate an output audio signal OUT. This output audio signal removes the target audio source but otherwise preserves the spatial and timbre characteristics of the input audio signal. Preserving the spatial and timbre characteristics of the input audio signal IN after the target source has been removed is a challenging task, and this disclosure presents various audio processing techniques that can be used individually or in arbitrary combinations to produce an output audio signal OUT that retains the spatial and timbre characteristics of the input audio signal IN even when the target audio source has been suppressed or completely eliminated.
[0048] The input audio signal IN comprises two channels and can be, for example, a stereo audio signal or a binaural audio signal. However, the audio processing system 1 can also process audio content in other formats. For example, a mono audio signal or an audio signal with more than two channels can be converted to a stereo format using any suitable known upmixing or downmixing process, thereby processing the stereo input audio signal with the audio processing system and optionally converting it back to the original format after processing. As a further alternative, an input audio signal IN with three or more channels can be processed by the audio processing system 1, wherein the audio processing system operates on pairs of these three or more channels.
[0049] What all audio processing techniques have in common is that audio processing system 1 operates using the difference audio signal D extracted from the input audio signal IN and the side audio signal S. Further reference will now be made to... Figure 2 The flowchart in the diagram describes the operation of the audio processing system 1.
[0050] The input audio signal IN comprises two channels (e.g., a stereo input audio signal comprising a left channel L and a right channel R) and is obtained by the audio processing system 1 at step S1. The input audio signal is provided to a difference-side extractor 11, which extracts the difference audio signal D and the side audio signal S from the input audio signal IN at step S3. For some audio processing techniques, at step S2, a source activity index α is extracted from the input audio signal. The source activity index α is described in further detail below and can be used to control the audio processing technique.
[0051] Typically, the mid-range audio signal M and the side-range audio signal S are extracted from the left channel L and right channel R of the input audio signal IN by determining the sum and difference of the left audio channel L and the right audio channel R, respectively. In short, the mid-range audio signal M and the side-range audio signal S can be determined by the following equation: (Equation 1) (Equation 2) Where the coefficient β equals 1 2, or β equals When β equals When the energy is preserved, it means that the total spectral energy of the middle audio signal M and the side audio signal S is equal to the spectral energy of the left audio signal L and the right audio signal R.
[0052] Using Equations 1 and 2 above, the center-aligned mid-audio signal M and the side-audio signal S are extracted. For many types of audio content, a center-aligned mid-side pair is appropriate because many types of target audio sources (such as speech) are typically center-aligned, meaning the target audio source will be included in the mid-audio signal, while any background audio content will remain in the side-audio signal. However, more generally, the mid-audio signal M and the side-audio signal S are extracted using an arbitrary imaging parameter θ according to the following equation: (Equation 3) (Equation 4) Wherein, 0 ≤ θ ≤ π / 2, and θ = π / 4 represents a centered image. The image parameter θ can be fixed, for example, specified by the user, or it can be dynamically varied with time and frequency, for example, determined as a detected image parameter, as explained in International Application No. PCT / US23 / 63717 filed March 3, 2023, entitled “TARGET MID-SIDE SIGNALS FORAUDIO APPLICATIONS [Target Mid-Side Signals for Audio Applications]”, which is incorporated herein by reference in its entirety. In short, this reference discloses determining the detected image parameters for each time-frequency tile and averaging the detected image parameters over multiple tiles (in time and / or frequency) to form a target image parameter. The target image parameter is then used to form a target mid-audio signal. For example, this dynamically centered audio signal can capture multiple sources with differential image, provided they are separated in frequency and / or time.
[0053] In the following text, it will be assumed that the audio signal is represented in the short-time Fourier transform (STFT) domain, where any audio signal is represented by multiple time-frequency tiles, each tile including multiple STFT coefficients representing a specific frequency band of a particular segment.
[0054] When extracting the difference audio signal D and the side audio signal S, the difference-side extractor 11 uses Equation 4 above to extract the side audio signal S. For each time-frequency tile, the difference-side extractor 11 also determines at least the phase parameter φ and amplitude U of the side audio signal S, and optionally also determines the sound image parameter θ. In particular, the difference-side extractor 11 determines the phase and amplitude of the side signal S according to the following equation: (Equation 5) (Equation 6) However, the imaging parameter θ of the side audio signal S is the imaging parameter of the input audio signal IN. Therefore, for the corresponding time-frequency tile of the input audio signal IN, the difference-side extractor 11 also uses the input audio signal IN to determine the imaging parameter θ of each time-frequency tile of the side audio signal S according to the following equation: (Equation 7) The audio-visual, phase, and amplitude parameters are collectively referred to as spatial parameters or spatial-level filtering (SLF) parameters.
[0055] The side audio signal S is currently a mono audio signal; however, it can be converted into a stereo side audio signal S using the SLF parameters according to the following equation: (Equation 8) (Equation 9) Where S1 and S2 represent the two channels of the stereo representation of the side channel S. In step S3, the difference-side extractor 11 subtracts the stereo side audio signal S from the input audio signal IN (which also includes both channels) according to the following equation to form the difference audio signal D: (Equation 10) Here, D1 and D2 represent the two channels of the difference audio signal D. The difference audio signal D is called a "difference" audio signal because it includes any audio content of the input audio signal IN that is not included in the side audio signal S. Therefore, combining the difference audio signal D with the side audio signal S in a proportional manner will perfectly reconstruct the input audio signal IN.
[0056] In many cases, the target audio source is image-centered. For most input audio signals IN that include speech, the speech (when active) is the dominant audio source, which determines the image of the input audio signal IN. Since speech is usually image-centered, the input audio signal IN will be image-centered, meaning that the side audio signal S essentially does not contain speech, while all speech is present in the difference audio signal D.
[0057] In some cases, the dominant audio source is not image-centered, but rather its image is not centered or changes over time. For these cases, the identified image parameter θ can be used to form an adaptive side audio signal S, as described in Equation 4. The adaptive side audio signal is converted to a two-channel representation using Equations 8 and 9, thereby extracting the difference audio signal D using Equation 10. Using the adaptive side audio signal S, even if the dominant audio source's image is not centered and / or changes over time, the dominant audio source can be assumed to be included in the difference audio signal D and excluded from the side audio signal S.
[0058] However, this approximation (the difference audio signal D contains only the target audio source and no other audio content) is only accurate to a certain extent, and in some cases, the difference audio signal D may include audio content unrelated to the target audio source in addition to the target audio source. For example, if the target audio source is any speech content, then noise or background audio may also be present in the difference audio signal D in addition to the speech. Therefore, the difference audio signal D is passed to the target source separator 12, which extracts the source-suppressed difference audio signal D from the difference audio signal D in step S4. 抑制 The source-suppressed differential audio signal D 抑制 It is actually a processed version of the difference audio signal D, in which any content associated with the target audio source is removed and / or attenuated.
[0059] In some implementations, the target source separator 12 includes a denoising module (e.g., implemented as a trained neural network) configured to determine a soft mask for suppressing any audio content that does not belong to the target audio source. The determined soft mask is applied to the difference audio signal D to generate a source audio signal D containing audio content associated with the target audio source. 源 Then, the source audio signal D can be subtracted from the difference audio signal D according to the following equation. 源 To form the differential audio signal D with source suppression 抑制 : (Equation 11) It should be further understood that trivial operations can be used to invert the soft mask used to suppress audio content that does not belong to the target audio to form an inverse soft mask. This inverse soft mask can be applied to the difference audio signal D to directly form the source-suppressed difference audio signal D. 抑制 Alternatively, the target source separator 12 can be configured to directly determine the inverse soft mask.
[0060] The following will combine Figure 5 Further details are described for the target source separator 12.
[0061] The difference audio signal D with source suppression抑制 The side audio signal S is passed to a subsequent module in the audio processing system 1, where one or more of the following audio processing techniques are performed to form an output audio signal OUT, or a modified output audio signal OUT whose spatial and timbre characteristics are restored. M .
[0062] Audio processing technique 1 involves a smooth transition from perfect reconstruction of the input audio signal IN when the target audio source is inactive to using a source-suppressed difference signal D when the target audio source is active. 抑制 Reconstruction is performed, or vice versa. This audio processing technique involves a target source identifier 18, a target source separator 12, and a combiner module 14. Audio processing technique 1 includes step S5: forming a difference audio signal D and a source-suppressed difference audio signal Do. 抑制 Modified differential audio signal D 修改 The weighted combination is such that, for a given time segment of the target source activity, the weighted combination equals the source-suppressed difference audio signal D. 抑制 For time segments when the target source is inactive, the weighted combination equals the difference audio signal; and for intermediate time segments where the target source transitions from active to inactive or vice versa, the weighted combination is the source-suppressed difference audio signal D. 抑制 The interpolation of the difference audio signal D. Then, at step S6, the modified difference audio signal D is... 修改 Combined with the side audio signal S, it forms the output audio signal OUT.
[0063] Audio processing technique 2 involves biasing the extraction of the source-suppressed audio signal in the target source separator 12 at step S4 to exclude the target audio source with an additional safety margin by simultaneously excluding some audio content that does not belong to the target audio source. At the cost of suppressing some audio content unrelated to the target audio source, audio processing system 1 reduces or completely eliminates the risk of the target audio source leaking into the output audio signal OUT. This is particularly beneficial when the target audio source is speech, as any "residual" speech found in the output audio signal OUT can be easily identified by users who would perceive it as interference.
[0064] Audio processing technology 3 involves determining the compensation gain C. 增益 And apply it to the side audio signal S and / or the source-suppressed difference audio signal D of the segment of the target audio source activity. 抑制 This is achieved by balancing the side audio signal S and the difference audio signal D from the inactive segment of the target audio source. The audio processing technique utilizes a level analyzer 17 and a gain compensator module 15. This is done by suppressing the target audio source during the active segment and using the source-suppressed difference audio signal D. 抑制The form of reintroducing additional background audio content may cause changes in the perceptual spectral energy of the background audio content between active and inactive segments. To minimize this perceptual effect, a compensation gain C is determined. 增益 It is then applied to the side audio signal or the source-suppressed differential audio signal to compensate for the difference.
[0065] Audio processing technique 4 involves reconstructing the spatial characteristics of the output audio signal OUT using a spatial parameter adjuster 13. Since the extracted difference audio signal D and the side audio signal S are biased towards the centered sound image and / or a phase difference of approximately zero, it has been found that stretching the sound image and / or phase difference to form a wider distribution significantly enhances the spatial fidelity of the audio signal.
[0066] Audio processing technology 5 involves the processing of the original audio signal IN. 原始 The predetermined frequency band is processed using any of the audio processing techniques described above, without processing the remaining frequency band. Using, for example, a brick-wall filter, the original input audio signal IN... 原始 It can be divided into two parts in terms of frequency. The first part is used as the input audio signal IN and is processed by the audio processing system 1, while the second part is not processed and is mixed with the output audio signal OUT to form the final output audio signal OUT. 最终 The first part of the frequency band is selected to cover the expected frequency range of the target audio source, while the second part of the frequency band is expected to contain virtually no audio content associated with the target audio signal.
[0067] The audio processing techniques 1 through 5 outlined above will now be described in detail below.
[0068] Audio processing technique 1 involves using a combiner 14 to form a suppressed difference audio signal D based on a segment-specific target source activity index α obtained from a target source identifier 18. 抑制 The weighted combination of the difference audio signal D.
[0069] A segment-specific target source activity index α is extracted by the target source recognizer 18 for each segment of the input audio signal IN or the difference audio signal D, and indicates whether the target audio source is active in the current segment. The target source recognizer 18 can be any type of recognizer configured to recognize any type of audio source. For example, the target source recognizer 18 can be configured to determine whether speech or music content is included in the input audio signal IN or the difference audio signal D. The target source recognizer 18 can be / includes a speech activity detection (VAD) module.
[0070] The segment-specific target source activity index α can be binary (e.g., indicating whether the target audio source is active or inactive). The segment-specific target source activity index α can vary between subsequent time segments as the target audio source becomes active or inactive. For example, speech may be active in some segments and inactive in others.
[0071] A segment-specific target source indicator α can be extracted from a continuous value (e.g., the probability of activity of a target audio source) by comparing it to a threshold αT. For example, a continuous value indicating the probability of source activity is determined for each segment, where the probability is a value between 0 and 1, i.e., a continuous value in the range of 0% to 100%. The segment-specific target source activity indicator is the continuous value compared to the threshold α. T The result of the comparison. If the continuous value exceeds the threshold α... T If the continuous value is below this threshold, the fragment-specific source activity index takes a high value α2 (e.g., one), while if the continuous value is below this threshold, the fragment-specific source activity index takes a low value α1 (e.g., zero). For example, the threshold α T It is 0.5% or 50%.
[0072] Further reference Figure 3A An example of a segment-specific source activity index α is shown, which takes a high value α2 (e.g., one) when the target audio source is active and a low value α1 (e.g., zero) when the target audio source is inactive. Of course, the audio processing system 1 can be designed to employ a segment-specific source activity index α defined in a different way.
[0073] The segment-specific target source activity index α is provided to the combiner 14, which can combine the source-suppressed difference audio signal D according to the following equation. 抑制 Combined with the difference audio signal to form the modified difference audio signal D. 修改 (Equation 12) This means that when α is one and the target audio source is active, D 修改 equals D 抑制 And when α is zero and the target audio source is inactive, D 修改 This equals D. This allows for the perfect reconstruction of the difference audio signal when the target audio source is inactive, and the inclusion of audio content from the difference audio signal D that does not belong to the target audio source in the modified difference audio signal D when the target audio source is active. 修改 middle.
[0074] However, in some implementations, the fragment source activity index α may change rapidly, which may be due to the difference audio signal D and the source-suppressed difference audio signal D. 抑制Rapid changes in weights can cause noticeable acoustic artifacts. To address this, the source activity index α for each segment can be smoothed temporally across a segment group comprising at least two segments to form a processing application coefficient PAC. This processing application coefficient controls the difference audio signal D and the source-suppressed difference audio signal Dc. 抑制 The combination of [components]. PAC extraction from the fragment source activity index α is performed by PAC extractor 19.
[0075] The PAC extractor 19 smooths the segment-specific source activity index α over time to form a PAC, which may include smoothing the segment-specific source activity index α over at least two segments, the at least two segments including at least one active segment and at least one inactive segment. For example, smoothing may be performed on one or more review segments and / or one or more look-ahead segments relative to the specific segment for which the PAC is to be determined. For some segment sequences that remain active or inactive for the target audio source (e.g., Figure 3A Between t1 and t2 in the segment, PAC equals the segment-specific source activity index α. However, in the region surrounding each transition from active to inactive or vice versa, PAC will deviate from the segment-specific source activity index α. Importantly, for all active segments, PAC equals the segment-specific source activity index α, while any smoothing used to form a smooth transition occurs in segments where the target audio source is inactive. This prevents audio content associated with the target audio source from leaking into the modified differential audio signal D. 修改 .
[0076] Smoothing of the fragment-specific source activity index α by the PAC extractor 19 may additionally or alternatively include: processing the fragment-specific source activity index α to exclude short periods of time below a predetermined threshold in the fragment-specific source activity index α; and / or processing the fragment-specific source activity index to reflect predetermined minimum start-up and / or release times. This smoothing can produce results such as... Figure 3B The processing applied coefficient PAC is shown.
[0077] Processing a fragment-specific source activity index α to exclude short fragment sequences below a predetermined threshold may include: determining sequences that are below the predetermined threshold α. T (For example, α) T = α1 / 2) is the number of consecutive segments associated with a segment-specific source activity index α (or the time portion of the input audio signal IN or the difference audio signal D). If the number of segments and / or the duration are less than the predetermined maximum number of segments N. max If the maximum duration is set, then the PAC for these segments is set to the maximum value α2. In practice, this means that combiner 14 will ignore short periods of inactive sources. The combiner is controlled by the PAC, which avoids performing perfect reconstruction during very short periods of inactivity.
[0078] An example of PAC obtained through this type of smoothing is... Figure 3A and Figure 3B The data is depicted between time t2 and t3, where the fragment-specific source activity index α is below the predetermined maximum fragment number N during the period from t2 to t3. max Or the maximum duration. Figure 3A The specific source activity index α in the segment briefly drops to α1 (e.g., because the target audio source is inactive for a short period of time), but to avoid changing the difference audio signal D and the source-suppressed audio signal D too frequently. 抑制 The weighting of fragment-specific source activity index α is ignored, as in... Figure 3B As can be seen, PAC remains at α2 from t2 to t3. Thus, PAC is biased to maintain a state that indicates the activity of the target audio source.
[0079] Processing the fragment-specific source activity index α to reflect predetermined minimum start and release times includes obtaining the minimum time to transition from a low value α1 to a high value α2 (minimum start time) and the minimum time to transition from a high value α2 to a low value α1 (minimum release time). The minimum start and minimum release times can be different or the same. For example, both the minimum start and minimum release times can be at least one second, at least two seconds, or at least three seconds.
[0080] For example, by Figure 3A and Figure 3B A comparison reveals that, compared to the fragment-specific source activity index α increasing almost instantaneously from a low value α1 to a high value α2 at time t1, the transition of PAC from a low value α1 to a high value α2 between times t0 and t1 is less rapid. This is because the minimum start-up time limits the maximum rate of change of PAC. Similarly, the minimum release time means that, compared to the fragment-specific source activity index α decreasing almost instantaneously from α2 to α1 at time t4, the transition of PAC from a high value α2 to a low value α1 between times t4 and t5 takes much longer.
[0081] A further requirement can be imposed on PAC: for at least all segments where the segment-specific source activity index is equal to a high value α2, PAC should be equal to the high state indicating an active target source. In other words, this means that any gradation or smoothing occurs on segments where the target audio source is inactive.
[0082] PAC can be used to replace the segment-specific source activity index α in Equation 12 above to achieve a source-suppressed differential audio signal D. 抑制 The weighting of the smoothed variations of the difference audio signal D. This is shown in Figure 4a, which illustrates additional details of the combiner 14 that forms the modified difference audio signal according to the following equation: (Equation 13) In some implementations, combiner 14 is replaced with Figure 4B The combiner 14' shown combines the source-suppressed differential audio signal D using a PAC. 抑制 With source audio signal D 源 Weighting is applied to form the modified difference audio signal D. 修改 Combiner 14' and target source separator 12' (see...) Figure 5 When used in combination, this target source separator simultaneously outputs the target source difference audio signal D. 源 The difference audio signal D between the source and source suppression 抑制 .
[0083] More specifically, combiner 14' generates the modified difference audio signal according to the following equation: (Equation 14) In other words, the source-suppressed differential audio signal D 抑制 Always included in the modified differential audio signal D 修改 In the middle, when the target audio source is inactive, the source difference audio signal D 源 Smoothly weighted to approximate D, which includes forming the complete difference audio signal D (i.e., perfectly reconstructed). 抑制 and D 源 The modified difference audio signal of the combination.
[0084] Although from Figure 4A and Figure 4B The weighting details in combiner 14 and combiner 14' are different, but it should be understood that the weighting details are derived from... Figure 4A and Figure 4B The overall processing performed by the block diagram is equivalent. For example, since all audio content of the difference audio signal D is divided into source difference audio signals D... 源 The difference audio signal D between the source and source suppression 抑制 Therefore, it is believed that D 源 = D - D 抑制 If we take D in equation 14... 源 Replace with D - D 抑制 The result is the weighted formula in Equation 13, which means that although the equations may look different, the weighting process is the same.
[0085] Additionally, Figure 4A The target source separator 12 in the middle can be the same as the target source separator 12', the only difference being that D 源 It is not provided as the output of the target source separator 12.
[0086] Provided with the differential audio signal D and the source-suppressed differential audio signal D 抑制 The weights are controlled by a PAC, which can in turn be a smoothed version of a segment-specific target source activity index α. This smoothing is biased to preserve the indication of target audio source activity. For example, a modified difference audio signal D based on a weighted combination is formed only for segments associated with the target source activity index indicating inactivity of the target audio source. 修改 For any segment associated with a target source activity index indicating the activity of a target audio source, the difference audio signal is a source-suppressed difference audio signal and is unaffected by the difference audio signal. In this way, by introducing more and more difference audio signals, the effect of source suppressors 12, 12' fades out smoothly during inactive segments.
[0087] Go to Figure 5 Further details of the target source separator 12' will now be described. The differential audio signal D is provided to a background suppressor 121, which determines a soft mask having a gain value for suppressing background audio content in the differential audio signal D. This soft mask has a gain value for each of multiple frequency bands in each segment. For example, if the target audio source is speech, the background audio content may include one or more of music, stationary noise and / or non-stationary noise, rain sounds, or wind sounds. On the other hand, if the target audio source is music, the background audio content may include speech but not music.
[0088] Background suppressor 121 may include a trained neural network that has been trained to predict a gain mask for suppressing background audio content of a target audio source. For example, the gain mask may be a soft mask.
[0089] exist Figure 6A The diagram schematically illustrates a soft mask 125, in which the differential audio signal is divided into multiple consecutive segments with corresponding timestamps T. Figure 6A (in the columns), where each segment includes multiple frequency bands F ( Figure 6A (in the rows). For each tile ( Figure 6A (A single table element in the table), background suppressor 121 determines the gain used to suppress any audio content not associated with the target audio source. Figure 6AIn this context, the gain value is a value between zero and one, where zero results in the tile being completely muted, and one results in the tile remaining unchanged. In the depicted example, the first and last columns of the soft mask 125 generally have low gain values. This indicates that the background suppressor 121 has indicated that there is no target audio source in these tiles. On the other hand, in the third and fourth columns, some tiles are associated with much higher gains, even assigning a maximum gain of one to some tiles. This indicates that the background suppressor 121 has indicated that these tiles include content associated with the target audio source. Therefore, the soft mask gain 125 indicates the degree to which each frequency band includes audio content or residual audio content associated with the target audio source.
[0090] In some implementations, a simplified soft mask can be used, which includes a single value for each segment indicating the extent to which the segment includes audio content or residual audio content associated with the target audio source. That is, dividing each segment into multiple frequency bands, each with a separate gain value, can enhance performance, but this is not necessary for all implementations.
[0091] The soft mask 125 can be smoothed over time to produce a smooth soft mask. For example, the soft mask can be averaged over ten or more segments, such as approximately 35 segments. Smoothing can be performed by the soft mask processor 122.
[0092] Audio processing technique 2 involves further processing of the soft mask 125, or a smoothed soft mask, to make the soft mask “more inclusive,” i.e., biased towards recognizing more content as associated with the target source. This further processing can be performed by the soft mask processor 122. Most background suppressors 121 are configured (e.g., trained (if they utilize neural networks)) to recognize content associated with the target source with a balanced false positive and false negative rate. However, for the purposes of this audio processing method, it is beneficial if the background suppressor has very few or even no false negatives (i.e., provides low gain to tiles that are actually associated with the target source). False negatives would mean that some of the target source is identified as background, and the user may hear the misclassified target source because background content may be present in the output audio signal. This is especially evident when the target audio source is speech, as users are generally very sensitive to any speech content, even if the speech content is limited in frequency and duration.
[0093] To circumvent these issues, process the soft mask 125 or a smoothed soft mask to form Figure 6BThe modified soft mask 125' is shown. Modifying the soft mask 125, or a smoothed soft mask, may include scaling all gain values for each tile by a positive scaling factor C1 greater than one, and / or increasing all gain values by an adjustment factor C2 greater than zero. That is, the gain value g can be modified as follows to form the modified gain value g. 修改 :g 修改 = C1 * g + C2. For example, C1 = 1.5 and C2 = 0.15. Optionally, after modifying the gain value, an upper limit is applied to the gain value so that any gain value exceeding one is set to one.
[0094] like Figure 6B As shown, with Figure 6A Compared to the original soft mask 125, the modified soft mask 125' is now "more inclusive" because more tiles are associated with higher gain values indicating that the content belongs to the target audio source. For example, the third, fourth, and fifth columns in the modified soft mask 125' have two, three, and three tiles with a maximum gain of 1.000, respectively, while the corresponding columns in soft mask 125 have zero, two, and two tiles with a maximum gain of 1.000, respectively. Similarly, in the modified soft mask 125', other tiles are also associated with higher gain values compared to gain mask 125, indicating that more content is associated with the target source of the modified soft mask 125'.
[0095] Typically, soft masking gain can be defined over the interval [g1, g2], where the minimum value g1 indicates that the associated segment or frequency band consists only of residual audio content or at least mostly residual audio content, and the value g2 indicates that the associated segment or frequency band is entirely composed of target source audio content or at least mostly target source audio content, or vice versa. The proportion of the segment or frequency band constituting the residual audio content or target source content can be defined as the ratio of the spectral energy of the residual audio content or target source content to the total spectral energy of the segment or frequency band.
[0096] The modified soft mask 125' is provided to a soft masking application, which applies the modified soft mask 125' to the difference audio signal D to form a source difference audio signal D containing isolated target audio sources. 源 The source differential audio signal D 源 The difference audio signal D is provided together with the difference calculator 124, wherein the difference calculator 124 will source-suppressed difference audio signal D 抑制 Simply define it as the difference D - D 源 Technicians will further recognize that different operations can be used to obtain the source-suppressed differential audio signal D. 抑制For example, the modified soft mask 125' can be inverted and applied to the difference audio signal D to directly obtain the source-suppressed difference audio signal D. 抑制 .
[0097] exist Figure 5 In the embodiment shown, the source-suppressed differential audio signal D 抑制 Source-source difference audio signal D 源 Both are provided as outputs; however, in some implementations, only signal D is provided. 抑制 D 源 One of them is provided as output.
[0098] Alternatively, a background suppressor 121 can be used to directly determine the soft mask that captures residual or background audio content. In this implementation, the soft mask's coverage of the background is reduced through a similar process (to avoid accidentally capturing the target audio source). For example, the gain value of the soft mask is scaled with a positive scaling factor less than one, and / or adjusted by subtracting an adjustment factor greater than zero and less than one to form an adjusted soft mask.
[0099] Using audio processing technique 2, a differential audio signal D with source suppression is provided. 抑制 The enhanced determination reduces or mitigates the risk of leakage of the target audio source through the audio processing system. Therefore, audio processing technique 2 can be used in any implementation of the source separator used in the concepts described herein.
[0100] Audio processing technique 3 includes adjusting the gain of at least one of the side audio signal and the source-suppressed difference audio signal when the target audio source is active. When the target audio source is eliminated, the total spectral energy of the audio signal decreases. Therefore, without applying compensation gain, the output audio signal may be perceived as quieter in segments where the target audio source is suppressed, which is generally undesirable. Audio processing technique 3 involves determining and applying appropriate compensation gain to avoid or mitigate the problem of quieter segments.
[0101] refer to Figure 7A The total spectral energy E of the input audio signal IN and the side audio signal S are compared between the inactive segment of the target audio source. The total spectral energy E of the side audio signal S is lower than the total spectral energy of the input audio signal IN, and the difference is... 1. The spectral energy included in the differential audio signal D. For inactive segments of the target audio source, the differential audio signal D is combined with the side audio signal S to reconstruct the input audio signal IN.
[0102] Figure 7B This illustrates the addition of a source-suppressed differential audio signal D to the side audio signal S during a segment of activity from the target audio source. 抑制The difference between the spectral energy of the alpha signal and the spectral energy of the side audio signal S. 2. The differential audio signal D that constitutes source suppression 抑制 The spectral energy included.
[0103] To avoid using the original difference audio signal D from inactive segments and the source-suppressed difference audio signal D 抑制 The resulting acoustic artifacts will 1 and 2. Compare to determine whether to reduce or completely eliminate. 1 and The gain C of the difference between 2 增益 The gain is determined and applied in gain compensator 15 to the side audio signal S and the source-suppressed differential audio signal D during the active segment. 抑制 At least one of them.
[0104] Gain C 增益 On a linear scale, it can be greater than or less than one, and therefore is either positive or negative when expressed in decibels. Thus, although C... 增益 This is referred to as gain, but it should be understood that the term gain in this context encompasses both attenuation and amplification of the relevant audio signal. For example, if the spectral energy difference... 1 is 5 dB and If 2 is 10 dB, then the difference C 增益 Determined as 1- 2, i.e., 5 - 10 = -5 dB, is applied to the side audio signal S or the source-suppressed differential audio signal D. 抑制 In order to provide compensation.
[0105] As described above, the compensation gain to be applied by the gain compensator 15 is based on the difference determined for at least one segment of the target audio source that is inactive. 1 and the difference determined for at least one segment of the target audio source activity. It was determined by 2. Because of the difference 1 and 2 may vary between different inactive segments and different active segments, therefore it is preferable to determine the difference between the mean or median across multiple inactive / active segments. 1 and / or 2, and based on the mean or median. 1 and / or 2. Determine C 增益 .
[0106] For example, identifying multiple inactive segments of a target audio source and determining the difference for each of these segments. 1. Regarding 1. All instances of the difference determine the median difference (e.g., the median of each ratio expressed in decibels). Additionally or alternatively, the average... The difference was determined to be the sum of the squares of the spectral energy of the input audio signal IN divided by the sum of the squares of the spectral energy of the side audio signal. This average can be expressed in decibels. Further analysis revealed that the median... 1 and the mean The average of the two provides a good estimate of the difference in D1.
[0107] The mean and / or median can be similarly determined for multiple segments of the target audio source activity. 2. Difference (or mean and average) 2. The average of the two).
[0108] In some implementations, the compensation gain C 增益 The determination is performed offline. In the offline implementation, the entire input audio signal IN, the side audio signal S, and the source-suppressed differential audio signal D are... 抑制 It can be available, which means that the compensation gain C 增益 It can be based on the mean and / or median covering all inactive and active segments, respectively. 1 and The difference is used to determine this.
[0109] Audio processing technique 3 can also be applied online. For example, determining the mean and / or median over a window area covering many segments. 1 and 2 differences, where the window region follows the current online segment. Preferably, the compensation gain C 增益 It was processed to not allow changes to be too rapid over time, as this could result in noticeable and disruptive rapid volume changes. Therefore, C 增益 It can be limited to a variation of no more than 1 dB or 2 dB within a time period of 5 or 10 seconds.
[0110] In some implementations, interpolation is performed for multiple frequency bands. 1 and 2 and the difference C corresponding to the compensation gain. 增益 The frequency band can be a quasi-octave band. For example, the difference... 1 and 2 and the difference C 增益Determined and applied independently for the following frequency bands: 0 - 400 Hz, 400 - 800 Hz, 800 - 1600 Hz, 1600 - 3200 Hz, 3200 - 6400 Hz, 6400 - 13200 Hz, and 13200 - 24000 Hz.
[0111] The audio processing technique 4 includes modifying the spatial characteristics of the output audio signal OUT. For many types of audio content, the target source (speech) is centered in the sound image and constitutes most of the audio content when active. The side audio signal S extracted by the difference and side extractor 11 will reflect this, since the sound image of the side audio signal S is the sound image of the input audio signal IN, which is close to the centered sound image, i.e., θ = π / 4. It is the target source that has now been eliminated that causes the sound image to be very concentrated in the center position, and in the case where the target source has been removed, it is necessary to modify the output audio signal OUT to achieve a sound image more suitable for the current audio content.
[0112] According to some embodiments, the spatial parameter adjuster 13 obtains the output audio signal and determines the sound image parameter θ for multiple frequency bands of each segment of the output audio signal OUT. Then, the spatial parameter adjuster 13 applies the sound image scaling function f 声像 (θ) to each sound image parameter θ to determine the adjusted sound image parameter θ' = f 声像 (θ). The sound image scaling function f 声像 (θ) is configured to generate an adjusted sound image parameter θ' for any sound image parameter θ, which is more deviated from the centered sound image than the sound image parameter θ input to the sound image scaling function f 声像 (•).
[0113] In some embodiments, the sound image scaling function is given by the following equation: <00(•) After processing, the adjusted image parameters are approximately 0.36 radians, and their normalized distance from the centered image is approximately 0.54. In other words, by moving from 0.5 radians to 0.36 radians, the image parameters have been further stretched away from the centered image at π / 4.
[0115] By providing this stretching of the panning parameter θ relative to the centered panning, the adjusted panning parameter θ' will more realistically approximate the panning parameters of the actual background audio for various audio content types.
[0116] Stretching of the audio-visual parameters can cause some interfering artifacts due to the amplification of small differences in θ. To reduce or remove these artifacts, the adjusted audio-visual parameters are smoothed over multiple segments. In one implementation, a Hamming window spanning ten or more segments is used to smooth the audio-visual parameters, for example, a Hamming window spanning 31 or 41 segments.
[0117] Optionally, a smoothed, adjusted image parameter θ' can be used to replace the image parameter θ to form a modified output audio signal OUT with enhanced spatial characteristics. 修改 .
[0118] Spatial parameter adjuster 13 can perform a similar process to expand the phase parameter φ of the output audio signal OUT. For many types of target sources, the phase difference of the target source is close to zero. In the same way as the image described above, the audio content dominated by the target audio source will contain the (primarily) phase difference of the target audio source in the output audio signal, even if the target audio source has been eliminated.
[0119] Therefore, the spatial parameter adjuster 13 can determine the reference phase parameter φ for each segment and frequency band of the output audio signal. 参考 Reference phase φ 参考 It can be the phase of any channel in the input audio signal. In some implementations, the reference phase φ 参考 It is the phase of any channel of the input audio signal scaled using a scaling factor based on θ, θ', or a smoothed version thereof. For example, the reference phase is determined by the following equation: (Equation 17) or (Equation 18) Where, φ L and φ R These are the phases of the left and right channels of the input audio signal, respectively. In equations 17 and 18 above, θ' can be replaced by θ or its smoothed version.
[0120] Using reference phase φ 参考, the phase parameter φ is determined as the reference phase φ 参考 The difference between the phases of the output audio signals. Use the phase scaling function f 相位 (φ) to determine the adjusted phase parameter φ':
[0121] where p is a value satisfying 0 < p < 1, for example, p = 0.6 or p = 0.4. Thus, in a manner similar to the stretching of the pan parameter, the adjusted phase parameter is stretched to better represent the background audio signal.
[0122] Optionally, smooth the adjusted phase parameter φ' over two or more segments. For example, use a Hamming window to smooth the adjusted phase parameter φ' for each frequency band and segment. In some embodiments, the Hamming window covers ten or more segments, such as 21 segments or 31 segments.
[0123] The (optionally smoothed) adjusted phase parameter φ' replaces the phase parameter φ of the output audio signal to form a spatially enhanced modified output audio signal.
[0124] It should be noted that although the above spatial parameter adjuster 13 performs stretching of both the pan parameter and the phase parameter, it is envisioned that the spatial parameter adjuster according to some embodiments performs stretching of only one of these parameters.
[0125] In any of the above audio processing techniques 1 to 4, it should be understood that the processing can be performed separately and independently for one or more frequency bands. For processing technique 1, this can involve determining a segment-specific source activity metric α for each frequency band, determining the corresponding PAC for each frequency band, and forming a weighted combination of the difference audio signal D and the source-suppressed difference audio signal D 抑制 using a separate PAC for each frequency band. For audio processing technique 2, this can involve modifying (and optionally smoothing) a single soft mask for each frequency band. For processing technique 3, this can involve determining and applying a single compensation gain for each frequency band. For processing technique 4, this can involve determining and modifying the pan and / or phase parameters separately for each of the multiple frequency bands.
[0126] Turning Figure 8 to the block diagram in
[0127] As described above, audio processing technique 1 involves minimal or no processing of the input audio signal when the target source is inactive, and full processing when the target source is active, while using weighted combination to smoothly transition between the two states. This allows the input audio signal to remain unchanged when the target source is inactive. Audio processing technique 5 has the same general purpose, namely, avoiding processing portions of the audio content where the target audio source is inactive, but audio processing technique 5 is implemented entirely separately from any other audio processing technique.
[0128] Obtain the original audio signal IN 原始 This signal is then passed to a set of frequency-domain brick-wall filters 21 and 22. The brick-wall filters 21 and 22 are complementary, meaning that any frequency blocked by one filter is allowed to pass through the other. At least one brick-wall filter 21 is configured to filter the original audio input signal IN. 原始 Filtering is performed to form the input audio signal IN. The brick-wall filter 21 associated with the input audio signal IN is configured to allow frequencies associated with the target audio source to pass through while blocking other frequencies.
[0129] Furthermore, at least one brick-wall filter 22 is provided as a complementary filter to one or more filters associated with the input audio signal IN. The complementary brick-wall filter 22 is configured to allow any frequencies blocked by the brick-wall filter 21 associated with the input audio signal IN to pass through. The signal passing through the brick-wall filter 22 is referred to as the unprocessed audio signal IN. 未处理 This is because the audio signal will bypass the audio processing system 1.
[0130] The input signal IN is provided to the audio processing system 1 and processed according to at least one of the other audio processing techniques described above to form a modified output audio signal OUT. 修改 (Alternatively, if spatial stretching is skipped, the output audio signal OUT is formed.) The modified output audio signal OUT... 修改 The signal is provided to signal combiner 23, which combines the modified output audio signal OUT. 修改 It is combined with the unprocessed audio signal to form the final signal.
[0131] Therefore, only the frequency bands containing the target audio source are expected to be passed to audio processing system 1 for source cancellation. Conversely, some frequency bands that do not contain any audio content associated with the target audio source are expected to bypass audio processing system 1. This means that the original input audio signal IN 原始 A portion will be left unprocessed and will be very close to the original input audio signal OUT. 原始 The target audio source has been removed from the frequency range where it was expected to be located.
[0132] As an example, if the target audio source is speech, the brick-wall filter 21 forming the input audio signal IN can be configured to allow frequencies in the range of 80–14000 Hz (covering the frequency band of the most common speech frequencies) to pass through, while blocking frequencies below 80 Hz and frequencies above 14000 Hz. The complementary brick-wall filter 22 allows frequencies below 80 Hz and frequencies above 14000 Hz to pass through, forming a signal containing the original input audio signal IN. 原始 Unprocessed audio signals containing very low and very high frequency content IN 未处理 The audio processing system 1 processes the input audio signal IN, which contains frequencies from 80 to 14000 Hz, to form a modified output audio signal OUT, in which the target audio source is suppressed. 修改 Then, the modified output audio signal OUT... 修改 With unprocessed audio signal IN 未处理 (It is expected that it does not include any audio content associated with the target audio source) to form a final audio signal OUT with a frequency range of 0 to 20000 Hz or higher. 最终 .
[0133] The passbands of the different brick-wall filters 21 and 22 can be set based on the type of the target audio source. In the example above, the passbands of filters 21 and 22 are set to capture speech in the input audio signal; however, many other alternatives are possible. For example, if the speech is known to be male or female, the frequency range of 80–14000 Hz can be adjusted to exclude higher frequencies for male speech and lower frequencies for female speech. As a further example, if an instrument is to be removed, the passband of brick-wall filter 21 associated with the input audio signal is set to the typical frequency range of the instrument, and brick-wall filter 22 is set as a complementary filter.
[0134] The audio processing system 1, together with the brick wall filters 21 and 22, forms the frequency selective audio processing system 1'.
[0135] Figure 9 The diagram illustrates how to output the final audio signal OUT. 最终 With alternative audio signal IN 替代 Mixed to form an alternative output audio signal OUT 替代 The block diagram is shown. The frequency-selective audio processing system 1' has already implemented suppression of the target audio source, and in some embodiments, it is desirable to replace the suppressed source with an alternative source. The alternative source includes the alternative audio signal IN. 替代 The final output audio signal OUT will be displayed in the middle. 最终 and alternative input audio signal IN替代 These signals are provided to signal combiner 2, which mixes them to obtain an alternative output audio signal OUT. 替代 The original input audio signal IN 原始 The target audio source in the file has been replaced with IN. 替代 Alternative audio sources present in the signal.
[0136] In one example implementation, the target audio source to be suppressed is speech, and the alternative audio signal IN is... 替代 These are voices in different languages. Effectively, this achieves dubbing, where the voice content can be replaced with different voice content.
[0137] As another example, the target audio source is a specific musical instrument, where the alternative audio signal IN 替代 This includes different music played on the same type of instrument or music from different instruments. In this way, for example, a guitar solo can be replaced with a different guitar solo.
[0138] It should be understood that the frequency-selective audio processing system 1' can be replaced with Figure 1 The implementation of one or more of the audio processing technologies detailed above.
[0139] Unless otherwise specifically stated, it is obvious from the following discussion that, throughout this public discussion, terms such as “processing,” “computing,” “calculating,” “determining,” and “analyzing” are used to refer to the actions and / or processes by which data represented as physical (e.g., electronic) quantities are manipulated and / or transformed into other data similarly represented as physical quantities by computer hardware or computing systems or similar electronic computing devices.
[0140] It should be understood that in the foregoing description of exemplary embodiments of the invention, various features are sometimes grouped together in a single embodiment, figure, or description thereof to simplify the disclosure and aid in understanding one or more of the various inventive aspects. However, the approach of this disclosure should not be construed as reflecting an intention to require more features than expressly recited in each claim. Rather, as reflected in the following claims, the inventive aspect lies in fewer than all features of a single foregoing disclosed embodiment. Therefore, the claims following the detailed description are hereby expressly incorporated into the detailed description, wherein each claim is an independent embodiment of the invention. Furthermore, while some embodiments described herein include some features included in other embodiments but not others included in other embodiments, as will be understood by those skilled in the art, combinations of features from different embodiments are intended to be within the scope of the invention and form different embodiments. For example, any of the claimed embodiments in the following claims may be used in any combination.
[0141] Furthermore, certain embodiments herein are described as methods or combinations of method elements that can be implemented by a processor of a computer system or by other means of performing functions. Thus, a processor having instructions for performing such methods or method elements forms means for performing methods or method elements. It should be noted that when a method comprises multiple elements (e.g., several steps), no particular order of these elements is implied unless specifically stated otherwise. Furthermore, the elements of the apparatus embodiments described herein are examples of means for performing the functions performed by the elements to carry out embodiments of the invention. Numerous specific details are set forth in the description provided herein. However, it should be understood that embodiments of the invention can be practiced without these specific details. In other instances, well-known methods, structures, and techniques have not been shown in detail in order to avoid obscuring the understanding of this specification.
[0142] Therefore, although specific embodiments of the invention have been described, those skilled in the art will recognize that other and further modifications can be made thereto without departing from the spirit of the invention, and all such changes and modifications falling within the scope of the invention are intended to be claimed.
[0143] Various aspects of the invention can be understood from the following enumerated example embodiments (EEE): EEE 1. A method for processing audio, the method comprising: Obtain an input audio signal comprising two channels, the input audio signal comprising multiple consecutive segments; For each segment of the input audio signal, a segment-specific source activity index is determined that indicates whether a predetermined target audio source is active in the segment; For each segment: Extract the side audio signal from the input audio signal; Based on the difference between the side audio signal and the input audio signal, extract the difference audio signal; By processing the difference audio signal with a source splitter, a source-suppressed difference audio signal is extracted from the difference audio signal; The inactivity energy level is determined based on the spectral energy of the differential audio signal of at least one segment associated with a segment-specific activity index indicating inactivity of the target audio source. The activity energy level is determined based on the spectral energy of the source-suppressed difference audio signal of at least one segment associated with a segment-specific activity index indicating the activity of the target audio source. Based on the difference between the inactive energy level and the active energy level of the at least one segment associated with a segment-specific activity index indicating the activity of the target audio source, a gain is applied to at least one of the side audio signal and the source-suppressed difference audio signal; and The source-suppressed differential audio signal is combined with the side audio signal, wherein at least one of the gain is applied to form an output audio signal in which the target audio source is suppressed.
[0144] EEE 2. The method according to EEE 1, further comprising: The inactivity energy level is determined as the average spectral energy of the difference audio signals of at least two segments associated with a segment-specific activity index indicating inactivity of the target audio source.
[0145] EEE 3. The method according to EEE 1 or EEE 2 further includes: The activity energy level is determined as the average spectral energy of source-suppressed audio signals from at least two segments associated with a segment-specific activity index indicating the activity of the target audio source.
[0146] EEE 4. The method according to any one of the preceding EEEs, wherein, for each segment, the output audio signal comprises a plurality of frequency bands, and wherein each frequency band is associated with a sound image parameter, the method further comprising: The image parameters of each frequency band and segment are scaled using an image scaling function, which is based on the deviation between the image parameters and the centered image to obtain adjusted image parameters that are more significantly different from the deviation of the image parameters relative to the centered image; and The modified output audio signal is formed using the adjusted audio-visual parameters.
[0147] EEE 5. The method according to EEE 4 further includes: The adjusted audio-visual parameters for each frequency band and segment are smoothed over time.
[0148] EEE 6. The method according to any one of the preceding EEEs, wherein, for each segment, the output audio signal comprises a plurality of frequency bands, and wherein each frequency band is associated with a phase parameter, the method further comprising: Determine the reference phase for each frequency band and segment; The phase parameters of each frequency band and segment are scaled using a phase scaling function, which is based on the deviation between the phase parameters and the reference phase, to obtain adjusted phase parameters that have a greater degree of deviation than the phase parameters relative to the reference phase; and The modified output audio signal is formed using the adjusted phase parameters.
[0149] EEE 7. The method according to EEE 6 further includes: The adjusted phase parameters for each frequency band and segment are smoothed over time.
[0150] EEE 8. The method according to EEE 6 or EEE 7 when subordinate to EEE 4, wherein determining the reference phase for each frequency band and segment includes: The phase parameters of the corresponding frequency band of one channel of the input audio signal are scaled using the adjusted image parameters.
[0151] EE 9. A method for processing audio, the method comprising: Obtain an input audio signal comprising two channels, the input audio signal comprising multiple consecutive segments; For each segment: Extract the side audio signal from the input audio signal; Based on the difference between the side audio signal and the input audio signal, extract the difference audio signal; By processing the difference audio signal with a source splitter, a source-suppressed difference audio signal is extracted from the difference audio signal; The source-suppressed differential audio signal is combined with the side audio signal to form the output audio signal in which the target audio source is suppressed; For each segment of the output audio signal, determining the acoustic image parameters and / or phase parameters across multiple frequency bands, the method further includes: The image parameters for each frequency band and segment are scaled using a scaling image scaling function, which is based on the deviation between the image parameters and the centered image to obtain adjusted image parameters that are more significantly different from the deviation of the image parameters relative to the centered image; and / or A reference phase is determined for each frequency band and segment, and the phase parameter of each frequency band and segment is scaled using a phase scaling function based on the deviation between the phase parameter and the reference phase to obtain an adjusted phase parameter that is more significantly different from the deviation of the phase parameter relative to the reference phase. The modified output audio signal is formed using the adjusted audio-visual parameters and / or the adjusted audio-visual parameters.
[0152] EEE 10. The method according to EEE 9 further includes: smoothing the adjusted acoustic image parameters for each frequency band and segment in time.
[0153] EEE 11. The method according to EEE 9 or EEE 10 further includes: smoothing the adjusted phase parameters for each frequency band and segment in time.
[0154] EEE 12. The method according to any one of EEE 9 to 11, wherein determining the reference phase for each frequency band and segment comprises: The phase parameters of the corresponding frequency band of one channel of the input audio signal are scaled using the adjusted image parameters.
[0155] EEE 13. The method according to any one of the preceding EEE, wherein extracting the source-suppressed difference audio signal by processing the difference audio signal with a source separator comprises: - Using a source separator, the source separator is configured to determine the extent to which the segment includes audio content or residual audio content associated with the target audio source, and - The source-suppressed differential audio signal is formed based on the degree to which each corresponding segment contains audio content associated with the target audio source.
[0156] EEE 14. The method according to EEE 13, wherein the source separator is configured to determine, for each segment, a gain mask for removing the target audio source from the differential audio signal.
[0157] EEE 15. The method according to EEE 14 further includes: Smooth the gain mask over at least two segments to form a smooth gain mask; and The smoothed gain mask is applied to the difference audio signal to form the source-suppressed difference audio signal.
[0158] EEE 16. The method according to EEE 14 or EEE 15, wherein each gain mask includes a value for each segment, the value being in the range [g1, g2], wherein g1 and g2 are positive values greater than or equal to zero, wherein g1 indicates that the corresponding segment contains audio content associated with the target source, and g2 indicates that the corresponding segment contains residual audio content, the method further comprising: Each value of the gain mask is scaled by a positive scaling factor less than one, and / or each value of the gain mask is reduced by subtracting a positive adjustment value greater than zero, to form a modified gain mask; and The modified gain mask is applied to the difference audio signal to form the source-suppressed difference audio signal.
[0159] EEE 17. The method according to EEE 16, wherein each gain mask includes the value of each of the multiple frequency bands of each segment.
[0160] EEE 18. The method according to EEE 16 or EEE 17, wherein g1 indicates that the corresponding segment or frequency band contains audio content dominated by the target source, and g2 indicates that the corresponding segment or frequency band contains audio content dominated by the residual audio content.
[0161] EEE 18. The method according to any one of the preceding EEEs further comprises: determining a processing application coefficient for a specific segment associated with a segment-specific source activity index indicating inactivity of the target audio source by smoothing the segment-specific source activity index over a group of segments associated with the specific segment, wherein the group of segments includes segments associated with the segment-specific source activity index indicating inactivity of the target audio source and segments associated with the segment-specific source activity index indicating activity of the target audio source; For the specific segment, based on the processing application coefficients, the difference audio signal and the source-suppressed difference audio signal are weighted and summed to form a modified difference audio signal; and For the specific segment, the modified difference audio signal is combined with the side audio signal to form an output audio signal in which the target audio source is suppressed.
[0162] EEE 18. The method according to EEE 17, wherein smoothing the segment-specific source activity index to determine the processing application coefficient includes: limiting the maximum difference between the processing application coefficients between two consecutive segments.
[0163] EEE 19. The method according to EEE 17, wherein smoothing the fragment-specific source activity index to determine the processing application coefficient includes: In N 平滑 Smoothing is performed on the specific source activity metrics of each segment, where N 平滑 Greater than or equal to two, and where N 平滑 The total duration of each segment is at least 1 second, preferably at least 2 seconds, or most preferably at least 2.5 seconds.
[0164] EEE 20. The method according to EEE 17 to 19, wherein the source-specific activity index for each segment is α1 or α2, wherein α2 indicates that the target source is active and α1 indicates that the target source is inactive, and wherein smoothing the segment-specific source activity index to determine the processing application coefficient includes: For a number of N consecutive segments, the specific source activity index for each segment is defined as α1, where N is between a given maximum number N. max Between; and The source-suppressed difference audio signal is used as the modified difference audio signal for each of the N consecutive segments.
[0165] EEE 21. According to the method described in EEE 20, where N max Chosen to make N max The total duration of the segments is less than or equal to 1 second, preferably less than or equal to 2 seconds, and most preferably less than or equal to 3 seconds, where T is the duration of the segments.
[0166] EEE 22. The method according to any one of the preceding EEEs, further comprising: obtaining the original audio signal; Based on frequency content, the original audio signal is divided into two separate signals: the input audio signal and the auxiliary audio signal; and The auxiliary audio signal is combined with the output audio signal.
[0167] EEE 23. The method according to any one of the preceding EEEs, wherein the target audio source is selected from the group consisting of speech, music and the sound of a particular instrument.
[0168] EEE 24. The method according to any one of the preceding EEE, wherein extracting the side audio signal and the difference audio signal comprises: The side audio signal is extracted based on the difference between the two channels in the input audio signal; Determine the acoustic image parameters of the input audio signal; The side audio signal is converted into a stereo side audio signal using the imaging parameters of the input audio signal; and The difference audio signal is defined as the difference between the stereo side audio signal and the two-channel input audio signal.
[0169] EE 25. An apparatus comprising: One or more processors; and A memory storing instructions that, when executed by the one or more processors, cause the one or more processors to perform the method according to any one of the preceding EEEs.
[0170] EEE 26. A non-transitory computer-readable storage medium having instructions stored thereon, which, when executed by one or more processors, cause the one or more processors to perform the method according to any one of EEE 1 to 24.
Claims
1. A method for processing audio, the method comprising: Obtain an input audio signal comprising two channels, the input audio signal comprising multiple consecutive segments; For each segment of the input audio signal, a segment-specific source activity index is determined that indicates whether a predetermined target audio source is active in the segment; For a specific segment associated with a segment-specific source activity index indicating inactivity of the target audio source, a processing application coefficient is determined by smoothing the segment-specific source activity index over a group of segments associated with the specific segment, wherein the group of segments includes segments associated with the segment-specific source activity index indicating inactivity of the target audio source and segments associated with the segment-specific source activity index indicating activity of the target audio source. For the specific segment: Extract the side audio signal from the input audio signal; Based on the difference between the side audio signal and the input audio signal, extract the difference audio signal; The source-suppressed difference audio signal is extracted from the difference audio signal using the following method: - Using a source separator, the source separator is configured to determine the extent to which the specific segment contains audio content or residual audio content associated with the target audio source, and - Based on the degree to which the specific segment contains audio content associated with the target audio source, the source-suppressed differential audio signal is formed; Based on the processing application coefficients, the difference audio signal and the source-suppressed difference audio signal are weighted and summed to form a modified difference audio signal; and The modified differential audio signal is combined with the side audio signal to form an output audio signal in which the target audio source is suppressed.
2. The method according to claim 1, wherein, The source separator is configured to determine the extent to which each of a plurality of frequency bands of the particular segment contains audio content or residual audio content associated with the target audio source, and Specifically, based on the degree to which each corresponding frequency band contains audio content associated with the target audio source, all frequency bands associated with the residual audio content are combined to form the source-suppressed differential audio signal.
3. The method according to claim 1 or claim 2, wherein, Smoothing the segment-specific source activity index to determine the processing application coefficients includes limiting the maximum difference between the processing application coefficients between two consecutive segments.
4. The method according to any one of the preceding claims, wherein, Smoothing the source-specific activity index of the fragment to determine the processing application coefficients includes: In N 平滑 Smoothing is performed on the specific source activity metrics of each segment, where N 平滑 Greater than or equal to two, and where N 平滑 The total duration of each segment is at least 1 second, preferably at least 2 seconds, or most preferably at least 2.5 seconds.
5. The method according to any one of the preceding claims, wherein, The source-specific activity index for each segment takes the value α1 or α2, where α2 indicates that the target source is active and α1 indicates that the target source is inactive, and wherein smoothing the segment-specific source activity index to determine the processing application coefficient includes: For a number of N consecutive segments, the specific source activity index for each segment is defined as α1, where N is between a given maximum number N. max Between; and The source-suppressed difference audio signal is used as the modified difference audio signal for each of the N consecutive segments.
6. The method according to claim 5, wherein, N max Chosen to make N max The total duration of the segments is less than or equal to 1 second, preferably less than or equal to 2 seconds, and most preferably less than or equal to 3 seconds, where T is the duration of the segments.
7. The method according to any one of the preceding claims, wherein, The source separator is configured to determine a gain mask for each segment to remove the target audio source from the difference audio signal.
8. The method of claim 7, further comprising: The gain mask is smoothed over at least two segments to form a smooth gain mask; as well as The smoothed gain mask is applied to the difference audio signal to form the source-suppressed difference audio signal.
9. The method according to claim 7 or claim 8, wherein, Each gain mask includes a value for each segment, the value being in the range [g1, g2], where g1 and g2 are positive values greater than or equal to zero, wherein g1 indicates that the corresponding segment contains audio content associated with the target source, and g2 indicates that the corresponding segment contains residual audio content, the method further comprising: Each value of the gain mask is scaled by a positive scaling factor less than one, and / or each value of the gain mask is reduced by subtracting a positive adjustment value greater than zero, to form a modified gain mask; and The modified gain mask is applied to the difference audio signal to form the source-suppressed difference audio signal.
10. The method according to claim 9, wherein, Each gain mask includes the value for each of the multiple frequency bands in each segment.
11. The method according to claim 9 or claim 10, wherein, g1 indicates that the corresponding segment or frequency band contains audio content dominated by the target source, and g2 indicates that the corresponding segment or frequency band contains audio content dominated by the residual audio content.
12. The method according to any one of the preceding claims, further comprising: The inactivity energy level is determined based on the spectral energy of the differential audio signal of at least one segment associated with a segment-specific activity index indicating inactivity of the target audio source. The activity energy level is determined based on the spectral energy of the source-suppressed audio signal of at least one segment associated with a segment-specific activity index indicating the activity of the target audio source; Based on the difference between the inactive energy level and the active energy level of the at least one segment associated with a segment-specific activity index indicating the activity of the target audio source, a gain is applied to at least one of the side audio signal and the source-suppressed difference audio signal.
13. The method of claim 12, further comprising: The inactivity energy level is determined as the average spectral energy of the difference audio signals of at least two segments associated with a segment-specific activity index indicating inactivity of the target audio source.
14. The method according to claim 12 or claim 13, further comprising: The activity energy level is determined as the average spectral energy of source-suppressed audio signals from at least two segments associated with a segment-specific activity index indicating the activity of the target audio source.
15. The method according to any one of the preceding claims, wherein, For each segment, the output audio signal includes multiple frequency bands, and each frequency band is associated with a sound image parameter; the method further includes: The image parameters of each frequency band and segment are scaled using an image scaling function, which is based on the deviation between the image parameters and the centered image to obtain adjusted image parameters that are more significantly different from the deviation of the image parameters relative to the centered image; and The modified output audio signal is formed using the adjusted audio-visual parameters.
16. The method of claim 15, further comprising smoothing the adjusted acoustic-image parameters of each frequency band and segment in time.
17. The method according to any one of the preceding claims, wherein, For each segment, the output audio signal comprises multiple frequency bands, wherein each frequency band is associated with a phase parameter, and the method further comprises: Determine the reference phase for each frequency band and segment; The phase parameters of each frequency band and segment are scaled using a phase scaling function, which is based on the deviation between the phase parameters and the reference phase, to obtain adjusted phase parameters that have a greater degree of deviation than the phase parameters relative to the reference phase; and The modified output audio signal is formed using the adjusted phase parameters.
18. The method of claim 17, further comprising smoothing the adjusted phase parameters for each frequency band and segment in time.
19. The method according to claim 17 or claim 18 when subordinate to claim 15, wherein, Determining the reference phase for each frequency band and segment includes: The phase parameters of the corresponding frequency band of one channel of the input audio signal are scaled using the adjusted image parameters.
20. The method according to any one of the preceding claims, further comprising: Obtain the original audio signal; Based on frequency content, the original audio signal is divided into two separate signals: the input audio signal and the auxiliary audio signal. as well as The auxiliary audio signal is combined with the output audio signal.
21. The method according to any one of the preceding claims, wherein, The target audio source is selected from the group consisting of speech, music, and the sound of a specific instrument.
22. The method according to any one of the preceding claims, wherein, Extracting the side audio signal and the difference audio signal includes: The side audio signal is extracted based on the difference between the two channels in the input audio signal; Determine the acoustic image parameters of the input audio signal; The side audio signal is converted into a stereo side audio signal using the imaging parameters of the input audio signal; and The difference audio signal is defined as the difference between the stereo side audio signal and the two-channel input audio signal.
23. An apparatus comprising: One or more processors; as well as A memory storing instructions that, when executed by the one or more processors, cause the one or more processors to perform the method according to any one of the preceding claims.
24. A non-transitory computer-readable storage medium having instructions stored thereon, which, when executed by one or more processors, cause the one or more processors to perform the method according to any one of claims 1 to 22.