Real-time level management with content recognition for audio content.

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A framework for audio level management using sound source classification and gain control modules addresses the challenges of mixed audio environments, enhancing audio sources by applying time-varying gains for improved audio quality.

JP2026521033APending Publication Date: 2026-06-25DOLBY LABORATORIES LICENSING CORP +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: DOLBY LABORATORIES LICENSING CORP
Filing Date: 2024-06-19
Publication Date: 2026-06-25

Application Information

Patent Timeline

19 Jun 2024

Application

25 Jun 2026

Publication

JP2026521033A

IPC: G10L21/034; G10L25/81; G10L25/84

AI Tagging

Application Domain

Speech analysis

Technology Topics

EngineeringSignal correlation

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Texitile light ageing test instrument
CN1588059Acompact structure Easy to assemble and disassemble Material analysis by optical meansTextile testingEngineering Light filter
Multi-dimensional training method and device of support vector machine
CN114186620AImprove linear separabilityimprove classificationKernel methods Character and pattern recognition Data setDescent algorithm
Loop structure of cold heat flows
CN1916533AImprove efficiencySimple configurationFluid circulation arrangementHeating and refrigeration combinationsHeat flowWorking fluid
Environment-friendly mobile collecting box for decoration cutting dust
CN108636005AThe dragging process is smoothavoid secondary flyingUsing liquid separation agent Working accessories EngineeringSediment
Credit text analysis method, credit object auditing method and credit object auditing device
CN114386430AReduce labor costs Improve efficiency Finance Semantic analysisCredit cardEngineering

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing audio enhancement algorithms struggle to effectively manage audio levels in mixed content environments, particularly when signal-to-noise ratio is low, and fail to distinguish between different audio sources, leading to suboptimal output and computational complexity.

Method used

A framework comprising a sound source classification module, trust activation tracker module, long-term energy estimation module, and gain control module to identify and manage audio sources like speech, music, and noise, applying time-varying gains for real-time level management.

Benefits of technology

Enhances audio sources by effectively boosting or attenuating each sound source to achieve desired energy levels, reducing computational complexity and improving audio output quality in diverse applications.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026521033000001_ABST

Patent Text Reader

Abstract

A system and method for content-aware real-time level control of audio content. One example provides a method for real-time level control of audio content, the method comprising the steps of: receiving an input audio signal; generating a short-time signal based on the input audio signal; identifying one or more sources of interest associated with the short-time signal using a source classifier; estimating the long-term root mean square energy (long-term RMS) for each of the one or more sources of interest using a long-term energy estimator; estimating a set of short-time gains based at least partially on the long-term RMS of at least one of the sources of interest using automatic gain control; and applying the short-time gains to the short-time signal.

Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] This application broadly relates to audio signal processing, and more specifically to content-aware level management by estimating and applying time-varying gain values to boost or attenuate each sound source. [Background technology]

[0002] Unless otherwise indicated herein, the materials described in this section are not prior art to the claims of this application, and their inclusion in this section does not make them prior art.

[0003] Existing enhancement algorithms for noisy speech perform poorly when applied to content mixed with speech, music, noise, etc. For example, known leveling techniques based on analyzing the distribution of short-duration energy are not very effective when the signal-to-noise ratio (SNR) is low. Furthermore, known leveling techniques cannot distinguish between different audio sources based solely on energy distribution.

[0004] Known leveling techniques that rely on voice activity detectors (VADs) treat non-speech sources, such as music and noise sources, as a single source, which can result in suboptimal audio output (e.g., boosted noise). Furthermore, known leveling techniques that rely on source separation and subsequent single-source leveling algorithms are computationally complex and may not be suitable for diverse implementations.

[0005] The disclosures made herein are presented with respect to these and other considerations. [Overview of the Initiative] [Problems that the invention aims to solve]

[0006] As the diversity of real-time communication, real-time streaming, user-generated content creation, and other similar applications increases, content-aware processing provides the ability to enhance audio sources associated with user interest. For example, real-time communication for music lessons requires enhancement of both speech and musical audio content. Instrumental performances are typically louder than speech. Therefore, the content-aware real-time processing algorithms described herein can reduce the music volume and increase the speech volume. Since the majority of live content creators do not have dedicated sound engineers or the ability to manually balance audio levels in real time, the embodiments described herein provide automated processing tools to assist with real-time content such as cloud-based video conferencing using laptop devices and live broadcasts using smartphone devices.

[0007] Techniques for processing audio signals are described. The examples found herein provide systems, devices, and methods for processing audio content, more specifically, for estimating and applying time-varying gains such that each audio source is boosted or attenuated to achieve a certain energy level. [Means for solving the problem]

[0008] Several examples provide a framework including a sound source classification module, a trust activation tracker module, a long-term energy estimation module, and a gain control module. The sound source classification module is configured to receive an input audio signal and identify one or more sound sources of interest associated with the input audio signal. The trust activation tracker module is configured to monitor the output of the sound source classification module for multiple frames and to verify the output of the sound source classification module. The long-term energy estimation module is configured to estimate the long-term RMS for each of the one or more sound sources of interest. The automatic gain control module is configured to estimate a set of short-term gains, at least partially based on the long-term RMS of at least one of the sound sources of interest.

[0009] In some aspects, the sound source classification module identifies and classifies audio sources within the input audio signal, such as speech sources, music sources, and noise sources. For short frames, the sound source classification module outputs a sound source activation probability for each identified audio source. The sound source classification module may be, for example, a machine learning model or some other audio analysis algorithm.

[0010] In some aspects, the Trust Activation Tracker module receives source activation probabilities for each audio source from the Source Classification module. The Trust Activation Tracker module may verify whether an audio source is activated based on its respective source activation probabilities exceeding a threshold for a predetermined number of frames. The predetermined number of frames may be unique for each audio source. In some aspects, when the source activation probabilities fall below a threshold, the Trust Activation Tracker module may determine that each audio source is inactive and reset the tracking for each audio source.

[0011] In some respects, the long-term energy estimation module operates only on audio sources indicated as active and enabled by the trust activation tracker module. The long-term energy estimator can estimate both instantaneous and cumulative energy for active audio sources. Instantaneous energy may be stored in an energy history buffer associated with each audio source, while cumulative energy is the sum of the instantaneous energies.

[0012] In some respects, the long-term energy estimation module calculates the long-term RMS of an audio source by dividing the cumulative energy by the number of frames. The long-term energy estimation module can track the audio source with the highest long-term RMS.

[0013] In some aspects, the automatic gain control module updates the short-time gain based on the long-term RMS and target RMS. For example, the initial short-time gain may be set to 1. The short-time gain may be updated by adding the short-time gain using the gain increment. In some cases, the target RMS is based on the long-term RMS of the sound source with the highest sound source activation probability.

[0014] In some respects, the rate of gain increment is controlled by attack time and release time parameters. The attack time parameter may be implemented to calculate the gain increment when the short-time gain is less than the target gain. The release time parameter may be implemented to calculate the gain increment when the short-time gain is greater than the target gain.

[0015] In some aspects, maximum allowable gain is implemented to prevent clipping. For example, the maximum allowable gain may be stored in a local history buffer, from which the minimum value is used to limit the current short-time gain. The maximum allowable gain can be calculated by dividing the maximum allowable amplitude by the maximum amplitude of the short-time signal.

[0016] In some respects, short-time gain is frequency-dependent gain based on available estimates of the noise spectrum. For example, frequency-dependent gain can be derived for noise bins (bins with amplitudes smaller than or equal to the noise spectrum) and non-noise bins (bins with amplitudes larger than the noise spectrum). Short-time gain can be applied to non-noise bins and / or noise bins.

[0017] Another example provides a method for real-time level control of audio content, which includes receiving an input audio signal, generating a short-duration signal based on the input audio signal, identifying one or more sources of interest associated with the short-duration signal using a source classifier, estimating a long-duration root mean square energy (long-duration RMS) for each of the one or more sources of interest using a long-duration energy estimator, estimating a set of short-duration gains based at least partially on the long-duration RMS of at least one of the sources of interest using automatic gain control, and applying the short-duration gains to the short-duration signal.

[0018] Various aspects of this disclosure provide audio signal processing and bring about improvements in technical fields such as audio processing, audio encoding, audio decoding, virtual reality, and object-based audio.

[0019] Embodiments described herein may generally be described as "technical," and the term "technical" may refer to systems, devices, methods, computer-readable instructions, modules, components, hardware logic, and / or operations, as suggested by the context in which it applies herein.

[0020] Features and technical advantages other than those explicitly described above will become apparent from the following detailed description and consideration of the related drawings. This summary is provided to introduce a selection of the technology in a simplified form and is not intended to identify the key or essential features of the claimed subject matter, which is defined by the appended claims.

Brief Description of the Drawings

[0021] These and other more detailed and specific features of the various embodiments will be more fully disclosed in the following description with reference to the accompanying drawings.

[0022] [Figure 1] FIG. is a block diagram of an exemplary audio coding system in which various aspects of the present disclosure may be implemented.

[0023] [Figure 2] FIG. shows a block diagram of a framework for audio level management according to some aspects of the present disclosure.

[0024] [Figure 3] FIG. shows a block diagram of another framework for audio level management according to some aspects of the present disclosure.

[0025] [Figure 4] FIG. shows block diagrams of various exemplary methods for audio level management that may be performed by the audio coding system of FIG. 1 according to various aspects of the present disclosure.

[0026] [Figure 5A] FIG. shows a schematic block diagram of an exemplary device architecture that may be used to implement various aspects of the present disclosure.

[0027] [Figure 5B]Figure 5A is a schematic block diagram of an exemplary CPU implemented in the device architecture, which may be used to implement various aspects of this disclosure. [Modes for carrying out the invention]

[0028] The following description includes numerous details, such as audio device configuration, timing, and operation, to provide an understanding of one or more aspects of the present disclosure. It will be readily apparent to those skilled in the art that these specific details are merely examples and are not intended to limit the scope of this application.

[0029] Where used herein, the term “including” and its variations shall be read as non-restrictive terms meaning “including, but not limited to.” The term “or” shall be read as “and / or” unless the context explicitly indicates otherwise. Such terms should be read as having an inclusive meaning. For example, “A and B” could mean at least: “both A and B,” or “at least both A and B.” As another example, “A or B” could mean at least: “at least A,” “at least B,” “both A and B,” or “at least both A and B.” As yet another example, “A and / or B” could mean at least: “A and B,” or “A or B.” Where an exclusive OR is intended, it shall be specifically stated (e.g., “either A or B,” or “at least one of A and B”). The term “based on…” should be read as “at least….” The terms “one exemplary implementation” and “a certain exemplary implementation” should be read as “at least one exemplary implementation.” The term “another implementation” should be read as “at least one other implementation.” The terms “determined,” “determine,” or “to determine” should be read as “to obtain,” “to receive,” “to calculate,” “to calculate,” “to estimate,” “to predict,” or “to derive.” Furthermore, in the following description and claims, unless otherwise defined, all technical and scientific terms used herein have the same meaning as those generally understood by those skilled in the art to which this disclosure belongs.

[0030] The following is a list of various acronyms that may appear throughout this disclosure and in the relevant claims and / or drawings. Other commonly used acronyms and technical terms may be omitted from this list for brevity. A short list of acronyms is provided below for the reader's convenience. IVAS - Immersive Voice and Audio Services RMS - Root Mean Squared STG - Short-Time Leveling Gain AT - Attack Time RT - Release Time

[0031] Figure 1 shows a block diagram of an audio coding system that can incorporate various aspects of the present invention. An exemplary audio coding system 100 includes an encoder 110 and a decoder 120. The input of encoder 110 corresponds to a first signal path 105, and the output of encoder 110 corresponds to a second signal path 115. The input of decoder 120 corresponds to a second signal path 115, and the output of decoder corresponds to a third signal path 125.

[0032] Encoder 110 is configured to receive one or more streams of audio information representing one or more channels of an audio signal from a first signal path 105. Encoder 110 is further configured to process the received streams of audio information and generate an encoded signal that can be output to a second signal path 115. In the second signal path 115, the encoded signal can be stored (e.g., captured, buffered, and / or recorded) or transmitted (e.g., via a wired or wireless communication medium). Decoder 120 is configured to receive the encoded signal from the second signal path 115. Decoder 120 is further configured to process the received encoded signal and generate a decoded signal that can be output to a third signal path 125. The decoded signal generated by decoder 120 corresponds to a replica of the audio information previously received by encoder 110 from the first signal path 105. In the third signal path, the decoded signal can be stored (e.g., captured and / or recorded), transmitted (e.g., via wireless or wired electronic communication media), or output to a listening device (e.g., a receiver, speaker, soundbar, game system, portable multifunction device (e.g., mobile phone or tablet), smart glasses, or audio processing device such as an XR / VR / AR / MR headset). The audio coding system 100 may be an audio system capable of implementing audio codec standards such as the Immersive Voice and Audio Services (IVAS) standard. In such cases, the encoded signal in the second signal path 115 may correspond to an IVAS bitstream.

[0033] In the various examples described herein, the terms “replica” and “replica signal” are not intended to mean that the stream of audio information is “identical.” Instead, the term “replica” may indicate that the stream of audio information is substantially the same as the original audio information (for example, a reconstruction representing the original audio information). For example, if encoder 110 uses lossless encoding techniques to produce the encoded signal, decoder 120 can, in principle, reconstruct a lossless version from the stream that is substantially the same as the original audio information. However, in examples where encoder 110 uses lossy encoding techniques, the content of the reconstructed replica signal may not be identical to the content of the original stream and may be perceptually indistinguishable from the original content. Thus, the terms “replica” and “replica signal” are intended to cover both lossless and lossy encoding techniques as used herein.

[0034] The examples described herein provide real-time content-aware level management (e.g., loudness level) of audio content. In some cases, a time-varying gain is estimated and applied to audio sources (e.g., audio objects) within the audio content. In some cases, the time-varying gain may boost or attenuate the audio sources to achieve a constant loudness or energy level of the audio content. The examples described herein primarily refer to three types of audio contained in audio content: music, speech, and noise. However, the exemplary systems, methods, and devices described herein may also be used in other scenarios involving different audio types, fewer audio types, or more audio types. Audio types may be referred to herein as audio sources or sound sources. Noise sources may be undesirable noise (e.g., audio without speech or music), such as background noise, ambient noise, and other disruptive noise. In some cases, quiet or obviously loud music or speech may be considered noise. The noise referred to herein generally includes sounds that may not be further processed by framework 200 (see Figure 2) or framework 300 (see Figure 3).

[0035] Figure 2 shows a block diagram of a framework 200 for audio level management, according to some aspects of the present disclosure. The framework 200 may be implemented, for example, by an encoder 110, a decoder 120, or a combination thereof. The framework 200 includes a sound source classification module 202, a trust activation tracker module 204, a verification module 206, a long-term energy estimation module 208, and a gain control module 212. The long-term energy estimation module 208 may include an energy history buffer 210. The gain control module 212 may include a maximum gain history buffer 214.

[0036] The input of the sound source classification module 202 corresponds to the first path 201. The output of the sound source classification module 202 corresponds to the second path 203. The sound source classification module 202 is coupled to the trust activation tracker module 204 via the second path 203. The sound source classification module 202 is configured to receive input audio content (for example, audio information received via the first signal path 105 in Figure 1) via the first path 201.

[0037] The sound source classification module 202 processes the input audio content to identify and classify the audio sources within it. For example, the sound source classification module 202 includes an audio analysis algorithm, such as a trained machine learning model, that identifies the audio source or sound of interest within the input audio content. The audio source may include, for example, speech, music, noise, etc. In some cases, the audio source may be more specific, such as the type of music, a specific instrument, or a specific speaker.

[0038] The sound source classification module 202 generates a source activation probability, which is a value indicating the likelihood (e.g., probability or confidence value) that the received input audio content contains the associated audio source. For example, for one frame of input audio content, the sound source classification module 202 may determine (e.g., by computation, estimation, or using an audio analysis algorithm) that there is a 90% chance that the input audio content contains speech, a 32% chance that the input audio content contains music, and a 22% chance that the input audio content contains noise. In some cases, the sound source classification module 202 operates in real time (e.g., per frame). In other cases, the sound source classification module 202 operates using historical data (e.g., using previous frames). The sound source classification module 202 is configured to provide the sound source activation probability to the confidence activation tracker module 204 via a second path 203.

[0039] The input to the Trust Activation Tracker Module 204 corresponds to the second path 203. The Trust Activation Tracker Module 204 is configured to receive the sound source activation probability from the sound source classification module 202 via the second path 203. The output to the Trust Activation Tracker Module 204 corresponds to the third path 205. The Trust Activation Tracker Module 204 is coupled to the verification module 206 via the third path 205.

[0040] The Trust Activation Tracker module 204 processes the sound source activation probabilities from the Sound Source Classification Module 202 to determine the activation status of each sound source identified by the Sound Source Classification Module 202. For example, the Trust Activation Tracker module 204 performs buffer operations and stores instructions on whether each sound source is active during each frame. The Trust Activation Tracker module 204 may include, for example, a sliding window buffer. A separate buffer may be provided for the sound source activation probabilities associated with each sound source identified by the Sound Source Classification Module 202, thereby monitoring each sound source separately.

[0041] To determine if a sound source is active, the trust activation tracker module 204 may compare the sound source activation probability to a confidence threshold in each frame. For example, consider a spoken audio source. If the source activation probability of the spoken audio source exceeds a confidence threshold (e.g., 90% confidence) over a predetermined number of frames (e.g., 10 frames), the trust activation tracker module 204 determines that the spoken audio source is active for that frame. If the source activation probability of the spoken audio source falls below the confidence threshold for a single frame, the trust activation tracker module 204 determines that the spoken audio source is not active (e.g., inactive) for that frame.

[0042] In addition, the Trust Activation Tracker module 204 can determine whether an active sound source is valid across multiple frames. For example, if a speech audio source is active for a predetermined number of frames (e.g., 10 frames), the Trust Activation Tracker module 204 determines that the activation of the speech audio source is valid. If the speech audio source is inactive for a given frame, the Trust Activation Tracker module 204 determines that the activation of the speech audio source is invalid and may resume monitoring the speech audio source.

[0043] The reliability threshold and the predetermined number of frames implemented by the trust activation tracker module 204 may be set separately for different audio sources. In this way, the reliability threshold and predetermined number of frames for a music audio source may differ from the reliability threshold and predetermined number of frames for a speech audio source, and each audio source may be treated differently (for example, using content-specific thresholds for activation tracking by the trust activation tracker module 204).

[0044] Such operations are performed for each audio source (e.g., a music audio source). In some implementations, the trust activation tracker module 204 ignores the source activation probability associated with a noise audio source. The trust activation tracker module 204 provides the source activation validity (e.g., whether each sound source is valid) to the verification module 206 via a third path 205.

[0045] In some cases, the input audio content received by the Trust Activation Tracker module 204 is a short-time signal containing short-time frames (e.g., subframes of the audio content). A short-time frame may be, for example, a 20ms waveform represented by coding coefficients. A short-time frame may also be a waveform of other lengths, such as a 10ms waveform or a 25ms waveform. In other cases, the Trust Activation Tracker module 204 receives an input audio signal and generates a short-time signal based on the input audio signal.

[0046] The inputs of the verification module 206 correspond to the first path 201 and the third path 205. The verification module 206 is coupled to the trust activation tracker module 204 via the third path 205. The verification module 206 receives input audio content via the first path 201. The verification module 206 receives source activation status from the trust activation tracker module 204 via the third path 205. The output of the verification module 206 corresponds to the fourth path 207. The verification module 206 is coupled to the long-term energy estimation module 208 via the fourth path 207.

[0047] The verification module 206 is configured to provide input audio content (as transferred input audio content) to the long-term energy estimation module 208 via the fourth path 207 when the source activation status from the trust activation tracker module 204 indicates that one of the audio sources is valid. When one of the audio sources is invalid, the verification module 206 does not provide input audio content to the long-term energy estimation module 208. That is, as shown in Figure 2, the transferred input audio content provided to the long-term energy estimation module 208 by the verification module 206 via the fourth path 207 may be a subset of the input audio content received by the sound source classification module 202 via the first path 201 (for example, the verification module 206 passes only the verified portion of the input audio content to the long-term energy estimation module 208), or it may include all of the input audio content initially received. In some cases, the operation of the verification module 206 may be performed by the trust activation tracker module 204 or the long-term energy estimation module 208, rather than by a separate module.

[0048] Since the sound source classification module 202 and the trust activation tracker module 204 can operate in real time, the input audio content may be noisy. However, the operation of the sound source classification module 202 and the trust activation tracker module 204 allows the use of noisy audio while rejecting inaccurate detection of activated sound sources, thereby reducing the complexity in audio analysis.

[0049] The input to the long-term energy estimation module 208 corresponds to the fourth path 207. The long-term energy estimation module 208 is coupled to the verification module 206 via the fourth path 207. The long-term energy estimation module 208 is configured to receive input audio content from the verification module 206 via the fourth path 207. The output to the long-term energy estimation module 208 corresponds to the fifth path 209. The long-term energy estimation module 208 is coupled to the gain control module 212 via the fifth path 209.

[0050] The long-term energy estimation module 208 tracks the instantaneous and cumulative energy of audio sources within the input audio content. The energy history buffer 210 can store the cumulative energy, which is updated by adding the instantaneous energy for each frame to the energy history buffer 210. The long-term energy estimation module 208 can track the instantaneous and cumulative energy of audio sources as real-time root mean square (RMS) values. For example, when the trust activation tracker module 204 indicates that an audio source is active, the long-term energy estimation module 208 first adds the instantaneous energy of the audio source to the energy history buffer 210. The long-term RMS value of the audio source is calculated by the long-term energy estimation module 208 by dividing the cumulative energy in the energy history buffer 210 by the number of "active" frames (e.g., active frame count).

[0051] In some cases, the long-term energy estimation module 208 also tracks which audio source has the highest long-term RMS value. For example, the long-term energy estimation module 208 stores the highest long-term RMS value and an indicator of the associated audio source. After calculating the long-term RMS value for the active audio source in each frame, the long-term energy estimation module 208 determines whether the long-term RMS value exceeds the highest long-term RMS value currently stored. If the long-term RMS value for the active audio source in the current frame exceeds the highest long-term RMS value stored, the long-term energy estimation module 208 overwrites the highest long-term RMS value.

[0052] The long-term energy estimation module 208 refers only to audio content that has an “active” audio source. If the continuous activation of the audio source is not maintained (for example, if the trust activation tracker module 204 indicates that the activation of the audio source is “inactive”), the operation of the long-term energy estimation module 208 is not performed for the relevant frame. In some examples, the energy history buffer 210 is a rolling window buffer such that the oldest value of accumulated energy is replaced with the most recent value of accumulated energy. The long-term energy estimation module 208 is configured to provide the long-term RMS value to the gain control module 212 via a fifth path 209. The long-term energy estimation module 208 may also provide the highest stored long-term RMS value and an indicator of the relevant audio source to the gain control module 212 via the fifth path 209.

[0053] The input to the gain control module 212 corresponds to the fifth path 209. The gain control module 212 is coupled to the long-term energy estimation module 208 via the fifth path 209. The gain control module 212 is configured to receive long-term RMS values from the long-term energy estimation module 208 via the fifth path 209. The output to the gain control module 212 corresponds to the sixth path 211.

[0054] The gain control module 212 is configured to determine (e.g., estimate or calculate) a short-time leveling gain to apply to audio content based on the long-term RMS value provided by the long-term energy estimation module 208 and a predetermined target RMS. The gain control module 212 may determine the short-time leveling gain once a predetermined number of frames have been processed by the long-term energy estimation module 208 and the associated audio source is “ready” for gain updating.

[0055] When updating the short-time leveling gain, the gain control module 212 considers all audio sources that are "ready" (for example, active for a predetermined number of frames). The target gain (in dB) can be calculated by the gain control module 212 according to Equation 1. TargetGain dB =targetRMS dB -LTRMS dB Formula (1) Here, targetRMS dB is the predetermined target RMS, and LTRMS is the long-term RMS value provided by the long-term energy estimation module 208.

[0056] In some cases, when multiple audio sources are ready for gain updating, the gain control module 212 calculates a single long-term RMS value from the multiple audio sources. For example, the gain control module 212 may select the audio source that has the highest source activation probability for a given frame (or the highest average source activation probability over multiple frames). In another example, the gain control module 212 selects the audio source associated with the best long-term RMS value, such as that provided by the long-term energy estimation module 208. In yet another example, the gain control module 212 uses the source activation probability values associated with the audio sources to calculate a weighted sum of long-term RMS values for multiple audio sources.

[0057] In some implementations, additional parameters are implemented to control the short-time leveling gain (STG) calculation in order to prevent rapid gain changes. For example, the attack time (AT) parameter can indicate the number of frames required to increase from the STG to the target gain. The release time (RT) parameter can indicate the number of frames required to decrease from the STG to the target gain. The AT and RT parameters can be unique for each audio source. Using these parameters, assuming an initial gain is set to a value of 1, the STG can be updated according to Equations 2 - 4. STG dB =STG dB +ΔG dB Equation (2) Here, ΔG is the change in STG,

Number

[0058] In some implementations, AT and RT are time-varying parameters. For example, the AT and RT parameters can be adapted by the gain control module 212 such that ΔG remains the same or a fixed amount during gain boost or attenuation. Further, the gain control module 212 can include a maximum gain history buffer 214 that stores the maximum allowable STG. In some cases, the maximum allowable STG can be the result of dividing 1 by the maximum amplitude of the short-time waveform of the audio content. To prevent clipping, the maximum allowable STG can be tracked by the gain control module 212 and implemented according to Equation 5. STG dB =min(STG dB ,MSTG dB ) Equation (5) Here, MSTG is the maximum allowable STG.

[0059] In some cases, if the maximum allowable STG is updated in each short frame, the resulting STG may fluctuate. However, if the maximum allowable STG is updated as a global minimum, the resulting STG may not be effective in gain leveling. Therefore, the gain control module 212 can track the maximum allowable STG in the maximum gain history buffer 214 and update the maximum allowable STG locally. The duration for which the maximum gain history buffer 214 is updated can be selected for each audio source to achieve effective leveling without clipping.

[0060] In some cases, only gain boosting is performed (no attenuation is performed). In such cases, weaker audio sources are boosted to the minimum target level, resulting in sufficient loudness. In such cases, the target gain may be constrained by the running maximum of the target gain.

[0061] The gain control module 212 applies the resulting STG to the input short-time audio signal to generate a leveled output signal. The gain control module 212 provides the leveled output signal via a sixth path 211. In Figure 1, the leveled output signal can be encoded by the encoder 110 and provided within the encoded signal via a second signal path 115.

[0062] In some cases, noise estimation and reduction may be performed to achieve frequency-dependent STG. Figure 3 shows a block diagram of a framework 300 for audio level management in some aspects of this disclosure. Framework 300 may be implemented, for example, by an encoder 110, a decoder 120, or a combination thereof. Framework 300 includes a sound source classification module 202, a trust activation tracker module 204, a verification module 206, a long-term energy estimation module 208, a gain control module 312, and a noise estimation reduction module 302. The sound source classification module 202, the trust activation tracker module 204, the verification module 206, and the long-term energy estimation module 208 may operate substantially as described with respect to framework 200. The gain control module 312 may include a frequency-dependent gain buffer 304.

[0063] The input of the noise estimation and reduction module 302 corresponds to the first path 201. The noise estimation and reduction module 302 receives input audio content via the first path 201. The output of the noise estimation and reduction module 302 corresponds to the seventh path 303. The noise estimation and reduction module 302 is coupled to the gain control module 312 via the seventh path 303.

[0064] The noise estimation and reduction module 302 is configured to process the received input audio content and classify the observed spectral bins of the input audio content into noise bins (having spectral amplitudes lower than the estimated noise spectrum) and non-noise bins (having spectral amplitudes higher than the estimated noise spectrum). The estimated noise spectrum may be a predetermined (e.g., pre-loaded) spectrum stored by the noise estimation and reduction module 302. The noise estimation and reduction module 302 is configured to provide the noise bins and non-noise bins to the gain control module 312.

[0065] The inputs to the gain control module 312 correspond to the seventh path 303 and the fifth path 209. The gain control module 312 is coupled to the long-term energy estimation module 208 via the fifth path 209. The gain control module 312 receives long-term RMS values from the long-term energy estimation module 208 via the fifth path 209. The gain control module 312 is coupled to the noise estimation reduction module 302 via the seventh path 303. The gain control module 312 receives noise bins and non-noise bins from the noise estimation reduction module 302 via the seventh path 303.

[0066] The gain control module 312 may operate in a similar manner to the gain control module 212 described with respect to Figure 2. The gain control module 312 may also calculate STG based on the noise bin and non-noise bin received from the noise estimation reduction module 302. For example, for the noise bin, using the noise parameter α, the gain control module 312 calculates STG for gain boost. DB If ≥ 0, then (1-α)*STG dB You may also apply STG for gain attenuation. DB If <0, then (1+α)*STG dB The following can be applied, where 0 ≤ α ≤ 1. In this way, the gain control module 312 prevents the re-boosting of noise that may have been suppressed by the noise estimation reduction module 302. Thus, the SNR can be further improved by implementing the noise estimation reduction module 302. In some examples, the STG is smoothed at each frequency by a unipolar low-pass filter.

[0067] Since the estimated noise spectrum can vary between frames, the gain control module 312 may apply a simple smoothing to calculate the frequency-dependent STG (FD-STG), as given by Equation 6. FD-STG[i,b]=β*FD-STD[i-1,b]+(1-β)*FD-STG[i,b] Equation (6) Here, β is the smoothing parameter, i is the time index, and b is the bin index. The FD-STG can be stored in the frequency-dependent gain buffer 304.

[0068] In some implementations, the gain control module 312 also performs the operation of the noise estimation reduction module 302. In these implementations, noise reduction and FD-STG may be combined, as given by Equation 7, so that in one case the gain processing is performed by the gain control module 312. FD-STG dB =FD-NRG dB +FD-STG dB Formula (7) Here, FD-NRG dB This is frequency-dependent noise reduction.

[0069] The illustrated blocks of modules in Framework 200 in Figure 2 and Framework 300 in Figure 3 are merely examples for real-time level management of audio content. In other examples, Frameworks 200 and 300 may include additional blocks, omit blocks, combine the functions of blocks, or split parts of blocks into additional blocks.

[0070] Figure 4 shows block diagrams of various exemplary methods 400 for real-time level control of audio content, which may be performed by the framework 200 of Figure 2. Although primarily described in relation to the framework 200, methods 400 may also be performed by the framework 300 of Figure 3. Method 400 may be performed by a processor which may be configured to perform method 400 via machine-executable instructions. Method 400 may be divided into various blocks or partitions, such as blocks 405, 410, 415, 420, 425, and 430. The various process blocks shown in Figure 4 provide examples of the various methods disclosed herein, and it is understood that some blocks may be removed, added, combined, or modified without departing from the spirit of this disclosure. In some examples, the processing of various blocks which may be described as processes, methods, steps, blocks, actions, or functions may begin in block 405.

[0071] In block 405, “Receiving Input Audio Signal,” exemplary method 400 may include receiving an input audio block contained within a frame. For example, referring to Figure 2, the sound source classification module 202 receives an input audio signal in the frame via a first path 201, as previously described. The input audio signal contains an input audio block. Processing can proceed from block 405 to block 410.

[0072] In block 410, “Generating a short-duration signal based on an input audio signal,” exemplary method 400 may include generating a short-duration signal based on an input audio signal. For example, referring to Figure 2, the sound source classification module 202 generates a short-duration signal based on an input audio signal, as described above. In other examples, the input audio signal may be a short-duration signal. Processing can proceed from block 410 to block 415.

[0073] In block 415, “Identifying one or more sources of interest,” the exemplary method 400 may include identifying one or more sources of interest associated with a short-duration signal. For example, referring to Figure 2, the sound source classification module 202 identifies and classifies audio sources in the input audio signal, as previously described. The process can then proceed from block 415 to block 420.

[0074] In block 420, “Estimating Long-Term RMS for Each Source of Interest,” exemplary method 400 may include estimating the long-term RMS for each of the one or more sources of interest, as previously described with respect to the confidence activation tracker module 204, the verification module 206, and the long-term energy estimation module 208. The process can proceed from block 420 to block 425.

[0075] In block 425, “Estimating a set of short-time gains,” exemplary method 400 may include estimating a set of short-time gains (STGs) based at least partially on the long-time RMS of at least one of the respective sources of interest, as previously described with respect to the long-time energy estimation module 208 and the gain control module 212. The process can then proceed from block 425 to block 430.

[0076] In block 430, “Applying Short-Time Gain to Short-Time Signals,” the exemplary method 400 may include applying short-time gain to short-time signals, as previously described with respect to the gain control module 212.

[0077] Figure 5A shows a schematic block diagram of an exemplary device architecture 500 (e.g., Apparatus 500) that may be used to implement various aspects of the present disclosure. Architecture 500 includes, but is not limited to, server and client devices, systems, and methods as described with reference to Figures 1 to 4. As illustrated, architecture 500 includes a central processing unit (CPU) 501 that can execute various processes according to, for example, a program stored in read-only memory (ROM) 502, or a program loaded from, for example, a storage unit 508 into random access memory (RAM) 503. The CPU 501 may be, for example, an electronic processor 501. RAM 503 also appropriately stores data required when the CPU 501 executes various processes. The CPU 501, ROM 502, and RAM 503 are interconnected via a bus 504. An input / output interface 505 is also connected to the bus 504.

[0078] The following components are connected to the I / O interface 505: an input unit 506 which may include a keyboard, mouse, etc.; an output unit 507 which may include a display such as a liquid crystal display (LCD) and one or more speakers; a storage unit 508 which includes a hard disk or another suitable storage device; and a communication unit 509 which includes a network interface card such as a network card (e.g., wired or wireless).

[0079] In some implementations, the input unit 506 includes one or more microphones located at different positions (depending on the host device) that enable the capture of audio signals in various formats (e.g., mono, stereo, spatial, immersive, and other appropriate formats).

[0080] In some implementations, the output unit 507 includes a system with varying numbers of speakers. The output unit 507 can render audio signals in various formats (e.g., mono, stereo, immersive, binaural, and other appropriate formats) (depending on the capabilities of the host device).

[0081] In some embodiments, the communication unit 509 is configured to communicate with other devices (for example, via a network). The drive 510 is also connected to the I / O interface 505, if necessary. A removable medium 511, such as a magnetic disk, optical disk, magneto-optical disk, flash drive, or other suitable removable medium, is mounted on the drive 510, so that computer programs read therefrom are installed in the storage unit 508, if necessary. Those skilled in the art will understand that although the apparatus 500 is described as including the above-described components, in actual applications it is possible to add, remove, and / or replace some of these components, and all such modifications or changes will fall within the scope of this disclosure.

[0082] According to exemplary embodiments of the present disclosure, the processes described above may be implemented as a computer software program or on a computer-readable storage medium. For example, embodiments of the present disclosure include a computer program product which includes a computer program tangibly embodied on a machine-readable medium, the computer program which includes program code for performing the method. In such embodiments, the computer program may be downloaded from a network via a communication unit 509 and mounted, and / or installed from a removable medium 511, as shown in Figure 5A.

[0083] Figure 5B shows a schematic block diagram of an exemplary CPU 501 implemented in the device architecture 500 of Figure 5A, which may be used to implement various aspects of the present disclosure. The CPU 501 includes an electronic processor 520 and a memory 521. The electronic processor 520 is electrically and / or communicatively connected to the memory 521 for bidirectional communication. The memory 521 stores leveling software 522. In some examples, the memory 521 may be located inside the electronic processor 520, such as an internal cache memory or some other internally located ROM, RAM, or flash memory. In other examples, the memory 521 may be located outside the electronic processor 520, such as a ROM 502, RAM 503, flash memory, or removable media 511, or another non-temporary computer-readable medium conceived for the device architecture 500. In some examples, the electronic processor 520 may implement leveling software 522 stored in memory 521 to perform one of the methods 400 shown in Figure 4, among other things.

[0084] In general, various exemplary embodiments of the present disclosure may be implemented in hardware or dedicated circuitry (e.g., control circuits), software, logic, or any combination thereof. For example, the units and modules described above may be executed by a control circuit (e.g., CPU 501 in combination with other components of Figure 5A), and thus the control circuit may perform the actions described in the present disclosure. Some aspects may be implemented in hardware, while others may be implemented in firmware or software that can be executed by a controller, microprocessor, or other computing device (e.g., control circuits). Various aspects of the exemplary embodiments of the present disclosure are illustrated and described using block diagrams, flowcharts, or any other pictorial representation, but it will be understood that the blocks, apparatus, systems, techniques, or methods described herein may, in non-limiting examples, be implemented in hardware, software, firmware, dedicated circuitry or logic, general-purpose hardware or controllers or other computing devices, or any combination thereof.

[0085] Furthermore, the various blocks shown in the flowchart can be viewed as method steps and / or actions resulting from the operation of computer program code and / or as a group of coupled logic circuit elements constructed to perform related functions. For example, embodiments of the present disclosure include a computer program product, which includes a computer program tangibly embodied on a machine-readable medium, the computer program including program code configured to perform the methods described above.

[0086] In the context of this disclosure, a machine-readable medium may be any tangible medium that contains or can store a program for use by or in connection with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may be non-temporary and may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination thereof. More specific examples of machine-readable storage media include electrical connections having one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0087] Computer program code for performing the methods of this disclosure may be written in any combination of one or more programming languages. These computer program codes may be provided to a processor of a general-purpose computer, a dedicated computer, or another programmable data processing device having a control circuit, so that when executed by the processor of the computer or other programmable data processing device, the program code performs the functions / operations specified in the flowcharts and / or block diagrams. The program code may be executed entirely on a computer, partially on a computer, as a standalone software package, partially on a computer, partially on a remote computer, entirely on a remote computer or server, or distributed across one or more remote computers and / or servers.

[0088] Those skilled in the art will understand that the present invention is by no means limited to the embodiments described above. Rather, many modifications and variations are possible and are considered to be within the scope of the appended claims. Various aspects and implementations of this disclosure can also be understood from the following enumerated example embodiments (EEEs), which are not claims. These can all represent systems, methods, and devices configured in accordance with the aspects of this disclosure.

[0089] [EEE1] A method for real-time level control of audio content, the method comprising: receiving an input audio signal; generating a short-time signal based on the input audio signal; identifying one or more sources of interest associated with the short-time signal using a source classifier; estimating a long-term root mean square energy (long-term RMS) for each of the one or more sources of interest using a long-term energy estimator; estimating a set of short-time gains based at least partially on the long-term RMS of at least one of the sources of interest using automatic gain control; and applying the short-time gains to the short-time signal.

[0090] [EEE2] The method of EEE1, wherein, while the sound source classifier is used to identify one or more sound sources of interest associated with the short-time signal, the output of the sound source classifier is processed using a trust activation tracker.

[0091] [EEE3] The method according to EEE1 or 2, wherein the sound source classifier is trained to classify sources that include at least one of speech sources, music sources, and noise sources.

[0092] [EEE4] The method according to any one of EEE1 to 3, wherein the sound source classifier outputs an instantaneous sound source activation probability for each of the sound sources of interest for each short-time frame of the short-time signal.

[0093] [EEE5] The method according to any one of EEE1 to 4, further comprising the step of determining the source activation status based in part on a pre-selected probability threshold for each sound source of interest.

[0094] [EEE6] The trust activation tracker counts consecutive active frames from each source, according to the method described in any one of EEE2 to 5.

[0095] [EEE7] The method according to any one of EEE2 to 6, wherein the trusted activation tracker indicates that a sound source of interest is active when the continuous active frame count for each of the sound sources of interest is greater than a predefined number of frames.

[0096] [EEE8] The aforementioned predefined number of frames is defined for each sound source of interest, as described in EEE7.

[0097] [EEE9] The method according to any one of EEE6 to 8, wherein the continuous active frame count is reset according to the determination that each sound source of interest is inactive.

[0098] [EEE10] The long-term energy estimator is the method according to any one of EEE1 to 9, which estimates the instantaneous energy and cumulative energy for each of the one or more sound sources of interest.

[0099] [EEE11] The method according to EEE9 or 10, wherein the instantaneous energy is stored in an energy history buffer according to the determination that each of the sources is active.

[0100] [EEE12] The method according to EEE11, wherein the cumulative energy is the sum of the instantaneous energies.

[0101] [EEE13] The method according to any one of EEE2 to 12, wherein the long-term energy estimator updates the long-term RMS estimate in accordance with the determination of a valid activation by the reliable activation tracker. [EEE14] The long-term RMS is calculated by dividing the cumulative energy by the number of active frames, according to the method described in any one of EEE10 to 13. [EEE15] The method according to any one of EEE11 to 14, wherein the cumulative energy is updated by adding the energy value stored in the energy history buffer.

[0102] [EEE16] The long-term energy estimator tracks the maximum value of the long-term RMS, according to the method described in any one of EEE1 to 15.

[0103] [EEE17] The method according to any one of EEE1 to 16, wherein the automatic gain control updates the short-time gain based on the long-term RMS and the target RMS.

[0104] [EEE18] The method according to any one of EEE2 to EEE17, wherein the automatic gain control is considered "ready" to be updated when the reliable activation tracker first reports "effective".

[0105] [EEE19] The initial short-time gain is set to 1, as described in any one of EEE1 to 18.

[0106] [EEE20] The target gain is calculated using the difference between the target RMS and the long-term RMS, as described in any one of EEE1 through EEE19.

[0107] [EEE21] The long-term RMS is based on the long-term RMS of the source having the maximum sound source activation probability, as described in EEE20.

[0108] [EEE22] The long-term RMS is determined by the method described in EEE20, based on the source having the largest long-term RMS.

[0109] [EEE23] The aforementioned long-term RMS is based on the method described in EEE20, which is the average of the long-term RMS among all "active" sources.

[0110] [EEE24] The method according to any one of EEE1 to 23, wherein the short-time gain is updated by adding a gain increment to the short-time gain.

[0111] [EEE25] The method according to any one of EEE20 to 24, wherein the target gain is constrained as a running maximum in order to allow only gain boosting.

[0112] [EEE26] The method according to EEE24 or 25, wherein the gain increment is calculated as the average increment per frame for the change from the short-time gain to the target gain.

[0113] [EEE27] The method according to any one of EEE24 to 26, wherein the rate of gain increment is controlled by the attack time and release time.

[0114] [EEE28] The method according to any one of EEE24 to 27, wherein when the short-time gain is smaller than the target gain, the attack time is used to calculate the gain increment.

[0115] [EEE29] The method according to EEE27 or 28, wherein when the short-time gain is greater than the target gain, the release time is used to calculate the gain increment.

[0116] [EEE30] The method according to any one of EEE1 to 29, wherein the aforementioned short-time gain is further adapted to the maximum allowable gain for preventing clipping.

[0117] [EEE31] The method according to EEE30, wherein the maximum allowable gain is obtained by dividing the maximum allowable amplitude by the maximum amplitude of the short-time signal.

[0118] [EEE32] The method according to EEE30 or 31, wherein the maximum allowable gain is held in a local history buffer, and the minimum value from there is used to limit the current short-time gain.

[0119] [EEE33] The duration of the local history buffer is defined for each source of interest, as described in EEE32.

[0120] [EEE34] The method according to any one of EEE1 to 33, wherein the short-time leveling gain is converted to a frequency-dependent gain based on an available estimate of the noise spectrum.

[0121] [EEE35] The frequency-dependent gain is derived for the noise bin and the non-noise bin, as described in EEE34.

[0122] [EEE36] The method according to EEE35, wherein the noise bin has an amplitude smaller than or equal to that of the noise spectrum.

[0123] [EEE37] The method according to EEE36, wherein the non-noise bin has a larger amplitude compared to the noise spectrum.

[0124] [EEE38] The aforementioned short-time leveling gain is the method described in any one of EEE34 to EEE37, applicable to the non-noise bin.

[0125] [EEE39] The method according to any one of EEE34 to 38, wherein the short-time leveling gain applied to the noise bin is multiplied by a factor of (1-α) when boosting the gain and by a factor of (1+α) when attenuating the gain, and α is between 0 and 1.

[0126] [EEE40] The frequency-dependent gain at each frequency is smoothed by a single-pole low-pass filter, as described in any one of EEE34 to EEE39.

[0127] [EEE41] The resulting frequency-dependent gain is combined with a noise reduction gain and applied within a single process, as described in any one of EEE34 to EEE40.

[0128] [EEE42] An electronic device comprising one or more processors and a memory containing one or more programs configured to be executed by the one or more processors, wherein the one or more programs contain instructions for performing the method described in any one of EEE1 to EEE41.

[0129] [EEE43] A non-temporary computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device, wherein the one or more programs include instructions for performing the method described in any one of EEE1 to 41.

[0130] With respect to the processes, systems, methods, heuristics, etc., described herein, the steps of such processes, etc., are described as being performed in a certain ordered sequence, but it should be understood that such processes may be carried out in the described steps performed in an order other than that described herein. Furthermore, it should be understood that certain steps may be performed simultaneously, other steps may be added, or certain steps described herein may be replaced, modified, or omitted. In other words, the descriptions of processes herein are provided for the purpose of illustrating certain embodiments and should not be construed as limiting the scope of the claims.

[0131] Therefore, it should be understood that the above description is illustrative and not limiting. Many embodiments and uses other than those provided will be apparent from reading the above description. The scope should not be determined by reference to the above description, but rather by reference to the appended claims, along with the entire scope of equivalents to which such claims are granted. It is expected and intended that there will be future developments in the art described herein, and that the disclosed systems and methods will be incorporated into such future embodiments. In short, it should be understood that this application is modifiable and can be modified.

[0132] All terms used in the claims are intended to be given their broadest reasonable interpretation and their ordinary meaning as understood by a person familiar with the art described herein, unless expressly otherwise stated herein. In particular, the use of singular articles such as “a,” “the,” and “said” should be read to describe one or more of the elements indicated, unless the claim expressly limits it to the opposite.

[0133] This summary of the disclosure is provided to enable readers to quickly confirm the nature of the technical disclosure. The summary is submitted with the understanding that it is not to be used to interpret or limit the scope or meaning of the claims. Furthermore, it is found that in the preceding detailed description, various features in various embodiments are grouped together for the purpose of making the disclosure flow more smoothly. This method of disclosure should not be interpreted as reflecting an intention that the claimed embodiments incorporate more features than are expressly described in each claim. Rather, as reflected in the following claims, the subject matter of the invention is less than all the features of a single disclosed embodiment. Thus, the following claims are incorporated herein into the detailed description, and each claim stands alone as separately claimed subject matter.

Claims

1. A method for real-time level management of audio content, the method being: The stage at which the input audio signal arrives; The step of generating a short-time signal based on the aforementioned input audio signal; The steps include: identifying one or more sound sources of interest associated with the short-time signal using a sound source classifier (202); The steps include: estimating the long-term RMS (root mean square energy) for each of the one or more sound sources of interest using a long-term energy estimator (208); The steps include: estimating a set of short-time gains based at least partially on the long-time RMS of at least one of the respective sound sources of interest using automatic gain control (212); The steps include applying the short-time gain to the short-time signal. Methods that include...

2. The method according to claim 1, wherein, while the sound source classifier is used to identify one or more sound sources of interest associated with the short-time signal, the output of the sound source classifier is processed using a trust activation tracker (204).

3. The method according to claim 1, wherein the sound source classifier is trained to classify sources including at least one of speech sources, music sources, and noise sources.

4. The method according to claim 1, wherein the sound source classifier outputs an instantaneous sound source activation probability for each of the sound sources of interest for each short-time frame of the short-time signal.

5. The method according to claim 1, further comprising the step of determining the source activation status based in part on a pre-selected probability threshold for each sound source of interest.

6. The method according to claim 2, wherein the trusted activation tracker counts consecutive active frames from each source.

7. The method according to claim 2, wherein the trusted activation tracker indicates that a sound source of interest is active when the continuous active frame count for each of the sound sources of interest is greater than a predefined number of frames.

8. The method according to claim 7, wherein the predefined number of frames is defined for each sound source of interest.

9. The method according to claim 6, wherein the continuous active frame count is reset according to the determination that each sound source of interest is inactive.

10. The method according to claim 1, wherein the long-term energy estimator estimates the instantaneous energy and cumulative energy for each of the one or more sound sources of interest.

11. The method according to claim 9, wherein the instantaneous energy is stored in an energy history buffer (210) according to the determination that each of the sources is active.

12. The method according to claim 11, wherein the cumulative energy is the sum of the instantaneous energies.

13. The method according to claim 2, wherein the long-term energy estimator updates the long-term RMS estimate in accordance with the determination of a valid activation by the reliable activation tracker.

14. The method according to claim 10, wherein the long-term RMS is calculated by dividing the cumulative energy by the number of active frames.

15. The method according to claim 11, wherein the accumulated energy is updated by adding the energy values stored in the energy history buffer.

16. The method according to claim 1, wherein the long-term energy estimator tracks the maximum value of the long-term RMS.

17. The method according to claim 1, wherein the automatic gain control (212) updates the short-time gain based on the long-term RMS and the target RMS.

18. The method of claim 2, wherein the automatic gain control is considered "ready" to be updated when the reliable activation tracker first reports "effective".

19. The method according to claim 1, wherein the initial short-time gain is set to 1.

20. The method according to claim 1, wherein the target gain is calculated using the difference between the target RMS and the long-term RMS.

21. The method according to claim 20, wherein the long-term RMS is based on the long-term RMS of a source having the maximum sound source activation probability.

22. The method according to claim 20, wherein the long-term RMS is based on a source having the largest long-term RMS.

23. The method according to claim 20, wherein the long-term RMS is based on the average of the long-term RMSs among all "active" sources.

24. The method according to claim 1, wherein the short-time gain is updated by adding a gain increment to the short-time gain.

25. The method according to claim 20, wherein the target gain is constrained as a running maximum value in order to allow only gain boosting.

26. The method according to claim 24, wherein the gain increment is calculated as the average increment per frame for changing from the short-time gain to the target gain.

27. The method according to claim 24, wherein the rate of gain increment is controlled by the attack time and the release time.

28. The method according to claim 24, wherein when the short-time gain is smaller than the target gain, the attack time is used to calculate the gain increment.

29. The method according to claim 27, wherein when the short-time gain is greater than the target gain, the release time is used to calculate the gain increment.

30. The method according to claim 1, wherein the short-time gain is further adapted to the maximum allowable gain for preventing clipping.

31. The method according to claim 30, wherein the maximum allowable gain is obtained by dividing the maximum allowable amplitude by the maximum amplitude of the short-time signal.

32. The method according to claim 30, wherein the maximum allowable gain is held in a local history buffer (214), and the minimum value therefrom is used to limit the current short-time gain.

33. The method according to claim 32, wherein the duration of the local history buffer is defined for each source of interest.

34. The method according to claim 1, wherein the short-time leveling gain is converted to a frequency-dependent gain based on an available estimate of the noise spectrum.

35. The method according to claim 34, wherein the frequency-dependent gain is derived with respect to the noise bin and the non-noise bin.

36. The method according to claim 35, wherein the noise bin has an amplitude smaller than or equal to that of the noise spectrum.

37. The method according to claim 36, wherein the non-noise bin has a larger amplitude compared to the noise spectrum.

38. The method according to claim 34, wherein the short-time leveling gain is applied to a non-noise bin.

39. The method according to claim 34, wherein the short-time leveling gain applied to the noise bin is multiplied by a factor of (1-α) when boosting the gain and by a factor of (1+α) when attenuating the gain, and α is between 0 and 1.

40. The method according to claim 34, wherein the frequency-dependent gain at each frequency is smoothed by a single-pole low-pass filter.

41. The method according to claim 34, wherein the resulting frequency-dependent gain is combined with a noise reduction gain and applied within a single process.

42. An electronic device comprising one or more processors and a memory containing one or more programs configured to be executed by the one or more processors, wherein the one or more programs contain instructions for performing the method according to any one of claims 1 to 41.

43. A non-temporary computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device, wherein the one or more programs include instructions for performing the method according to any one of claims 1 to 41.