Audio Encoder with Signal Enhancement Module, Active Phase Audio Encoder and Inactive Phase Audio Encoder, Audio Decoder with Mixer, Encoding Method, Decoding Method, Computer Program and Bitstream

The audio encoder with a signal enhancement module efficiently encodes speech signals by using noise reduction information for signal activity detection, addressing the challenge of background noise in neural audio codecs, enhancing speech quality and reducing computational complexity.

WO2026124795A1PCT designated stage Publication Date: 2026-06-18FRAUNHOFER GESELLSCHAFT ZUR FORDERUNG DER ANGEWANDTEN FORSCHUNG EV

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
FRAUNHOFER GESELLSCHAFT ZUR FORDERUNG DER ANGEWANDTEN FORSCHUNG EV
Filing Date
2024-12-13
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing neural audio codecs struggle to effectively handle background noise during speech coding, leading to increased computational complexity and reduced speech quality, as they require larger and more complex networks to maintain quality in noisy environments.

Method used

An audio encoder with a signal enhancement module that applies noise reduction and uses noise reduction information to determine signal activity, allowing for efficient encoding by differentiating between active and inactive phases, using either an active or inactive phase encoder based on the signal activity, thereby reducing computational load and improving encoding efficiency.

🎯Benefits of technology

The proposed solution enables efficient encoding of speech signals with background noise by leveraging noise reduction information for signal activity detection, resulting in improved speech quality and reduced computational complexity.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure EP2024086401_18062026_PF_FP_ABST
    Figure EP2024086401_18062026_PF_FP_ABST
Patent Text Reader

Abstract

Embodiments comprise an audio encoder for providing an encoded representation on the basis of an input audio signal, wherein the audio encoder comprises a signal enhancement module, configured to apply a noise reduction to the input audio signal, or to a processed version thereof, to obtain an enhanced audio signal, wherein the audio encoder comprises an active phase audio encoder, configured to provide the encoded representation on the basis of the enhanced audio signal, or a processed version thereof, wherein the audio encoder comprises an inactive phase audio encoder, configured to obtain a noise reduction information about the noise reduction applied to the input audio signal, or to the processed version thereof, by the signal enhancement module and to provide the encoded representation on the basis the noise reduction information, and / or wherein the inactive phase audio encoder is configured to provide the encoded representation on the basis of the enhanced audio signal, or a processed version thereof. Furthermore, a decoder with a mixer, encoding methods, computer programs and bitstreams are disclosed.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Audio Encoder with Signal Enhancement Module, Active Phase Audio Encoder and Inactive Phase Audio Encoder, Audio Decoder with Mixer, Encoding Method, Decoding Method, Computer Program and Bitstream

[0002] Description

[0003] Technical Field

[0004] Embodiments comprise audio encoders with signal enhancement modules, active phase audio encoders and inactive phase audio encoders, audio decoders with mixers, encoding methods, decoding methods, computer programs and bitstreams.

[0005] Embodiments comprise methods and apparatuses for controlling VAD / DTX-CNG of a neural audio coder by a noise reduction module.

[0006] Background of the Invention

[0007] Discontinuous Transmission (DTX) is an efficient way to drastically reduce the transmission rate of a communication codec in the absence of voice input [2], In this mode, most frames that are determined to consist of background noise only are dropped from transmission and replaced by some Comfort Noise Generation (CNG) in the decoder.

[0008] Recently, significant progress has been made in the field of speech coding with the introduction of Deep Neural Network-based models. The so-obtained neural speech and audio codecs derive a compact discrete learned representation of the input signal which forms the bitstream to be transmitted. In real-life conditions, speech signal to transmit are recorded with background noise additionally to the desired signal. The expected background noise has then to be considered during training of the neural codec where the aim might be to reproduce or to remove the background noise at receiver side [1], However, in any case, considering background noise for the training of the neural codec will consume model capacity and will, hence, require larger and more complex neural networks for maintaining speech quality achievable for clean speech.

[0009] FV - ACr - Aspect B+F - FH241204PCT-2024363614.DOCX Hence, regarding prior art approaches, there is a need for an improved audio coding concept taking into account background noise, which achieves an improved compromise regarding a quality of a rendered audio signal, a computational complexity and a flexibility of the concept.

[0010] This is achieved by the subject matter of the independent claims of the present application.

[0011] Further embodiments according to the invention are defined by the subject matter of the dependent claims of the present application.

[0012] Summary of the Invention

[0013] In the following, embodiments according to the invention are discussed grouped in different aspects of the invention. However, it is to be noted that this grouping is not to be understood in a limiting manner, but shall facilitate understanding the spirit and scope of the invention.

[0014] Hence, embodiments according to the first aspect of the invention may comprise any or all of the features, functionalities and details as disclosed in the context of embodiments according to the second aspect of the invention, both individually or taken in combination.

[0015] In line with this, embodiments according to the second aspect of the invention may comprise any or all of the features, functionalities and details as disclosed in the context of embodiments according to the first aspect of the invention, both individually or taken in combination.

[0016] The same applies for embodiments disclosed in the context of an inventive decoder, e.g. according to the third aspect of the invention. Embodiments according to the third aspect of the invention may comprise any or all of the features, functionalities and details as disclosed in the context of embodiments according to the first and / or second aspect of the invention, both individually or taken in combination, for example, in a direct or corresponding manner (e.g. an encoder functionality to provide an encoded representation having a specific information encoded therein corresponding to a decoder functionality for decoding said information and optionally to act on the basis of said information).

[0017] Furthermore, the same applies to features, functionalities and details disclosed in the context of a method or bitstream with regard to the respective bitstream or method, as well as with regard to a respective decoder and / or encoder.

[0018] FH241204PCT-2024363614. DOCX Embodiments according to the first aspect of the invention comprise an audio encoder (e.g. a speech encoder) for providing an encoded representation on the basis of an input audio signal (e.g. an input speech signal, e.g. X, e.g. X), wherein the audio encoder comprises a signal enhancement module (e.g. a noise reduction module or a speech enhancement module), configured to apply a noise reduction (e.g. comprising a compression, transformation and / or masking step) to the input audio signal, or to a processed version thereof, to obtain an enhanced audio signal (e.g. S, e.g. S, e.g. S), wherein the audio encoder comprises a signal activity detector (e.g. a voice activity detector), configured to obtain (e.g. receive from the signal enhancement module) a signal activity information (e.g. a voice activity signal) (e.g. about the input audio signal), using a noise reduction information (e.g. a mask parameter, e.g. an information about a mask, e.g. M, e.g. Mm, e.g. Mrand M e.g. which is applied to the input audio signal or a processed version thereof, in order to obtain the enhanced audio signal, e.g. a difference between the input audio signal and the enhanced audio signal, e.g. N = X - S) (which is, for example provided by the signal enhancement module) (e.g. a signal to noise ratio information) about (e.g. describing) the noise reduction (e.g. in terms of noise reduction functionality; e.g. in terms of an information about one or more processing parameters (e.g. mask parameters, e.g. an information about a mask M) used by the signal enhancement module in order to reduce noise; e.g. in terms of a masking applied to obtain the enhanced audio signal; e.g. in terms of an amount of removed noise; e.g. in terms of a difference between the input audio signal and the enhanced audio signal; e.g. in terms of an estimate of an intensity (e.g. magnitude) of a noisy part in the input audio signal or in the processed version of the input audio signal) applied to the input audio signal, or to the processed version thereof, by the signal enhancement module, and wherein the audio encoder comprises an active phase audio encoder, configured to provide the encoded representation on the basis of the enhanced audio signal, or a processed version thereof (e.g. to encode the enhanced audio signal, or a processed version thereof, in order to obtain the encoded representation), in dependence of the signal activity information (e.g. in case the signal activity information indicates that the enhanced audio signal is an active phase audio signal, e.g. in case the signal activity information indicates that the enhanced audio signal is a speech signal or a signal comprising mainly speech, e.g. in case the signal activity information indicates that the input audio signal based on which the enhanced audio signal is obtained, is an active phase audio signal, e.g. in case the signal activity information indicates that the input audio signal based on which the enhanced audio signal is obtained, is a speech signal or a signal comprising mainly speech).

[0019] The enhanced audio signal or the processed version thereof may be selectively encoded using the active phase audio encoder, which may be specifically designed to encode an active phase audio signal, such as a speech signal (e.g. an audio signal predominately comprising speech)

[0020] FH241204PCT-2024363614. DOCX efficiently. Hence, an efficient encoding scheme may be provided by using said active phase encoder, if the signal is classified as an active phase audio signal. If the signal is not classified as an active phase audio signal, but for example instead as an inactive phase audio signal, a different encoder, such as an inactive phase audio encoder may optionally be used, or the encoding may be skipped.

[0021] In any case, it was recognized that the determination of the signal activity information may be performed in a very efficient manner if it is obtained using a noise reduction information about a noise reduction applied to the input audio signal or a processed version thereof, in order to obtain the enhanced audio signal. Hence, for example, instead of analyzing the input audio signal or instead of analyzing the processed version of the input audio signal or instead of analyzing the enhanced audio signal, an information about the noise reduction processing may be exploited.

[0022] As the noise reduction is performed to obtain the enhanced audio signal, the noise reduction information may be available as a by-product of the signal enhancement. Hence, resources for a separate signal analysis step may be saved.

[0023] Beyond that, it was recognized that the signal activity information may be provided particularly reliable and robust by exploiting the noise reduction information as an information about a portion of the audio input signal, which is removed, in order to obtain the enhanced audio signal.

[0024] Here, it is to be noted that embodiments are not limited to a specific form of the noise reduction information. This information may range from a precise characterization of noise removed from the audio signal, e.g. a noise level, hence an information about a noise energy, e.g. a noise shape information, such as a differentiated information about noise levels in subbands, to intermediate results from the noise reduction, e.g. also referred to as signal enhancement, such as a mask information, e.g. a mask for reducing noise using a masking.

[0025] Accordingly, the “double” use of noise reduction information, such as intermediate results, for example of the signal enhancement module, allows reducing a computational load on the signal activity detector and allows obtaining the signal activity information in a more precise and robust manner.

[0026] According to embodiments of the first aspect, the audio encoder comprises an inactive phase audio encoder, configured to obtain (e.g. receive from the signal enhancement module) a noise

[0027] FH241204PCT-2024363614. DOCX reduction information or the noise reduction information (e.g. a mask parameter, e.g. an information about a mask, e.g. M, e.g. Mm, e.g. Mrand M e.g. which is applied to the input audio signal or a processed version thereof in order to obtain the enhanced audio signal, e.g. a difference between the input audio signal and the enhanced audio signal, e.g. N = X - S) (e.g. the noise reduction information obtained or used by the signal activity detector or an information differing from said information or some same and some different information portions thereof) (e.g. provided by the signal enhancement module) about (e.g. describing) the noise reduction (e.g. in terms of noise reduction functionality; e.g. in terms of an information about one or more processing parameters (e.g. mask parameters, e.g. an information about a mask M) used by the signal enhancement module in order to reduce noise; e.g. in terms of a masking applied to obtain the enhanced audio signal; e.g. in terms of an amount of removed noise; e.g. in terms of a difference between the input audio signal and the enhanced audio signal) applied to the input audio signal, or to the processed version thereof, by the signal enhancement module and to provide the encoded representation on the basis of the noise reduction information (e.g. in the form of an absolute noise level / energy, for example, along with a spectral shape of the noise estimate, e.g. derived from the noise reduction information, e.g. by determining LSID subband energies Ebusing the noise reduction information) in dependence of the signal activity information (e.g. in case the signal activity information indicates that the input audio signal is a background audio signal or an inactive phase audio signal, e.g. in case the signal activity information indicates that the input audio signal is a noise signal or a signal comprising mainly noise, e.g. compared to a speech portion of the signal).

[0028] Alternatively or in addition, the inactive phase audio encoder is configured to provide the encoded representation on the basis of the enhanced audio signal, or a processed version thereof (e.g. encode the enhanced audio signal, or a processed version thereof) in dependence of the signal activity information (e.g. in case the signal activity information indicates that the input audio signal is a background audio signal or an inactive phase audio signal, e.g. in case the signal activity information indicates that the input audio signal is a noise signal or a signal comprising mainly noise, e.g. compared to a speech portion of the signal).

[0029] Alternatively or in addition, the inactive phase audio encoder is configured to provide the encoded representation on the basis of the input audio signal in dependence of the signal activity information (e.g. in case the signal activity information indicates that the input audio signal is a background audio signal or an inactive phase audio signal, e.g. in case the signal activity information indicates that the input audio signal is a noise signal or a signal comprising mainly noise, e.g. compared to a speech portion of the signal).

[0030] FH241204PCT-2024363614. DOCX According to embodiments of the first aspect, the signal activity detector is configured to obtain the signal activity information using the noise reduction information and using the input audio signal or the processed version thereof (e.g. so that inputs of VAD can include other domain specific features computed outside the NR module, directly from the input speech).

[0031] Embodiments according to the first aspect of the invention comprise an audio encoder (e.g. a speech encoder) for providing an encoded representation on the basis of an input audio signal (e.g. an input speech signal, e.g. X, e.g. X), wherein the audio encoder comprises a signal enhancement module (e.g. a noise reduction module or a speech enhancement module), configured to apply a noise reduction (e.g. comprising a power law fcompression, transformation and / or masking step) to the input audio signal, or to a processed version thereof, to obtain an enhanced audio signal (e.g. S, e.g. S, e.g. S).

[0032] Furthermore, the audio encoder comprises an active phase audio encoder, configured to provide the encoded representation on the basis of the enhanced audio signal, or a processed version thereof (e.g. to encode the enhanced audio signal, or a processed version thereof).

[0033] Furthermore, the audio encoder comprises an inactive phase audio encoder, configured to obtain (e.g. receive from the signal enhancement module) a noise reduction information (e.g. a mask parameter, e.g. an information about a mask, e.g. M, e.g. Mm, e.g. Mr e.g. which is applied to the input audio signal or a processed version thereof in order to obtain the enhanced audio signal, e.g. a difference between the input audio signal and the enhanced audio signal, e.g. N = X - S) (e.g. provided by the signal enhancement module) about (e.g. describing) the noise reduction (e.g. in terms of noise reduction functionality; e.g. in terms of an information about one or more processing parameters (e.g. mask parameters, e.g. an information about a mask M) used by the signal enhancement module in order to reduce noise; e.g. in terms of a masking applied to obtain the enhanced audio signal; e.g. in terms of an amount of removed noise; e.g. in terms of a difference between the input audio signal and the enhanced audio signal) applied to the input audio signal, or to the processed version thereof, by the signal enhancement module and to provide the encoded representation on the basis the noise reduction information (e.g. in the form of an absolute noise level / energy, for example, along with a spectral shape of the noise estimate, e.g. derived from the noise reduction information, e.g. by determining LSID subband energies Ebusing the noise reduction information).

[0034] FH241204PCT-2024363614. DOCX Alternatively or in addition, the inactive phase audio encoder is configured to provide the encoded representation on the basis of the enhanced audio signal, or a processed version thereof (e.g. encode the enhanced audio signal, or a processed version thereof).

[0035] Hence, the encoded representation may be provided using the active phase audio encoder, which may be specifically designed to encode an active phase audio signal, such as a speech signal (e.g. an audio signal predominately comprising speech) or using the inactive phase audio encoder, which may be specifically designed to encode an inactive phase audio signal (e.g. a background signal, e.g. an audio signal predominately background noise), e.g. with a significantly reduced transmission effort compared to the active phase audio encoder.

[0036] On the one hand, it was recognized that the enhanced audio signal or processed version thereof may be provided the active phase audio encoder and / or to the inactive phase audio encoder, in order to obtain the encoded representation.

[0037] On the other hand, e.g. as an alternative to encoding the enhanced audio signal or processed version thereof, the inactive phase audio encoder may be configured to provide the encoded representation on the basis the noise reduction information.

[0038] It was recognized that a representation of an inactive phase audio signal (e.g. a background audio signal) may be obtained in a very efficient manner using a noise reduction information about a noise reduction applied to the input audio signal or a processed version thereof in order to obtain the enhanced audio signal. Hence, for example, instead of analyzing the input audio signal or instead of analyzing the processed version of the input audio signal or instead of analyzing the enhanced audio signal, an information about the processing (e.g. the signal enhancement) may be used in order to determine the encoded representation of the input audio signal, e.g. in case of the input audio signal being an inactive phase audio signal, e.g. comprising primarily background noise.

[0039] Hence, as an example, the input audio signal or a processed version thereof may be provided to the signal enhancement module. The signal enhancement module may be configured to apply the noise reduction to the input audio signal or the processed version thereof. For example, in case the noise reduction fulfills a first criterion, e.g. surpasses a threshold, e.g. so that a predominant portion of the signal energy (as being noise) is removed from the input audio signal or the processed version thereof, in order to obtain the enhanced audio signal, the noise reduction information, comprising an information about said removed noise may be

[0040] FH241204PCT-2024363614. DOCX provided to the inactive phase audio encoder in order to represent the input audio signal (or at least the predominant background portion).

[0041] On the other hand, for example, in case the noise reduction fulfills a second criterion or does not fulfill the first criterion, e.g. does not surpass a threshold, e.g. so that only a minor portion of the signal energy (as being noise) is removed from the input audio signal or the processed version thereof, in order to obtain the enhanced audio signal, the enhanced audio signal or a processed version thereof may be provided to the active phase audio encoder in order to represent the input audio signal.

[0042] As the noise reduction is performed to obtain the enhanced audio signal, the noise reduction information may be available as a by-product of the signal enhancement. Hence, resources for a separate signal analysis step (e.g. in the inactive phase audio encoder) may be saved.

[0043] Hence, in other words it was recognized that an accurate representation of an input audio signal comprising primarily noise may be obtained efficiently by exploiting the noise reduction information as an information about a portion of the audio input signal, which is removed, in order to obtain the enhanced audio signal.

[0044] Here again, it is to be noted that embodiments are not limited to a specific form of the noise reduction information. This information may range from a precise characterization of noise removed from the audio signal, e.g. a noise level, hence an information about a noise energy, e.g. a noise shape information, such as a differentiated information about noise levels in subbands, to intermediate results from the noise reduction, e.g. also referred to as signal enhancement, such as a mask information, e.g. a mask for reducing noise using a masking.

[0045] Accordingly, the “double” use of noise reduction information, such as intermediate results, for example of the signal enhancement module, allows reducing a computational load on the inactive phase audio encoder and allows obtaining the encoded representation in a more precise and robust manner.

[0046] According to embodiments of the first aspect the audio encoder comprises a signal activity detector (e.g. a voice activity detector), configured to obtain (e.g. receive from the signal enhancement module) a signal activity information (e.g. a voice activity signal) (e.g. about the input audio signal), using a noise reduction information or the noise reduction information (e.g. a mask parameter, e.g. an information about a mask, e.g. M, e.g. Mm, e.g. Mre.g. which is applied to the input audio signal or a processed version thereof, in order to obtain the

[0047] FH241204PCT-2024363614. DOCX enhanced audio signal, e.g. a difference between the input audio signal and the enhanced audio signal, e.g. N = X - S) (e.g. the noise reduction information obtained or used by the inactive phase audio encoder or an information differing from said information or some same and some different information portions thereof) (e.g. a signal to noise ratio information) (e.g. provided by the signal enhancement module) about (e.g. describing) the noise reduction (e.g. in terms of noise reduction functionality; e.g. in terms of an information about one or more processing parameters (e.g. mask parameters, e.g. an information about a mask M) used by the signal enhancement module in order to reduce noise; e.g. in terms of a masking applied to obtain the enhanced audio signal; e.g. in terms of an amount of removed noise; e.g. in terms of a difference between the input audio signal and the enhanced audio signal) applied to the input audio signal, or to a processed version thereof, by the signal enhancement module. Furthermore, the active phase audio encoder is configured to provide the encoded representation in dependence of the signal activity information (e.g. in case the signal activity information indicates that the enhanced audio signal is an active phase audio signal, e.g. in case the signal activity information indicates that the enhanced audio signal is a speech signal or a signal comprising mainly speech, e.g. in case the signal activity information indicates that the input audio signal based on which the enhanced audio signal is obtained, is an active phase audio signal, e.g. in case the signal activity information indicates that the input audio signal based on which the enhanced audio signal is obtained, is a speech signal or a signal comprising mainly speech). In addition, the inactive phase audio encoder is configured to provide the encoded representation in dependence of the signal activity information (e.g. in case the signal activity information indicates that the input audio signal is a background audio signal or an inactive phase audio signal, e.g. in case the signal activity information indicates that the input audio signal is a noise signal or a signal comprising mainly noise, e.g. compared to a speech portion of the signal).

[0048] According to embodiments of the first and / or second aspect, the active phase encoder comprises (or, for example, is) a neural audio encoder.

[0049] In view of the above, it is to be highlighted, that embodiments may comprise an exploitation of the noise reduction information, e.g. as a by-product of the signal enhancement module, for signal activity detection and / or for inactive phase audio encoding. Accordingly, a same or a different noise reduction information may be provided to an optional signal enhancement module or an optional inactive phase audio encoder.

[0050] FH241204PCT-2024363614. DOCX The noise reduction information may be provided in the form of an intermediate result of the speech enhancement module or as a fully processed information, such as a noise estimate, to the one or the other or to both (signal activity detector and inactive phase audio decoder).

[0051] Hence, the savings in computational costs may be doubled.

[0052] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain (e.g. to determine, e.g. to provide) a spectral domain representation (e.g. X, e.g. a frequency domain representation; e.g. as a result of a Fourier Transform, e.g. as a result of a Short Term Fourier Transform (STFT), e.g. as a processed version of the input audio signal) on the basis of the input audio signal. Furthermore, the signal enhancement module is configured to obtain (e.g. to determine, e.g. to provide) a mask information (e.g. a mask parameter, e.g. an information about a mask, e.g. M, e.g. Mm, e.g. e.g. an information about a mask) on the basis of the spectral domain representation. Furthermore, the signal enhancement module is configured to obtain (e.g. to determine, e.g. to provide) the enhanced audio signal (e.g. S, e.g. S, e.g. S) using a masking (e.g. in the form of Sr / i= r / i. Mr / i; e.g. by masking) of the spectral domain representation or of a processed version thereof (e.g. ), using the mask information (e.g. by applying a mask to the input audio signal or a processed version thereof, in order to obtain the enhanced audio signal); and the signal enhancement module is configured to obtain (e.g. to determine, e.g. to provide) the noise reduction information on the basis of the mask information or to provide the mask information as the noise reduction information (e.g. comprising or being a mask parameter, e.g. comprising or being an information about a mask, e.g. M, e.g. Mm, e.g. Mrand M e.g. on the basis of a difference between the input audio signal and the enhanced audio signal, e.g. N = X - S, with the enhanced audio signal being obtained on the basis of the mask information, e.g. according

[0053] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to non-linearly scale the spectral domain representation (e.g. X) in order to obtain the processed version of the spectral domain representation (e.g. X), and the signal enhancement module is configured to obtain the mask information on the basis of the processed version of the spectral domain representation. Furthermore, the signal enhancement module is configured to obtain the enhanced audio signal using the masking of the processed version (e.g. X) of the spectral domain representation using the mask information (e.g. according to S = X. M ).

[0054] FH241204PCT-2024363614. DOCX According to embodiments of the first and / or second aspect, the signal enhancement module is configured to apply a power law compression to the spectral domain representation, in order to obtain the processed version of the spectral domain representation.

[0055] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to apply (e.g. separately apply) a power law compression to a real part and to an imaginary part (e.g. of the spectral domain representation, in order to obtain the processed version (e.g. to obtain a real and imaginary part of the processed version, e.g.r / i) of the spectral domain representation (e.g. according to = sign Xr / i)\Xr / i\a

[0056] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain the real part Xrand the imaginary part Xtof the processed version of the spectral domain representation according to wherein Xris a real part of the spectral domain representation; wherein Xtis an imaginary part of the spectral domain representation; sign is the signum function; and a is a power law factor between (0,1).

[0057] According to embodiments of the first and / or second aspect, the signal enhancement module comprises a neural network; and the neural network is configured to obtain the mask information (e.g. a complex-valued mask; e.g. a real magnitude mask Mm) on the basis of the processed version (e.g. X ; e.g. Xr / ie.g. Xm, e.g. Xp) of the spectral domain representation (e.g. on the basis of spectral values of the processed version of the spectral domain representation).

[0058] According to embodiments of the first and / or second aspect, the neural network is configured to obtain the mask information (e.g. a complex-valued mask; e.g. a real magnitude mask Mm, e.g. values representing a frequency dependent mask; e.g. mask values for a plurality of frequency bins) on the basis of a magnitude of the processed version (e.g. Xm) of the spectral domain representation or on the basis of (e.g. respective) magnitudes of (e.g. respective) spectral values of the processed version of the spectral domain representation (e.g. using a frequency bin by frequency bin processing).

[0059] FH241204PCT-2024363614. DOCX According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain (e.g. determine) an intermediate magnitude mask (e.g. a real magnitude mask Mm, e.g. comprising respective mask values for respective frequency bins) on the basis of a magnitude of the processed version (e.g. Xm) of the spectral domain representation or on the basis of (e.g. respective) magnitudes of (e.g. respective) spectral values of the processed version of the spectral domain representation (e.g. using a frequency bin by frequency bin processing). Furthermore, the signal enhancement module is configured to obtain (e.g. determine) an intermediate representation (e.g. Yr, e.g. Y); e.g. an intermediate real and imaginary feature representation) in dependence on the intermediate magnitude mask (e.g. a real valued magnitude mask Mm) and on the basis of a (e.g. respective) phase of the processed version (e.g. Xp) of the spectral domain representation or on the basis of (e.g. respective) phases of (e.g. respective) spectral values of the processed version of the spectral domain representation (e.g. using a frequency bin by frequency bin processing). In addition, the signal enhancement module is configured to obtain a mask (e.g. Mrand MJ on the basis of the intermediate representation (e.g. Yr, e.g. YJ and the signal enhancement module is configured to obtain (e.g. determine) the enhanced audio signal (e.g. S, e.g. S, e.g. S) using a masking (e.g. in the form of Sr / i= Xr / i.Mr / J of the processed version thereof (e.g. X), using the mask (e.g. Mrand MJ.

[0060] According to embodiments of the first and / or second aspect, the signal enhancement module comprises a first stage comprising a first neural network and the signal enhancement module comprises a second stage comprising a second neural network. The first neural network has a higher computational complexity (e.g. comprises more layers, e.g. comprises more neural network parameters, e.g. requires more computational resources) than the second neural network and the first stage is configured to process a magnitude of the processed version (e.g. Xm) of the spectral domain representation in order to obtain (e.g. determine) an intermediate magnitude mask (e.g. a real magnitude mask Mm).

[0061] Furthermore, the signal enhancement module is configured to obtain an intermediate representation (e.g. Yr, e.g. YJ in dependence on the intermediate magnitude mask (e.g. a real magnitude mask Mm) and in dependence on a (e.g. respective) phase of the processed version (e.g. Xp) of the spectral domain representation or in dependence on (e.g. respective) phases of (e.g. respective) spectral values of the processed version of the spectral domain representation (e.g. using a frequency bin by frequency bin processing); and the second stage is configured to obtain the enhanced audio signal (e.g. S, e.g. S, e.g. S).in dependence on the intermediate representation (e.g. Yr, e.g. YJ.

[0062] FH241204PCT-2024363614. DOCX According to embodiments of the first and / or second aspect, the signal enhancement module is configured to perform a channel-wise feature reorientation on the basis of the processed version of the spectral domain representation (e.g. a magnitude of the processed version, e.g. Xm, of the spectral domain representation) in order to obtain the mask information (e.g. and in order to obtain the enhanced audio signal).

[0063] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain the mask information and / or the enhanced audio signal on the basis of the processed version of the spectral domain representation (e.g. a magnitude of the processed version, e.g. Xm, of the spectral domain representation) using a processing in a power law domain.

[0064] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain a complex-valued mask (e.g. having real and imaginary parts Mrand Mf) on the basis of the processed version of the spectral domain representation, in order to obtain the mask information (e.g. Mrand M wherein the complex valued mask may form the mask information, or wherein the mask information may be derived from the complex valued mask) and / or in order to obtain the enhanced audio signal.

[0065] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain a real magnitude mask (e.g. Mm) on the basis of the processed version of the spectral domain representation (e.g. X in order to obtain the complex-valued mask.

[0066] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain a mask (e.g. a complex valued mask M, e.g. having real and imaginary parts Mre.g. a real valued magnitude mask Mm) in dependence on the processed version (e.g. Xm) of the spectral domain representation. Furthermore, the signal enhancement module is configured to determine a complement (e.g. 1 - \M\, e.g. (1 - Mm) of the mask (e.g. a magnitude complement, e.g. a real complement, e.g. a complement with respect to an absolute value of a mask), in order to obtain the noise reduction information (e.g. an information on the basis of a noise estimate information, e.g. an information on the basis of | / V| or Nm, e.g. | / V|, e.g. Nm) and / or the signal enhancement module is configured to determine the complement of the mask in order to obtain the noise reduction information or, for example, the mask information in order to obtain the noise reduction information (e.g. in the form of the complement).

[0067] FH241204PCT-2024363614. DOCX According to embodiments of the first and / or second aspect, the signal enhancement module is configured to apply a mask to a representation (e.g. |x|) of the audio input signal, or of the processed version thereof, in a non-linearly scaled domain (e.g. in a power law domain) (e.g. the processed version of the input audio signal), using the complement (e.g. 1 - \M\, e.g. (1 - Mm)) of the mask, in order to obtain the noise reduction information (e.g. an information on the basis of a noise estimate information, e.g. an information on the basis of | / V| or Nm, e.g. |7V | ; e.g. Nm, e.g. an information describing an estimated intensity (e.g. magnitude) of a noisy part of the input audio signal or of the processed version thereof).

[0068] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to determine a complement or a difference (e.g. N = X - S) on the basis of representations of the enhanced audio signal (e.g. the intermediate representation (e.g. Yr, e.g. ?))) and of the input audio signal or the processed version thereof, in a non-linearly scaled domain (e.g. in a power law domain), in order to obtain the noise reduction information.

[0069] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain the noise reduction information (e.g. N) and / or the enhanced audio signal (e.g. S) using an inverse non-linear scaling or using a power law decompression (e.g. on the basis of S , e.g. on the basis of the processed version of the spectral domain representation, e.g. X).

[0070] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain a mask (e.g. a complex valued mask M, e.g. having real and imaginary parts Mt and M e.g. a real valued magnitude mask Mm) on the basis of the processed version of the spectral domain representation, in order to obtain the mask information. Furthermore, the signal enhancement module is configured to determine a complement (e.g. 1 - \M\, e.g. (1 - Mm)) of the mask, in order to obtain the noise reduction information (e.g. an information on the basis of a noise estimate information, e.g. an information on the basis of | N | or Nm, e.g. |JV| , e.g. Nm, e.g. an information describing an estimated intensity (e.g. magnitude) of a noisy part of the input audio signal or of the processed version thereof), and the signal enhancement module is configured to mask a representation (e.g. |x|) of the audio input signal, or of the processed version thereof, in a power law domain (e.g. the processed version of the input audio signal) using the complement (e.g. 1 - \M\, e.g. (1 - Mm)) of the mask, in order to obtain a power law domain noise estimate N. In addition, the signal enhancement module is configured to obtain the noise reduction information on the basis of a noise magnitude information Nm, which is determined according to

[0071] FH241204PCT-2024363614. DOCX n ST / Am ~ Nmwherein Nmis the magnitude of the power law domain noise estimate.

[0072] Alternatively, the signal enhancement module is configured to obtain a noise estimate information N in the power law domain, having a real part Nrand an imaginary part Ntusing a subtraction between the preprocessed version of the input audio signal (e.g. X) and a version of the enhanced signal in the power law domain (e.g. S), and the signal enhancement module is configured to obtain the noise reduction information (e.g. Nr / j) on the basis of the noise estimate information N in the power law domain, having a real part Nrand an imaginary part Ni, according to

[0073] Nr / i= stgn Nr / i')\Nr / i\1 / awherein Nris the real part of the power law domain noise estimate; wherein Ntis the imaginary part of the power law domain noise estimate; sign is the signum function; and a is a power law factor between (0,1).

[0074] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain the spectral domain representation (e.g. a frequency domain representation; e.g. as a result of a Fourier Transform, e.g. as a result of a Short Term Fourier Transform, e.g. as a processed version of the input audio signal, e.g. X) on the basis of the input audio signal and / or on the basis of the processed version thereof, using a windowing. Furthermore, the signal enhancement module is configured to obtain the noise reduction information on the basis of the mask information with a temporal granularity which is associated with the windowing (e.g. with a time resolution equal to the hop size of a windowing of the time domain to spectral domain transform, e.g. with a time resolution equal to the hop size of a windowing of the STFT) (wherein the signal enhancement module is, for example, configured to obtain the enhanced audio signal with a temporal granularity which is associated with the windowing (e.g. with a time resolution equal to the hop size of the time domain to spectral domain transform, e.g. with a time resolution equal to the hop size of a windowing of the STFT)).

[0075] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain the spectral domain representation (e.g. a frequency domain representation; e.g. as a result of a Fourier Transform, e.g. as a result of a Short Term Fourier

[0076] FH241204PCT-2024363614. DOCX Transform, e.g. as a processed version of the input audio signal, e.g. X) on the basis of the input audio signal and / or on the basis of the processed version thereof, using a windowing. Furthermore, the signal enhancement module is configured to obtain the enhanced audio signal with a temporal granularity which is associated with the windowing (e.g. with a time resolution equal to the hop size of the time domain to spectral domain transform, e.g. with a time resolution equal to the hop size of a windowing of the STFT). In addition, the signal enhancement module is configured to obtain the noise reduction information with a temporal granularity which is associated with the windowing (e.g. with a time resolution equal to the hop size of the time domain to spectral domain transform, e.g. with a time resolution equal to the hop size of a windowing of the STFT).

[0077] Alternatively or in addition, the audio encoder (e.g. the signal enhancement module, e.g. the signal activity detector) is configured to obtain a signal to noise ratio information (e.g. a y s2segmental SNR, e.g. in the form o segSNR = 10 log 10 (^4)) with a temporal granularity which Sfcwfc is associated with the windowing (e.g. with a time resolution equal to the hop size of the spectral domain transform, e.g. with a time resolution equal to the hop size of a windowing of the STFT) (e.g. on the basis of the enhanced audio signal and the mask information), in order to obtain the noise reduction information (e.g. with a time size resolution equal to the hop size of the spectral domain transform; e.g. in the form of the signal to noise ratio information).

[0078] According to embodiments of the first and / or second aspect, the audio encoder (e.g. the signal enhancement module, e.g. the signal activity detector) is configured to obtain the signal to noise ratio information segSNR according to wherein Skis a clean speech estimate (e.g. after inverse non-linear scaling, e.g. after a power law decompression); wherein Nkis a noise estimate (e.g. after inverse non-linear scaling, e.g. after a power law decompression); wherein k is a frequency index.

[0079] According to embodiments of the first and / or second aspect, the audio encoder (e.g. the signal enhancement module, e.g. the signal activity detector) is configured to obtain a time domain signal to noise ratio information and to provide or to obtain the noise reduction information on the basis of the time domain signal to noise ratio information (e.g. after an inverse transformation (e.g. inverse FFT)).

[0080] FH241204PCT-2024363614. DOCX According to embodiments of the first and / or second aspect, the temporal granularity which is associated with the windowing is a time size resolution equal to the hop size of the windowing.

[0081] According to embodiments of the first and / or second aspect, the audio encoder (e.g. the signal enhancement module, e.g. the signal activity detector) is configured to compare the signal to noise ratio information to a threshold, in order to provide a comparison result for usage in the signal activity detector and / or for usage in the inactive phase audio encoder (e.g. in order to obtain the signal activity information, e.g. to enable the signal activity detector to determine the signal activity information) (e.g. in order to deduce, e.g. to directly deduce, a Voice Activity Detection (VAD decision)).

[0082] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain the enhanced audio signal with a temporal granularity which is associated with the windowing (e.g. with a time resolution equal to the hop size of the spectral domain transform). Furthermore, the signal enhancement module is configured to obtain a noise information (e.g. a noise estimate, e.g. an information about noise included in the input audio signal or a processed version thereof, e.g. an information about a noise energy and / or a shape of the noise) with a temporal granularity which is associated with the windowing, on the basis of the mask information and on the basis of the spectral domain representation. In addition, the audio encoder (e.g. the signal enhancement module, e.g. the signal activity detector) is configured to aggregate (e.g. to smoothen, e.g. to average, e.g. in a temporal direction, e.g. over time) the noise information, in order to obtain a noise level information (e.g. an estimate of the background noise level) (e.g. by applying some moving average (MA) and exponential moving average (EMA)). Furthermore, the audio encoder (e.g. the signal enhancement module, e.g. the signal activity detector) is configured to obtain the noise reduction information (e.g. in the form of a comparison result of a comparison of the enhanced audio signal and the noise level information) (e.g. in order to obtain the signal activity information, e.g. to enable the signal activity detector to determine the signal activity information) (e.g. in order to deduce, e.g. to directly deduce, a Voice Activity Detection (VAD decision)) on the basis of the enhanced audio signal and the noise level information (e.g. by comparing of the enhanced audio signal, e.g. a short-term information, and the noise level information, e.g. a long term information).

[0083] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain the enhanced audio signal with a temporal granularity which is associated with the windowing (e.g. with a time size resolution equal to the hop size of the spectral domain transform). Furthermore, the signal enhancement module is configured to

[0084] FH241204PCT-2024363614. DOCX obtain a noise information (e.g. a noise estimate, e.g. an information about noise included in the input audio signal or a processed version thereof, e.g. an information about a noise energy and / or a shape of the noise) with a temporal granularity which is associated with the windowing in dependence on the mask information and on the basis of the spectral domain representation. In addition, the audio encoder is configured to obtain the noise reduction information (e.g. an intermediate quantity usable to obtain the signal activity information; e.g. a signal activity decision information, e.g. a Voice Activity Detection (VAD decision) information, e.g. in order to obtain the signal activity information, e.g. to enable the signal activity detector to determine the signal activity information) (e.g. in order to deduce, e.g. to directly deduce, a Voice Activity Detection (VAD decision)) on the basis of the enhanced audio signal and the noise information using machine learning.

[0085] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain the noise reduction information (e.g. an intermediate quantity usable to obtain the signal activity information; e.g. a signal activity decision information, e.g. a Voice Activity Detection (VAD decision) information, e.g. for enabling the signal activity detector to obtain the signal activity information) on the basis of a comparison of a short-term estimate of the enhanced audio signal with a long-term estimate of a noise included in the input audio signal or with a long-term estimate of a noise reduced from the enhanced audio signal when compared to the input audio signal (e.g. so that the signal enhancement module is configured to compare an instantaneous energy of the clean speech and noisy part estimates, e.g. to derive the VAD decision).

[0086] According to embodiments of the first and / or second aspect, the noise reduction information comprises one or more of

[0087] • an estimate of a noise portion of the audio input signal (e.g. an estimate of the background noise) or of a spectral domain representation thereof,

[0088] • an information about (e.g. describing) a noise portion reduced from the enhanced audio signal compared to the input audio signal,

[0089] • an estimate of a residual noise portion of the enhanced audio signal (e.g. an estimate of the residual background noise in the enhanced signal) or of a spectral domain representation thereof,

[0090] • the mask information,

[0091] • an intermediate representation (e.g. Yr, e.g. ?)) (e.g. an intermediate representation of SEM),

[0092] FH241204PCT-2024363614. DOCX • an intermediate magnitude mask (e.g. a real magnitude mask Mm) (e.g. an intermediate representation of SEM), and / or

[0093] • a decision information (e.g. obtained using machine learning, e.g. obtained using a neural network) (e.g. a signal activity decision information, e.g. a Voice Activity Detection (VAD decision) information).

[0094] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to obtain (e.g. to receive, e.g. to determine) an environmental information (e.g. an information about an expected noise level or about an expected signal to noise ratio or about environmental characteristics, such as a windy environment) and / or a user preference information (e.g. an application specific information) (hence, e.g. an information about recording environment conditions and / or user preferences), and the signal enhancement module is configured to adjust the application of the noise reduction (e.g. comprising a compression, transformation and / or masking step) to the input audio signal or the processed version thereof to obtain an enhanced audio signal (e.g. S, e.g. S, e.g. S) on the basis of the environmental information and / or the user preference information.

[0095] According to embodiments of the first and / or second aspect, the signal enhancement module is configured to provide the noise reduction information (e.g. to a switch, e.g. to a mixer, e.g. to the inactive phase audio encoder, e.g. to the active phase audio encoder, e.g. to the signal activity detector).

[0096] According to embodiments of the first and / or second aspect, the inactive phase encoder is configured to determine noise generation parameters to be included in the encoded representation based on the noise reduction information.

[0097] According to embodiments of the first and / or second aspect, the audio encoder (e.g. the signal enhancement module and / or the inactive phase encoder) is configured to determine a noise energy information (e.g. a noise level information) and / or a noise shape information on the basis of the noise reduction information (e.g. a noise energy information describing an amount of energy of noise in the input audio signal, e.g. the noise shape information describing an distribution of the amount of energy of the noise in the input audio signal); and the inactive phase encoder is configured to provide the encoded representation on the basis of the noise energy information (e.g. a noise level information) and / or on the basis of the noise shape information (e.g. a spectral shape of the noise estimate reflecting the energy distribution of the noise estimate, for example, in LSID frequency sub-bands) (e.g. such that the encoded representation comprises an encoded information describing the noise energy and / or an

[0098] FH241204PCT-2024363614. DOCX encoded representation describing a noise spectral shape, e.g. of a noisy portion of the audio input signal).

[0099] According to embodiments of the first and / or second aspect, the audio encoder (e.g. the signal enhancement module and / or the inactive phase encoder) is configured to determine a plurality of subband energies (e.g. Eb) for the input audio signal, in order to determine the noise shape information.

[0100] According to embodiments of the first and / or second aspect, the subbands are distributed (e.g. in a frequency domain) non-uniformly.

[0101] According to embodiments of the first and / or second aspect, the subbands are distributed according to a psychoacoustic model of a human (e.g. the subband distribution following or following roughly the Bark scale).

[0102] According to embodiments of the first and / or second aspect, the audio encoder (e.g. the signal enhancement module and / or the inactive phase encoder) is configured to determine a plurality of logarithmic spectral information decomposition, LSID, subband energies, in order to determine the noise shape information.

[0103] According to embodiments of the first and / or second aspect, the audio encoder (e.g. the signal enhancement module and / or the inactive phase encoder) is configured to determine the logarithmic spectral information decomposition, LSID, subband energies according to wherein Fbis a cardinality of Ib, with Ibbeing a set of bin indices of the spectral domain representation associated with (e.g. belonging) subband index b (e.g. where Fbis the cardinality of Ibthe set of FFT bin indices belonging to the subband index b); wherein N is a noise estimate information; and wherein k is a frequency index.

[0104] According to embodiments of the first and / or second aspect, the audio encoder (e.g. the signal enhancement module and / or the inactive phase encoder) is configured to aggregate (e.g. to smooth or to average) subband energies of the plurality of subband energies (e.g. by applying some moving average (MA) and exponential moving average (EMA)) in order to determine the noise energy information (e.g. a noise level information) and / or the noise shape information.

[0105] FH241204PCT-2024363614. DOCX According to embodiments of the first and / or second aspect, the audio encoder (e.g. the signal enhancement module and / or the inactive phase encoder) is configured to normalize (e.g. after a conversion, e.g. after a conversion in a logarithmic domain, e.g. after conversion to dB) the plurality of subband energies in order to determine the noise shape information (e.g. using a global gain information, e.g. using the noise energy information).

[0106] According to embodiments of the first and / or second aspect, the inactive phase encoder comprises a multi-stage vector quantizer comprising a plurality of stages; and the multi-stage vector quantizer is configured to obtain quantized representations of the noise energy information and / or of the noise shape information; and the inactive phase encoder is configured to provide the encoded representation on the basis of the quantized representations of the noise energy information and / or of the noise shape information.

[0107] According to embodiments of the first and / or second aspect, the inactive phase encoder is configured to obtain the quantized representation of the noise energy information using a first stage of the multi-stage vector quantizer; and the inactive phase encoder is configured to obtain the quantized representation of the noise shape information using a second stage, which is subsequent to the first stage; and the inactive phase encoder is configured to selectively adjust a number of bits allocated for the quantization of the noise shape information in the second stage.

[0108] According to embodiments of the first and / or second aspect, the inactive phase encoder is configured to skip an encoding of the noise shape information or of the quantized representation of the noise shape information for a predetermined number of frames (e.g. to discontinuously transmit the noise shape information or of the quantized representation of the noise shape information; e.g. so that the SID is transmitted at most every 8 frames in (2)).

[0109] According to embodiments of the first and / or second aspect, the inactive phase encoder is configured to provide a Silence Insertion Descriptor in order to provide the inactive phase audio signal.

[0110] According to embodiments of the first and / or second aspect, the inactive phase encoder comprises (or, for example, is) a discontinuous transmission, DTX, encoder (e.g. for transmitting SiD to a corresponding CNG decoder on a decoder side).

[0111] FH241204PCT-2024363614. DOCX According to embodiments of the first and / or second aspect, the active phase encoder is configured to skip an encoding of frames in dependence on the noise reduction information.

[0112] According to embodiments of the first and / or second aspect, the inactive phase encoder is configured to selectively encode noise generation parameters in dependence on the noise reduction information (e.g. when the active phase encoder is skipping an encoding of a frame) (e.g. when the noise reduction information indicates that speech or foreground speech is substantially inactive in the input audio signal; e.g. when the noise reduction information indicates that background noise is dominant in the input audio signal; e.g. when the noise reduction information indicates the input audio signal as being a background audio signal or an inactive phase audio signal; e.g. when the noise reduction information indicates that an amount of noise removed from the enhanced audio signal compared to the input audio signal is above a threshold).

[0113] According to embodiments of the first and / or second aspect, the inactive phase encoder is configured to selectively encode noise generation parameters, in case the noise reduction information indicates that an amount of noise removed from the enhanced audio signal compared to the input audio signal is above a threshold.

[0114] According to embodiments of the first and / or second aspect, the inactive phase encoder is configured to determine an information about (e.g. describing)

[0115] • a noise contribution included in the input audio signal on the basis of the noise reduction information; and / or

[0116] • a noise contribution included in the enhanced audio signal; and / or

[0117] • a start time and / or an end time of an interval (e.g. portion) in the audio signal, wherein a noise contribution fulfills a criterion; wherein the inactive phase encoder is configured to provide the encoded representation on the basis of said information.

[0118] According to embodiments of the first and / or second aspect, the audio encoder comprises a mixer, wherein the mixer is configured to obtain a mixed audio signal on the basis of the enhanced audio signal and a further noisy signal (e.g. the input audio signal or the processed version thereof, e.g. a signal derived on the basis of the input audio signal or the processed version thereof in dependence on the noise reduction information; e.g. a noise signal) in dependence on at least one of

[0119] FH241204PCT-2024363614. DOCX the noise reduction information

[0120] • the input audio signal or the processed version thereof, and / or

[0121] • the signal activity information.

[0122] According to embodiments of the first and / or second aspect, the mixer is configured to obtain a mixed audio signal on the basis of a weighted mixing of the further noisy signal and of the enhanced audio signal, wherein a weighting the further noisy signal and of the enhanced audio signal in the mixed audio signal is controlled in dependence on (e.g. by) the signal activity information (e.g. so that a (attenuated) version of the background noise estimated by the SEM, e.g. in inactive phases detected by the VAD, is conveyed to inactive phase audio encoder, e.g. DTX-CNG encoder) (e.g. so that the clean speech or a mix between the (attenuated) estimated noise and the estimated clean speech, in active phases detected by the VAD, is conveyed to (neural) audio encoder).

[0123] According to embodiments of the first and / or second aspect, the mixer is configured to provide the mixed audio signal to the active phase audio encoder or to the inactive phase audio encoder.

[0124] According to embodiments of the first and / or second aspect, the audio encoder is configured to provide the encoded representation on the basis of the signal activity information.

[0125] Embodiments according to the third aspect of the invention comprise an audio decoder (e.g. a speech decoder) for decoding an encoded representation of an input audio signal (e.g. an input speech signal, e.g. X, e.g. X), wherein the audio decoder is configured to selectively switch between an active phase operating mode and an inactive phase operating mode as a selected mode, on the basis of a signal activity information (e.g. provided via data stream, e.g. a VAD information).

[0126] Furthermore, the audio decoder comprises an active phase audio decoder, configured to provide an active phase audio signal (e.g. foreground signal, e.g. a speech signal or a signal comprising mainly speech) on the basis of the encoded representation, in case of the selected mode being the active phase operating mode.

[0127] In addition, the audio decoder comprises an inactive phase audio decoder, configured to provide a background audio signal (e.g. an inactive phase audio signal e.g. a background signal, e.g. a noise signal or a signal comprising mainly noise) on the basis of the encoded representation, in case of the selected mode being the inactive phase operating mode.

[0128] FH241204PCT-2024363614. DOCX Furthermore, the audio decoder is configured to obtain (e.g. to receive) a (decoder-sided, e.g. a user-sided, e.g. originating from a decoder-side, e.g. originating from a user-side) mix control signal, indicating a weighting of an active phase audio signal and of a background audio signal or of a scaled (e.g. attenuated) version thereof for a mixed audio signal.

[0129] Furthermore, the inactive phase audio decoder is configured to provide a background audio signal or an attenuated version thereof on the basis of the encoded representation, (e.g. in dependence on the mix control signal,) when the audio decoder operates in the active phase operating mode (e.g. irrespective of the selected mode) (e.g. irrespective of the selected mode being the inactive phase operating mode).

[0130] In addition, the audio decoder comprises a mixer, configured to mix, in the active phase operating mode, an active phase audio signal, provided using the active phase decoder in the active phase operating mode and a background audio signal or an attenuated version thereof, provided using the inactive phase audio decoder in the active phase operating mode, according to the weighting indicated by the mix control signal, in order to obtain the mixed audio signal.

[0131] It was recognized that based on a mix control signal, which may, for example, be provided decoder-sided (hence, for example, not being signaled in a bitstream), an improved decoded version of an audio signal may be obtained by mixing the active phase audio signal and the background audio signal or a scaled version thereof.

[0132] In particular, a seamless transition from the different operating modes may be achieved. On the on hand, during intervals, in which mostly background noise is present in the encoded audio signal, the inactive phase audio decoder may be used, to limit a use of transmission resources and to achieve a pleasant hearing experience, e.g. instead of utter silence. On the other hand, during intervals, in which mostly a use signal, such as speech, is present in the encoded audio signal, the active phase audio decoder may be used, in order to obtain a best possible quality of the reconstructed signal.

[0133] However, in some cases, the active phase audio signal may be “too clean” and may hence result in an uncomfortable hearing experience. Therefore, the active phase audio signal may be mixed with the background audio signal or the scaled version thereof, which may in addition, allow cross-fading a switching of the operating mode. For example, instead of jumping from crystal clear speech to background noise only, some background noise may be already

[0134] FH241204PCT-2024363614. DOCX sprinkled into the clear speech during active phase operating mode, so that the transition to background noise only in the inactive phase operating mode feels smoother for a listener. This is achieved by the, for example, decoder-sided, mix signal

[0135] Hence, the benefits of the twofold decoding of active phase decoder and inactive phase decoder, which may be optimized for their respective primary use cases, may be achieved without setbacks in the form of unpleasant transitions between the two.

[0136] In particular, the addition of the background noise to the active phase audio signal may be scalable, on the basis of the mix control signal.

[0137] As an example, the audio signal provided by the inactive phase audio encoder in the inactive phase operating mode may be considered an inactive phase audio signal and the audio signal provided by the inactive phase audio encoder in the active phase operating mode may be considered a background audio signal. The inactive phase audio signal and the background audio signal may be provided in same manner, but for example just in different phases.

[0138] According to embodiments of the third aspect, the inactive phase audio decoder is configured to obtain (e.g. to determine, e.g. to update, e.g. to estimate), when operating in the inactive phase operating mode, a set of parameters (e.g. an information about a noise energy or a noise level and / or e.g. an information about a noise shaping, e.g. a noise shape information) on the basis of the encoded representation for providing the background audio signal (e.g. the inactive phase audio signal); and the inactive phase audio decoder is configured to provide, when operating in the active phase operating mode, a background audio signal or a scaled (e.g. an attenuated) version thereof, using the set of parameters obtained when operating in the inactive phase operating mode.

[0139] According to embodiments of the third aspect, the active phase audio decoder and the inactive phase audio decoder are configured to exchange a parameter information (e.g. when the selected mode is switched form active phase operating mode to inactive phase operating mode, e.g. when the selected mode is switched form inactive phase operating mode to active phase operating mode; e.g. an information for buffer initialization).

[0140] According to embodiments of the third aspect, inactive phase audio decoder is configured to obtain, when operating in the inactive phase operating mode, a parameter information for providing the background audio signal (e.g. the inactive phase audio signal), based on the active phase audio signal provided by the active phase audio decoder provided, when

[0141] FH241204PCT-2024363614. DOCX operating in the active phase operating mode (e.g. so that the decoded signal from the active phase audio decoder, e.g. a neural audio decoder, may serve to determine, or to improve or to refine the background audio signal (e.g. the inactive phase audio signal) and / or a parameter information, such as a noise estimate at the decoder side, for obtaining the background audio signal (e.g. the inactive phase audio signal), for example, using for a Minimum Statistics method, e.g. as proposed in (2)).

[0142] According to embodiments of the third aspect, the audio decoder is configured to cross-fade a transition from the inactive phase operating mode to the active phase operating mode using the mixer (e.g. so as to successively decrease a significance of the background audio signal, e.g. of the inactive phase audio signal in the mixed audio signal).

[0143] According to embodiments of the third aspect, the audio decoder is configured to select between (e.g. selectively switch between) (e.g. using different settings of the mixer)

[0144] • providing, as an output signal, silence (e.g. in order to not restitute at all a background noise or a level of background noise, e.g. as included in the original audio signal, e.g. prior to encoding; e.g. so that nothing from the inactive phase audio decoder, e.g. a DTX / CNG decoder, is output, e.g. so that silence is generated in the inactive phase operating mode) in the inactive phase operating mode;

[0145] • providing, as an output signal, an attenuated version of the background audio signal (e.g. the inactive phase audio signal) (e.g. in order to restitute back a certain level of background noise, e.g. as included in the original audio signal, e.g. prior to encoding; e.g. for pleasantness and listening comfort, e.g. so that a background audio signal, e.g. comfort noise generated by the inactive phase audio decoder, e.g. DTX / CNG decoder, is attenuated and, for example, provided solely inactive phase operating mode) in the inactive phase operating mode;

[0146] • providing, as an output signal, an (e.g. unattenuated version of the or an unattenuated version of an) background audio signal (e.g. an inactive phase audio signal) in the inactive phase operating mode; on the basis of the selected mode and / or on the basis of the mix control signal.

[0147] According to embodiments of the third aspect, the audio decoder is configured to select between (e.g. selectively switch between) (e.g. using different settings of the mixer)

[0148] FH241204PCT-2024363614. DOCX • providing, as an output signal, a mixed audio signal comprising a combination of an attenuated version of a background audio signal (e.g. an output signal of the inactive phase audio decoder, e.g. generated when the audio decoder operates in the active phase operating mode) and the active phase audio signal, in the active phase operating mode;

[0149] • providing, as an output signal, a mixed audio signal comprising a (e.g. weighted) combination of an (e.g. non-attenuated) background audio signal, e.g. inactive phase audio signal, (e.g. an output signal of the inactive phase audio decoder, e.g. generated when the audio decoder operates in the active phase operating mode) and the active phase audio signal (e.g. so as to restitute entirely the original level of background noise, e.g. as included in the original audio signal, e.g. prior to encoding;, e.g. so that the background audio signal, e.g. inactive phase audio signal, is not attenuated and optionally mixed with the active phase audio signal, e.g. the audio decoder output during active phases), in the active phase operating mode; and

[0150] • providing, as an output signal, an active phase audio signal (e.g. without adding a contribution to the background audio signal, e.g. inactive phase audio signal; e.g. without adding a contribution of an output signal of the inactive phase audio decoder) in the active phase operating mode, on the basis of the selected mode and / or on the basis of the mix control signal.

[0151] According to embodiments of the third aspect, the audio decoder is configured to obtain (e.g. to receive) the (decoder-sided, e.g. a user-sided, e.g. originating from a decoder-side, e.g. originating from a user-side) mix control signal on the basis of a user input, an application input, and / or an environmental information (wherein, for example, the audio decoder comprises a user interface for inputting and / or for varying the mix control signal).

[0152] According to embodiments of the third aspect, the inactive phase decoder comprises (or, for example, is) a discontinuous transmission, DTX, decoder; and / or the inactive phase encoder comprises (or, for example, is) a Comfort Noise Generation, CNG, decoder.

[0153] According to embodiments of the third aspect, the active phase decoder comprises (or, for example, is) a neural audio decoder.

[0154] Brief Description of the Drawings

[0155] FH241204PCT-2024363614. DOCX The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various embodiments of the invention are described with reference to the following drawings, in which:

[0156] Fig. 1 shows a schematic view of an audio encoder according to embodiments of the first aspect of the invention, with additional, optional features;

[0157] Fig. 2 shows a schematic view of an audio encoder according to embodiments of the second aspect of the invention, with additional, optional features;

[0158] Fig. 3 shows a schematic view of signal enhancement module according to embodiments of the first and / or second aspect of the invention;

[0159] Fig. 4 shows a schematic view of an audio decoder according to embodiments of the third aspect of the invention;

[0160] Fig. 5 shows a schematic view of an encoder, having a neural audio / speech encoder and a DTX / CNG encoder, according to embodiments of the invention;

[0161] Fig. 6 shows a schematic view of lower level block diagrams of an encoder and a decoder, according to embodiments of the invention; and

[0162] Fig. 7 shows a schematic view of optional processing steps for a signal enhancement module according to embodiments.

[0163] Detailed Description of the Embodiments

[0164] Equal or equivalent elements or elements with equal or equivalent functionality are denoted in the following description by equal or equivalent reference numerals even if occurring in different figures.

[0165] In the following description, a plurality of details is set forth to provide a more throughout explanation of embodiments of the present invention. However, it will be apparent to those skilled in the art that embodiments of the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form rather than in detail in order to avoid obscuring embodiments of the present

[0166] FH241204PCT-2024363614. DOCX invention. In addition, features of the different embodiments described herein after may be combined with each other, unless specifically noted otherwise.

[0167] Fig. 1 shows a schematic view of an audio encoder according to embodiments of the first aspect of the invention, with additional, optional features.

[0168] Encoder 100 comprises a signal enhancement module (SEM) 110, a signal activity detector (SAD) 120 and an active phase audio encoder 130.

[0169] Encoder 100 may receive the input audio signal or processed version thereof 101. This input information 101 is provided to the signal enhancement module (SEM) 110, which is configured to apply a noise reduction to signal 101 , to obtain an enhanced audio signal 102. The signal enhancement module (SEM) 102, may optionally be configured to further process the enhanced audio signal, in order to provide a processed version of the enhanced audio signal at its output.

[0170] The signal activity detector 120 is configured to obtain a signal activity information 103 using a noise reduction information 104 about the noise reduction applied to the input audio signal, or to the processed version thereof 101 , by the signal enhancement module 110.

[0171] The noise reduction information 104 may, for example, be provided by the signal enhancement module 110, or information 104 may be derived by the signal activity detector 120 itself. The noise reduction information 104 may, for example, be an information describing the noise reduction applied to signal 101. As an example, information 104 may describe a portion of signal 101 which is removed in order to obtain signal 102. Information 104 may hence, for example, be an intermediate result of the signal enhancement module 110. Information 104 may, for example, describe the processing of signal 101 withing the signal enhancement module 110.

[0172] Furthermore, the active phase audio encoder 130 is configured to provide the encoded representation 105 on the basis of the enhanced audio signal, or a processed version thereof 102, in dependence of the signal activity information 103, for example with the signal activity information indicating that signal 102 is to be transmitted, since signal 101 is predominately a use signal, such as a speech signal.

[0173] As an optional feature, encoder 100 further comprises an inactive phase audio encoder 140. The inactive phase audio encoder 140 may, for example, be configured to obtain a noise

[0174] FH241204PCT-2024363614. DOCX reduction information 106 (e.g. identical to or differing from the noise reduction information 104 or corresponding to noise reduction information 104) about the noise reduction applied to the input audio signal, or to the processed version thereof 101 , by the signal enhancement module 110 and to provide an encoded representation 107 on the basis of the noise reduction information, in dependence of the signal activity information 103.

[0175] Alternatively or in addition, the inactive phase audio encoder 140 may, for example, be configured to provide the encoded representation 107 on the basis of the enhanced audio signal, or a processed version thereof 102 in dependence of the signal activity information 103. Alternatively or in addition, the inactive phase audio encoder 140 may, for example, be configured to provide the encoded representation 107 on the basis of the input audio signal 101 in dependence of the signal activity information 103.

[0176] Fig. 2 shows a schematic view of an audio encoder according to embodiments of the second aspect of the invention, with additional, optional features.

[0177] Encoder 200 comprises a signal enhancement module (SEM) 210, an active phase audio encoder 230 and an inactive phase audio encoder 240.

[0178] Encoder 200 may receive the input audio signal or processed version thereof 201. This input information 201 is provided to the signal enhancement module (SEM) 210, which is configured to apply a noise reduction to signal 201 , to obtain an enhanced audio signal 202. The signal enhancement module (SEM) 102, may optionally be configured to further process the enhanced audio signal, in order to provide a processed version of the enhanced audio signal at its output.

[0179] The active phase audio encoder 230 is configured to provide an encoded representation 205 on the basis of the enhanced audio signal, or a processed version thereof 202.

[0180] Furthermore, the inactive phase audio encoder 240 is configured to obtain a noise reduction information 206 about the noise reduction applied to the input audio signal, or to the processed version thereof 201 , by the signal enhancement module 210 and to provide the encoded representation 207 on the basis the noise reduction information 206.

[0181] Alternatively or in addition, the inactive phase audio encoder 240 is configured to provide the encoded representation 207 on the basis of the enhanced audio signal, or a processed version thereof 202.

[0182] FH241204PCT-2024363614. DOCX As optional features, encoder 200 may comprise a signal activity detector 220, which may have a corresponding functionality as discussed with respect to signal activity detector 120 of Fig. 1. Hence, noise reduction information 104 and 106 may correspond to information 204 and 206.

[0183] Fig. 3 shows a schematic view of a signal enhancement module according to embodiments of the first and / or second aspect of the invention.

[0184] Signal enhancement module 300 may, for example, correspond to signal enhancement module 110 shown in Fig. 1 or signal enhancement module 210 shown in Fig. 2.

[0185] Signal enhancement module 300 comprises a spectral domain transformer 310, a mask information determiner 320, a noise reduction information determiner 330 and masking unit 340. As discussed in the context of Fig. 1 and 2, the signal enhancement module 300 may be provided with the input audio signal or a processed version thereof 301 (e.g. corresponding to signals 101 , 201).

[0186] The signal enhancement module 300 is configured to, here as an example using spectral domain transformer 310, obtain a spectral domain representation 305 on the basis of the input audio signal (or the processed version thereof) 301. Optionally, signal enhancement module 300 may further comprise a scaling unit 350, which may be configured to scale the spectral domain representation in order to obtain a processed version thereof (indicated in Fig. 3 by reference sign 305 as well).

[0187] Furthermore, the signal enhancement module 300 is configured to, here as an example using mask information determiner 320, obtain a mask information 303 on the basis of the spectral domain representation (e.g. on the basis of the processed version thereof) 305. Furthermore, the signal enhancement module 300 is configured to, here as an example using masking unit 340, obtain the enhanced audio signal 302 using a masking of the spectral domain representation or of a processed version thereof 302, using the mask information 303.

[0188] In addition, the signal enhancement module is configured to, here as an example using noise reduction information determiner 330, obtain the noise reduction information 304 on the basis of the mask information 303 or to provide the mask information 303 as the noise reduction information 304.

[0189] FH241204PCT-2024363614. DOCX Fig. 4 shows a schematic view of an audio decoder according to embodiments of the third aspect of the invention.

[0190] Decoder 400 comprises an operating mode selector 410, an active phase audio decoder 430, an inactive phase audio decoder 440 and a mixer 420.

[0191] The decoder 400 is configured to obtain, e.g. to receive, a signal activity information 401 and an encoded representation 402, which may be both included in a bitstream. Optionally, the signal activity information 401 may, for example, be included in the encoded representation 402 and may hence be extracted or separated in a preprocessing step.

[0192] The audio decoder, here as shown as an example using the operating mode selector 410, is configured to selectively switch between an active phase operating mode and an inactive phase operating mode as a selected mode, on the basis of the signal activity information 401. As an example, the operating mode selector 410 may hence provide an information 403 about the selected operating mode.

[0193] Furthermore, the active phase audio decoder 430 is configured to provide an active phase audio signal 404 on the basis of the encoded representation 402, in case of the selected mode (e.g. as indicated by information 403) being the active phase operating mode. In addition, the inactive phase audio decoder 440 is configured to provide a background audio signal 405 on the basis of the encoded representation 402, in case of the selected mode (e.g. as indicated by information 403) being the inactive phase operating mode.

[0194] Furthermore, the audio decoder is configured to obtain a mix control signal 406, e.g. a decodersided mix control signal, indicating a weighting of an active phase audio signal 404 and of a background audio signal or of a scaled version thereof 405 for a mixed audio signal 407.

[0195] Furthermore, the inactive phase audio decoder is configured to provide a background audio signal or a scaled, e.g. attenuated, version thereof 405 on the basis of the encoded representation 402, when the audio decoder operates in the active phase operating mode.

[0196] In addition, the mixer 420 is configured to mix, in the active phase operating mode, an active phase audio signal 404, provided using the active phase decoder 430 in the active phase operating mode and a background audio signal or an attenuated version thereof 405, provided using the inactive phase audio decoder 440 in the active phase operating mode, according to

[0197] FH241204PCT-2024363614. DOCX the weighting indicated by the mix control signal 406, in order to obtain the mixed audio signal 407.

[0198] In the following, further preferred embodiments are discussed. In particular embodiments comprising signal enhancement modules in the form of a speech enhancement modules (SEM), embodiments comprising active phase audio codecs in the form of neural audio codecs and inactive phase audio codecs in the form of DTX / CNG codecs are disclosed. However, respective specific implementations of signal enhancement modules, active phase audio codecs and inactive phase audio codecs are to be understood as examples and not in a limiting manner. Hence, in the following embodiments relating to neural audio codecs may be implemented using any suitable form of active phase audio codec, embodiments relating to DTX / CNG codecs may be may be implemented using any suitable form of inactive phase audio codec and embodiments relating to speech enhancement module may be implemented using any suitable form of signal enhancement module.

[0199] In general, according to embodiments of the invention, a concatenation of a speech enhancement module (SEM) and a neural audio codec may be used, where the neural audio codec can, for example, be trained solely on clean speech. The two modules, the SEM and the neural audio codec, can, for example, be trained separately or, for example for best performance, aligned with each other, e.g. by training the neural audio codec using the SEM output as input, and / or e.g. by adapting the SEM to the neural encoder and / or input conditions (e.g. recording environment conditions, user preferences...).

[0200] This interplay between SEM and neural audio codec can be even further extended according to embodiments in a DTX transmission mode as discussed in this disclosure.

[0201] Preferred embodiment

[0202] According to a preferred embodiment of the invention, a signal enhancement module, an active phase audio codec and an inactive phase audio codec, for example in the form of a SEM, a neural audio / speech codec and a DTX / CNG coding may be combined. An example for such a combination is pictured schematically in Fig. 5 for the encoder side.

[0203] Fig. 5 shows a schematic view of an encoder, having a neural audio / speech encoder and a DTX / CNG encoder, according to embodiments of the invention. Encoder 500 comprises a signal enhancement module 510 (here as an example a speech enhancement module), a voice activity detection unit 520, which, as an optional feature, comprises a switch unit (e.g. a switch), an active phase audio encoder (here as an example a neural audio encoder) 530 and an

[0204] FH241204PCT-2024363614. DOCX inactive phase audio encoder (here as an example, a DTX / CNG encoder) 540. In other words, Fig. 5 shows a combination of SEM, VAD (and optional switch unit), and the two active and inactive phase encoders.

[0205] The speech enhancement module 510 is at the front-end and enhances the signal 501 to code for example by attenuating or reducing the background noise. The enhanced signal 502 may, for example, be then or will be then coded, for example either, by an active or an inactive voice coding scheme (e.g. using encoder 530 or 540).

[0206] For controlling the two coding modes (active and inactive coding modes, e.g. corresponding to an inactive phase operating mode and an inactive phaser operating mode), a signal activity detector, e.g. as shown in Fig. 5 in the form of a Voice Activity Detection (VAD), e.g. 520, and optionally a switch unit may be used or may, for example, even be needed. The VAD, e.g. 520, may, for example, detect the start and end of the active and inactive phases and may, for example, control which signal(s) to convey to, for example either, the active voice encoder, e.g. 530, (e.g. neural audio encoder) or the inactive voice encoder, e.g. 540, (e.g. DTX / CNG encoder). In a simple or even the simplest embodiment the enhanced signal, e.g. 502, may, for example, be conveyed to either the active voice encoder, e.g. 530, during active phases or to the inactive voice coder, e.g. 540, during inactive phases. According to some embodiments, during inactive phases, an estimate of the background noise or an internal representation of the SEM (e.g. in the form of a noise reduction information, e.g. comprising or being a masking information) may, for example, be directly conveyed to the inactive voice coder 540.

[0207] In a preferred embodiment, a neural audio encoder is responsible of transmitting the information during active phases and is combined with a DTX / CNG encoder, which is used during inactive phases. During this phase an inactive bitstream, e.g. 505, may be generated, for example, describing a background noise or an attenuated version of it. If the background noise is completely removed or strongly attenuated, the inactive bitstream, e.g. 505, can optionally be reduced to a signaling of the start and / or (e.g. eventually) the end of an inactive phase.

[0208] Embodiments, and in particular the above-discussed preferred embodiments may have the following advantages:

[0209] • Reduce the required bit-rate of an inactive phase audio encoder, e.g. in the form of a neural audio coder, since the inactive bit-stream, e.g. 507, comprises or consists of simple signaling, and, for example, alternatively or in addition a very

[0210] FH241204PCT-2024363614. DOCX parametric / compact description of the background noise (for example, in the form of a so called Silence Insertion Descriptor, SID), which may, for example, be transmitted at a transmission rate lower, or even much lower than the transmission rate during active phases (Example: i.e. SID is transmitted at most every 8 frames in [2]).

[0211] • Reduce the algorithmic complexity of the whole speech coder (e.g. the active phase audio encoder), since the inactive phase audio encoder, e.g. the DTX / CNG encoder / decoder, e.g. 540, may be less or even much less complex than the active phase audio encoder / decoder (e.g. neural audio encoder / decoder), e.g. 530.

[0212] • Enhance the VAD and / or switch unit, e.g. 520, for example by exploiting the information and intermediate representation of the SEM).

[0213] • Enhance the inactive phase audio encoder (e.g. DTX / CNG encoder), e.g. 540, for example by exploiting the information and intermediate representations of the SEM (e.g. background noise estimate).

[0214] Detailed description of embodiments

[0215] A more detailed description of an embodiment according to the invention is illustrated in Fig. 6, for encoder and decoder sides. Fig. 6 shows a schematic view of lower level block diagrams of an encoder and a decoder, according to embodiments of the invention.

[0216] Encoder 600 comprises a signal enhancement module 610, here as an example in the form of a noise reduction unit, a signal activity detector 60, here as an example as a voice activity detector, an optional unit 625, which may comprise a switch and / or a mixer, an active phase audio encoder (here as an example a neural audio encoder) 630 and an inactive phase audio encoder (here as an example, a DTX / CNG encoder) 640.

[0217] Decoder 700 comprises an active phase audio decoder (here as an example a neural audio decoder) 730, an inactive phase audio decoder (here as an example, a DTX / CNG decoder) 740 and a mixer 720, which may optionally comprise a switch.

[0218] This time (e.g. compared to the embodiment shown in Fig. 5) the VAD and switch unit, e.g. 520, is split into the VAD, e.g. 620, and the switch / mixer 1 unit, e.g. 625, and potential connection between the modules are drawn.

[0219] Embodiments according to the invention might be based on but not limited to previously technologies including the neural speech codec NESC [3] for the neural audio coder, and the Ultra-Low Complexity Noise Suppressor (ULCNet) as described in [4] for the SEM module.

[0220] FH241204PCT-2024363614. DOCX The signal activity detector, e.g. VAD, e.g. 620, can, for example, take into inputs several signals. It can use the input signal, e.g. 601 , and based on a local estimate of the Signal-to- Noise Ratio (segmental SNR, segSNR) and some heuristic rules, detect active phase and inactive phases. According to embodiments of the present invention, the VAD can, for example, be assisted by the SEM module, e.g. 610, and, for example, by using an intermediate representation, e.g. 604, of it, like am estimated mask used to filter the input signal in frequency domain. This mask compared to the input signal magnitude spectrum can be used to estimate a segSNR for example.

[0221] More specifically, the SEM module may, for example be used or implemented as introduced in [4] which is described in Fig. 7. Fig. 7 shows a schematic view of optional processing steps for a signal enhancement module according to embodiments. Fig. 7 from [4], illustrates, as an example, the Ultra Low Complexity DNN Model for noise suppression.

[0222] The input features provided to the DNN may, for example, be the magnitude and phase features computed from the power law compressed real and imaginary parts of the noisy signal’s STFT. A power law power law compression method may, for example, be applied to the real and imaginary parts of the noisy signal X, e.g. with a power law factor of a between [0,1], for example, as follows:

[0223] The DNN may, for example, predict then in the STFT domain from the power law compressed real and imaginary parts of the noisy signal a complex-value mask which may, for example, then be applied to the noisy signal for estimating the clean speech S. This may, for example, be done in two steps, predicting first a real magnitude mask Mmbefore estimating a complexvalued mask, for example by estimating the real and imaginary parts Mrand

[0224] From theses masks it is possible in the STFT done to estimate both the clean speech and the magnitude of noisy part (below still in the power law compressed domain).

[0225] Alternatively, the magnitude of noise can be estimated from the intermediate real magnitude mask:

[0226] Alternatively, the noise can be estimated by a simple subtraction:

[0227] FH241204PCT-2024363614. DOCX N = X — S

[0228] Both the clean speech and noise estimates can be recovered after a power law decompression: or only the noise magnitude part: n ST / Am ~ Nm

[0229] Since the SEM is performed on short term FFTs, (STFT), according to embodiments, an estimate of the clean speech and noisy part can be obtained, for example, at every hops, optionally, with a time size resolution equal to the hop size, which is according to a preferred embodiment 10ms. A local SNR called segmental SNR can be easily deduced:

[0230] Where k is the frequency index. This segmental SNR can, for example, be used to directly deduce a Voice Activity Detection (VAD decision), e.g. by thresholding its values. Alternatively, one can, for example, smooth the energy of the noise or magnitude noise estimates to get an estimate of the background noise level in the signal. This could serve to compare the instantaneous estimated clean speech energy, for example, for detecting the voiced activity phases. Another possibility is to design a machine learning algorithm, for example, getting in input the instantaneous energy of the clean speech and / or noisy part estimates to derive the VAD decision.

[0231] The optional switch / mixer 1 module conveys, for example, the output(s) of the SEM to the subsequent coders depending of the VAD decision. In active phase, the enhanced signal is, for example, conveyed to the active phase code, e.g. a neural audio encoder. In inactive phases, the same enhanced signal can, for example, be conveyed to the inactive phase encoder.

[0232] Alternatively, in order to enhance and simplify the inactive phase encoder, an estimate of the background noise or of the residual background noise in the enhanced signal can, for example, be directly sent as input to the inactive phase encoder. In another embodiment, a frequency domain representation of the background noise or of the residual background noise in the enhanced signal can, for example, be conveyed.

[0233] Indeed, such a representation can be easily derived from a SEM like in [4] and can advantageously replace the Minimum Statistics processing done in [2], Finally, the noise

[0234] FH241204PCT-2024363614. DOCX estimate or an intermediate representation of SEM can, for example, also be transmitted to the inactive phase encoder.

[0235] More specifically, for example, by considering the SEM module of [4], the magnitude noise estimate | / V| can optionally be used to derive the parameters of inactive phase encoder. Using the inactive coding as described in [2], the noise estimation in frequency domain can, for example, be advantageously replaced by the estimate(s) derived by the SEM module. The inactivate phase encoder may be configured to transmit or, for example consists of transmitting, an absolute noise level / energy optionally along with a spectral shape of the noise estimate reflecting the energy distribution of the noise estimate, for example, in LSI D frequency sub-bands. The sub-bands are, for example, obtained by grouping FFT bins into nonoverlapping spectral partitions. The subband grouping is, for example, not uniform and follows optionally roughly the Bark scale and can, for example, be seen as critical bands in the perceptual sense. 24 subbands are, for example, obtained at the encoder and transmitted. According to an embodiment, i the LSID subband energies of the noise estimate may, for example, be obtained, directly from the SEM module:

[0236] Where Fbis the cardinality of Ibthe set of FFT bin indices belonging to the subband index b. It is also possible to get the longer term characterics of the noise estimate to smooth the noise estimate or the subband energies, for example, by applying some moving average (MA) and / or exponential moving average (EMA).

[0237] The LSID subband energies Ebare, for example, first converted into dB and optionally normalized, for example, by a global gain, for example, to capture the shape information of the spectrum. The global gain is, for example, quantized on e.g. 7 bits, whereas the normalized vector is, for example, encoded by a Multi-Stage Vector Quantizer (MSVQ) with e.g. 4 stages. As an example, 6 bits are allocated in the 2 first stages, and 5 bits are allocated for each of the remaining 2 stages. The SID comprises or contains then 29 bits, for example, along with some side-information, which corresponds, for example, to 1.6 kbps. Still the MSVQ can, for example, be truncated, for example, by not transmitting some lower layer, for example, to even reduce the SID payload. It is worth noting that the SID is, for example, not transmitted at every frame but optionally rather every N frames, like N=8.

[0238] FH241204PCT-2024363614. DOCX At the decoder side, the transmitted VAD information may, for example, dictate which decoder needs to decode the current packet, or in case of DTX non-transmission generate a signal. It is possible that the active and inactive phase decoders exchange some information, e.g. information 709 as shown in Fig. 6, for example for buffer initialization during transitions. During active phase, the decoded signal from the neural audio decoder can, for example, also serve to refine the noise estimate at the decoder side using for a Minimum Statistics method as proposed in [2]

[0239] The switch-mixer 2 unit output the right decoded signal, e.g. 707, based on the VAD decision, e.g. 703, but can also decoded to cross fade the generated signals, or, for example, to mix them in active phases, for example, since the DTX / CNG decoder is able to generate a background or an attenuated background noise artificially even during active phases. Different scenarios are possible according to embodiments:

[0240] Not restitute at all the background noise: nothing from the DTX / CNG decoder is, for example, output, and silence is, for example, generated in inactive phase.

[0241] Restitute back a certain level of background noise for pleasantness and listening comfort: the comfort noise generated by the DTX / CNG decoder is, for example, attenuated and played (optionally solely) during inactive phase, and possibly mixed with the audio decoder output, for example, during active phases.

[0242] Restitute entirely the original level of background noise, the DTX / CNG decoder is, for example, then not attenuated, and can, for example, be optionally mixed with the audio decoder output, for example, during active phases

[0243] In the following, embodiments are summarized:

[0244] First additional embodiment: Audio processor, comprising:

[0245] • an audio enhancement module (e.g. as an example of a signal enhancement module) configured to process an input audio signal to obtain an enhanced audio signal (ES); and

[0246] • a signal activity detector, e.g. a voice activity detector, optionally with a switch functionality (e.g. a VAD and switch unit) configured to generate active and inactive signals, and to detect active and inactive phases; and

[0247] • an active phase coder processing (e.g. configured to process) the active signal during the active phases and generating an active bitstream

[0248] • an inactive phase coder processing (e.g. configured to process) the inactive signal during the inactive phases and generating an inactive bitstream

[0249] FH241204PCT-2024363614. DOCX Second additional embodiment: Audio processor according to the first additional embodiment, wherein: the active phase coder comprises at least a learnable layers or is neural audio coder

[0250] Third additional embodiment: Audio processor according to the second additional embodiment, wherein: the SEM comprises at least a learnable layers or is DNN-based speech enhancer

[0251] Optional aspects of embodiments (e.g. encoder)

[0252] Embodiments according to the invention, e.g. in particular encoders according to embodiments may be configured to perform any or all of the following functionalities, both individually or taken in combination:

[0253] • Deduce a Voice Activity Decision using internal representation(s) or output(s) of the noise reduction module, where VAD can include additional heuristic or ML. o VAD can include heuristic rules and / or ML o Inputs of VAD can include the estimated noise, the input speech or the estimated clean speech o Inputs of VAD can include any internal representation of the NR module, or features extracted from the NR module like subband energies. o Inputs of VAD can include other domain specific features computed outside the NR module, directly from the input speech o VAD derived from NR module output(s) or internal representation(s), steering the switching between active speech coding by a (neural) speech coder and comfort noise parametric coder or / and between continuous and discontinuous transmission (DTX mode on / off).

[0254] • Switch-mixer 1 controlled by a VAD producing different mixes of a noise and a clean speech signal estimated by a NR module, switching between active audio coding ((neural) audio encoder) and an inactive coding (DTX / CNG encoder). o Conveying to DTX-CNG encoder a (attenuated) version of the background noise estimated by the SEM in inactive phases detected by the VAD o Conveying to (neural) audio encoder the clean speech or a mix between the (attenuated) estimated noise and the estimated clean speech in active phases detected by the VAD.

[0255] • DTX / CNG encoder (Silence Descriptor, SID), transmitting a parametric or a very compact representation of the background noise, using an output of NR module

[0256] FH241204PCT-2024363614. DOCX o Using the estimated noise or an attenuated version of the estimated noise as input to encode o Coding a spectral envelope or subband energies of the background noise to encode, subband energies could be directly derived from the NR module. o Coding the background noise with the help of VQ or a DNN-based quantizer.

[0257] Optional aspects of embodiments (e.g. encoder-decoder)

[0258] Embodiments according to the invention, e.g. encoders and / or decoders according to embodiments mayss be configured to perform any or all of the following functionalities, both individually or taken in combination:

[0259] • Switch-mixer 1 + Switch-mixer 2, where Switch-mixer 2 is controlled by a transmitted VAD information producing different mixes of a generated comfort noise and a decoded active speech signal. o Outputting the generated comfort noise during inactive phases of the VAD o Outputting the decoded active speech during the active phases of the VAD, or a mix of the generated comfort noise and the decoded active speech. o Inter-play between attenuation / mixing done in Switch-mixer 1 and mixing done in Switch-mixer 2.

[0260] Hence, according to a preferred embodiment, a speech enhancer (e.g. as an example of a signal enhancement module), a neural audio / speech coder (e.g. as an example of an active phase audio encoder) and a DTX / CNG coder (e.g. as an example of an inactive phase audio encoder) may be combined to improve the transmission of speech recorded in real-world environment. The neural audio encoder may, for example, be responsible for coding active voice phases, while the DTX / CNG encoder may, for example, handle inactive voice phases. For example, by exploiting the available internal or external representations of the speech enhancer, the voice activity detection and background noise estimation used or for example required by the DTX / CNG module can be improved for a more efficient transmission.

[0261] Hence, a core concept according to embodiments may comprise controlling VAD / DTX-CNG of a neural audio coder by a noise reduction module.

[0262] Accordingly, embodiments may be used in the field of neural speech coding, NESC, DTX / CNG and / or Noise suppression. In particular, embodiments may be implemented for Communication Speech Coding, e.g. in particular in NESC (e.g. new space communication standard).

[0263] FH241204PCT-2024363614. DOCX With regard to the above-discussion of the invention, it is to be highlighted again that for the sake of brevity some embodiments were disclosed by the way of example. However, in view of this, the invention is not to be limited by the exact examples.

[0264] Accordingly, as previously described, equal or equivalent elements or elements with equal or equivalent functionality are denoted in the description by equal or equivalent reference numerals even if occurring in different figures.

[0265] Hence, for example, the signal enhancement module 110 may correspond to signal enhancement module 210, which may both correspond to signal enhancement module 510 (e.g. as an example for a signal enhancement module) and / or to noise reduction unit 610 (e.g. as an example for a signal enhancement module). Hence, features, functionalities and details as disclosed for one of these modules may be implemented or used with any of the other modules, in particular with any of the further features, functionalities and details as disclosed in the corresponding embodiment.

[0266] The same applies for signal activity detectors 120, 220 and voice activity detectors 520 and 620, as well as for active phase audio encoders 130, 230 and (neural) audio encoders 530, 630, as well as for inactive phase audio encoders 140, 240 and (neural) audio encoders 540, 640.

[0267] In line with this, decoder entities may comprise same or corresponding features, functionalities or details as their encoder-sided counterparts, e.g. active phase audio decoder 430 (e.g. corresponding to (neural) audio decoder 730 having features, functionalities and details corresponding to active phase audio encoders 130, 230 and (neural) audio encoders 530, 630, e.g. inactive phase audio decoder 440 (e.g. corresponding to DTX / CNG decoder 740 having features, functionalities and details corresponding to inactive phase audio encoders 140, 240 and DTX / CNG encoders 540, 640.

[0268] Accordingly, input signal 101 may correspond to input signal 201 or 301 , which may correspond to input speech signals 501 and / or 601.

[0269] In addition, outputs of the disclosed encoders may correspond to inputs of corresponding decoders, e.g. encoded representation 105 and / or 107 or encoded representations 205 and / or 207 and / or inactive bitstream 507 and / or active bitstream 505 and / or coding info 607 and / or CNG info 605 corresponding to encoded representation 402 and / or corresponding to CNG info 705 and / or coding info 707.

[0270] FH241204PCT-2024363614. DOCX Accordingly, the signal activity information 103, 104, 401 , and VAD info 603 and 703 may comprise corresponding information.

[0271] Implementation Alternatives

[0272] Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, one or more of the most important method steps may be executed by such an apparatus.

[0273] Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.

[0274] Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.

[0275] Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine-readable carrier.

[0276] Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine-readable carrier.

[0277] FH241204PCT-2024363614. DOCX In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.

[0278] A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and / or non-transitionary.

[0279] A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.

[0280] A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.

[0281] A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.

[0282] A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.

[0283] In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are preferably performed by any hardware apparatus.

[0284] The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.

[0285] FH241204PCT-2024363614. DOCX The apparatus described herein, or any components of the apparatus described herein, may be implemented at least partially in hardware and / or in software.

[0286] The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.

[0287] The methods described herein, or any components of the apparatus described herein, may be performed at least partially by hardware and / or by software. The above-described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.

[0288] FH241204PCT-2024363614. DOCX References

[0289] [1] Haici Yang, Kai Zhen, Seungkwon Beack, Minje Kim, “Source-Aware Neural Speech Coding for Noisy Speech Compression” [2] A. Lombard, S. Wilde, E. Ravelli, S. Dohla, G. Fuchs, and M. Dietz, “Frequencydomain Comfort Noise Generation for Discontinuous Transmission in EVS,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia: IEEE, Apr. 2015, pp. 5893-5897. doi: 10.1109 / ICASSP.2015.7179102. [3] N. Pia, K. Gupta, S. Korse, M. Multrus, and G. Fuchs, “NESC: Robust Neural End-2-

[0290] End Speech Coding with GANs,” Jul. 07, 2022, arXiv arXiv:2207.03282. Accessed: Oct. 24, 2024. [Online], Available: http: / / arxiv.org / abs / 2207.03282

[0291] [4] S. S. Shetu, S. Chakrabarty, O. Thiergart, and E. Mabande, “Ultra Low Complexity Deep Learning Based Noise Suppression,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2024, pp.

[0292] 466-470. doi: 10.1109 / ICASSP48485.2024.10448353.

[0293] FH241204PCT-2024363614. DOCX

Claims

Claims1. An audio encoder (100, 200, 500, 600) for providing an encoded representation (105, 107, 205, 207, 402, 505, 507, 605, 607, 705, 707) on the basis of an input audio signal (101 , 201 , 301 , 501 , 601), wherein the audio encoder comprises a signal enhancement module (110, 210, 300, 510, 610), configured to apply a noise reduction to the input audio signal, or to a processed version thereof, to obtain an enhanced audio signal (102, 202, 302, 502, 602, 606), wherein the audio encoder comprises an active phase audio encoder (130, 230, 530, 630), configured to provide the encoded representation (105, 205, 402, 505, 605, 705) on the basis of the enhanced audio signal, or a processed version thereof, wherein the audio encoder comprises an inactive phase audio encoder (140, 240, 540, 640), configured to obtain a noise reduction information (104, 106, 204, 206, 304, 502, 604, 606) about the noise reduction applied to the input audio signal, or to the processed version thereof, by the signal enhancement module and to provide the encoded representation (107, 207, 402, 507, 607, 707) on the basis the noise reduction information, and / or wherein the inactive phase audio encoder is configured to provide the encoded representation (107, 207, 402, 507, 607, 707) on the basis of the enhanced audio signal, or a processed version thereof.

2. The audio encoder (100, 200, 500, 600) according to claim 1 , wherein the audio encoder comprises a signal activity detector (120, 220, 520, 620), configured to obtain a signal activity information (103, 203, 401 , 603, 703), using a noise reduction information or the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606) about the noise reduction applied to the input audio signal (101 , 201 , 301 , 501 , 601), or to a processed version thereof, by the signal enhancement module (110, 210, 300, 510, 610),FH241204PCT-2024363614. DOCXwherein the active phase audio encoder (130, 230, 530, 630) is configured to provide the encoded representation (105, 205, 402, 505, 605, 705) in dependence of the signal activity information; and wherein the inactive phase audio encoder (140, 240, 540, 640) is configured to provide the encoded representation (107, 207, 402, 507, 607, 707) in dependence of the signal activity information.

3. The audio encoder (100, 200, 500, 600) according to any of the preceding claims, wherein the active phase encoder comprises a neural audio encoder.

4. The audio encoder (100, 200, 500, 600) according to any of the preceding claims, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to obtain a spectral domain representation (305) on the basis of the input audio signal (101 , 201 , 301 , 501 , 601); wherein the signal enhancement module is configured to obtain a mask information (303) on the basis of the spectral domain representation; wherein the signal enhancement module is configured to obtain the enhanced audio signal (102, 202, 302, 502, 602, 606) using a masking of the spectral domain representation or of a processed version thereof, using the mask information; and wherein the signal enhancement module is configured to obtain the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606) on the basis of the mask information or to provide the mask information as the noise reduction information.

5. The audio encoder (100, 200, 500, 600) according to claim 4, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to non- linearly scale the spectral domain representation (305) in order to obtain the processed version of the spectral domain representation (305), and wherein the signal enhancement module is configured to obtain the mask information (303) on the basis of the processed version of the spectral domain representation,FH241204PCT-2024363614. DOCXwherein the signal enhancement module is configured to obtain the enhanced audio signal (102, 202, 302, 502, 602, 606) using the masking of the processed version of the spectral domain representation using the mask information.

6. The audio encoder (100, 200, 500, 600) according to claim 4 or 5, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to apply a power law compression to the spectral domain representation (305), in order to obtain the processed version of the spectral domain representation (305).

7. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 6, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to apply a power law compression to a real part and to an imaginary part of the spectral domain representation (305), in order to obtain the processed version of the spectral domain representation (305).

8. The audio encoder (100, 200, 500, 600) according to claim 7, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to obtain the real part / ,. and the imaginary part ; of the processed version of the spectral domain representation (305) according towherein Xris a real part of the spectral domain representation; wherein Xtis an imaginary part of the spectral domain representation; sign is the signum function; and a is a power law factor between [0,1],9. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 8, wherein the signal enhancement module (110, 210, 300, 510, 610) comprises a neural network; andFH241204PCT-2024363614. DOCXwherein the neural network is configured to obtain the mask information (303) on the basis of the processed version of the spectral domain representation (305).

10. The audio encoder (100, 200, 500, 600) according to claim 9, wherein the neural network is configured to obtain the mask information (303) on the basis of a magnitude of the processed version of the spectral domain representation (305) or on the basis of magnitudes of spectral values of the processed version of the spectral domain representation.

11. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 10, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to obtain an intermediate magnitude mask on the basis of a magnitude of the processed version of the spectral domain representation (305) or on the basis of magnitudes of spectral values of the processed version of the spectral domain representation (305); wherein the signal enhancement module is configured to obtain an intermediate representation in dependence on the intermediate magnitude mask and on the basis of a phase of the processed version of the spectral domain representation or on the basis of phases of spectral values of the processed version of the spectral domain representation; wherein the signal enhancement module is configured to obtain a mask on the basis of the intermediate representation; wherein the signal enhancement module is configured to obtain the enhanced audio signal (102, 202, 302, 502, 602, 606) using a masking of the processed version thereof, using the mask.

12. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 11 , wherein the signal enhancement module (110, 210, 300, 510, 610) comprises a first stage comprising a first neural network; wherein the signal enhancement module comprises a second stage comprising a second neural network;FH241204PCT-2024363614. DOCXwherein the first neural network has a higher computational complexity than the second neural network; wherein the first stage is configured to process a magnitude of the processed version of the spectral domain representation (305) in order to obtain an intermediate magnitude mask; wherein the signal enhancement module is configured to obtain an intermediate representation in dependence on the intermediate magnitude mask and in dependence on a phase of the processed version of the spectral domain representation or in dependence on phases of spectral values of the processed version of the spectral domain representation; and wherein the second stage is configured to obtain the enhanced audio signal (102, 202, 302, 502, 602, 606) in dependence on the intermediate representation.

13. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 12, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to perform a channel-wise feature reorientation on the basis of the processed version of the spectral domain representation (305) in order to obtain the mask information (303).

14. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 13, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to obtain the mask information (303) and / or the enhanced audio signal (102, 202, 302, 502, 602, 606) on the basis of the processed version of the spectral domain representation (305) using a processing in a power law domain.

15. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 14, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to obtain a complex-valued mask on the basis of the processed version of the spectral domain representation (305), in order to obtain the mask information (303) and / or in order to obtain the enhanced audio signal (102, 202, 302, 502, 602, 606).FH241204PCT-2024363614. DOCX16. The audio encoder according to claim 15, wherein the signal enhancement module is configured to obtain a real magnitude mask on the basis of the processed version of the spectral domain representation (305) in order to obtain the complex-valued mask.

17. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 16, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to obtain a mask in dependence on the processed version of the spectral domain representation (305); and wherein the signal enhancement module is configured to determine a complement of the mask, in order to obtain the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606) and / or wherein the signal enhancement module is configured to determine the complement of the mask in order to obtain the noise reduction information.

18. The audio encoder (100, 200, 500, 600) according to claim 17, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to apply a mask to a representation of the audio input signal, or of the processed version thereof, in a non-linearly scaled domain, using the complement of the mask, in order to obtain the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606).

19. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 18, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to determine a complement or a difference on the basis of representations of the enhanced audio signal (102, 202, 302, 502, 602, 606) and of the input audio signal (101 , 201 , 301 , 501 , 601) or the processed version thereof, in a non-linearly scaled domain, in order to obtain the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606).

20. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 19,FH241204PCT-2024363614. DOCXwherein the signal enhancement module (110, 210, 300, 510, 610) is configured to obtain the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606) and / or the enhanced audio signal (102, 202, 302, 502, 602, 606) using an inverse non-linear scaling or using a power law decompression.

21. The audio encoder (100, 200, 500, 600) according to claim 20, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to obtain a mask on the basis of the processed version of the spectral domain representation (305), in order to obtain the mask information (303), wherein the signal enhancement module is configured to determine a complement of the mask, in order to obtain the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606), wherein the signal enhancement module is configured to mask a representation of the audio input signal, or of the processed version thereof, in a power law domain using the complement of the mask, in order to obtain a power law domain noise estimate N , wherein the signal enhancement module is configured to obtain the noise reduction information on the basis of a noise magnitude information Nm, which is determined according toN1 Nm = N1 Nm1 / awherein Nmis the magnitude of the power law domain noise estimate; or wherein the signal enhancement module is configured to obtain a noise estimate information N in the power law domain, having a real part Nrand an imaginary part using a subtraction between the preprocessed version of the input audio signal and a version of the enhanced signal in the power law domain, and wherein the signal enhancement module is configured to obtain the noise reduction information on the basis of the noise estimate information N in the power law domain, having a real part Nrand an imaginary part Nt, according toNr / i= sign Nr / i^Nr / t1 / 01FH241204PCT-2024363614. DOCXwherein Nris the real part of the power law domain noise estimate; whereinis the imaginary part of the power law domain noise estimate; sign is the signum function; and a is a power law factor between [0,1],22. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 21 , wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to obtain the spectral domain representation (305) on the basis of the input audio signal (101 , 201 , 301 , 501 , 601) and / or on the basis of the processed version thereof, using a windowing; wherein the signal enhancement module is configured to obtain the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606) on the basis of the mask information (303) with a temporal granularity which is associated with the windowing.

23. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 22, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to obtain the spectral domain representation (305) on the basis of the input audio signal (101 , 201 , 301 , 501 , 601) and / or on the basis of the processed version thereof, using a windowing; wherein the signal enhancement module is configured to obtain the enhanced audio signal (102, 202, 302, 502, 602, 606) with a temporal granularity which is associated with the windowing; wherein the signal enhancement module is configured to obtain the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606) with a temporal granularity which is associated with the windowing; and / or wherein the audio encoder is configured to obtain a signal to noise ratio information with a temporal granularity which is associated with the windowing, in order to obtain the noise reduction information.

24. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 23,FH241204PCT-2024363614. DOCXwherein the audio encoder is configured to obtain the signal to noise ratio information segSNR according towherein Skis a clean speech estimate; wherein Nkis a noise estimate; wherein k is a frequency index.

25. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 24, wherein the audio encoder is configured to obtain a time domain signal to noise ratio information and to provide to obtain the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606) on the basis of the time domain signal to noise ratio information.

26. The audio encoder (100, 200, 500, 600) according to any of claims 22 to 25, wherein the temporal granularity which is associated with the windowing is a time size resolution equal to the hop size of the windowing.

27. The audio encoder (100, 200, 500, 600) according to any of claims 23 to 26, wherein the audio encoder is configured to compare the signal to noise ratio information to a threshold, in order to provide a comparison result for usage in the signal activity detector (120, 220, 520, 620) and / or for usage in the inactive phase audio encoder (140, 240, 540, 640).

28. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 27, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to obtain the enhanced audio signal (102, 202, 302, 502, 602, 606) with a temporal granularity which is associated with the windowing;FH241204PCT-2024363614. DOCXwherein the signal enhancement module is configured to obtain a noise information with a temporal granularity which is associated with the windowing, on the basis of the mask information (303) and on the basis of the spectral domain representation (305); wherein the audio encoder is configured to aggregate the noise information, in order to obtain a noise level information; wherein the audio encoder is configured to obtain the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606) on the basis of the enhanced audio signal and the noise level information.

29. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 28, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to obtain the enhanced audio signal (102, 202, 302, 502, 602, 606) with a temporal granularity which is associated with the windowing; wherein the signal enhancement module is configured to obtain a noise information with a temporal granularity which is associated with the windowing in dependence on the mask information (303) and on the basis of the spectral domain representation (305); wherein the audio encoder is configured to obtain the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606) on the basis of the enhanced audio signal and the noise information using machine learning.

30. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 29, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to obtain the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606) on the basis of a comparison of a short-term estimate of the enhanced audio signal (102, 202, 302, 502, 602, 606) with a long-term estimate of a noise included in the input audio signal (101 , 201 , 301 , 501 , 601) or with a long-term estimate of a noise reduced from the enhanced audio signal when compared to the input audio signal (101 , 201 , 301 , 501 , 601).FH241204PCT-2024363614. DOCX31. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 30, wherein the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606) comprises one or more of• an estimate of a noise portion of the audio input signal or of a spectral domain representation (305) thereof,• an information about a noise portion reduced from the enhanced audio signal (102, 202, 302, 502, 602, 606) compared to the input audio signal (101 , 201 , 301 , 501 , 601),• an estimate of a residual noise portion of the enhanced audio signal (102, 202, 302, 502, 602, 606) or of a spectral domain representation (305) thereof,• the mask information (303),• an intermediate representation,• an intermediate magnitude mask, and / or• a decision information.

32. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 31 , wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to obtain an environmental information and / or a user preference information, and wherein the signal enhancement module is configured to adjust the application of the noise reduction to the input audio signal (101 , 201 , 301 , 501 , 601) or the processed version thereof to obtain an enhanced audio signal (102, 202, 302, 502, 602, 606) on the basis of the environmental information and / or the user preference information.

33. The audio encoder (100, 200, 500, 600) according to any of claims 4 to 32, wherein the signal enhancement module (110, 210, 300, 510, 610) is configured to provide the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606).

34. The audio encoder (100, 200, 500, 600) according to any of the preceding claims, wherein the inactive phase encoder is configured to determine noise generation parameters to be included in the encoded representation (105, 107, 205, 207, 402, 505,FH241204PCT-2024363614. DOCX507, 605, 607, 705, 707) based on the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606).

35. The audio encoder (100, 200, 500, 600) according to any of the preceding claims, wherein the audio encoder is configured to determine a noise energy information and / or a noise shape information on the basis of the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606); and wherein the inactive phase encoder is configured to provide the encoded representation (107, 207, 402, 507, 607, 707) on the basis of the noise energy information and / or on the basis of the noise shape information.

36. The audio encoder (100, 200, 500, 600) according to claim 35, wherein the audio encoder is configured to determine a plurality of subband energies for the input audio signal (101 , 201 , 301 , 501 , 601), in order to determine the noise shape information.

37. The audio encoder (100, 200, 500, 600) according to claim 36, wherein the subbands are distributed non-uniformly.

38. The audio encoder (100, 200, 500, 600) according to any of claims 36 to 37, wherein the subbands are distributed according to a psychoacoustic model of a human.

39. The audio encoder (100, 200, 500, 600) according to any of claims 35 to 38, wherein the audio encoder is configured to determine a plurality of logarithmic spectral information decomposition, LSID, subband energies, in order to determine the noise shape information.

40. The audio encoder (100, 200, 500, 600) according to claim 39, wherein the audio encoder is configured to determine the logarithmic spectral information decomposition, LSID, subband energies according toFH241204PCT-2024363614. DOCXwherein Ibis a cardinality of Ib, with Ibbeing a set of bin indices of the spectral domain representation (305) associated with subband index b; wherein N is a noise estimate information; and wherein k is a frequency index.

41. The audio encoder (100, 200, 500, 600) according to any of claims 35 to 40, wherein the audio encoder is configured to aggregate subband energies of the plurality of subband energies in order to determine the noise energy information and / or the noise shape information.

42. The audio encoder (100, 200, 500, 600) according to any of claims 35 to 41 , wherein the audio encoder is configured to normalize the plurality of subband energies in order to determine the noise shape information.

43. The audio encoder (100, 200, 500, 600) according to any of claims 35 to 42, wherein the inactive phase encoder comprises a multi-stage vector quantizer comprising a plurality of stages; and wherein the multi-stage vector quantizer is configured to obtain quantized representations of the noise energy information and / or of the noise shape information; and wherein the inactive phase encoder is configured to provide the encoded representation (107, 207, 402, 507, 607, 707) on the basis of the quantized representations of the noise energy information and / or of the noise shape information.

44. The audio encoder (100, 200, 500, 600) according to claim 43,FH241204PCT-2024363614. DOCXwherein the inactive phase encoder is configured to obtain the quantized representation of the noise energy information using a first stage of the multi-stage vector quantizer; and wherein the inactive phase encoder is configured to obtain the quantized representation of the noise shape information using a second stage, which is subsequent to the first stage; and wherein the inactive phase encoder is configured to selectively adjust a number of bits allocated for the quantization of the noise shape information in the second stage.

45. The audio encoder (100, 200, 500, 600) according to any of claims 35 to 44, wherein the inactive phase encoder is configured to skip an encoding of the noise shape information or of the quantized representation of the noise shape information for a predetermined number of frames],46. The audio encoder (100, 200, 500, 600) according to any of the preceding claims, wherein the inactive phase encoder is configured to provide a Silence Insertion Descriptor in order to provide the inactive phase audio signal.

47. The audio encoder (100, 200, 500, 600) according to any of the preceding claims, wherein the inactive phase encoder comprises a discontinuous transmission, DTX, encoder.

48. The audio encoder (100, 200, 500, 600) according to any of the preceding claims, wherein the active phase encoder is configured to skip an encoding of frames in dependence on the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606).

49. The audio encoder (100, 200, 500, 600) according to any of the preceding claims,FH241204PCT-2024363614. DOCXwherein the inactive phase encoder is configured to selectively encode noise generation parameters in dependence on the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606).

50. The audio encoder according to claim 48 or 49, wherein the inactive phase encoder is configured to selectively encode noise generation parameters, in case the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606) indicates that an amount of noise removed from the enhanced audio signal (102, 202, 302, 502, 602, 606) compared to the input audio signal (101 , 201 , 301 , 501 , 601) is above a threshold.

51. The audio encoder (100, 200, 500, 600) according to any of the preceding claims, wherein the inactive phase encoder is configured to determine an information about• a noise contribution included in the input audio signal (101 , 201 , 301 , 501, 601) on the basis of the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606); and / or• a noise contribution included in the enhanced audio signal (102, 202, 302, 502, 602, 606); and / or• a start time and / or an end time of an interval in the audio signal, wherein a noise contribution fulfills a criterion; wherein the inactive phase encoder is configured to provide the encoded representation (107, 207, 402, 507, 607, 707) on the basis of said information.

52. The audio encoder (100, 200, 500, 600) according to any of the preceding claims, wherein the audio encoder comprises a mixer (625), wherein the mixer is configured to obtain a mixed audio signal on the basis of the enhanced audio signal (102, 202, 302, 502, 602, 606) and a further noisy signal in dependence on at least one of• the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606)• the input audio signal (101 , 201 , 301 , 501 , 601) or the processed version thereof, and / orFH241204PCT-2024363614. DOCXthe signal activity information (103, 203, 401 , 603, 703).

53. The audio encoder (100, 200, 500, 600) according to claim 52, wherein the mixer (625) is configured to obtain a mixed audio signal on the basis of a weighted mixing of the further noisy signal and of the enhanced audio signal (102, 202, 302, 502, 602, 606), wherein a weighting the further noisy signal and of the enhanced audio signal in the mixed audio signal is controlled in dependence on the signal activity information (103, 203, 401 , 603, 703).

54. The audio encoder (100, 200, 500, 600) according to any of claims 52 to 53, wherein the mixer (625) is configured to provide the mixed audio signal to the active phase audio encoder (130, 230, 530, 630) or to the inactive phase audio encoder (140, 240, 540, 640).

55. The audio encoder (100, 200, 500, 600) according to any of the preceding claims , wherein the audio encoder is configured to provide the encoded representation (105, 107, 205, 207, 402, 505, 507, 605, 607, 705, 707) on the basis of the signal activity information (103, 203, 401 , 603, 703).

56. An audio decoder (400, 700) for decoding an encoded representation (105, 107, 205, 207, 402, 505, 507, 605, 607, 705, 707) of an input audio signal (101 , 201 , 301 , 501 , 601), wherein the audio decoder is configured to selectively switch between an active phase operating mode and an inactive phase operating mode as a selected mode, on the basis of a signal activity information (103, 203, 401 , 603, 703); wherein the audio decoder comprises an active phase audio decoder (430, 730), configured to provide an active phase audio signal on the basis of the encoded representation, in case of the selected mode being the active phase operating mode; wherein the audio decoder comprises an inactive phase audio decoder (440, 740), configured to provide a background audio signal on the basis of the encoded representation, in case of the selected mode being the inactive phase operating mode;FH241204PCT-2024363614. DOCXwherein the audio decoder is configured to obtain a mix control signal (403, 703), indicating a weighting of an active phase audio signal and of a background audio signal or of a scaled version thereof for a mixed audio signal (407, 707); and wherein the inactive phase audio decoder (440, 740) is configured to provide a background audio signal or a scaled version thereof on the basis of the encoded representation, when the audio decoder operates in the active phase operating mode; and wherein the audio decoder comprises a mixer (420, 720), configured to mix, in the active phase operating mode, an active phase audio signal, provided using the active phase decoder in the active phase operating mode and a background audio signal or a scaled version thereof, provided using the inactive phase audio decoder in the active phase operating mode, according to the weighting indicated by the mix control signal (403, 703), in order to obtain the mixed audio signal (407, 707).

57. The audio decoder (400, 700) according to claim 56, wherein the inactive phase audio decoder (440, 740) is configured to obtain, when operating in the inactive phase operating mode, a set of parameters on the basis of the encoded representation (105, 107, 205, 207, 402, 505, 507, 605, 607, 705, 707) for providing the background audio signal; and wherein the inactive phase audio decoder (440, 740) is configured to provide, when operating in the active phase operating mode, a background audio signal or a scaled version thereof, using the set of parameters obtained when operating in the inactive phase operating mode.

58. The audio decoder (400, 700) according to any of claims 56 or 57, wherein the active phase audio decoder (430, 730) and the inactive phase audio decoder (440, 740) are configured to exchange a parameter information (709).FH241204PCT-2024363614. DOCX59. The audio decoder (400, 700) according to any of claims 56 to 58, wherein inactive phase audio decoder (440, 740) is configured to obtain, when operating in the inactive phase operating mode, a parameter information for providing the background audio signal, based on the active phase audio signal provided by the active phase audio decoder provided, when operating in the active phase operating mode.

60. The audio decoder (400, 700) according to any of claims 56 to 59, wherein the audio decoder is configured to cross-fade a transition from the inactive phase operating mode to the active phase operating mode using the mixer (420, 720).

61. The audio decoder (400, 700) according to any of claims 56 to 60, wherein the audio decoder is configured to select between providing, as an output signal, silence in the inactive phase operating mode; providing, as an output signal, an attenuated version of the background audio signal in the inactive phase operating mode; providing, as an output signal, a background audio signal in the inactive phase operating mode; on the basis of the selected mode and / or on the basis of the mix control signal (403, 703).

62. The audio decoder (400, 700) according to any of claims 56 to 61 , wherein the audio decoder is configured to select between providing, as an output signal, a mixed audio signal (407, 707) comprising a combination of an attenuated version of a background audio signal and the active phase audio signal, in the active phase operating mode; providing, as an output signal, a mixed audio signal (407, 707) comprising a combination of a background audio signal and the active phase audio signal, in the active phase operating mode; andFH241204PCT-2024363614. DOCXproviding, as an output signal, an active phase audio signal in the active phase operating mode, on the basis of the selected mode and / or on the basis of the mix control signal (403, 703).

63. The audio decoder (400, 700) according to any of claims 56 to 62, wherein the audio decoder is configured to obtain a mix control signal (403, 703) on the basis of a user input, an application input, and / or an environmental information.

64. The audio decoder (400, 700) according to any of claims 56 to 63, wherein the inactive phase decoder comprises a discontinuous transmission, DTX, decoder; and / or wherein the inactive phase encoder comprises a Comfort Noise Generation, CNG, decoder.

65. The audio decoder (400, 700) according to any of claims 56 to 64, wherein the active phase decoder comprises a neural audio decoder.

66. A method for providing an encoded representation (105, 107, 205, 207, 402, 505, 507, 605, 607, 705, 707) on the basis of an input audio signal (101 , 201 , 301 , 501 , 601), wherein the method comprises applying a noise reduction to the input audio signal, or to a processed version thereof, to obtain an enhanced audio signal (102, 202, 302, 502, 602, 606), wherein the method comprises providing the encoded representation on the basis of the enhanced audio signal, or a processed version thereof,FH241204PCT-2024363614. DOCXwherein the method comprises obtaining a noise reduction information (104, 106, 204, 206, 304, 502, 604, 606) about the noise reduction applied to the input audio signal, or to the processed version thereof and providing the encoded representation on the basis the noise reduction information (104, 106, 204, 206, 304, 502, 604, 606), and / or wherein the method comprises providing the encoded representation on the basis of the enhanced audio signal, or a processed version thereof.

67. A method for decoding an encoded representation (105, 107, 205, 207, 402, 505, 507, 605, 607, 705, 707) of an input audio signal (101 , 201 , 301 , 501 , 601), wherein the method comprises selectively switching between an active phase operating mode and an inactive phase operating mode as a selected mode, on the basis of a signal activity information (103, 203, 401 , 603, 703); wherein the method comprises providing an active phase audio signal on the basis of the encoded representation, in case of the selected mode being the active phase operating mode; wherein the method comprises providing a background audio signal on the basis of the encoded representation, in case of the selected mode being the inactive phase operating mode; wherein the method comprises obtaining a mix control signal (403, 703), indicating a weighting of an active phase audio signal and of a background audio signal or of a scaled version thereof for a mixed audio signal (407, 707); and wherein the method comprises providing a background audio signal or a scaled version thereof on the basis of the encoded representation, when the selected mode is the active phase operating mode; and wherein the method comprises mixing, in the active phase operating mode, an active phase audio signal, provided in the active phase operating mode and a background audio signal or a scaled version thereof, provided in the active phase operating mode,FH241204PCT-2024363614. DOCXaccording to the weighting indicated by the mix control signal (403, 703), in order to obtain the mixed audio signal (407, 707).

68. A computer program for performing the method according to any of claims 66 or 67, when the computer program runs on a computer.

69. Bitstream, having encoded therein an input audio signal (101 , 201 , 301 , 501 , 601) using the method according to claim 66.FH241204PCT-2024363614. DOCX