Methods for processing audio signals, signal processing units, binaural renderers, audio encoders, and audio decoders.
By separately processing audio signals using the initial and late reverberation portions of a room impulse response and applying signal-dependent scaling, the method addresses the inaccuracies in conventional methods, achieving perceptually equivalent results with reduced computational complexity and improved sound quality in binaural rendering.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- FRAUNHOFER GESELLSCHAFT ZUR FORDERUNG DER ANGEWANDTEN FORSCHUNG EV
- Filing Date
- 2026-04-24
- Publication Date
- 2026-06-25
Smart Images

Figure 2026105125000001_ABST
Abstract
Description
[Technical Field]
[0001] The present invention relates to the field of audio coding / decoding, and more particularly to spatial audio coding and spatial audio object coding, for example, to the field of 3D audio codec systems. Embodiments of the present invention relate to a signal processing unit, a binaural renderer, an audio encoder and an audio decoder for processing an audio signal according to an indoor impulse response. [Background technology]
[0002] Spatial audio coding tools are well-known in this art and are standardized, for example, in the MPEG surround standard. Spatial audio coding starts with multiple original inputs, e.g., five or seven input channels, which are identified by their arrangement in the playback setup, for example, as left channel, center channel, right channel, left surround channel, right surround channel, and low-frequency extension channel. The spatial audio encoder can derive one or more downmix channels from the original channels and may also derive parametric data related to the spatial cue, such as inter-channel level differences, inter-channel phase differences, and inter-channel time differences of channel coherence values. To finally obtain output channels that are approximate versions of the original input channels, one or more downmix channels are sent to a spatial audio decoder along with parametric side information indicating the spatial cue for decoding the downmix channels and associated parametric data. The arrangement of channels in the output setup can be fixed and may be, for example, a 5.1 format, a 7.1 format, etc.
[0003] Furthermore, spatial audio object coding tools are well known in this art, and are standardized, for example, in the MPEG SAOC standard (SAOC = Spatial Audio Object Coding). In contrast to spatial audio coding which starts from the original channel, spatial audio object coding starts from audio objects that are not automatically dedicated for a given rendering playback setup. Rather, the placement of audio objects in a playback scene is flexible and can be set by the user, for example, by inputting certain rendering information into the spatial audio object coding decoder. Alternatively or additionally, rendering information may be transmitted as additional side information or metadata, and rendering information may include information on where a given audio object should be placed (e.g., over time) in its position in the playback setup. To obtain a certain data compression, some audio objects are coded using a SAOC encoder, which computes one or more transport channels from the input objects by downmixing the objects according to certain downmix information. In addition, the SAOC encoder computes parametric side information representing inter-object cues, such as object-level difference (OLD) and object coherence values. As in the case of SAC (Spatial Audio Coding), inter-object parametric data is calculated for each individual time / frequency tile. For a frame with an audio signal (e.g., 1024 or 2048 samples), multiple frequency bands (e.g., 24, 32, or 64 bands) are considered so that parametric data is provided for each frame and each frequency band. For example, when the audio portion has 20 frames and each frame is subdivided into 32 frequency bands, the number of time / frequency tiles is 640.
[0004] In 3D audio systems, it is sometimes desirable to provide a spatial impression of an audio signal as if it were being heard in a specific room. In such situations, the room impulse response of a specific room is provided, for example, based on its measurement, and is used to process the audio signal when presenting it to the listener. In such a presentation, it may be desirable to process direct sound and early reflections separated from late reverberation. [Overview of the Initiative] [Problems that the invention aims to solve]
[0005] The fundamental objective of this invention is to provide an approved method for separately processing an audio signal using the initial portion and late reverberation of a room impulse response, thereby enabling the achievement of a result that is as perceptually equivalent as possible to the result of convolution of the audio signal using the complete impulse response. [Means for solving the problem]
[0006] This objective is achieved by the method of claim 1, the signal processing unit of claim 19, the binaural renderer of claim 23, the audio encoder of claim 24, and the audio decoder of claim 25.
[0007] This invention is based on the inventor's discovery that conventional methods have a problem in that when processing an audio signal using a room impulse response, the result of processing the audio signal separately with respect to the initial portion and reverberation deviates from the result when convolution is applied using a complete impulse response. This invention is further based on the inventor's discovery that, for example, when using a synthetic reverberation method, the influence of the input audio signal on the reverberation is not sufficiently preserved, and therefore an appropriate level of reverberation depends on both the input audio signal and the impulse response. The influence of the impulse response can be considered by using known reverberation characteristics as input parameters. The influence of the input signal can be considered by signal-dependent scaling to adapt the level of reverberation determined based on the input audio signal. This method has been shown to better match the perceived level of reverberation when using a complete convolution method for binaural rendering.
[0008] (1) The present invention provides a method for processing an audio signal according to the room impulse response, and this method is Processing an audio signal separately using the initial portion and late reverberation of the room impulse response, wherein processing the late reverberation includes generating a scaled reverberation signal, and the scaling depends on the audio signal. The process involves combining an audio signal processed using the initial portion of the room impulse response with a scaled reverberation signal. Includes. Compared to the conventional methods described above, the method of the present invention is advantageous because it allows scaling of late reverberation without the need to compute a complete convolution result or to apply an extensive and inaccurate auditory model. Embodiments of the method of the present invention provide a simple method for scaling artificial late reverberation so that it sounds like reverberation in a complete convolution method. Scaling is based on the input signal and does not require additional auditory models or target reverberation loudness. The scaling factor can be derived in the time-frequency domain, which is also advantageous because audio material in the encoder / decoder chain is often available in the time-frequency domain as well.
[0009] (2) According to the embodiment, scaling may depend on the state of one or more input channels of the audio signal (e.g., the number of input channels, the number of active input channels and / or activity in the input channels). Scaling is advantageous because it can be easily determined from the input audio signal with reduced computational overhead. For example, scaling can be determined simply by determining the number of channels in the original audio signal that are downmixed into the currently considered downmix channel, which contains a reduced number of channels compared to the original audio signal. Alternatively, the number of active channels (channels that currently exhibit some activity in the audio frame) that are downmixed into the currently considered downmix channel can form the basis for scaling the reverberation signal.
[0010] (3) According to the embodiment, scaling (as an addition or alternative to input channel states) depends on a predefined or calculated correlation measure of the audio signals. Using a predefined correlation measure is advantageous because it reduces computational complexity in the process. A predefined correlation measure can have a fixed value, for example, within the range of 0.1 to 0.9, and can be empirically determined based on the analysis of multiple audio signals. On the other hand, if it is desirable to obtain a more accurate measure for each audio signal currently being processed, it is advantageous to compute the correlation measure, despite the additional computational resources required.
[0011] (4) According to the embodiment, generating a scaled reverberation signal includes applying a gain factor, which is determined based on the state of one or more input channels of the audio signal and / or based on a predefined or calculated correlation measure for the audio signal, and the gain factor may be applied before, during, or after processing the late reverberation of the audio signal. This is advantageous because the gain factor can be easily calculated based on the parameters above and can be flexibly applied with respect to the reverber in the processing chain, depending on the implementation details.
[0012] (5) According to the embodiment, the gain factor is determined as follows: g=c u +ρ·(c c -c u ) however ρ = a predefined or calculated correlation measure for audio signals. c u , c c = A factor that indicates the state of one or more input channels of an audio signal, c u This refers to channels that are uncorrelated overall, and c c This relates to the channels that are correlated as a whole. This is advantageous because the factor scales over time with the number of active channels in the audio signal.
[0013] (6) According to the embodiment, cu and c c is determined as follows.
Number
Number
[0014] (7) According to an embodiment, the gain factor is low-pass filtered over a plurality of audio frames, and the gain factor can be low-pass filtered as follows.
Number
Number
Number
[0015] (8) According to the embodiment, generating a scaled reverberation signal includes a correlation analysis of an audio signal, the correlation analysis of an audio signal may include determining a composite correlation measure for audio frames of the audio signal, the composite correlation measure may be calculated by composing correlation coefficients for multiple channel combinations of one audio frame, each audio frame includes one or more time slots, and composing the correlation coefficients may include averaging multiple correlation coefficients of the audio frame. This is advantageous because correlation can be described by a single value that describes the overall correlation of a single audio frame; there is no need to deal with multiple frequency-dependent values.
[0016] (9) According to one embodiment, determining a composite correlation measure may include (i) calculating an overall average value for each channel of a single audio frame, (ii) calculating a zero-average audio frame by subtracting the average value from the corresponding channel, (iii) calculating correlation coefficients for a combination of multiple channels, and (iv) calculating a composite correlation measure as the average of the multiple correlation coefficients. As mentioned above, this is advantageous because only one overall correlation value is calculated for each frame (an easy process), and this calculation can be done in the same way as the "standard" Pearson correlation coefficient, which also uses the zero-mean signal and its standard deviation.
[0017] (10) According to the embodiment, the correlation coefficient for channel synthesis is determined as follows.
number
[0018] (11) According to one embodiment, processing the late reverberation of an audio signal includes downmixing the audio signal and applying the downmixed audio signal to a reverberator. For example, processing within a reverberator is advantageous because it requires handling fewer channels, and the downmix process can be directly controlled.
[0019] (12) The present invention provides a signal processing unit comprising: an input for receiving an audio signal; an initial part processor for processing the received audio signal according to the initial part of a room impulse response; a late reverberation processor for processing the received audio signal according to the late reverberation of a room impulse response, wherein the late reverberation processor is configured or programmed to generate a scaled reverberation signal dependent on the received audio signal; and an output for combining the audio signal processed using the initial part of the room impulse response and the scaled reverberation signal into an output audio signal.
[0020] (13) According to one embodiment, the late reverberation processor comprises a reverberator that receives an audio signal and generates a reverberation signal; a correlation analyzer that generates a gain factor dependent on the audio signal; and a gain stage coupled to the input or output of the reverberator and controlled by the gain factor provided by the correlation analyzer.
[0021] (14) According to one embodiment, the signal processing unit further comprises at least one of a low-pass filter coupled between a correlation analyzer and a gain stage, and a delay element coupled between a gain stage and an adder, the adder being further coupled to an initial subprocessor and an output.
[0022] (15) The present invention provides a binaural renderer equipped with the signal processing unit of the present invention.
[0023] (16) The present invention provides an audio encoder for coding an audio signal, comprising a signal processing unit of the present invention or a binaural renderer of the present invention for processing the audio signal before coding.
[0024] (17) The present invention provides an audio decoder for decoding an encoded audio signal, comprising a signal processing unit of the present invention for processing a decoded audio signal or a binaural renderer of the present invention. Embodiments of the present invention will be described with reference to the accompanying drawings. [Brief explanation of the drawing]
[0025] [Figure 1] This provides an overview of the 3D audio encoder in a 3D audio system. [Figure 2] This provides an overview of the 3D audio decoder in a 3D audio system. [Figure 3] Figure 2 shows an example of how to implement a format converter that can be implemented in the 3D audio decoder. [Figure 4] Figure 2 shows one embodiment of a binaural renderer that can be implemented in the 3D audio decoder. [Figure 5] An example of the indoor impulse response h(t) is shown. [Figure 6(a)] This paper demonstrates different possibilities for processing audio input signals using the room impulse response and shows how to process the entire audio signal according to the room impulse response. [Figure 6(b)] This paper demonstrates different possibilities for processing audio input signals using room impulse responses, showing separate processing for the initial and late reverberation portions. [Figure 7] A block diagram of a signal processing unit, such as a binaural renderer, operating according to the teachings of the present invention is shown. [Figure 8] The binaural processing of an audio signal in a binaural renderer according to one embodiment of the present invention is schematically shown. [Figure 9] Figure 8 schematically shows the processing in the frequency domain reverberation of the binaural renderer according to one embodiment of the present invention. [Modes for carrying out the invention]
[0026] Next, embodiments of the method of the present invention will be described. The following description will begin with a system overview of a 3D audio codec system in which the method of the present invention may be implemented.
[0027] Figures 1 and 2 show algorithmic blocks of a 3D audio system according to an embodiment. More specifically, Figure 1 shows an overview of the 3D audio encoder 100. The audio encoder 100 receives input signals in a pre-renderer / mixer circuit 102, which may be provided as an option, and more specifically, receives multiple input channels that provide the audio encoder 100 with multiple channel signals 104, multiple object signals 106, and corresponding object metadata 108. The object signals 106 (see signal 110) processed by the pre-renderer / mixer 102 may be provided to the SAOC encoder 112 (SAOC = Spatial Audio Object Coding). The SAOC encoder 112 generates SAOC transport channels 114 that are provided to the USAC encoder 116 (USAC = Integrated Speech and Audio Coding). Furthermore, a signal SAOC-SI 118 (SAOC-SI = SAOC Side Information) is also provided to the USAC encoder 116. The USAC encoder 116 further receives object signals 120 directly from the pre-renderer / mixer, as well as channel signals and pre-rendered object signals 122. Object metadata information 108 is applied to the OAM encoder 124 (OAM = Object Metadata), which provides compressed object metadata information 126 to the USAC encoder. Based on the input signals described above, the USAC encoder 116 generates a compressed output signal mp4 as shown in 128.
[0028] Figure 2 shows an overview of the 3D audio decoder 200 of the 3D audio system. The encoded signal 128 (mp4) generated by the audio encoder 100 in Figure 1 is received by the audio decoder 200, more specifically by the USAC decoder 202. The USAC decoder 202 decodes the received signal 128 into a channel signal 204, a pre-rendered object signal 206, an object signal 208, and a SAOC transport channel signal 210. Furthermore, compressed object metadata information 212 and the signal SAOC-SI 214 are output by the USAC decoder 202. The object signal 208 is provided to the object renderer 216, which outputs the rendered object signal 218. The SAOC transport channel signal 210 is supplied to the SAOC decoder 220, which outputs the rendered object signal 222. The compressed object metadata 212 is supplied to the OAM decoder 224, which outputs control signals to the object renderer 216 and the SAOC decoder 220, respectively, to generate the rendered object signals 218 and 222. The decoder further comprises a mixer 226 that receives input signals 204, 206, 218 and 222 to output a channel signal 228, as shown in Figure 2. The channel signal may be output directly to a loudspeaker, for example, a 32-channel loudspeaker, as shown in 230. The signal 228 may be provided to a format conversion circuit 232 that receives a regeneration layout signal as a control input indicating how the channel signal 228 should be converted. In the embodiment shown in Figure 2, it is assumed that the conversion is performed in such a manner that the signal may be provided to a speaker system as shown in 234. Furthermore, the channel signal 228 may be provided to the binaural renderer 236, which generates two output signals, for example, for headphones, as shown in 238.
[0029] In one embodiment of the present invention, the encoding / decoding system shown in Figures 1 and 2 is based on the MPEG-D USAC codec for coding channel signals and object signals (see signals 104 and 106). To increase efficiency for coding a large number of objects, MPEG SAOC technology may be used. Three types of renderers may perform the task of rendering objects to channels, rendering channels to headphones, or rendering channels to different loudspeaker setups (see Figure 2, reference numerals 230, 234, and 238). When object signals are explicitly transmitted or parameterally encoded using SAOC, the corresponding object metadata information 108 is compressed (see signal 126) and multiplexed into a 3D audio bitstream 128. The algorithmic blocks of the entire 3D audio system shown in Figures 1 and 2 will be described in more detail below.
[0030] A pre-renderer / mixer 102 may be optionally provided to convert the channel + object input scene into a channel scene before encoding. Functionally, the pre-renderer / mixer 102 is equivalent to the object renderer / mixer described below. Object pre-rendering may be desired to ensure deterministic signal entropy at the encoder input, which is essentially independent of the number of simultaneously active object signals. Object pre-rendering does not require the transmission of object metadata. Discrete object signals are rendered into a channel layout configured for use by the encoder. Object weights for each channel are obtained from the associated object metadata (OAM).
[0031] The USAC encoder 116 is a core codec for loudspeaker channel signals, discrete object signals, object downmix signals, and pre-rendered signals. The USAC encoder 116 is based on MPEG-D USAC technology. The USAC encoder 116 handles the coding of the above signals by creating channel and object mapping information based on geometric and semantic information of input channels and object assignments. This mapping information describes how input channels and objects are mapped to USAC channel elements such as channel pair elements (CPE), single channel elements (SCE), low-frequency effects (LFE), and quad-channel elements (QCE), and the CPE, SCE, and LFE, as well as the corresponding information, are sent to the decoder. All additional payloads, such as SAOC data 114, 118, or object metadata 126, are considered in the encoder's rate control. Object coding is possible in various ways depending on the rate / distortion and interactivity requirements for the renderer. According to embodiments, the following object coding variants are possible.
[0032] • Pre-rendered objects: Object signals are pre-rendered and mixed into a 22.2-channel signal before encoding. The subsequent coding chain experiences the 22.2-channel signal.
[0033] • Discrete object waveform: Objects are supplied to the encoder as monophonic waveforms. The encoder uses a single-channel element (SCE) to transmit the objects in addition to the channel signal. The decoded objects are rendered and mixed at the receiver. Compressed object metadata information is sent to the receiver / renderer.
[0034] • Parametric object waveforms: Object properties and their relationships to each other are described by SAOC parameters. The downmix of the object signals is coded using USAC. Parameter information is transmitted together. The number of downmix channels is selected based on the number of objects and the overall data rate. Compressed object metadata information is sent to the SAOC renderer.
[0035] The SAOC encoder 112 and SAOC decoder 220 for object signals are based on MPEG SAOC technology. The system is capable of recreating, modifying, and rendering several audio objects based on a smaller number of transmitted channels and additional parametric data such as OLD, IOC (Inter-Object Coherence), and DMG (Downmix Gain). The additional parametric data exhibits a significantly lower data rate than required to transmit all objects individually, thereby making coding extremely efficient. The SAOC encoder 112 takes the object / channel signal as a monophonic waveform as input and outputs parameter information (packed into the 3D audio bitstream 128) and a SAOC transport channel (encoded and transmitted using a single channel element). The SAOC decoder 220 reconstructs the object / channel signal from the decoded SAOC transport channel 210 and parameter information 214 to generate an output audio scene based on the playback layout, restored object metadata information, and optionally user interaction information.
[0036] The object metadata codec (see OAM encoder 124 and OAM decoder 224) provides, for each object, relevant metadata specifying the geometric position and volume of the object in 3D space, which is efficiently coded by quantization of object properties in time and space. The compressed object metadata cOAM 126 is transmitted to the receiver 200 as side information.
[0037] The object renderer 216 utilizes compressed object metadata to generate object waveforms according to a given playback format. Each object is rendered to a certain output channel according to its metadata. The output of this block is the sum of the partial results. If both channel-based content and discrete / parametric objects are decoded, the channel-based waveform and the rendered object waveform are mixed by the mixer 226, and the resulting waveform 228 is then output, or the resulting waveform 228 is fed to a post-processor module such as the binaural renderer 236 or the loudspeaker renderer module 232.
[0038] The binaural renderer module 236 generates a binaural downmix of multi-channel audio material so that each input channel is represented by a virtual sound source. Processing is performed frame by frame in the QMF (Quaternary Mirror Filter Bank) region, and binauralization is based on the measured binaural room impulse response.
[0039] The loudspeaker renderer 232 converts between the transmitted channel configuration 228 and the desired playback format. It is sometimes called a "format converter." The format converter performs a conversion to a lower number of output channels, i.e., it results in a downmix.
[0040] Figure 3 shows an example of implementing a format converter 232. Also called a loudspeaker renderer, the format converter 232 converts between the transmitter channel configuration and the desired playback format. The format converter 232 performs the conversion to a lower number of output channels, i.e., it performs the downmix (DMX) process 240. Preferably operating in the QMF region, the downmixer 240 receives the mixer output signal 228 and outputs the loudspeaker signal 234. A configurator 242, also called a controller, may be provided, which receives, as control inputs, a signal 246 indicating the mixer output layout, i.e., the layout for which the data represented by the mixer output signal 228 is determined, and a signal 248 indicating the desired playback layout. Based on this information, the controller 242 preferably automatically generates optimized downmix matrices for a given combination of the input and output formats and applies these matrices to the downmixer 240. The format converter 232 enables both standard loudspeaker configurations and random configurations with non-standard loudspeaker positions.
[0041] Figure 4 shows one embodiment of the binaural renderer 236 of Figure 2. The binaural renderer module can provide a binaural downmix of multi-channel audio material. Binauralization can be based on a measured binaural room impulse response. The room impulse response can be considered a "fingerprint" of the acoustic properties of a real room. The room impulse response can be measured and stored, and this "fingerprint" can be provided to any acoustic signal, thereby enabling the listener to simulate the acoustic properties of the room related to the room impulse response. The binaural renderer 236 can be configured or programmed to render the output channel into two binaural channels using a head-related transfer function or binaural room impulse response (BRIR). For example, in mobile devices, binaural rendering for headphones or loudspeakers attached to such mobile devices is desirable. In such mobile devices, constraints may necessitate limiting the complexity of the decoder and rendering. In such processing scenarios, in addition to omitting uncorrelated signals, it may be preferable to first perform a downmix using the downmixer 250 to an intermediate downmix signal 252, i.e., to a lower number of output channels, thereby obtaining a lower number of input channels for the actual binaural transducer 254. For example, 22.2 channel material may be downmixed to a 5.1 intermediate downmix by the downmixer 250, or alternatively, the intermediate downmix may be directly calculated by the SAOC decoder 220 in Figure 2 in a kind of "shortcut" mode. In that case, binaural rendering only requires applying 10 HRTF (Head-Related Transfer Function) or BRIR functions to render the 5 individual channels at different positions, as opposed to applying 44 HRTF or BRIR functions if the 22.2 input channels were to be rendered directly. The convolution operations required for binaural rendering require a lot of processing power, and therefore reducing this processing power while still achieving acceptable audio quality is particularly useful for mobile devices.The binaural renderer 236 generates a binaural downmix 238 of the multi-channel audio material 228 such that each input channel (except the LFE channel) is represented by a virtual sound source. Processing may be performed frame by frame in the QMF region. Binauralization is based on the measured binaural room impulse response, and direct sound and early reflections may be transferred to the audio material via a convolution technique in the pseudo-FFT region using fast convolution over the QMF region, while late reverberations may be processed separately.
[0042] Figure 5 shows an example of a room impulse response h(t)300. The room impulse response includes three components: direct sound 301, early reflections 302, and late reverberation 304. In this way, the room impulse response describes the reflection behavior of a sealed reverberant acoustic space when an impulse is emitted. Early reflections 302 are individual reflections with increasing density, and the portion of the impulse response where the individual reflections are no longer distinguishable is called late reverberation 304. Direct sound 301 can be easily identified in the room impulse response and separated from the early reflections, but the transition from early reflections 302 to late reverberation 304 is not very obvious.
[0043] As explained above, in binaural renderers, such as the one shown in Figure 2, various techniques are known for processing multi-channel audio input signals according to the room impulse response.
[0044] Figure 6 illustrates different possibilities for processing an audio input signal using a room impulse response. Figure 6(a) shows processing the entire audio signal according to the room impulse response, and Figure 6(b) shows separate processing of the initial and late reverberation portions. As shown in Figure 6(a), an input signal 400, for example a multi-channel audio input signal, is received and applied to a processor 402, which is configured or programmed to allow the complete convolution of the multi-channel audio input signal 400 using the room impulse response (see Figure 5), which in the illustrated embodiment produces a two-channel audio output signal 404. As mentioned above, this method is considered disadvantageous because using convolution over the entire impulse response is computationally very costly. Therefore, according to an alternative method, instead of processing the entire multi-channel audio input signal by applying a full convolution using the room impulse response described with respect to Figure 6(a), as shown in Figure 6(b), the processing is separated with respect to the initial parts 301, 302 (see Figure 5) of the room impulse response 300 and the later reverberation part 302. More specifically, as shown in Figure 6(b), the multi-channel audio input signal 400 is received, but the signal is applied in parallel to a first processor 406 to process the initial part, i.e., to process the audio signal according to the direct sound 301 and early reflections 302 in the room impulse response 300 shown in Figure 5. The multi-channel audio input signal 400 is also applied to a processor 408 to process the audio signal according to the later reverberation 304 of the room impulse response 300. In the embodiment shown in Figure 6(b), the multi-channel audio input signal may also be applied to a downmixer 410 to downmix the multi-channel signal 400 into a signal with a lower number of channels. The output of the downmixer 410 is then applied to the processor 408. The outputs of processor 406 and processor 408 are combined in 412 to generate a two-channel audio output signal 404'.
[0045] In binaural renderers, as mentioned above, it is sometimes desirable to process direct sound and early reflections separately from late reverberation, primarily to reduce computational complexity. The processing of direct sound and early reflections can be transferred to the audio signal by a convolution technique performed, for example, by processor 406 (see Figure 6(b)), while late reverberation can be replaced by synthesized reverberation performed by processor 408. The overall binaural output signal 404' is then a combination of the convolution result provided by processor 406 and the synthesized reverberation signal provided by processor 408.
[0046] This process is also described in prior art literature [1]. The results of the method described above should be as perceptually equivalent as possible to the results of the complete transformation method described with respect to the convolution of the complete impulse response, Figure 6(a). However, when an audio signal, or more generally, audio material, is convolved with the direct sound and early reflection portions of the impulse response, the result is that different channels are summed up to form an overall sound signal associated with the reproduced signal to one ear of the listener. Reverberation, however, is not calculated from this overall signal, but is generally the reverberation signal of one channel or downmix of the original input audio signal. The inventors of the present invention have therefore determined that late reverberation does not adequately fit the convolution result provided by processor 406. It has been found that an appropriate level of reverberation depends on both the input audio signal and the room impulse response 300. The influence of the impulse response is achieved by using reverberation characteristics as input parameters for a reverberator, which may be part of processor 408, and these input parameters are obtained from the analysis of measured impulse responses, e.g., frequency-dependent reverberation time and frequency-dependent energy measures. These measures can generally be determined from a single impulse response, for example by calculating energy and RT60 reverberation time in an octave filter bank analysis, or they are the average values of the results of multiple impulse response analyses.
[0047] However, despite these input parameters provided to the reverberator, it has been found that the influence of the input audio signal on the reverberation is not sufficiently preserved when using synthetic reverberation techniques such as those described with respect to Figure 6(b). For example, the downmix used to generate the synthetic reverberation tail causes the influence of the input audio signal to be lost. The resulting level of reverberation is therefore not perceptually equivalent to the result of a full convolution technique, especially when the input signal contains multiple channels.
[0048] To date, there is no known method to compare the amount of late reverberation to the results of a full convolution method or to match it to a convolution result. There are several techniques that attempt to rate the quality of late reverberation or how naturally it sounds. For example, one method defines a loudness measure for natural acoustic reverberation, which uses a loudness model to predict the perceived loudness of the reverberation. This method is described in prior art literature [2] and the level can be fitted to a target value. The drawback of this method is that it relies on a complex and inaccurate model of human hearing. It also requires a target loudness to provide a scaling factor for the late reverberation that can be found using a full convolution result.
[0049] Another method, described in prior art literature [3], uses a cross-correlation criterion for testing artificial reverberation quality. However, this is only applicable to testing different reverberation algorithms, and is not applicable to multi-channel audio, binaural audio, or to qualify late reverberation scaling.
[0050] Another possible approach is to use the number of input channels in the ear as a scaling factor, but this does not give perceptually correct scaling because the perceived amplitude of the overall sound signal depends on the correlation of different audio channels, not just on the number of channels.
[0051] Accordingly, the present invention provides a signal-dependent scaling method for adapting the reverberation level according to an input audio signal. As described above, the perceived level of reverberation is desired to match the level of reverberation when using a full convolution technique for binaural rendering, and therefore, determining a measure for an appropriate level of reverberation is important for achieving good sound quality. According to one embodiment, the audio signal is processed separately using the initial portion and the late reverberation of the room impulse response, and processing the late reverberation includes generating a scaled reverberation signal, the scaling of which depends on the audio signal. The processed initial portion of the audio signal and the scaled reverberation signal are combined into an output signal. According to one embodiment, the scaling depends on the state of one or more input channels of the audio signal (e.g., the number of input channels, the number of active input channels and / or activity in the input channels). According to another embodiment, the scaling depends on a predefined or calculated correlation measure for the audio signal. An alternative embodiment may perform scaling based on a combination of the state of one or more input channels and a predefined or calculated correlation measure.
[0052] According to one embodiment, a scaled reverberation signal may be generated by applying a gain factor determined based on the state of one or more input channels of the audio signal, or based on a predefined or calculated correlation measure for the audio signal, or based on a combination thereof.
[0053] According to one embodiment, processing the audio signal separately includes processing the audio signal using the initial reflection portions 301, 302 of the room impulse response 300 during a first process, and processing the audio signal using the diffuse reverberation 304 of the room impulse response 300 during a second process, which is separate from the first process. The transition from the first process to the second process occurs during a transition time. According to a further embodiment, in the second process, the diffuse (late) reverberation 304 may be replaced with a composite reverberation. In this case, the room impulse response applied to the first process includes only the initial reflection portions 300, 302 (see Figure 5) and does not include the late diffuse reverberation 304.
[0054] Below, we will describe in more detail one embodiment of the method of the present invention in which the gain factor is calculated accordingly based on correlation analysis of input audio signals. Figure 7 shows a block diagram of a signal processing unit, such as a binaural renderer, operating according to the teachings of the present invention. The binaural renderer 500 comprises a first branch including a processor 502 that receives an audio signal x[k] containing N channels from input 504. When the processor 502 is part of the binaural renderer, it processes the input signal 504 to produce an output signal 506 x convThe processor generates [k]. More specifically, the processor 502 causes a convolution of the audio input signal 504 using the direct sound and early reflections of the room impulse response, which may be provided to the processor 502 from an external database 508 holding multiple recorded binaural room impulse responses. The processor 502 may operate based on the binaural room impulse response provided by the database 508, as described above, thereby generating an output signal 502 having only two channels. The output signal 506 is provided from the processor 502 to the adder 510. The input signal 504 is further provided to a reverberation branch 512, which includes a reverberation processor 514 and a downmixer 516. The downmixed input signal is provided to the reverberation 514, which generates a reverberation signal r[k] at the output of the reverberation 514, which may contain only two channels, based on reverberation parameters such as reverberation RT60 and reverberation energy, respectively, held in databases 518 and 520. The parameters stored in databases 518 and 520 can be retrieved from the stored binaural intra-room impulse responses by appropriate analysis 522, as shown by the dashed lines in Figure 7.
[0055] The reverberation branch 512 further includes a correlation analysis processor 524, which receives an input signal 504 and generates a gain factor g at its output. Furthermore, a gain stage 526 is provided coupled between the reverberator 514 and the adder 510. The gain stage 526 is controlled by the gain factor g, thereby generating a scaled reverberation signal r at the output of the gain stage 526. g [k] is generated, and this reverberation signal r g[k] is applied to the adder 510. The adder 510 combines the initial processing portion and the reverberation signal to provide an output signal y[k] which also contains two channels. Optionally, the reverberation branch 512 may include a low-pass filter 528 coupled between the processor 524 and the gain stage to smooth the gain factor over several audio frames. Optionally, a delay element 530 may also be provided between the output of the gain stage 526 and the adder 510 to delay the scaled reverberation signal so that the scaled reverberation signal matches the transition between the initial reflection and reverberation in the room impulse response.
[0056] As explained above, Figure 7 is a block diagram of a binaural renderer that processes direct sound and early reflections separately from late reverberation. As can be seen, the input signal x[k] processed using the direct and early reflections of the binaural room impulse response is signal x conv [k] is produced. This signal, as shown in the figure, gives rise to the reverberation signal component r g The signal is transferred to the adder 510 for addition to [k]. This signal is generated by supplying a downmix of the input signal x[k], for example a stereo downmix, to the reverberator 514, which in turn supplies the downmixed reverberation signal r[k] and a gain factor g to a multiplier or gain stage 526. The gain factor g is obtained by correlation analysis of the input signal x[k] performed by the processor 524 and can be smoothed over time by the low-pass filter 528 as described above. The scaled or weighted reverberation component can optionally be delayed by a delay element 530 so that its start coincides with the transition point from early reflection to late reverberation, thus obtaining the output signal y[k] at the output of the adder 510.
[0057] The multichannel binaural renderer shown in Figure 7 introduces a synthesized two-channel late reverberation, and to overcome the shortcomings of conventional methods described above, according to the present invention, the synthesized late reverberation is scaled by a gain factor g to match the perception to the result of a full convolutional method. The superposition of multiple channels in the listener's ear (e.g., up to 22.2) is correlation-dependent. For such reasons, the late reverberation can be scaled according to the correlation of the input signal channels, and embodiments of the present invention provide a correlation-based time-dependent scaling method for determining an appropriate amplitude of the late reverberation.
[0058] To calculate the scaling factor, a correlation measure based on the correlation coefficient is introduced, which, according to the embodiment, is defined in the two-dimensional time-frequency domain, e.g., the QMF domain. A correlation value between -1 and 1 is calculated for each multidimensional audio frame, and each audio frame is defined by the number of frequency bands N, the number of time slots per frame M, and the number of audio channels A. One scaling factor is obtained per frame and per ear.
[0059] An embodiment of the method of the present invention will be described in more detail below. First, a reference is made to the correlation measure used in the correlation analysis processor 524 in Figure 7. According to this embodiment, the correlation measure is based on Pearson's moment coefficient (also known as the correlation coefficient), which is calculated by dividing the covariance of two variables X and Y by the product of their standard deviations, as follows:
number
number
number
number
[0060] According to the embodiment described above, scaling was determined based on a calculated correlation measure for the audio signals. This is advantageous, for example, when it is desirable to obtain a correlation measure for each audio signal currently being processed, despite the additional computational resources required.
[0061] However, the present invention is not limited to such methods. According to other embodiments, a predefined correlation measure may also be used instead of calculating the correlation measure. Using a predefined correlation measure is advantageous because it reduces the computational complexity in the process. The predefined correlation measure may have a fixed value, for example, between 0.1 and 0.9, which can be empirically determined based on the analysis of multiple audio signals. In such cases, the correlation analysis 524 may be omitted, and the gain of the gain stage may be set by an appropriate control signal.
[0062] According to other embodiments, scaling may depend on the state of one or more input channels of the audio signal (e.g., the number of input channels, the number of active input channels, and / or activity in the input channels). This is advantageous because scaling can be easily determined from the input audio signal with reduced computational overhead. For example, scaling may be determined by simply determining the number of channels in the original audio signal that are downmixed into the now-considered downmix channel, which includes a reduced number of channels compared to the original audio signal. Alternatively, the number of active channels (channels that currently exhibit some activity in the audio frame) that are downmixed into the now-considered downmix channel may form the basis for scaling the reverberation signal. This may be done in block 524.
[0063] The following describes in detail embodiments for determining the scaling of a reverberation signal based on the state of one or more input channels of an audio signal and a correlation measure (which is fixed or calculated as described above). According to such embodiments, the gain factor or gain or scaling factor g is defined as follows:
number
number
number
number
[0064] For example, when considering input signals, the following applies: Channels 1, 3, and 4 are downmixed into downmix channel 1 (see matrix Q above), In frame n, The active channels are channels 1, 2, and 4. K in This is the number of channels at the intersection {1,4}, K in (n) = 2 In frame n+1, The active channels are channels 1, 2, 3, and 4. K in This is the number of channels at the crossover {1,3,4}, K in (n+1)=3. An audio channel (within a predefined frame) may be considered active if it has an amplitude or energy exceeding a preset threshold within that predefined frame. For example, according to one embodiment, activity in an audio channel (within a predefined frame) may be defined as follows: • The sum or maximum absolute amplitude of signals in a frame (in the time domain, QMF domain, etc.) is greater than 0, or • The sum or maximum value of the signal energy (the absolute square of the amplitude in the time domain or QMF domain) in the frame is greater than 0. Instead of 0, another threshold greater than 0 (for maximum energy or amplitude), such as a threshold of 0.01, can also be used.
[0065] According to the embodiment, the number of active channels (time-varying) or the number of channels included in the downmix channel (downmix matrix that is not equal to 0) is fixed at K. in A gain factor is provided for each ear, depending on the signal. The factor is assumed to increase linearly between the case of uncorrelated signals and the case of correlated signals overall. Uncorrelated signals overall mean no inter-channel dependence (correlation value is 0), while correlated signals overall mean the signals are weighted versions of each other (correlation value is 1 if there is a phase difference in the offset).
[0066] As described above, the gain or scaling factor g can be smoothed across audio frames by the low-pass filter 528. The low-pass filter 528 smooths g for frame size k as follows: s (t) produces a smoothed gain factor s It may have a time constant of . [Number] [Number] [Number] However, t s = time constant of the low-pass filter at [s] t i = frame t i audio frame at g s = smoothed gain factor k = frame size, and f s = sampling frequency at [Hz] The frame size k can be the size of the audio frame in the time-domain samples, for example, 2048 samples. The left-channel reverberation signal of the audio frame x(t i ) is then scaled by the factor g s,left (t i ), and the right-channel reverberation signal is scaled by the factor g s,right (t i ). The scaling factor is calculated once using K in as the number of channels (active non-zero or total) present in the left channel of the stereo downmix fed to the reverberator, thereby obtaining the scaling factor g s,left (t i [[ID=5�]]). Then, the scaling factor is calculated once again using K in as the number of channels (active non-zero or total) present in the right channel of the stereo downmix fed to the reverberator, thereby obtaining the scaling factor g s,right (t i ). The reverberator returns a stereo reverberation version of the audio frame. The left channel of the reverberation version (or the left channel of the input to the reverberator) is g s,left (t i) is scaled to g, and the right channel of the reverb version (or the right channel of the reverb input) is g s,right (t i It is scaled by ). The scaled artificial (synthesized) late reverberation is applied to the adder 510 to be added to the signal 506, which is being processed using direct sound and early reflections. As described above, the method of the present invention can be used in a binaural processor for binaural processing of audio signals, according to the embodiments. One embodiment of binaural processing of audio signals is described below. Binaural processing can be carried out as a decoder process that converts the decoded signal into a binaural downmix signal that provides a surround sound experience when listened to through headphones.
[0067] Figure 8 shows a schematic diagram of a binaural renderer 800 for binaural processing of an audio signal according to one embodiment of the present invention. Figure 8 also provides an overview of QMF region processing in the binaural renderer. At input 802, the binaural renderer 800 receives an input signal to be processed, for example, an input signal comprising N channels and 64 QMF bands. Furthermore, the binaural renderer 800 receives several input parameters to control the processing of the audio signal. The input parameters are a binaural chamber impulse response (BRIR) 804 for 2 × N channels and 64 QMF bands, and an indicator K of the maximum bandwidth used for convolution of the audio input signal using the early reflection portion of the BRIR 804. max 806, as well as the reverberation parameters 808 and 810 (RT60 and reverberation energy) mentioned above. The binaural renderer 800 includes a high-speed convolutional processor 812 for processing the input audio signal 802 using the initial portion of the received BRIR 804. The processor 812 outputs two channels and K maxAn initial processing signal 814 is generated, which includes 64 QMF bands. In addition to the initial processing branch having a high-speed convolution processor 812, the binaural renderer 800 also includes a reverberation branch including two reverberators 816a and 816b, each reverberator receiving RT60 information 808 and reverberation energy information 810 as input parameters. The reverberation branch further includes a stereo downmix processor 818 and a correlation analysis processor 820, both of which also receive the input audio signal 802. Furthermore, two gain stages 821a and 821b are provided between the stereo downmix processor 818 and the respective reverberators 816a and 816b to control the gain of the downmix signal 822 provided by the stereo downmix processor 818. Based on the input signal 802, the stereo downmix processor 818 provides a downmix signal 822 having two bands and 64 QMF bands. The gains of gain stages 821a and 821b are controlled by their respective control signals 824a and 824b, provided by the correlation analysis processor 820. The gain-controlled downmix signals are input to their respective reverberators 816a and 816b, generating their respective reverberation signals 826a and 826b. The initial processing signal 814 and the reverberation signals 826a and 826b are received by the mixer 828, which combines the received signals into an output audio signal 830 having two channels and 64 QMF bandwidths. Furthermore, according to the present invention, the high-speed convolution processor 812 and reverberators 816a and 816b receive additional input parameters 832 indicating the transition in the room impulse response 804 from the initial part to the later reverberation, as determined above.
[0068] The binaural renderer module 800 (e.g., binaural renderer 236 in Figure 2 or Figure 4) has a decoded data stream as input 802. The signal is processed by a QMF analysis filter bank, outlined in ISO / IEC 14496-3:2009, section 4.B.18.2, with modifications thereof described in ISO / IEC 14496-3:2009, section 8.6.4.2. The renderer module 800 can also process QMF domain input data, in which case the analysis filter bank may be omitted. The binaural intra-room impulse response (BRIR) 804 is represented as a complex QMF domain filter. The conversion from the time-domain binaural intra-room impulse response to the complex QMF filter representation is outlined in ISO / IEC FDIS 23003-1:2006, Annex B. In the complex QMF domain, BRIR804 is limited to a certain number of time slots such that BRIR804 includes only the early reflection portions 301, 302 (see Figure 5) and does not include the late diffuse reverberation 304. The transition point 832 from early reflection to late reverberation is determined, as described above, for example, by the analysis of BRIR804 in the pre-processing step of binaural processing. The QMF domain audio signal 802 and QMF domain BRIR804 are then processed by a band-based fast convolution 812 to perform binaural processing. QMF domain reverberators 816a, 816b are used to generate 2-channel QMF domain late reverberations 826a, 826b. The reverberation modules 816a, 816b use a set of frequency-dependent reverberation times 808 and energy values 810 to adapt the characteristics of the reverberation. The reverberation waveform is based on a stereo downmix 818 of the audio input signal 802, and its amplitude is adaptively scaled 821a, 821b according to a correlation analysis 820 of the multi-channel audio signal 802. The two-channel QMF domain convolution result 814 and the two-channel QMF domain reverberations 816a, 816b are then combined 828, and finally, two QMF combined filter banks compute the binaural time-domain output signal 830 as outlined in ISO / IEC 14496-3:2009, section 4.6.18.4.2. The renderer can also generate QMF domain output data.In that case, the composite filter bank is omitted.
[0069] definition The audio signal 802 supplied to the binaural renderer module 800 is referred to below as the input signal. The audio signal 830, which is the result of binaural processing, is referred to as the output signal. The input signal 802 of the binaural renderer module 800 is the audio output signal of the core decoder (see signal 228 in Figure 2, for example). The following variable definitions are used. [Table 1]
[0070] process Next, we will explain the processing of the input signal. The binaural renderer module acts on a continuous, non-overlapping frame of input audio signals with a length of 2048 time-domain samples, and outputs one frame of 2048 samples for each processed input frame of length.
[0071] (1) Initialization and preprocessing The binaural processing block is initialized before the audio samples supplied by the core decoder (see, for example, decoder 200 in Figure 2) are processed. Initialization consists of several processing steps.
[0072] (a) Reading the analytical values The reverberator modules 816a and 816b take a frequency-dependent set of reverberation time 808 and energy value 810 as input parameters. These values are read from the interface during the initialization of the binaural processing module 800. In addition, the transition time 832 from early reflection to late reverberation in a time-domain sample is read. The values can be stored in a binary file, written as 32 bits, float values, little-endian ordered, per sample. The read values required for processing are listed in the table below. [Table 2]
[0073] (b) Reading and preprocessing of BRIR The binaural room impulse response 804 is read from two dedicated files that individually store the left-ear BRIR and the right-ear BRIR. The time-domain samples of the BRIR are stored in an integer wave file with a resolution of 24 bits per sample and 32 channels. The ordering of the BRIRs in the file is as described in the following table. [Table 3]
[0074] If there is no BRIR measured at one of the loudspeaker positions, the corresponding channel in the wave file contains a value of 0. The LFE channel is not used for binaural processing.
[0075] As a preprocessing step, a given set of binaural room impulse responses (BRIRs) is converted from a time-domain filter to a complex-valued QMF-domain filter. The implementation of a given time-domain filter in the complex-valued QMF domain is performed according to ISO / IEC FDIS 23003-1:2006, Annex B. The prototype filter coefficients for the filter conversion are used according to ISO / IEC FDIS 23003-1:2006, Annex B, Table B.1. 1 ≤ v ≤ L trans,n for the complex-valued QMF-domain filter To obtain TIFF2026105125000022.tif26119, 1 ≤ v ≤ L trans for the time-domain representation TIFF2026105125000023.tif26121 is processed. (2) Audio signal processing The audio processing block of the binaural renderer module 800 obtains the time-domain audio samples 802 for N in input channels from the core decoder, N out=Generates a binaural output signal 830 consisting of two channels. The process takes the following as input: • Decoded audio data 802 from the core decoder, • Complex QMF domain representation of the initial reflection portion of BRIR set 804, and • Frequency-dependent parameter sets 808, 810, and 832 used by the QMF domain reverberators 816a and 816b to generate late reverberations 826a and 826b.
[0076] (a) QMF analysis of audio signals As the first processing step, the binaural renderer module receives N (from the core decoder). in Channel time-domain input signal = 2048 time-domain samples TIFF2026105125000024.tif24132, dimension L n =32QMF time slot (slot index n) and K=64 frequency band (band index k) in Convert to channel QMF domain signal representation 802. QMF analysis, outlined in ISO / IEC 14496-3:2009, section 4.B.18.2 and modified in ISO / IEC 14496-3:2009, section 8.6.4.2, is used for time-domain signals. This was performed on the frame TIFF2026105125000025.tif1921, with 1≦v≦L and 1≦n≦L n QMF region signal The frame for TIFF2026105125000026.tif24136 is acquired.
[0077] (b) High-speed convolution of QMF domain audio signal and QMF domain BRIR Next, a bandwidth-based fast convolution 812 is performed to process the QMF domain audio signal 802 and the QMF domain BRIR 804. FFT analysis may be performed for each QMF frequency band for each channel of the input signal 802 and each BRIR 804. Due to the complex values in the QMF domain, one FFT analysis is performed on the real part of the QMF domain signal representation, and another FFT analysis is performed on the imaginary part of the QMF domain signal representation. The results are then combined to form the final band-wide complex-valued pseudo-FFT domain signal as follows: TIFF2026105125000027.tif13166 and the following bandwidth-based complex value BRIR are formed, In the left ear TIFF2026105125000028.tif13150 Right ear TIFF2026105125000029.tif13152. The length of the FFT transform is the length L of the complex-valued QMF domain BRIR filter. trans,n And, QMF domain time slot L n It is determined according to the frame length in and therefore, L FFT =L trans,n +L n -1. The complex-valued pseudo-FFT domain signal is then subjected to a complex-valued pseudo-FFT domain BRIR filter to form a fast convolution result. A vector m is used to signal which channel of the input signal corresponds to which BRIR pair in the BRIR dataset. conv This is used. This multiplication is 1 ≤ k ≤ K max This is performed bandwidth-wise for all QMF frequency bands k. Maximum bandwidth K max This is determined by the QMF band, which represents either 18kHz or the highest signal frequency present in the audio signal from the core decoder. f max =min(f max,decoder ,18kHz). The result of multiplying each audio input channel using each BRIR pair is 1 ≤ k ≤ K max This is summed in each QMF frequency band, and thereafter, the intermediate 2 channels K max A pseudo-FFT signal is generated. TIFF2026105125000030.tif30136 and TIFF2026105125000031.tif29137 is a QMF domain frequency band k Pseudo-FFT convolution results in The filename is TIFF2026105125000032.tif19139. Next, a bandwidth-based FFT synthesis is performed, and the convolution result is inversely transformed into the QMF domain, thereby 1 ≤ n ≤ L FFT and 1 ≤ k ≤ K max L is FFT Time slot Intermediate 2-channel K with TIFF2026105125000033.tif19139 max A signal in the bandwidth QMF region is generated. For each QMF domain input signal frame with L=32 time slots, a convolution result signal frame with L=32 time slots is returned. FFT -32 time slots are stored, and overlap addition is performed in subsequent frames.
[0078] (c) Generation of late reverberation As the second intermediate signals 826a and 826b, A reverberation signal called TIFF2026105125000034.tif20132 is generated by frequency domain reverberator modules 816a and 816b. Frequency domain reverberators 816a and 816b take the following as inputs: • QMF region stereo downmix of one frame of the input signal 822, • A parameter set including frequency-dependent reverberation time 808 and energy value 810. The frequency domain reverberators 816a and 816b return a 2-channel QMF domain late reverberation tail. The maximum number of bandwidths used in the frequency-dependent parameter set is calculated based on the maximum frequency.
[0079] First, the input signal A QMF region stereo downmix 818 is performed on one frame of TIFF2026105125000035.tif2227, and the reverber input is formed by the weighted sum of the input signal channels. The weighted gain is the downmix matrix M DMX It is contained within. The weighted gain is real-valued and non-negative, and the downmix matrix has dimension N. out ×N in It contains non-zero values, where the input signal channel is mapped to one of the two output channels.
[0080] The channel representing the loudspeaker position on the left hemisphere is mapped to the left output channel, and the channel representing the loudspeaker on the right hemisphere is mapped to the right output channel. The signals of these channels are weighted by a coefficient of 1. The channel representing the loudspeaker in the midline is mapped to both output channels of the binaural signal. The input signals of these channels are weighted by the following coefficients. TIFF2026105125000036.tif39115
[0081] Furthermore, an energy equalization step is performed in the downmix. The energy equalization step adapts the downmix channel so that its bandwidth energy is equal to the sum of the bandwidth energies of the input signal channels contained within it. This energy equalization is performed by bandwidth multiplication using the following real-valued coefficients. TIFF2026105125000037.tif44124
[0082] factor c eq,k The frequency is restricted to the interval [0.5, 2]. A numerical constant ε is introduced to avoid division by zero. Downmixing also applies to the frequency f max The bandwidth is limited. The values in all higher frequency bands are set to 0.
[0083] Figure 9 schematically shows the processing in the frequency domain reverberators 816a and 816b of the binaural renderer 800 according to one embodiment of the present invention.
[0084] In the frequency domain reverberation, the mono downmix of the stereo input is calculated using the input mixer 900. This is done non-coherently by applying a 90° phase shift on the second input channel.
[0085] This mono signal is then fed into a feedback delay loop 902 in each frequency band k, thereby creating an impulse attenuation sequence. This is followed by a parallel FIR uncorrelatedizer that distributes the signal energy in an attenuated manner during the intervals between impulses, creating non-coherence between output channels. Attenuation filter tap density is applied to create energy attenuation. Filter tap phase calculations are limited to four options to implement a sparse, multiplier-less uncorrelatedizer.
[0086] After the reverberation calculation, inter-channel coherence (ICC) correction 904 is included in the reverberator module for each QMF frequency band. In the ICC correction step, frequency-dependent direct gain g is applied to adapt the ICC. direct and cross-mix gain g cross and are used.
[0087] The amount of energy and reverberation time for different frequency bands are included in the input parameter set. The values are given at several frequency points that are internally mapped to K=64 QMF frequency bands.
[0088] Final intermediate signal Two instances of a frequency domain reverberator are used to calculate TIFF2026105125000038.tif19124. The TIFF2026105125000039.tif2647 signal is the first output channel of the first instance of the reverberator. TIFF2026105125000040.tif2649 is the second output channel of the second instance of the reverberator. They are combined into a final reverberation signal frame with the dimensions of two channels, 64 bandwidths, and 32 time slots. To ensure correct scaling of the reverberation output, the stereo downmix 822 is scaled 821a,b at both times according to the correlation measure 820 of the input signal frames. The scaling factor is the correlation coefficient c between 0 and 1. corr Linearly according to Defined as a value within the interval of TIFF2026105125000041.tif33157, however, TIFF2026105125000042.tif30137 and TIFF2026105125000043.tif45119
[0089] however, TIFF2026105125000044.tif2237 is channel A This represents the standard deviation over one time slot n, and the operator {*} indicates the complex conjugate. TIFF2026105125000045.tif2310 is a QMF region signal in an actual signal frame. This is a zero-average version of TIFF2026105125000046.tif2112.
[0090] c corr This is calculated twice: once for multiple channels A and B that are active in the actual signal frame F and included in the left channel of the stereo downmix, and once for multiple channels A and B that are active in the actual signal frame F and included in the right channel of the stereo downmix. DMX,act This is a single downmix channel A (with a downmix matrix M that is not equal to 0). DMX The number of input channels currently active in the frame, downmixed to the number of matrix elements in the Ath row.
[0091] The scaling factor is then as follows: The scaling factor in TIFF2026105125000047.tif23170 is smoothed across the audio signal frame by a first-order low-pass filter, thereby smoothing the scaling factor. The file TIFF2026105125000048.tif25154 is generated.
[0092] The scaling factor is initialized in the first audio input data frame by time-domain correlation analysis using the same method.
[0093] The input to the first reverberation instance is the scaling factor. Scaled to TIFF2026105125000049.tif2748, the input to the second reverberation instance is the scaling factor It will be scaled to TIFF2026105125000050.tif2751.
[0094] (d) Synthesis of convolution results and late reverberation Next, the convolution result 814 for one QMF region audio input frame, TIFF2026105125000051.tif18132, and reverberator outputs 826a, 826b, TIFF2026105125000052.tif20130 is synthesized by mixing process 828, which bandwise sums the two signals. Convolution is K max Because it is only performed within the bandwidth up to K max The upper band higher than that is, Please note that TIFF2026105125000053.tif2445 is 0. Late reverberation output is used in the mixing process. d=((L trans It is delayed by the amount of time slots (-20 * 64 + 1) / (64 + 0.5) + 1. The delay d takes into account the transition time from early to late reflections in the BRIR, the initial delay of the reverberator in the 20 QMF time slot, and the analysis delay in the 0.5 QMF time slot for the QMF analysis of the BRIR, in order to ensure the insertion of late reverberation in a reasonable time slot. Composite signal in TIFF2026105125000054.tif2629 is, Calculated by TIFF2026105125000055.tif24105.
[0095] (e) QMF synthesis of binaural QMF region signals QMF region output signal One 2-channel frame of 32 time slots in TIFF2026105125000056.tif2428 is converted to a 2-channel time-domain signal frame of length L by QMF synthesis according to ISO / IEC 14496-3:2009, section 4.6.18.4.2, thereby obtaining the final time-domain output signal 830. The file TIFF2026105125000057.tif23114 is generated.
[0096] According to the method of the present invention, synthesized or artificial late reverberation is scaled taking into account the characteristics of the input signal, thereby improving the quality of the output signal while taking advantage of the reduction in computational complexity obtained by separate processing. Furthermore, as can be seen from the above description, no additional auditory models or target reverberation loudness are required.
[0097] It should be noted that the present invention is not limited to the embodiments described above. For example, although the above embodiments were described in relation to the QMF domain, it should be noted that other time-frequency domains, such as the STFT domain, may also be used. Furthermore, the scaling factor may be calculated in a frequency-dependent manner such that the correlation is not calculated over the entire frequency band, i.e., over i∀[1,N], but rather over several S subsets defined as follows. i1∀[1,N1],i2∀[N1+1,N2],...,i S ∀[N S-1 +N]
[0098] Furthermore, smoothing may be applied across the frequency band, or the bands may be combined according to specific rules, for example, according to the frequency resolution of hearing. The smoothing may be adapted to various time constants, for example, depending on the frame size or listener preference.
[0099] The method of the present invention can also be applied to various frame sizes, and even a frame size of just one time slot in the time-frequency domain is possible.
[0100] According to the embodiment, various downmix matrices can be used for downmixing, such as symmetric downmix matrices or asymmetric matrices.
[0101] The correlation measure can be derived from parameters transmitted in the audio bitstream, for example, from inter-channel coherence in MPEG surround or SAOC. Furthermore, according to the embodiment, it is possible to exclude certain matrix values, such as miscalculated values or values on the principal diagonal, and autocorrelation values, from the mean calculation if necessary.
[0102] The process can be performed in the encoder, for example, when applying a low-complexity binaural profile, instead of using it in the binaural renderer on the decoder side. This results in some representation of the scaling factor, such as the scaling factor itself, a correlation measure between 0 and 1, etc., and these parameters are sent in the bitstream from the encoder to the decoder for a fixed downstream matrix.
[0103] Furthermore, while the embodiments described above described applying gain after the reverberator 514, it should be noted that, according to other embodiments, the gain may also be applied before the reverberator 514 or within the reverberator by modifying the gain within the reverberator 514, for example. This is advantageous because it may require less computation.
[0104] While some embodiments have been described in the context of the apparatus, it is clear that these embodiments also represent descriptions of the corresponding methods, where blocks or devices correspond to method steps or features of method steps. Similarly, embodiments described in the context of method steps also represent descriptions of the corresponding blocks, items, or features of the corresponding apparatus. Some or all of the method steps may be performed by (or using) hardware devices, such as a microprocessor, a programmable computer, or an electronic circuit. In some embodiments, one or more of the most important method steps may be performed by such devices.
[0105] Depending on certain implementation requirements, embodiments of the present invention may be implemented in hardware or software. Implementation may be carried out using non-temporary storage media, such as digital storage media, which store electronically readable control signals, and which cooperate (or can cooperate) with a programmable computer system so that each method is carried out. For example, floppy disks, DVDs, Blu-ray®, CDs, ROMs, PROMs, and EPROMs, EEPROMs, or FLASH® memories. Thus, the digital storage media may be computer-readable.
[0106] Some embodiments of the present invention include a data carrier having an electronically readable control signal, which can cooperate with a programmable computer system so that one of the methods described herein is carried out.
[0107] In general, embodiments of the present invention may be implemented as a computer program product having program code, the program code being operable to perform one of the methods when the computer program product is running on a computer. The program code may be stored, for example, on a machine-readable carrier.
[0108] Other embodiments include a computer program stored in a machine-readable carrier for carrying out one of the methods described herein.
[0109] In other words, embodiments of the method of the present invention are, therefore, computer programs having program code for carrying out one of the methods described herein when the computer program is running on a computer.
[0110] A further embodiment of the method of the present invention is a data carrier (or digital storage medium, or computer-readable medium) having a computer program for carrying out one of the methods described herein recorded thereon. The data carrier, digital storage medium, or recording medium is typically tangible and / or non-temporary.
[0111] A further embodiment of the method of the invention is a data stream or sequence of signals representing a computer program for carrying out one of the methods described herein. The data stream or sequence of signals may be configured to be transmitted, for example, over a data communication connection, for example, over the Internet.
[0112] Further embodiments include processing means, such as a computer or a programmable logic device, configured or programmed to carry out one of the methods described herein.
[0113] Further embodiments include a computer on which a computer program for carrying out one of the methods described herein is installed.
[0114] Further embodiments of the present invention include an apparatus or system configured to transfer (e.g., electronically or optically) a computer program for carrying out one of the methods described herein to a receiver. The receiver may be, for example, a computer, a mobile device, a memory device, etc. The apparatus or system may include, for example, a file server for transferring the computer program to the receiver.
[0115] In some embodiments, a programmable logic device (e.g., a field-programmable gate array) may be used to carry out some or all of the functions of the methods described herein. In some embodiments, the field-programmable gate array may work with a microprocessor to carry out one of the methods described herein. In general, the methods are preferably carried out by any hardware device.
[0116] The embodiments described above are merely illustrative of the principles of the present invention. Modifications and variations of the configurations and details described herein will be obvious to those skilled in the art. Therefore, the invention is not intended to be limited by any specific details presented in the description and explanation of the embodiments herein, but only by the claims immediately following.
[0117] literature [1] Further expanded in MRSchroeder, "Digital Simulation of Sound Transmission in Reverberant Spaces," The Journal of the Acoustical Society of America, Vol. 47, pp. 424-431 (1970), and in J.A. Moorer, "About This Reverberation Business," Computer Music Journal, Vol. 3, no. 2, pp. 13-28, MIT Press (1979). [2] Uhle, Christian, Paulus, Jouni, Herre, Jurgen: "Predicting the Perceived Level of Late Reverberation Using Computational Models of Loudness," Proceedings, 17th International Conference on Digital Signal Processing (DSP), July 6-8, 2011, Corfu, Greece. [3] Czyzewski, Andrzej: “A Method of Artificial Reverberation Quality Testing” J.Audio Eng.Soc.,Vol.38,No 3,1990.