Audio apparatus and method for generating a stereo signal

The audio apparatus and method address suboptimal audio rendering issues by using downmix and residual filters with complex conjugate relationships and adaptive filtering, enhancing audio quality and reducing complexity for improved user experiences.

WO2026125033A1PCT designated stage Publication Date: 2026-06-18KONINKLIJKE PHILIPS NV

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
KONINKLIJKE PHILIPS NV
Filing Date
2025-12-01
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Current audio rendering technologies exhibit suboptimal performance in terms of perceived quality, spatial perception, complexity, and resource usage, particularly in scenarios involving small and cheap portable devices, leading to a reduced user experience.

Method used

An audio apparatus and method that employs a combination of downmix filters, residual filters, and renderers to generate an output stereo signal, utilizing complex conjugate relationships between filter responses and adaptive filtering to reduce error values, allowing for improved audio quality, reduced complexity, and efficient resource usage.

🎯Benefits of technology

The approach provides improved stereo audio generation with enhanced perceived quality, reduced computational burden, and efficient implementation, while maintaining flexibility and adaptability across various scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure EP2025084817_18062026_PF_FP_ABST
    Figure EP2025084817_18062026_PF_FP_ABST
Patent Text Reader

Abstract

An audio apparatus comprises a downmixer (107) generating a mono downmix by downmixing a stereo signal resulting from applying downmix filters (103, 105) to an input stereo signal. Residual filters (109,113) generate compensation signals from the mono downmix and residual signals are generated by compensators (111, 115) compensating the input channel signals by the compensation signals. A renderer (117) renders the mono downmix and residual signals to generate an output stereo signal. The filter responses of the downmix filter (103, 105) and the residual filters (109,113) for a given channel have a fixed relationship, typically being complex conjugates in the frequency domain, and are updated by an adapter (119) to reduce error values determined from the residual signals. The error values may be energy or power level measures for the residual signals.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] 2024PF00555

[0002] 1

[0003] AUDIO APPARATUS AND METHOD FOR GENERATING A STEREO SIGNAL

[0004] FIELD OF THE INVENTION

[0005] The invention relates to an audio apparatus and method for rendering a stereo signal.

[0006] BACKGROUND OF THE INVENTION

[0007] Spatial audio applications have become numerous and widespread and increasingly form part of many audiovisual experiences. New and improved spatial experiences and applications are continuously being developed which results in increased demands for the audio processing and rendering.

[0008] A lot of research and development effort has focused on providing efficient and high quality audio encoding and audio decoding for spatial audio. A frequently used spatial audio representation is multichannel audio representations, including stereo representation, and efficient encoding of such multichannel audio based on downmixing multichannel audio signals to downmix channels with fewer channels have been developed. One of the main advances in low bit-rate audio coding has been the use of parametric multichannel coding where a downmix signal is generated together with parametric data that can be used to upmix the downmix signal to recreate the multichannel audio signal.

[0009] In addition to accurately reproducing a stereo signal, it has also been of interest to create high quality rendering, and specifically binaural rendering of (encoded) stereo signals to emulate a virtual loudspeaker playback.

[0010] Binaural rendering of content authored for multi-channel playback can be achieved by the sum of convolutions of the input channel signals with left and right Head Related Impulse Responses (HRIRs), where each HRIR pair corresponds to a measured / simulated impulse response from a loudspeaker location to the ears. This can be expressed compactly in the z-domain as:

[0011] = 2 / c (z) ' / C(Z)

[0012]

[0013] Vc

[0014] where Ac(z) represents the z-transform of the time domain input signal xc[n] with channel c, TL R(z) represents the z-transform of the left and right time domain output signals I [n] and r [n], respectively, and HR Rc(z) is the z-transform of the HRIR of the left and right channels [n] and hr[n] for the angle (and distance) corresponding to loudspeaker position <pc. Approaches for rendering binaural stereo are disclosed in WO2010 / 122455 Al and W02007 / 031896 Al. 2024PF00555

[0015] 2

[0016] However, whereas current approaches for audio rendering may provide acceptable performance in many applications and scenarios, they tend not to be ideal and may exhibit suboptimal behavior in some scenarios. In particular, it may result in suboptimal perceived quality and / or a reduced user experience with e.g. perceived suboptimal spatial perception / audio scene in some cases. Complexity and / or resource usage may also be higher than desired and may in some case make the approach undesired for some implementations, such as applications based on small and cheap portable devices.

[0017] Hence, an improved approach would be advantageous. In particular, an approach allowing increased flexibility, improved adaptability, improved performance, increased audio quality, improved perceived quality, an improved rendering of an audio scene, improved spatial representation, reduced complexity and / or resource usage, reduced computational load, facilitated implementation, improved user experience, and / or an improved spatial audio experience would be advantageous.

[0018] SUMMARY OF THE INVENTION

[0019] Accordingly, the Invention seeks to preferably mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination.

[0020] According to an aspect of the invention there is provided an audio apparatus for generating an output stereo signal, the audio apparatus comprising: a receiver arranged to receive a stereo signal; a first downmix filter arranged to generate a first filtered signal by filtering a signal of a first channel of the stereo signal; a second downmix filter arranged to generate a second filtered signal by filtering a signal of a second channel of the stereo signal; a combiner arranged to generate a mono downmix audio signal by combining the first filtered signal and the second filtered signal; a first residual filter arranged to generate a first compensation signal by filtering the mono downmix audio signal, a filter response of the first downmix filter having a fixed predetermined relationship with the filter response of the first residual filter in that a frequency representation of the filter response of the first downmix filter is a complex conjugate of a frequency representation of the filter response of the first residual filter; a first compensator arranged to generate a first residual signal by compensating the signal of the first channel of the stereo signal by the first compensation signal; a second residual filter arranged to generate a second compensation signal by filtering the mono downmix audio signal, a filter response of the second downmix filter having a fixed predetermined relationship with the filter response of the second residual filter; a second compensator arranged to generate a second residual signal by compensating the signal of a second channel of the stereo signal by the second compensation signal; an adapter arranged to update the filter response of the first residual filter to reduce a magnitude of a first error value and to update the filter response of the second residual filter to reduce a magnitude of a second error value, the first error value being dependent on the first residual signal and the second error value being dependent on the second residual signal, the Tenderer comprising: a first Tenderer arranged to perform a first rendering of the mono downmix audio signal to generate a first intermediate stereo signal, the first rendering being a directional rendering arranged to render the mono downmix audio signal from a first direction; a second Tenderer 2024PF00555

[0021] 3

[0022] arranged to perform a second rendering being a rendering of the first residual signal to generate a second intermediate stereo signal; and a combiner arranged to combine at least the first intermediate stereo signal and the second intermediate stereo signal to generate the output stereo signal.

[0023] The approach may provide an improved audio experience in many embodiments. For many signals and scenarios, the approach may provide improved rendering of a stereo audio signal allowing improved generation / reconstruction of a stereo audio signal with an improved perceived audio quality. The approach may provide improved representation of an audio scene by a stereo signal.

[0024] The approach may provide efficient implementation and may in many embodiments allow reduced complexity and / or resource usage. The approach may in many scenarios allow a reduced computational burden while providing a perceived high quality rendering of a stereo signal.

[0025] The processing may be in time frequency segments or tiles. Each time frequency segment / tile may represent a frequency interval in a time interval. In many embodiments, the mono downmix audio signal may be divided into time segments / intervals and a frequency representation of the signal in the time segment / interval may be provided by signal values representing different frequency segments of the signal in the time segment / interval. Some or all of the processing may be performed in the frequency domain / frequency subbands.

[0026] The first direction may be a desired / target rendering direction. A direction may be an angle and / or orientation from a listening position / in a stereo image.

[0027] The directional rendering may be a binaural rendering generating the first intermediate stereo signal as a binaural stereo signal comprising a point source positioned in the first direction, the binaural rendering comprising selecting directional transfer functions / binaural impulse response values as values for a directional transfer function / binaural impulse response for a sound source in the first direction. Directional transfer functions may be parameterized transfer functions, and specifically may be represented in the frequency domain as weights for each of a plurality of subbands. A weight may be provided for each stereo channel. The weights may typically be complex.

[0028] The directional transfer function / binaural impulse response values may be parametric values and may be frequency tile values. The directional transfer function / binaural impulse response values may be values representing any suitable binaural impulse response in any suitable way, including HRIR, HRTF, BRIR values etc.

[0029] The directional rendering may e.g. render the mono downmix audio signal from a position / direction determined from the spatial parameters, from a parameter received in a bitstream (also including the stereo signal and potentially the spatial parameters), or e.g. in dependence on a user input, etc.

[0030] In many embodiments, the second rendering may be a predetermined rendering.

[0031] The filter responses may be frequency domain responses (e.g. frequency domain transfer functions) or may be time domain responses (e.g. time domain impulse responses). The filters may be implemented in the frequency domain with a filter weight (typically a complex value) for each subband. 2024PF00555

[0032] 4

[0033] The filters may be implemented as frequency domain filters with filter weights typically being a complex value for each channel for each frequency subband for each time segment. A frequency subband value of a signal may specifically be a complex value / sample of the signal for each frequency subband and each time segment.

[0034] The filter weights of a residual filter and a downmix filter for the same frequency and subband (and often time segment) may be complex conjugates of each other. The filter weights may be complex valued.

[0035] The second rendering may be a non-directional rendering and may specifically be a nonpoint source rendering. The second rendering may be a diffuse rendering. The directional rendering may be a point source rendering.

[0036] In some embodiments, the adapter may be arranged to update the filter response of the first residual filter only to reduce the magnitude of the first error value. In some embodiments, the adapter may be arranged to update the filter response of the first residual filter based only on the first error value. In some embodiments, the adapter may be arranged to update the filter response of the first residual filter independently of / without considering the second error value.

[0037] In some embodiments, the adapter may be arranged to update the filter response of the second residual filter only to reduce the magnitude of the second error value. In some embodiments, the adapter may be arranged to update the filter response of the second residual filter based only on the second error value. In some embodiments, the adapter may be arranged to update the filter response of the second residual filter independently of / without considering the first error value.

[0038] The adapter may be arranged to separably and / or independently update the filter responses of the first residual filter and the second residual filter.

[0039] According to an optional feature of the invention, the first error value is monotonically increasing with an increasing power level of the first residual signal and the second error value is monotonically increasing with an increasing power level of the second residual signal.

[0040] This may provide an advantageous approach for many scenarios, including e.g. providing an advantageous trade-off between complexity, computational resources, data rate and / or the perceived audio quality of the generated output stereo signal.

[0041] In many embodiments, the first error value is monotonically increasing with a decreasing cross-correlation between the signal of a first channel of the stereo signal and the first compensation signal. In many embodiments, the second error value is monotonically increasing with a decreasing crosscorrelation between the signal of a second channel of the stereo signal and the second compensation signal.

[0042] According to an optional feature of the invention, the first error value is monotonically increasing with an increasing cross-correlation between the mono downmix audio signal and the first residual signal. 2024PF00555

[0043] 5

[0044] In many embodiments, the second error value is monotonically increasing with an increasing cross-correlation between the mono downmix audio signal and the second residual signal.

[0045] This may provide an advantageous approach for many scenarios, including e.g. providing an advantageous trade-off between complexity, computational resources, data rate and / or the perceived audio quality of the generated output stereo signal.

[0046] According to an optional feature of the invention, a frequency representation of the filter response of the first downmix filter is a complex conjugate of a frequency representation of the filter response of the first residual filter.

[0047] This may provide an advantageous approach for many scenarios.

[0048] In many embodiments, a frequency representation of the filter response of the second downmix filter is a complex conjugate of a frequency representation of the filter response of the second residual filter.

[0049] In many embodiments, a time domain impulse response of the filter response of the first downmix filter is a time reversed version of a time domain impulse response of the filter response of the first residual filter.

[0050] In many embodiments, a time domain impulse response of the filter response of the second downmix filter is a time reversed version of a time domain impulse response of the filter response of the second residual filter.

[0051] According to an optional feature of the invention, the adaptation is subject to a constraint on a combined energy measure for the filter response of the first residual filter and the filter response of the second residual filter.

[0052] This may be particularly advantageous in many embodiments and for many scenarios. The constraint may be one of the combined energy measure being constant / equal to a constant value.

[0053] For a frequency domain representation, the adaptation may be subject to a constraint on a sum of combined magnitude measure for weights of each subband for the two channels, and specifically on this being constant.

[0054] According to an optional feature of the invention, the second Tenderer is arranged to perform a third rendering being a rendering of the second residual signal to generate a third intermediate stereo signal; and the combiner is arranged to combine at least the first intermediate stereo signal, the second intermediate stereo signal, and the third intermediate stereo signal to generate the output stereo signal.

[0055] This may be particularly advantageous in many embodiments and for many scenarios. In some embodiments, the first Tenderer is arranged to determine the first direction from a direction indication provided in a data signal also comprising the stereo signal.

[0056] According to an optional feature of the invention, the adapter is arranged to compensate the update of the filter response of the first residual filter and the update of the filter response of the 2024PF00555

[0057] 6

[0058] second residual filter for a common signal shift property for the first residual filter and for the second residual filter.

[0059] This may provide an advantageous approach for many scenarios.

[0060] According to an optional feature of the invention, the first compensator comprises a delay for delaying the signal of the first channel of the stereo signal relative to the first compensation signal.

[0061] This may be particularly advantageous in many embodiments and for many scenarios. In some embodiments, the second compensator comprises a delay for delaying the signal of the second channel of the stereo signal relative to the second compensation signal.

[0062] According to an optional feature of the invention, the first downmix filter is a frequency domain filter comprising weights for each subband of a frequency domain representation of the signal of the first channel of the stereo signal; the first residual filter is a frequency domain filter comprising weights for each subband of a frequency domain representation of the mono downmix audio signal; and wherein weights of the first downmix filter are complex conjugates of weights of a same subband of the first residual filter.

[0063] This may provide an advantageous approach for many scenarios. The weights for each subband may be complex valued weights (typically for all signals / filters).

[0064] According to an optional feature of the invention, the audio apparatus comprises: a spatial parameter circuit arranged to provide sets of spatial parameters for the stereo signal, the sets of spatial parameters being indicative of relative signal properties of channels of the stereo signal; and the Tenderer is arranged to determine the first direction from the sets of spatial parameters.

[0065] This may be particularly advantageous in many embodiments and for many scenarios. The spatial parameters may comprise sets of spatial parameters, each set of spatial parameters comprising at least one of: a level difference parameter indicative of a level difference between channels of the multichannel audio signal; a correlation parameter indicative of a coherence between channels of the multichannel audio signal; a timing difference parameter indicative of a timing difference between channels of the multichannel audio signal, and a phase difference parameter indicative of a phase difference between channels of the multichannel audio signal.

[0066] In many embodiments, the receiver is arranged to receive a data signal comprising the stereo signal and the sets of spatial parameters, and the spatial parameter circuit is arranged to extract the sets of spatial parameters from the data signal.

[0067] According to an optional feature of the invention, the Tenderer is arranged to determine the first direction from a timing of a peak of a cross correlation between the filter response of the first downmix filter and the filter response of the second downmix filter.

[0068] This may be particularly advantageous in many embodiments and for many scenarios. In some embodiments, the first Tenderer is arranged to determine a point source direction in a stereo image of the stereo signal from the spatial parameters, and to determine the first direction by applying a mapping function to the point source direction. 2024PF00555

[0069] 7

[0070] In some embodiments, the first Tenderer is arranged to determine the first direction from a direction indication provided in a data signal also comprising the stereo signal.

[0071] According to an optional feature of the invention, the audio apparatus of further comprises a decorrelator arranged to decorrelate the first residual signal to generate a decorrelated residual signal; and wherein the Tenderer is arranged to perform a third rendering being of the decorrelated residual signal to generate a third intermediate stereo signal, and wherein the combiner is arranged to combine at least the first intermediate stereo signal, the second intermediate stereo signal, and the third intermediate stereo signal to generate the output stereo signal.

[0072] This may provide an advantageous approach for many scenarios, including e.g. providing an advantageous trade-off between complexity, computational resources, data rate and / or the perceived audio quality of the generated output stereo signal.

[0073] The third rendering may be a predetermined rendering employing a predetermined mapping of the decorrelated residual signal to channel signals of the third intermediate stereo signal.

[0074] According to an optional feature of the invention, the first direction is dependent on a property of the stereo signal and the second rendering is a predetermined rendering employing a predetermined mapping of the first residual signal to channel signals of the second intermediate stereo signal.

[0075] This may be particularly advantageous in many embodiments and for many scenarios. The predetermined rendering employing a predetermined mapping may be independent of the stereo signal and properties thereof.

[0076] The processing may be performed in subbands. The processing may be performed in time segments. The processing in each subband may for some or all steps be performed separately / independently in each subband (with respect to the processing in other subbands). The processing in each time segment may for some (any) or all steps be performed separately / independently in each time segment (with respect to the processing in other time segments).

[0077] The processing may be time interval / segment based with all processing being performed for each time segment. Equivalently, the signal(s) for each segment may be considered a signal (and in particular signals of different time segments, may be considered different signals).

[0078] According to another aspect of the invention, there is provided a method of generating an output stereo signal, the method comprising: receiving a stereo signal; a first downmix filter generating a first filtered signal by filtering a signal of a first channel of the stereo signal; a second downmix filter generating a second filtered signal by filtering a signal of a second channel of the stereo signal; generating a mono downmix audio signal by combining the first filtered signal and the second filtered signal; a first residual filter generating a first compensation signal by filtering the mono downmix audio signal, a filter response of the first downmix filter having a fixed predetermined relationship with the filter response of the first residual filter; generating a first residual signal by compensating the signal of the first channel of the stereo signal by the first compensation signal; a second residual filter generating a second 2024PF00555

[0079] 8

[0080] compensation signal by filtering the mono downmix audio signal, a filter response of the second downmix filter having a fixed predetermined relationship with the filter response of the second residual filter in that a frequency representation of the filter response of the first downmix filter is a complex conjugate of a frequency representation of the filter response of the first residual filter; generating a second residual signal by compensating the signal of a second channel of the stereo signal by the second compensation signal; updating the filter response of the first residual filter to reduce a magnitude of a first error value, the first error value being dependent on the first residual signal; updating the filter response of the second residual filter to reduce a magnitude of a second error value, the second error value being dependent on the second residual signal; rendering an output stereo signal, the rendering comprising: performing a first rendering of the mono downmix audio signal to generate a first intermediate stereo signal, the first rendering being a directional rendering arranged to render the mono downmix audio signal from a first direction; performing a second rendering being a rendering of the first residual signal to generate a second intermediate stereo signal; and combining at least the first intermediate stereo signal and the second intermediate stereo signal to generate the output stereo signal.

[0081] These and other aspects, features and advantages of the invention will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.

[0082] BRIEF DESCRIPTION OF THE DRAWINGS

[0083] Embodiments of the invention will be described, by way of example only, with reference to the drawings, in which

[0084] FIG. 1 illustrates some elements of an example of an audio apparatus in accordance with some embodiments of the invention;

[0085] FIG. 2 illustrates some elements of an example of an audio apparatus in accordance with some embodiments of the invention;

[0086] FIG. 3 illustrates some elements of an example of a Tenderer for an audio apparatus in accordance with some embodiments of the invention

[0087] FIG. 4 illustrates an example of an approach for generating two residual signals for an example of an audio apparatus in accordance with some embodiments of the invention;

[0088] FIG. 5 illustrates some elements of an example of an audio apparatus in accordance with some embodiments of the invention; and

[0089] FIG. 6 illustrates some elements of a possible arrangement of a processor for implementing elements of an audio apparatus in accordance with some embodiments of the invention.

[0090] DETAILED DESCRIPTION OF SOME EMBODIMENTS OF THE INVENTION FIG. 1 illustrates an audio apparatus, henceforth also referred to as the audio render apparatus, which is arranged to render an output stereo signal from an input stereo signal. Thus, the audio render apparatus may receive a stereo signal and from this proceed to perform a rendering process 2024PF00555

[0091] 9

[0092] resulting in an output stereo signal that is typically perceived to have improved properties and which for many signals and scenarios may provide an improved user experience and perception. In many embodiments, the audio render apparatus may generate a binaural stereo signal providing an improved (“out-of-head”) experience when listened to using headphones.

[0093] The audio render apparatus comprises a receiver 101 which receives a data signal comprising an input stereo signal. The stereo signal may typically be encoded in accordance with a suitable encoding standard and the receiver 101 may be arranged to decode the encoded data. The input stereo signal may be one captured at the audio render apparatus and thus may be received from e.g. a set of stereo microphones. In many embodiments, it may be received from another source, or indeed may be an artificially generated stereo signal (e.g. it may be a virtual audio stereo signal).

[0094] The receiver 101 may be arranged to receive a time domain and / or a frequency domain audio signal version / representation of the input stereo signal. In some cases, the received data signal may include the stereo signal in only one representation, i.e. the data signal may include only one of the frequency domain stereo signal and the time domain stereo signal. In such cases, the received data signal may be transformed to the other domain as appropriate. Thus, in some cases, a received data signal may include a time domain audio signal being the time domain representation of the stereo signal, and a time to frequency domain transformer may from this generate the frequency domain stereo signal. In some cases, a received data signal may include a frequency domain stereo signal being the frequency domain representation of the stereo signal and a frequency to time domain transformer may from this generate the time domain stereo signal.

[0095] The following description will focus on embodiments where all or most of the operations are performed in the frequency domain, but it will be appreciated that in many embodiments some, or indeed all of the operations may be performed in the time domain.

[0096] The audio render apparatus may in many embodiments perform subband processing and accordingly the input stereo signal may be processed in the subband / frequency domain. In many cases, the input stereo signal may be directly received in a suitable frequency domain / subband representation. In many embodiments, the input stereo signal may be received as a time domain signal and the audio render apparatus may comprise functionality for transforming the input stereo signal to the frequency domain.

[0097] The receiver 101 may include a time to frequency domain transformer that generates a frequency domain / subband stereo signal from a received time domain representation. In particular, in some embodiments, the receiver may comprise a fdter bank which is arranged to generate a frequency subband representation of a received time domain input stereo signal. The receiver 101 may comprise a fdter bank that is applied to the input stereo signal such that this is divided into frequency subbands.

[0098] The fdter bank may be Quadrature Mirror Filter (QMF) bank or may e.g. be implemented by a Fast Fourier Transform (FFT), but it will be appreciated that many other fdter banks and approaches for dividing an audio signal into a plurality of subband signals are known and may be used. The fdter- 2024PF00555

[0099] 10

[0100] bank may specifically be a complex-valued pseudo QMF bank, resulting in e.g. 32 or 64 complex-valued sub-band signals.

[0101] The processing is furthermore typically performed in time segments or time slots / intervals. In most embodiments, the audio signal is divided into time intervals / segments with a conversion to the frequency / subband domain by applying e.g. an FFT or QMF filtering to the samples of each signal. For example, each channel of the downmix audio signal may be divided into time segments of e.g. 2048, 1024, or 512 samples. These signals may then be processed to generate samples for e.g. 64, 32 or 16 subbands. Thus, a set of samples may be determined for each subband of the input stereo signal.

[0102] It should be noted that the number of time domain samples is not directly coupled to the number of subbands. Typically, for a so-called critically sampled filterbank of N bands, every N input samples will lead to N sub-band samples (one for every sub-band). An oversampled filterbank will produce more output samples. E.g. for every N input samples, it would generate k*N output samples, i.e., k consecutive samples for every band.

[0103] In some embodiments, the subbands are generated to have the same bandwidth but in other embodiments subbands are generated to have different bandwidths, e.g. reflecting the sensitivity of human hearing to different frequencies.

[0104] For example, the receiver 101 may employ a hybrid filterbank with logarithmic filter band center-frequency spacings that follow that of human perception similar to equivalent rectangular bandwidths (ERBs). In order to compensate for the delay of the filtering by the small filter bank, a delay may be introduced for higher frequency subbands.

[0105] As a specific example, a time-domain signal x [n] may be fed through a downsampled complex-exponential modulated QMF bank with K bands. Each frame of 64 time domain samples x[n] results in one slot of QMF samples X[k, I] with k = (0,..., K — 1) at slot I. The lower slots may then be filtered by additional complex-modulated filterbanks splitting the lower bands further. The higher slots are delayed ensuring that the filtered input stereo signals of the lower bands are in sync with the higher bands as the filtering introduces a delay. This finally results in a structure where for every 64 timedomain samples x[n], one slot of hybrid QMF samples K[k, Z] is produced with k = (0,..., L — 1) at slot Z, e.g. with a total number of hybrid bands M = 77.

[0106] Thus, the signals and the processing may be performed in subbands and for individual segments. Such blocks of a frequency interval / subband in a given time interval / segment will also be referred to as time frequency segments / tiles.

[0107] The audio render apparatus is arranged to process the input stereo signal to generate a mono downmix audio signal for the stereo signal as well as at least two residual audio signals. The downmix and residual signal are then typically rendered differently with the downmix being rendered as a directional signal component and with at least one of the residual signals typically being rendered using a more diffuse rendering, and typically using a predetermined rendering that is not adapted based on the 2024PF00555

[0108] 11

[0109] signal properties of the intermediate stereo signal, the downmix, or the residual signals. The audio render apparatus uses a specific adaptive approach for generating the mono downmix audio signal and the residual signals.

[0110] The audio render apparatus further comprises a first downmix filter 103 which is arranged to generate a first filtered signal by filtering a signal of a first channel of the stereo signal. Similarly, the audio render apparatus further comprises a second downmix filter 105 which is arranged to generate a second filtered signal by filtering a signal of a second channel of the stereo signal. Specifically, the first downmix filter 103 and the second downmix filter 105 may respectively filter the left and right signals of the intermediate stereo signal. In some embodiments, the downmix filters 103, 105 may be implemented as time domain filters with a suitable impulse response. However, in many embodiments, the downmix filters 103, 105 may be implemented in the frequency domain, and specifically the filtering may be a subband operation where the channel signal for a subband is multiplied by a complex subband weight representing the response of the filter in that subband.

[0111] The audio render apparatus further comprises a downmixer 107 which is arranged to downmix the filtered stereo signal to generate a mono downmix audio signal. The downmixer 107 is arranged to combine the filtered stereo channel signals to generate the mono downmix audio signal, and specifically to sum the filtered stereo channel signals.

[0112] The downmixer 107 may specifically be arranged to generate frequency subband values of the mono downmix audio signal by combining frequency subband values of channels of the input stereo signal following the filtering / weighting by the downmix filters 103, 105. In many embodiments, the audio render apparatus is arranged to multiply / scale subband samples of the respective channels of the stereo signal by the respective (typically complex-valued) downmix filter subband weights and combining (specifically summing) the resulting subband values to generate subband values of the mono downmix audio signal, (also referred to simply as the mono downmix, downmix, or downmix signal).

[0113] FIG. 2 illustrates an example of the downmixing approach for one subband where the subband samples of the channel signals are multiplied by weights of the downmix filters before being summed. Specifically, a first scale block 201 multiplies input subband samples XL[b] of the left channel of the input signal by a subband downmix filter weights w [b]. It further comprises a second scale block 203 which multiplies input subband samples Xr[b] of the right channel of the input signal by a subband downmix filter weights wr*[b]. The resulting subband samples are summed together by a summer / summation 205 to generate the subband samples of the mono downmix audio signal S[b],

[0114] The audio render apparatus further comprises functionality for generating two residual signals from the channels of the stereo signal. The audio render apparatus seeks to remove the component of the input channels that are included in the downmix. It seeks to generate the residual signals to have a reduced / small correlation with the channel signals. Specifically, it seeks to s to generate the residual signals to have zero correlation with the downmix signal. 2024PF00555

[0115] 12

[0116] The audio render apparatus specifically comprises a first residual filter 109 arranged to generate a first compensation signal by filtering the mono downmix audio signal. In particular, for a frequency domain / subband processing, the first residual filter 109. Specifically, the compensation values may be generated by multiplying the subband samples of the downmix by the filter weights for that subband sample.

[0117] The first residual filter 109 has a filter response which has a fixed predetermined relationship with the filter response of the first downmix filter 103 (and vice versa, i.e. the filter response of the first downmix filter 103 has a has a filter response which has a fixed predetermined relationship with the filter response of the first residual filter 109). The audio render apparatus is arranged to ensure that this relationship is always maintained, and thus if the filter response of one filter is changed, the filter response of the other changes correspondingly to maintain the same relationship.

[0118] Specifically, the time domain impulse response of the first downmix filter 103 may be a time inversed / reversed time domain impulse response of the first residual filter 109. The time domain impulse responses of the first downmix filter 103 and the first residual filter 109 may be time inversed / reversed versions of each other. Equivalently, the frequency domain transfer function of the first downmix filter 103 may be a complex conjugate of a frequency domain transfer function of the first residual filter 109 (and vice versa). The frequency domain transfer function of the first downmix filter 103 and the first residual filter 109 may be complex conjugates of each other.

[0119] In particular, for frequency subband processing where the filters are implemented as a complex weight for each subband (and with the filtering operation corresponding to multiplying each input subband sample by the complex weight for the subband to generate the output subband sample for the subband), the complex weights of the first downmix filter 103 and the first residual filter 109 may for each subband be the complex conjugates of each other. Thus, in many embodiments, the subband filter weights of the first downmix filter 103 and the first residual filter 109 are generated as the complex conjugates of each other. This may result in a compensation value which estimates the component of the first channel that is included in / transferred to the downmix. The compensation signal may essentially be generated to reverse the operation of the downmix weight for the first channel but with the signal being generated as a phase inverted version of the component of the first channel that is represented in the downmix.

[0120] The audio render apparatus includes a first compensator 111 which generates a first residual signal by compensating the signal of the first channel of the input stereo signal by the first compensation signal. Specifically, the first compensator 111 may be arranged to add the (antiphase) first compensation signal to the signal of the first channel of the input stereo signal. The first compensator 111 seeks to compensate the first channel to remove or at least reduce the component / correlation (after weighting / filtering) between the mono downmix audio signal and the first residual signal. Specifically, the generated compensation value (which specifically may represent an antiphase signal of a component 2024PF00555

[0121] 13

[0122] of the first channel included in the downmix) may be added to the first signal to generate the first residual signal.

[0123] Similar functionality is provided for the second channel of the stereo signal.

[0124] Thus, the audio render apparatus further comprises a second residual filter 113 arranged to generate a second compensation signal by filtering the mono downmix audio signal. In particular for a frequency domain / subband processing, the second residual filter 113. Specifically, the compensation values may be generated by multiplying the subband samples of the downmix by the filter weights for that subband sample.

[0125] The second residual filter 113 has a filter response which has a fixed predetermined relationship with the filter response of the second downmix filter 105 (and vice versa, i.e. the filter response of the second downmix filter 105 has a has a filter response which has a fixed predetermined relationship with the filter response of the second residual filter 113)). The audio render apparatus is arranged to ensure that this relationship is always maintained, and thus if the filter response of one filter is changed, the filter response of the other changes correspondingly to maintain the same relationship.

[0126] As for the first downmix filter 103 and the first residual filter 109, the time domain impulse response of the second downmix filter 105 may be a time reversed time domain impulse response of the second residual filter 113. The time domain impulse responses of the second downmix filter 105 and the second residual filter 113 may be time reversed of each other. Equivalently, the frequency domain transfer function of the second downmix filter 105 may be a complex conjugate of a frequency domain transfer function of the second residual filter 113 (and vice versa). The frequency domain transfer function of the second downmix filter 105 and the second residual filter 113 may be complex conjugates of each other.

[0127] In particular, for frequency subband processing where the filters are implemented as a complex weight for each subband (and with the filtering operation corresponding to multiplying each input subband sample by the complex weight for the subband to generate the output subband sample for the subband), the complex weights of the second downmix filter 105 and the second residual filter 113 may for each subband be the complex conjugates of each other. Thus, in many embodiments, the subband filter weights of the second downmix filter 105 and the second residual filter 113 are generated as the complex conjugates of each other. This may result in a compensation value which estimates the component of the second channel that is included in / transferred to the downmix. The compensation signal may essentially be generated to reverse the operation of the downmix weight for the second channel but with the signal being generated as a phase inverted version of the component of the second channel that is represented in the downmix.

[0128] The audio render apparatus includes a second compensator 115 which generates a second residual signal by compensating the signal of the second channel of the input stereo signal by the second compensation signal. Specifically, the second compensator 115 may be arranged to add the (antiphase) second compensation signal to the signal of the second channel of the input stereo signal. The second 2024PF00555

[0129] 14

[0130] compensator 115 seeks to compensate the second channel to remove or at least reduce the component / correlation (after weighting / filtering) between the mono downmix audio signal and the second residual signal. Specifically, the generated compensation value (which specifically may represent an antiphase signal of a component of the second channel included in the downmix) may be added to the second signal to generate the second residual signal.

[0131] FIG. 2 illustrates an example of such functionality for generating the residual signals. The circuit comprises a first residual scale block 207 which applies subband filter weights wl[b] of the first residual filter 109, with the weights being complex conjugates of the subband downmix filter weights w [b] of the first downmix filter 103, to the subband samples of the downmix signal S[b], The resulting compensation values are then by a summation circuit 211 added to the subband samples of the first signal Xl[b] to generate the subband signals of the first residual signal Dl[b].

[0132] Similarly, the residual filter 113 comprises a second residual scale block 209 which applies subband filter weights wr[b] of the second residual filter 113, which are complex conjugates of the subband downmix filter weights wr*[b] of the second downmix filter 105 to the subband samples of the downmix signal S[b]. The resulting compensation values are then by a summation circuit 213 added to the subband samples of the second signal Xr[b] to generate the subband signals of the second residual signal Dr[b].

[0133] The downmixer 107, first compensator 111, and typically the second compensator 115 are coupled to a renderer 117 which is arranged to generate an output stereo signal by rendering the mono downmix audio signal and at least one residual signal. Further, the renderer 117 is arranged to apply a different rendering approach to the mono downmix audio signal than to the residual signals. Specifically, whereas the mono downmix audio signal is rendered as a directional component, the residual signal(s) is(are) typically rendered as diffuse, non-directional (or at least less directional) signals. In some cases, the residual signals may e.g. be rendered from locations corresponding to the directions of virtual stereo loudspeakers. The rendering of the residual signals is typically using a predetermined rendering algorithm that is not adapted dependent on the signal properties. In many embodiments, binaural rendering may be used.

[0134] The audio render apparatus may accordingly be arranged to, for each subband, generate the mono downmix audio signal as:

[0135]

[0136] where WHrepresents the downmix weights.

[0137] The residual signals may be derived using the complex conjugate of the downmix weights 2024PF00555

[0138] 15

[0139] Di I — WjS

[0140] D =

[0141]

[0142] Drr — wrS

[0143] The triplet (S, Di, Dr) may then be rendered and may specifically be processed by a binaural renderer which uses HRTF filters and optionally BRIR to produce left and right channels.

[0144] The renderer 117 may use a specific approach where parallel paths process the mono downmix audio signal and the residual signal(s) in different ways to generate different stereo signal components which are then combined to generate the output stereo signal.

[0145] The approach may consider a general signal model that stereo signals can be represented as:

[0146] I = fieJ< Plx + ni

[0147] r = frej< Prx + nr

[0148] The directional signal component X is phase shifted using two (frequency-dependent) parameters <pi and <pr, and is further panned / positioned in the stereo image of the original stereo channels I and r by frequency-dependent positive gains and fr. Furthermore, a residual signal component is represented by signal components

[0149]

[0150] and nrof the respective channels. It is noted that the signal model description does not necessarily refer to a time-domain signal, but rather can alternatively or additionally refer to individual (potentially relatively small) frequency subbands. For example, the described signal model may individually apply to each of the frequency subbands for which separate spatial parameters are provided.

[0151] The directional signal component x is phase shifted using two parameters <pi and <pr, and is further panned / positioned in the stereo image of the original stereo channels I and r. The panning is to an angle represented by the panning angle y. Furthermore, a residual signal component is represented by (e.g. noise) signal components

[0152]

[0153] and nrof the respective channels. It is noted that the signal model description does not necessarily refer to a time-domain signal, but rather can alternatively or additionally refer to individual (potentially relatively small) frequency subbands. For example, the described signal model may individually apply to each of the frequency subbands for which separate spatial parameters are provided.

[0154] The renderer 117 may render the audio signal corresponding to an assumption / consideration that the mono downmix audio signal corresponds to the directional signal component x, and the residual signals correspond to the residual audio signals ni and nr. It uses different rendering approaches to generate different intermediate stereo signals which may be considered estimates of the different signal components of the signal model with these intermediate stereo signals being 2024PF00555

[0155] 16

[0156] combined to generate an output stereo signal. The intermediate stereo signals may be considered estimates or approximations of respectively the directional signal x and residual signals and nrof the signal model but are generated using low resource demanding approaches. The approach provides an advantageous rendering in many scenarios, embodiments, and applications and in particular may often provide an advantageous audio and spatial perception while allowing low complexity and resource demanding implementation and operation.

[0157] The audio render apparatus further comprises an adapter 119 which is arranged to update the filter response of the first residual filter 109 to reduce a magnitude of a first error value where the first error value is determined from the first residual signal. The adapter 119 is further arranged to update the filter response of the second residual filter 113 to reduce a magnitude of a second error value where the second error value is determined from the second residual signal. The adapter 119 is further arranged to maintain the predetermined relationship, (such as the frequency domain complex conjugation) between the filter responses of the first residual filter 109 and the first downmix filter 103, and between the filter responses of the second residual filter 109 and the second downmix filter 105, and accordingly the adapter 119 is also arranged to update the filter response of the first downmix filter 103 to reduce the magnitude of the first error value and to update the filter response of the second downmix filter 105 to reduce the magnitude of the second error value. The adaptation of the downmix filters 103, 105 are typically achieved by a direct copy / paste (with the appropriate conjugation or time reversal) of the filter responses of the first residual filter 109 and of the second residual filter 109 following the updates of these.

[0158] The error values may reflect a power / magnitude level of the corresponding residual signals, and specifically may be monotonically increasing for an increasing power / magnitude level of the corresponding residual signal.

[0159] Specifically, the first error value may be monotonically increasing with an increasing power level of the first residual signal and the second error value may be monotonically increasing with an increasing power level of the second residual signal.

[0160] Thus, in many embodiments, the adapter 119 may be arranged to adapt the filter responses to reduce the power level of the residual signals.

[0161] In many embodiments, the first error value may be monotonically increasing with an increasing cross-correlation between the mono downmix audio signal and the first residual signal. In many embodiments, the second error value may be monotonically increasing with an increasing crosscorrelation between the mono downmix audio signal and the second residual signal.

[0162] In many embodiments, the first error value may include a contribution, or even consist of, an error value that is monotonically increasing with an increasing cross-correlation between the mono downmix audio signal and the first residual signal. In many embodiments, the second error value may include a contribution, or even consist of, an error value that is monotonically increasing with an increasing cross-correlation between the mono downmix audio signal and the second residual signal. 2024PF00555

[0163] 17

[0164] In many embodiments, the filter responses / weights are updated to reduce, and specifically remove, correlation between the input (or output) signal of the residual filters and the residual signal. The adapter 119 may seek to remove all downmix-correlated components from the channels of the input stereo signals.

[0165] The adaptation may typically update the filter responses to reduce the residual signal levels subject to constraints on the filter responses. In many embodiments, the adaptation is subject to a constraint on a combined energy measure for the filter response of the first residual filter and the filter response of the second residual filter. For example, in many embodiments, the filter responses may be constrained to always have in the same total combined energy response. Thus, the adapter 119 may be arranged to adapt / update the filter responses but with the constraint that the total power / energy gain / transfer is unchanged.

[0166] The adaptation / updating may use any suitable approach for updating / adapting the filter responses based on the error signal. For example, in many embodiments, the adapter 119 may implement a gradient descent algorithm to minimize the error signal by adapting the filter responses.

[0167] In many embodiments, the filters may as described be implemented in the subband domain, and the adaptation may be performed individually in each subband. Specifically, the adapter 119 may be arranged to adapt the subband filter weights but such that the total power level for the weights is unchanged. Specifically, the weights may be updated under the constraint that the sum square magnitude of the weights are constant.

[0168] Specifically, for a subband processing, the mono downmix audio signal may (for a given subband) be given by (ref. FIG. 2):

[0169]

[0170] s = ’•"'[']

[0171] It may often be desired that the mono downmix audio signal corresponds to the principal component of the stereo signal in which case W is given by the principal eigenvector of the 2x2 signal covariance matrix. The adaptation approach can be considered to seek to flexibly and dynamically adapt the filter responses to provide the mono downmix audio signal as an estimate of the principal component. The adapter 119 may adapt the weights to seek to determine the eigenvector of the 2x2 signal covariance matrix.

[0172] The adaptation may specifically be performed based on the constraint of | tv;[Zc] |2+ | wr[A ] |2= c, where c is some constant, (which e.g. can be set to 1). Such an approach may ensure that the energy of the estimated source (i.e. the mono downmix audio signal S[fc]

[0173]

[0174] = wf [fc] +

[0175] W [fc]Xr[fc]) is constrained. 2024PF00555

[0176] 18

[0177] The first and second (left and right) residual signals at time-step i may be calculated using adaptive filters (time- or frequency-domain) that remove the estimated point source S[fc] from the left and right stereo channels XL[fc] and Xr[fc] of the input stereo signal, i.e.

[0178] D\[k]

[0179]

[0180] Dlr[fc] Xr[k — <5] — wlr[fc]S[fc]

[0181] The adaptive filter weights may specifically be updated using gradient descent at iteration i, e.g. according to,

[0182] Wjt+1[fc] = + / / S* [fc] (Xi [k — <5] — wtl[fc]S[fc]) = + p. S*[k]ei [fc] W

[0183]

[0184] r+1[fc] = WrH + pS*[k](Xr[k — <5] — w^[fc]S[fc]) =

[0185] where e [Ar] and e,'. [Ar] are the error values.

[0186] The iterative update may aim to reduce the error values which specifically may be the mean-square value of the residual signals, or equivalently the correlation between the estimated point source (the mono downmix audio signal) and left / right signals XL[fc] and Xr[fc] of the stereo signal. After each update, the weights may be rescaled to satisfy the constraint | tv;[A ] |2+ | vv?. [Ar] |2= 1. The coefficients of the adaptive filters after updating are then complex conjugated and copied to the downmix filters 103, 105 to estimate the downmix signal / point source at iteration i + 1.

[0187] For the approach of FIG. 3 and using the constraint | Wi |2+ | wr|2= 1, it can be shown that the total energy in the input stereo signal equals that of the total energy of the mono downmix audio signal and the two residual signals:

[0188] |

[0189]

[0190] S[b]|2+ |A[b]|2+ lA DNI2= |Xdb]|2+ |Xr[b]|2.

[0191] This approach is adaptive in that it estimates the filters Wi and wrbased on the correlation between the estimated point source, the mono downmix audio signal [Ar] and the left and right channels XL[fc] and Xr[fc] of the input stereo signal. The approach results in an adaptation that provides estimates of a matched filter that represents the acoustic impulse response (phase and magnitude) between the point source and the capture positions for the left and right channels. This can be illustrated by considering a simple example where the left and right stereo channels are given by:

[0192] X fc] = S[k] + Dt[k] 2024PF00555

[0193] 19

[0194] Xr[ / c] = S[ / c]e-^W + Dr[k]

[0195] i.e. where the point source signal is simply delayed in the right channel by the value A[fc] relative to the left channel. It is also for the example assumed that the residual signals Di [fc] and Dr[fc] are mutually uncorrelated and uncorrelated with S[fc]. The SNR (Signal to Noise Ratio) of the left and right individual channels of the input signal is:

[0196] SNR [fc] = ^[kj / a^k]

[0197] w

[0198]

[0199] here erf [Ar] = £,{S[fc]S*[fc]} and = E{D[[k]Di [fc]} = E{Dr[ / c]£)^[ / c]}. If the adaptive filters Wi [fc] and wr[fc] converge to the principal eigenvector of the covariance matrix computed as,

[0200] « = [yl™ | fv |]K,[k] x;[fc]]

[0201] 1 rejA[fc]i

[0202]

[0203] then this means that the principal component or estimate of the phantom source is given by

[0204] S[k] = [X£[fc] Xr[fc]]wH[fc] = -^(e-'2WJ[ / c] + Xr[fc])

[0205] = j2S[k]e~EW++ Dr[k])

[0206]

[0207] The resulting SNR of the principal component has increased by a factor of 2, or approximately 3 dB,

[0208] SNRPC[k] = 2o-J [fc] / crj[fc]

[0209] assuming that S,

[0210]

[0211] and Drare uncorrelated. The higher SNR means that it is e.g. easier to binauralize the phantom source.

[0212] In other words, the left frequency-domain signal may first be phase-aligned (delayed) with respect to the right signal by A [A] before summing and normalizing with the diffuse noise terms averaging out. This produces a scaled and delayed version of the point source. The resulting left and right residual signals may then be given by: 2024PF00555

[0213] 20

[0214] Dt[k] = Xt[k] - wt[k]S[k] = Xt[k] - -±=e*™S[k]

[0215] = lDt[k] - ^eJ^Dr[k] (18) _. 1. 1 1. £)r[fc] = Xr[fc] — wr[fc]S[fc] = Xr[fc] — — S[fc] = -Dr[k] — - e~J^ Dt[k]

[0216]

[0217] \ 2 2 2

[0218] In some embodiments, the adapter 119 may be arranged to compensate the update of the fdter responses of the residual fdters (and the downmix filters) to compensate for a common signal shift property for the filters. The common shift property may typically be a common delay for time domain filters and a common phase for frequency domain / subband filters. The common signal shift property may be common for the filter responses if no compensation is performed. Thus, the adapter 119 may be arranged to remove a common delay component that is present in both parallel filters (i.e. in both residual filters and / or in both downmix filters) and / or a common phase component that is present in both parallel filters.

[0219] For example, the sum of the filter weights Wi [fc] + wr[A] can be decomposed into an all-pass and minimum phase filter response, w [Ar] = Wi [Ar] + wr[Ar] = wap[Ar] wmp[fc], where the minimum phase component can be described as an impulse response where the energy is concentrated at the beginning of the response (minimum delay), and the all-pass component consists of a frequencydependent delay response. The first and second filter weights can then be normalized for the common phase component by scaling with the factor wmp[fc] / w [Ar] for each subband.

[0220] FIG. 3 shows examples of elements of the renderer 117.

[0221] The mono downmix audio signal is in the example fed to a first renderer 301 which is arranged to render the mono downmix audio signal to generate a first intermediate stereo signal. The rendering by the first renderer 301 (also referred to as a first rendering) is a directional rendering which renders the first intermediate stereo signal with a given direction / position in the stereo image of the first intermediate stereo signal. The first rendering may specifically render the mono downmix audio signal as a point source with a given direction / position in the stereo image.

[0222] The first renderer 301 is coupled to a direction determining circuit 303 which is arranged to determine a direction y’ which is fed to the first renderer 301 resulting a rendering the mono downmix audio signal to be perceived from this position / direction. Thus, the first rendering is specifically such that the mono downmix audio signal in the first intermediate stereo signal is perceived as a point audio source positioned in the direction corresponding to the direction y’ determined by the direction determining circuit 303. The direction y’ will also be referred to as the rendering direction or rendering angle. 2024PF00555

[0223] 21

[0224] In some embodiments, the direction determining circuit 303 may be arranged to determine the direction y’ as a rendering direction y’ that is indicated in metadata provided for the input stereo signal. For example, in many embodiments, the receiver 101 may receive a data signal comprising the (encoded) input stereo signal as well as e.g. metadata indicating a position / direction of dominant single point sources and the direction determining circuit 303 may extract this information and use it as the rendering direction y’.

[0225] The first Tenderer 301 may accordingly proceed to render the mono downmix audio signal such that it is perceived from the given direction, and it specifically achieve this directional rendering by applying a directional transfer function to the mono downmix audio signal with the directional transfer function generating the intermediate stereo signal from the mono downmix audio signal. The directional transfer function may specifically include a sub-transfer function for each channel, i.e. it may include one (sub)transfer function for generating a left channel signal and one (sub)transfer function for generating the right channel signal.

[0226] In many cases, the directional transfer function may be provided as a set of complex weights for the different subbands of a frequency representation of the mono downmix audio signal. The audio apparatus may perform many or all of the operations in the frequency domain and thus the transfer function may also be expressed and applied in the frequency domain. For example, for each frequency subband of the representation of the mono downmix audio signal, the transfer function may provide a complex weight for each of the output channels and a frequency representation of the first intermediate stereo signal may be generated by applying / multiplying the subband samples of the mono downmix audio signal by these weights to generate the subband samples of the first intermediate stereo signal.

[0227] The first transfer function is determined to correspond to the desired direction, i.e. it reflects the mapping from the mono downmix audio signal to the channels of the first intermediate stereo signal such that it is perceived as / corresponds to an audio source at a position in the stereo image corresponding the rendering direction / angle.

[0228] For example, in some cases, the transfer function for a given direction may correspond to a panning of the mono downmix audio signal to the given direction in the stereo image.

[0229] In many embodiments, the first rendering may be a binaural rendering and the first intermediate stereo signal may be a binaural stereo signal providing an enhanced spatial experience / perception when heard through headphones. Thus, the first renderer 301 may specifically be a binaural audio Tenderer which generates binaural audio signals for the left and right ear of a user. Binaural audio signals are generated to provide a desired spatial experience and are typically reproduced by headphones or earphones that specifically may be part of a headset worn by a user (the headset typically also comprises left and right eye displays).

[0230] Thus, in many embodiments, the audio rendering by the first renderer 301 is a binaural render process using suitable binaural transfer functions to provide the desired spatial effect for a user 2024PF00555

[0231] 22

[0232] wearing a headphone. For example, the first renderer 301 may be arranged to generate an audio component to be perceived to arrive from a specific position using binaural processing.

[0233] Binaural processing is known to be used to provide a spatial experience by virtual positioning of sound sources using individual signals for the listener’s ears. With an appropriate binaural rendering processing, the signals required at the eardrums in order for the listener to perceive sound from any desired direction can be calculated, and the signals can be rendered such that they provide the desired effect. These signals are then recreated at the eardrum using either headphones or a crosstalk cancelation method (suitable for rendering over closely spaced speakers). Binaural rendering can be considered to be an approach for generating signals for the ears of a listener resulting in tricking the human auditory system into perceiving that a sound is coming from the desired positions.

[0234] The binaural rendering is based on binaural transfer functions which vary from person to person due to the acoustic properties of the head, ears and reflective surfaces, such as the shoulders. Binaural transfer functions may therefore be personalized for an optimal binaural experience. For example, binaural filters can be used to create a binaural recording simulating multiple sources at various locations. This can be realized by convolving each sound source with the pair of e.g., Head Related Impulse Responses (HRIRs) that correspond to the position of the sound source.

[0235] A well-known method to determine binaural transfer functions is binaural recording. It is a method of recording sound that uses a dedicated microphone arrangement and is intended for replay using headphones. The recording is made by either placing microphones in the ear canal of a subject or using a dummy head with built-in microphones, a bust that includes pinnae (outer ears). The use of such dummy head including pinnae provides a very similar spatial impression as if the person listening to the recordings was physically present during the recording.

[0236] By measuring e.g., the responses from a sound source at a specific location in 2D or 3D space to microphones placed in or near human ears, the appropriate binaural filters can be determined. Based on such measurements, binaural filters reflecting the acoustic transfer functions to the user’s ears can be generated. The binaural filters can be used to create a binaural recording simulating multiple sources at various locations. This can be realized e.g., by convolving each sound source with the pair of measured impulse responses for a desired position of the sound source. In order to create the illusion that a sound source is moving around the listener, a large number of binaural filters is typically required with a certain spatial resolution, e.g., 10 degrees.

[0237] The head related binaural transfer functions may be represented e.g., as Head Related Impulse Responses (HRIR), or equivalently as Head Related Transfer Functions (HRTFs) or, Binaural Room Impulse Responses (BRIRs). The (e.g., estimated or assumed) transfer function from a given position to the listener’s ears (or eardrums) may for example be represented in the frequency domain in which case it is typically referred to as an HRTF or BRTF, or in the time domain in which case it is typically referred to as a HRIR or BRIR. In some scenarios, the head related binaural transfer functions are determined to include aspects or properties of the acoustic environment and specifically of the 2024PF00555

[0238] 23

[0239] environment in which the measurements are made, whereas in other examples only the user characteristics are considered. Examples of the first type of functions are the BRIRs and BRTFs.

[0240] In the example, the audio render apparatus comprises a store 305 which stores directional transfer functions for different directions. The directional transfer function for a given direction represents the mapping of a mono audio signal to stereo channels such that the mono audio signal is positioned in the given direction in a stereo image of the stereo channels. Thus, applying the directional transfer function for a given direction to the mono downmix audio signal may generate a stereo signal representing the mono downmix audio signal as an audio source positioned in the given direction. The mapping may in some cases be a time domain mapping (such as a gain, filter or other transfer function) or may in many cases be a frequency domain mapping, such as a set of parameter values / scale values (typically complex values) for different subbands. In the latter case, a frequency domain intermediate stereo signal may be generated by for each subband multiplying the subband sample of the mono downmix audio signal with respectively a complex value for that subband for a first channel of the intermediate stereo signal and with a complex value for that subband for a second channel of the intermediate stereo signal.

[0241] For example, in examples where a panning is performed in the horizontal 2D plane, the store 305 may comprise panning parameters for different directions. For example, panning parameters for azimuth angles in a 0-360° interval may be provided for each 1° angle increment. The first Tenderer 301 may be coupled to the store 305 and be arranged to extract the directional transfer function for the rendering direction and then proceed to perform the rendering using the extracted directional transfer function. The rendering of the (potentially gain compensated) mono downmix audio signal may accordingly be rendered such that it is positioned / perceived in the stereo image to arrive from the rendering position.

[0242] It will be appreciated that the store 305 may not have a directional transfer function stored for the desired rendering direction. In such cases, the first renderer 301 may be arranged to retrieve the nearest directional transfer function from the store 305 and use this for rendering. In such cases, the rendering direction may be considered to correspond to the direction for the retrieved directional transfer function, i.e. the rendered direction may be a quantized value y’ of the desired rendering direction determined by the direction determining circuit 303.

[0243] In other embodiments, the first Tenderer may be arranged to estimate a desired directional transfer function for a desired rendering direction by interpolating between two directional transfer functions from the store 305 corresponding to the two rendering angles nearest to the desired rendering direction determined by the direction determining circuit 303.

[0244] In most embodiments, the first renderer 301 is as mentioned arranged to perform a binaural rendering and the directional transfer functions stored in the store 305 are binaural transfer functions. Thus, the store may store data describing binaural transfer functions for different directions. The binaural transfer functions may for example be HRTFs, BRIRs, or HRIRs. The store 305 may 2024PF00555

[0245] 24

[0246] specifically store frequency subband complex values for each channel for each frequency subband for a range of different frequencies. The first Tenderer 301 may thus perform the binaural rendering by multiplying the subband samples of the mono downmix audio signal with the corresponding subband coefficients / complex values of the selected binaural transfer function to generate subband sample values of the intermediate binaural stereo signal.

[0247] It will be appreciated that in many embodiments, the directional transfer functions may be stored as a plurality of functions linked with different directions. For example, the store 305 may be a look-up table which can receive the rendering direction as an index an provide a set of values of the directional transfer function for that direction. The directional transfer function may for example be represented by individual subband values / coefficients, or may e.g. in other embodiments by represented by e.g. parameter values defining the directional transfer function operation (e.g. coefficients for the transfer function), a mathematical description / fiinction from which suitable values of the transfer function can be generated etc.

[0248] Thus, the audio render apparatus comprises a processing path which generates an intermediate stereo signal comprising the mono downmix audio signal represented as an audio source at a specific position in the spatial image of the first intermediate stereo signal. The mono downmix audio signal may typically be represented as a point audio source at the given direction. The rendering may be adaptive with the direction being given by e.g. the spatial parameters and thus may be dynamically adapted to reflect the characteristics of the stereo signal.

[0249] In addition, the audio render apparatus comprises a second processing path which generates a second intermediate stereo signal.

[0250] The first residual signal is fed to a second Tenderer 307 which is arranged to perform a second rendering being a rendering of the first residual signal (and in many cases the second residual signal) to generate a second intermediate stereo signal (and typically a third intermediate stereo signal). However, in contrast to the first rendering process, the second rendering process is typically a predetermined rendering which is not dependent on the spatial parameters, and which typically does not depend on properties of the stereo signal. The second rendering may typically be a diffuse rendering seeking to generate the second intermediate stereo signal to provide a perception of a more diffuse and spatially less definite audio source. The second rendering is specifically a predetermined rendering employing a predetermined mapping of the residual signal to channel signals of the second intermediate stereo signal.

[0251] As a specific example, the second rendering may generate the second intermediate stereo signal by simply mapping the first residual signal to two phase inverse signals, i.e. the second intermediate stereo signal may be generated with the first decorrelated mono downmix audio signal being mapped to both channels but with a 180° phase offset between them (the first residual signal may specifically be inverted for one of the channels). For example, in some embodiments, the first residual 2024PF00555

[0252] 25

[0253] signal may be mapped to the right and left signals of the second intermediate stereo signal but with the mapping being 180° out of phase for the two channels of the second intermediate stereo signal.

[0254] The first Tenderer 301 and second Tenderer 307 are coupled to a combiner 309 which is arranged to combine at least the first intermediate stereo signal and the second intermediate stereo signal to generate an output stereo signal. In many embodiments, the combiner 309 may be arranged to combine / sum the samples / values of the individual channels of the first and second intermediate stereo signals to generate the samples / values of the output stereo signal. In many cases, the combination may be performed by combining / summing time domain values of the intermediate stereo signals. In other embodiments, the combination may be performed in the time domain by combining / summing subband values of the intermediate stereo signals.

[0255] In many embodiments, the combination of the intermediate stereo signals may be by a (possibly weighted) combination / summation of corresponding channel signals for the first intermediate stereo signal and the second intermediate stereo signal.

[0256] The audio render apparatus accordingly generates an output stereo signal which is the combination of a directional rendering putting an audio source at a desired position as determined from the received spatial parameters, and of a predetermined rendering providing a more diffuse and decorrelated perception of the corresponding audio source. The approach provides two parallel rendering processes / paths for the mono downmix audio signal with the rendered results being combined to generate the output stereo signal.

[0257] In addition to the described flexible and adaptable generation of the output signal to provide an output stereo signal that includes both a directionally rendered component and a more diffuse / predetermined rendered component, the audio render apparatus may in some embodiments adapt / control the relative level between these components, e.g. by adapting weights of the combination.

[0258] In many cases, the second Tenderer 307 may be arranged to render a plurality of residual signals. For example, the described rendering for a single residual signal may be repeated for each of a plurality of residual signals to generate an intermediate stereo signal. These intermediate stereo signals may then be combined into the second stereo signal which subsequently (or as part of the same operation) may be combined into the output stereo signal.

[0259] The Tenderer may accordingly perform a third rendering function which generates a third intermediate stereo signal by rendering of the second residual signal. The third intermediate signal may then be combined with the first intermediate stereo signal and the second intermediate stereo signal (e.g. in one or multiple steps).

[0260] The audio apparatus is accordingly arranged to generate an output stereo signal, and often an output binaural stereo signal from the received mono downmix audio signal and spatial parameters. The audio apparatus specifically implements two different rendering paths with one being a directional (binaural) rendering of a directional (e.g. a dominant) signal component while the other is a predetermined rendering / mapping of a decorrelated audio signal generated from the mono downmix 2024PF00555

[0261] 26

[0262] audio signal. The rendering of the output stereo signal is not a conventional adaptive upmixing of the received and decorrelated mono signals, and is specifically not a conventional 2x2 matrix upmixing of the mono signal and a decorrelated signal, but rather is a direct generation of a stereo signal by parallel processing of respectively the mono downmix audio signal and one or more residual signals, with the former rendering being directional dependent on the spatial parameters and the latter rendering being a predetermined rendering,

[0263] The processing seeks to render the mono downmix audio signal as a direct / dominant / directional component using a direct rendering with a direction that is e.g. given by the spatial parameters. The rendering employs a directionally dependent transfer function to the left and right stereo output signal forthat purpose. The approach further seeks to render a residual / remaining signal component as a more diffuse signal, and specifically it uses a predetermined rendering where a decorrelated signal is mapped directly to the channels of the output binaural signal using a transfer function. The mapping is predetermined and may specifically be such that it allows a more diffuse and non-directional perception of this signal component. The rendering process thus uses fundamentally different approaches to provide different signal components in the output binaural signal.

[0264] In some embodiments, the direction determining circuit 303 may be arranged to determine the rendering direction y’ from spatial parameters that are indicative of interchannel relationships between the channels of the input stereo signal.

[0265] The audio render apparatus specifically comprises a spatial parameter circuit 121 which is arranged to determine spatial parameters that are indicative of / reflect relative properties of the channel signals of the stereo audio signal. In particular, the spatial parameters may be indicative of at least one of relative intensities / levels of the stereo channels, relative (frequency domain) phases of the stereo channels, a relative time difference between the channels, and / or a correlation between the channels. Specifically, the spatial parameters may include one or more of an inter-channel intensity difference, inter-channel level difference, inter-channel time difference, inter-channel phase difference, and / or interchannel correlation.

[0266] The spatial parameter circuit 121 may specifically provide sets of frequency subband spatial parameters for the stereo signal where the sets of frequency subband spatial parameters are indicative of relative signal properties of the channels of the stereo signal. The frequency subband spatial parameters are provided for individual subbands of the stereo signal.

[0267] The spatial parameters may be spatial parameters as used for encoding a stereo signal using a Parametric Stereo (PS) encoding of the stereo signal.

[0268] A classical PS downmix is calculated as:

[0269] m = c(l + r) 2024PF00555

[0270] 27

[0271] where the parameter c is chosen such that the power of the stereo signal is preserved in the downmix, the power being defined using the 2 -norm:

[0272] ||m||2— || + ||r||2

[0273] and thus e.g.:

[0274]

[0275] The PS parameters are specifically an Inter-channel Intensity Difference IID, an Interchannel Correlation ICC, and in some cases an Inter-channel Phase Difference IPD parameter. These may specifically be defined / determined as:

[0276] IID = nr

[0277] ICC

[0278]

[0279] IPD = arg < I, r >

[0280] where the complex-valued inner product is defined as:

[0281]

[0282] Vi

[0283] The spatial parameters are typically provided for specific time frequency tiles, and thus specifically each parameter value is generated / provided for a given frequency subband and for a given time segment.

[0284] The spatial parameter circuit 121 may determine spatial parameters that are indicative of relative properties of the channels (channel signals) of the stereo signal.

[0285] In many embodiments, the spatial parameter circuit 121 may receive the input stereo signal and process / analyze this to generate the spatial parameters. Specifically, the spatial parameter circuit 121 may calculate the IID, ICC, and IPD values in accordance with the formulas indicated above. Thus, in many embodiments, the audio render apparatus may simply receive a stereo signal and therefrom 2024PF00555

[0286] 28

[0287] generate spatial parameters. In such a scenario, e.g. instead of the IID, ICC and IPD, the following inner products may alternatively or additionally be used as spatial parameters:

[0288] Vi

[0289] Vi

[0290]

[0291] Vi

[0292] In some embodiments, the spatial parameter circuit 121 may be arranged to generate the spatial parameters by extracting them from a received signal. For example, the receiver 101 may receive a data signal comprising both the input stereo signal as well as spatial parameter data for the input stereo signal. The spatial parameter circuit 121 may in this case simply determine the spatial parameters by extracting the spatial parameter values from the received data signal

[0293] In many embodiments, the rendering direction y’ may be determined from spatial parameters such as specifically the following spatial parameters (per subband, where in the following k denotes the frequency bin, with a subband potentially including more than one frequency bin):

[0294] 11 D rz?i w

[0295] L J“ Sk6bxr(JID(b) > 0)

[0296] [k]x;[k]

[0297] IPD[b] = ^kEbXt[k]X;[k])

[0298] _ IZkeb [fc]^r [fc] I _ ICC[b] = (0 < ICC(b) < 1)

[0299]

[0300] Vs keb^l [k]Xi [ / c])(Skebxr[fc]x; [fc]) '

[0301] where for coherent left-right channel components / CC[h] -> 1, while for uncorrelated channels ICC[b] -► 0.

[0302] In the case of an ITD (interchannel time difference) being used, this can be estimated per band using cross-correlation methods using the maximum delay in metadata to limit the bounds wherein the peak search is performed.

[0303] In such a case, the rendering direction y’ may be determined as an angle of a principal component of the received signal, henceforth also referred to as an (per frequency) orientation direction y: 2024PF00555

[0304] 29

[0305] [1 — IID + J4IIDICC2+ (JID - I)2

[0306] y = tan1- - - -

[0307]

[0308] \ 2ICC4TTD

[0309] Thus, in many embodiments, the rendering direction y’ may be determined as or from (e.g. using a predetermined mapping) a direction y of a principal component of the stereo signal.

[0310] The spatial parameters may as previously mentioned often be calculated by the audio render apparatus for a given input stereo signal. However, in other embodiments, the spatial parameters may be received together with the input stereo signal, e.g. as part of a single data signal / bitstream.

[0311] The direction determining circuit 303 may accordingly be arranged to determine the direction from the spatial parameters. The spatial parameters provide information on the relationship between the channels of the stereo signal that are downmixed and as such provide information of the position / orientation of the audio, and specifically of a dominant signal component in the stereo image of the stereo signal. For example, the spatial parameters may provide information of the position of the dominant signal component in the stereo signal, and specifically it provides information of an orientation angle for the dominant signal.

[0312] The direction determining circuit 303 may specifically determine the rendering direction y’ from the spatial parameters. The rendering direction will be determined on a frequency tile basis, and specifically in frequency subbands and time segments matching those for which the spatial parameters are provided.

[0313] Different approaches for determining the rendering direction from the spatial parameters may be used in different embodiments. In particular, the signal model as indicated above is based on directional component x being at a direction y in the stereo image of the stereo signal, henceforth also referred to as the orientation direction y. In many embodiments, the direction determining circuit 303 may determine the orientation direction y and then determine the (desired) rendering direction y’ from the orientation direction y. The orientation direction y is accordingly an estimation of a point source direction in a stereo image. Indeed, in some embodiments or scenarios, the rendering direction y’ may simply be set equal to the orientation direction y.

[0314] The determination of the orientation direction y may be based on the signal model indicated above. The spatial parameters provide information on the relative properties of the channel signals of the stereo signal and specifically they may provide information on both the interchannel levels / intensity differences as well as on the interchannel correlation. Accordingly, the spatial parameters can be considered to provide information on the directional signal component x and on the position of this in the stereo image of the stereo signal, i.e. the spatial parameters provide information on the orientation direction y allowing this to be determined from the provided parameter values.

[0315] The direction determining circuit 303 may determine the orientation direction y as a direction to a directional signal component in a stereo image of the stereo signal from the spatial 2024PF00555

[0316] 30

[0317] parameters, and to map this to a direction in a stereo image of the output stereo signal. The directional signal component may be a dominant signal component. The direction determining circuit 303 may be arranged to determine the orientation direction y as a direction of a dominant sound source in the stereo signal where the direction of the dominant sound source is represented by the spatial parameters.

[0318] The directional signal component may specifically be a signal component (estimated / determined) to originate from a point source. Specifically, the direction determining circuit 303 may be arranged to determine the orientation direction y as a direction for which a single point source audio source will result in spatial parameter values matching the spatial parameters of the data signal.

[0319] In some embodiments, the direction determining circuit may as mentioned be arranged to determine the first direction in line with:

[0320] 7 = arctan

[0321]

[0322] where IID is an interchannel intensity difference and ICC is an inter-channel cross-correlation, and specifically with these given by the equations provided above in connection with the equations for determining gains.

[0323] The direction determining circuit 303 may, as previously mentioned, in some embodiments be used directly as the rendering direction y’, i.e. y = y’. However, in many embodiments, a mapping may be included which for at least some values of the orientation direction y may result in a different rendering direction y’.

[0324] Thus, in many embodiments, the direction determining circuit 303 may be arranged to apply a mapping function to the orientation direction y to determine the rendering direction y’.

[0325] For example, the mapping may map the position in the stereo image of the original stereo signal as represented by the orientation direction y to a desired position in the stereo image of the output stereo signal as represented by the rendering direction y’. In many cases, where the output stereo signal is a binaural signal, the mapping may include a consideration / determination of a distance to the audio sources. For example, a range of the orientation direction y in the interval of [0,180°] may be mapped to a location between two virtual stereo speakers in the audio scene created by the binaural rendering. Such speakers may for example be positioned at angles of -30° and +30° relative to a center direction for the binaural signal. Thus, in such situations, the direction determining circuit 303 may include a mapping between an orientation direction y in the range of [0,180°] to a rendering direction y’ in the range of [-30°, +30°].

[0326] Thus, in some embodiments, the directional component (the mono downmix audio signal) may be rendered to a virtual angle in the range of a virtual loudspeaker angle range generated by a 2024PF00555

[0327] 31

[0328] binaural rendering. The rendered directional component may be combined with a diffuse rendering of the residual signal(s).

[0329] In many embodiments, the direction determining circuit 303 may be arranged to map an orientation direction y representing an angle in one interval / range to a rendering direction y’ representing an angle in a different interval / range.

[0330] In the previous examples, the rendering has been based on one intermediate stereo signal representing the residual signal component. However, in many embodiments, there may be two (or possibly more) parallel paths for the rendering of the residual / non-directional signal components.

[0331] The predetermined rendering of the second Tenderer 307 may as previously mentioned simply be achieved by rendering the corresponding decorrelated signal in one channel of the corresponding intermediate stereo signal, and with no signal being included in the other channel. For example, the second residual signal may be rendered in the left channel of the second intermediate stereo signal and the first residual signal may be rendered in the right channel of the third intermediate stereo signal.

[0332] In some embodiments where binaural processing is used, each of the residual signals may be rendered from a specific position, such as each decorrelated signal being rendered from a different virtual position, such as for example from different virtual (loudspeaker) positions.

[0333] In some embodiments, the rendering for a residual signal, such as the rendering of the first residual signal, may be to position the signal at a specific position.

[0334] In many embodiments, the rendering of a residual signal may be performed by the second Tenderer 307 retrieving a set of directional transfer functions from the store 305 and rendering the residual signal using the retrieved transfer fiinction(s).

[0335] In many embodiments, the Tenderer 301, 307 may be arranged to extract a directional transfer function for a single predetermined direction and to render the residual signal using this directional transfer function. Accordingly, the residual signal may be rendered from one predetermined direction / position, such as a direction / position corresponding to a virtual speaker position.

[0336] An example of subband parametric rendering may e.g. result in left and right signals:

[0337] I = gx■ m ■ Gj / Cy)] ■ + gn■ H^m} ■ Gt[pt] ■ + gn■ H2{m} ■ Gt[pr] ’ r

[0338]

[0339] = gx- m - Gr[ / (y)] ■ + gn■ H^m} ■ Gr[pt] ■ + gn■ H2{m} ■ Gr[0r] ■ e^PA

[0340] where Gt, Gr, <pi, (prform the parametric HRIRs, f (y) is a mapping function converting the estimated angles (orientation direction y) to HRIR direction angles, Pi and are two pre-determined angles and } and H2{. } are two mutually independent optional decorrelators (if no decorrelator is present, the responses } and H2{. } may simply be considered unity responses. 2024PF00555

[0341] 32

[0342] In some embodiments, the second Tenderer 307 may retrieve directional transfer functions for a plurality of predetermined directions and it may use multiple directional transfer functions in performing the predetermined rendering. For example, different directional transfer functions may be used for different frequency subbands. This may provide a more diffuse perception with the audio being generated such that it is perceived from different directions for different subbands thereby resulting in a perception of a more distributed and spread audio source.

[0343] Such approaches may be used both in embodiments in which a single residual signal is generated and rendered, or indeed in cases where multiple residual signals are generated and rendered. In the latter case, the sets of predetermined directions for the different residual signals are different in order to enhance the perceived diffuseness of the non-directional signal component.

[0344] The second Tenderer 307 may generate the second intermediate stereo signal using a first set of directional transfer functions retrieved from the store 305 for a first set of predetermined directions, and may generate the third intermediate stereo signal using a second set of directional transfer functions retrieved from the store 305 for a second set of predetermined directions where the first set of set of predetermined directions is different from the second set of predetermined directions.

[0345] In particular, the directional transfer functions may be binaural transfer functions and the second Tenderer 307 may be arranged to perform binaural rendering using binaural impulse response values for a first set of predetermined directions and may be arranged to perform binaural rendering using binaural impulse response values for a second set of predetermined directions where the first set of predetermined directions are different from the second set of predetermined directions.

[0346] In many cases, the use of multiple directional transfer functions may be achieved by using directional transfer functions for different directions in different frequency subbands.

[0347] Thus, instead of rendering the residual / non-directional signals using fixed angles, e.g. mimicking a virtual stereo speaker setup, the residual signals may also be rendered using composite, e.g. pre-calculated HRIRs for many sources / directions, e.g. spread over a (part of a) circle, or (part of) a sphere.

[0348] I = dx ■ rn ■ G;[ / (y)] ■ + gn■ H^m} ■ G l,compej< Pl,comp+gn. H2{m} ■ Gt,comp J 4* l, comp r

[0349]

[0350] = gx- m - Gr[f(y)] ■+gn■ H^m} ■ ^r,comp ej$r,comp + 9n ■ H2{m} ■ ^r,comp. eJ4>r,comp

[0351] where e.g.: 2024PF00555

[0352] Gl’Comp 9norm GM] ■

[0353] peBt

[0354] { '

[0355] PeBt

[0356] 'r,comp 9 norm Gr[p] ■

[0357] / ?eBr{ '

[0358] Gr[f] ■.

[0359]

[0360] P^Br

[0361] with B(being a set of angles at which the left residual signal is to be rendered, Bra set of angles at which the right residual signal is to be rendered, and gnorma normalisation factor.

[0362] In some embodiments, a single residual signal component may be directly rendered onto left and right channels without any HRIR processing.

[0363] In many embodiments, the direction determining circuit 303 may be arranged to determine the rendering direction y’ from the filter responses of the downmix filters 103, 105 and / or from the residual filters 109, 113.

[0364] In particular, in some embodiments, the renderer 117 is arranged to determine the rendering direction y’ from a timing of a peak of a cross correlation between the filter response of the first downmix filter 103 and the filter response of the second downmix filter 105. Equivalently (due to the predetermined relationships between the filters), in some embodiments, the renderer 117 is arranged to determine the rendering direction y’ from a timing of a peak of a cross correlation between the filter response of the first residual filter 109 and the filter response of the second reference frequency.

[0365] In particular, the filter responses will typically be adapted to have a peak corresponding to the arrival time of a dominant signal via typically a direct path. The time difference between the peaks may accordingly indicate the time difference between the direct paths to the capture positions for the different stereo channels, and accordingly may be indicative of the direction to the source of the dominant signal.

[0366] As a specific example, the direction of arrival y,cmay be determined based on the crosscorrelation of Wi and wr. 2024PF00555

[0367] 34

[0368] n =

[0369]

[0370] cos

[0371] where T corresponds to the delay extracted from the peak of the cross-correlation function (and can be positive or negative), c is the speed of sound (e.g., 343 m / s), and d is the microphone spacing (e.g.in meters).

[0372] Depending on the underlying model, other alternatives can be used for direction of arrival estimation. For example, in another embodiment, a table of directions and corresponding ideal (complex) fdter responses for left and right channels, woi and w0 rthat assumes some underlying model (far-field, tangent-pan model) may be used. The sum of Hermitian inner products between the estimated filters Wi and wrand the ideal filters can be computed per frequency and then added. The maximum result corresponds to the estimated direction ykof the assumed point source audio source.

[0373] The detected direction of arrival may be considered the orientation direction y and thus may be used to determine the rendering direction y’, e.g. the estimated direction of arrival may in some embodiments be used directly as the rendering direction y’.

[0374] The downmix weights may in many embodiments be determined to ensure power preservation with the combined power / energy level of the intermediate signals corresponding to the power level of intermediate stereo signal, i.e. such that there is power preservation between the input stereo signal and the output stereo signal.

[0375] Specifically, in many embodiments, the determination of the downmix weights for a subband may be subject to the constraint / requirement that the sum square magnitude of the subband downmix weights is constant, and specifically is always one, i.e.

[0376] |2+ |wr|2= 1.

[0377] In practice, it might be advantageous to update weights w[6] only when a directional source is deemed to be active. Since a directional source is considered a coherent point source, a simple activity test can be performed, e.g. by measuring the average coherence value over several subbands and thresholding this value. If the average coherence exceeds a given threshold, then the weights w[6] may be updated.

[0378] In many embodiments, the rendering of one (or both / all) residual signal(s) may include a decorrelation. In many embodiments, the rendering of a residual signal may include a decorrelation of the residual signal. In many embodiments, the first residual signal may be decorrelated and subsequently rendered by the second Tenderer 307. In many embodiments, the first residual signal may be generated as described above and the second residual signal may be generated by decorrelating the second residual signal. The two residual signals may then be rendered as the more diffuse components. Specifically, the 2024PF00555

[0379] 35

[0380] previously described rendering approaches for two (or more) residual signals may be applied to two (or more) residual signals of which one or more is generated by decorrelation

[0381] In particular, while Di [6] and Dr[6] as indicated above are assumed to be uncorrelated, the expressions for D[ [6] and Dr[6] indicate that they may not be fully uncorrelated. Accordingly, in some embodiments, a decorrelator 401 may as illustrated in FIG. 4 be applied to one of the residual signals. In this case, the previously described approach may be used but with Dr[b] =

[0382]

[0383] [b], so that the estimated directional source S[b], and the decorrelation based generated pair of decorrelated residual signals can be binauralized using HRTFs (and optionally binaural room impulse responses) and rendered. A standard decorrelator may be used such as those used for decoding a PS signal.

[0384] In some embodiments, a delay may be introduced to the channel signals relative to the downmix and typically the residual weights. For example, as illustrated in FIG. 5, delays 501, 503 may be introduced to the first and second channel signals (and specifically to the subband samples of the first and second channels) before the compensation is applied to generate the residual signals.

[0385] The delays 501, 503 may be used to address causality issues. In particular, using the complex conjugate of the frequency subband / domain weights downmix weights Wi and wrin (20) is equivalent to time-reversing the corresponding time-domain filter coefficients, i.e. it is equivalent to a time reversal of the impulse response. For relatively small time differences between the time signals, this may not be a problem for a frequency domain / subband processing but if there is a substantial lag between the signals (corresponding to a large value for an ITD parameter), the complex conjugation of weights and subsequent residual signal calculation may be subject to problems. In order to compensate for such a time reversal and ensure causality, the delays 501, 503 may be introduced. The delays may for example be set to be equal to a processing time segment / interval corresponding to the maximum signal delay between left and right channels. For example, for a communication scenario, this delay can correspond to the distance between the pair of microphones used to capture point source signals, such as local speakers, and the delay would correspond to an exponential given by e“7<ijd / / c. where d is the max distance and c is the speed of sound (e.g, for air « 343 m / s). Of course, this also assumes that the frame length or time segment taken for the frequency-domain transform is large enough to capture such delays between left and right channels for the same time frame.

[0386] In other embodiments the delays may be set as half the length of the corresponding timedomain filter length, again assuming that half this length is sufficient to cover the expected maximum ja> N

[0387] delay between left and right channels, i.e. 3 = e2A, where N is the underlying time-domain filter length and fsis the sampling frequency. This, however, would also require delaying the subband residual ja> N ja> N

[0388] weights by e2fs and respectively advancing the subband downmix weights by e2A. This approach can be advantageous in situations where the subband downmix and residual weights also model early 2024PF00555

[0389] 36

[0390] reflections between a directional source and the microphones in a communication setting to further improve the downmix signal to noise ratio.

[0391] In another embodiment of the invention, where the delays 501 and 503 are set to half the length in samples of the time-domain filter, the time-domain filters 207 and 209 may be initialized to a common all-pass / delay component, implemented as an impulse at time-delay corresponding to an integer delay of half the filter length and a scaling of 1 / 2. This allows the adaptive process based on gradient descent to find a solution near this common all-pass term and does not necessarily require the additional introduction of delay processing into the weights 207, 209, or 201, 203. This is advantageous since a common all-pass term can be set beforehand in accordance with the delays 501 and 503 and does not have to be removed by calculating an all-pass and minimum phase component of the total response.

[0392] Therefore, in some embodiments, the receiver is further arranged to receive an indication of a maximum time offset between channels of the stereo signal. For example, the audio render apparatus may receive an indication of a maximum distance (which in some cases may be an actual distance) between the microphones capturing the stereo signal. The indication may for example be received as a user input, or e.g. may be received as metadata which is part of a data signal also comprising the input stereo signal.

[0393] The audio render apparatus may then be arranged to adapt the combination in response to the indication of the maximum time offset corresponding to such a maximum distance. For example, the combiner may be arranged to set a value of the delays depending on the maximum distance between microphones.

[0394] As another example of how the maximum distance, and thus the maximum interchannel delay / time difference, may be used to adapt the operation of the audio render apparatus, and specifically how it may be used to adapt the combination is in estimating the inter-channel time difference (ITD) by the spatial parameter circuit 121. ITD methods are commonly based on cross-correlation methods and maximum peak picking of the cross-correlation function corresponding to the inter-channel delay. The maximum distance can be translated into a maximum delay value via the relation Tm= d / c. and the maximum peak is selected within an interval of ±Tmaround zero for the cross-correlation function. For azimuth directional of arrival estimation, the corresponding direction may then be estimated from the relationship,

[0395] c

[0396] y = — arccos r,

[0397] a

[0398] where T is the delay corresponding to the maximum peak in the cross-correlation function.

[0399] The processing may be performed in subbands. The processing may be performed in time segments. The processing in each subband may for some (any) or all steps be performed separately / independently in each subband (with respect to the processing in other subbands). The 2024PF00555

[0400] 37

[0401] processing in each time segment may for some (any) or all steps be performed separately / independently in each time segment (with respect to the processing in other time segments).

[0402] The processing may be time interval / segment based with all processing being performed for each time segment. Equivalently, the signal(s) for each segment may be considered a signal (and in particular signals of different time segments, may be considered different signals).

[0403] The audio apparatus(s) may specifically be implemented in one or more suitably programmed processors. An example of a suitable processor is provided in the following.

[0404] FIG. 6 is a block diagram illustrating an example processor 600 according to embodiments of the disclosure. Processor 600 may be used to implement one or more processors implementing an apparatus as previously described or elements thereof (including in particular one more artificial neural network). Processor 600 may be any suitable processor type including, but not limited to, a microprocessor, a microcontroller, a Digital Signal Processor (DSP), a Field ProGrammable Array (FPGA) where the FPGA has been programmed to form a processor, a Graphical Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC) where the ASIC has been designed to form a processor, or a combination thereof.

[0405] The processor 600 may include one or more cores 602. The core 602 may include one or more Arithmetic Logic Units (ALU) 604. In some embodiments, the core 602 may include a Floating Point Logic Unit (FPLU) 606 and / or a Digital Signal Processing Unit (DSPU) 608 in addition to or instead of the ALU 604.

[0406] The processor 600 may include one or more registers 612 communicatively coupled to the core 602. The registers 612 may be implemented using dedicated logic gate circuits (e.g., flip-flops) and / or any memory technology. In some embodiments the registers 612 may be implemented using static memory. The register may provide data, instructions and addresses to the core 602.

[0407] In some embodiments, processor 600 may include one or more levels of cache memory 610 communicatively coupled to the core 602. The cache memory 610 may provide computer-readable instructions to the core 602 for execution. The cache memory 610 may provide data for processing by the core 602. In some embodiments, the computer-readable instructions may have been provided to the cache memory 610 by a local memory, for example, local memory attached to the external bus 616. The cache memory 610 may be implemented with any suitable cache memory type, for example, Metal-Oxide Semiconductor (MOS) memory such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), and / or any other suitable memory technology.

[0408] The processor 600 may include a controller 614, which may control input to the processor 600 from other processors and / or components included in a system and / or outputs from the processor 600 to other processors and / or components included in the system. Controller 614 may control the data paths in the ALU 604, FPLU 606 and / or DSPU 608. Controller 614 may be implemented as one or more state machines, data paths and / or dedicated control logic. The gates of controller 614 may be implemented as standalone gates, FPGA, ASIC or any other suitable technology. 2024PF00555

[0409] 38

[0410] The registers 612 and the cache 610 may communicate with controller 614 and core 602 via internal connections 620A, 620B, 620C and 620D. Internal connections may be implemented as a bus, multiplexer, crossbar switch, and / or any other suitable connection technology.

[0411] Inputs and outputs for the processor 600 may be provided via a bus 616, which may include one or more conductive lines. The bus 616 may be communicatively coupled to one or more components of processor 600, for example the controller 614, cache 610, and / or register 612. The bus 616 may be coupled to one or more components of the system.

[0412] The bus 616 may be coupled to one or more external memories. The external memories may include Read Only Memory (ROM) 632. ROM 632 may be a masked ROM, Electronically Programmable Read Only Memory (EPROM) or any other suitable technology. The external memory may include Random Access Memory (RAM) 633. RAM 633 may be a static RAM, battery backed up static RAM, Dynamic RAM (DRAM) or any other suitable technology. The external memory may include Electrically Erasable Programmable Read Only Memory (EEPROM) 635. The external memory may include Flash memory 634. The External memory may include a magnetic storage device such as disc 636. In some embodiments, the external memories may be included in a system.

[0413] The invention can be implemented in any suitable form including hardware, software, firmware, or any combination of these. The invention may optionally be implemented at least partly as computer software running on one or more data processors and / or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed, the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit or may be physically and functionally distributed between different units, circuits and processors.

[0414] Although the present invention has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present invention is limited only by the accompanying claims. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the invention. In the claims, the term comprising does not exclude the presence of other elements or steps.

[0415] Furthermore, although individually listed, a plurality of means, elements, circuits or method steps may be implemented by e.g. a single circuit, unit or processor. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and / or advantageous. Also, the inclusion of a feature in one category of claims does not imply a limitation to this category but rather indicates that the feature is equally applicable to other claim categories as appropriate. Furthermore, the order of features in the claims do not imply any specific order in which the features must be worked and in particular the order of individual steps in a method claim does not imply that the steps 2024PF00555

[0416] 39

[0417] must be performed in this order. Rather, the steps may be performed in any suitable order. In addition, singular references do not exclude a plurality. Thus references to "a", "an", "first", "second" etc. do not preclude a plurality. Reference signs in the claims are provided merely as a clarifying example shall not be construed as limiting the scope of the claims in any way.

[0418] Generally, examples of an audio apparatus, a method of operation for an audio apparatus synthesis, and a computer program which implements the method are indicated by below embodiments.

[0419] EMBODIMENTS:

[0420] Embodiment 1. An audio apparatus for generating an output stereo signal, the audio apparatus comprising:

[0421] a receiver (101) arranged to receive a stereo signal;

[0422] a first downmix filter (103) arranged to generate a first filtered signal by filtering a signal of a first channel of the stereo signal;

[0423] a second downmix filter (105) arranged to generate a second filtered signal by filtering a signal of a second channel of the stereo signal;

[0424] a combiner (107) arranged to generate a mono downmix audio signal by combining the first filtered signal and the second filtered signal;

[0425] a first residual filter (109) arranged to generate a first compensation signal by filtering the mono downmix audio signal, a filter response of the first downmix filter (103) having a fixed predetermined relationship with the filter response of the first residual filter (109);

[0426] a first compensator (111) arranged to generate a first residual signal by compensating the signal of the first channel of the stereo signal by the first compensation signal;

[0427] a second residual filter (113) arranged to generate a second compensation signal by filtering the mono downmix audio signal, a filter response of the second downmix filter (105) having a fixed predetermined relationship with the filter response of the second residual filter (113) in that a frequency representation of the filter response of the first downmix filter (103) is a complex conjugate of a frequency representation of the filter response of the first residual filter (109);

[0428] a second compensator (115) arranged to generate a second residual signal by compensating the signal of a second channel of the stereo signal by the second compensation signal;

[0429] an adapter (119) arranged to update the filter response of the first residual filter (109) to reduce a magnitude of a first error value and to update filter response of the second residual filter (113) to reduce a magnitude of a second error value, the first error value being dependent on the first residual signal and the second error value being depending on the second residual signal;

[0430] a Tenderer (117) arranged to render an output stereo signal, the Tenderer (117) comprising: 2024PF00555

[0431] 40

[0432] a first Tenderer (301) arranged to perform a first rendering of the mono downmix audio signal to generate a first intermediate stereo signal, the first rendering being a directional rendering arranged to render the mono downmix audio signal from a first direction;

[0433] a second Tenderer (303) arranged to perform a second rendering being a rendering of the first residual signal to generate a second intermediate stereo signal; and

[0434] a render combiner (309) arranged to combine at least the first intermediate stereo signal and the second intermediate stereo signal to generate the output stereo signal.

[0435] The first and the second filters are individually and separably updateable

[0436] Embodiment 2. The audio apparatus of embodiment 1 wherein the first error value is monotonically increasing with an increasing power level of the first residual signal and the second error value is monotonically increasing with an increasing power level of the second residual signal.

[0437] Embodiment 3. The audio apparatus of embodiment 1 or 2 wherein the first error value is monotonically increasing with an increasing cross-correlation between the mono downmix audio signal and the first residual signal.

[0438] Embodiment 4. The audio apparatus of any previous embodiment wherein the adaptation is subject to a constraint on a combined energy measure for the filter response of the first residual filter (109) and the filter response of the second residual filter (113).

[0439] Embodiment 5. The audio apparatus of any previous embodiment wherein the second Tenderer (307) is arranged to perform a third rendering being a rendering of the second residual signal to generate a third intermediate stereo signal; and

[0440] the combiner (309) is arranged to combine at least the first intermediate stereo signal, the second intermediate stereo signal, and the third intermediate stereo signal to generate the output stereo signal.

[0441] Embodiment 6. The audio render apparatus of embodiment 5 wherein the adapter (119) is arranged to compensate the update of the filter response of the first residual filter (109) and the update of the filter response of the second residual filter (113) for a common signal shift property for the first residual filter (109) and for the second residual filter (113).

[0442] Embodiment 7. The audio apparatus of any previous embodiment wherein the first compensator (111) comprises a delay for delaying the signal of the first channel of the stereo signal relative to the first compensation signal. 2024PF00555

[0443] 41

[0444] Embodiment 8. The audio apparatus of any previous embodiment wherein the first downmix filter (103) is a frequency domain filter comprising weights for each subband of a frequency domain representation of the signal of the first channel of the stereo signal; the first residual filter (103) is a frequency domain filter comprising weights for each subband of a frequency domain representation of the mono downmix audio signal; and wherein weights of the first downmix filter (103) are complex conjugates of weights of a same subband of the first residual filter (109).

[0445] Embodiment 9. The audio apparatus of any previous embodiment further comprising:

[0446] a spatial parameter circuit (121) arranged to provide sets of spatial parameters for the stereo signal, the sets of spatial parameters being indicative of relative signal properties of channels of the stereo signal; and

[0447] the renderer (117) is arranged to determine the first direction from the sets of spatial parameters.

[0448] Embodiment 10. The audio render apparatus of any previous embodiment wherein the renderer (117) is arranged to determine the first direction from a timing of a peak of a cross correlation between the filter response of the first downmix filter (103) and the filter response of the second downmix filter (105).

[0449] Embodiment 11. The audio render apparatus of any previous embodiment further comprising a decorrelator arranged to decorrelate the first residual signal to generate a decorrelated residual signal; and wherein the renderer (117) is arranged to perform a third rendering being of the decorrelated residual signal to generate a third intermediate stereo signal, and wherein the combiner (309) is arranged to combine at least the first intermediate stereo signal, the second intermediate stereo signal, and the third intermediate stereo signal to generate the output stereo signal.

[0450] Embodiment 12. The audio render apparatus of any previous embodiment wherein the first direction is dependent on a property of the stereo signal and the second rendering is a predetermined rendering employing a predetermined mapping of the first residual signal to channel signals of the second intermediate stereo signal.

[0451] Embodiment 13. A method of generating an output stereo signal, the method comprising:

[0452] receiving a stereo signal;

[0453] a first downmix filter (103) generating a first filtered signal by filtering a signal of a first channel of the stereo signal; 2024PF00555

[0454] 42

[0455] a second downmix filter (105) generating a second filtered signal by filtering a signal of a second channel of the stereo signal;

[0456] generating a mono downmix audio signal by combining the first filtered signal and the second filtered signal;

[0457] a first residual filter (109) generating a first compensation signal by filtering the mono downmix audio signal, a filter response of the first downmix filter (103) having a fixed predetermined relationship with the filter response of the first residual filter (109);

[0458] generating a first residual signal by compensating the signal of the first channel of the stereo signal by the first compensation signal;

[0459] a second residual filter (113) generating a second compensation signal by filtering the mono downmix audio signal, a filter response of the second downmix filter (105) having a fixed predetermined relationship with the filter response of the second residual filter (113) in that a frequency representation of the filter response of the first downmix filter (103) is a complex conjugate of a frequency representation of the filter response of the first residual filter (109);

[0460] generating a second residual signal by compensating the signal of a second channel of the stereo signal by the second compensation signal;

[0461] updating the filter response of the first residual filter (109) to reduce a magnitude of a first error value, the first error value being dependent on the first residual signal;

[0462] updating the filter response of the second residual filter (113) to reduce a magnitude of a second error value, the second error value being dependent on the second residual signal;

[0463] rendering an output stereo signal, the rendering comprising:

[0464] performing a first rendering of the mono downmix audio signal to generate a first intermediate stereo signal, the first rendering being a directional rendering arranged to render the mono downmix audio signal from a first direction;

[0465] performing a second rendering being a rendering of the first residual signal to generate a second intermediate stereo signal; and

[0466] combining at least the first intermediate stereo signal and the second intermediate stereo signal to generate the output stereo signal.

[0467] Embodiment 14. A computer program product comprising computer program code means adapted to perform all the steps of embodiment 13 when said program is run on a computer.

Claims

2024PF0055543CLAIMS:Claim 1. An audio apparatus for generating an output stereo signal, the audio apparatus comprising:a receiver (101) arranged to receive a stereo signal;a first downmix filter (103) arranged to generate a first filtered signal by filtering a signal of a first channel of the stereo signal;a second downmix filter (105) arranged to generate a second filtered signal by filtering a signal of a second channel of the stereo signal;a combiner (107) arranged to generate a mono downmix audio signal by combining the first filtered signal and the second filtered signal;a first residual filter (109) arranged to generate a first compensation signal by filtering the mono downmix audio signal, a filter response of the first downmix filter (103) having a fixed predetermined relationship with the filter response of the first residual filter (109);a first compensator (111) arranged to generate a first residual signal by compensating the signal of the first channel of the stereo signal by the first compensation signal;a second residual filter (113) arranged to generate a second compensation signal by filtering the mono downmix audio signal, a filter response of the second downmix filter (105) having a fixed predetermined relationship with the filter response of the second residual filter (113) in that a frequency representation of the filter response of the first downmix filter (103) is a complex conjugate of a frequency representation of the filter response of the first residual filter (109);a second compensator (115) arranged to generate a second residual signal by compensating the signal of a second channel of the stereo signal by the second compensation signal;an adapter (119) arranged to update the filter response of the first residual filter (109) to reduce a magnitude of a first error value and to update filter response of the second residual filter (113) to reduce a magnitude of a second error value, the first error value being dependent on the first residual signal and the second error value being depending on the second residual signal;a renderer (117) arranged to render an output stereo signal, the renderer (117) comprising:a first renderer (301) arranged to perform a first rendering of the mono downmix audio signal to generate a first intermediate stereo signal, the first rendering being a directional rendering arranged to render the mono downmix audio signal from a first direction;a second renderer (303) arranged to perform a second rendering being a rendering of the first residual signal to generate a second intermediate stereo signal; and2024PF0055544a render combiner (309) arranged to combine at least the first intermediate stereo signal and the second intermediate stereo signal to generate the output stereo signal.Claim 2. The audio apparatus of claim 1 wherein the first error value is monotonically increasing with an increasing power level of the first residual signal and the second error value is monotonically increasing with an increasing power level of the second residual signal.Claim 3. The audio apparatus of claim 1 or 2 wherein the first error value is monotonically increasing with an increasing cross-correlation between the mono downmix audio signal and the first residual signal.Claim 4. The audio apparatus of any previous claim wherein the adaptation is subject to a constraint on a combined energy measure for the filter response of the first residual filter (109) and the filter response of the second residual filter (113).Claim 5. The audio apparatus of any previous claim wherein the second Tenderer (307) is arranged to perform a third rendering being a rendering of the second residual signal to generate a third intermediate stereo signal; andthe combiner (309) is arranged to combine at least the first intermediate stereo signal, the second intermediate stereo signal, and the third intermediate stereo signal to generate the output stereo signal.Claim 6. The audio render apparatus of claim 5 wherein the adapter (119) is arranged to compensate the update of the filter response of the first residual filter (109) and the update of the filter response of the second residual filter (113) for a common signal shift property for the first residual filter (109) and for the second residual filter (113).Claim 7. The audio apparatus of any previous claim wherein the first compensator ( 111) comprises a delay for delaying the signal of the first channel of the stereo signal relative to the first compensation signal.Claim 8. The audio apparatus of any previous claim wherein the first downmix filter (103) is a frequency domain filter comprising weights for each subband of a frequency domain representation of the signal of the first channel of the stereo signal; the first residual filter (103) is a frequency domain filter comprising weights for each subband of a frequency domain representation of the mono downmix audio signal; and wherein weights of the first downmix filter (103) are complex conjugates of weights of a same subband of the first residual filter (109).2024PF0055545Claim 9. The audio apparatus of any previous claim further comprising:a spatial parameter circuit (121) arranged to provide sets of spatial parameters for the stereo signal, the sets of spatial parameters being indicative of relative signal properties of channels of the stereo signal; andthe renderer (117) is arranged to determine the first direction from the sets of spatial parameters.Claim 10. The audio render apparatus of any previous claim wherein the Tenderer (117) is arranged to determine the first direction from a timing of a peak of a cross correlation between the filter response of the first downmix filter (103) and the filter response of the second downmix filter (105).Claim 11. The audio render apparatus of any previous claim further comprising a decorrelator arranged to decorrelate the first residual signal to generate a decorrelated residual signal; and wherein the Tenderer (117) is arranged to perform a third rendering being of the decorrelated residual signal to generate a third intermediate stereo signal, and wherein the combiner (309) is arranged to combine at least the first intermediate stereo signal, the second intermediate stereo signal, and the third intermediate stereo signal to generate the output stereo signal.Claim 12. The audio render apparatus of any previous claim wherein the first direction is dependent on a property of the stereo signal and the second rendering is a predetermined rendering employing a predetermined mapping of the first residual signal to channel signals of the second intermediate stereo signal.Claim 13. A method of generating an output stereo signal, the method comprising:receiving a stereo signal;a first downmix filter (103) generating a first filtered signal by filtering a signal of a first channel of the stereo signal;a second downmix filter (105) generating a second filtered signal by filtering a signal of a second channel of the stereo signal;generating a mono downmix audio signal by combining the first filtered signal and the second filtered signal;a first residual filter (109) generating a first compensation signal by filtering the mono downmix audio signal, a filter response of the first downmix filter (103) having a fixed predetermined relationship with the filter response of the first residual filter (109);generating a first residual signal by compensating the signal of the first channel of the stereo signal by the first compensation signal;2024PF0055546a second residual filter (113) generating a second compensation signal by filtering the mono downmix audio signal, a filter response of the second downmix filter (105) having a fixed predetermined relationship with the filter response of the second residual filter (113) in that a frequency representation of the filter response of the first downmix filter (103) is a complex conjugate of a frequency representation of the filter response of the first residual filter (109);generating a second residual signal by compensating the signal of a second channel of the stereo signal by the second compensation signal;updating the filter response of the first residual filter (109) to reduce a magnitude of a first error value, the first error value being dependent on the first residual signal;updating the filter response of the second residual filter (113) to reduce a magnitude of a second error value, the second error value being dependent on the second residual signal;rendering an output stereo signal, the rendering comprising:performing a first rendering of the mono downmix audio signal to generate a first intermediate stereo signal, the first rendering being a directional rendering arranged to render the mono downmix audio signal from a first direction;performing a second rendering being a rendering of the first residual signal to generate a second intermediate stereo signal; andcombining at least the first intermediate stereo signal and the second intermediate stereo signal to generate the output stereo signal.Claim 14. A computer program product comprising computer program code means adapted to perform all the steps of claim 13 when said program is run on a computer.