Audio apparatus and method of generating a stereo signal

The audio apparatus enhances stereo signal rendering by using frequency subband spatial parameters to align phases and reduce complexity, improving perceived quality and spatial representation in portable devices.

WO2026125128A1PCT designated stage Publication Date: 2026-06-18KONINKLIJKE PHILIPS NV

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
KONINKLIJKE PHILIPS NV
Filing Date
2025-12-04
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing audio rendering technologies for stereo signals exhibit suboptimal performance in terms of perceived quality, spatial perception, complexity, resource usage, and computational load, particularly in applications involving small and cheap portable devices.

Method used

An audio apparatus and method that utilizes frequency subband spatial parameters to determine downmix weights, generates a mono downmix audio signal and residual signals, and applies directional and predetermined renderings to combine these signals, aligning phase and reducing complexity while enhancing perceived audio quality and spatial representation.

🎯Benefits of technology

The approach provides improved stereo audio generation with reduced complexity and resource usage, offering enhanced perceived quality and spatial representation, suitable for various scenarios including binaural rendering on portable devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure EP2025085447_18062026_PF_FP_ABST
    Figure EP2025085447_18062026_PF_FP_ABST
Patent Text Reader

Abstract

An audio apparatus renders a stereo signal and comprises a weight processor (105) determining frequency subband downmix weights from spatial parameters for the stereo signal. A downmixer (107) determines a mono downmix by downmixing the stereo signal using the downmix weights. A residual circuit (109) arranged to generate at least a first residual signal from a first channel of the stereo signal by generating frequency subband residual weights from the downmix weights, generating frequency subband compensation values downmix and the residual weights; and generating frequency subband values of the residual signal by compensation of the channel by the compensation values. A renderer (111) renders an output stereo signal by a directional rendering of the downmix to generate a first intermediate stereo signal and a second rendering of the residual signals to generate a second intermediate stereo signal with the output stereo signal being generated by combining the intermediate stereo signals.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] AUDIO APPARATUS AND METHOD OF GENERATING A STEREO SIGNAL

[0002] FIELD OF THE INVENTION

[0003] The invention relates to an audio apparatus and method for rendering a stereo signal.

[0004] BACKGROUND OF THE INVENTION

[0005] Spatial audio applications have become numerous and widespread and increasingly form part of many audiovisual experiences. New and improved spatial experiences and applications are continuously being developed which result in increased demands on the audio processing and rendering.

[0006] A lot of research and development effort has focused on providing efficient and high quality audio encoding and audio decoding for spatial audio. A frequently used spatial audio representation is multichannel audio representations, including stereo representation, and efficient encoding of such multichannel audio based on downmixing multichannel audio signals to downmix channels with fewer channels have been developed. One of the main advances in low bit-rate audio coding has been the use of parametric multichannel coding where a downmix signal is generated together with parametric data that can be used to upmix the downmix signal to recreate the multichannel audio signal.

[0007] In addition to accurately reproducing a stereo signal, it has also been of interest to create high quality rendering, and specifically binaural rendering of (encoded) stereo signals to emulate a virtual loudspeaker playback.

[0008] Binaural rendering of content authored for multi-channel playback can be achieved by the sum of convolutions of the input channel signals with left and right Head Related Impulse Responses (HRIRs), where each HRIR pair corresponds to a measured / simulated impulse response from a loudspeaker location to the ears. This can be expressed compactly in the z-domain as:

[0009] YL> R{Z) = ^xc{z} ■

[0010]

[0011] Vc

[0012] where Xc(z) represents the z-transform of the time domain input signal xc[n] with channel c, YL R(z) represents the z-transform of the left and right time domain output signals I [n] and r [n], respectively,

[0013]

[0014] andRC(z) is the z-transform of the HRIR of the left and right channels [n] and hr[n] for the angle (and distance) corresponding to loudspeaker position <pc. Approaches for rendering binaural stereo are disclosed in WO2010 / 122455 A 1 and W02007 / 031896 A 1. However, whereas current approaches for audio rendering may provide acceptable performance in many applications and scenarios, they tend not to be ideal and may exhibit suboptimal behavior in some scenarios. In particular, it may result in suboptimal perceived quality and / or a reduced user experience with e.g. perceived suboptimal spatial perception / audio scene in some cases. Complexity and / or resource usage may also be higher than desired and may in some case make the approach undesired for some implementations, such as applications based on small and cheap portable devices.

[0015] Hence, an improved approach would be advantageous. In particular an approach allowing increased flexibility, improved adaptability, improved performance, increased audio quality, improved perceived quality, an improved rendering of an audio scene, improved spatial representation, reduced complexity and / or resource usage, reduced computational load, facilitated implementation, improved user experience, and / or an improved spatial audio experience would be advantageous.

[0016] SUMMARY OF THE INVENTION

[0017] Accordingly, the Invention seeks to preferably mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination.

[0018] According to an aspect of the invention there is provided an audio apparatus comprising: a receiver arranged to receive a stereo signal; a spatial parameter circuit arranged to provide sets of frequency subband spatial parameters for the stereo signal, the sets of frequency subband spatial parameters being indicative of relative signal properties of channels of the stereo signal; a weight processor arranged to determine frequency subband downmix weights for the channels of the stereo signal from the sets of frequency subband spatial parameters; a downmixer arranged to determine a mono downmix audio signal by downmixing the stereo signal, the downmixer being arranged to generate frequency subband values of the mono downmix audio signal by combining frequency subband values of channels of the stereo signal dependent on the frequency subband downmix weights; a residual circuit arranged to generate at least a first residual signal from a first channel of the stereo signal, the residual circuit being arranged to: generate frequency subband residual weights for the first channel from frequency subband downmix weights for the first channel, the frequency subband residual weights being determined as complex conjugates of the frequency subband downmix weights; generate first frequency subband compensation values for the first channel from the frequency subband values of the mono downmix audio signal and the frequency subband residual weights for the first channel; and generate frequency subband values of the first residual signal from a compensation of frequency subband values of the first channel by the first frequency subband compensation values; a Tenderer arranged to render an output stereo signal, the Tenderer comprising: a first Tenderer arranged to perform a first rendering of the mono downmix audio signal to generate a first intermediate stereo signal, the first rendering being a directional rendering arranged to render the mono downmix audio signal from a first direction; a second Tenderer arranged to perform a second rendering being a rendering of the first residual signal to generate a second intermediate stereo signal; and a combiner arranged combine at least the first intermediate stereo signal and the second intermediate stereo signal to generate the output stereo signal; and wherein the weight processor (105) is arranged to determine the frequency subband downmix weights such that the downmixing aligns a phase of the frequency subband values of the channels of the stereo signal.

[0019] The approach may provide an improved audio experience in many embodiments. For many signals and scenarios, the approach may provide improved rendering of a stereo audio signal allowing improved generation / reconstruction of a stereo audio signal with an improved perceived audio quality. The approach may provide improved representation of an audio scene by a stereo signal.

[0020] The approach may provide efficient implementation and may in many embodiments allow reduced complexity and / or resource usage. The approach may in many scenarios allow a reduced computational burden while providing a perceived high quality rendering of a stereo signal.

[0021] The processing may be in time frequency segments or tiles. Each time frequency segment / tile may represent a frequency interval in a time interval. In many embodiments, the mono downmix audio signal may be divided into time segments / intervals and a frequency representation of the signal in the time segment / interval may be provided by signal values representing different frequency segments of the signal in the time segment / interval. Some or all of the processing may be performed in the frequency domain / frequency subbands. In some cases, some processing may be time domain processing, and in particular some processing may be a combination of frequency domain / subbands and time-domain processing.

[0022] The spatial parameters may comprise sets of spatial parameters, each set of spatial parameters comprising at least one of: a level difference parameter indicative of a level difference between channels of the multichannel audio signal; a correlation parameter indicative of a coherence between channels of the multichannel audio signal; a timing difference parameter indicative of a timing difference between channels of the multichannel audio signal, and a phase difference parameter indicative of a phase difference between channels of the multichannel audio signal.

[0023] The first direction may be a desired / target rendering direction. A direction may be an angle and / or orientation from a listening position / in a stereo image.

[0024] The directional rendering may be a binaural rendering generating the first intermediate stereo signal as a binaural stereo signal comprising a point source positioned in the first direction, the binaural rendering comprising selecting directional transfer functions / binaural impulse response values as values for a directional transfer function / binaural impulse response for a sound source in the first direction. Directional transfer functions may be parameterized transfer functions, and specifically may be represented in the frequency domain as weights for each of a plurality of subbands. A weight may be provided for each stereo channel. The weights may typically be complex.

[0025] The directional transfer fimction / binaural impulse response values may be parametric values and may be frequency tile values. The directional transfer function / binaural impulse response values may be values representing any suitable binaural impulse response in any suitable way, including HRIR, HRTF, BRIR values etc. The directional rendering may e.g. render the mono downmix audio signal from a position / direction determined from the spatial parameters, from a parameter received in a bitstream (also including the stereo signal and potentially the spatial parameters), or e.g. in dependence on a user input, etc.

[0026] In many embodiments, the second rendering may be a predetermined rendering.

[0027] The frequency subband downmix weights may typically be a complex value for each channel for each frequency subband for each time segment. A frequency subband value of a signal may specifically be a complex value / sample of the signal for each frequency subband and each time segment.

[0028] The frequency subband residual weights are complex conjugates of the frequency subband downmix weights. In many embodiments, the frequency subband downmix weights and the frequency subband residual weights are complex values. In many embodiments, the frequency subband values of the mono downmix audio signal, the frequency subband values of channels of the stereo signal, the frequency subband compensation values for the first channel, and / or the frequency subband values of the first residual signal are complex values.

[0029] In some embodiments, the weight processor may seek to determine the frequency subband downmix weights to have magnitudes that are dependent on a level of correlation between the mono downmix audio signal and the channels of the stereo signal.

[0030] This may provide an advantageous approach for many scenarios, including e.g. providing an advantageous trade-off between complexity, computational resources, data rate and / or the perceived audio quality of the generated output stereo signal.

[0031] In many embodiments, the weight for a first channel of the signal is monotonically increasing with a level of correlation between the first channel and the mono downmix audio signal.

[0032] According to an optional feature of the invention, the weight processor is arranged to determine the frequency subband downmix weights to have levels that meet a combined level constraint.

[0033] This may provide an advantageous approach for many scenarios, including e.g. providing an advantageous trade-off between complexity, computational resources, data rate and / or the perceived audio quality of the generated output stereo signal.

[0034] The levels may be power / amplitude / magnitude / energy levels.

[0035] According to an optional feature of the invention, the weight processor is arranged to determine the frequency subband downmix weights to achieve at least one of: maximizing a power level of the mono downmix audio signal; minimizing a power level of the first residual signal; or minimizing a correlation of the mono downmix audio signal and the first residual signal.

[0036] This may be particularly advantageous in many embodiments.

[0037] According to an optional feature of the invention, the weight processor is arranged to determine the frequency subband downmix weights to generate the mono downmix audio signal as a principal component of the stereo signal. This may be particularly advantageous in many embodiments and for many scenarios. Effectively, in connection with a combined level constraint this may mean that the power of the mono downmix audio signal is maximized.

[0038] In many embodiments, the weight processor is arranged to determine the frequency subband downmix weights from a principal component analysis of the stereo signal.

[0039] According to an optional feature of the invention, the first Tenderer is arranged to determine a point source direction in a stereo image of the stereo signal from the spatial parameters, and to determine the first direction by applying a mapping function to the point source direction.

[0040] This may be particularly advantageous in many embodiments and for many scenarios. In some embodiments, the first Tenderer is arranged to determine the first direction from a direction indication provided in a data signal also comprising the stereo signal.

[0041] According to an optional feature of the invention, the second rendering is a predetermined rendering employing a predetermined mapping of the first residual signal to channel signals of the second intermediate stereo signal.

[0042] This may provide an advantageous approach for many scenarios, including e.g. providing an advantageous trade-off between complexity, computational resources, data rate and / or the perceived audio quality of the generated output stereo signal.

[0043] According to an optional feature of the invention, the second rendering includes a decorrelation.

[0044] This may be particularly advantageous in many embodiments and for many scenarios. According to an optional feature of the invention, the residual circuit is arranged to generate a second residual signal from a second channel of the stereo signal, the residual circuit being arranged to generate second frequency subband residual weights for the second channel from frequency subband downmix weights for the second channel; generate second frequency subband compensation values for the second channel from the frequency subband values of the mono downmix audio signal and the frequency subband residual weights for the second channel; and generate frequency subband values of the second residual signal from a compensation of frequency subband values of the second channel of the stereo signal by the second frequency subband compensation values; and the second Tenderer is arranged to perform a third rendering being a rendering of the second residual signal to generate a third intermediate stereo signal; and the combiner is arranged to combine at least the first intermediate stereo signal, the second intermediate stereo signal, and the third intermediate stereo signal to generate the output stereo signal.

[0045] This may provide an advantageous approach for many scenarios, including e.g. providing an advantageous trade-off between complexity, computational resources, data rate and / or the perceived audio quality of the generated output stereo signal. According to an optional feature of the invention, the audio apparatus comprises a delay for delaying the frequency subband values of the stereo signal relative to the frequency subband downmix weights.

[0046] This may be particularly advantageous in many embodiments and for many scenarios. In some embodiments, the receiver is further arranged to receive an indication of a maximum time offset between channels of the stereo signal; and the audio render apparatus is arranged to adapt the combination in dependence on the indication of the maximum time offset.

[0047] This may be particularly advantageous in many embodiments and for many scenarios. According to an optional feature of the invention, the spatial parameter circuit is arranged to analyze the stereo signal to generate the sets of spatial parameters.

[0048] This may provide an advantageous approach for many scenarios, including e.g. providing an advantageous trade-off between complexity, computational resources, data rate and / or the perceived audio quality of the generated output stereo signal.

[0049] In many embodiments, the receiver is arranged to receive a data signal comprising the stereo signal and the sets of spatial parameters, and the spatial parameter circuit is arranged to extract the sets of spatial parameters from the data signal.

[0050] According to an optional feature of the invention, the spatial parameters include an interchannel intensity difference parameter, and interchannel phase difference, and an interchannel correlation parameter.

[0051] This may be particularly advantageous in many embodiments and for many scenarios. The processing may be performed in subbands. The processing may be performed in time segments. The processing in each subband may for some (any) or all steps be performed

[0052] separate ly / independently in each subband (with respect to the processing in other subbands). The processing in each time segment may for some (any) or all steps be performed separately / independently in each time segment (with respect to the processing in other time segments).

[0053] The processing may be time interval / segment based with all processing being performed for each time segment. Equivalently, the signal(s) for each segment may be considered a signal (and in particular signals of different time segments, may be considered different signals).

[0054] According to another aspect of the invention, there is provided a method of generating an output stereo signal, the method comprising: receiving a stereo signal; providing sets of frequency subband spatial parameters for the stereo signal, the sets of frequency subband spatial parameters being indicative of relative signal properties of frequency subbands of channels of the stereo signal; determining frequency subband downmix weights for the channels of the stereo signal from the sets of frequency subband spatial parameters; determining a mono downmix audio signal by downmixing the stereo signal, the downmixer being arranged to generate frequency subband values of the mono downmix audio signal by combining frequency subband values of channels of the stereo signal dependent on the frequency subband downmix weights; generating at least a first residual signal from a first channel of the stereo signal, the generating including: generating frequency subband residual weights for the first channel as complex conjugates of the frequency subband downmix weights for the first channel; generating first frequency subband compensation values for the first channel from the frequency subband values of the mono downmix audio signal and the frequency subband residual weights for the first channel; and generating frequency subband values of the first residual signal from a compensation of frequency subband values of the first channel by the first frequency subband compensation values; rendering an output stereo signal, the rendering comprising: performing a first rendering of the mono downmix audio signal to generate a first intermediate stereo signal, the first rendering being a directional rendering arranged to render the mono downmix audio signal from a first direction; performing a second rendering being a rendering of the first residual signal to generate a second intermediate stereo signal; and combining at least the first intermediate stereo signal and the second intermediate stereo signal to generate the output stereo signal; and wherein determining frequency subband downmix weights comprises determining the frequency subband downmix weights such that the downmixing aligns a phase of the frequency subband values of the channels of the stereo signal.

[0055] These and other aspects, features and advantages of the invention will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.

[0056] BRIEF DESCRIPTION OF THE DRAWINGS

[0057] Embodiments of the invention will be described, by way of example only, with reference to the drawings, in which

[0058] FIG. 1 illustrates some elements of an example of an audio apparatus in accordance with some embodiments of the invention;

[0059] FIG. 2 illustrates some elements of an example of an audio apparatus in accordance with some embodiments of the invention;

[0060] FIG. 3 illustrates some elements of an example of a Tenderer for an audio apparatus in accordance with some embodiments of the invention

[0061] FIG. 4 illustrates an example of an approach for generating two residual signals for an example of an audio apparatus in accordance with some embodiments of the invention;

[0062] FIG. 5 illustrates some elements of an example of an audio apparatus in accordance with some embodiments of the invention; and

[0063] FIG. 6 illustrates some elements of a possible arrangement of a processor for implementing elements of an audio apparatus in accordance with some embodiments of the invention.

[0064] DETAILED DESCRIPTION OF SOME EMBODIMENTS OF THE INVENTION FIG. 1 illustrates an audio apparatus, henceforth also referred to as the audio render apparatus, which is arranged to render an output stereo signal from an input stereo signal. Thus, the audio render apparatus may receive a stereo signal and from this proceed to perform a rendering process resulting in an output stereo signal that is typically perceived to have improved properties and which for many signals and scenarios may provide an improved user experience and perception. In many embodiments, the audio render apparatus may generate a binaural stereo signal providing an improved (“out-of-head”) experience when listened to using headphones.

[0065] The audio render apparatus comprises a receiver 101 which receives a data signal comprising an input stereo signal. The stereo signal may typically be encoded in accordance with a suitable encoding standard and the receiver 101 may be arranged to decode the encoded data. The input stereo signal may be one captured at the audio render apparatus 103 and thus may be received from e.g. a set of stereo microphones. In many embodiments, it may be received from another source, or indeed may be an artificially generated stereo signal (e.g. it may be a virtual audio stereo signal).

[0066] The audio render apparatus is arranged to perform subband processing and accordingly the input stereo signal may be processed in the subband / frequency domain. In many cases, the input stereo signal may be directly received in a suitable frequency domain / subband representation. In many embodiments, the input stereo signal may be received as a time domain signal and the audio render apparatus may comprise functionality for transforming the input stereo signal to the frequency domain.

[0067] The receiver 101 may include a time to frequency domain transformer that generates a frequency domain / subband stereo signal from a received time domain representation. In particular, in some embodiments, the receiver may comprise a filter bank which is arranged to generate a frequency subband representation of a received time domain input stereo signal. The receiver 101 may comprise a filter bank that is applied to the input stereo signal such that this is divided into frequency subbands.

[0068] The filter bank may be Quadrature Mirror Filter (QMF) bank or may e.g. be implemented by a Fast Fourier Transform (FFT), but it will be appreciated that many other filter banks and approaches for dividing an audio signal into a plurality of subband signals are known and may be used. The filterbank may specifically be a complex-valued pseudo QMF bank, resulting in e.g. 32 or 64 complex-valued sub-band signals.

[0069] The processing is furthermore typically performed in time segments or time slots / intervals. In most embodiments, the audio signal is divided into time intervals / segments with a conversion to the frequency / subband domain by applying e.g. an FFT or QMF filtering to the samples of each signal. For example, each channel of the downmix audio signal may be divided into time segments of e.g. 2048, 1024, or 512 samples. These signals may then be processed to generate samples for e.g. 64, 32 or 16 subbands. Thus, a set of samples may be determined for each subband of the input stereo signal.

[0070] It should be noted that the number of time domain samples is not directly coupled to the number of subbands. Typically, for a so-called critically sampled filterbank of N bands, every N input samples will lead to N sub-band samples (one for every sub-band). An oversampled fdterbank will produce more output samples. E.g. for every N input samples, it would generate k*N output samples, i.e., k consecutive samples for every band. In some embodiments, the subbands are generated to have the same bandwidth but in other embodiments subbands are generated to have different bandwidths, e.g. reflecting the sensitivity of human hearing to different frequencies.

[0071] For example, the receiver 101 may employ a hybrid filterbank with logarithmic filter band center-frequency spacings that follow that of human perception similar to equivalent rectangular bandwidths (ERBs). In order to compensate for the delay of the filtering by the small filter bank, a delay may be introduced for higher frequency subbands.

[0072] As a specific example, a time -domain signal x[ri] may be fed through a downsampled complex-exponential modulated QMF bank with K bands. Each frame of 64 time domain samples x[n] results in one slot of QMF samples X[k, I] with k = (0,..., K — 1) at slot I. The lower slots may then be filtered by additional complex-modulated filterbanks splitting the lower bands further. The higher slots are delayed ensuring that the filtered input stereo signals of the lower bands are in sync with the higher bands as the filtering introduces a delay. This finally results in a structure where for every 64 timedomain samples x[n], one slot of hybrid QMF samples Y [k, Z] is produced with k = (0,..., L — 1) at slot Z, e.g. with a total number of hybrid bands M = 77.

[0073] Thus, the signals and the processing may be performed in subbands and for individual segments. Such blocks of a frequency interval / subband in a given time interval / segment will also be referred to as time frequency segments / tiles.

[0074] The audio render apparatus further comprises a spatial parameter circuit 103 which provides sets of frequency subband spatial parameters for the stereo signal where the sets of frequency subband spatial parameters are indicative of relative signal properties of the channels of the stereo signal. The frequency subband spatial parameters are provided for individual subbands of the stereo signal.

[0075] The spatial parameters are indicative of / reflect relative properties of the channel signals of the stereo audio signal. In particular, the spatial parameters may be indicated to include parameters that are indicative of at least one of relative intensities / levels of the stereo channels, relative (frequency domain) phases of the stereo channels, a relative time difference between the channels, and / or a correlation between the channels. Specifically, the spatial parameters may include one or more of an inter-channel intensity difference, inter-channel level difference, inter-channel time difference, interchannel phase difference, and / or inter-channel correlation.

[0076] The spatial parameters may specifically be spatial parameters as used for encoding a stereo signal using a Parametric Stereo (PS) encoding of the stereo signal.

[0077] A classical PS downmix is calculated as:

[0078] m — c(l + r)

[0079] where the parameter c is chosen such that the power of the stereo signal is preserved in the downmix, the power being defined using the 2-norm: ||m||2— || + ||r||2

[0080] and thus e.g.:

[0081] =lll / ll2+ Ikll2

[0082] c

[0083]

[0084] 4||i + r||Z

[0085] The PS parameters are specifically an Inter-channel Intensity Difference IID, an Interchannel Correlation ICC, and in some cases an Inter-channel Phase Difference IPD parameter. These may specifically be defined / determined as:

[0086]

[0087] IPD = arg < I, r >

[0088] where the complex-valued inner product is defined as:

[0089]

[0090] The spatial parameters are typically provided for specific time frequency tiles, and thus specifically each parameter value is generated / provided for a given frequency subband and for a given time segment.

[0091] The spatial parameter circuit 103 may determine spatial parameters that are indicative of relative properties of the channels (channel signals) of the stereo signal.

[0092] In many embodiments, the spatial parameter circuit 103 may receive the input stereo signal and process / analyze this to generate the spatial parameters. Specifically, the spatial parameter circuit 103 may calculate the IID, ICC, and IPD values in accordance with the formulas indicated above. Thus, in many embodiments, the audio render apparatus may simply receive a stereo signal and therefrom may generate spatial parameters. In such a scenario, e.g. instead of the IID, ICC and IPD, the following inner products may alternatively or additionally be used as spatial parameters: < l,r>= - rt*

[0093] Vi

[0094]

[0095] In some embodiments, the spatial parameter circuit 103 may be arranged to generate the spatial parameters by extracting them from a received signal. For example, the receiver 101 may receive a data signal comprising both the input stereo signal as well as spatial parameter data for the intermediate stereo signal. The spatial parameter circuit 103 may in this case simply determine the spatial parameters by extracting the spatial parameter values from the received data signal.

[0096] The audio render apparatus of Fig. 1 is arranged to process the input stereo signal to generate a downmix signal for the stereo signal as well as at least one residual signal. The downmix and residual signal are then rendered differently with the downmix being rendered as a directional signal component and with the residual signal typically being rendered using a more diffuse rendering, and typically using a predetermined rendering that is not adapted based on the signal properties of the intermediate stereo signal, the downmix, or the residual signals.

[0097] The audio render apparatus of Fig. 1 comprises a weight processor 105 which is arranged to determine frequency subband downmix weights for the channels of the stereo signal from the sets of frequency subband spatial parameters. Specifically, for each subband, the weight processor 105 generates a weight for a first channel of the input stereo signal and a weight for the second channel of the input stereo signal as a function of the spatial parameter values for that subband. It will be appreciated that the set of frequency subband weights for each channel may correspond to a frequency representation of a filter that is applied to a channel of the stereo signal.

[0098] The audio render apparatus of Fig. 1 further comprises a downmixer 107 which is arranged to downmix the input stereo signal to generate a mono downmix. The downmixer 107 is specifically arranged to generate frequency subband values of the mono downmix audio signal by combining frequency subband values of channels of the input stereo signal dependent on the frequency subband downmix weights. In many embodiments, the downmixer 107 is arranged to multiply / scale subband samples of the respective channels of the intermediate stereo signal by the respective (typically complex) downmix weights and combine (specifically sum) the resulting subband values to generate subband values of the downmix.

[0099] FIG. 2 illustrates an example where the downmixer 107 comprises a first scale block 201 which multiplies input subband samples XL[£>] of the left channel of the input signal by a subband downmix weights w [b]. It further comprises a second scale block 203 which multiplies input subband samples Xr[b] of the right channel of the input signal by a subband downmix weights wr*[b]. The resulting subband samples are summed together by a summer 205 to generate the subband samples of the mono downmix audio signal S[b] (also referred to simply as the mono downmix, downmix signal or downmix).

[0100] The audio render apparatus of Fig. 1 further comprises a residual circuit 109 arranged to generate a first residual signal from a first channel of the stereo signal. The residual circuit 109 seeks to remove the component of the first channel signal that corresponds to the downmix. It seeks to generate a residual signal with a reduced / small correlation with the mono downmix audio signal, and specifically it seeks to generate a residual signal with close to zero correlation with the mono downmix audio signal.

[0101] The residual circuit 109 is specifically arranged to generate frequency subband residual weights for the first channel from frequency subband downmix weights for the first channel and then to generate compensation values for the frequency subbands from the subband samples of the mono downmix modified by the residual weights. Specifically, the compensation values may be generated by multiplying the subband samples of the downmix by the residual weights for that subband sample.

[0102] The residual weights are generated as the complex conjugates of the subband downmix weights. This may result in a compensation value which estimates the component of the first channel that is included in / transferred to the downmix. The compensation signal may essentially be generated to reverse the operation of the downmix weight for the first channel but with the signal being generated as a phase inverted version of the component of the first channel that is represented in the downmix.

[0103] The residual circuit 109 may then compensate the first channel to remove or at least reduce this component. Specifically, the generated compensation value (which specifically may represent an antiphase signal of a component of the first channel included in the downmix) may be added to the first signal to generate the first residual signal.

[0104] In many embodiments, the residual circuit 109 may further be arranged to generate a second residual signal from the second channel of the input stereo signal. The residual circuit 109 may use the same approach as described for the first channel. In particular, the (subband) compensation weight(s) may be generated from the (subband) downmix weight(s) for the second channel (specifically as the complex conjugates thereof) and these may be applied to the subband samples (and specifically be added thereto for antiphase signals) to generate the second residual signal.

[0105] An example of such a residual circuit 109 is illustrated in FIG. 2. The residual circuit 109 comprises a first residual scale block 207 which applies subband residual weights wl[b], which are complex conjugates of the subband downmix weights wl*[b], to the subband samples of the downmix signal Ŝ[b]. The resulting compensation values are then by a summation circuit 211 added to the subband samples of the first signal Xl[b] to generate the subband signals of the first residual signal Dl[b]. The approach may correspond to subtracting the part of the channel signals that can be estimated / are represented by the mono downmix audio signal from the channel signals to generate residual signals. Similarly, the residual circuit 109 comprises a second residual scale block 209 which applies subband residual weights wr[b] which are complex conjugates of the subband downmix weights wr*[b] to the subband samples of the downmix signal Ŝ[b]. The resulting compensation values are then by a summation circuit 213 added to the subband samples of the second signal Xr[b] to generate the subband signals of the second residual signal Dr[b].

[0106] The downmixer 107 and residual circuit 109 are coupled to a renderer 111 which is arranged to generate an output stereo signal by rendering the mono downmix audio signal and at least one residual signal. Further, the renderer 111 is arranged to apply a different rendering approach to the mono downmix audio signal than to the residual signals. Specifically, whereas the mono downmix audio signal is rendered as a directional component, the residual signal(s) is(are) typically rendered as diffuse, non-directional (or at least less directional) signals or e.g. rendered from locations corresponding to the directions of virtual stereo loudspeakers. The rendering of the residual signals is typically using a predetermined rendering algorithm that is not adapted dependent on the signal properties. In many embodiments, binaural rendering may be used.

[0107] The audio render apparatus may accordingly be arranged to, for each subband, generate the mono downmix audio signal as:

[0108] Ŝ = [l r]wH

[0109] where WHrepresents the downmix weights.

[0110] The residual signals may be derived using the complex conjugate of the downmix weights

[0111] D̂l= l — wlŜ

[0112] D̂ = [D̂l]

[0113]

[0114] D̂r= r — wrŜ

[0115] The triplet (Ŝ, D̂l, D̂r) may then be rendered and may specifically be processed by a binaural renderer which uses HRTF filters and optionally BRIR to produce left and right channels.

[0116] The renderer 111 may use a specific approach where parallel paths process the mono downmix audio signal and the residual signal(s) in different ways to generate different stereo signal components which are then combined to generate the output stereo signal.

[0117] A general signal model for stereo signals can be represented as:

[0118] l = flejφx + nl

[0119] r = frejφx + nr The directional signal component x is phase shifted using two (frequency-dependent) parameters φland φr, and is further panned / positioned in the stereo image of the original stereo channels l and r by frequency-dependent positive gains fland fr. Furthermore, a diffuse signal component is represented by signal components nland nrof the respective channels. It is noted that the signal model description does not necessarily refer to a time-domain signal, but rather can alternatively or additionally refer to individual (potentially relatively small) frequency subbands. For example, the described signal model may individually apply to each of the frequency subbands for which separate spatial parameters are provided.

[0120] The renderer 111 may render the audio signal corresponding to an assumption / consideration that the mono downmix audio signal corresponds to the directional signal component x, and the residual signals correspond to the diffuse audio signals nland nr. It uses different rendering approaches to generate different intermediate stereo signals which may be considered estimates of the different signal components of the signal model with these intermediate stereo signals being combined to generate an output stereo signal. The intermediate stereo signals may be considered estimates or approximations of respectively the directional signal x and residual signals nland nrof the signal model but are generated using low resource demanding approaches. The approach provides an advantageous rendering in many scenarios, embodiments, and applications and in particular may often provide an advantageous audio and spatial perception while allowing low complexity and resource demanding implementation and operation.

[0121] FIG. 3 shows examples of elements of the renderer 111.

[0122] The mono downmix audio signal is in the example fed to a first renderer 301 which is arranged to render the mono downmix audio signal to generate a first intermediate stereo signal. The rendering by the first renderer 301 (also referred to as a first rendering) is a directional rendering which renders the first intermediate stereo signal with a given direction / position in the stereo image of the first intermediate stereo signal. The first rendering may specifically render the mono downmix audio signal as a point source with a given direction / position in the stereo image.

[0123] The first Tenderer 301 is coupled to a direction determining circuit 303 which is arranged to determine a direction y’ which is fed to the first Tenderer 301 resulting a rendering the mono downmix audio signal to be perceived from this position / direction. Thus, the first rendering is specifically such that the mono downmix audio signal in the first intermediate stereo signal is perceived as a point audio source positioned in the direction corresponding to the direction y’ determined by the direction determining circuit 303. The direction y’ will also be referred to as the rendering direction or rendering angle.

[0124] The first renderer 301 may accordingly proceed to render the mono downmix audio signal such that is perceived from the given direction, and it specifically achieves this directional rendering by applying a directional transfer function to the mono downmix audio signal with the directional transfer function generating the intermediate stereo signal from the mono downmix audio signal. The directional transfer function may specifically include a sub-transfer function for each channel, i.e. it may include one (sub)transfer function for generating a left channel signal and one (sub)transfer function for generating the right channel signal.

[0125] In many cases, the directional transfer function may be provided as a set of complex weights for the different subbands of a frequency representation of the mono downmix audio signal. The audio apparatus may perform many or all of the operations in the frequency domain and thus the transfer function may also be expressed and applied in the frequency domain. For example, for each frequency subband of the representation of the mono downmix audio signal, the transfer function may provide a complex weight for each of the output channels and a frequency representation of the first intermediate stereo signal may be generated by applying / multiplying the subband samples of the mono downmix audio signal by these weights to generate the subband samples of the first intermediate stereo signal.

[0126] The first transfer function is determined to correspond to the desired direction, i.e. it reflects the mapping from the mono downmix audio signal to the channels of the first intermediate stereo signal such that it is perceived as / corresponds to an audio source at a position in the stereo image corresponding the rendering direction / angle.

[0127] For example, in some cases, the transfer function for a given direction may correspond to a panning of the mono downmix audio signal to the given direction in the stereo image.

[0128] In many embodiments, the first rendering may be a binaural rendering, and the first intermediate stereo signal may be a binaural stereo signal providing an enhanced spatial experience / perception when heard through headphones. Thus, the first renderer 301 may specifically be a binaural audio renderer which generates binaural audio signals for the left and right ear of a user. Binaural audio signals are generated to provide a desired spatial experience and are typically reproduced by headphones or earphones that specifically may be part of a headset worn by a user (the headset typically also comprises left and right eye displays).

[0129] Thus, in many embodiments, the audio rendering by the first renderer 301 is a binaural render process using suitable binaural transfer functions to provide the desired spatial effect for a user wearing a headphone. For example, the first renderer 301 may be arranged to generate an audio component to be perceived to arrive from a specific position using binaural processing.

[0130] Binaural processing is known to be used to provide a spatial experience by virtual positioning of sound sources using individual signals for the listener’s ears. With an appropriate binaural rendering processing, the signals required at the eardrums in order for the listener to perceive sound from any desired direction can be calculated, and the signals can be rendered such that they provide the desired effect. These signals are then recreated at the eardrum using either headphones or a crosstalk cancelation method (suitable for rendering over closely spaced speakers). Binaural rendering can be considered to be an approach for generating signals for the ears of a listener resulting in tricking the human auditory system into perceiving that a sound is coming from the desired positions. The binaural rendering is based on binaural transfer functions which vary from person to person due to the acoustic properties of the head, ears and reflective surfaces, such as the shoulders. Binaural transfer functions may therefore be personalized for an optimal binaural experience. For example, binaural filters can be used to create a binaural recording simulating multiple sources at various locations. This can be realized by convolving each sound source with the pair of e.g., Head Related Impulse Responses (HRIRs) that correspond to the position of the sound source.

[0131] A well-known method to determine binaural transfer functions is binaural recording. It is a method of recording sound that uses a dedicated microphone arrangement and is intended for replay using headphones. The recording is made by either placing microphones in the ear canal of a subject or using a dummy head with built-in microphones, a bust that includes pinnae (outer ears). The use of such dummy head including pinnae provides a very similar spatial impression as if the person listening to the recordings was physically present during the recording.

[0132] By measuring e.g., the responses from a sound source at a specific location in 2D or 3D space to microphones placed in or near human ears, the appropriate binaural filters can be determined. Based on such measurements, binaural filters reflecting the acoustic transfer functions to the user’s ears can be generated. The binaural filters can be used to create a binaural recording simulating multiple sources at various locations. This can be realized e.g., by convolving each sound source with the pair of measured impulse responses for a desired position of the sound source. In order to create the illusion that a sound source is moving around the listener, a large number of binaural filters is typically required with a certain spatial resolution, e.g., 10 degrees.

[0133] The head related binaural transfer functions may be represented e.g., as Head Related Impulse Responses (HRIR), or equivalently as Head Related Transfer Functions (HRTFs) or, Binaural Room Impulse Responses (BRIRs). The (e.g., estimated or assumed) transfer function from a given position to the listener’s ears (or eardrums) may for example be represented in the frequency domain in which case it is typically referred to as an HRTF or BRTF, or in the time domain in which case it is typically referred to as a HRIR or BRIR. In some scenarios, the head related binaural transfer functions are determined to include aspects or properties of the acoustic environment and specifically of the environment in which the measurements are made, whereas in other examples only the user characteristics are considered. Examples of the first type of functions are the BRIRs and BRTFs.

[0134] In the example, the audio render apparatus comprises a store 305 which stores directional transfer functions for different directions. The directional transfer function for a given direction represents the mapping of a mono audio signal to stereo channels such that the mono audio signal is positioned in the given direction in a stereo image of the stereo channels. Thus, applying the directional transfer function for a given direction to the mono downmix audio signal may generate a stereo signal representing the mono downmix audio signal as an audio source positioned in the given direction. The mapping may in some cases be a time domain mapping (such as a gain, filter or other transfer function) or may in many cases be a frequency domain mapping, such as a set of parameter values / scale values (typically complex values) for different subbands. In the latter case, a frequency domain intermediate stereo signal may be generated by for each subband multiplying the subband sample of the mono downmix audio signal with respectively a complex value for that subband for a first channel of the intermediate stereo signal and with a complex value for that subband for a second channel of the intermediate stereo signal.

[0135] For example, in examples where a panning is performed in the horizontal 2D plane, the store 305 may comprise panning parameters for different directions. For example, panning parameters for azimuth angles in a 0-360° interval may be provided for each 1° angle increment. The first Tenderer 301 may be coupled to the store 305 and be arranged to extract the directional transfer function for the rendering direction and then proceed to perform the rendering using the extracted directional transfer function. The rendering of the (potentially gain compensated) mono downmix audio signal may accordingly be rendered such that it is positioned / perceived in the stereo image to arrive from the rendering position.

[0136] It will be appreciated that the store 305 may not have a directional transfer function stored for the desired rendering direction. In such cases, the first Tenderer 301 may be arranged to retrieve the nearest directional transfer function from the store 305 and use this for rendering. In such cases, the rendering direction may be considered to correspond to the direction for the retrieved directional transfer function, i.e. the rendered direction may be a quantized value y’ of the desired rendering direction determined by the direction determining circuit 303.

[0137] In other embodiments, the first renderer may be arranged to estimate a desired directional transfer function for a desired rendering direction by interpolating between two directional transfer functions from the store 305 corresponding to the two rendering angles nearest to the desired rendering direction determined by the direction determining circuit 303.

[0138] In most embodiments, the first renderer 301 is as mentioned arranged to perform a binaural rendering and the directional transfer functions stored in the store 305 are binaural transfer functions. Thus, the store may store data describing binaural transfer functions for different directions. The binaural transfer functions may for example be HRTFs, BRIRs, or HRIRs. The store 305 may specifically store frequency subband complex values for each channel for each frequency subband for a range of different frequencies. The first renderer 301 may thus perform the binaural rendering by multiplying the subband samples of the mono downmix audio signal with the corresponding subband coefficients / complex values of the selected binaural transfer function to generate subband sample values of the intermediate binaural stereo signal.

[0139] It will be appreciated that in many embodiments, the directional transfer functions may be stored as a plurality of functions linked with different directions. For example, the store 305 may be a look-up table which can receive the rendering direction as an index an provide a set of values of the directional transfer function for that direction. The directional transfer function may for example be represented by individual subband values / coefficients, or may e.g. in other embodiments by represented by e.g. parameter values defining the directional transfer function operation (e.g. coefficients for the transfer function), a mathematical description / function from which suitable values of the transfer function can be generated etc.

[0140] Thus, the audio render apparatus of Fig. 1 comprises a processing path which generates an intermediate stereo signal comprising the mono downmix audio signal represented as an audio source at a specific position in the spatial image of the first intermediate stereo signal. The mono downmix audio signal may typically be represented as a point audio source at the given direction. The rendering may be adaptive with the direction being given by e.g. the spatial parameters and thus may be dynamically adapted to reflect the characteristics of the stereo signal.

[0141] In addition, the audio render apparatus of Fig. 1 comprises a second processing path which generates a second intermediate stereo signal.

[0142] The first residual signal is fed to a second renderer 307 which is arranged to perform a second rendering being a rendering of the first residual signal (and in many cases the second residual signal) to generate a second intermediate stereo signal (and typically a third intermediate stereo signal). However, in contrast to the first rendering process, the second rendering process is typically a predetermined rendering which is not dependent on the spatial parameters, and which typically is not depending on properties of the stereo signal. The second rendering may typically be a diffuse rendering seeking to generate the second intermediate stereo signal to provide a perception of a more diffuse and spatially less definite audio source. The second rendering is specifically a predetermined rendering employing a predetermined mapping of the residual signal to channel signals of the second intermediate stereo signal.

[0143] As a specific example, the second rendering may generate the second intermediate stereo signal by simply mapping the first residual signal to two phase inverse signals, i.e. the second intermediate stereo signal may be generated with the first decorrelated mono downmix audio signal being mapped to both channels but with a 180° phase offset between them (the first residual signal may specifically be inverted for one of the channels). For example, in some embodiments, the first residual signal may be mapped to the right and left signals of the second intermediate stereo signal but with the mapping being 180° out of phase for the two channels of the second intermediate stereo signal.

[0144] The first renderer 301 and second renderer 307 are coupled to a combiner 309 which is arranged to combine at least the first intermediate stereo signal and the second intermediate stereo signal to generate an output stereo signal. In many embodiments, the combiner 309 may be arranged to combine / sum the samples / values of the individual channels of the first and second intermediate stereo signals to generate the samples / values of the output stereo signal. In many cases, the combination may be performed by combining / summing subband values of the intermediate stereo signals. In other embodiments, the combination may be performed in the time domain by combining / summing timedomain values of the intermediate stereo signals. In many embodiments, the combination of the intermediate stereo signals may be by a (possibly weighted) combination / summation of corresponding channel signals for the first intermediate stereo signal and the second intermediate stereo signal.

[0145] The audio render apparatus in Fig. 1 accordingly generates an output stereo signal which is the combination of a directional rendering putting an audio source at a desired position as determined from the received spatial parameters, and of a predetermined rendering providing a more diffuse and decorrelated perception of the corresponding audio source. The approach provides two parallel rendering processes / paths for the mono downmix audio signal with the rendered results being combined to generate the output stereo signal.

[0146] In addition to the described flexible and adaptable generation of the output signal to provide an output stereo signal that includes both a directionally rendered component and a more diffuse / predeterminedly rendered component, the audio render apparatus may in some embodiments adapt / control the relative level between these components, e.g. by adapting weights of the combination.

[0147] In many cases, the second Tenderer 307 may be arranged to render a plurality of residual signals. For example, the described rendering for a single residual signal may be repeated for each of a plurality of residual signals to generate an intermediate stereo signal. These intermediate stereo signals may then be combined into the second stereo signal which subsequently (or as part of the same operation) may be combined into the output stereo signal.

[0148] The renderer may accordingly perform a third rendering function which generates a third intermediate stereo signal by rendering of the second residual signal. The third intermediate signal may then be combined with the first intermediate stereo signal and the second intermediate stereo signal (e.g. in one or multiple steps).

[0149] The audio apparatus is accordingly arranged to generate an output stereo signal, and often an output binaural stereo signal from the received mono downmix audio signal and spatial parameters. The audio apparatus specifically implements two different rendering paths with one being a directional (binaural) rendering of a directional (e.g. a dominant) signal component while the other is a predetermined rendering / mapping of a decorrelated audio signal generated from the mono downmix audio signal. The rendering of the output stereo signal is not a conventional adaptive upmixing of the received and decorrelated mono signals, and is specifically not a conventional 2x2 matrix upmixing of the mono signal and a decorrelated signal, but rather is a direct generation of a stereo signal by parallel processing of respectively the mono downmix audio signal and one or more residual signals, with the former rendering being directional dependent on the spatial parameters and the latter rendering being a predetermined rendering.

[0150] The processing seeks to render the mono downmix audio signal as a direct / dominant / directional component using a direct rendering with a direction that is e.g. given by the spatial parameters. The rendering employs a directionally dependent transfer function to the left and right stereo output signal for that purpose. The approach further seeks to render a residual / remaining signal component as a more diffuse signal, and specifically it uses a predetermined rendering where a decorrelated signal is mapped directly to the channels of the output binaural signal using a transfer function. The mapping is predetermined and may specifically be such that it allows a more diffuse and non-directional perception of this signal component. The rendering process thus uses fundamentally different approaches to provide different signal components in the output binaural signal.

[0151] The direction determining circuit 303 may in many embodiments be determined based on the downmix weights (or equivalently in many scenarios the residual weights). The downmix weights are dependent on the spatial parameters and accordingly the rendering direction y’ may often be determined from the spatial parameters.

[0152] In particular, in many embodiments, the rendering direction y’ may be determined from spatial parameters such as specifically the following spatial parameters (per subband, where in the following k denotes the frequency bin, with a subband potentially including more than one frequency bin:

[0153] IID[b] = Σk∈b|xl[k]|² / Σk∈b|xr[k]|²

[0154] IPD[b] = arg(Σk∈bxl[k]xr*[k])

[0155] ICC[b] = |Σk∈bxl[k]xr*[k]| / √(Σk∈b|xl[k]|²·Σk∈b|xr[k]|²), 0 ≤ ICC ≤ 1

[0156] √(Σk∈b|xl[k]|²)(Σk∈b|xr[k]|²)

[0157]

[0158] where for coherent left-right channel components ICC[b] → 1, while for uncorrelated channels ICC[b] → 0.

[0159] In the case of an ITD (interchannel time difference) being used, this can be estimated per band using cross-correlation methods using the maximum delay in metadata m to limit the bounds wherein the peak search is performed.

[0160] In such a case, the rendering direction y’ may be determined as an angle of a principal component of the received signal, henceforth also referred to as an (per frequency band) orientation direction y:

[0161] γ = tan-1((1 - IID + √(4·IID·ICC² + (IID - 1)²))

[0162] γ = tan-1

[0163]

[0164] 2ICC√IID

[0165] Thus, in many embodiments, the rendering direction y’ may be determined as or from (e.g. using a predetermined mapping) a direction y of a principal component of the stereo signal. The spatial parameters may as previously mentioned often be calculated by the audio render apparatus for a given input stereo signal. However, in other embodiments, the spatial parameters may be received together with the input stereo signal, e.g. as part of a single data signal / bitstream.

[0166] The direction determining circuit 303 may accordingly be arranged to determine the direction from the spatial parameters. The spatial parameters provide information on the relationship between the channels of the stereo signal that are downmixed and as such provide information of the position / orientation of the audio, and specifically of a dominant signal component in the stereo image of the stereo signal. For example, the spatial parameters may provide information of the position of the dominant signal component in the stereo signal, and specifically it provides information of an orientation angle for the dominant signal.

[0167] The direction determining circuit 303 may specifically determine the rendering direction y’ from the spatial parameters. The rendering direction will be determined on a frequency tile basis, and specifically in frequency subbands and time segments matching those for which the spatial parameters are provided.

[0168] Different approaches for determining the rendering direction from the spatial parameters may be used in different embodiments. In particular, the signal model as indicated above is based on directional component x being at a direction y in the stereo image of the stereo signal, henceforth also referred to as the orientation direction y. In many embodiments, the direction determining circuit 303 may determine the orientation direction y and then determine the (desired) rendering direction y’ from the orientation direction y. The orientation direction y is accordingly an estimation of a point source direction in a stereo image. Indeed, in some embodiments or scenarios, the rendering direction y’ may simply be set equal to the orientation direction y.

[0169] The determination of the orientation direction y may be based on the signal model indicated above. The spatial parameters provide information on the relative properties of the channel signals of the stereo signal and specifically they may provide information on both the interchannel levels / intensity differences as well as on the interchannel correlation. Accordingly, the spatial parameters can be considered to provide information on the directional signal component x and on the position of this in the stereo image of the stereo signal, i.e. the spatial parameters provide information on the orientation direction y allowing this to be determined from the provided parameter values.

[0170] The direction determining circuit 303 may determine the orientation direction y as a direction to a directional signal component in a stereo image of the stereo signal from the spatial parameters, and to map this to a direction in a stereo image of the output stereo signal. The directional signal component may be a dominant signal component. The direction determining circuit 303 may be arranged to determine the orientation direction y as a direction of a dominant sound source in the stereo signal where the direction of the dominant sound source is represented by the spatial parameters.

[0171] The directional signal component may specifically be a signal component (estimated / determined) to originate from a point source. Specifically, the direction determining circuit 303 may be arranged to determine the orientation direction y as a direction for which a single point source audio source will result in spatial parameter values matching the spatial parameters of the data signal.

[0172] In some embodiments, the direction determining circuit may as mentioned be arranged to determine the first direction in line with:

[0173] γ = arctan 1 - IID + √(IID - 1)² + 4 · ICC² · IID ───────────────────────── 2 · ICC · √IID

[0174]

[0175] where IID is an interchannel intensity difference and ICC is an inter-channel cross-correlation, and specifically with these given by the equations provided above in connection with the equations for determining gains.

[0176] The direction determining circuit 303 may, as previously mentioned, in some embodiments be used directly as the rendering direction y’, i.e. y = y’. However, in many embodiments, a mapping may be included which for at least some values of the orientation direction y may result in a different rendering direction y’.

[0177] Thus, in many embodiments, the direction determining circuit 303 may be arranged to apply a mapping function to the orientation direction y to determine the rendering direction y’.

[0178] For example, the mapping may map the position in the stereo image of the original stereo signal as represented by the orientation direction y to a desired position in the stereo image of the output stereo signal as represented by the rendering direction y’. In many cases, where the output stereo signal is a binaural signal, the mapping may include a consideration / determination of a distance to the audio sources. For example, a range of the orientation direction y in the interval of [0,180°] may be mapped to a location between two virtual stereo speakers in the audio scene created by the binaural rendering. Such speakers may for example be positioned at angles of -30° and +30° relative to a center direction for the binaural signal. Thus, in such situations, the direction determining circuit 303 may include a mapping between an orientation direction y in the range of [0, 180°] to a rendering direction y’ in the range of [-30°, +30°].

[0179] Thus, in some embodiments, the directional component (the mono downmix audio signal) may be rendered to a virtual angle in the range of a virtual loudspeaker angle range generated by a binaural rendering. The rendered directional component may be combined with a diffuse rendering of the residual signal(s).

[0180] In many embodiments, the direction determining circuit 303 may be arranged to map an orientation direction y representing an angle in one interval / range to a rendering direction y’ representing an angle in a different interval / range. In the previous examples, the rendering has been based on one intermediate stereo signal representing the diffuse signal component. However, in many embodiments, there may be two (or possibly more) parallel paths for the rendering of the residual / non-directional signal components.

[0181] The predetermined rendering of the second Tenderer 307 may as previously mentioned simply be achieved by rendering the corresponding decorrelated signal in one channel of the corresponding intermediate stereo signal, and with no signal being included in the other channel. For example, the second residual signal may be rendered in the left channel of the second intermediate stereo signal and the first residual signal may be rendered in the right channel of the third intermediate stereo signal.

[0182] In some embodiments where binaural processing is used, each of the residual signals may be rendered from a specific position, such as each decorrelated signal being rendered from a different virtual position, such as for example from different virtual (loudspeaker) positions.

[0183] In some embodiments, the rendering for a residual signal, such as the rendering of the first residual signal, may be to position the signal at a specific position.

[0184] In many embodiments, the rendering of a residual signal may be performed by the second Tenderer 307 retrieving a set of directional transfer functions from the store 305 and rendering the residual signal using the retrieved transfer function(s).

[0185] In many embodiments, the Tenderer 301, 307 may be arranged to extract a directional transfer function for a single predetermined direction and to render the residual signal using this directional transfer function. Accordingly, the residual signal may be rendered from one predetermined direction / position, such as a direction / position corresponding to a virtual speaker position.

[0186] An example of subband parametric rendering may e.g. result in left and right signals:

[0187] l = g_x · m · G_l[f(γ)] · e^{jφ_l[f(γ)]} + g_n · H₁{m} · G_l[β_l] · e^{jφ_l[β_l]} + g_n · H₂{m} · G_l[β_r] · e^{jφ_l[β_r]}

[0188]

[0189] r = g_x · m · G_r[f(γ)] · e^{jφ_r[f(γ)]} + g_n · H₁{m} · G_r[β_l] · e^{jφ_r[β_l]} + g_n · H₂{m} · G_r[β_r] · e^{jφ_r[β_r]}

[0190] where G_l, G_r, φ_l, φ_r form the parametric HRIRs, f(γ) is a mapping function converting the estimated angles (orientation direction γ) to HRIR direction angles, β_l

[0191]

[0192] and β_r are two pre-determined angles and H₁{.} and H₂{.} are two mutually independent optional decorrelators (if no decorrelator is present, the responses H₁{.} and H₂{.} may simply be considered unity responses.

[0193] In some embodiments, the second Tenderer 307 may retrieve directional transfer functions for a plurality of predetermined directions and it may use multiple directional transfer functions in performing the predetermined rendering. For example, different directional transfer functions may be used for different frequency subbands. This may provide a more diffuse perception with the audio being generated such that it is perceived from different directions for different subbands thereby resulting in a perception of a more distributed and spread audio source. Such approaches may be used both in embodiments in which a single residual signal is generated and rendered, or indeed in cases where multiple residual signals are generated and rendered. In the latter case, the sets of predetermined directions for the different residual signals are different in order to enhance the perceived diffuseness of the non-directional signal component.

[0194] The second Tenderer 307 may generate the second intermediate stereo signal using a first set of directional transfer functions retrieved from the store 305 for a first set of predetermined directions, and may generate the third intermediate stereo signal using a second set of directional transfer functions retrieved from the store 305 for a second set of predetermined directions where the first set of set of predetermined directions is different from the second set of predetermined directions.

[0195] In particular, the directional transfer functions may be binaural transfer functions and the second Tenderer 307 may be arranged to perform binaural rendering using binaural impulse response values for a first set of predetermined directions and may be arranged to perform binaural rendering using binaural impulse response values for a second set of predetermined directions where the first set of predetermined directions are different from the second set of predetermined directions.

[0196] In many cases, the use of multiple directional transfer functions may be achieved by using directional transfer functions for different directions in different frequency subbands.

[0197] Thus, instead of rendering the diffuse / non-directional signals using fixed angles, e.g. mimicking a virtual stereo speaker setup, the diffuse signals may also be rendered using composite, e.g. pre-calculated HRIRs for many sources / directions, e.g. spread over a (part of a) circle, or (part of) a sphere.

[0198] 1 = gx- m - Gt[f(y)] ■ + gn■ H^m] ■ Gl’Comp ' e^l’comP + gn- H2{m} - G I, comp ■ QJ 4>l, comp r = gx- m - Gr[ / (y)] ■ + gn■ H^m] ■ GriComp■e}(t>r’comv + gn■ H2{m} ■ GriComp. eJ4>r,comp

[0199] where e.g.:

[0200] Gl’Comp 9norm GM] ■

[0201] PEBt

[0202]

[0203] G_r,comp = g_norm · |Σ_{β∈B_r} G_r[β] · e^{jφ_r[β]}|

[0204]

[0205] β∈B_r

[0206] with B_l being a set of angles at which the left diffuse signal is to be rendered, B_r a set of angles at which the right diffuse signal is to be rendered, and g_norm a normalisation factor.

[0207] In some embodiments, a single residual signal component may be directly rendered onto left and right channels without any HRIR processing.

[0208] The weight processor 105 may be arranged to determine the downmix weights based on the subband spatial parameters in accordance with different equations and functions in different embodiments.

[0209] In many embodiments, the weight processor 105 may be arranged to determine the downmix weights such that the resulting mono downmix audio signal is a principal signal component for the stereo signal. In particular, the weight processor 105 may perform a principal component analysis / calculation based on the spatial parameters within each subband to determine the downmix weights that will recreate the downmix signal as the principal signal component for that subband.

[0210] The weight processor 105 may accordingly exploit that the principal component for the stereo signal may be determined as function of the spatial parameters. The approach may further be based on a consideration that the principal signal component corresponds to the directional signal component for the above indicated signal model, i.e. it may correspond to the signal component x.

[0211] The weight processor 105 may specifically be arranged to generate the mono downmix audio signal to maximize coherent stereo components by phase / time -aligning these components.

[0212] In many embodiments, the weight processor (105) may be arranged to determine the frequency subband downmix weights to maximize a power level of the mono downmix audio signal; minimize a power level of the first residual signal; and / or minimize a correlation of the mono downmix audio signal and the first residual signal.

[0213] For example, the downmix signal’s maximization is a result of rotating the input first and second stereo channels along the principal eigenvector which corresponds to the component of maximum variance (eigenvalue) of the input signal. Since the first residual signal is based on the arrangement of removing this principal component from the first input channels, it is thus arranged to reduce the power level of the first residual signal.

[0214] The weight processor 105 may be arranged to determine the downmix weights such that they maximize the signal power of the downmix signal under a magnitude / power / energy constraint. The weight processor 105 may proceed to do so by determining the weights to be a combination of phase values (phase alignment) and amplitude values that maximizes the energy of the downmix (under the constraint). The principal component may be determined by a principal component analysis of the input stereo signal.

[0215] The weight processor 105 may accordingly determine the frequency subband downmix weights such that the downmixing aligns a phase of the frequency subband values of the channels of the stereo signal. The spatial parameters reflect interchannel parameters / properties and thus can provide information of the relative phase differences for the subbands. Accordingly, the values of the downmix weights resulting in phase alignment can be determined from the spatial parameters.

[0216] In many embodiments, the weight processor 105 may be arranged to determine the frequency subband downmix weights to have magnitudes that are dependent on the level of correlation between the mono downmix audio signal / principal component and the channels of the stereo signal.

[0217] For example, if the principal component is mainly aligned with the left channel signal, the magnitude of the weight for the left channel signal is higher than for the right channel signal, and vice versa. The weight processor 105 may be arranged to determine the weights such that the larger the relative contribution to the principal component of one channel, the larger the relative magnitude of the weight for that channel. Further, the larger the correlation between the channel signal and the principal component, the larger the contribution for that channel to the principal component. In many embodiments, the weight for a first (or each) channel of the signal is monotonically increasing with a level of correlation between the first (or each) channel and the mono downmix audio signal / principal component of the stereo signal.

[0218] The determination of the downmix weights may as mentioned be subject to an energy / power / level constraint on the downmix weights.

[0219] The downmix weights may in many embodiments be determined to ensure power preservation with the combined power / energy level of the intermediate signals corresponding to the power level of intermediate stereo signal, i.e. such that there is power preservation between the input stereo signal and the output stereo signal.

[0220] Specifically, in many embodiments, the determination of the downmix weights for a subband may be subject to the constraint / requirement that the sum square magnitude of the subband downmix weights is one, i.e.

[0221] |w_l|² + |w_r|² = 1.

[0222] In the following, an example of a derivation of downmix weights corresponding to the principal component from the spatial parameters will be provided. The description uses a simpler notation with the subband / frequency bin index being omitted for brevity. The values expressed in the derivation are per subband.

[0223] We start by expressing the covariance matrix R between left and right stereo channels in terms of the IID, IPD / ITD, and ICC, and use I and r to denote the left and right subband signals for simplicity. Furthermore, we use the notation {x, y) = yHx to denote the Hermitian inner product between x and y,

[0224] R = [l / r] · [l* r*]

[0225]

[0226] Dividing the matrix by (I, l) r, r),

[0227] ’ / - (Z,r)

[0228] V(Z, Z) / (r,r)

[0229] „=y / (l,l)(r,r)

[0230] {r, I), -,w v7<nr> / (Z, Z>

[0231]

[0232] If we equivalently redefine the IID as

[0233]

[0234] And the complex normalized cross-correlation p as

[0235] {l,r)

[0236]

[0237] then,

[0238] VTTD p

[0239]

[0240] p* Vv / ZD

[0241] The complex normalized cross-correlation can further be expressed in terms of the IPD and ICC,

[0242] p = ICC·e^{jIPD} allowing us to finally express the covariance matrix as

[0243] √IID, ICC·e^{jIPD}; ICC·e^{-jIPD}, √(1 / IID)

[0244] Performing a principal component analysis on the matrix 7?. we obtain the following eigenvectors:

[0245] > (IID + 1) ± J4IIDICC2+ (IID - I)2

[0246] A

[0247]

[0248] 1'2” 2VTID

[0249] where the maximum eigenvalue λ_max =

[0250]

[0251] (plus-sign between two terms in numerator) given the possible range of values for the IID. The corresponding eigenvector is given by

[0252] v = [2ICC√IID·e^{jIPD} / (1 - IID + √(4·IID·ICC² + (IID-1)²)); 1]

[0253]

[0254] 1

[0255] It is noted that without loss of generality, an overall phase term can be added to both the elements of v:

[0256] v = [2ICC√IID·e^{j(IPD+φ)} / (1 - IID + √(4·IID·ICC² + (IID-1)²)); e^{jφ}]

[0257]

[0258] For energy preservation, the principal eigenvector is normalized,

[0259]

[0260] llvil KJ

[0261] such that |w_l|² + |w_r|² = 1.

[0262] Thus, in many embodiments a ratio between the downmix weights for the two channels may be determined as:

[0263] 2ICC√IID·e^{jIPD} / (1 - IID + √(4·IID·ICC² + (IID-1)²)) and the corresponding directional of the principle / directional component may be given by:

[0264] > (IID + 1) + y / 4IID ICC2+ (IID - I)2

[0265]

[0266] ” Z^TlD

[0267] In the following some specific considerations of the performance / operation are provided based on the specific approach of FIG. 2.

[0268] Given the structure of FIG. 2 and

[0269] S = w^H · [l; r] D_l = l - w_l·S D_r = r - w_r·S

[0270]

[0271] w^H·w = |w_l|² + |w_r|² = 1

[0272] it follows that the total energy in the input equals that of the total energy of the outputs (the estimated directional component and two residual signals), i.e.,

[0273] 2

[0274] |

[0275]

[0276] S[b] | + |fW2+ |£>r ^] |2= |XJ[6] |2+ Kr[ / ’] |2

[0277] Furthermore, based on the downmix weights and the estimation of residual components it can be shown that the effective transformation (

[0278]

[0279] (I - WW^H) from stereo input channels to residuals is orthogonal to the downmix weight transformation W^H, i.e. W^H(I - WW^H) = 0 under the constraint W^H·W = 1.

[0280] In the approach, the filters / weights Wi and wrare determined based on the timefrequency characteristics of the stereo signal as described by the IID, IPD / ITD, and ICC parameters, timealigning and scaling the point source estimates per subband. In fact, the approach estimates a fdter to match the acoustic response between a point source and the capture points for respectively the left and right channels of the stereo signal.

[0281] To understand this better, an example can be considered where IID = 1. Furthermore, assume that the left and right stereo channels are given by (this example can be seen as a simplified far-field example):

[0282] X_l[b] = S[b] + D_l[b] Xr[b] = S[b]e~j, PD[b]+ Dr[b],

[0283] Accordingly, the point source signal is simply delayed on the right channel. Assuming that the additive residual signals Di [6] and Dr[6] are mutually uncorrelated and uncorrelated with S[b], the SNR (Signal to Noise Ratio) of the left and right individual channels is given by:

[0284] SNR_l[b] = SNR_r[b] = σ²_S[b] / σ²_D[b]

[0285]

[0286] where σ²_S[b] = E{S[b]S*[b]} and σ²_D[b] = E{D_l[b]D*_l[b]} = E{D_r[b]D*_r[b]}.

[0287] In the following, two scenarios are considered, namely high and low SNR relative to the background residual noise signal.

[0288] High SNR

[0289] For subbands where the SNR is high (e.g. during onsets or high energy portions of the directional source (where the ICC -> 1), the resulting principal eigenvector w as determined above corresponds to the direction of the audio source,

[0290] w[b] ≈ (1 / √2) · [e^{jIPD[b]}; 1]

[0291] This means that the principal component or estimate of the directional audio source is given by

[0292] S[b] = [X_l[b] X_r[b]]·w*[b] = (1 / √2)·(e^{-jIPD[b]}·X_l[b] + X_r[b])

[0293]

[0294] The resulting SNR of the principal component has increased by a factor of 2, or approximately 3 dB,

[0295] SNR_PC[b] = 2·σ²_S[b] / σ²_D[b] assuming that S, Dt, and Drare uncorrelated. The higher SNR means that it’s easier to binauralize the directional audio source.

[0296] In other words, the left subband signal is first phase-aligned (delayed) with respect to the right subband signal by IPD[b] before summing and normalizing and the diffuse noise terms average out. This produces a scaled and delayed version of the phantom source. The resulting left and right residual subband signals are given by:

[0297] B, [i] = X, [i] - WJWJ = X, [i] - T eJWDm sti,]

[0298] 1 1 1 Dr[b] = Xr[b] - wr[b]S[b] = Xr[b] - — S[b] = -Dr[b] - -e-jIPD^Dt[b]

[0299]

[0300] \ 2 2 2

[0301] Low SNR

[0302] For subbands where the background noise dominates (ICC -> 0), when the IID = 1, the principal eigenvector equals,

[0303]

[0304] and

[0305] S[b] = [^ [b] Xr[b]>"[b] = ^ (^ [6] + Xr[b]) = + Dr[b]~)

[0306] DAb] = ^ [b] - W£[W] = |A^] - ^ Dr[b]

[0307] Dr[b] = Xr[b] - wr[b]S[b] = ^Dr[b] - ^D^b]

[0308]

[0309] If the background noise dominates (ICC -> 0), but the IID

[0310]

[0311] 1, then theoretically,

[0312] w[b] = [°], the weights may be generated to select the right channel only. To circumvent this, a small constant 6 can be added to the principal eigenvector,

[0313]

[0314] ” V2e2+ 2e + 1+J

[0315] In this case, the downmixer will focus on the channel with the higher energy (for I ID < 1, the right channel, and for IID > 1, the left channel.

[0316] In practice, it might be advantageous to update weights w[b] only when a directional source is deemed to be active. Since a directional source is considered a coherent point source, a simple activity test can be performed, e.g. by measuring the average coherence value over several subbands and thresholding this value. If the average coherence exceeds a given threshold, then the weights w[b] may be updated.

[0317] The following alternative formulation of the weights also results in orthogonalization of the downmix and the residual signal(s):

[0318] cos(a) ■ eJ< Pl

[0319]

[0320] sin(a) ■ ej(Prwith

[0321]

[0322] and the phases chosen such that:

[0323] mod(<pr— (pi, 2TT) = IPD

[0324] In many embodiments, the rendering of one (or both / all) residual signal(s) may include a decorrelation. In many embodiments, the rendering of a residual signal may include a decorrelation of the residual signal. In many embodiments, the first residual signal may be decorrelated and subsequently rendered by the second Tenderer 307. In many embodiments, the first residual signal may be generated as described above and the second residual signal may be generated by decorrelating the second residual signal. The two residual signals may then be rendered as the more diffuse components. Specifically, the previously described rendering approaches for two (or more) residual signals may be applied to two (or more) residual signals of which one or more is generated by decorrelation In particular, while Di [b] and Dr[b] as indicated above are assumed to be uncorrelated, the expressions for [6] and Dr[6] indicate that they may not be fully uncorrelated. Accordingly, in some embodiments, a decorrelator 401 may as illustrated in FIG. 3 be applied to one of the residual signals. In this case, the previously described approach may be used but with Dr[6] = DL[b], so that the estimated directional source S[b], and the decorrelation based generated pair of decorrelated residual signals can be binauralized using HRTFs (and optionally binaural room impulse responses) and rendered. A standard decorrelator may be used such as those used for decoding a PS signal.

[0325] In some embodiments, a delay may be introduced to the channel signals relative to the downmix and typically the residual weights. For example, as illustrated in FIG. 5, delays 501, 503 may be introduced to the first and second channel signals (and specifically to the subband samples of the first and second channels) before the compensation is applied to generate the residual signals.

[0326] The delays 501, 503 may be used to address causality issues. In particular, using the complex conjugate of the frequency subband / domain weights downmix weights Wi and wrin (20) is equivalent to time-reversing the corresponding time-domain filter coefficients, i.e. it is equivalent to a time reversal of the impulse response. For relatively small time differences between the time signals, this may not be a problem for a frequency domain / subband processing but if there is a substantial lag between the signals (corresponding to a large value for an ITD parameter), the complex conjugation of weights and subsequent residual signal calculation may be subject to problems. In order to compensate for such a time reversal and ensure causality, the delays 501, 503 may be introduced. The delays may for example be set to be equal to a processing time segment / interval corresponding to the maximum signal delay between left and right channels. For example, for the communication scenario, this delay can corresponds to the distance between the pair of microphones used to capture point source signals such as local speakers and the delay would correspond to an exponential given by e

[0327]

[0328] -- / c'jd / c, where d is the max distance and c is the speed of sound (e.g, for air « 343 m / s). Of course, this also assumes that the frame length or time segment taken for the frequency-domain transform is large enough to capture such delays between left and right channels for the same time frame.

[0329] In other embodiments the delays may be set as half the length of the corresponding time domain filter length, again assuming that half this length is sufficient to cover the expected maximum ja> N

[0330] delay between left and right channels, i.e. 3 = e2A, where N is the underlying time-domain filter length and fsis the sampling frequency. This however, would also require delaying the subband residual ja> N ja> N

[0331] weights by e2fs and respectively advancing the subband downmix weights by e2A. This approach can be advantageous in situations where the subband downmix and residual weights also model early reflections between a directional source and the microphones in a communication setting to further improve the downmix signal to noise ratio. Therefore, in some embodiments, the receiver is further arranged to receive an indication of a maximum time offset between channels of the stereo signal. For example, the audio render apparatus may receive an indication of a maximum distance (which in some cases may be an actual distance) between the microphones capturing the stereo signal. The indication may for example be received as a user input, or e.g. may be received as metadata which is part of a data signal also comprising the input stereo signal.

[0332] The audio render apparatus may then be arranged to adapt the combination in response to the indication of the maximum time offset corresponding to such a maximum distance. For example, the combiner may be arranged to set a value of the delays depending on the maximum distance between microphones.

[0333] As another example of how the maximum distance, and thus the maximum interchannel delay / time difference, may be used to adapt the operation of the audio render apparatus, and specifically how it may be used to adapt the combination is in estimating the inter-channel time difference (ITD) by the spatial parameter estimator 103. ITD methods are commonly based on cross-correlation methods and maximum peak picking of the cross-correlation function corresponding to the inter-channel delay. The maximum distance can be translated into a maximum delay value via the relation Tm= d / c. and the maximum peak is selected within an interval of ±Tmaround zero for the cross-correlation function. For azimuth directional of arrival estimation, the corresponding direction may then be estimated from the relationship,

[0334] c

[0335] y = — arccos r,

[0336] a

[0337] where T is the delay corresponding to the maximum peak in the cross-correlation function.

[0338] The processing is performed in subbands. The processing may be performed in time segments. The processing in each subband may for some (any) or all steps be performed

[0339] separate ly / independently in each subband (with respect to the processing in other subbands). The processing in each time segment may for some (any) or all steps be performed separately / independently in each time segment (with respect to the processing in other time segments).

[0340] The processing may be time interval / segment based with all processing being performed for each time segment. Equivalently, the signal(s) for each segment may be considered a signal (and in particular signals of different time segments, may be considered different signals).

[0341] The audio apparatus(s) may specifically be implemented in one or more suitably programmed processors. An example of a suitable processor is provided in the following.

[0342] FIG. 6 is a block diagram illustrating an example processor 600 according to embodiments of the disclosure. Processor 600 may be used to implement one or more processors implementing an apparatus as previously described or elements thereof (including in particular one more artificial neural network). Processor 600 may be any suitable processor type including, but not limited to, a microprocessor, a microcontroller, a Digital Signal Processor (DSP), a Field ProGrammable Array (FPGA) where the FPGA has been programmed to form a processor, a Graphical Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC) where the ASIC has been designed to form a processor, or a combination thereof.

[0343] The processor 600 may include one or more cores 602. The core 602 may include one or more Arithmetic Uogic Units (AUU) 604. In some embodiments, the core 602 may include a Floating Point Uogic Unit (FPUU) 606 and / or a Digital Signal Processing Unit (DSPU) 608 in addition to or instead of the AUU 604.

[0344] The processor 600 may include one or more registers 612 communicatively coupled to the core 602. The registers 612 may be implemented using dedicated logic gate circuits (e.g., flip-flops) and / or any memory technology. In some embodiments the registers 612 may be implemented using static memory. The register may provide data, instructions and addresses to the core 602.

[0345] In some embodiments, processor 600 may include one or more levels of cache memory 610 communicatively coupled to the core 602. The cache memory 610 may provide computer-readable instructions to the core 602 for execution. The cache memory 610 may provide data for processing by the core 602. In some embodiments, the computer-readable instructions may have been provided to the cache memory 610 by a local memory, for example, local memory attached to the external bus 616. The cache memory 610 may be implemented with any suitable cache memory type, for example, Metal-Oxide Semiconductor (MOS) memory such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), and / or any other suitable memory technology.

[0346] The processor 600 may include a controller 614, which may control input to the processor 600 from other processors and / or components included in a system and / or outputs from the processor 600 to other processors and / or components included in the system. Controller 614 may control the data paths in the AUU 604, FPUU 606 and / or DSPU 608. Controller 614 may be implemented as one or more state machines, data paths and / or dedicated control logic. The gates of controller 614 may be implemented as standalone gates, FPGA, ASIC or any other suitable technology.

[0347] The registers 612 and the cache 610 may communicate with controller 614 and core 602 via internal connections 620A, 620B, 620C and 620D. Internal connections may be implemented as a bus, multiplexer, crossbar switch, and / or any other suitable connection technology.

[0348] Inputs and outputs for the processor 600 may be provided via a bus 616, which may include one or more conductive lines. The bus 616 may be communicatively coupled to one or more components of processor 600, for example the controller 614, cache 610, and / or register 612. The bus 616 may be coupled to one or more components of the system.

[0349] The bus 616 may be coupled to one or more external memories. The external memories may include Read Only Memory (ROM) 632. ROM 632 may be a masked ROM, Electronically Programmable Read Only Memory (EPROM) or any other suitable technology. The external memory may include Random Access Memory (RAM) 633. RAM 633 may be a static RAM, battery backed up static RAM, Dynamic RAM (DRAM) or any other suitable technology. The external memory may include Electrically Erasable Programmable Read Only Memory (EEPROM) 635. The external memory may include Flash memory 634. The External memory may include a magnetic storage device such as disc 636. In some embodiments, the external memories may be included in a system.

[0350] The invention can be implemented in any suitable form including hardware, software, firmware, or any combination of these. The invention may optionally be implemented at least partly as computer software running on one or more data processors and / or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed, the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit or may be physically and functionally distributed between different units, circuits and processors.

[0351] Although the present invention has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present invention is limited only by the accompanying claims. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the invention. In the claims, the term comprising does not exclude the presence of other elements or steps.

[0352] Furthermore, although individually listed, a plurality of means, elements, circuits or method steps may be implemented by e.g. a single circuit, unit or processor. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and / or advantageous. Also, the inclusion of a feature in one category of claims does not imply a limitation to this category but rather indicates that the feature is equally applicable to other claim categories as appropriate. Furthermore, the order of features in the claims do not imply any specific order in which the features must be worked and in particular the order of individual steps in a method claim does not imply that the steps must be performed in this order. Rather, the steps may be performed in any suitable order. In addition, singular references do not exclude a plurality. Thus references to "a", "an", "first", "second" etc. do not preclude a plurality. Reference signs in the claims are provided merely as a clarifying example shall not be construed as limiting the scope of the claims in any way.

[0353] Generally, examples of an audio apparatus, a method of operation therefor, and a computer program which implements the method are indicated by below embodiments.

[0354] EMBODIMENTS: Embodiment 1. An audio apparatus comprising:

[0355] a receiver (101) arranged to receive a stereo signal;

[0356] a spatial parameter circuit (103) arranged to provide sets of frequency subband spatial parameters for the stereo signal, the sets of frequency subband spatial parameters being indicative of relative signal properties of frequency subbands of channels of the stereo signal;

[0357] a weight processor (105) arranged to determine frequency subband downmix weights for the channels of the stereo signal from the sets of frequency subband spatial parameters;

[0358] a downmixer (107) arranged to determine a mono downmix audio signal by downmixing the stereo signal, the downmixer (107) being arranged to generate frequency subband values of the mono downmix audio signal by combining frequency subband values of channels of the stereo signal dependent on the frequency subband downmix weights;

[0359] a residual circuit (109) arranged to generate at least a first residual signal from a first channel of the stereo signal, the residual circuit (109) being arranged to:

[0360] generate frequency subband residual weights for the first channel from frequency subband downmix weights for the first channel;

[0361] generate first frequency subband compensation values for the first channel from the frequency subband values of the mono downmix audio signal and the frequency subband residual weights for the first channel; and

[0362] generate frequency subband values of the first residual signal from a compensation of frequency subband values of the first channel by the first frequency subband compensation values;

[0363] a renderer (111) arranged to render an output stereo signal, the renderer comprising: a first renderer (301) arranged to perform a first rendering of the mono downmix audio signal to generate a first intermediate stereo signal, the first rendering being a directional rendering arranged to render the mono downmix audio signal from a first direction;

[0364] a second renderer (307) arranged to perform a second rendering being a rendering of the first residual signal to generate a second intermediate stereo signal; and

[0365] a combiner (309) arranged combine at least the first intermediate stereo signal and the second intermediate stereo signal to generate the output stereo signal.

[0366] Embodiment 2. The audio apparatus of embodiment 1 wherein the weight processor (105) is arranged to determine the frequency subband downmix weights such that the downmixing aligns a phase of the frequency subband values of the channels of the stereo signal. Embodiment 3. The audio apparatus of any previous embodiment wherein the weight processor (105) is arranged to determine the frequency subband downmix weights to have levels that meet a combined level constraint.

[0367] Embodiment 4. The audio apparatus of any previous embodiment wherein the weight processor (105) is arranged to determine the frequency subband downmix weights to achieve at least one of:

[0368] maximizing a power level of the mono downmix audio signal;

[0369] minimizing a power level of the first residual signal; or

[0370] minimizing a correlation of the mono downmix audio signal and the first residual signal.

[0371] Embodiment 5. The audio apparatus of any previous embodiment wherein the weight processor (105) is arranged to determine the frequency subband downmix weights for the mono downmix audio signal to be a principal component of the stereo signal.

[0372] Embodiment 6. The audio apparatus of any previous embodiment wherein the first renderer (301) is arranged to determine a point source direction in a stereo image of the stereo signal from the spatial parameters, and to determine the first direction by applying a mapping function to the point source direction.

[0373] Embodiment 7. The audio apparatus of any previous embodiment wherein the second rendering is a predetermined rendering employing a predetermined mapping of the first residual signal to channel signals of the second intermediate stereo signal.

[0374] Embodiment 8. The audio apparatus of any previous embodiment wherein the second rendering includes a decorrelation.

[0375] Embodiment 9. The audio apparatus of any previous embodiment wherein the residual circuit (109) is arranged to generate a second residual signal from a second channel of the stereo signal, the residual circuit (109) being arranged to

[0376] generate second frequency subband residual weights for the second channel from frequency subband downmix weights for the second channel;

[0377] generate second frequency subband compensation values for the second channel from the frequency subband values of the mono downmix audio signal and the frequency subband residual weights for the second channel; and

[0378] generate frequency subband values of the second residual signal from a compensation of frequency subband values of the second channel of the stereo signal by the second frequency subband compensation values; and the second renderer (307) is arranged to perform a third rendering being a rendering of the second residual signal to generate a third intermediate stereo signal; and the combiner is arranged to combine at least the first intermediate stereo signal, the second intermediate stereo signal, and the third intermediate stereo signal to generate the output stereo signal.

[0379] Embodiment 10. The audio apparatus of any previous embodiment further comprising a delay (501, 503) for delaying the frequency subband values of the stereo signal relative to the frequency subband downmix weights.

[0380] Embodiment 11. The audio apparatus of any previous embodiment wherein the receiver (101) is further arranged to receive an indication of a maximum time offset between channels of the stereo signal; and the audio render apparatus is arranged to adapt the combination in dependence on the indication of the maximum time offset.

[0381] Embodiment 12. The audio apparatus of any previous embodiment wherein the spatial parameter circuit (103) is arranged to analyze the stereo signal to generate the sets of spatial parameters.

[0382] Embodiment 13. The audio apparatus of any previous embodiment wherein the spatial parameters include an interchannel intensity difference parameter, and interchannel phase difference, and an interchannel correlation parameter.

[0383] Embodiment 14. A method of generating an output stereo signal, the method comprising:

[0384] receiving a stereo signal;

[0385] providing sets of frequency subband spatial parameters for the stereo signal, the sets of frequency subband spatial parameters being indicative of relative signal properties of frequency subbands of channels of the stereo signal;

[0386] determining frequency subband downmix weights for the channels of the stereo signal from the sets of frequency subband spatial parameters;

[0387] determining a mono downmix audio signal by downmixing the stereo signal, the downmixer (107) being arranged to generate frequency subband values of the mono downmix audio signal by combining frequency subband values of channels of the stereo signal dependent on the frequency subband downmix weights;

[0388] generating at least a first residual signal from a first channel of the stereo signal, the generating including:

[0389] generating frequency subband residual weights for the first channel from frequency subband downmix weights for the first channel; generating first frequency subband compensation values for the first channel from the frequency subband values of the mono downmix audio signal and the frequency subband residual weights for the first channel; and

[0390] generating frequency subband values of the first residual signal from a compensation of frequency subband values of the first channel by the first frequency subband compensation values;

[0391] rendering an output stereo signal, the rendering comprising:

[0392] performing a first rendering of the mono downmix audio signal to generate a first intermediate stereo signal, the first rendering being a directional rendering arranged to render the mono downmix audio signal from a first direction;

[0393] performing a second rendering being a rendering of the first residual signal to generate a second intermediate stereo signal; and

[0394] combining at least the first intermediate stereo signal and the second intermediate stereo signal to generate the output stereo signal.

[0395] More specifically, the invention is defined by the appended CLAIMS.

Claims

1. CLAIMS:2.Claim 1. An audio apparatus comprising:3.a receiver (101) arranged to receive a stereo signal;4.a spatial parameter circuit (103) arranged to provide sets of frequency subband spatial parameters for the stereo signal, the sets of frequency subband spatial parameters being indicative of relative signal properties of frequency subbands of channels of the stereo signal;5.a weight processor (105) arranged to determine frequency subband downmix weights for the channels of the stereo signal from the sets of frequency subband spatial parameters;6.a downmixer (107) arranged to determine a mono downmix audio signal by downmixing the stereo signal, the downmixer (107) being arranged to generate frequency subband values of the mono downmix audio signal by combining frequency subband values of channels of the stereo signal dependent on the frequency subband downmix weights;7.a residual circuit (109) arranged to generate at least a first residual signal from a first channel of the stereo signal, the residual circuit (109) being arranged to:8.generate frequency subband residual weights for the first channel from frequency subband downmix weights for the first channel, the frequency subband residual weights being determined as complex conjugates of the frequency subband downmix weights;9.generate first frequency subband compensation values for the first channel from the frequency subband values of the mono downmix audio signal and the frequency subband residual weights for the first channel; and10.generate frequency subband values of the first residual signal from a compensation of frequency subband values of the first channel by the first frequency subband compensation values;11.a renderer (111) arranged to render an output stereo signal, the renderer comprising: a first renderer (301) arranged to perform a first rendering of the mono downmix audio signal to generate a first intermediate stereo signal, the first rendering being a directional rendering arranged to render the mono downmix audio signal from a first direction;12.a second renderer (307) arranged to perform a second rendering being a rendering of the first residual signal to generate a second intermediate stereo signal; and13.a combiner (309) arranged combine at least the first intermediate stereo signal and the second intermediate stereo signal to generate the output stereo signal; and wherein the weight processor (105) is arranged to determine the frequency subband downmix weights such that the downmixing aligns a phase of the frequency subband values of the channels of the stereo signal. Claim 2. The audio apparatus of any previous claim wherein the weight processor (105) is arranged to determine the frequency subband downmix weights to have levels that meet a combined level constraint.14.Claim 3. The audio apparatus of any previous claim wherein the weight processor (105) is arranged to determine the frequency subband downmix weights to achieve at least one of:15.maximizing a power level of the mono downmix audio signal;16.minimizing a power level of the first residual signal; or17.minimizing a correlation of the mono downmix audio signal and the first residual signal.18.Claim 4. The audio apparatus of any previous claim wherein the weight processor (105) is arranged to determine the frequency subband downmix weights for the mono downmix audio signal to be a principal component of the stereo signal.19.Claim 5. The audio apparatus of any previous claim wherein the first renderer (301) is arranged to determine a point source direction in a stereo image of the stereo signal from the spatial parameters, and to determine the first direction by applying a mapping function to the point source direction.20.Claim 6. The audio apparatus of any previous claim wherein the second rendering is a predetermined rendering employing a predetermined mapping of the first residual signal to channel signals of the second intermediate stereo signal.21.Claim 7. The audio apparatus of any previous claim wherein the second rendering includes a decorrelation.22.Claim 8. The audio apparatus of any previous claim wherein the residual circuit (109) is arranged to generate a second residual signal from a second channel of the stereo signal, the residual circuit (109) being arranged to23.generate second frequency subband residual weights for the second channel from frequency subband downmix weights for the second channel;24.generate second frequency subband compensation values for the second channel from the frequency subband values of the mono downmix audio signal and the frequency subband residual weights for the second channel; and25.generate frequency subband values of the second residual signal from a compensation of frequency subband values of the second channel of the stereo signal by the second frequency subband compensation values; and the second renderer (307) is arranged to perform a third rendering being a rendering of the second residual signal to generate a third intermediate stereo signal; and the combiner is arranged to combine at least the first intermediate stereo signal, the second intermediate stereo signal, and the third intermediate stereo signal to generate the output stereo signal.26.Claim 9. The audio apparatus of any previous claim further comprising a delay (501, 503) for delaying the frequency subband values of the stereo signal relative to the frequency subband downmix weights.27.Claim 10. The audio apparatus of any previous claim wherein the spatial parameter circuit (103) is arranged to analyze the stereo signal to generate the sets of spatial parameters.28.Claim 11. The audio apparatus of any previous claim wherein the spatial parameters include an interchannel intensity difference parameter, and interchannel phase difference, and an interchannel correlation parameter.29.Claim 12. A method of generating an output stereo signal, the method comprising:30.receiving a stereo signal;31.providing sets of frequency subband spatial parameters for the stereo signal, the sets of frequency subband spatial parameters being indicative of relative signal properties of frequency subbands of channels of the stereo signal;32.determining frequency subband downmix weights for the channels of the stereo signal from the sets of frequency subband spatial parameters;33.determining a mono downmix audio signal by downmixing the stereo signal, the downmixer (107) being arranged to generate frequency subband values of the mono downmix audio signal by combining frequency subband values of channels of the stereo signal dependent on the frequency subband downmix weights;34.generating at least a first residual signal from a first channel of the stereo signal, the generating including:35.generating frequency subband residual weights for the first channel as complex conjugates of the frequency subband downmix weights for the first channel;36.generating first frequency subband compensation values for the first channel from the frequency subband values of the mono downmix audio signal and the frequency subband residual weights for the first channel; and generating frequency subband values of the first residual signal from a compensation of frequency subband values of the first channel by the first frequency subband compensation values;37.rendering an output stereo signal, the rendering comprising:38.performing a first rendering of the mono downmix audio signal to generate a first intermediate stereo signal, the first rendering being a directional rendering arranged to render the mono downmix audio signal from a first direction;39.performing a second rendering being a rendering of the first residual signal to generate a second intermediate stereo signal; and40.combining at least the first intermediate stereo signal and the second intermediate stereo signal to generate the output stereo signal; and wherein determining frequency subband downmix weights comprises determining the frequency subband downmix weights such that the downmixing aligns a phase of the frequency subband values of the channels of the stereo signal.41.Claim 13. A computer program product comprising computer program code means adapted to perform all the steps of claim 12 when said program is run on a computer.