Rendering of multiple higher order ambisonic signals

By estimating modified metadata based on microphone array geometry and audio source positions, the method improves sound localization accuracy in spatial audio rendering, addressing issues of incorrect direction perception and diffuseness in existing technologies.

WO2026139212A1PCT designated stage Publication Date: 2026-07-02NOKIA TECHNOLOGIES OY

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
NOKIA TECHNOLOGIES OY
Filing Date
2025-12-08
Publication Date
2026-07-02

Smart Images

  • Figure EP2025085814_02072026_PF_FP_ABST
    Figure EP2025085814_02072026_PF_FP_ABST
Patent Text Reader

Abstract

An apparatus at least to perform: obtain one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; obtain a listener position within an audio scene comprising one or more areas having one or more inside and outside regions in relation to the audio signal set positions, wherein the inside region is defined by the audio signal set positions and the audio scene comprises one or more audio sources; determine, a projected listener position based on a geometry of the inside region; obtain metadata associated with the projected listener position; obtain a projection shape; obtain information based on a position of the one or more audio sources relative to the projection shape; estimate modified metadata for the projected listener position based on the listener position and the information.
Need to check novelty before this filing date? Find Prior Art

Description

RENDERING OF MULTIPLE HIGHER ORDER AMBISONIC SIGNALSFIELD

[0001] The present application relates to a method, apparatus, system and computer program for rendering of multi-point Ambisonic signals and in particular but not exclusively to method, apparatus, system and computer program for rendering of multi-point higher order Ambisonic signals.BACKGROUND

[0002] . Spatial audio capture approaches attempt to capture an audio environment or audio scene such that the audio environment or audio scene can be perceptually recreated to a listener in an effective manner and furthermore may permit a listener to move and / or rotate within the recreated audio environment. For spatial audio capture and recording spatial sound linearly at one position at the recording space, a high-end microphone array is needed. One such microphone is the spherical 32-microphone Eigenmike. From the high-end microphone array higher-order Ambisonics (HOA) signals can be obtained and used for rendering. With the HOA audio signals, the spatial audio can be rendered so that sounds arriving from different directions are satisfactorily separated in a reasonable auditory bandwidth. In some systems multiple microphone locations enable a multi-point HOA (MPHOA) capture system where there are multiple HOA audio signals at locations within an audio scene. In some embodiments even basic microphone array for up to first order ambisonics may also be used for recording the audio scene. In some other embodiments, the audio scene may comprise two or more synthetic FOA or HOA sources.

[0003] Audio rendering, where the captured audio signals are presented to a listener can be part of a virtual reality (VR) or augmented reality (AR) system. The audio rendering furthermore can be performed as part of a VR or AR where the listener can freely move within the environment or audio scene and rotate their head, which is known as a 6 degrees of freedom (6DoF) configuration. Furthermore the audio rendering can be Multi-Point HOA (MPHOA) audio rendering where the audio scene comprises multiple HOA audio signals recordings which are rendered to a user in a 6DoF manner. That is, the user is able to listen to the recorded scene from positions that may be other than the positions of the recorded HOA sources.SUMMARY

[0004] According to a first aspect, there is provided an apparatus for generating a spatialized audio output based on a listener position, the apparatus comprising at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the system at least to perform: obtainone or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; obtain a listener position within an audio scene, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions, wherein the inside region is defined by the respective audio signal set positions and the audio scene comprises one or more audio sources; determine, for the listener position within an audio environment outside the inside region, a projected listener position based on a geometry of the inside region; obtain, based on the one or more audio signal sets, metadata associated with the projected listener position; obtain a projection shape; obtain information based on a position of the one or more audio sources relative to the projection shape; and estimate modified metadata for the projected listener position based on the listener position and the information.

[0005] The modified metadata may comprise at least one of: a modified energy metadata parameter; and a modified directional metadata parameter.

[0006] The apparatus caused to estimate modified metadata for the projected listener position based on the listener position and the information based on the position of the one or more audio sources relative to the projection shape may be further caused to: determine at least one audio position with respect to the projected listener position, wherein the modified metadata for the projected listener position comprises a direction parameter representing a direction from the projected listener position to one of the at least one audio position; determine spatial metadata for the listener position based on the at least one audio signal set position with respect to the projected listener position, wherein the spatial metadata comprises a spatial direction parameter representing a direction from the listener position to the one of the at least one audio position.

[0007] The apparatus caused to obtain one or more audio signal sets may be caused to obtain the one or more audio signal sets from microphone arrangements, wherein each microphone arrangement is at a respective position and comprises one or more microphones.

[0008] The apparatus caused to obtain the one or more audio signal sets may be caused to obtain one or more higher order ambisonics sources.

[0009] The inside regions in relation to the respective audio signal set positions for one higher order ambisonics source may define a position associated with the higher order ambisonics source.

[0010] The projected listener position may be the position associated with the higher order ambisonics source.

[0011] The apparatus caused to obtain a listener position may be caused to obtain the listener position from a further apparatus.

[0012] The apparatus caused to obtain, for the at least one of the one or more audio signal sets, metadata based on a processing of the at least one audio signals of the at least one of the one or more audio signalsets may be caused to determine a directional parameter based on the processing of the at least one audio signals.

[0013] The apparatus caused to determine, for the listener position within an audio environment outside the inside region, a projected listener position based on a geometry of the inside region may be caused to determine the projected listener position at a location of one of: within a plane or volume at least partially defined by an edge or surface linking the one of the one or more audio signal set positions and the listener position; within a plane or volume at least partially defined by an edge or surface linking the one of the one or more audio signal set positions within an associated inside region; on an edge or surface defined by the one of the one or more audio signal set positions; and at a closest of the one or more audio signal set positions.

[0014] The apparatus caused to estimate modified metadata for the projected listener position based on the listener position and the information related to a relationship between a position of the one or more audio sources and the projection shape may be caused to: generate at least one interpolation weights based on the audio signal set positions and the projected listener position; apply the at least one interpolation weights to respective audio signal set audio metadata to generate interpolated audio metadata; and combine the interpolated audio metadata to generate the modified metadata for the projected listener position.

[0015] The apparatus caused to estimate modified metadata for the projected listener position based on the listener position and the information based on a position of the one or more audio sources relative to the projection shape may be caused to map the modified metadata based on the second listener position to a cartesian co-ordinate system.

[0016] The apparatus caused to obtain information based on a position of the one or more audio sources relative to the projection shape may be caused to indicate whether the position of the one or more audio sources is on the boundary of the projection shape.

[0017] The apparatus caused to estimate modified metadata for the projected listener position based on the listener position and the information based on a position of the one or more audio sources relative to the projection shape may be further caused to: estimate a modified direction of arrival and energy based on the information indicating the position of the one or more audio sources is on the projection shape; estimate a modified direction of arrival, modified direct-to-total energy ratio and directional weighting based on the information indicating the position of the one or more audio sources is otherwise not on the projection shape.

[0018] The projection shape may be one of: a projection sphere having a defined projection radius boundary; a regular projection shape having a regular defined projection boundary; and an irregular projection shape having an arbitrary projection boundary.

[0019] The apparatus caused to obtain the projection shape may be caused to obtain a parameter defining at least one of: the projection shape; a radius associated with the projection shape; and at least one dimension associated with the projection shape.

[0020] The apparatus caused to obtain information may be caused to obtain a scene parameter indicating whether the one or more audio source is positioned at an edge or boundary of the projection shape.

[0021] According to a second aspect there is provided an apparatus for assisting the generation of a spatialized audio output based on a listener position, the apparatus comprising at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the system at least to perform: obtain one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; obtain for an audio scene at least one position associated with one or more audio source, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions and the audio scene comprises the one or more audio sources; obtain a projection shape; obtain information based on a position of the one or more audio sources relative to the projection shape; and transmit to a further apparatus the one or more audio signal sets and the information.

[0022] The projection shape may be one of: a projection sphere having a defined projection radius boundary; a regular projection shape having a regular defined projection boundary; and an irregular projection shape having an arbitrary projection boundary.

[0023] The apparatus caused to obtain the projection shape may be caused to obtain a parameter defining at least one of: the projection shape; a radius associated with the projection shape; and at least one dimension associated with the projection shape.

[0024] The apparatus caused to obtain information may be caused to obtain a scene parameter indicating whether the one or more audio source is positioned at an edge or boundary of the projection shape.

[0025] According to a third aspect, there is provided an apparatus comprising means configured to: obtain one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; obtain a listener position within an audio scene, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions, wherein the inside region is defined by the respective audio signal set positions and the audio scene comprises one or more audio sources; determine, for the listener position within an audio environment outside the inside region, a projected listener position based on a geometry of the inside region; obtain, based on the one or more audio signal sets, metadata associated with the projected listener position; obtain a projection shape; obtain information based on a position of the one or more audio sources relative to the projection shape; estimate modified metadata for the projected listener position based on the listener position and the information.

[0026] The modified metadata may comprise at least one of: a modified energy metadata parameter; and a modified directional metadata parameter.

[0027] The means configured to estimate modified metadata for the projected listener position based on the listener position and the information based on the position of the one or more audio sources relative to theprojection shape may be further configured to: determine at least one audio position with respect to the projected listener position, wherein the modified metadata for the projected listener position comprises a direction parameter representing a direction from the projected listener position to one of the at least one audio position; determine spatial metadata for the listener position based on the at least one audio signal set position with respect to the projected listener position, wherein the spatial metadata comprises a spatial direction parameter representing a direction from the listener position to the one of the at least one audio position.

[0028] The means configured to obtain one or more audio signal sets may be configured to obtain the one or more audio signal sets from microphone arrangements, wherein each microphone arrangement is at a respective position and comprises one or more microphones.

[0029] The means configured to obtain the one or more audio signal sets may be configured to obtain one or more higher order ambisonics sources.

[0030] The inside regions in relation to the respective audio signal set positions for one higher order ambisonics source may define a position associated with the higher order ambisonics source.

[0031] The projected listener position may be the position associated with the higher order ambisonics source.

[0032] The means configured to obtain a listener position may be configured to obtain the listener position from a further apparatus.

[0033] The means configured to obtain, for the at least one of the one or more audio signal sets, metadata based on a processing of the at least one audio signals of the at least one of the one or more audio signal sets may be configured to determine a directional parameter based on the processing of the at least one audio signals.

[0034] The means configured to determine, for the listener position within an audio environment outside the inside region, a projected listener position based on a geometry of the inside region may be configured to determine the projected listener position at a location of one of: within a plane or volume at least partially defined by an edge or surface linking the one of the one or more audio signal set positions and the listener position; within a plane or volume at least partially defined by an edge or surface linking the one of the one or more audio signal set positions within an associated inside region; on an edge or surface defined by the one of the one or more audio signal set positions; and at a closest of the one or more audio signal set positions.

[0035] The means configured to estimate modified metadata for the projected listener position based on the listener position and the information related to a relationship between a position of the one or more audio sources and the projection shape may be configured to: generate at least one interpolation weights based on the audio signal set positions and the projected listener position; apply the at least one interpolationweights to respective audio signal set audio metadata to generate interpolated audio metadata; and combine the interpolated audio metadata to generate the modified metadata for the projected listener position.

[0036] The means configured to estimate modified metadata for the projected listener position based on the listener position and the information based on a position of the one or more audio sources relative to the projection shape may be configured to map the modified metadata based on the second listener position to a cartesian co-ordinate system.

[0037] The means configured to obtain information based on a position of the one or more audio sources relative to the projection shape may be configured to indicate whether the position of the one or more audio sources is on the boundary of the projection shape.

[0038] The means configured to estimate modified metadata for the projected listener position based on the listener position and the information based on a position of the one or more audio sources relative to the projection shape may be further configured to: estimate a modified direction of arrival and energy based on the information indicating the position of the one or more audio sources is on the projection shape; estimate a modified direction of arrival, modified direct-to-total energy ratio and directional weighting based on the information indicating the position of the one or more audio sources is otherwise not on the projection shape.

[0039] The projection shape may be one of: a projection sphere having a defined projection radius boundary; a regular projection shape having a regular defined projection boundary; and an irregular projection shape having an arbitrary projection boundary.

[0040] The means configured to obtain the projection shape may be configured to obtain a parameter defining at least one of: the projection shape; a radius associated with the projection shape; and at least one dimension associated with the projection shape.

[0041] The means configured to obtain information may be configured to obtain a scene parameter indicating whether the one or more audio source is positioned at an edge or boundary of the projection shape.

[0042] According to a fourth aspect there is provided an apparatus for assisting the generation of a spatialized audio output based on a listener position, the apparatus comprising means configured to: obtain one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; obtain for an audio scene at least one position associated with one or more audio source, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions and the audio scene comprises the one or more audio sources; obtain a projection shape; obtain information based on a position of the one or more audio sources relative to the projection shape; and transmit to a further apparatus the one or more audio signal sets and the information.

[0043] The projection shape may be one of: a projection sphere having a defined projection radius boundary; a regular projection shape having a regular defined projection boundary; and an irregular projection shape having an arbitrary projection boundary.

[0044] The means configured to obtain the projection shape is configured to obtain a parameter defining at least one of: the projection shape; a radius associated with the projection shape; and at least one dimension associated with the projection shape.

[0045] The means configured to obtain information may be configured to obtain a scene parameter indicating whether the one or more audio source is positioned at an edge or boundary of the projection shape.

[0046] According to a fifth aspect, there is provided a method for an apparatus for generating a spatialized audio output based on a listener position, the method comprising at least: obtaining one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; obtaining a listener position within an audio scene, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions, wherein the inside region is defined by the respective audio signal set positions and the audio scene comprises one or more audio sources; determining, for the listener position within an audio environment outside the inside region, a projected listener position based on a geometry of the inside region; obtaining, based on the one or more audio signal sets, metadata associated with the projected listener position; obtaining a projection shape; obtaining information based on a position of the one or more audio sources relative to the projection shape; and estimating modified metadata for the projected listener position based on the listener position and the information.

[0047] The modified metadata may comprise at least one of: a modified energy metadata parameter; and a modified directional metadata parameter.

[0048] Estimating modified metadata for the projected listener position based on the listener position and the information based on the position of the one or more audio sources relative to the projection shape may further comprise: determining at least one audio position with respect to the projected listener position, wherein the modified metadata for the projected listener position comprises a direction parameter representing a direction from the projected listener position to one of the at least one audio position; determining spatial metadata for the listener position based on the at least one audio signal set position with respect to the projected listener position, wherein the spatial metadata comprises a spatial direction parameter representing a direction from the listener position to the one of the at least one audio position.

[0049] Obtaining one or more audio signal sets may further comprise obtaining the one or more audio signal sets from microphone arrangements, wherein each microphone arrangement is at a respective position and comprises one or more microphones.

[0050] Obtaining the one or more audio signal sets may further comprise obtaining one or more higher order ambisonics sources.

[0051] The inside regions in relation to the respective audio signal set positions for one higher order ambisonics source may define a position associated with the higher order ambisonics source.

[0052] The projected listener position may be the position associated with the higher order ambisonics source.

[0053] Obtaining a listener position may further comprise obtaining the listener position from a further apparatus.

[0054] Obtaining, for the at least one of the one or more audio signal sets, metadata based on a processing of the at least one audio signals of the at least one of the one or more audio signal sets may further comprise determining a directional parameter based on the processing of the at least one audio signals.

[0055] Determining, for the listener position within an audio environment outside the inside region, a projected listener position based on a geometry of the inside region may further comprise determining the projected listener position at a location of one of: within a plane or volume at least partially defined by an edge or surface linking the one of the one or more audio signal set positions and the listener position; within a plane or volume at least partially defined by an edge or surface linking the one of the one or more audio signal set positions within an associated inside region; on an edge or surface defined by the one of the one or more audio signal set positions; and at a closest of the one or more audio signal set positions.

[0056] Estimating modified metadata for the projected listener position based on the listener position and the information related to a relationship between a position of the one or more audio sources and the projection shape may further comprise: generating at least one interpolation weights based on the audio signal set positions and the projected listener position; applying the at least one interpolation weights to respective audio signal set audio metadata to generate interpolated audio metadata; and combining the interpolated audio metadata to generate the modified metadata for the projected listener position.

[0057] Estimating modified metadata for the projected listener position based on the listener position and the information based on a position of the one or more audio sources relative to the projection shape may further comprise mapping the modified metadata based on the second listener position to a cartesian coordinate system.

[0058] Obtaining information based on a position of the one or more audio sources relative to the projection shape may further comprise indicating whether the position of the one or more audio sources is on the boundary of the projection shape.

[0059] Estimating modified metadata for the projected listener position based on the listener position and the information based on a position of the one or more audio sources relative to the projection shape may further comprise: estimating a modified direction of arrival and energy based on the information indicating the position of the one or more audio sources is on the projection shape; estimating a modified direction of arrival, modified direct-to-total energy ratio and directional weighting based on the information indicating the position of the one or more audio sources is otherwise not on the projection shape.

[0060] The projection shape may be one of: a projection sphere having a defined projection radius boundary; a regular projection shape having a regular defined projection boundary; and an irregular projection shape having an arbitrary projection boundary.

[0061] Obtaining the projection shape may further comprise obtaining a parameter defining at least one of: the projection shape; a radius associated with the projection shape; and at least one dimension associated with the projection shape.

[0062] Obtaining information may further comprise obtaining a scene parameter indicating whether the one or more audio source is positioned at an edge or boundary of the projection shape.

[0063] According to a sixth aspect there is provided a method for an apparatus for assisting the generation of a spatialized audio output based on a listener position, the method comprising at least: obtaining one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; obtaining for an audio scene at least one position associated with one or more audio source, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions and the audio scene comprises the one or more audio sources; obtaining a projection shape; obtaining information based on a position of the one or more audio sources relative to the projection shape; transmitting to a further apparatus the one or more audio signal sets and the information.

[0064] The projection shape may be one of: a projection sphere having a defined projection radius boundary; a regular projection shape having a regular defined projection boundary; and an irregular projection shape having an arbitrary projection boundary.

[0065] Obtaining the projection shape may further comprise obtaining a parameter defining at least one of: the projection shape; a radius associated with the projection shape; and at least one dimension associated with the projection shape.

[0066] Obtaining information may further comprise obtaining a scene parameter indicating whether the one or more audio source is positioned at an edge or boundary of the projection shape.

[0067] According to a seventh aspect, there is provided a computer readable medium comprising instructions which, when executed by an apparatus for generating a spatialized audio output based on a listener position, cause the apparatus to perform at least the following: obtaining one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; obtaining a listener position within an audio scene, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions, wherein the inside region is defined by the respective audio signal set positions and the audio scene comprises one or more audio sources; determining, for the listener position within an audio environment outside the inside region, a projected listener position based on a geometry of the inside region; obtaining, based on the one or more audio signal sets, metadata associated with the projected listener position;obtaining a projection shape; obtaining information based on a position of the one or more audio sources relative to the projection shape; and estimating modified metadata for the projected listener position based on the listener position and the information.

[0068] According to an eighth aspect, there is provided a computer readable medium comprising instructions which, when executed by an apparatus for assisting the generation of a spatialized audio output based on a listener position, cause the apparatus to perform at least the following: obtaining one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; obtaining for an audio scene at least one position associated with one or more audio source, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions and the audio scene comprises the one or more audio sources; obtaining a projection shape; obtaining information based on a position of the one or more audio sources relative to the projection shape; transmitting to a further apparatus the one or more audio signal sets and the information.

[0069] According to a ninth aspect, there is an apparatus for generating a spatialized audio output based on a listener position, the apparatus comprising: obtaining circuitry configured to obtain one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; obtaining circuitry configured to obtain a listener position within an audio scene, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions, wherein the inside region is defined by the respective audio signal set positions and the audio scene comprises one or more audio sources; determining circuitry configured to determine, for the listener position within an audio environment outside the inside region, a projected listener position based on a geometry of the inside region; obtaining circuitry configured to obtain, based on the one or more audio signal sets, metadata associated with the projected listener position; obtaining circuitry configured to obtain a projection shape; obtaining circuitry configured to obtain information based on a position of the one or more audio sources relative to the projection shape; and estimating circuitry configured to estimate modified metadata for the projected listener position based on the listener position and the information.

[0070] According to a tenth aspect, there is an apparatus for assisting the generation of a spatialized audio output based on a listener position, the apparatus comprising: obtaining circuitry configured to obtain one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; obtaining circuitry configured to obtain for an audio scene at least one position associated with one or more audio source, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions and the audio scene comprises the one or more audio sources; obtaining circuitry configured to obtain a projection shape; obtaining circuitry configured to obtain information based on a position of the one or more audio sourcesrelative to the projection shape; transmitting circuitry configured to transmit to a further apparatus the one or more audio signal sets and the information.

[0071] According to an eleventh aspect, there is an apparatus for generating a spatialized audio output based on a listener position, the apparatus comprising: means for obtaining one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; means for obtaining a listener position within an audio scene, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions, wherein the inside region is defined by the respective audio signal set positions and the audio scene comprises one or more audio sources; means for determining, for the listener position within an audio environment outside the inside region, a projected listener position based on a geometry of the inside region; means for obtaining, based on the one or more audio signal sets, metadata associated with the projected listener position; means for obtaining a projection shape; means for obtaining information based on a position of the one or more audio sources relative to the projection shape; and means for estimating modified metadata for the projected listener position based on the listener position and the information.

[0072] According to a twelfth aspect, there is an apparatus for assisting the generation of a spatialized audio output based on a listener position, the apparatus comprising: means for obtaining one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; means for obtaining for an audio scene at least one position associated with one or more audio source, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions and the audio scene comprises the one or more audio sources; means for obtaining a projection shape; means for obtaining information based on a position of the one or more audio sources relative to the projection shape; means for transmitting to a further apparatus the one or more audio signal sets and the information.

[0073] According to a thirteenth aspect, there is provided a non-transitory computer readable medium comprising program instructions that, when executed by an apparatus, cause the apparatus to perform at least the method according to any of the preceding aspects.

[0074] According to a fourteenth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising instructions] for causing an apparatus, for defining a file format carriage for generating a spatialized audio output based on a listener position, the apparatus caused to perform at least the following: obtaining one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; obtaining a listener position within an audio scene, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions, wherein the inside region is defined by the respective audio signal set positions and the audio scene comprises one or more audio sources; determining, for the listener position within an audio environment outside the inside region, a projected listener positionbased on a geometry of the inside region; obtaining, based on the one or more audio signal sets, metadata associated with the projected listener position; obtaining a projection shape; obtaining information based on a position of the one or more audio sources relative to the projection shape; and estimating modified metadata for the projected listener position based on the listener position and the information.

[0075] According to a fifteenth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising instructions] for causing an apparatus, for defining a file format carriage for assisting the generation of a spatialized audio output based on a listener position, the apparatus caused to perform at least the following: obtaining one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position; obtaining for an audio scene at least one position associated with one or more audio source, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions and the audio scene comprises the one or more audio sources; obtaining a projection shape; obtaining information based on a position of the one or more audio sources relative to the projection shape; transmitting to a further apparatus the one or more audio signal sets and the information.

[0076] In the above, many different embodiments have been described. It should be appreciated that further embodiments may be provided by the combination of any two or more of the embodiments described above.DESCRIPTION OF FIGURES

[0077] Embodiments will now be described, by way of example only, with reference to the accompanying Figures in which:

[0078] Figs.1a to 1c show example audio scenes within which a user can freely move according to some embodiments;

[0079] Figs.2a and 2b show schematically an example spatial metadata modifier according to some embodiments;

[0080] Fig.3 shows a flow diagram showing an operation of the example spatial metadata modifier as shown in Fig.2b for implementing some embodiments;

[0081] Fig.4 shows an example renderer within which some embodiments can be implemented;

[0082] Fig.5 shows a flow diagram of an example renderer shown in Fig.4 according to some embodiments;

[0083] Figs.6a and 6b show schematically a further example spatial metadata modifier according to some embodiments;

[0084] Fig.7 shows an example system within which the renderer can be implemented;

[0085] Fig.8 shows a flow diagram of an example decoder / renderer shown in Fig.7 according to some embodiments;

[0086] Figs.9a to 9c show example shapes for DOA mapping;

[0087] Fig.10 shows an effect of an example directional weighting according to some embodiments; and

[0088] Fig.11 shows apparatus suitable for implementing some embodiments wherein a capture apparatus can be separate from the rendering apparatus elements.DETAILED DESCRIPTION

[0089] The following relates to apparatus, methods and computer programs for rendering of audio scenes.

[0090] As discussed above 6DoF is presently commonplace in virtual reality, such as VR games, where movement at the audio scene is straightforward to render as all spatial information is readily available (i.e., the position of each sound source as well as the audio signal of each source separately).

[0091] In the following examples the audio signal sets are generated by microphones (or microphone arrays). For example a microphone arrangement may comprise one or more microphones and generate for the audio signal set one or more audio signals. In some embodiments the audio signal set comprises audio signals which are virtual or generated audio signals (for example a virtual speaker audio signal with an associated virtual speaker location). In some embodiments the microphone-arrays are furthermore separate from or physically located away from any processing apparatus, however this does not preclude examples where the microphones are located on the processing apparatus or are physically connected to the processing apparatus.

[0092] The current MPEG-I Immersive audio standard (ISO / IEC 23090-4 WD3) renderer such as described in GB2007710.8 supports 6 degrees-of-freedom (6DoF) rendering of audio scenes comprising multiple first order or higher-order Ambisonics (FOA, HOA) microphone recordings or synthesized signals. The renderer is able to provide a binaural signal at the listener position (and orientation) based on the recorded FOA / HOA signals and their positions. That is, the renderer is able to provide a binaural signal at a non-sampled position in the scene, thus providing a 6DoF experience for the listener.

[0093] For example Figure 1 a shows an example scene 100 with microphone positions, shown in Figure 1 a as microphone positions rm 101, m2 103, m3 105 forming a first triangle (with one side in common with the second triangle and another side in common with the third triangle), microphone positions nru 107, m2 103, m3105 forming a second triangle (with one side in common with the first triangle and another side in common with the fourth triangle), microphone positions rm 101, ms 109, m3 105 forming a third triangle (with one side in common with the first triangle and another side in common with the fourth triangle), and microphone positions nru 107, ms 109, m3105 forming a fourth triangle (with one side in common with the second triangle and another side in common with the third triangle).

[0094] Additionally is shown an example listener position p / 111 which in Figure 1 a is within the first triangle.

[0095] The (MPEG-I) immersive audio renderer can be configured to utilize information (e.g., position) about audio sources present in the scene (e.g., during the recording), as shown in UK patent application GB2020239.6. The positions of any of the audio sources present in the recorded or synthetic audio scenemay be provided as input information to the renderer in addition to the audio signals and positions of the FOA or HOA microphones. Using this additional information about the audio sources allows the renderer to provide improved sound quality through improved localization and distance attenuation behaviour for the known sources. To achieve this, the renderer is configured to perform beamforming (from a FOA / HOA signal) towards a known source to estimate the source properties (energy at different frequency bands). Based on this and the position of the known source, the renderer is able to determine from which direction, from the listener’s perspective, and in which frequency bands, how much energy is being contributed by the known source.

[0096] An example of exterior rendering is shown in Figures 1 b and 1c and is described in further detail in EP21201766.9. The interior rendering methods determines spatial parameters at the listener position based on weighted interpolation of spatial parameters calculated from the microphone signals of the microphones defining the triangle that the listener is in. The closer a listener is to a microphone, the more weight is given to the spatial parameters calculated at that microphone position. However, when the listener moves outside of the microphone area, and thus is not inside any of the triangles, a weighted interpolation method such as described above cannot be used.

[0097] Previously the rendering of exterior listening positions employs spatial parameters interpolated from the spatial parameters calculated at the microphones defining the outside edge that the listener is closest to and modify the parameters based on the distance of the listener to the edge and an exterior rendering radius. In some situations a direction of arrival and diffuseness parameters (which are a subset of spatial parameters) can be modified based on an assumption that audio sources outside of the capture area are approximately at a distance from the microphone array area equal to the exterior rendering radius.

[0098] For example as shown in Fig.1b and 1c the listener position p / 141 is projected to the side of the triangle connecting the two closest microphone positions, in other words point 143 between microphone positions rm 101, m2 103, with the exterior rendering radius modified based on the distance of the known sources from the microphone area and the position of the listener. The exterior rendering radius is set to the distance of the closest informed source to the listener form the edge of the microphone area. Figure 1c for example shows the effect of application of the sound direction, where sound direction-of-arrival is adjusted based on an assumption that sound sources are on the circle 153 (also known as the projection sphere). In these embodiments the distance to the sound source is not known, only direction information for frequency bands is known. The radius of the circle can be an estimate of distance of outside audio sources. In the current draft international standard (DIS) version of the MPEG-I immersive audio standard (ISO / IEC 23090-4), the radius may be set by the content creator and provided to the renderer in the bitstream. A default value of 4.0m is used if no information is provided in the bitstream. In this example the direction of arrival of the sound source changes as listener is moving and can be implemented by adjusting spatial parametersestimated for the projected listener position and using the adjusted spatial parameters as the estimate for the spatial parameters at the listener position.

[0099] In other words for ‘exterior’ rendering spatial parameters are interpolated from the spatial parameters calculated at the microphones defining the outside edge that the listener is closest to and then these parameters are modified based on the distance of the listener to the edge and an exterior rendering radius.

[0100] For example this can be shown with respect to Fig.2a. In this example is shown the listener position p / 208 is projected to pLiProj206 to the side of the triangle between the outside 200 and inside 202 microphone areas. Direction of arrival parameters 6, <p, also uDOA212 and diffuseness parameters which are a subset of spatial parameters M = {e, p, r, e] 204 are modified the spatial metadata modifier 201 based on an assumption that audio sources outside of the capture area are approximately at a distance from the microphone array area equal to the exterior rendering radius and based on the Rproj214 and pL208 to generate modified parameters Mnew= [0new, <pnew> new> e } 210 such as the modified DOA uDOA mod210.

[0101] Thus when performing exterior rendering, an assumption is made that the sound sources are approximately at a similar distance from the microphone area. This assumption is made because the position of the sound sources are not defined exactly. Furthermore the spatial parameters are also modified to obfuscate any localization errors caused by mapping of the sound sources to the projection sphere. Namely:Diffuseness is increased if the listener moves closer to the mapped sounds; andSpatial parameters are modified a different amount based on the directions of where the sounds are estimated to be coming from. Sounds from inside the microphone array area are modified less and sounds coming from the outside are modified more.

[0102] An example scenario of such a scene could be a recording of a couple of musicians playing on a busy street. There are people around the musicians as well as people walking on the street. Some people on the street are listening and contribution to the audio scene by applauding or talking. The song is recorded using multiple Ambisonics mics, placed some distance apart. The positions of at least some of the musicians are known. The scene is captured and then subsequently turned into a VR scene by the content creator. The content creator creates a scene with associated scene description with the positions of the recording microphones and any of the known sources. The scene description can be in any suitable format, including, Encoder Input Format (EIF) or any other suitable format. The scene description is then encoded into a bitstream along with the audio and is provided to the listener for consumption via VR, for example.

[0103] The modification of spatial parameters to obfuscate the localization errors described above performs well for general situations where the position of all sound sources is not well defined. The further a sound source is away from the exterior rendering projection sphere, the more the localization is erroneous. In ascene, some sound sources might be at a known distance from the microphone arrays, some might be closer, and some might be farther.

[0104] In cases where all sound sources in the scene are located at the same distance from the microphone area, this obfuscation is unnecessary and limits the localization performance of the exterior rendering.

[0105] In EP 21201766.9, the energy parameter is not modified, which is reasonable, since the exact positions of sounds sources is not known and thus the distance between the listener and the sound sources are not known (and thus energy modification or compensation cannot be determined). This has the effect of, during exterior rendering, the sound source energy does not change based on the listener distance from the source. This is especially noticeable when the listener moves further away from the microphone area, as the sounds do not attenuate as would be expected by the listener.

[0106] Thus, in situations like these, when normal exterior rendering is used, the listener perceives localization errors as well as sound sources sounding more diffuse than they should be. In other words the listener is likely to experience errors because of the two following reasons.

[0107] Firstly, the diffuseness is due to the modification (decrease) of the direct-to-total ratio as the listener moves closer to a mapped (to the sphere) sound source. The diffuseness increase sounds like the direction from where the sound comes from becomes increasingly not-so-clear. This is a good approach when the sound sources are not actually on the sphere but are still mapped to the sphere (such as described in detail in EP 21201766.9). However, if the sources are actually on the sphere the modification is unnecessary.

[0108] Secondly for the listener, localization errors are heard as sound sources not perceived to be arriving from the correct direction. Sound source positions may appear to not coincide with the direction of their visual counterparts (VR situation). In EP 21201766.9, a directional weighting is performed so that the original DOA vector is averaged with the mapped DOA vector onto the exterior sphere. This causes a source exactly on the sphere to have an incorrect DOA. This can be shown, for example in Fig.10, where the final DOA vector after modification will be uDOA,mod,weighted1015 which is a weighted average between uDOA1007 and ^DOA.mod 1005. As can be seen in the figure, uDOA,mod,weighted1015 does not point toward the sound source 1009 from the listener position pL1003, but is only approximately correct. The listener therefore will perceive or hear the sound source from the wrong direction.

[0109] The concept as discussed in further detail herein with respect to the following embodiments and examples relates to a binaural (or otherwise spatial audio signals compatible) rendering of 6DoF audio scenes captured with one or more higher-order Ambisonics microphones (HOA) where there is provided apparatus and methods for estimating spatial parameters at a listener position when the listener is exterior to the microphone area and when sound sources are assumed to be located at a set distance from the microphone area.

[0110] In some embodiments this can be enabled via a defined or specific signaling of information in order to aim to achieve high-quality rendering of the audio scene without localization errors (i.e. without wrong direction and increased diffuseness).

[0111] This approach as described herein can in some embodiments be implemented as a parallel or optional method to the one described in EP 21201766.9 for estimating spatial parameters at listener positions exterior to the microphone area.

[0112] The method as described in EP 21201766.9 and as detailed in the following disclosure first estimates spatial parameters at the edge of the microphone area and then modifies the estimated spatial parameters at the listener position based on a projection radius (or projection sphere or suitable projection dimension).

[0113] In some embodiments the apparatus and method are configured to implement the methods as described in EP 21201766.9 to produce a good solution for general situations when the sound source positions are not known and furthermore implement the following methods for a specific, but important, situation where all sound sources in a scene are known to be at the same distance (or substantially same or approximately similar distance) away from the microphone area whereas EP 21201766.9 methods can be applied in scenarios where sources are located both inside and outside of the microphone area and at more relaxed distances from the microphone area.

[0114] The aim of the embodiments described in further detail herein is to therefore provide better localization performance than the methods described in EP 21201766.9 for the situation when the scene is recorded with a single HOA microphone and all sources are around this microphone at set distances. The methods presented in EP 21201766.9 are likely to produce better localization performance in situations where the above assumptions do not hold. In some embodiments a content creator instructs (or informs or controls) a renderer as to which method to employ by indicating the method or approach in a bitstream.

[0115] In the proposed method spatial parameters are modified (when instructed to do so in the bitstream) as follows:Spatial parameters {0, p, r, e] are calculated at the edge of the microphone array area;New direction parameters at the listener position §new, pneware obtained based on a mapping of the directions §, <p to a position on a sphere with a set radius; andNew energy parameters eneware calculated based on the distance of the listener to the position where the direction parameters were mapped.

[0116] In other words the differences between the parameter modifications performed in some embodiments herein and in EP 21201766.9 are the following:No diffuseness modification;No directional weighting;New energy compensation

[0117] In addition in some embodiments a new bitstream parameter is defined for the MPEG-I immersive audio bitstream which is used by the renderer to decide which of the above two exterior rendering methods (EP 21201766.9 orthose described in further detail herein) to use during rendering.

[0118] For example Fig.3 shows a flow diagram describing the operations described in further detail herein when the listener is outside of the microphone array area.

[0119] In some embodiments, as shown in Fig.3 by 300, there is received information or a determination with respect to the relationship between the audio sources or sound sources and the projection sphere. In this example, the relationship is defined by the receipt and check of the sourcesOnProjectionSphere information (which can be a binary flag or indicator).

[0120] For example where the sourcesOnProjectionSphere information is false and there are no sources on the projection sphere then the operations follow the path shown in the left hand side of the Fig.3.

[0121] Thus there is obtained spatial parameters {§, <p,r, e} at the edge of the microphone area, as shown in Fig.3 by 301.

[0122] Then there is a modification as shown in Fig.3 by 303 of the direction of arrival (DOA) 6, <p, for example applying the methods shown in EP 21201766.9 to generate Onew, pnew.

[0123] Following this is a modification as shown in Fig.3 by 305 of the direct to total energy ratio (DTR) f for example applying the methods shown in EP 21201766.9 to generate {rnew}.

[0124] Additionally is shown in Fig.3 by 307 the generation of direction weights.

[0125] Thus the modified metadata parameters {§new, <pnew> new> e }arethen used as by the renderer to render a suitable output audio signal as shown in Fig.3 by 309.

[0126] Furthermore where the sourcesOnProjectionSphere information is true and there are sources on or near the projection sphere then the operations follow the path shown in the right hand side of the Fig.

[0127] Thus there is obtained spatial parameters {§, <p,r, e} at the edge of the microphone area, as shown in Fig.3 by 311.

[0128] Then there is a modification as shown in Fig.3 by 313 of the direction of arrival (DOA) 6, <p, for example applying the methods described herein to generate Onew, pnew

[0129] Following this is a modification as shown in Fig.3 by 315 of the energy e for example applying the methods described in further detail herein to generate enew.

[0130] Thus the modified metadata parameters {6new> new> > new }arethen usedasby the renderer to render a suitable output audio signal as shown in Fig.3 by 309.

[0131] The proposed embodiments can be employed for scenes where a single microphone array is used to record several sound sources located around the microphone array. For example, an acoustic band is playing around a 360 camera with an integrated HOA microphone.

[0132] In the following the projection dimension is defined as a projection sphere (in other words equal in all 3 dimensions) or projection radius (for example defining a two dimensional projection). It would however be understood that the projection sphere is more generally a projection displacement or dimensional parameter and could define more than a simple radius term. In other words the projection can be irregular and not uniform according to the direction.

[0133] With respect to Fig.4 is shown a schematic view of an example apparatus within which some embodiments can be employed. The example in Fig.4 shows how embodiments can be implemented on a MPEG-I Immersive audio system for 6DoF Multi-Point HOA rendering of audio scenes comprising HOA / FOA content (recorded or synthesized). It would be understood that the embodiments can be employed to render such a scene using other encoded implementations.

[0134] In some embodiments the renderer comprises a pre-processor 401. The pre-processor 401 is configured to initialize the rendering after receiving information of the scene such as positions of HOA / FOA sources p1 Vs402 and also receiving head-related impulse responses (HRIRs) 400.

[0135] The pre-processor 401 is configured to obtain the positions 402 and performs Delaunay triangulation to provide a set of triangles T1...NT412 that partition the scene into triangle sections. This triangulation is for example shown in the Figures 1a to 1c. The pre-processor 401 is also configured to convert the head-related impulse responses (HRIRs) 400 which are converted to frequency domain head-related transfer functions (HRTF) 420. In some embodiments this can be implemented by a short-term Fourier transform. In a MPEG-I Immersive audio case, the alias-free STFT algorithm is used such as described in Pulkki, V., S. Delikaris-Manias, and A. Politis, Parametric Time-frequency Domain Spatial Audio. 2018: John Wiley & Sons, Incorporated. In some embodiments the HRTFs 420 can be used to calculate an Ambisonics-to-binaural transform matrix MH0A2binb'), for each frequency band b.

[0136] In some embodiments the renderer further comprises a position pre-processor 403. The position pre-processor 403 is configured to determine HOA / FOA sources that are close to the listener for later processing purposes.

[0137] During rendering, for each input frame j, the position pre-processor 403 is configured to take as an input the listener position pt404, the HOA / FOA source positions p1 Vs402 and the set of triangles T1 MT412 created in the pre-processor 401. Based on the input, the position pre-processor 403 determines interpolation weights wc(i,j) 416 for the HOA / FOA sources pt402. This is done by determining in which triangle Ththe listener is in and calculating the barycentric coordinates for the triangle Ttand the listener position. The barycentric coordinates are used as the weights. The weights sum to one and the closer the listener is to an HOA / FOA source, the higher the weight will be. At an edge of a triangle, the weight for the HOA / FOA source that is not part of that edge is 0. The weights wc(i,j) 416 and the triangle that the listener is in (active triangle) 7^ (J) 414 are provided as outputs.

[0138] In some embodiments, when the listener moves from a triangle to another, the switching of the active triangle may be delayed, and only switched after a few frames of audio. This is because processing of the audio for HOA / FOA sources for the new triangle requires a few frames of audio to provide meaningful output. Thus, the active triangle is not always the triangle that the listener is in. The position pre-processor 403 can also determines the HOA / FOA source that is closest to the listener.

[0139] In some embodiments the renderer comprises a spatial analyzer 411. The spatial analyzer 411 provides spatial metadata parameters for frames of HOA / FOA signals that positioned near the listener. These are later used, in the metadata interpolator 409, to estimate spatial metadata parameters at the listener position.

[0140] The spatial analyzer 411 is configured to receive as an input a frame (for example, 456 samples) of HOA / FOA source signals sESD(i,f) 410 for each source i and frame j and the current active triangle 7^ (J) 414. For each HOA / FOA source i, belonging to the triangle TA(j) 414, the spatial analyzer 411 performs STFT processing to obtain time-frequency domain signals S(i,j, k) 422, where k refers to a sub-frame (for example, 128 samples). The spatial analyzer 411 then calculates spatial metadata from the time-frequency domain signals. The spatial metadata comprises the energy, direction information (azimuth and elevation) and diffuseness information (direct-to-total energy ratio) and is obtained as follows:First, a signal analysis vector s i, j, k, h) is calculated:r* 1.0Sbi2<ii, j, fc) * 0.5774s(i,j, k, b) =Sb 3(i, j, fc) * 0.5774fc) * 0.5774where Sb ci,j, k) is the value in matrix S i,j, k) corresponding to channel c and frequency bin b.From the signal analysis vector, a signal intensity vector is calculated:s^ij, k, b) * s1(i,j, k, b)Tk, b) = Re s^ij, k, b) * s2(i,j, k, b) ■k, b) *where sc(i,j, k, b) denotes the complex conjugate of sc(i,j, k, b).Signal energy is then calculated as follows:14e(i,j, k, b) = * sc(i,j, k, b)C = 1An average intensity vector and average energy is calculated as follows:Nsfe(i,j,b) = —^ e(i,j, k, b)sfi(i,j, b) = 77- yNsf1— 11k=lThe direction data (azimuth and elevation) are calculated as follows:§(i,j, b) = atan2(i2(i,j, b), tiCtj, b))(p(i,j, b) = atan2 h)where h) is the nth element of the average intensity vectorh), h) is the azimuth and p i,j, b) is the elevation.The direct-to-total energy ration is calculated as follows:r(i, j, b) =.The energy for the subframes k are then obtained as follows:e(i,j, k, b) = e(i,j, b), k E l.. NSfThus the output also comprises q(i,j,b) 428, j(i,j,b) 430 r(i,j,b) 432 and e(i,j,k,b) 424.

[0141] The renderer furthermore comprises a spatial metadata interpolator 409 which is configured to provide an estimate of the spatial metadata at the listener position based on the spatial metadata for the HOA / FOA sources belonging to the triangle that the listener is in, and the interpolation weights calculated in the position pre-processing block.

[0142] First, the spatial metadata is converted into vector form:— sin(0(i, j, h)) cos(<p(i, j, h))v(i,j, b) = sin(<p(i,j, b)) b)— cos(<p(i, j, h)) cos(<p(i, j, h))

[0143] The vectors are rotated according to the listener’s head and source orientations:v(j,y,b) = k / iead (f)R source

[0144] An interpolated spatial metadata vector is then calculated by a weighted average of spatial metadata vectors:Nsv(j> k, b) = wc.(j, k)v(i,j, b)i=l

[0145] where the interpolation weights are the barycentric coordinates calculated in the position preprocessing block.

[0146] And finally, the interpolated spatial metadata vector is converted to spatial metadata parameters as follows:

[0147] Azimuth:§(j, k,b) = atan2 (— i?i ( / , k,b),—v3(j, k,b)^

[0148] Elevation:<p(y, k, b) = atan2 (— v2(j, k, b), y / (—v1(j,k,b)')2+ (— v3( / , / c, h))2)

[0149] Direct-to-total energy ratio:>f(j, k,b) = yj(v1(j,k,b)')2+ (v2(j,k,by)2+ (v3( / , / c, b))2

[0150] Energy:Nse(j, k,b) = wCi(j, k'jeiij, k, b)i = l

[0151] These O(j, k, b) 436, <p J, k, b) 438, r(j, k, b) 440 and e(j, k, b) 442 can be output.

[0152] The signal interpolator 405 is configured to provide an “interpolated signal” at the listener position, that is, an estimate of the signal at the listener position. This is used later (in the mixer 407) in conjunction with the interpolated metadata at the listener position to provide the final binaural output.

[0153] The interpolated signal Sb c(j, k') is obtained by taking the HOA / FOA signal that is closest to the listener and applying an EQ on the signal:Sb,c(j>k) = Geq(J,k,b)SbiC(,mc(J),j,k)

[0154] where mc(j) is index of the HOA source chosen for interpolation (closest to listener in most cases) and:. ( I e(j, k,b) \Geq(j, k, b) =minye(mc(y) 7-k + £> GeqMaxj-

[0155] The value S (j, k, 6) 418 can be output to the mixer 407.

[0156] The renderer further can comprise a mixer 407, which takes as an input the interpolated spatial metadata at the listener position and the interpolated signal at the listener position and provides the binaural time-frequency domain output signal. The mixer 407 can create a signal covariance matrix from the interpolated spatial metadata which describes the desired (or target) spatial characteristics of the signal at the listener position. An optimal mixing algorithm is then used to obtain a mixing matrix that when multiplied with the interpolated signal, the resulting signal is inline with the desired spatial characteristics.

[0157] First, a binaural prototype signal is created from the interpolated signal:■■■ Bb l(j, Ns^ B(j, b) — bIH0A2bin(b) * Rsh(j) *Sb, Nch(j’ l) — Sb, Nch(j’ Nsf)

[0158] Where ^s / iC is 8 rotation matrix taking into account the listener and HOA / FOA source rotation and MH0A2bin(b) is the Ambisonics to binaural matrix.

[0159] Then, a signal covariance matrix Cxis calculated for the prototype signal:Cxew(j> b) = B(j,b)B(j,b

[0160] Recursive averaging is applied to get the signal covariance matrix for frame j:Gx(j,b) = (1 - d)Cxew(j, b) + dCxj - 1,6)

[0161] Where d = 0.9.

[0162] Next, a signal covariance matrix Cyis calculated from the interpolated spatial metadata at the listener position. First the direct portion of Cyis calculated:NsfC$irect(j, 6) = e(j, k, b)r(j, k, b)H(b, d^H^b, d)k=l

[0163] where H(b, d) refers to the HRTF value at frequency bin b, in direction.

[0164] Second, the diffuse portion of Cyis calculated.NsfCylffuse(j,b') = ^(1 - f(j, fc, b))e(j, fc, b) Cdif(b)k=l

[0165] where:C^W =- -

[0166] Cyis then obtained as follows:C™wj> b') = Cirectj, b') + Cyiffuse(j,b)

[0167] And recursive averaging:Cy(j, b) = (1 - d)C;ewG, b) + dCy(j - 1,6)

[0168] where d = 0.9.

[0169] The signal covariance matrices are then used to obtain mixing matrices which are used to obtain the binaural output like so:O(j, k, 6) = M(j, k, 6) * B(j — 1, k, 6) * D(j, 6)

[0170] Where (j, 6) is a decorrelated time-frequency domain signal obtained from a buffer of previous binaural signals B.

[0171] The matrices M(j, k, 6) and Mr(j, k, 6) are obtained from the optimal mixing procedure outlined in Vilkamo, J., Backstrbm, T., and Kuntz, A. (2013). Optimized covariance domain framework for timefrequency processing of spatial audio. Journal of the Audio Engineering Society, 61(6), 403-411.. After applying the mixing matrices on the binaural prototype signal B, the result output 0 444 has the spatial characteristics of the spatial metadata at the listener position.

[0172] The renderer can furthermore comprise an output processor 413 which is configured to receive the obtained binaural output signal O(j, k, 6) and perform an inverse STFT on it to produce the final time domain binaural output signal sout(i,j) 446.

[0173] With respect to Fig.5 is shown a flow diagram of the example operations of the renderer shown in Fig.4 according to some embodiments.

[0174] For example there is shown in Fig.5 by 501, obtaining HRIR and positions p / .

[0175] Then as shown by in Fig.5 in 503, pre-processing to generate TI NT and HRTFs.

[0176] Furthermore then is shown obtaining p / in Fig.5 by 505.

[0177] Then as shown in Fig.5 by 507, position-preprocessing to generate TA(J) and wc(i,j).

[0178] Also is the operation of obtaining SEso(i,j) as shown in Fig.5 by 509.

[0179] Following this as shown in Fig.5 by 511 is spatial analyzing to generate S(i,j,k,b), e(i,j,k,b), q(i,j,b), j(i,j,b), r(i,j,b).

[0180] This can be followed by spatial metadata interpolating to generate e(j,k,b) q(j,k,b) j(j,k,b), r(j,k,b) as shown in Fig.5 by 515.

[0181] Then can be signal interpolating to generate S(j,k,b) as shown in Fig.5 by 517.

[0182] Then as shown in Fig.5 by 519 is generation of mixing O(j,k,b).

[0183] Followed by the outputting of Spatialized audio Sout(ij) (e.g., binaural, surround loudspeakers, Ambisonics) as shown by Fig.5 by 521.

[0184] When the listener moves outside of the microphone area, the spatial metadata interpolator 409 is configured to adjust the interpolated spatial metadata (0(j, k, bf <p(J, k, bf r(j, k, h)) based on listener distance to the microphone area and the exterior rendering radius to give the listener a 6DoF listening experience outside of the microphone area.

[0185] Furthermore when implementing exterior rendering the pre-processor 201 and the spatial analyzer 411 can be configured to implement the following.

[0186] In the position pre-processing block, the triangle in which the listener is in is determined. This, as mentioned earlier, can be implemented by calculating barycentric coordinates for the listener position for all triangles T.. TNTfound by the Delaunay triangulation. The listener is inside a triangle if all barycentric coordinates are positive. In the case there are no triangles for which all barycentric coordinates are positive, it is determined that the listener is outside of the microphone area. In such cases, the following can be implemented:

[0187] Firstly the closest HOA source (mk) and edge (eki) to the listener position (pL) is calculated.

[0188] Then a distance to the closest HOA sources to the listener is obtained by calculating the squared horizontal distance from the listener to the HOA sources as follows:Dmt (Pm^x PL. X) " T (Pm^y Pi,y)

[0189] The closest edge can then be determined by finding the distance Dldfrom the listener position pLto all edges ekiPL,mk’ ^-kl > if dotvn0-> / PL, mi. ’ VfcADij = < PL,mk■ nkt-dotvnJ- j — '■ -, elsedotv’n~ dot^f

[0190] where pL,mfcis the position of the listener in HOA source mkspace:PL,mkPL Pirifc

[0191] and where dotv nis the dot product between the normalized edge vector vk(and the normal of the edge nij.dotv n’ W-kl'

[0192] If the closest HOA source to the listener is closer than any of the outside edges, the listener position is projected to the HOA source position:PL,proj Pmclosest

[0193] When this is not the case, the listener is projected on to the closest exterior edge (e0):PL,pro j WklPm^ ”F (1 Wk()pm /

[0194] The weights wi7are obtained by:. PL,mk’ *kl, if dotv n— 0|Vfcj|~ j PL,mk’ ^Kl \Wkl = PL,mk■ Vkl ~ Idotv nI1 - — - 11 vkl|, elsedotv’n~ dotv n

[0195] In the spatial metadata interpolator 409, the interpolation weights can be determined when implementing exterior rendering and an adjustment to the obtained spatial metadata is performed as described below:

[0196] Determining Interpolation weights

[0197] If the listener was projected on a HOA source, the interpolation weight for that source is set to 1 and the interpolation weight for all other sources is set to 0.

[0198] If the listener was projected to an outside edge, the interpolation weights are calculated as follows (for frame j)w(k,j) = wMw(Z,j) = 1 - wkl

[0199] Where k and I are the indices of the HOA sources defining the edge that the listener is closest to.

[0200] Adjusting spatial metadata

[0201] Spatial metadata is interpolated as described above, for the non-exterior rendering case (except with the interpolation weights calculated as directly above).

[0202] After interpolation the spatial metadata is adjusted as follows. Fig.2b and the left hand side of Fig.3 shows an example situation, where pL403 is located external to the triangle.

[0203] First a DOA vector uD(M212 is calculated (for each frequency band) from the interpolated spatial metadata:— sin(0) cos(<p)UDCM sin(<p)— cos($) cos(<p)

[0204] where 0 and <p are the interpolated azimuth and elevation parameters. Note that the audio frame ( / ) and frequency band indices ( / ) are not shown for clarity. A weighted DOA vector is also calculated:UDCM —uDOA^proj

[0205] where Rproj214 is the exterior projection radius (projection sphere or projection dimension). A modified DOA vectoruDOAmod216 is calculated based on the projected pLpLiProj208:206^DOA,mod PL,proj T" PL

[0206] A unit length version is also calculated:> ^DOA,mod^DOA,mod ~ 1 TI^DOA,mod |

[0207] A modified direct-to-total energy ratio is also calculated:, „ ^OO / l,modLrm0d = min (r, — - - )Kpro j

[0208] A directional weighting is applied to the modified parameters to modify parameters more heavily from sound sources outside the capturing region, while applying less modifications to sources inside the capturing region:12Wdir 2^^ ”1” ^L,proj ’ ^DO / 1)

[0209] Where nL projis a “listener normal” which is a vector pointing away from the HOA source area.^DOA, mod, weighted ^dir^DOA,mod " T (1 ^dir)^DOA

[0210] Modified spatial metadata values are then obtained from uDOAimod)Weightedand processing is continued as in the non-exterior rendering case:

[0211] Azimuth:0atan2 ^DOA, mod, weighted^’ ^DOA, mod, weighted^

[0212] Elevation:(p atan2 ^DOA, mod, weighted2’ ( ^DOA, mod, weighted^ " T ( ^DOA, mod, weighted^

[0213] Direct-to-total energy ratio:T ^"mod, weighted

[0214] The exterior projection radius is defined in the MPEG-I immersive audio bitstream as shown below:

[0215] The two exterior projection radius related bitstream elements are described as follows in the specification:Syntax No. of Mnemonic bitshoaGroups()hoaGroupsCount = GetCountOrlndexQ;for (int i = 0; i < hoaGroupsCount; i++) {hoaGroupId = GetlDQ;hoaGroupHasRegion; 1 bslbfif (hoaGroupHasRegion) {hoaGroupRegionld = GetlDQ;coSourceCount = GetCountOrlndexQ;for (int j = 0; j < coSourceCount; j++) {coSourceld = GetlDQ;FreqBandConfigQ;exteriorRenderingProjectionRadius = GetDistance(isSmallScene);hoaGroupHasinformedSources; 1 bslbfif (hoaGroupHasinformedSources == True) {informedSourceCount; 16 uimsbf maxSimulinformedSources; 8 uimsbf for (j = 0; j < informedSourceCount; j++) {InformedSourcelnfoStructQadaptiveExteriorRenderinqProjectionRadius; 1 bslbfhoaGroupHasLowProfileConfig; 1 bslbfif (hoaGroupHasLowProfileConfig == True) {LowProfileConfig()

[0216] exteriorRenderingProjectionRadius Indicates the exterior projection radius for exterior 6D0F HOA rendering.

[0217] adaptiveExteriorRenderingProjectionRadius Indicates whether the exterior projection radius for 6D0F HOA rendering is modified based on informed source positions.

[0218] In some embodiments when it is known or determined that sound or audio sources are on the sphere (or more generally region edge) defined by the exterior projection radius or dimension then the abovedescribed exterior rendering process can be modified and further configured to modify the MPEG-I immersive audio bitstream (hoaGroupsQ element). This further modification is explained next:

[0219] Firstly in some embodiments a new bitstream element (sourcesOnProjectionSphere) can be defined and added to the hoaGroupsQ element. When this bit is set to ‘true’, a special “sources on projection sphere”mode of exterior rendering is employed during rendering. This mode assumes that all sound sources are positioned around a single HOA source at a set distance (=exteriorRenderingProjectionRadius). When this bit is set to ‘false’, “normal” exterior rendering as described above is used.

[0220] This can be defined as followsSyntax No. of Mnemonic bitshoaGroupsQ{hoaGroupsCount = GetCountOrlndexQ;for (int i = 0; i < hoaGroupsCount; i++) {hoaGroupId = GetlDQ;hoaGroupHasRegion; 1 bslbfif (hoaGroupHasRegion) {hoaGroupRegionld = GetlDQ;}coSourceCount = GetCountOrlndexQ;for (int j = 0; j < coSourceCount; j++) {coSourceld = GetlDQ;}FreqBandConfig();exteriorRenderingProjectionRadiusGetDistance(isSmallScene);hoaGroupHasInformedSources; 1 bslbfif (hoaGroupHasInformedSources == True) {informedSourceCount; 16 uimsbf maxSimulInformedSources; 8 uimsbf for G = 0; j < informedSourceCount; j++) {InformedSourcelnfoStructQ}adaptiveExteriorRenderingProjectionRadius; 1 bslbfSyntax No. of Mnemonic bits}sourcesOnProjectionSphere; 1 bslbfhoaGroupHasLowProfileConfig; 1 bslbfif (hoaGroupHasLowProfileConfig == True) {LowProfileConfigO}}}

[0221] sourcesOnProjectionSphere Indicates whether to render the scene with the assumption that sound sources are at a specific distance (exteriorRenderingProjectionRadius) from the HOS source.

[0222] The following describes exterior rendering processing according to the invention, changes are present in the “Adjust spatial metadata” section. The main changes are that, when sourcesOnProjectionSphere=true, no diffuseness modification is done, and no directional weighting is applied. Also, the energy parameter is modified.

[0223] In these embodiments the Spatial metadata is interpolated in the same manner as described above, for the non-exterior rendering case (except with the interpolation weights calculated as directly above).

[0224] After interpolation the spatial metadata is adjusted as follows (as indicated in Fig.6b and the right side of Fig.3):

[0225] First a DOA vector is calculated (for each frequency band) from the interpolated spatial metadata:— sin(0) cos(<p)UDCM sin(<p)— cos($) cos(<p)

[0226] where 0 and <p are the interpolated azimuth and elevation parameters. Note that the audio frame ( / ) and frequency band indices ( / ) are left out here for clarity. A weighted DOA vector 212 is also calculated:UDCM —uDOA^proj

[0227] where Rproj214 is the exterior projection radius. A modified DOA vector 216 is calculated:^DOA,mod PL,proj " F Upc PL

[0228] A unit length version is also calculated:> ^DOA,mod^DOA,mod ~ 1 TI^DOA,mod |

[0229] When sourcesOnProjectionSphere is false, a modified direct-to-total energy ratio is also calculated:, „ ^OO / l,modLrmod= min (r, — - - )Kpro j

[0230] When sourcesOnProjectionSphere is false, a directional weighting is applied to the modified parameters to modify parameters more heavily from sound sources outside the capturing region, while applying less modifications to sources inside the capturing region:12Wdir 2^^ ”1” ^L,proj ’ ^DO / 1)

[0231] Where nL projis a “listener normal” which is a vector pointing away from the HOA source area.^DOA, mod, weighted ^dir^DOA,mod "h (1 ^dir)^DOA^mod, weighted mtn (1, Wdjrrmod+ (1

[0232] When sourcesOnProjectionSphere is false, modified spatial metadata values are then obtained from uDOAimodiWeight:edand processing is continued as in the non-exterior rendering case:

[0233] Azimuth:@new atan2 UQO / I, mod, weighted^’ ^DOA, mod, weighted^

[0234] Elevation:< Pnew atan2 I ^DOA, mod, weighted2’ -J ( ^-DO A, mod, weighted J + ( ^DOA, mod, weighted3J j

[0235] When sourcesOnProjectionSphere is true, the following is used:

[0236] Azimuth:@new atan2 ^DOA,mod^’ ^-DOA^nod^

[0237] Elevation:< Pnew atan2 I ^DOA,mod2> -J ^DOA,mod^ + ( ^DO / l.modg^ j

[0238] When sourcesOnProjectionSphere is false, direct-to-total energy ratio is the modified, weighted value:^new ^"mod, weighted

[0239] When sourcesOnProjectionSphere is true, the original, unmodified direct-to-total energy ratio is used as is.^new

[0240] When sourcesOnProjectionSphere is true, the modified energy is calculated as followst'-pro j^new &min(dT, |uDOAmod|)

[0241] Where dTis a minimum distance threshold to limit the increase in energy as the listener moves closer to the sphere defined by the exterior projection radius.

[0242] An example system employing some embodiments is described in Fig.7. The figure illustrates an end to end system overview for an audio scene comprising multiple HOA sources, which is rendered according to the above examples. The renderer receives the scene description and audio bitstreams and performs rendering accordingly.

[0243] The system can comprise a content creator 701 which can be implemented on any suitable computer or processing device. The content creator 701 comprises an (MPEG-I Immersive audio) encoder 703 which is configured to receive the audio scene description 700 and the audio signals or data 702. The audio scene description 700 can be provided in the MPEG-I Immersive audio Encoder Input Format (EIF) or in other suitable format. Generally, the audio scene description contains an acoustically relevant description of the contents of the audio scene, and contains, for example, the scene geometry as a mesh or voxel, acoustic materials, acoustic environments with reverberation parameters, positions of sound sources, and other audio element related parameters such as whether reverberation is to be rendered for an audio element or not. The content creator 701 furthermore can be configured to define or set exteriorRenderingProjectionRadius, sourcesOnProjectionSphere) based on the scene being recorded. For the situation that these embodiments are particularly beneficial (single HOA source, sources at the same distance from the microphone used to record the scene), the content creator 701 can be configured to set sourcesOnProjectionSphere=true. exteriorRenderingProjectionRadius should be set to the distance of the sound sources.

[0244] The MPEG-I Immersive audio encoder 703 is configured to output encoded data 704.

[0245] The content creator 701 furthermore in some embodiments comprises a MPEG-H encoder 705 and generates MPEG-H audio bitstream.

[0246] The MPEG-H audio bitstream 706 and 6DoF audio bitstream 704 in some embodiments can be streamed to end-user devices or made available for download or stored.

[0247] Additionally the system comprises a server 711 configured to obtain the bitstreams, and store them (in bitstream storage 713) and supply them (for example as a six degrees of freedom (6-DoF) audio bitstream 712) to the player 721.

[0248] The relevant bitstream 712 is retrieved by the player 721. In some embodiments other implementation options are feasible such as broadcast, multicast.

[0249] The player 721 in some embodiments comprises a playback device 723 configured to obtain or receive the 6DoF audio bitstream 712, and furthermore can be configured to receive or otherwise obtain the 6 DoF tracking information (listener orientation or position information) 726 from a suitable listener user interface, for example from the head mounted device (HMD) 729. These can for example be generated bysensors within the HMD 729 or from sensors in the environment sensing the orientation or position of the listener.

[0250] In some embodiments the playback device 723 comprises a bitstream parser 725 configured to obtain the encoded bitstream 712 and decode these in an opposite or inverse operation to the encoders 703, 705 to generate audio and metadata 724 which can be passed to a MPEG-I Immersive audio renderer 727.

[0251] In some embodiments the playback device 723 comprises the MPEG-I Immersive audio renderer 727 configured to implement the rendering operations as described above and generate audio output signals 728 which can be output to the head mounted device 729.

[0252] The playback device 723 can be implemented in different form factors depending on the application. In some embodiments the playback device is equipped with its own listener position tracking apparatus or receives the listener position information from an external apparatus. The playback device can in some embodiments be also equipped with headphone connector to deliver output of the rendered binaural audio to the headphones.

[0253] With respect to Fig.8 is shown an example flow diagram of operations according to some embodiments from the decoder / renderer.

[0254] For example as shown in Fig.8 by 801 is shown the operation of obtaining the bitstream.

[0255] Then is shown in Fig.8 by 803 the operation of decoding or otherwise extracting the scene data comprising the information exteriorRenderingProjectionRadius defining the projection dimensions, or projection sphere radius and sourcesOnProjectionSphere defining the relationship between the audio source or sound source and the projection boundary.

[0256] Following this is shown in Fig.8 by 805 the operation of triangulating the HOA source space.

[0257] Then can be the operation as shown in Fig.8 by 807 of obtaining the user or listener position.

[0258] Having this information then is the operation, as shown in Fig.8 by 809 of calculating or otherwise determining spatial metadata at the listener position.

[0259] A further operation as shown in Fig.8 by 809, is one of determining whether the user is inside the HOA source space. In other words determining whether interior or exterior rendering is to be employed.

[0260] Where the user is inside the HOA source space, and interior rendering is to be applied then, as shown in Fig.8 by 813, there is a rendering of audio signals according to spatial metadata at the listener position.

[0261] Where the user is outside the HOA source space and exterior rendering is to be applied then a further determination or the relationship between the audio source and the projection space or projection sphere is employed, as shown in Fig.8 by 815, to determine whether the audio source or sources are on the projection sphere or projection edge.

[0262] Where the information or determination determines that the sourcesOnProjectionSphere is false then, as shown in Fig.8 by 819 there is a modification or adjustment of the spatial metadata based onexteriorRenderingProjectionRadius in a manner described formerly (method 1) above to generate new direction and ratio values.

[0263] Where the information or determination determines that the sourcesOnProjectionSphere is true then, as shown in Fig.8 by 817 there is a modification or adjustment of the spatial metadata based on exteriorRenderingProjectionRadius in a manner described latterly (method 2) to generate new direction and energy values.

[0264] Then following the modification of the spatial metadata according to either method thee is then the operation of rendering audio signals according to the modified or adjusted spatial metadata at the listener position as shown in Fig.8 by 821.

[0265] With respect to Figs.9a to 9b show example projections which are not uniform radius spheres including ovals Fig.9a, block or rectangular shapes Fig.9b and irregular shapes Fig.9c. In other words the projection can be any suitable shape with any suitable boundary and is not limited to projection spheres and projection radiuses.

[0266] With respect to Fig.11 an example electronic device which may be used as the computer, encoder processor, decoder processor or any of the functional blocks described herein is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1600 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.

[0267] In some embodiments the device 1600 comprises at least one processor or central processing unit 1607. The processor 1607 can be configured to execute various program codes such as the methods such as described herein.

[0268] In some embodiments the device 1600 comprises a memory 1611. In some embodiments the at least one processor 1607 is coupled to the memory 1611. The memory 1611 can be any suitable storage means. In some embodiments the memory 1611 comprises a program code section for storing program codes implementable upon the processor 1607. Furthermore in some embodiments the memory 1611 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1607 whenever needed via the memory-processor coupling.

[0269] In some embodiments the device 1600 comprises a user interface 1605. The user interface 1605 can be coupled in some embodiments to the processor 1607. In some embodiments the processor 1607 can control the operation of the user interface 1605 and receive inputs from the user interface 1605. In some embodiments the user interface 1605 can enable a user to input commands to the device 1600, for example via a keypad. In some embodiments the user interface 1605 can enable the user to obtain information from the device 1600. For example the user interface 1605 may comprise a display configured to display information from the device 1600 to the user. The user interface 1605 can in some embodiments comprise atouch screen or touch interface capable of both enabling information to be entered to the device 1600 and further displaying information to the user of the device 1600.

[0270] In some embodiments the device 1600 comprises an input / output port 1609. The input / output port 1609 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1607 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and / or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

[0271] The transceiver can communicate with further apparatus by any suitable known communications protocol.

[0272] The transceiver input / output port 1609 may be configured to transmit / receive the audio signals, the bitstream and in some embodiments perform the operations and methods as described above by using the processor 1607 executing suitable code.

[0273] It should be understood that the apparatuses may comprise or be coupled to other units or modules used in or for transmission and / or reception. Although the apparatuses have been described as one entity, different modules and memory may be implemented in one or more physical or logical entities.

[0274] It is noted that whilst some embodiments have been described in relation to 5G networks, similar principles can be applied in relation to other networks and communication systems. Therefore, although certain embodiments were described above by way of example with reference to certain example architectures for wireless networks, technologies and standards, embodiments may be applied to any other suitable forms of communication systems than those illustrated and described herein.

[0275] It is also noted herein that while the above describes example embodiments, there are several variations and modifications which may be made to the disclosed solution without departing from the scope of the present invention.

[0276] As used herein, “at least one of the following: ” and “at least one of ” and similar wording, where the list of two or more elements are joined by “and” or “or”, mean at least any one of the elements, or at least any two or more of the elements, or at least all the elements.

[0277] In general, the various embodiments may be implemented in hardware or special purpose circuitry, software, logic or any combination thereof. Some aspects of the disclosure may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the disclosure is not limited thereto. While various aspects of the disclosure may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware,special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

[0278] As used in this application, the term “circuitry” may refer to one or more or all of the following:(a) hardware-only circuit implementations (such as implementations in only analog and / or digital circuitry) and(b) combinations of hardware circuits and software, such as (as applicable):(c) a combination of analog and / or digital hardware circuit(s) with software / firmware and (i) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions); and(ii) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.

[0279] This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and / or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.

[0280] The embodiments of this disclosure may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Computer software or program, also called program product, including software routines, applets and / or macros, may be stored in any apparatus-readable data storage medium and they comprise program instructions to perform particular tasks. A computer program product may comprise one or more computer-executable components which, when the program is run, are configured to carry out embodiments. The one or more computer-executable components may be at least one software code or portions of it.

[0281] Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as DVD and the data variants thereof, CD. The physical media is a non-transitory media.

[0282] The term “non-transitory,” as used herein, is a limitation of the medium itself (i.e., tangible, not a signal) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM).

[0283] The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may comprise one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), FPGA, gate level circuits and processors based on multi core processor architecture, as non-limiting examples.

[0284] Embodiments of the disclosure may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

[0285] The scope of protection sought for various embodiments of the disclosure is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the disclosure.

[0286] The foregoing description has provided by way of non-limiting examples a full and informative description of the exemplary embodiment of this disclosure. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this disclosure will still fall within the scope of this invention as defined in the appended claims. Indeed, there is a further embodiment comprising a combination of one or more embodiments with any of the other embodiments previously discussed.

Claims

CLAIMS1. An apparatus for generating a spatialized audio output based on a listener position, the apparatus comprising at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to:obtain one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position;obtain a listener position within an audio scene, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions, wherein the inside region is defined by the respective audio signal set positions and the audio scene comprises one or more audio sources;determine, for the listener position within an audio environment outside the inside region, a projected listener position based on a geometry of the inside region;obtain, based on the one or more audio signal sets, metadata associated with the projected listener position;obtain a projection shape;obtain information based on a position of the one or more audio sources relative to the projection shape; andestimate modified metadata for the projected listener position based on the listener position and the information.

2. The apparatus as claimed in claim 1, wherein the modified metadata comprises at least one of: a modified energy metadata parameter; anda modified directional metadata parameter.

3. The apparatus as claimed in any of claim 1 or 2, caused to estimate modified metadata for the projected listener position based on the listener position and the information based on the position of the one or more audio sources relative to the projection shape is further caused to:determine at least one audio position with respect to the projected listener position, wherein the modified metadata for the projected listener position comprises a direction parameter representing a direction from the projected listener position to one of the at least one audio position;determine spatial metadata for the listener position based on the at least one audio signal set position with respect to the projected listener position, wherein the spatial metadata comprises a spatial direction parameter representing a direction from the listener position to the one of the at least one audio position.

4. The apparatus as claimed in any of claims 1 to 3, caused to obtain one or more audio signal sets is caused to obtain the one or more audio signal sets from microphone arrangements, wherein each microphone arrangement is at a respective position and comprises one or more microphones.

5. The apparatus as claimed in any of claims 1 to 3, caused to obtain the one or more audio signal sets is caused to obtain one or more higher order ambisonics sources.

6. The apparatus as claimed in claim 5, wherein the inside regions in relation to the respective audio signal set positions for one higher order ambisonics source defines a position associated with the higher order ambisonics source.

7. The apparatus as claimed in any of claim 5 or 6, wherein the projected listener position is the position associated with the higher order ambisonics source.

8. The apparatus as claimed in any of claims 1 to 7, caused to obtain a listener position is caused to obtain the listener position from a further apparatus.

9. The apparatus as claimed in any of claims 1 to 8, caused to obtain, for the at least one of the one or more audio signal sets, metadata based on a processing of the at least one audio signals of the at least one of the one or more audio signal sets is caused to determine a directional parameter based on the processing of the at least one audio signals.

10. The apparatus as claimed in any of claims 1 to 9, caused to determine, for the listener position within an audio environment outside the inside region, a projected listener position based on a geometry of the inside region is caused to determine the projected listener position at a location of one of:within a plane or volume at least partially defined by an edge or surface linking the one of the one or more audio signal set positions and the listener position;within a plane or volume at least partially defined by an edge or surface linking the one of the one or more audio signal set positions within an associated inside region;on an edge or surface defined by the one of the one or more audio signal set positions; and at a closest of the one or more audio signal set positions.

11. The apparatus as claimed in any of the claims 1 to 10, caused to estimate modified metadata for the projected listener position based on the listener position and the information related to a relationship between a position of the one or more audio sources and the projection shape is caused to:generate at least one interpolation weights based on the audio signal set positions and the projected listener position;apply the at least one interpolation weights to respective audio signal set audio metadata to generate interpolated audio metadata; andcombine the interpolated audio metadata to generate the modified metadata for the projected listener position.

12. The apparatus as claimed in claim 11, caused to estimate modified metadata for the projected listener position based on the listener position and the information based on a position of the one or more audio sources relative to the projection shape is caused to map the modified metadata based on the second listener position to a cartesian co-ordinate system.

13. The apparatus as claimed in any of claims 1 to 12, caused to obtain information based on a position of the one or more audio sources relative to the projection shape is further caused to indicate whether the position of the one or more audio sources is on the boundary of the projection shape.

14. The apparatus as claimed in claim 13, caused to estimate modified metadata for the projected listener position based on the listener position and the information based on a position of the one or more audio sources relative to the projection shape is further caused to:estimate a modified direction of arrival and energy based on the information indicating the position of the one or more audio sources is on the projection shape;estimate a modified direction of arrival, modified direct-to-total energy ratio and directional weighting based on the information indicating the position of the one or more audio sources is otherwise not on the projection shape.

15. The apparatus as claimed in any of claims 1 to 14, wherein the projection shape is one of:a projection sphere having a defined projection radius boundary;a regular projection shape having a regular defined projection boundary; andan irregular projection shape having an arbitrary projection boundary.

16. The apparatus as claimed in any of claims 1 to 15, caused to obtain the projection shape is caused to obtain a parameter defining at least one of:the projection shape;a radius associated with the projection shape; andat least one dimension associated with the projection shape.

17. The apparatus as claimed in any of claims 1 to 16, caused to obtain information is caused to obtain a scene parameter indicating whether the one or more audio source is positioned at an edge or boundary of the projection shape.

18. An apparatus for assisting generation of a spatialized audio output based on a listener position, the apparatus comprising at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to:obtain one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position;obtain for an audio scene at least one position associated with one or more audio source, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions and the audio scene comprises the one or more audio sources; obtain a projection shape;obtain information based on a position of the one or more audio sources relative to the projection shape; andtransmit to a further apparatus the one or more audio signal sets and the information.

19. The apparatus as claimed in claim 18, wherein the projection shape is one of:a projection sphere having a defined projection radius boundary;a regular projection shape having a regular defined projection boundary; andan irregular projection shape having an arbitrary projection boundary.

20. The apparatus as claimed in any of claims 18 or 19, caused to obtain the projection shape is caused to obtain a parameter defining at least one of:the projection shape;a radius associated with the projection shape; andat least one dimension associated with the projection shape.

21. A method for generating a spatialized audio output based on a listener position, the method comprising at least:obtaining one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position;obtaining a listener position within an audio scene, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions, wherein the inside region is defined by the respective audio signal set positions and the audio scene comprises one or more audio sources;determining, for the listener position within an audio environment outside theinside region, a projected listener position based on a geometry of the inside region;obtaining, based on the one or more audio signal sets, metadata associated with the projected listener position;obtaining a projection shape;obtaining information based on a position of the one or more audio sources relative to the projection shape; andestimating modified metadata for the projected listener position based on the listener position and the information.

22. A method for assisting generation of a spatialized audio output based on a listener position, the method comprising at least:obtaining one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position;obtaining for an audio scene at least one position associated with one or more audio source, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions and the audio scene comprises the one or more audio sources;obtaining a projection shape;obtaining information based on a position of the one or more audio sources relative to the projection shape; andtransmitting to a further apparatus the one or more audio signal sets and the information.

23. An apparatus for generating a spatialized audio output based on a listener position, the apparatus comprising means configured to:obtain one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position;obtain a listener position within an audio scene, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions,wherein the inside region is defined by the respective audio signal set positions and the audio scene comprises one or more audio sources;determine, for the listener position within an audio environment outside the inside region, a projected listener position based on a geometry of the inside region;obtain, based on the one or more audio signal sets, metadata associated with the projected listener position;obtain a projection shape;obtain information based on a position of the one or more audio sources relative to the projection shape; andestimate modified metadata for the projected listener position based on the listener position and the information.

24. An apparatus for assisting generation of a spatialized audio output based on a listener position, the apparatus comprising means configured to:obtain one or more audio signal sets, wherein each of the one or more audio signal sets is associated with a respective audio signal set position;obtain for an audio scene at least one position associated with one or more audio source, wherein the audio scene comprises one or more areas having one or more inside and outside regions in relation to the respective audio signal set positions and the audio scene comprises the one or more audio sources; obtain a projection shape;obtain information based on a position of the one or more audio sources relative to the projection shape; andtransmit to a further apparatus the one or more audio signal sets and the information.