Spatial audio rendering

The method addresses spatial audio rendering challenges by recalculating source geometries and beamforming coefficients to adapt to moving audio sources, ensuring accurate audio playback for listeners with six degrees of freedom.

WO2026131023A1PCT designated stage Publication Date: 2026-06-25NOKIA TECHNOLOGIES OY

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
NOKIA TECHNOLOGIES OY
Filing Date
2025-11-26
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Existing spatial audio rendering technologies struggle to accurately account for the movement of multiple audio sources within a listening space, leading to localization errors and incorrect audio rendering when listeners move with six degrees of freedom.

Method used

A method and apparatus that detect movement of audio sources within a scene, recalculating source geometries to de-emphasize moving sources and re-emphasize stationary ones, using pre-calculated geometries and beamforming coefficients to ensure accurate rendering based on listener position.

Benefits of technology

Enables accurate spatial audio rendering with six degrees of freedom by dynamically adapting to audio source movements, reducing localization errors and ensuring correct audio playback.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure EP2025084328_25062026_PF_FP_ABST
    Figure EP2025084328_25062026_PF_FP_ABST
Patent Text Reader

Abstract

Examples of the disclosure relate to spatial audio rendering that allows for six degrees of freedom of movement of a listener. In examples a source geometry of an audio scene is obtained. Respective sections of the source geometry of the audio scene comprise multiple audio sources. When movement of at least one of the multiple audio sources in the audio scene is detected an updated source geometry for the audio scene is obtained. The updated source geometry at least partially de-emphasizes the at least one of the multiple audio sources for which movement is detected. The audio scene is rendered based on the updated source geometry and the listener position during movement of the at least one of the multiple audio sources.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] TITLE

[0002] Spatial Audio Rendering

[0003] TECHNOLOGICAL FIELD

[0004] Examples of the disclosure relate to spatial audio rendering. Some relate to spatial audio rendering that allows for six degrees of freedom of movement of a listener.

[0005] BACKGROUND

[0006] Audio signal sets, such as Higher Order Ambisonics (HOAs) can be used to enable spatial rendering of a listening space. This can enable a listener to perceive accurate spatial aspects of audio scenes within the listening space. The position of the listener relative to audio sources within the listening space is used to provide the appropriate spatial aspects. If one or more of the audio sources moves this has to be accounted for.

[0007] BRIEF SUMMARY

[0008] According to various, but not necessarily all, examples of the disclosure there is provided an apparatus for six degrees of freedom audio rendering comprising:

[0009] at least one processor;

[0010] and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform:

[0011] obtaining a source geometry of an audio scene wherein respective sections of the source geometry of the audio scene comprise multiple audio sources;

[0012] detecting movement of at least one of the multiple audio sources in the audio scene; obtaining an updated source geometry for the audio scene wherein the updated source geometry at least partially de-emphasizes the at least one of the multiple audio sources for which movement is detected; and

[0013] enabling rendering of the audio scene based on the updated source geometry and the listener position during movement of the at least one of the multiple audio sources.

[0014] An audio source which has been at least partially de-emphasized might not be used for the updated source geometry.

[0015] If it is detected that multiple audio sources are moving then audio sources that are moving the most can be de-emphasized.

[0016] The processor and memory may also be arranged to cause the apparatus to perform: detecting that the at least one of the multiple audio sources that have been at least partially de-emphasized have stopped moving;

[0017] obtaining a further source geometry wherein the further source geometry re-emphasizes the at least one of the multiple audio sources that have stopped moving; and

[0018] enabling updated rendering of the audio scene based on the further source geometry and the listener position after the at least one of the multiple audio sources that were at least partially de-emphasized have stopped moving.

[0019] The re-emphasized audio sources may be used for the further source geometry.

[0020] At least one of the multiple audio sources for which movement is detected may be located within a section of the source geometry corresponding to the listener position.

[0021] Detecting movement of at least one of the multiple audio sources in the audio scene may comprise obtaining advance information of the movement of at least one of the multiple audio sources.

[0022] The processor and memory may also be arranged to cause the apparatus to perform:

[0023] using the advance information of the movement of at least one of the multiple audio sources to obtain the updated source geometry of the audio scene; and

[0024] enabling rendering of the audio scene based on the updated source geometry and the listener position prior to movement of at least one of the multiple audio sources.

[0025] The processor and memory may also be arranged to cause the apparatus to perform:

[0026] using the advance information of the movement of at least one of the multiple audio sources to obtain the further source geometry of the audio scene; and

[0027] enabling updated rendering of the audio scene based on the further source geometry and the listener position prior to the at least one of the multiple audio sources that have been at least partially de-emphasized stopping movement.

[0028] The respective source geometries may be precalculated based on the advance information.

[0029] The processor and memory may also be arranged to cause the apparatus to perform obtaining initial source information.

[0030] The initial source information may comprise one or more informed source beamforming directions.

[0031] The source geometries may comprise a triangulation The processor and memory may also be arranged to cause the apparatus to perform applying a cross fade for a transition between respective source geometries.

[0032] The rendering may comprise using a first audio source in a section of the source geometry for signal interpolation and the first audio source and two other audio sources for spatial metadata interpolation.

[0033] Movement of the of the one or more audio sources may comprise at least one of:

[0034] change in location;

[0035] change in orientation.

[0036] The one or more audio sources may comprise higher order ambisonics sources.

[0037] According to various, but not necessarily all, examples of the disclosure there is provided a method comprising:

[0038] obtaining a source geometry of an audio scene wherein respective sections of the source geometry of the audio scene comprise multiple audio sources;

[0039] detecting movement of at least one of the multiple audio sources in the audio scene; obtaining an updated source geometry for the audio scene wherein the updated source geometry at least partially de-emphasizes the at least one of the multiple audio sources for which movement is detected; and

[0040] enabling rendering of the audio scene based on the updated source geometry and the listener position during movement of the at least one of the multiple audio sources.

[0041] According to various, but not necessarily all, examples of the disclosure there is provided a computer program comprising instructions which, when executed by an apparatus, cause the apparatus to perform:

[0042] obtaining a source geometry of an audio scene wherein respective sections of the source geometry of the audio scene comprise multiple audio sources;

[0043] detecting movement of at least one of the multiple audio sources in the audio scene; obtaining an updated source geometry for the audio scene wherein the updated source geometry at least partially de-emphasizes the at least one of the multiple audio sources for which movement is detected; and

[0044] enabling rendering of the audio scene based on the updated source geometry and the listener position during movement of the at least one of the multiple audio sources. According to various, but not necessarily all, embodiments there is provided an apparatus comprising:

[0045] at least one processor; and

[0046] at least one memory including computer program code;

[0047] the at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus to perform at least a part of one or more methods described herein.

[0048] According to various, but not necessarily all, embodiments there is provided an apparatus comprising means for performing at least part of one or more methods described herein. The description of a function and / or action should additionally be considered to also disclose any means suitable for performing that function and / or action. Functions and / or actions described herein can be performed in any suitable way using any suitable method.

[0049] According to various, but not necessarily all, embodiments there is provided examples as claimed in the appended claims.

[0050] While the above examples of the disclosure and optional features are described separately, it is to be understood that their provision in all possible combinations and permutations is contained within the disclosure. It is to be understood that various examples of the disclosure can comprise any or all the features described in respect of other examples of the disclosure, and vice versa. Also, it is to be appreciated that any one or more or all the features, in any combination, may be implemented by / comprised in / performable by an apparatus, a method, and / or computer program instructions as desired, and as appropriate. The description of a function should additionally be considered to also disclose any means suitable for performing that function

[0051] BRIEF DESCRIPTION

[0052] Some examples will now be described with reference to the accompanying drawings in which: FIGS. 1A and 1B show an audio scene and a source geometry for an audio scene;

[0053] FIGS. 2A and 2B show a moving audio source;

[0054] FIG. 3 shows a method;

[0055] FIGS. 4A to 4C show movement of an audio source;

[0056] FIGS. 5A to 5C show movement of an audio source;

[0057] FIG. 6 shows an example immersive audio renderer;

[0058] FIG. 7 shows an example immersive audio renderer;

[0059] FIG. 8 shows a method;

[0060] FIGS. 9A and 9B show a method and example triangulations; FIG. 10 shows a system; and

[0061] FIG. 11 shows a controller.

[0062] The figures are not necessarily to scale. Certain features and views of the figures can be shown schematically or exaggerated in scale in the interest of clarity and conciseness. For example, the dimensions of some elements in the figures can be exaggerated relative to other elements to aid explication. Corresponding reference numerals are used in the figures to designate corresponding features. For clarity, all reference numerals are not necessarily displayed in all figures.

[0063] DETAILED DESCRIPTION

[0064] Fig. 1A shows an audio scene 100 that can be used in examples of the disclosure.

[0065] The audio scene 100 comprises multiple audio sources 102. In this example the audio sources comprise Higher Order Ambisonics (HOA) sources. Other types of audio sources could be used in other examples.

[0066] The audio scene 100 in Fig. 1 comprises five audio sources 102. Other numbers of audio sources 102 could be used in other examples of the disclosure. Each of the audio sources 102 has a position and orientation within the audio scene 100.

[0067] The audio scene 100 could represent a virtual reality environment. For example, it could be a gaming environment or any other suitable type of environment. The audio sources 102 within the audio scene could comprise recordings and / or synthesized signals. In the example of Fig. 1A the microphones that capture the audio signals and provide the audio sources 102 are provided at positions m1to m5.

[0068] A listener 104 is positioned within the audio scene 100. The listener 104 can move freely within the audio scene 100. The listener 104 can move with six degrees of freedom (6DoF) within the audio scene 100. That is, the listener 104 can change both their position and orientation within the audio scene 100. In this example the listener 104 is at position pt.

[0069] An example of an audio scene 100 could be a recording of multiple musicians playing on a busy street. There are people around the musicians as well as people walking on the street. Some people on the street can be listening to the musicians and can contribute to the audio scene 100 by applauding or talking. The music from the musicians is recorded using multiple microphones, such as Ambisonics microphones, placed some distance apart. The positions of at least some of the musicians are known. The audio scene 100 is captured and then subsequently turned into a mediated reality scene by a content creator. The content creator creates a scene description with the positions of the recording microphones and any informed sources. An informed source is a sound emitting entity for which information is available. The available information for an informed source can be the position and orientation of the sound emitting entity but does not need to comprise the audio signal. The position and / or orientation of the informed sources can be known relative to the positions of the microphones. The scene description is then encoded into a bitstream along with the audio and is provided to the listener 104 for consumption as spatial audio.

[0070] For 6DoF rendering, a renderer can provide a spatial audio signal, such as a binaural signal, for the listener position 104. The renderer can provide the spatial audio signal for a non-sampled position of the audio scene 100.

[0071] In order to enable 6DoF rendering for an audio scene 100 with multiple audio sources 102 the audio data, or source signals, for all of the audio sources 102 needed for the rendering is required.

[0072] The audio sources 102 needed for the rendering for a particular position of the listener 104 can be determined from a source geometry of the audio scene 100. The source geometry determines the positions of the audio sources 102 within the audio scene 100. The source geometry can partition the audio scene 100 into sections. The sections can be determined such that respective sections comprise the audio sources 102 that are needed for rendering when the listener 104 is positioned within the section.

[0073] Fig. 1B schematically shows an example source geometry for the audio scene 100. In this example the source geometry comprises a triangulation. The triangulation partitions the audio scene 100 into triangular sections. The vertices of the respective triangular sections are defined by the positions of the audio sources 102. When a listener 104 is positioned in a triangular section the audio sources 102 needed for the rendering are the audio sources 102 provided at the vertices of the triangular section.

[0074] Figs. 2A and 2B show a moving audio source within the audio scene 100. IN this example the third audio source 102_3 moves from a central position of the audio scene 100 to a position towards the left sides of the audio scene 100 as indicated by the arrow 200.

[0075] Fig. 2A shows what happens if the source geometry is not recalculated to account for the movement of the audio source 102_3. In this example the source geometry is invalid. If the beamformer parameters are not recalculated after the audio source 102_3 moves then the beamformer points in the wrong direction and does not capture the informed source at the correct position. This will cause wrong audio to be rendered if the listener 104 is at the location of the third audio source 102_3 which leads to localization errors.

[0076] Fig. 2B shows the source geometry being recalculated. In this case the beamformer parameters are recalculated after the audio source 102_3 moves so the beamformer points in the correct direction. The recalculated beamformer captures the informed source at the correct position. This enables the correct audio to be rendered if the listener 104 is at the location of the third audio source 102_3.

[0077] When an audio source 102 is moving this needs to be accounted for in the processing and rendering of the audio scene. Examples of the disclosure provide methods for accounting for the movement of one or more audio sources.

[0078] Fig. 3 shows a method that can be implemented in examples of the disclosure. The method could be implemented by an apparatus such as a controller. The apparatus could be provided in an encoder or any other suitable device or entity.

[0079] At block 300 the method comprises obtaining a source geometry of an audio scene 100 wherein respective sections of the source geometry of the audio scene 100 comprise multiple audio sources 102. The source geometry can comprise one or more sections.

[0080] The source geometry that is obtained at block 300 can be used for an initial rendering of the audio scene 100. The source geometry and rendering can be changed when an audio source 102 starts moving. The source geometry that is obtained at block 300 can be an initial source geometry.

[0081] The source geometry could comprise a triangulation or any other partitioning of the audio scene 100. The triangulation is the division of the audio scene 100 in to triangular section with an audio source at each vertex of the triangle.

[0082] At block 302 the method comprises detecting movement of at least one of the multiple audio sources 102 in the audio scene 100. The movement of the one or more audio sources 102 can comprise a change in location and / or orientation of one or more of the audio sources 102.

[0083] At least one of the multiple audio sources for which movement is detected can be located within a section of the source geometry corresponding to the listener position. Detecting movement of at least one of the multiple audio sources 102 in the audio scene 100 can comprise obtaining advance information of the movement of at least one of the multiple audio sources 102. The advance information could be provided in the bitstream or by any other suitable means. This can enable the respective source geometries to be pre-calculated. The precalculation can occur during initialising of the system or during any other suitable time. The precalculation can occur before the rendering of the audio. The pre-calculation does not occur in real time with the rendering and playback of the audio.

[0084] When advance information of the movement of at least one of the multiple audio sources 102 is obtained the advance information can be used to obtain the updated source geometry of the audio scene 100 and enable rendering of the audio scene 100 based on the updated source geometry and the listener position prior to movement of at least one of the multiple audio sources 102.

[0085] Similarly, the advance information of the movement of at least one of the multiple audio sources 102 can be used to obtain the further source geometry of the audio scene 100 and to enable updated rendering of the audio scene based on the further source geometry and the listener position. The obtaining of the further source geometry and the updated rendering can occur prior to the stopping of the movement of the audio sources that have been at least partially deemphasized.

[0086] At block 304 the method comprises obtaining an updated source geometry for the audio scene 100 wherein the updated source geometry at least partially de-emphasizes the at least one of the multiple audio sources 102 for which movement is detected.

[0087] In the updated source geometry an audio source which has been at least partially de-emphasized is not used.

[0088] If it is detected that multiple audio sources 102 are moving then audio sources 102 that are moving the most are de-emphasized. For example, audio sources that are moving with a speed or rate of change that is above a threshold could be de-emphasized while no change would be applied to the audio sources that are moving with a speed or rate of change that is below the threshold.

[0089] At block 306 the method comprises enabling rendering of the audio scene 100 based on the updated source geometry and the listener position during movement of the at least one of the multiple audio sources 102. In the rendering the updated source geometry can be used in place of the source geometry that was used before the movement of the audio source 102 was detected. The rendering can comprise using a first audio source in a section of the source geometry for signal interpolation and the first audio source and two other audio sources for spatial metadata interpolation.

[0090] Fig. 3 shows an example method that can be implemented when it is detected that an audio source 102 has started moving. In examples of the disclosure the source geometry can also be adapted when it is detected that the moving audio source 102 has stopped moving. In such cases the method could comprise detecting that the at least one of the multiple audio sources 102 that have been at least partially de-emphasized have stopped moving. In some examples the audio source 102 could still be moving but could have slowed down so that the speed or rate or change is now below a threshold. The method could then comprise obtaining a further source geometry wherein the further source geometry re-emphasizes the at least one of the multiple audio sources 102 that have stopped moving and enabling updated rendering of the audio scene 100 based on the further source geometry and the listener position after the at least one of the multiple audio sources 102 that were at least partially de-emphasized have stopped moving.

[0091] In the further source geometry an audio source which has been at re-emphasized is is used.

[0092] In some examples the method can comprise additional blocks that are not shown in Fig. 3. For example, the method could also comprise obtaining initial source information. The initial source information comprises one or more informed source beamforming directions. The initial source information can be valid for the time period before one or more of the audio sources 102 are detected to be moving.

[0093] In some examples a cross-fade for a transition between respective source geometries can be applied.

[0094] Figs. 4A to 4C show movement of an audio source 102 and how this can affect the source geometries.

[0095] Fig. 4A shows an audio scene 100 comprising five audio sources 102 and an informed source 400. An informed source is a sound emitting entity for which information is available. The available information for an informed source can be the position and orientation of the sound emitting entity but does not need to comprise the audio signal. Fig. 4A shows the determining of the closest audio source 102 for the informed source 400 and determining beamforming coefficients for beamformers pointing from the audio sources towards the informed sources. The closest audio source 102 and beamforming coefficients can be determined for each of the informed sources 400 in the audio scene.

[0096] In Fig. 4B one of the audio sources 102_3 has moved. The audio source 102_3 has changed position. This means that the precalculated vector pointing from the audio source 102_3 to the informed source 400 does not point towards the informed source 400 anymore. Any beamforming coefficients calculated based on this vector would be incorrect.

[0097] Fig. 4C shows the recalculation of the of the vector pointing from the audio source 102_3 in the new position towards the informed source 400. This new vector can be used to calculate beamforming coefficients.

[0098] In examples of beamforming coefficients calculated from the initial source positions as shown in Fig. 4A would initially be used. When the audio source 102_3 is moving the audio source 102_3 would be deemphasized. When the audio source 102_3 has finished moving the of beamforming coefficients calculated from the final source positions as shown in Fig. 4C would be used.

[0099] Figs. 5A to 5C show movement of an audio source 102_3 and how this can affect the source geometry.

[0100] Fig. 5A shows an audio scene 100 comprising five audio sources 102. At this point none of the audio sources 102 are moving. The triangulation for the audio scene is determined based on the current positions of the audio sources 102.

[0101] In Fig. 5B one of the audio sources 102_3 is moving as indicated by the arrow 500. The audio source 102_3 has changed position. When the audio source 102_3 is moving it is de-emphasized so that it is not included in the triangulation of the audio scene 100. Therefore in Fig. 5B the triangulation is recalculated and the moving audio source 102_3 is not used. This provides an intermediate source geometry that can be used while the audio source 102_3 is moving between locations.

[0102] In Fig. 5C the audio source 102_3 has stopped moving and is now located at a new position. The triangulation is recalculated and the audio source 102_3 that moved can now be re-emphasized. In this case the audio source 102_3 that moved can be included back in the triangulation but now the triangulation is based on the new position. This provides a new source geometry that can be used while the audio source 102_3 is at the new location.

[0103] Fig. 6 shows an example immersive audio renderer 600 that could be used in some examples of the disclosure. The immersive audio renderer 600 could be configured to implement the method of Fig. 3 or any other suitable methods. The immersive audio renderer 600 shown in Fig. 6 provides 6DoF Multi-Point HOA rendering of audio scenes as described in MPEG-I immersive audio specification (ISO / IEC 23090-4). The audio scenes that are rendered can comprise HOA / FOA content that can be recorded or synthesized. Modifications to this implementation and other implementations can be used in examples of the disclosure.

[0104] The immersive audio renderer 600 comprises an active source determiner 602. The active source determiner 602 receives information of the audio scene 604 and orientations of the audio sources 606 as inputs.

[0105] The information of the audio scene 604 can comprise the positions of the audio sources 102 within an audio scene 100. The audio sources can be HOA or FOA audio sources or any other suitable type of sources. The positions of the audio sources 102 can be denoted p1..N. The orientations of the audio sources 606 can be denoted ot.

[0106] The positions and / or orientations of the audio sources 102 can change during rendering. The active source determiner 602 is configured to monitor changes in positions and / or orientations of the audio sources 102 and determine which audio sources 102 should be used for rendering. The audio sources 102 that are to be used for rendering are the active sources. The determination of which audio sources 102 should be used for rendering is based on changes in position and / or orientation of one or more audio sources 102.

[0107] The active source determiner 602 provides a set of audio sources 608 as an output. The set of audio sources 608 can be denoted

[0108]

[0109] P (t). The set of audio sources 608 indicates which audio sources 102 should be used for rendering. For example, it can exclude audio sources 102 that are to be de-emphasized or can indicate which of the audio sources 102 are to be de-emphasized.

[0110] The active source determiner 602 can use any suitable process to determine which audio sources 102 should be used for rendering. In some examples dynamic updates to the position and / or orientation of one or more audio sources 102 can be made. In such examples the changes to the position and / or orientation of one or more audio sources 102 can happen at any time during rendering. The immersive audio renderer 600 does not have prior information about the timing or the trajectory of changes to the position and / or orientation of one or more audio sources 102. In such cases all audio sources can initially be comprised within a list of active sources. When a change to the position and / or orientation of one or more audio sources 102 is detected the audio sources 102 that have moved can be de-emphasized. For example, the audio sources 102 that have moved can be removed from the list of active audio sources 102.

[0111] Once an audio source 102 has been de-emphasized it continues to be de-emphasized until a threshold duration of time has expired from the last detected movement. For example, if an audio source 102 has been removed from the list of active audio sources 102 it will not be added back to the list of active audio sources 102 until the threshold time has expired. The threshold time could be one second or any other suitable duration of time.

[0112] In other examples the immersive audio renderer 600 can have prior information about the timing or the trajectory of changes to the position and / or orientation of one or more audio sources 102. For example, this information could be received in the bitstream. In this case the immersive audio renderer 600 can be aware of movement of the audio sources 102 before the movement occurs. In this example, if the active source determiner 602 detects that movement of one or more of the audio sources 102 is going to happen then the active source determiner 602 can de-emphasize the audio sources 102. The de-emphasis of the one or more audio sources 102 can occur preemptively. For example, it can occur slightly before the audio source 102 starts to move. In some examples the de-emphasis of the audio sources 102 can occur within one second of the start of the movement of the audio sources or within any other suitable time threshold. This can reduce audible artefacts for the listener 104.

[0113] In cases where the immersive audio renderer 600 has prior information about the movement of the audio sources 102 an audio source 102 can be re-emphasized immediately after it stops moving, or a short period of time after it stops moving. A short period of time could be around 100ms. For example, an audio source 102 that has been removed from the list of active audio sources 102 can be added back to the list.

[0114] The immersive audio renderer 600 also comprises a pre-processing block 610. The preprocessing block 610 receives information of the audio scene 604, and head-related impulse responses (HRIRs) 612 as inputs, and the set of audio sources 608 as an input.

[0115] The pre-processing block 610 can be configured to determine a source geometry for the audio scene 100. The source geometry can comprise a partition of the audio scene into different sections, beamformer parameters and / or any other geometry or information. For example, the pre-processing block 610 can be configured to perform triangulation on the positions of the audio sources 102. The pre-processing block 610 can perform Delaunay triangulation or any other suitable type of triangulation. The triangulation that is performed by the pre-processing block 610 provides a set of triangles T1 MT614 as an output. The set of triangles T1 MT614 partition the audio scene 100 into triangular sections.

[0116] If informed audio sources are used then the pre-processing block 610 can also determine beamformer parameters for the informed audio sources. Any suitable process can be used to determine the beamformer parameters. As an example:

[0117] First vectors from each informed source position pSmto an audio source position

[0118]

[0119] are determined. The vectors can be as shown in Figs. 4A to 4C.

[0120] v(i,m) = pSm− pi

[0121] The vectors are then rotated according to the orientation of the audio source:

[0122] vrot(j,m) = RHOA(j)v(j,m)

[0123] where RHOA(i) is the rotation matrix for the orientation of the audio source i.

[0124] The time-delay (in samples) between the pairs of audio sources and informed sources can be calculated as follows:

[0125] dsamples(i,m) = ||v(j,m)||

[0126]

[0127] where c is the speed of sound and fsis the sampling frequency.

[0128] The time-delay between two audio sources for an informed source can be obtained with:

[0129]

[0130] ^•samples ^samples O2' ^samples^lf

[0131] The delay in short-time Fourier transform (STFT) hop sizes can be obtained with:

[0132] , z.. x. fdSamples(il> ^2>

[0133] ct / iop i, 12, m) = round1 1

[0134]

[0135] Lhop

[0136] where Lhopis the set hop length.

[0137] In addition, for each informed source, the closest audio source iclosest(m) is determined based on the lengths of v(j,m). The attenuation gain between the closest audio source iclosest(m) and some other audio source i for informed source m can be obtained with:

[0138] ||v(iclosest(m),m)||

[0139] gatt(i,m) =

[0140]

[0141] ||v(j, m)|| In addition, beamforming weights w(iclosest(m),m) that focus a beam pattern from the closest audio source towards the informed source m can be calculated:

[0142] Y(iclosest

[0143] w(iclosest

[0144]

[0145] ^■niclosest(.m) + l)2

[0146] where Y(iclosest(m),m) is the set of real spherical harmonic functions for the pair (iclosest(m),m) for all the orders n ∈ [0, ni(m)] and degrees k ∈ [−n,n]. Real spherical harmonics are defined

[0147]

[0148] IFN_ I ( V2 sin( | k 10), if k < 0

[0149] Ynk(θ,σ) = √((2n+1)(n−|k|)! / (n+|k|)!) Pnk(cos σ) * { 1, if k == 0

[0150] (n + |k|)!

[0151] √2 cos(|k|θ), if k > 0

[0152] for each audio source i and informed source m pairs. The outer product of Y(j,m) and Y(j,m)Tcan also be calculated up to order 1:

[0153] yy(j, m) = Y(j, m)Y(j, m)T, for n ≤ 1

[0154] The pre-processing block 610 also receives the set of audio sources 608 from the active source determiner 602. This indicates which audio sources 102 should be used for rendering. The preprocessing block 610 can determine if the set of audio sources 608 has changed and if there are any changes then the pre-processing block 610 can obtain an updated source geometry. The updated source geometry at least partially de-emphasizes audio sources 102 that are moving or are about to move. The updated source geometry can comprise triangulation and / or beamformer parameters.

[0155] To update the source geometry the pre-processing block 610 can re-calculate the values for v

[0156]

[0157] (t, m), dSampies (X dbOp(j,'L,i2,rn,'), w(iciosest(wi),iTi) and yy(t, m). The calculations used to recalculate the values be similar to those used to obtain the original beamformer parameters as described above except that the vectors v(j, m) would be determined from informed source positions pSmto active audio source positions

[0158]

[0159] . That is the vectors v(j,m) would be determined for audio sources 102 that have not been de-emphasized.

[0160] The pre-processing block 610 is also configured to convert the HRIRs 612 to frequency domain head-related transfer functions (HRTFs) 616. The pre-processing block 610 can use a short-term Fourier transform to perform the conversion. For MPEG-I Audio cases, an alias-free STFT algorithm can be used. The HRTFs 616 can be used to calculate an Ambisonics-to-binaural transform matrix MHOA2bin(b), for each frequency band b. The HRTFs 616 can be provided as an output of the pre-processing block 610. The immersive audio renderer 600 also comprises a position pre-processing block 616. The position pre-processing block 616 is configured to determine audio sources 102 that are close to the position of the listener 104.

[0161] During rendering, for each input frame j, the position pre-processing block 616 takes as input the listener position pl618, positions of the audio sources p1..Nor other information of the audio scene 604, and the set of triangles T1 MT614 that were created in the pre-processing block 610. Based on these inputs, the position pre-processing block 616 determines interpolation weights wc(i,j) 620 for the audio sources pi. The position pre-processing block 616 can also determine the audio source 102 that is closest to the listener 104.

[0162] To determine the interpolation weights wc(j,j) an active triangle TA(f) 622 is determined. The active triangle TA(f) can be the triangle from the set of triangles T1 MT614 that the listener 104 is in. Barycentric coordinates for the active triangle TA(j') and the listener position are then calculated. The barycentric coordinates are used as the interpolation weights wc(j,j) 620. The interpolation weights wc(j,j) 620 sum to one and the closer the listener 104 is to an audio source 102, the higher the weight for that audio source 102 will be. At an edge of a triangle, the interpolation weight for the audio source 102 that is not part of that edge is 0.

[0163] The interpolation weights wc(j,j) 620 and the active triangle TA(f) 622 are provided as outputs of the position pre-processing block 616.

[0164] If the listener 104 moves from one triangle to another then the position pre-processing block can switch the active triangle TA(f). In some examples the switching of the active triangle TA(f) can be delayed, and only switched after a few frames of audio. The delay of the switching can be beneficial because processing (STFT) of the audio signals representing the audio sources 102 for the new triangle takes a few frames of audio to provide meaningful output. During the delayed switch, the interpolation weights wc(j,j) are not updated until the STFT is ready. After the STFT is ready, the interpolation weights wc(j,j) are calculated for the current listener position. A crossfade can be performed between the new interpolation weights wc(j,j) and the interpolation weights wc(j,j) used during the delayed switch of the active triangle TA(f). Therefore, the active triangle TA(j') is not always the triangle that the listener 104 is in.

[0165] Any suitable process can be used to perform the cross fade. In some examples, after a delayed switch of the active triangle TA(j), the interpolation weights wc(j,j) are cross-faded overa duration of 24 frames. On the first frame, the interpolation weights wc(j,j) used during the delayed switch are used. For the second frame, the interpolation weights wc(j,j) calculated based on the current listener position are used for a small subset of the frequency bands. For the rest of the frequency bands, the interpolation weights wc(j,j) used during the delayed switch are used. For the third frame interpolation weights wc(j,j) calculated based on the current listener position are used for more frequency bands than for the previous frame. This continues until after 24 frames, the interpolation weights wc(j,j) are all obtained based on the listener position for all frequency bands.

[0166] The immersive audio renderer 600 also comprises a spatial analysis block 624. The spatial analysis block 624 can receive at least one audio signal 626 as an input. The at least one audio signal 626 can be denoted sESD(i,j). The spatial analysis block 624 is configured to provide spatial metadata for the audio sources 102. The spatial metadata can then be used by a spatial metadata interpolation block 632, to estimate spatial metadata parameters at the listener position.

[0167] The spatial analysis block 624 takes as input a frame (for example, 256 samples) of audio signals sESD(i,j) 626 for each audio source i and frame j.

[0168] The spatial analysis block 624 performs STFT processing to obtain time-frequency domain signals 628. The time-frequency domain signals 628 can be denoted S(i,j,k) where k refers to a sub-frame (for example, 128 samples). The spatial analysis block 624 calculates spatial metadata from the time-frequency domain signals 628. The spatial metadata can comprise energy parameters and directional parameters. In this example the spatial metadata can comprise the energy, direction information (azimuth and elevation) and diffuseness information (direct-to-total energy ratio). In some examples the spatial metadata can be obtained as follows:

[0169] First, a signal analysis vector

[0170]

[0171] k, b) is calculated:

[0172] Sbil(i,j,k) * 1.0

[0173] Sbi2(i,j,k) * 0.5774

[0174] S J’7’ ’ Sbi3(i, j, k) * 0.5774

[0175]

[0176] SbA(i,j,k') * 0.5774

[0177] where Sb ci,j,k') is the value in matrix S(i,j,k) corresponding to channel c and frequency bin b.

[0178] From the signal analysis vector, a signal intensity vector is calculated:

[0179] s-jfij, k, b) * s4(i,j, k, b)'

[0180] k, b~) = Re k, b) * s2(i,j, k,b)

[0181]

[0182] . Si(i,j, k, b) * s3(i,j, k,by

[0183] where sc(i,j, k, b) denotes the complex conjugate of sc(i,j, k, by

[0184] Signal energy is then calculated as follows: 14

[0185] e(i,j,k,b) = -^sc(i,j,k,b) *sc(i,j,k,b)

[0186]

[0187] C = 1

[0188] An average intensity vector and average energy is calculated as follows:

[0189] e(i,j,b) e(i,j, k, b)

[0190] Nsf

[0191] Ki.j.b) = — ) i(i,j,k,b)

[0192]

[0193] The direction data (azimuth and elevation) are calculated as follows:

[0194] θ̃(i,j,b) = atan2(ĩ2(i,j,b), ĩ1(i,j,b))

[0195] |ĩ1(i,j,b)|

[0196] φ̃(i,j,b) = atan2(ĩ3(i,j,b),

[0197]

[0198] |ĩ2(i,j,b)|)

[0199] where ĩn(i,j,b) is the nth element of the average intensity vector ĩ(i,j,b), θ̃(i,j,b) is the azimuth and φ̃(i,j,b) is the elevation.

[0200] The direct-to-total energy ratio is calculated as follows:

[0201] r̃(i,j,b) =

[0202]

[0203] e(i,j,b)

[0204] The energy for the subframes k are then obtained as follows:

[0205] e(i,j,k,b) = ẽ(i,j,b), k ∈ 1..Nsf

[0206] Other processes for calculating the spatial metadata can be used in other examples.

[0207] The spatial metadata 630 can comprise direction-of-arrival (DOA) data including an azimuth direction θ(i,j,b) and an elevation angle φ(i,j,b), a direct-to-total energy ratio r(i,j,b) and an energy for the subframes e(i,j, k, b). Other parameters could be used in other examples.

[0208] The spatial metadata 630 and time-frequency domain signals 628 are provided as outputs of the spatial analysis block 624.

[0209] The immersive audio renderer 600 also comprises a spatial metadata interpolation block 632. The spatial metadata interpolation block 632 receives the spatial metadata 630 and the interpolation weights wc(i,j) as inputs. The spatial metadata interpolation block 632 can also receive orientations of the audio sources ot606 and orientation of the listener’s head o(634 as inputs.

[0210] The spatial metadata interpolation block 632 is configured to provide an estimate of the spatial metadata for the position of the listener 104. The spatial metadata for the position of the listener 104 is estimated based on the spatial metadata 630 for the audio sources 102 comprising the active triangle TA(j) and the interpolation weights wc(j,j) 620 that have been calculated in the position pre-processing block 616.

[0211] Any suitable process can be used to perform the spatial metadata interpolation. As an example process, the spatial metadata can be converted into vector form:

[0212] −sin(θ(i,j,b))cos(φ(i,j,b))

[0213] ṽ(i,j,b) = sin(φ(i,j,b)) r̃(i,j,b)

[0214]

[0215] −cos(φ(i,j,b))cos(φ(i,j,b))

[0216] The vectors are rotated according to the orientations of the audio sources ot606 and orientation of the listener’s head o(634.

[0217] ṽ(i,j,b) = Rhead(j)Rsource(i,j)ṽ(i,j,b)

[0218] An interpolated spatial metadata vector is then calculated by a weighted average of spatial metadata vectors:

[0219] Ns

[0220] v̂(j,k,b) = ∑ w̃c(j,k)ṽ(i,j,b)

[0221]

[0222] i = l

[0223] where the interpolation weights w̃c(j,k) are the barycentric coordinates calculated in the position pre-processing block 616.

[0224] The interpolated spatial metadata vector is then converted to spatial metadata parameters as follows:

[0225] Azimuth:

[0226] §

[0227]

[0228] (J, k, b) = atan2(— V-LCJ, k, b), k, b))

[0229] Elevation:

[0230] <

[0231]

[0232] p(y, k, b) = atan2 (— v2(j, k, b), (—vr(j,k,bD2+ (—

[0233] Direct-to-total energy ratio: r

[0234]

[0235] (j, k, b) = V (vi ( / , k, b))2+ (y2(J, k, b))2+ (v3( / , k, b))2

[0236] Energy:

[0237] Ns

[0238] ê(j,k,b) = ∑ w̃c(j,k)e(i,j,k,b)

[0239]

[0240] i = l

[0241] The spatial metadata interpolation block 632 provides the interpolated spatial metadata vector 636 as an output. The spatial metadata interpolation block 632 can also provide the energy parameter ê(j,k,b) 638 as an output.

[0242] In the above description of the spatial analysis block 624, it is mentioned that spatial analysis is performed for audio sources that belong to the triangle that the listener 104 is in. In addition, spatial analysis can also be performed for audio sources 102 which are the closest audio sources 102 to active informed sources.

[0243] An Informed source is active when certain conditions hold. The conditions could be that the listener is closer than a set threshold to the informed source and / or that the closest audio source is closer than a set threshold to the informed source. The respective thresholds can be different.

[0244] The spatial analysis for the for audio sources 102 which are the closest audio sources to active informed sources can comprise obtaining signals via beamforming for each informed source. The beamformed signals can be post-filtered and time-aligned at the audio source positions and at the position of the listener 104.

[0245] For each active informed source for the current frame, beamforming can be performed as follows:

[0246] $

[0247]

[0248] src(j / k, b, TYl) ^(j-closest^V^) > k, b, iclosest^V^a))

[0249] The obtained signal can be post-filtered:

[0250] ^

[0251]

[0252] src (j / k, b, m) Gpost (j / k, b, m)Ssrc(j, k, b, Tn)

[0253] where a cross-pattern coherence (CroPaC) based post-filter is used.

[0254] Time-alignment at audio source positions can be done as follows:

[0255] S

[0256]

[0257] src(j>m)=$src (t(j> i, w), k, b, m)

[0258] where

[0259] t

[0260]

[0261] (jri, Tn) j min i-closestCm), & hop / max)

[0262] Time-alignment at listener position can be done as follows:

[0263] S

[0264]

[0265] l.src(j / k, b, Tn) Ssrc[t(j, i-l, closest* k* b, Tnj Spatial metadata parameters that only have contribution from the informed sources can be estimated for the active audio sources 102.

[0266] The signal covariance matrix from the time-aligned beamformed signals Ssrc(j, k, b, i, m) for the active audio sources is determined:

[0267] Csrc(j,b,i) = ∑ gatt(i,m)2* Êsrc(j,b,i,m) * yy(i,m)

[0268]

[0269] m£mactlve

[0270] The signal covariance matrix from the listener 104 towards an audio source can be obtained as follows:

[0271] Cl(j,b) = ∑ gl,att(j,iclosest(m),m)2* Êl,src(j,b,m)

[0272]

[0273] * HRTF(vl,rot(j,m),b) * HRTF(vl,rot(j,m),b)Hwhere HRTF(vl,rot(j,m),b) is the HRTF from the listener 104 towards the informed source m and gl,att(j,iclosest(m),m) is an attenuation gain and Êl,src(j,b,m) is the energy of the time-aligned signal Ŝl,src(j,b,m).

[0274] Now, Ci is the signal covariance matrix at the position of the listener 104 with contribution only from the informed sources. However, for the output the contribution for all other audio sources 102 present in the audio scene 100 is still needed. The signal covariance matrix Cyas calculated above comprises the contribution from all audio sources in the audio scene 100 (informed and non-informed). This was calculated based on interpolated spatial metadata at the position of the listener 104. Therefore it can be beneficial to be able to calculate interpolated spatial metadata at the position of the listener 104 without the contribution of the informed sources. This can be done as follows:

[0275] First, the intensity vectors and energy at the audio sources from the signal covariance matrices CSrc(j,b, i) (informed source contribution at the audio sources) can be calculated:

[0276] Csrc(j,b,j, 1,4)1

[0277] isrc(j,k,b,i) = Csrc(j,b,i,1,2) / (√3 * Nsf)

[0278] Csrc(j,b,i,1,3)]

[0279] Esrc(j,k,b,i) = (∑n=14Csrc(j,b,i,n,n))

[0280] Esrc(J, k, b, i) =

[0281]

[0282] 4 * NSf

[0283] Similarly, intensity iHOA(j,k,b,i) and energy EHOA(j,k,b,i) can be calculated for the audio signals. These contain the contribution from all sources in the scene. Next, intensity and energy which has only the contribution from non-informed sources is obtained by subtraction:

[0284] ires(j,k,b,i) = iHOA(j,k,b,i) − isrc(j,k,b,i)

[0285] E

[0286]

[0287] res(j> k, b, i) EH0A(J, k, b, i) Esrc(j, k, b, i) From these, spatial metadata can be obtained for audio source positions that only have contribution from the non-informed sources:

[0288] Azimuth:

[0289] 6

[0290]

[0291] res(j, k, b, j) = atan2(ires(j, k, b, i, 2), ires(j, k, b, i, 1))

[0292] Elevation:

[0293] <

[0294]

[0295] Pres(j> 0=atan2 (ires(j, k, b, i, 3), ViresO< k, b, i, l)2+ ires(j, k, b, i, 2)2Direct-to-total energy ratio:

[0296] z.., -\'h-esCj> k, b, I, 1) T Ires (A k, b, i, 2) T iresC / ^ k, b, i, 3) IresCj > k, b, I)

[0297]

[0298] EresCh k, b, j)

[0299] These can then be interpolated in the same manner as in the spatial metadata interpolation block 632 as described above. This results in EreS:intrp(j,k,b 9reS:intrp(j,k,b <pres,intrP(j>k,b) and res,intrp (j > k, b^).

[0300] The immersive audio Tenderer 600 also comprises a signal interpolation block 640. The signal interpolation block 640 provides an interpolated signal Sb c(j, k, ) 642 for the listener position. The interpolated signal Sb c(J,k,) 642 is an estimate of the signal at the listener position. The interpolated signal can be used later in conjunction with the interpolated metadata 636 at the position of the listener 104 to provide the final binaural output.

[0301] The interpolated signal Sb c(J,k,) 642 is obtained by taking the audio signal for the audio source that is closest to the listener and applying equalisation the signal.

[0302]

[0303] ^b,c(J>k) Geq(J, k, b)Sb / C(mc(J), j, k)

[0304] where mc(J) is index of the audio source chosen for interpolation (closest to listener in most cases) and:

[0305] z / I e(J, k,b) \

[0306] Ge,(J.k.b-) =+

[0307]

[0308] The immersive audio Tenderer 600 also comprises a mixing block 644. The mixing block 644 receives inputs comprising the interpolated spatial metadata vector 636 at the listener position and the interpolated signal Sb c(J,k,) 642 at the position of the listener 104. The mixing block 644 also receives the HRTFs 616, orientations of the audio sources O; 606 and orientation of the listener’s head ot634 as inputs. The mixing block 644 processes the inputs to provide the binaural time-frequency domain signal 646 as an output.

[0309] The mixing block 644 can create a signal covariance matrix from the interpolated spatial metadata vector 636. The covariance matrix describes the desired (or target) spatial characteristics of the signal at the listener position. An optimal mixing algorithm can then be used to obtain a mixing matrix that when multiplied with the interpolated signal Sb c(j, k, ) 642, provides a resulting signal that is in accordance with the desired spatial characteristics.

[0310] First, a binaural prototype signal is created from the interpolated spatial metadata vector 636 ' 4,10'4) $b,l (j> NSf)

[0311] — MH0A2bin(.b) * Rsh(J) *

[0312]

[0313] Sb, Nch(j> 1) Bb, Nch(j> Nsf) Where Rsh(j) is a rotation matrix taking into account the orientation of the listener and the orientations of the audio sources and MHOA2binb) is the Ambisonics to binaural matrix.

[0314] Then, a signal covariance matrix Cxis calculated for the prototype signal:

[0315] C”ew04) =S04)B04)H

[0316] Recursive averaging is applied to get the signal covariance matrix for frame j:

[0317] C O) b) = (1 - d)Cwj, b) + dCxj - 1, b)

[0318] Where d = 0.9.

[0319] Next, a signal covariance matrix Cyis calculated from the interpolated spatial metadata at the listener position. First the direct portion of Cyis calculated:

[0320] Nsf

[0321] Cyirect (jtb) = e(j, k, b)f(J, k, b)H(b, d^H^fJ), d}

[0322]

[0323] k=l

[0324] where H(b,d) refers to the HRTF value at frequency bin b, in direction d.

[0325] Secondly the diffuse portion of Cyis calculated.

[0326] Nsf

[0327] Cyiffuse(j, b) = ^(1 - f(j, k, b^efj, k, b) Cdif(J))

[0328]

[0329] k=l

[0330] where:

[0331] c

[0332]

[0333] <“'m =- -

[0334] Cyis then obtained as follows:

[0335] C

[0336]

[0337] "ew0'4) = Cyirect(j, b) + Cylffuse(J,b)

[0338] And recursive averaging:

[0339] Cy (j, b) = (1 - d)CyeW(j, b) + dCy (j - 1, b) where d = 0.9.

[0340] Where informed sources are used a signal covariance matrix can be calculated from the interpolated metadata in the same manner as Cywas calculated above. This gives a signal covariance matrix Creswith contribution from only the non-informed sources. The final target covariance matrix is then obtained as follows:

[0341] Cy (j, b) = ( / , b) + Cres(j, b)

[0342] The signal covariance matrices are then used to obtain mixing matrices which are used to obtain the binaural output as follows:

[0343] O(j, k, b) = M j, k, b) * B(J — 1, k, b) * D(j, b)

[0344] Where D(J,b) is a decorrelated time-frequency domain signal obtained from a buffer of previous binaural signals B.

[0345] The matrices M(J,k,b) and Mr(J,k,b) can be obtained from an optimal mixing procedure. After applying the mixing matrices on the binaural prototype signal B, the result output 0 is the binaural time-frequency domain signal 646 and has the spatial characteristics of the spatial metadata at the position of the listener 104.

[0346] The binaural time-frequency domain signal 646 is provided to an output block 648. The output block 648 takes the obtained binaural time-frequency domain signal 650 and performs inverse STFT on it to produce the final time domain binaural output signal Ss(i,j) 650.

[0347] Fig. 7 shows another example immersive audio Tenderer 600. The immersive audio Tenderer 600 can comprise blocks as shown in Fig. 6 and described above. However in Fig. 7 the immersive audio Tenderer 600 is configured to retain an audio source 102 in the list of active sources even if the audio source 102 is moving. For example, if an audio scene 102 consists of just one or two audio sources 102 then the audio sources would need to be used for rendering even if it was moving.

[0348] In the example of Fig. 7 the active source determiner 602 provides an additional re-do preprocessing output 700. The re-do preprocessing output 700 is provided to the pre-processing block 610 and indicates that the pre-processing needs to be redone to account for the new position of a moving audio source 102.

[0349] Other means for retaining an audio source 102 in the list of active sources even if the audio source 102 is moving can be used. For instance, if an audio source 102 is moving slowly then the moving audio source 102 can be kept in the list of active sources. An audio source 102 can be determined to be moving slowly if the rate of change of position is below a threshold. The source geometry can then be updated periodically, for example, every 10 or 20 frames. In such cases the calculation would not be done for every frame but, because the audio source 102 is only moving slowly, the errors in the source geometries does not get too big.

[0350] Fig. 8 shows a method that can be used in examples of the disclosure. The method could be implemented using an immersive audio Tenderer 600 as shown in Figs. 6 or 7 or any other suitable means. The method of Fig. 8 could be used where the movement of the audio sources 102 is not known beforehand.

[0351] At block 800 the method comprises obtaining initial audio source positions for an audio scene. The initial audio source positions can be received as information of the audio scene 604 as shown in Figs. 6 and 7.

[0352] An initial set of active audio sources 102 is obtained at block 802. The set of active audio sources 102 can be determined by an active source determiner 602 as shown in Figs. 6 and 7 or by any other suitable means. The active sources are the sources that are to be used for rendering.

[0353] At blocks 804 and 806 the initial source geometry for the audio scene is obtained. At block 804 a triangulation for the audio scene is calculated. Other portions of the audio scene 100 could be used in other examples. At block 806 beamformer parameters are calculated.

[0354] After the source geometry has been determined new audio source positions can be obtained at block 808. The new audio source positions can be the positions of the audio sources 102 at a later point in time. At block 810 it is determined if audio source positions have changed. The determining of whether audio sources positions have changed could be performed by the active source determiner 602 such as shown in figs. 6 and 7 or by any other suitable means.

[0355] If it is determined that one or more of the audio source positions have changed then the method proceeds to block 812. At block 812 the active audio sources for the new positions of the audio sources is determined. Once the new set of active audio sources is the source geometry is updated for the new audio source position. At block 814 the triangulation for the audio scenes is re-calculated based on the new audio source positions. Similarly, at block 816 the beamformer parameters are recalculated for the new audio source positions. Once the source geometry has been determined for the new audio source positions then the method proceeds to block 818 and the spatial audio is rendered based on the updated source geometry.

[0356] If at block 810 it is determined that the audio source positions have not changed then the method proceeds straight to block 818 and the spatial audio is rendered based on the initial source geometry.

[0357] Fig. 9A shows a method that can be used in examples of the disclosure where the movement of the audio sources 102 is known beforehand. In this case the source geometries can be determined before the spatial audio is rendered and then selected for use at an appropriate time. The method of Fig. 9A could be implemented during initialization of the Tenderer or at any other suitable time.

[0358] Fig. 9B shows example triangulations that could be used for an audio scene 100 that implements the method of Fig. 9A, or other suitable methods.

[0359] In the example of Fig. 9A, at block 900, information about the movement of the audio sources 102 is obtained. This information can be received in the bitstream or via any other suitable means. This information about the movement of the audio sources 102 can comprise any updates to the audio sources 102 or any other suitable information.

[0360] In some examples the information of audio source movement could comprise timed position updates related to the audio sources 102 in the update packets in the MPEG-I immersive audio bitstream. The information that is obtained can comprises identification of one or more audio sources 102, a time stamp, and a new position and / or orientation of the audio source 102.

[0361] At block 902 the timings of movements of the respective audio sources 102 are determined. This can establish the time intervals for which source geometries can be applied.

[0362] Any suitable process can be used to determine the timings of the movements of the audio sources. For example the information about the movement of the audio sources 102 can be analysed to determine sections of time where the audio sources 102 are stationary. The start and end times of these sections can then be used as time interval borders. An audio source 102 can be defined to be stationary for a time instant t, if there are no timed updates related to it for a time window around t (t+-to). The bitstream can also carry updates related to the active state of an audio source 102. These can also used to define time intervals. Fig. 9B shows example time intervals. The audio starts at a first time 910. At this time there are four audio sources and none of them are moving. It is determined, from the information about the movement of the audio sources 102 that, at a second time 912, one of audio sources 102 starts to move. The first time 910 and the second time 912 define a first time interval To. In the first time interval To there is no movement of the audio sources 102.

[0363] It is also determined, from the information about the movement of the audio sources 102 that, at a third time 914, the movement of the audio source 102 has stopped. The second time 912 and the third time 914 define a second time interval Ti. In the first second interval Ti there is movement of the audio sources 102.

[0364] Similarly it if determined that at a fourth time 916 movement of an audio source 102 begins and at a fifth time 918 movement of that audio source 102 stops. This defines a third time interval T2 and a fourth time interval T3. In the third time interval T2 there is no movement of the audio sources 102 and in the fourth time interval T3 there is movement of the audio sources 102.

[0365] At block 904 the active audio sources are determined. The active audio sources can be the audio sources that are not moving and so can be used for defining the source geometries. The active audio sources can be determined for the time intervals that were established at block 902. For instance, in the example of Fig. 9B the active audio sources would be determined for the first time interval To, the second time interval Ti, the third time interval T2, the fourth time interval T3, and the fifth time interval T4.

[0366] At block 906 the source geometries are calculated for the time intervals. The source geometries can comprise triangulations, beamformer parameters, and any other suitable geometries. The source geometries can be calculated based on the movement of the audio sources 102 and the set of active sources 102 for the respective time interval so that audio sources that are moving can be de-emphasized. The calculated source geometries can then be stored and retrieved as needed.

[0367] The respective time intervals can be different for different parameters. For example the time intervales used for defining active audio sources could be different to the time intervals used for defining a triangulation and / or a set of beamformer parameters. Fig. 10 shows a system 1000 that can be used to implement examples of the disclosure. The system 1000 comprises a content creation workflow 1002 a server 1002 and a player side functionality 1006.

[0368] The content creation workflow 1002 enables the creation of the audio scene 102. For example it enables the audio scene to be recorded or otherwise generated. The audio data 1010 can be captured by microphones or generated using any other suitable means. The content creation workflow 1002 also enables an encoder input format (EIF) audio scene description to be generated.

[0369] The audio data is provided to an MPEG-H encoder 1014 to generate an MPEG-H audio bitstream 1018. The audio data and the EIF audio scene description are provided to an MPEG-I encoder 1012 to generate a 6DoF audio bitstream 1016.

[0370] The MPEG-H audio bitstream 1018 and the 6DoF audio bitstream 1016 are provided from the respective encoders to the server 1004. The MPEG-H audio bitstream 1018 and the 6DoF audio bitstream 1016 are stored in bitstream storage 1020 at the server 1004.

[0371] The player side functionality 1006 can comprise a playback device 1022. The playback device 1022 can be configured to retrieve the 6DoF audio bitstream 1016 from the bitstream storage 1020.

[0372] The play back device 1022 comprises a bitstream parsing module 1024 which parses the retrieved 6DoF audio bitstream 1016 and provides audio and metadata 1026 as an output.

[0373] The audio and metadata 1026 is provided to an audio Tenderer 1028 such as an MPEG-I audio Tenderer. The audio Tenderer 1028 receives the audio and metadata 1026 as an input and also receives 6DoF tracking information 1030 from the listener 104. The 6Dof tracking information 1030 can be received from sensors in a playback device worn by the listener 104 or from any other suitable means.

[0374] The audio Tenderer 1028 uses the 6Dof tracking information 1030 to render the audio and metadata 1026 for the position of the listener and provide the audio 1032 to the listener 104.

[0375] Fig. 11 shows an example controller 1100. The controller 1100 could be provided within an encoder or any other suitable entity. Implementation of the controller 1100 may be as controller circuitry. The controller 1100 may be implemented in hardware alone, have certain aspects in software including firmware alone or can be a combination of hardware and software (including firmware). The controller 1100 can provide an apparatus for implementing the disclosure of could be provided as part of an apparatus that implements the disclosure.

[0376] As illustrated in Fig. 11 the controller 1100 can be implemented using instructions that enable hardware functionality, for example, by using executable instructions of a computer program 1106 in a general-purpose or special-purpose processor 1102 that may be stored on a computer readable storage medium (disk, memory etc.) to be executed by such a processor 1102.

[0377] The processor 1102 is configured to read from and write to the memory 1104. The processor 1102 may also comprise an output interface via which data and / or commands are output by the processor 1102 and an input interface via which data and / or commands are input to the processor 1102.

[0378] The memory 1104 stores a computer program 1106 comprising computer program instructions (computer program code) that controls the operation of the apparatus when loaded into the processor 1102. The computer program instructions, of the computer program 1106, provide the logic and routines that enables the apparatus to perform the methods illustrated in the Figs. The processor 1102 by reading the memory 1104 is able to load and execute the computer program 1106.

[0379] In some examples where the controller 1100 is provided within an apparatus, the controller therefore comprises means for:

[0380] obtaining 300 a source geometry of an audio scene 100 wherein respective sections of the source geometry of the audio scene comprise multiple audio sources102;

[0381] detecting 302 movement of at least one of the multiple audio sources 102 in the audio scene 100;

[0382] obtaining 304 an updated source geometry for the audio scene 100 wherein the updated source geometry at least partially de-emphasizes the at least one of the multiple audio sources 102 for which movement is detected; and

[0383] enabling 306 rendering of the audio scene 100 based on the updated source geometry and the listener position during movement of the at least one of the multiple audio sources 102.

[0384] The computer program 1106 may arrive at the apparatus via any suitable delivery mechanism 1108. The delivery mechanism 1108 may be, for example, a machine-readable medium, a computer-readable medium, a non-transitory computer-readable storage medium, a computer program product, a memory device, a record medium such as a Compact Disc Read-Only Memory (CD-ROM) or a Digital Versatile Disc (DVD) or a solid-state memory, an article of manufacture that comprises or tangibly embodies the computer program 1106. The delivery mechanism may be a signal configured to reliably transfer the computer program 1106. The apparatus may propagate or transmit the computer program 1106 as a computer data signal.

[0385] The computer program 1106 can comprise computer program instructions for causing an apparatus to perform at least the following or for performing at least the following:

[0386] obtaining 300 a source geometry of an audio scene 100 wherein respective sections of the source geometry of the audio scene comprise multiple audio sources102;

[0387] detecting 302 movement of at least one of the multiple audio sources 102 in the audio scene 100;

[0388] obtaining 304 an updated source geometry for the audio scene 100 wherein the updated source geometry at least partially de-emphasizes the at least one of the multiple audio sources 102 for which movement is detected; and

[0389] enabling 306 rendering of the audio scene 100 based on the updated source geometry and the listener position during movement of the at least one of the multiple audio sources 102.

[0390] The computer program instructions may be comprised in a computer program, a non-transitory computer readable medium, a computer program product, a machine-readable medium. In some but not necessarily all examples, the computer program instructions may be distributed over more than one computer program.

[0391] Although the memory 1104 is illustrated as a single component / circuitry it may be implemented as one or more separate components / circuitry some or all of which may be integrated / removable and / or may provide permanent / semi-permanent / dynamic / cached storage.

[0392] Although the processor 1102 is illustrated as a single component / circuitry it may be implemented as one or more separate components / circuitry some or all of which may be integrated / removable. The processor 1102 may be a single core or multi-core processor.

[0393] References to “computer-readable storage medium”, “computer program product”, “tangibly embodied computer program” etc. or a “controller”, “computer”, “processor” etc. should be understood to encompass not only computers having different architectures such as single / multiprocessor architectures and sequential (Von Neumann) / parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc.

[0394] As used in this application, the term “circuitry” can refer to one or more or all of the following:

[0395] (a) hardware-only circuitry implementations (such as implementations in only analog and / or digital circuitry) and

[0396] (b) combinations of hardware circuits and software, such as (as applicable):

[0397] (i) a combination of analog and / or digital hardware circuit(s) with software / firmware and

[0398] (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions and

[0399] (c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g. firmware) for operation, but the software might not be present when it is not needed for operation.

[0400] This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor and its (or their) accompanying software and / or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit for a mobile device or a similar integrated circuit in a server, a cellular network device, or other computing or network device.

[0401] The blocks illustrated in the Figs, can represent steps in a method and / or sections of code in the computer program 1106. The illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block can be varied. Furthermore, it can be possible for some blocks to be omitted.

[0402] The above-described examples find application as enabling components of:

[0403] automotive systems; telecommunication systems; electronic systems including consumer electronic products; distributed computing systems; media systems for generating or rendering media content including audio, visual and audio visual content and mixed, mediated, virtual and / or augmented reality; personal systems including personal health systems or personal fitness systems; navigation systems; user interfaces also known as human machine interfaces; networks including cellular, non-cellular, and optical networks; ad-hoc networks; the internet; the internet of things; virtualized networks; and related software and services The apparatus can be provided in an electronic device, for example, a mobile terminal, according to an example of the present disclosure. It should be understood, however, that a mobile terminal is merely illustrative of an electronic device that would benefit from examples of implementations of the present disclosure and, therefore, should not be taken to limit the scope of the present disclosure to the same. While in certain implementation examples, the apparatus can be provided in a mobile terminal, other types of electronic devices, such as, but not limited to: mobile communication devices, hand portable electronic devices, wearable computing devices, portable digital assistants (PDAs), pagers, mobile computers, desktop computers, televisions, gaming devices, laptop computers, cameras, video recorders, GPS devices and other types of electronic systems, can readily employ examples of the present disclosure. Furthermore, devices can readily employ examples of the present disclosure regardless of their intent to provide mobility.

[0404] The term ‘comprise’ is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising Y indicates that X may comprise only one Y or may comprise more than one Y. If it is intended to use ‘comprise’ with an exclusive meaning then it will be made clear in the context by referring to ‘comprising only one...’ or by using ‘consisting.’

[0405] In this description, the wording ‘connect’, ‘couple’ and ‘communication’ and their derivatives mean operationally connected / coupled / in communication. It should be appreciated that any number or combination of intervening components can exist (including no intervening components), i.e., to provide direct or indirect connection / coupling / communication. Any such intervening components can include hardware and / or software components.

[0406] As used herein, the term "determine / determining" (and grammatical variants thereof) can include, not least: calculating, computing, processing, deriving, measuring, investigating, identifying, looking up (for example, looking up in a table, a database, or another data structure), ascertaining and the like. Also, "determining" can include receiving (for example, receiving information), accessing (for example, accessing data in a memory), obtaining and the like. Also, " determine / determining" can include resolving, selecting, choosing, establishing, and the like.

[0407] In this description, reference has been made to various examples. The description of features or functions in relation to an example indicates that those features or functions are present in that example. The use of the term ‘example’ or ‘for example’ or ‘can’ or ‘may’ in the text denotes, whether explicitly stated or not, that such features or functions are present in at least the described example, whether described as an example or not, and that they can be, but are not necessarily, present in some of or all other examples. Thus ‘example’, ‘for example’, ‘can’, or ‘may’ refers to a particular instance in a class of examples. A property of the instance can be a property of only that instance or a property of the class or a property of a sub-class of the class that includes some but not all the instances in the class. It is therefore implicitly disclosed that a feature described with reference to one example but not with reference to another example, can where possible be used in that other example as part of a working combination but does not necessarily have to be used in that other example.

[0408] As used herein, “at least one of the following: ” and “at least one of ” and similar wording, where the list of two or more elements are joined by “and” or “or” mean at least any one of the elements, or at least any two or more of the elements, or at least all the elements.

[0409] Although examples have been described in the preceding paragraphs with reference to various examples, it should be appreciated that modifications to the examples given can be made without departing from the scope of the claims.

[0410] Features described in the preceding description may be used in combinations other than the combinations explicitly described above.

[0411] Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not.

[0412] The description of a feature, such as an apparatus or a component of an apparatus, configured to perform a function, or for performing a function, should additionally be considered to also disclose a method of performing that function. For example, description of an apparatus configured to perform one or more actions, or for performing one or more actions, should additionally be considered to disclose a method of performing those one or more actions with or without the apparatus.

[0413] Although features have been described with reference to certain examples, those features may also be present in other examples whether described or not.

[0414] The term ‘a’, ‘an’ or ‘the’ is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising a / an / the Y indicates that X may comprise only one Y or may comprise more than one Y unless the context clearly indicates the contrary. If it is intended to use ‘a’, ‘an’ or ‘the’ with an exclusive meaning then it will be made clear in the context. In some circumstances the use of ‘at least one’ or ‘one or more’ may be used to emphasis an inclusive meaning but the absence of these terms should not be taken to infer any exclusive meaning. The presence of a feature (or combination of features) in a claim is a reference to that feature or (combination of features) itself and to features that achieve substantially the same technical effect (equivalent features). The equivalent features include, for example, features that are variants and achieve substantially the same result in substantially the same way. The equivalent features include, for example, features that perform substantially the same function, in substantially the same way to achieve substantially the same result.

[0415] In this description, reference has been made to various examples using adjectives or adjectival phrases to describe characteristics of the examples. Such a description of a characteristic in relation to an example indicates that the characteristic is present in some examples exactly as described and is present in other examples substantially as described.

[0416] The above description describes some examples of the present disclosure however those of ordinary skill in the art will be aware of possible alternative structures and method features which offer equivalent functionality to the specific examples of such structures and features described herein above and which for the sake of brevity and clarity have been omitted from the above description. Nonetheless, the above description should be read as implicitly including reference to such alternative structures and method features which provide equivalent functionality unless such alternative structures or method features are explicitly excluded in the above description of the examples of the present disclosure.

[0417] Whilst endeavoring in the foregoing specification to draw attention to those features believed to be of importance the Applicant may seek protection via the claims in respect of any patentable feature or combination of features hereinbefore referred to and / or shown in the drawings whether or not emphasis has been placed thereon.

[0418] I / we claim:

Claims

1. CLAIMS1. An apparatus for six degrees of freedom audio rendering comprising:3.at least one processor;4.and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform:5.obtaining a source geometry of an audio scene wherein respective sections of the source geometry of the audio scene comprise multiple audio sources;6.detecting movement of at least one of the multiple audio sources in the audio scene; obtaining an updated source geometry for the audio scene wherein the updated source geometry at least partially de-emphasizes the at least one of the multiple audio sources for which movement is detected; and7.enabling rendering of the audio scene based on the updated source geometry and the listener position during movement of the at least one of the multiple audio sources.

2. An apparatus as claimed in claim 1, wherein an audio source which has been at least partially de-emphasized is not used for the updated source geometry.

3. An apparatus as claimed in any preceding claim, wherein if it is detected that multiple audio sources are moving then audio sources that are moving the most are de-emphasized.

4. An apparatus as claimed in any preceding claim, wherein the processor and memory are also arranged to cause the apparatus to perform:11.detecting that the at least one of the multiple audio sources that have been at least partially de-emphasized have stopped moving;12.obtaining a further source geometry wherein the further source geometry re-emphasizes the at least one of the multiple audio sources that have stopped moving; and13.enabling updated rendering of the audio scene based on the further source geometry and the listener position after the at least one of the multiple audio sources that were at least partially de-emphasized have stopped moving.

5. An apparatus as claimed in claim 4, wherein the re-emphasized audio sources are used for the further source geometry.

6. An apparatus as claimed in any preceding claim, wherein at least one of the multiple audio sources for which movement is detected are located within a section of the source geometry corresponding to the listener position.

7. An apparatus as claimed in any preceding claim, wherein detecting movement of at least one of the multiple audio sources in the audio scene comprises obtaining advance information of the movement of at least one of the multiple audio sources.

8. An apparatus as claimed in claim 7, wherein the processor and memory are also arranged to cause the apparatus to perform:17.using the advance information of the movement of at least one of the multiple audio sources to obtain the updated source geometry of the audio scene; and18.enabling rendering of the audio scene based on the updated source geometry and the listener position prior to movement of at least one of the multiple audio sources.

9. An apparatus as claimed in any of claim 7 or 8, wherein the processor and memory are also arranged to cause the apparatus to perform:20.using the advance information of the movement of at least one of the multiple audio sources to obtain the further source geometry of the audio scene; and21.enabling updated rendering of the audio scene based on the further source geometry and the listener position prior to the at least one of the multiple audio sources that have been at least partially de-emphasized stopping movement.

10. An apparatus as claimed in any of claims 7 to 9, the respective source geometries are precalculated based on the advance information.

11. An apparatus as claimed in any preceding claim, wherein the processor and memory are also arranged to cause the apparatus to perform obtaining initial source information.

12. An apparatus as claimed in claim 11, wherein the initial source information comprises one or more informed source beamforming directions.

13. An apparatus as claimed in any preceding claim wherein the source geometries comprise a triangulation14. An apparatus as claimed in any preceding claim, wherein the processor and memory are also arranged to cause the apparatus to perform applying a cross fade for a transition between respective source geometries.

15. An apparatus as claimed in any preceding claim, wherein the rendering comprises using a first audio source in a section of the source geometry for signal interpolation and the first audio source and two other audio sources for spatial metadata interpolation.

16. An apparatus as claimed in any preceding claim, wherein movement of the of the one or more audio sources comprises at least one of:28.change in location; and29.change in orientation.

17. An apparatus as claimed in any preceding claim, wherein the one or more audio sources comprise higher order ambisonics sources.

18. A method comprising:32.obtaining a source geometry of an audio scene wherein respective sections of the source geometry of the audio scene comprise multiple audio sources;33.detecting movement of at least one of the multiple audio sources in the audio scene; obtaining an updated source geometry for the audio scene wherein the updated source geometry at least partially de-emphasizes the at least one of the multiple audio sources for which movement is detected; and34.enabling rendering of the audio scene based on the updated source geometry and the listener position during movement of the at least one of the multiple audio sources.

19. A method as claimed in claim 18, wherein an audio source which has been at least partially de-emphasized is not used for the updated source geometry.

20. A computer program comprising instructions which, when executed by an apparatus, cause the apparatus to perform:37.obtaining a source geometry of an audio scene wherein respective sections of the source geometry of the audio scene comprise multiple audio sources;38.detecting movement of at least one of the multiple audio sources in the audio scene; obtaining an updated source geometry for the audio scene wherein the updated source geometry at least partially de-emphasizes the at least one of the multiple audio sources for which movement is detected; and39.enabling rendering of the audio scene based on the updated source geometry and the listener position during movement of the at least one of the multiple audio sources.

21. An apparatus for six degrees of freedom audio rendering comprising means for:obtaining a source geometry of an audio scene wherein respective sections of the source geometry of the audio scene comprise multiple audio sources;41.detecting movement of at least one of the multiple audio sources in the audio scene; obtaining an updated source geometry for the audio scene wherein the updated source geometry at least partially de-emphasizes the at least one of the multiple audio sources for which movement is detected; and42.enabling rendering of the audio scene based on the updated source geometry and the listener position during movement of the at least one of the multiple audio sources.