A method and apparatus for 6DOF rendering

By optimizing bitrate allocation for spatial metadata and audio transport channels in IVAS streams based on listener position, the method addresses inefficiencies in 6DoF rendering, achieving reduced bitrate usage and improved audio quality.

WO2026139208A1PCT designated stage Publication Date: 2026-07-02NOKIA TECHNOLOGIES OY

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
NOKIA TECHNOLOGIES OY
Filing Date
2025-12-05
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Conventional bitrate allocation in immersive audio codecs like IVAS is suboptimal for six-degrees-of-freedom (6DoF) rendering, leading to inefficient use of transmission bitrate and reduced audio quality due to unequal importance of spatial metadata and audio transport channels based on listener position.

Method used

Implement differentiated bitrate budget allocation for spatial metadata and audio transport channels in IVAS streams, prioritizing higher budgets for spatial metadata in 6DoF rendering scenarios, adjusting bitrates based on listener position and microphone array configurations.

Benefits of technology

Achieves reduced total transmission bitrate and enhanced subjective audio quality for 6DoF rendering by optimizing bitrate distribution for spatial metadata and audio transport channels.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure EP2025085672_02072026_PF_FP_ABST
    Figure EP2025085672_02072026_PF_FP_ABST
Patent Text Reader

Abstract

An apparatus, comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain a listener position; obtain at least two microphone array positions, each of the at least two microphone array positions associated with a respective microphone array configured to capture audio signals representing an audio scene; determine a use for at least two immersive streams from the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position; determine a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use; and control an encoding of the at least two immersive streams based on the audio signals and the determined bitrate.
Need to check novelty before this filing date? Find Prior Art

Description

A METHOD AND APPARATUS FOR 6DOF RENDERINGFIELD

[0001] The present application relates to a method and apparatus for six-degrees-of-freedom rendering, but not exclusively for method and apparatus for six-degrees-of-freedom rendering employing immersive voice and audio services (IVAS).BACKGROUND

[0002] Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the immersive voice and audio services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G / 5G network. Such immersive services include uses for example in immersive voice and audio for applications such as virtual reality (VR), augmented reality (AR) and mixed reality (MR) as well as spatial voice communication including teleconferencing. This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channelbased audio and scene-based audio inputs including spatial information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.

[0003] For example, the input signals can be presented to an IVAS encoder in one of the supported formats (and in some allowed combinations of the formats). Similarly, the decoder can be configured to output the audio in any of the supported formats. The main supported input formats for IVAS are stereo, multichannel (MC), object-based audio (ISM), scene-based audio (SBA), and Metadata-assisted spatial audio (MASA). In addition, the following combinations are supported: Objects with MASA (OMASA) and Objects with SBA (OSBA). IVAS furthermore includes the EVS codec for mono input operation. The IVAS output formats include mono, stereo, multi-channel (including custom loudspeaker layouts), FOA, HOA2, HOA3, and binaural. This flexibility is because as a spatial audio codec supporting at least three degrees of rotation freedom (yaw, pitch, roll) for all spatial inputs, the IVAS codec is expected to be used in a variety of scenarios, all of which cannot be known beforehand.

[0004] In addition, a so-called pass-through operation is possible allowing, e.g., MASA output for MASA input, in other words where the audio could be provided in its original format after transmission (encoding / decoding).

[0005] Additionally RTP (Real-Time Transport Protocol) is intended for an end-to-end, real-time transfer of streaming media and provides facilities for jitter compensation and detection of packet loss and out-of-order delivery. RTP allows data transfer to multiple destinations through IP multicast or to a specific destination through IP unicast. The majority of the RTP implementations are built on top of the User Datagram Protocol (UDP). Other transport protocols may also be utilized. RTP is used in together with other protocols such as H.323 and Real Time Streaming Protocol (RTSP).

[0006] The RTP specification describes two protocols: RTP and RTCP. RTP is used for the transfer of multimedia data, and its companion protocol (RTCP) is used to periodically send control information and QoS (Quality of Service) parameters.

[0007] RTP sessions are typically initiated between client and server or between client and another client (or a multi-party topology) using a signalling protocol, such as H.323, the Session Initiation Protocol (SIP), or RTSP. These protocols typically use the Session Description Protocol (SDP), such as defined by RFC 8866 to specify parameters for the sessions.SUMMARY

[0008] According to a first aspect, there is provided an apparatus for controlling a generation of encoded immersive streams, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain a listener position; obtain at least two microphone array positions, each of the at least two microphone array positions associated with a respective microphone array configured to capture audio signals representing an audio scene; determine a use for at least two immersive streams from the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position; determine a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use; and control an encoding of the at least two immersive streams based on the audio signals and the determined bitrate.

[0009] The apparatus may be further caused to: generate at least two encoded immersive streams based on the control; and transmit the at least two encoded immersive streams.

[0010] The apparatus caused to generate the at least two encoded immersive streams based on the control may be caused to: create / generate at least two immersive streams; and encode the at least two immersive streams based on the control to generate the at least two encoded immersive streams.

[0011] The apparatus caused to determine the use associated with each of the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position may be caused to determine the use as one of: spatial metadata determination; and spatial metadata determination and audio signal rendering.

[0012] The use as one of: spatial metadata determination; and spatial metadata determination and audio signal rendering may be one of: spatial metadata interpolation; and spatial metadata interpolation and audio signal rendering.

[0013] The apparatus caused to determine the bitrate for the at least two immersive streams from the at least two microphone arrays based on the use may be caused to determine: a first, higher, bitrate for the use of spatial metadata determination and audio signal rendering; and a second, lower, bitrate for the use of spatial metadata determination.

[0014] The apparatus caused to determine the bitrate for the at least two immersive streams from the at least two microphone arrays based on the use may be caused to determine: a first, higher, bitrate for the use of audio transport channel compared to a normal bitrate allocation mode; and a second, lower, bitrate for the use of audio transport channel compared to the normal bitrate allocation mode.

[0015] The apparatus, caused to determine the bitrate for the at least two immersive streams from the at least two microphone arrays based on the use may be caused to: determine a total bitrate for all of the at least two immersive streams; determine a first bitrate for each of the at least two immersive streams based on the total bitrate; and modify or adjust the determined first bitrate for each of the at least two immersive streams based on the use associated with each of the at least two microphone arrays.

[0016] The apparatus caused to determining a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use may be caused to: determine a bitrate for a first of the at least two immersive streams; determine an audio bitrate based on the use for the first of the at least two immersive streams; determine a spatial metadata bitrate based on the use for the first of the at least two immersive streams, wherein the audio bitrate and spatial metadata bitrate combined are equal to or less than the bitrate for the first of the at least two immersive streams.

[0017] The apparatus caused to determine the audio bitrate based on the use for the first of the at least two immersive streams may be caused to determine: a first, higher, bitrate for the audio bitrate when the use is the spatial metadata determination and audio signal rendering use; and a second, lower or zero, bitrate for the audio bitrate when the use is only the spatial metadata determination.

[0018] The apparatus caused to determine the spatial metadata bitrate based on the use for the first of the at least two immersive streams may be caused to determine: a first, bitrate for the spatial metadata bitrate when the use is the spatial metadata determination and audio signal rendering use; and a second, higher, bitrate for the spatial metadata bitrate when the use is only the spatial metadata determination.

[0019] The apparatus caused to obtain the listener position may be caused to obtain the listener position from a further apparatus.

[0020] The apparatus may be further caused to transmit to at least one further apparatus at least one of: information associated with the use for at least two immersive streams from the at least two microphone arrays; and information associated with the control.

[0021] According to a second aspect there is provided an apparatus for processing encoded immersive streams, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain at least two encoded immersive streams; obtain a listener position; decode the at least two encoded immersive streams, based on a determined use associated with each of the at least two encoded immersive streams, to generate decoded at least two immersive streams; render at least one output audio signal based on the decoded at least two immersive streams and the listener position.

[0022] The apparatus further may be caused to obtain information associated with the determined use associated with each of the at least two encoded immersive streams.

[0023] The apparatus caused to obtain information associated with the determined use associated with each of the at least two encoded immersive streams may be caused to receive the information associated with the determined use associated with each of the at least two encoded immersive streams from a further apparatus.

[0024] The apparatus caused to obtain at least two encoded immersive audio coded streams may be caused to receive the at least two encoded immersive streams from a further apparatus.

[0025] There is provided according to a third aspect an apparatus for controlling a generation of encoded immersive streams, the apparatus comprising means configured to: obtain a listener position; obtain at least two microphone array positions, each of the at least two microphone array positions associated with a respective microphone array configured to capture audio signals representing an audio scene; determine a use for at least two immersive streams from the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position; determine a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use; and control an encoding of the at least two immersive streams based on the audio signals and the determined bitrate.

[0026] The means may be further configured to: generate at least two encoded immersive streams based on the control; and transmit the at least two encoded immersive streams.

[0027] The means configured to generate the at least two encoded immersive streams based on the control may be configured to: create / generate at least two immersive streams; and encode the at least two immersive streams based on the control to generate the at least two encoded immersive streams.

[0028] The means configured to determine the use associated with each of the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position may be configured to determine the use as one of: spatial metadata determination; and spatial metadata determination and audio signal rendering.

[0029] The use as one of: spatial metadata determination; and spatial metadata determination and audio signal rendering may be one of: spatial metadata interpolation; and spatial metadata interpolation and audio signal rendering.

[0030] The means configured to determine the bitrate for the at least two immersive streams from the at least two microphone arrays based on the use may be configured to determine: a first, higher, bitrate for the use of spatial metadata determination and audio signal rendering; and a second, lower, bitrate for the use of spatial metadata determination.

[0031] The means configured to determine the bitrate for the at least two immersive streams from the at least two microphone arrays based on the use may be configured to determine: a first, higher, bitrate for the use of audio transport channel compared to a normal bitrate allocation mode; and a second, lower, bitrate for the use of audio transport channel compared to the normal bitrate allocation mode.

[0032] The means configured to determine the bitrate for the at least two immersive streams from the at least two microphone arrays based on the use may be configured to: determine a total bitrate for all of the at least two immersive streams; determine a first bitrate for each of the at least two immersive streams based on the total bitrate; and modify or adjust the determined first bitrate for each of the at least two immersive streams based on the use associated with each of the at least two microphone arrays.

[0033] The means configured to determining a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use may be configured to: determine a bitrate for a first of the at least two immersive streams; determine an audio bitrate based on the use for the first of the at least two immersive streams; determine a spatial metadata bitrate based on the use for the first of the at least two immersive streams, wherein the audio bitrate and spatial metadata bitrate combined are equal to or less than the bitrate for the first of the at least two immersive streams.

[0034] The means configured to determine the audio bitrate based on the use for the first of the at least two immersive streams may be configured to determine: a first, higher, bitrate for the audio bitrate when the use is the spatial metadata determination and audio signal rendering use; and a second, lower or zero, bitrate for the audio bitrate when the use is only the spatial metadata determination.

[0035] The means configured to determine the spatial metadata bitrate based on the use for the first of the at least two immersive streams may be configured to determine: a first, bitrate for the spatial metadata bitrate when the use is the spatial metadata determination and audio signal rendering use; and a second, higher, bitrate for the spatial metadata bitrate when the use is only the spatial metadata determination.

[0036] The means configured to obtain the listener position may be configured to obtain the listener position from a further apparatus.

[0037] The means may be further configured to transmit to at least one further apparatus at least one of: information associated with the use for at least two immersive streams from the at least two microphone arrays; and information associated with the control.

[0038] According to a fourth aspect there is provided an apparatus for processing encoded immersive streams, the apparatus comprising means configured to: obtain at least two encoded immersive streams; obtain a listener position; decode the at least two encoded immersive streams, based on a determined use associated with each of the at least two encoded immersive streams, to generate decoded at least two immersive streams; render at least one output audio signal based on the decoded at least two immersive streams and the listener position.

[0039] The means further may be configured to obtain information associated with the determined use associated with each of the at least two encoded immersive streams.

[0040] The means configured to obtain information associated with the determined use associated with each of the at least two encoded immersive streams may be configured to receive the information associated with the determined use associated with each of the at least two encoded immersive streams from a further apparatus.

[0041] The means configured to obtain at least two encoded immersive audio coded streams may be configured to receive the at least two encoded immersive streams from a further apparatus.

[0042] According to a fifth aspect, there is provided a method for an apparatus for controlling a generation of encoded immersive streams, the method comprising: obtaining a listener position; obtaining at least two microphone array positions, each of the at least two microphone array positions associated with a respective microphone array configured to capture audio signals representing an audio scene; determining a use for at least two immersive streams from the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position; determining a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use; and controlling an encoding of the at least two immersive streams based on the audio signals and the determined bitrate.

[0043] The method may further comprise: generating at least two encoded immersive streams based on the control; and transmitting the at least two encoded immersive streams.

[0044] Generating the at least two encoded immersive streams based on the control may comprise: creating / generating at least two immersive streams; and encode the at least two immersive streams based on the control to generate the at least two encoded immersive streams.

[0045] Determining the use associated with each of the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position may comprise determining the use as one of: spatial metadata determination; and spatial metadata determination and audio signal rendering.

[0046] The use as one of: spatial metadata determination; and spatial metadata determination and audio signal rendering may be one of: spatial metadata interpolation; and spatial metadata interpolation and audio signal rendering.

[0047] Determining the bitrate for the at least two immersive streams from the at least two microphone arrays based on the use may comprise determining: a first, higher, bitrate for the use of spatial metadata determination and audio signal rendering; and a second, lower, bitrate for the use of spatial metadata determination.

[0048] Determining the bitrate for the at least two immersive streams from the at least two microphone arrays based on the use may comprise determining: a first, higher, bitrate for the use of audio transport channel compared to a normal bitrate allocation mode; and a second, lower, bitrate for the use of audio transport channel compared to the normal bitrate allocation mode.

[0049] Determining the bitrate for the at least two immersive streams from the at least two microphone arrays based on the use may comprise: determining a total bitrate for all of the at least two immersive streams; determining a first bitrate for each of the at least two immersive streams based on the total bitrate; and modifying or adjust the determined first bitrate for each of the at least two immersive streams based on the use associated with each of the at least two microphone arrays.

[0050] Determining a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use may comprise: determining a bitrate for a first of the at least two immersive streams; determining an audio bitrate based on the use for the first of the at least two immersive streams; determining a spatial metadata bitrate based on the use for the first of the at least two immersive streams, wherein the audio bitrate and spatial metadata bitrate combined are equal to or less than the bitrate for the first of the at least two immersive streams.

[0051] Determining the audio bitrate based on the use for the first of the at least two immersive streams may comprise determining: a first, higher, bitrate for the audio bitrate when the use is the spatial metadata determination and audio signal rendering use; and a second, lower or zero, bitrate for the audio bitrate when the use is only the spatial metadata determination.

[0052] Determining the spatial metadata bitrate based on the use for the first of the at least two immersive streams may comprise determining: a first, bitrate for the spatial metadata bitrate when the use is the spatial metadata determination and audio signal rendering use; and a second, higher, bitrate for the spatial metadata bitrate when the use is only the spatial metadata determination.

[0053] Obtaining the listener position may comprise obtaining the listener position from a further apparatus.

[0054] The method may further comprise transmitting to at least one further apparatus at least one of: information associated with the use for at least two immersive streams from the at least two microphone arrays; and information associated with the control.

[0055] According to a sixth aspect there is provided a method for an apparatus for processing encoded immersive streams, the method comprising: obtaining at least two encoded immersive streams; obtaining a listener position; decoding the at least two encoded immersive streams, based on a determined use associated with each of the at least two encoded immersive streams, to generate decoded at least twoimmersive streams; rendering at least one output audio signal based on the decoded at least two immersive streams and the listener position.

[0056] The method may further comprise obtaining information associated with the determined use associated with each of the at least two encoded immersive streams.

[0057] Obtaining information associated with the determined use associated with each of the at least two encoded immersive streams may comprise receiving the information associated with the determined use associated with each of the at least two encoded immersive streams from a further apparatus.

[0058] Obtaining at least two encoded immersive audio coded streams may comprise receiving the at least two encoded immersive streams from a further apparatus.

[0059] According to a seventh aspect, there is provided a computer readable medium comprising instructions which, when executed by an apparatus for controlling a generation of encoded immersive streams, the apparatus cause the apparatus to perform at least the following: obtain a listener position; obtain at least two microphone array positions, each of the at least two microphone array positions associated with a respective microphone array configured to capture audio signals representing an audio scene; determine a use for at least two immersive streams from the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position; determine a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use; and control an encoding of the at least two immersive streams based on the audio signals and the determined bitrate.

[0060] According to an eighth aspect, there is provided a computer readable medium comprising instructions which, when executed by an apparatus for processing encoded immersive streams the apparatus cause the apparatus to perform at least the following: obtain at least two encoded immersive streams; obtain a listener position; decode the at least two encoded immersive streams, based on a determined use associated with each of the at least two encoded immersive streams, to generate decoded at least two immersive streams; render at least one output audio signal based on the decoded at least two immersive streams and the listener position.

[0061] According to a ninth aspect, there is an apparatus for controlling a generation of encoded immersive streams, the apparatus comprising: obtaining circuitry configured to obtain at least two microphone array positions, each of the at least two microphone array positions associated with a respective microphone array configured to capture audio signals representing an audio scene; determining circuitry configured to determine a use for at least two immersive streams from the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position; determining circuitry configured to determine a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use; and controlling circuitry configured to control an encoding of the at least two immersive streams based on the audio signals and the determined bitrate.

[0062] According to a tenth aspect, there is an apparatus an apparatus for processing encoded immersive streams the comprising; obtaining circuitry configured to obtain a listener position; decoding circuitry configured to decode the at least two encoded immersive streams, based on a determined use associated with each of the at least two encoded immersive streams, to generate decoded at least two immersive streams; rendering circuitry configured to render at least one output audio signal based on the decoded at least two immersive streams and the listener position.

[0063] According to an eleventh aspect, there is an apparatus for controlling a generation of encoded immersive streams the apparatus comprising: means for obtaining a listener position; means for obtaining at least two microphone array positions, each of the at least two microphone array positions associated with a respective microphone array configured to capture audio signals representing an audio scene; means for determining a use for at least two immersive streams from the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position; means for determining a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use; and means for controlling an encoding of the at least two immersive streams based on the audio signals and the determined bitrate.

[0064] According to an eleventh aspect, there is an apparatus for processing encoded immersive streams the apparatus comprising: means for obtaining at least two encoded immersive streams; means for obtaining a listener position; means for decoding the at least two encoded immersive streams, based on a determined use associated with each of the at least two encoded immersive streams, to generate decoded at least two immersive streams; means for rendering at least one output audio signal based on the decoded at least two immersive streams and the listener position.

[0065] According to a thirteenth aspect, there is provided a non-transitory computer readable medium comprising program instructions that, when executed by an apparatus, cause the apparatus to perform at least the method according to any of the preceding aspects.

[0066] According to a fourteenth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising instructions] for causing an apparatus, for defining a file format carriage for controlling a generation of encoded immersive streams, the apparatus caused to perform at least the following: obtain a listener position; obtain at least two microphone array positions, each of the at least two microphone array positions associated with a respective microphone array configured to capture audio signals representing an audio scene; determine a use for at least two immersive streams from the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position; determine a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use; and control an encoding of the at least two immersive streams based on the audio signals and the determined bitrate.

[0067] According to a fifteenth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising instructions] for causing an apparatus, for defining a file format carriage for processing encoded immersive streams, the apparatus caused to perform at least the following: obtain at least two encoded immersive streams; obtain a listener position; decode the at least two encoded immersive streams, based on a determined use associated with each of the at least two encoded immersive streams, to generate decoded at least two immersive streams; render at least one output audio signal based on the decoded at least two immersive streams and the listener position.

[0068] In the above, many different embodiments have been described. It should be appreciated that further embodiments may be provided by the combination of any two or more of the embodiments described above.DESCRIPTION OF FIGURES

[0069] Embodiments will now be described, by way of example only, with reference to the accompanying Figures in which:

[0070] Fig.1 shows a representation of an inter-network communications system according to some example embodiments;

[0071] Fig.2 shows a schematic representation of an example IVAS RTP packet structure;

[0072] Fig.3 shows a schematic representation of an example system implementing some embodiments;

[0073] Fig.4 shows a flow diagram representation of the example system shown in Fig.3 according to some embodiments;

[0074] Fig.5 shows a schematic representation of an example front end processor as shown in Fig.3 in further detail according to some embodiments;

[0075] Fig.6 shows a flow diagram representation of the example front end processor shown in Fig.5 according to some embodiments;

[0076] Fig.7 shows a schematic representation of an example encoder as shown in Fig.3 in further detail according to some embodiments;

[0077] Fig.8 shows a flow diagram representation of the example encoder shown in Fig.7 according to some embodiments;

[0078] Fig.9 shows a schematic representation of an example decoder as shown in Fig.3 in further detail according to some embodiments;

[0079] Fig.10 shows a flow diagram representation of the example decoder shown in Fig.9 according to some embodiments;

[0080] Fig.11 shows a schematic representation of an example 6DoF renderer as shown in Fig.3 in further detail according to some embodiments;

[0081] Fig.12 shows a flow diagram representation of the example 6DoF renderer shown in Fig.11 according to some embodiments;

[0082] Fig.13 shows an example capture and rendering scenario according to some embodiments;

[0083] Fig.14 shows a flow diagram for reduced total bitrate and equal rendering quality compared to nonmodified bitstreams according to some embodiments;

[0084] Fig.15 shows a flow diagram for equal total bitrate and increased rendering quality compared to nonmodified bitstreams compared to non-modified bitstreams according to some embodiments;

[0085] Fig .16 shows a flow diagram showing a summary of the operations according to some embodiments; and

[0086] Fig.17 shows an example device suitable for implementing the apparatus described herein.DETAILED DESCRIPTION

[0087] The following relates to apparatus, methods and computer programs for 6DoF rendering with IVAS (or a similar codec) where there is proposed a differentiated bitrate budget allocation for spatial metadata and audio transport channel in a subset of IVAS streams to provide higher bitrate budget for spatial metadata compared to the bitrate budget allocation for spatial metadata for a given IVAS stream bitrate in case of 3DoF rendering.

[0088] Fig.1 shows an example teleconferencing system within which some embodiments can be implemented. In this example there is shown two sites or rooms, Room A 100 and Room B 102. Room A 100 comprises a ‘talker’ or user, Talker 1 103. Room B 102 comprises one ‘talker’ or user, Talker RX 141.

[0089] In the following example within room A is a suitable teleconference apparatus (or more generally telecommunications apparatus 110) configured to spatially capture and encode the audio environment and furthermore is configured to render a spatial audio signal to the room. The apparatus can in some embodiments be implemented by a user equipment (UE) operating within a cellular communications system or accessing any suitable access network. Within each of the other rooms may be a suitable teleconference apparatus (or more generally telecommunications apparatus such as apparatus 120 within room B) configured to render a spatial audio signal to the room and furthermore is configured to capture and encode at least a mono audio and optionally configured to spatially capture and encode the audio environment.

[0090] In the following examples each room is provided with the means to spatially capture, encode spatial audio signals, receive spatial audio signals and render these to a suitable listener. It would be understood that there may be other embodiments where the system comprises some apparatus configured to capture and encode audio signals (in other words the apparatus is a ‘transmit’ apparatus), and other apparatus configured to receive and render audio signals (in other words the apparatus is a ‘receive’ only apparatus). In such embodiments the system within which embodiments may be implemented may comprise apparatus with varying abilities to capture / render audio signals.

[0091] The teleconference apparatus (for each site or room) 110, 120 can be configured to call into a teleconference controlled by and implemented over a server 111.

[0092] In some embodiments the communications or teleconferencing system comprises a (peer-to-peer) communications system (rather than the server based system shown in Fig.1) within which some embodiments can be implemented. Thus, for example, two or more UEs can be configured to interact directly with each other (for example to implement an immersive audio phone call between users). In such a scenario one of the UEs can be configured to deliver spatial ambience as one stream and employ a close-up microphone (for example a Lavalier microphone) to capture the speech as an audio object or audio source. The sender UE can be configured to encode the spatial ambience audio signals in a MASA format stream and the close-up microphone audio signal as an object format stream. The two audio streams can then be delivered as separated IVAS streams. The sender UE, in addition, can be configured to encode processing information during the encoding to deliver the PI frames together with the IVAS frames to the receiver UE.

[0093] The teleconference apparatus can be configured to spatially capture and encode the audio environment and furthermore can be configured to render a spatial audio signal to the room. In this example only the communications or signalling path from the Room A 100 to the Room B 102 is shown for simplicity but a duplex or multipoint communication system comprising multiple signalling paths can be implemented using the methods as described herein without significant inventive input.

[0094] The teleconference apparatus (for each site or room) 110, 120 is further configured to communicate with each other to implement a teleconference function.

[0095] As shown in Fig.1, the apparatus 110, 120 and server 111 can comprise suitable encoder and decoder functionality. For example, the apparatus 110 is shown comprising an (IVAS) encoder 101, the server 111 is shown comprising a (IVAS) decoder and encoder 121 and the apparatus 120 is shown comprising an (IVAS) decoder 131. In such a manner audio signals representing the user or talker 1 103 can be captured by microphone arrays 115a, 115b which are passed to the encoder 101 to be encoded which generates a bitstream 106 to be passed to a server 111. The encoder 101 can be configured to encode each bitstream individually for each microphone array 115a and 115b.

[0096] The server 111 can then decode, (optionally then mix with other objects and otherwise process the audio signals) and encode then to generate the bitstream 108 to be passed to the apparatus 120. The apparatus 120 can then decode the audio signals and present them to the user or talker ‘Talker RX’ 141.

[0097] Although this example shows a teleconference application the encoder / decoder functionality can be applied to the streaming of any suitable media.

[0098] The IVAS decoder / renderer for each of the teleconference apparatus 102 can be furthermore configured to handle multiple input streams that may each originate from a different encoder.

[0099] The IVAS codec algorithm which is employed in the above example is described in 3GPP TS 26.253 (Codec for Immersive Voice and Audio Services; Detailed Algorithmic Description incl. RTP payload format and SDP parameter definitions), currently at v2.0.0 (SP-240030).

[0100] Furthermore the IVAS codec floating-point C code is provided in 3GPP TS 26.258.

[0101] The IVAS codec bitstream contains information on the contents of each bitstream frame. This information includes the aforementioned input format and additional information on the present format, i.e. , sub format signaling, (e.g., order of Ambisonics, channel layout of MC, number of transport channels, etc.). This is in addition to the encoded audio signal and possible metadata. Furthermore, IVAS bitstream frames are designed to be standalone in such way that IVAS decoder can start decoding from any valid IVAS frame (containing full data of for the format specified in the bitstream) and produce good quality output.

[0102] As discussed previously RTP is intended for an end-to-end, real-time transfer of streaming media and provides facilities for jitter compensation and detection of packet loss and out-of-order delivery. RTP is furthermore designed to carry a multitude of multimedia formats, which permit the transport of new formats without revising the RTP standard. To this end, the information required by a specific application of the protocol is not included in the generic RTP header. For a class of applications (e.g., audio, video), an RTP profile may be defined. For a media format (e.g., a specific video coding format), an associated RTP payload format may be defined. Every instantiation of RTP in a particular application may therefore require a profile and payload format specifications.

[0103] The profile is configured to define the codec used to encode the payload data and the mapping to payload format codes in the protocol field Payload Type (PT) of the RTP header.

[0104] For example, the RTP profile for audio and video conferences with minimal control is defined in RFC 3551. The profile defines a set of static payload type assignments, and a dynamic mechanism for mapping between a payload format, and a PT value using Session Description Protocol (SDP). The latter mechanism is used for newer video codec such as RTP payload format for H.264 Video defined in RFC 6184 or RTP Payload Format for High Efficiency Video Coding (HEVC) defined in RFC 7798.

[0105] The IVAS RTP payload format is currently being enhanced as part of IVAS_Codec_Ph2 work item in 3GPP SA4, and the latest state is described in TS 26.253 Annex A (v2.0.0, SP-240030). Recent changes are also described in a CR document S4-241325.

[0106] An RTP session can be established for each multimedia stream. Audio and video streams may be implemented which use separate RTP sessions, enabling a receiver to selectively receive components of a particular stream. The RTP specification can furthermore be configured to recommend port numbers for RTP, and furthermore to recommend the use of the next odd port number for the associated RTCP session. A single port can be used for RTP and RTCP in applications that multiplex the protocols.

[0107] Each RTP stream can comprise RTP packets, and the RTP packet in turn can comprise a RTP header and payload pair.

[0108] Fig.2 shows an example IVAS RTP packet structure (following S4-241325) 200. The RTP Header (and possible RTP Header Extension) 201 follow a known RTP design. The structure 200 further comprises a IVAS payload 211. The IVAS payload 211 comprises a payload header 203 section, a (IVAS) frame data 205 section and an optional PI (processing information) data 207 section.

[0109] The payload header 203 section can comprise different types of header bytes, for example: ToC (Table of Content) and E-bytes (Extra bytes, or further bytes which can be used to define or signal aspects of the payload).

[0110] The ToC bytes can be employed to describe the content of the frame data section (by indicating the size / bitrate for the data frames). The ToC bytes also can differentiate IVAS frames from EVS frames, in situations where the IVAS is operating in mono EVS mode.

[0111] The E-bytes can signal additional information, such as Codec Mode Requests (CMR) which indicate a request to change the bitrate (and possibly other configurations like format and bandwidth) of an incoming IVAS stream. E-bytes may also be used to explicitly indicate the presence of the PI data section at the end of the payload.

[0112] The frame data 205 section can include the IVAS data frames (and EVS data frames in a case of employing IVAS in mono mode). The data frames represent the encoded IVAS bitstreams. The bitstream includes the encoded IVAS audio data with possible additional metadata. The bitstream can also include initialization data or information required to initialize an IVAS decoder, for example information about the input / coded format and sub-format of the encoded data (e.g., multichannel 5.1 format). Any EVS frames do not include such format data.

[0113] The PI data 207 section can include (PI) processing information data and related headers. PI data can be used to transmit any (non-audio) data that can be used to assist the rendering or processing of the audio data, such as, for example, scene and device orientation data. The PI data can also be used to request something from the other session participant, such as, for example to mute an incoming stream or increase noise suppression. Feedback data can also be transmitted, for example, the head orientation of a listener.

[0114] Currently, the IVAS specification as defined in 3GPP SA4 does not enable six degrees of freedom rendering with MASA format as the input format (nor with any other input format).

[0115] 6DoF rendering with the help of spatial metadata and audio transport channels in IVAS streams can be performed with principles that are analogous to those utilized in 6DoF rendering with multiple HOA sources in MPEG-I immersive audio standard (ISO / IEC 23090-4).

[0116] The IVAS streams with the metadata-assisted spatial audio (MASA) format are currently optimized for 3DoF rendering. Consequently, the bitrate budget for MASA metadata and audio transport channels is optimized for solitary rendering of an IVAS stream with three degrees of rotational freedom (i.e., the rotation of a listener’s head or without any head tracking).

[0117] However, such a default IVAS bitrate allocation is not optimal for performing 6D0F rendering with two or more IVAS streams corresponding to the different microphone array positions. This is because the spatial metadata has a greater role in 6D0F rendering compared to 3DoF rendering, for some of the IVAS streams, and the audio transport channels have a less significant role or use. This change in role or use depends on the listener position with respect to the microphone array positions.

[0118] Consequently, the transmission of all the IVAS streams with default bitrate allocation for the audio transport channels and the spatial metadata is suboptimal, leading to suboptimal audio quality. Furthermore, in some scenarios, this can lead to inefficient use of the total transmission bitrate budget for the two or more IVAS streams combined (for rendering with 6DoF).

[0119] For example, this can result in a wasteful use of bitrate and a lower audio quality if the IVAS bitrate budget allocation within an IVAS stream is not modified to enable appropriate allocation of the bitrate for spatial metadata (e.g., MASA) and audio transport channels, in the case of 6DoF rendering.

[0120] In summary, conventional or known bitrate allocation in the encoding process can produce reduced subjective audio quality and / or inefficient transmission bitrate distribution especially in low bitrate environments.

[0121] As described above the concept as discussed in the following embodiments is one which relates to rendering of immersive audio signals with IVAS (or a similar codec) and for 6DoF where there is proposed apparatus and methods for differentiated bitrate budget allocation based on a determined use for the input streams. For example a differentiated bitrate budget allocation for spatial metadata and audio transport channel in a subset of IVAS streams to provide higher bitrate budget for spatial metadata compared to the bitrate budget allocation for spatial metadata for a given IVAS stream bitrate in case of 3DoF rendering. This in some embodiments can achieve at least one of the following:Reduced total transmission bitrate for 6DoF rendering; andHigher subjective audio quality for 6DoF rendering with two or more IVAS streams.

[0122] In some embodiments this can be implemented as follows:Encoder (total transmission bitrate reduction embodiment):receive two or more microphone array positions used for performing 6DoF rendering with IVAS;receive a listener position (with respect to the microphone array positions);receive two or more IVAS streams corresponding to the microphone array positions. determine the use or role of each IVAS stream corresponding to each microphone array position for 6DoF rendering based on the current listener position. The use or role of a stream can indicate higher importance for the spatial metadata than for audio (referred to as “Spatial metadata IVAS streams”), or equal importance for the spatial metadata and audio (referred to as “Audio signal IVAS streams”).decrease the total bitrate of the “Spatial metadata IVAS streams”, while increasing the relative bitrate budget allocation for the spatial metadata encoding for these streams. As result of this adjustment, the bitrate for the audio signal encoding is decreased, whereas the bitrate for the spatial metadata encoding can, e.g., be kept the same;encode the “Spatial metadata IVAS streams” with the adjusted bitrate budgets (i.e., more relative spatial metadata bitrate budget compared to the relative spatial metadata bitrate budget used in IVAS for 3DoF rendering); andencode the “Audio signal IVAS streams” with regular bitrate allocation (used for 3DoF rendering).

[0123] With respect to the decoder:receive two or more encoded IVAS streams based on a given listener position; extract and decode spatial metadata and audio transport channel audio data from the two or more IVAS streams for 6DoF IVAS rendering;perform 6DoF rendering with the decoded spatial metadata and transport channel audio data from the two or more IVAS streams.

[0124] The reduction in transmission bitrate for the “Spatial metadata IVAS streams” can achieve a total transmission bitrate reduction. In such embodiments this maintains the 6DoF rendering subjective quality by improving spatial metadata quality while reducing total transmission bitrate.

[0125] In some other embodiments (higher subjective audio quality embodiment), the reduction in transmission bitrate for the “Spatial metadata IVAS streams” can be reallocated to the “Audio signal IVAS streams” to increase the overall subjective audio quality at the total transmission bitrate budget. In other words a transmission bitrate of the “Audio signal IVAS streams” is increased by the amount that was saved by the reduction of the transmission bitrate of the “Spatial metadata IVAS streams”. In such implementations the apparatus and methods aim to increase the 6DoF rendering subjective quality by improving spatial metadata quality as well as audio transport channel quality, while maintaining the same total transmission bitrate.

[0126] In some embodiments, energy information is derived from the audio signals of the “Spatial metadata IVAS streams” (which have a reduced relative bitrate budget for the audio signal encoding). The energy information is used in the 6DoF rendering.

[0127] In some further embodiments, the decoder / renderer that is configured to receive the two or more IVAS streams, determines the spatial metadata bitrate budget for each IVAS stream depending on microphone array positions and the listener position, and signals the bitrate budget allocation to the media sender (encoders of the two or more IVAS streams).

[0128] In some embodiments, the media sender (or encoder or capture apparatus) is configured to determine a bitrate budget allocation for spatial metadata depending on the listener position information with respect to the microphone array positions.

[0129] Furthermore, in some embodiments, an entity on the media sender side determines the bitrate budget allocation for spatial metadata depending on the listener position information with respect to the microphone array positions.

[0130] The bitrate budget allocation for spatial metadata can, in some embodiments, be one or more of the following:Normal IVAS frame (Type 0);IVAS frame with more MASA metadata bitrate allocation (Type 1);IVAS frame with only MASA metadata (Type 2).

[0131] The bitrate budget allocation difference is described with three steps or levels (Type 0, Type 1 and Type 2) in some embodiments can be defined or further modified with further granular steps or levels. In other words, in some embodiments there can be more than three levels of bitrate budget allocation.

[0132] In such embodiments a feedback method or control method can comprise:The receiver can control the delivery of Type 0 or Type 1 streams as feedback having stream ID and type, based on listener position information with respect to the IVAS MASA capture positions (i.e., the microphone array positions); andThe sender can control the delivery of Type 0 or Type 1 streams based on listener position feedback to the IVAS stream senders.

[0133] In some embodiments when implemented the following impacts can be observed:

[0134] Bitrate saving scenario:

[0135] Without the embodiments: Total transmission bitrate for 3 X IVAS streams at 128kbps per IVAS stream = 384kbps.

[0136] With the embodiments: The same 6DoF spatial audio quality can be achieved at (128 + 24.4*2) = 176.8kbps, i.e., saving 207.2kbps or 54% bitrate reduction.

[0137] Quality improvement example:

[0138] Let’s assume the total bitrate of 192 kbps for all three IVAS streams (3 X IVAS streams at 64kbps per IVAS stream).

[0139] Without the embodiments: 3 x 64 kbps IVAS streams can be used. This means 9 kbps for the spatial metadata, and 55 kbps for the audio for each stream.

[0140] With the embodiments: One 128 kbps and two 24.4 kbps IVAS streams can be used. This means 17.5 kbps for the spatial metadata, and 110.5 kbps for the audio (of the microphone array that is used for rendering).

[0141] Thus, by implementing the embodiments described herein in further detail, significantly higher bitrates can be obtained for both the (meaningful) audio and the spatial metadata without increasing the total bitrate, which results in clear quality improvements.

[0142] Fig.3 shows schematically an example system suitable for implementing some embodiments. The system comprises two sides: a capture side 390 and the renderer side 392. The sound scene is captured in the capture side 390, and it is transmitted to the renderer side 392, where the sound scene is rendered to a listener in 6 degrees of freedom (6DoF) (i.e., the listener can rotate their head as well as move in the sound scene).

[0143] With respect to the capture side 390, the input to the system is N sets of microphone array signals, e.g., from microphone arrays integrated on mobile phones (where N is two or more).

[0144] In Fig.3, there are shown three sets of microphone array signals, microphone array signals 1 304, microphone array signals 2 306, and microphone array signals N 308. In addition, the microphone array positions 300 and transmission bitrate 302 are obtained as inputs in the capture side 390.

[0145] In some embodiments the microphone array positions 300 contain the locations of the microphone arrays (e.g., as X, Y, Z coordinates or any other suitable coordinate system) and their orientations (e.g., as yaw, pitch, roll angles).

[0146] The transmission bitrate 302 in some embodiments is the total bitrate used for transmitting the 6DoF sound scene from the capture side 390 to the renderer side 392.

[0147] For simplicity, in this example, the transmission bitrate 302 does not include the overhead bits from the transmission protocols, such as IP or RTP header, for example. The transmission bitrate 302, in this example only includes the bitrate used for the IVAS frames.

[0148] In the renderer side 392, the input to the system is the listener position 340, which contains the location of the listener (e.g., as X, Y, Z coordinates or other suitable coordinate system) and their orientation (e.g., as yaw, pitch, roll angles).

[0149] In some embodiments the microphone array signals 304, 306, 308 are forwarded to respective frontend processors 305, 307, 309.

[0150] In some embodiments the front-end processors 305, 307, 309 are configured to determine a MASA stream 314, 316, 318 using the microphone array signals 304, 306, 308.

[0151] In some embodiments the MASA stream 314, 316, 318 comprises MASA transport audio signals and MASA spatial metadata.

[0152] Each N set of Microphone array signals 304, 306, 308 can be processed with their own front-end 305, 307, 309, and as a result N sets of MASA streams 314, 316, 318 are obtained.

[0153] The microphone array positions 300, the transmission bitrate 302, and the listener position 340 can be forwarded to a controller 301. The controller 301 in some embodiments is configured to determine controlinformation 312 for controlling the encoding of the MASA streams 314, 316, 318. For example the control information 312 can contain a stream bitrate and a bitrate allocation mode, for each stream.

[0154] The stream bitrate in some embodiments comprises information which defines a bitrate to be used for coding a certain MASA stream.

[0155] The bitrate allocation mode in some embodiments comprises information which defines the use or mode in which the bitrate is allocated between the spatial metadata and the audio signal encoding. As discussed later, there can, for example be defined the following modes:Normal coding frame, i.e., default allocation of bitrate between spatial metadata and audio signal encoding (Type 0);Coding frame with more spatial metadata bitrate allocation (Type 1); and Coding frame with only spatial metadata encoded (Type 2).

[0156] The MASA streams, for example MASA stream 1 314, MASA stream 2 316, and MASA stream N 318, are forwarded to respective encoders 315, 317, 319. The encoders 315, 317, 319 are also configured to receive control information 312.

[0157] Then the encoders 315, 317, 319, based on the control information 312 (e.g., the stream bitrate and the bitrate allocation mode) are configured to encode the MASA streams 314, 316, 318 and form respective bitstreams, bitstream 1 324, bitstream 2326, and bitstream N 328, that are then transmitted from the capture side 390 to the renderer side 392.

[0158] A separate bitstream, bitstream 1 324, bitstream 2 326, and bitstream 3 328 can be generated for each MASA stream, MASA stream 1 314, MASA stream 2 316, and MASA stream 3 318, in other words there are N bitstreams.

[0159] With respect to the renderer side 392, the bitstreams, bitstream 1 324, bitstream 2326, and bitstream N 328, are received and passed to respective decoders 325, 327, 329. The decoders 325, 327, 329 are then configured to decode the bitstreams 324, 326, 328 and produce decoded MASA streams, decoded MASA stream 1 334, decoded MASA stream 2336, and decoded MASA stream N 338 as an output.

[0160] The decoded MASA streams 334, 336, 338 can then be forwarded to the 6DoF renderer 331 or more generally renderer, which is configured to also receive the listener position 340 (containing the location and the orientation of the listener). In some embodiments the 6DoF renderer 331 is further configured to receive the microphone array positions 300.

[0161] The 6DoF renderer 331 is configured to generate a spatial audio output 350 (e.g., binaural audio signals, multichannel loudspeaker audio signals, etc.).

[0162] The rendering is based on 6DoF listener tracking, so the listener (head) movement and rotation is taken into account. As a result, the listener can feel as if they are actually in the space where the sound scene was captured and move around the sound scene.

[0163] Although this example and the following describes the receiving of microphone array audio signals, and then the generation of suitable MASA streams, it would be understood that in some embodiments the MASA streams are generated or synthesized and obtained by any suitable means, for example by a suitable computer generated audio source based audio signal and spatial metadata.

[0164] With respect to Fig.4 an example flow diagram of the operations of the example system shown in Fig.3 with respect to the capture side 390 and the renderer side 392 is described.

[0165] For example, with respect to the capture side 390, as shown in Fig.4 by 401 is the operation of receiving or otherwise obtaining the microphone array audio signals, microphone array positions, transmission rate, listener position.

[0166] Then as shown in Fig.4 by 403 is the operation of determining control information based on the microphone array positions, transmission rate, and listener position.

[0167] Following this is the operation as shown in Fig.4 by 405 of performing front-end processing of the microphone array audio to generate MASA spatial audio signals.

[0168] Then is the operation, as shown in Fig.4 by 407 of encoding MASA spatial audio signals based on control information (for example based on the mode or determined use) to generate encoded immersive stream.

[0169] The encoded immersive streams (and microphone array position information) can then be output as shown in Fig.4 by 409.

[0170] With respect to the renderer side 392, as shown in Fig.4 by 411 is the operation of receiving the encoded immersive stream, microphone array positions, and listener position.

[0171] Following this is the operation as shown in Fig.4 by 413 of decoding the encoded immersive streams.

[0172] Then is the operation of rendering a spatial audio signal based on the decoded streams, listener position and in some embodiments the microphone array positions as shown in Fig.4 by 415.

[0173] With respect to Fig.5 is shown an example front-end processor in further detail. In this example the front-end processor shown is the front-end processor associated with the microphone array signal 1 304, but would be understood to cover any and all of the front-end processors. The front-end processor thus is shown obtaining or receiving as an input, the microphone array signal 1 304 and produces the MASA stream 1 314 as an output.

[0174] In some embodiments the microphone array signal 1 304 is forwarded to a transport signal generator 501. The transport signal generator 501 is configured to generate MASA transport audio signals 502, which typically are stereo signals but can be any suitable transport audio signal format. In some embodiments if the microphone array signals originate from a mobile phone in a landscape orientation, the generation of the MASA transport audio signals 502 may simply be determined by selecting the two channels that correspond to the microphones at the left and right edges of the device. In some configurations, the processing may involve beamforming to left and right directions. For example, if the microphone array signals are based ona set of cardioid microphones closely placed, or if the microphone array signals are a first-order Ambisonic signal, the transport signal generator 501 can be configured to generate cardioid-shaped beams towards left and right directions to generate the MASA transport audio signals.

[0175] The microphone array signals 304 in some embodiments are also forwarded to the metadata determiner 503. The metadata determiner 503 is configured to generate the MASA spatial metadata 504. The determining of the MASA spatial metadata 504 can be implemented according to any suitable method, with the most suitable method selected based on the microphone arrangement or configuration. For example, if the device is a smart phone with two or more microphones, the method described in UK patent application GB1619573.7 can be employed, which uses a delay analysis between the microphones to determine a direction parameter in frequency bands, and correlation analysis to determine the direct-to-total energy ratio for that direction. In some embodiments if the microphone signals are a first-order Ambisonic (FOA) signal, or if they can be converted to a FOA signal, the metadata determiner 503 can use methods that are based on the Directional Audio Coding (DirAC) to determine the direction and ratio parameters. The analyzed metadata directions and ratios form the MASA spatial metadata 504. Other MASA parameters can be set to zero or to other suitable values.

[0176] The MASA transport audio signals 502 and the MASA spatial metadata 504 can therefore form together the MASA stream, which as shown in this example is the MASA stream 1 314, which is the output of the front-end processor 305.

[0177] With respect to Fig.6 an example flow diagram of the operations of the example front end processor shown in Fig.5 is described.

[0178] For example, as shown in Fig.6 by 601 is the operation of receiving or otherwise obtaining the microphone array audio signals.

[0179] Then as shown in Fig.6 by 603 is the operation of generating transport audio signals based on the microphone array audio signals.

[0180] Following this is the operation as shown in Fig.6 by 605 of generating spatial metadata based on the microphone array audio signals.

[0181] Then is shown the operation, in Fig.6 by 607 of outputting MASA spatial audio signals comprising the spatial metadata and transport audio signals.

[0182] The controller 301 is configured to allocate bitrate budgets for each IVAS stream depending on the use or role in the 6DoF rendering.

[0183] For example where the role for a stream is only spatial metadata interpolation, a higher relative bitrate budget is provided for spatial metadata compared to a regular IVAS stream at any transmission bitrate per stream. Furthermore, the spatial metadata bitrate budget should be greater than or equal to the bitrate budget which would have been available for a regular IVAS stream which is optimized for 3DoF rendering.

[0184] Furthermore if the spatial metadata bitrate budget is retained while reducing the transmission bitrate of IVAS stream used only for spatial metadata interpolation, this can maintain overall quality while reducing bitrate.

[0185] In some embodiments where the spatial metadata bitrate budget is increased while reducing the transmission bitrate of IVAS stream used only for spatial metadata interpolation, this can increase overall quality while reducing bitrate.

[0186] In order to achieve spatial metadata bitrate budget that is greater than or equal to spatial metadata bitrate budget from regular IVAS streams, the spatial metadata bitrate budget share can be increased while reducing the share of audio transport channel bitrate budget in the modified IVAS stream compared to the regular IVAS stream.

[0187] In some embodiments the controller is configured to implement the following:Receive a total transmission bitrate budget for 6DoF rendering with MASA format in IVAS. This can be specified as transmission bitrate budget for each IVAS stream. For a typical scenario of 6DoF rendering, three IVAS streams are utilized to enable six degrees of freedom. For example, the total transmission budget can be specified as 384 kbps with each IVAS stream at 128 kbps.Receive location information for each of the microphone arrays capturing the audio scene.Receive location information for a listener.Determine which microphone arrays capturing the audio scene are used only for spatial metadata interpolation and which ones are used for spatial metadata interpolation and audio signal rendering.

[0188] To reduce the total transmission bitrate for 6DoF rendering, a modified encoder can be utilized for the IVAS streams used only for spatial metadata interpolation. Thus, the controller 301 can be configured to determine modified bitrate budget allocations.

[0189] The modified encoder uses the same or increased spatial metadata bitrate budget compared to a regular IVAS stream for the specified IVAS stream bitrate. This spatial metadata bitrate budget is used as the base value to select the new transmission bitrate.

[0190] Subsequently, a bitrate budget for the audio transport channel is selected such that it is at a quality that it can at least be used to obtain energy values, which are used in the 6DoF rendering, as described further below.

[0191] Furthermore, the audio transport channels may be used for audio rendering for short time instances, e.g., if the listener position changes quickly. In this case, the audio rendering may be performed for some time using the spatial metadata IVAS streams, before the bitrate allocation modes have been updated to reflect the new listener position. Thus, a bitrate budget allowing this should be allocated.

[0192] The sum of the spatial metadata bitrate budget and minimum credible audio transport channel bitrate budget is used to select the modified IVAS stream transmission bitrate.

[0193] The microphone arrays used for spatial metadata interpolation and audio rendering are encoded with the regular IVAS encoder.

[0194] In an embodiment where the goal is to improve 6DoF rendering subjective audio quality for a constant total transmission bitrate: The bitrate savings obtained from the reduction in the transmission bitrate of IVAS streams used only for spatial metadata interpolation is used to boost the transmission bitrate for the IVAS streams corresponding to the microphones used for spatial metadata interpolation and audio rendering. This results in higher quality audio transport channel as well as higher quality spatial metadata interpolation which boosts the 6DoF rendering subjective quality.

[0195] In an implementation embodiment, for every regular IVAS stream transmission bitrate a list of favorable modified IVAS stream allocation tables can be prepared and made available to the modified IVAS stream encoder.

[0196] The following table presents example MASA bitrate allocations in IVAS for the metadata and audio (transport channels) parts across IVAS bitrates.

[0197] The regular columns indicate the bitrates used in regular IVAS streams. The presented values may vary slightly in coded IVAS frames, for example the format bits (3 for MASA) are not taken into account in the below numbers. However, the following table provides a good example estimate regarding how the total bitrate is distributed for the metadata and audio in IVAS MASA frames.

[0198] The IVAS frames of Type 0 (regular IVAS streams) follow the bitrate distribution in “Regular” columns. The IVAS frames of Type 1 (modified IVAS streams) follow the bitrate distribution in “Modified” columns. Please note that these values are provided only as examples.* the bitrate allocations for bitrates 384 and 512 kbps follow a different coding scheme for regular IVAS streams, the presented values are examples and in other embodiments can be different values.

[0199] The Modified columns in the table present example modified bitrate distributions between metadata and audio for a modified IVAS stream (i.e., for a Spatial metadata IVAS stream). For the lowest IVAS bitrate of 13.2 kbps, a bitrate of 7.2 kbps can be used for the audio part (7.2 kbps is the lowest bitrate used in EVS Primary coding). The rest of the bitrate (6 kbps) is left for the metadata.

[0200] For the highest bitrates, an (almost) transparent metadata coding bitrates can be used. The metadata bitrate can be set to 200 kbps (or even higher) at the highest IVAS bitrates (384 and 512 kbps). The rest of the bitrate is left for the audio (184 and 312 kbps, respectively). The bitrate allocations between the lowest and highest bitrates can be distributed in increasing order, so that, for example, the metadata has more or equal amount of bitrate budget allocated than the previous metadata bitrate.

[0201] Example:Case 1 : 3 x Type 0 streams at 64 kbps = 192 kbps total bitrate (9 kbps for metadata, 55 kbps for audio) Case 2: 1 x Type 0 stream at 128 kbps + 2 x Type 1 streams at 32 kbps = 192 kbps (17.5 for metadata for the Type 0 stream, 18.6 for metadata for the Type 1 streams, 110.5 kbps for audio)Total bitrate stays the same, but the meaningful metadata and audio bitrates are doubled

[0202] In an example scenario, three bitstreams are transmitted with a total bitrate of 192 kbps for all the streams.

[0203] In Case 1, all the bitstreams are transmitted as regular IVAS streams (Type 0) with equal bitrate distributions. This means that each stream is transmitted as 64 kbps streams with 9 kbps allocated to MASA metadata and 55 kbps allocated to the audio transports. The Case 1 example is presented in the following table as the case 1 column.

[0204] In Case 2, the bitrate budgets for the streams are re-distributed based on the role of the streams. Stream 1 is used for the main audio signal processing in the 6DoF rendering (e.g., the listener is located closest to the microphone array that is used to capture Stream 1). Streams 2 and 3 are mostly used for metadata interpolation in the rendering. In this case, the bitrate budget for audio in Streams 2 and 3 is not that important and the audio bitrate can be lowered. For Stream 1 , the audio bitrate can be increased because that stream is the most important one audio-wise.

[0205] One example bitrate re-distribution could be to assign 128 kbps bitrate to Stream 1 and 32 kbps bitrates to Streams 2 and 3. Stream 1 is not modified further (i.e., Type 0 is kept), which gives 17.5 kbps bitrate for metadata and 110.5 kbps bitrate for audio.

[0206] The metadata and audio bitrates for Streams 2 and 3 are furthermore adjusted so that the metadata bitrate is increased, and the audio bitrate is reduced (i.e., the streams become Type 1 streams). These adjustments follow the “Modified” bitrate columns in Table 1. For 32 kbps modified bitrates, the metadata is assigned 18.6 kbps bitrate and the audio is assigned 13.4 kbps bitrate. These adjustments for the streams are presented in column Case 2 in the following table.

[0207] As presented above and in the following table, the bitrates for the meaningful metadata and audio transports for 6DoF rendering are (roughly) doubled, while the total bitrate is retained (192 kbps). Rendering the Case 2 bitstreams gives the listener a higher quality 6 DoF listening experience compared to rendering the Case 1 bitstreams.

[0208] Fig.7 shows schematically an example encoder. In this example the encoder is the encoder 315 associated with the first microphone array, however the following can be applied to any and all of the encoders for any of the streams.

[0209] The input to the encoder 315 is a MASA stream, the MASA stream 1 314. The MASA stream comprises the MASA transport audio signals 502 and the MASA spatial metadata 504. Furthermore the encoder 315 is configured to receive as an input the control information 312, which in some embodiments comprises the stream bitrate and the bitrate allocation mode (or the stream use information).

[0210] The control information 312 in some embodiments is forwarded to the bitrate allocator 701, which is configured to determine the audio bitrate 712 and the metadata bitrate 714 based on the control information. For example the bitrate allocator 712 is configured to employ a look up table similar to that described in the above tables to determine the audio and metadata bitrates based on the stream bitrate and the bitrate allocation mode.

[0211] In some embodiments the bitrate allocator (or previously with respect to the controller) can furthermore be configured to determine a first, higher, bitrate for the use of audio transport channel compared to a normal bitrate allocation mode; and a second, lower, bitrate for the use of audio transport channel compared to the normal bitrate allocation mode. The ability to employ a reduced or higher audio transport channel bitrate can be beneficial such that there can be higher, bitrate for the use of audio transport channel compared to the normal bitrate allocation mode, or lower, bitrate for the use of audio transport channel compared to the normal bitrate allocation mode.

[0212] The above can be useful to reduce audio transport channel bitrate, because for some use or mode, the audio transport channel with reduced bitrate is used only as an emergency use in case of sudden change in user position (e.g., with teleportation).

[0213] The MASA transport audio signals 502 can then be input to a transport audio signal encoder 703, which applies suitable encoding based on the audio bitrate 712. For example the transport audio signal encoder 703 is configured to employ an IVAS core encoder. The resulting encoded transport audio signals 702 can then be forwarded to a multiplexer or MUX 707.

[0214] The MASA metadata 504 can similarly be configured to be input to a metadata encoder 705, which applies suitable encoding based on the metadata bitrate 714. The metadata encoder 705 can, e.g., employ the IVAS MASA encoding tools to generate the encoded metadata 704. The resulting encoded metadata 704 can also be forwarded to the MUX 707.

[0215] The multiplexer or MUX 707 is configured to multiplex the encoded transport audio signals 702 and the encoded metadata 704 to generate the bitstream, which in this example is bitstream 1 324 which is the output.

[0216] With respect to Fig.8 an example flow diagram of the operations of the example encoder shown in Fig.7 is described.

[0217] For example, as shown in Fig.8 by 801 is the operation of receiving or otherwise obtaining MASA spatial audio signals comprising the spatial metadata and transport audio signals and control information (for example indicating a determined use or mode).

[0218] Then as shown in Fig.8 by 803 is the operation of determining bitrate allocations for the transport audio signals and the spatial metadata.

[0219] Following this is the operation as shown in Fig.8 by 805 of encoding transport audio signals and spatial metadata based on the bitrate allocations.

[0220] Then is shown the operation, in Fig.8 by 807 of outputting encoded MASA spatial audio signals.

[0221] Fig.9 shows schematically an example decoder 325 according to some embodiments. The decoder 325 shown in Fig.9 is the decoder associated with stream 1 however the following applies to any and all of the decoders. The input to the decoder 325 is the bitstream, which in this example is bitstream 1 324. The bitstream 1 324 is input to a demultiplexer or DEMUX 901, which demultiplexes the bitstream into encodedtransport audio signals 902 and encoded metadata 904. The encoded transport audio signals 902 can then be input to transport audio signal decoder 903, which decodes the audio signals to generate decoded transport audio signals 902. The decoding can be performed using methods corresponding to the encoding methods applied in the transport audio signal encoder.

[0222] Additionally the encoded metadata 904 is input to a metadata decoder 905, which decodes the metadata to generate decoded spatial metadata 914. The decoding can be performed using methods corresponding to the encoding methods applied in the metadata encoder. The decoded transport audio signals 902 and the decoded spatial metadata 914 form the decoded MASA stream, which in this example is the Decoded MASA stream 1 334 which can be output.

[0223] With respect to Fig.10 an example flow diagram of the operations of the example decoder shown in Fig.9 is described.

[0224] For example, as shown in Fig.10 by 1001 is the operation of receiving or otherwise obtaining the data stream.

[0225] Then as shown in Fig.10 by 1003 is the operation of demultiplexing the stream to generate encoded transport audio signals and encoded spatial metadata.

[0226] Following this is the operation as shown in Fig.10 by 1005 of decoding the encoded transport audio signals and encoded spatial metadata.

[0227] Then is shown the operation, in Fig.10 by 1007 of outputting the decoded transport audio signals and decoded spatial metadata in the form of the decoded MASA spatial audio signals.

[0228] Fig.11 shows schematically an example 6DoF renderer 331. The renderer 331 can in some embodiments be based on the rendering process presented in GB2007710.8 and the associated US2023079683, which is adopted in MPEG-I immersive audio 6DoF HOA rendering (ISO / IEC 23090-4).

[0229] The difference between these Tenderers and the renderer employed in these embodiments is that the spatial metadata is obtained directly from the MASA streams, whereas in MPEG-I 6DoF rendering the metadata is analyzed from the input HOA signals. Additionally, instead of HOA input audio signals, the renderer employs MASA transport audio signals as an input.

[0230] In some embodiments the renderer 331 comprises a position pre-processor 1101 configured to receive inputs in the form of the Microphone array positions 300 and the listener position 340. The position pre-processor 1101 is configured, based on the locations of the microphone arrays and the listener, to determine processing weights for each of the microphone arrays.

[0231] This can be implemented, for example, by triangulating the area covered by the microphone arrays and calculating the weights based on barycentric coordinates and the listener position. The processing weights are then passed as interpolation data 1106 to the signal interpolator 1103 and to the metadata interpolator 1105. The interpolation data 1106 in some embodiments further comprises other data, such asthe closest microphone array index, the indexes of the microphone arrays encapsulating the listener and the orientations of the microphone arrays and the listener.

[0232] The signal interpolator 1103 in some embodiments receives the N decoded transport audio signals 1102 and the interpolation data 1106. First, the signal interpolator 1103 is configured to transform the audio signals into the time-frequency domain, e.g., by utilizing STFT (Short-Time Fourier Transform), complex-modulated quadrature mirror filter (QMF) bank, complex low delay filter bank (CLDFB), or through some other means. Then, the energies are calculated for the transformed signals. Based on the processing weights from the interpolation data 1106 and the signal energies, the transformed signal of the closest microphone array to the listener is equalized. The equalized signal is the interpolated audio signal 1112 output from the signal interpolator 1103.

[0233] In this rendering example, only the signal from the closest microphone array to the listener is equalized and output, while the other microphone array signals are only used for obtaining their energies. As such, the signal from the closest microphone array to the listener is the most important for the rendering quality.

[0234] In some embodiments, the signal energies could be fed directly to the signal interpolator 1103. In this case only the signal of the closest microphone array to the listener would be needed to be transformed to time-frequency domain in the signal interpolator 1103.

[0235] The metadata interpolator 1105 in some embodiments receives the N decoded spatial metadata streams 1104 and the interpolation data 1106. The spatial metadata contains (at least) the direction-of-arrival (DOA) and the direct-to-total energy ratios (DTR). The metadata interpolator 1105 is configured to rotate or otherwise modify the DOA values accordingly based on the orientations of the microphone arrays and the listener and interpolates the DOA and DTR values based on the processing weights from the interpolation data 1106. A single interpolated metadata 1114 is then output to the synthesis processor 1107.

[0236] The synthesis processor 1107 is configured to receive or otherwise obtain the interpolated audio signal 1112 and interpolated metadata 1114 as well as the microphone array positions 300 and the listener position 340. The synthesis processor 1107 is then configured to generate a spatial audio output 350 which can be played to a listener to create an immersive 6DoF listening experience.

[0237] With respect to Fig.12 an example flow diagram of the operations of the example 6DoF renderer shown in Fig.11 is described.

[0238] For example, as shown in Fig.12 by 1201 is the operation ofreceiving or otherwise obtaining decoded MASA spatial audio signals comprising decoded transport audio signals and decoded metadata, and listener position and microphone array positions.

[0239] Then as shown in Fig.12 by 1203 is the operation of implementing position pre-processing to generate interpolation data (information).

[0240] Following this is the operation as shown in Fig.12 by 1205 of performing signal interpolation and metadata interpolation.

[0241] Then is the operation as shown in Fig.12 by 1207 of synthesis processing based on the interpolated audio signals and metadata to generate spatial audio signals.

[0242] Then is shown the operation, in Fig.12 by 1209 of outputting the spatial audio signals.

[0243] More details on suitable 6DoF rendering can be found in applications GB2007710.8, US2023079683, and MPEG-I immersive audio 6DoF HOA rendering (ISO / IEC 23090-4).

[0244] Fig.13 shows an example capture and render scenario for the system. At the capture side 1300, a sound scene is captured with five microphone arrays m1-m5. These microphone arrays encapsulate a capturing area 1340. At the render side 1360, a listener 1380 is listening to the sound scene captured by the capture side 1300. The sound scene is rendered to the listener in 6DoF fashion, i.e., the listener can move in the playback area 1350 and turn their head. The sound scene adjusts accordingly to the listener’s movements.

[0245] From the render side 1360, the position 1330 of the listener 1380 is transmitted to the capture side 1300 (this is illustrated by the grey listener 1370 in the capture area). The microphone arrays encapsulating the listener (ml, m2 and m3) form the arrays that are used in the 6DoF rendering for the current listener position. Therefore, only the MASA bitstreams 1310 originating from those arrays are needed to be transmitted to the render side (with the microphone array positions 1320).

[0246] The microphone array ml is closest to the listener’s current position, and the audio signal from that array is used for the main audio signal processing in the 6DoF rendering. The other transmitted bitstreams (from m2 and m3) are mostly used for metadata interpolation in the 6DoF rendering. For a higher listening quality of experience, the audio bitrate should be prioritized for ml and the metadata bitrates should be prioritized for m2 and m3.

[0247] For example, the prioritizing of the bitrates can follow the example presented in the above tables. If a 192 kbps total bitrate is allocated for the three bitstreams, ml can have 128 kbps bitrate allocation and m2 and m3 can have 32 kbps bitrate allocations. The type of the ml bitstream can be left unmodified (Type 0 bitstream), which would give a bitrate of 17.5 kbps for the metadata and a bitrate of 110.5 kbps for the audio. The m2 and m3 bitstreams can be modified to Type 1 bitstreams, which would give a bitrate of 18.6 kbps for the metadata and a bitrate of 13.4 kbps for the audio for the two respective streams.

[0248] Fig.14 shows a flow diagram of example method steps for some embodiments, where the total bitrate is reduced. The reduction is achieved by modifying the bitrates for the spatial metadata and the audio parts for the spatial metadata streams. The rendering of the modified bitstreams result in equal or higher quality for the listener compared to rendering the original non-modified bitstreams.

[0249] With respect to the capture side 1400, in Fig.14 by 1401 is shown receiving the listener position from the renderer side and determining the microphone array positions.

[0250] Then as shown in Fig.14 by 1403 is the operation of determining the roles or use of the IVAS (MASA) streams originating from the microphone arrays based on the listener position (spatial metadata or audio signal streams).

[0251] Following this as shown in Fig.14 by 1405 is the operation of increasing the bitrate budget for the spatial metadata and reducing the bitrate budget for the audio for the spatial metadata streams.

[0252] Then as shown in Fig.14 by 1407 is the operation of encoding the adjusted spatial metadata and the original audio signal streams.

[0253] Finally as shown in Fig.14 by 1409 is the operation of transmitting the encoded streams with reduced total bitrate to the render side.

[0254] Furthermore with respect to the render side 1450 are the following operations for this example scenario.

[0255] For example, as shown in Fig.14 by 1451 is the operation of receiving or otherwise obtaining encoded IVAS (MASA) streams from the capture side.

[0256] Then as shown in Fig.14 by 1453 is the operation of decoding the streams and extracting the spatial metadata and audio transport channel data from the decoded streams.

[0257] Following this is the operation as shown in Fig.14 by 1455 of performing 6DoF rendering with the decoded data providing equal rendering quality to the listener compared to rendering the non-modified original bitstreams.

[0258] Fig.15 shows a further flow diagram which presents example methods steps for the approach, where the total bitrate is maintained, and the spatial metadata and the audio channel transport bitrates are modified. This results in higher quality rendering for the listener compared to rendering the original non-modified bitstreams.

[0259] With respect to the capture side 1400, in Fig.15 by 1501 is shown receiving the listener position from the renderer side and determining the microphone array positions.

[0260] Then as shown in Fig.15 by 1503 is the operation of determining the roles or use of the IVAS (MASA) streams originating from the microphone arrays based on the listener position (spatial metadata or audio signal streams).

[0261] Following this as shown in Fig.15 by 1505 is the operation of increasing the bitrate budget for the spatial metadata and reducing the bitrate budget for the audio for the spatial metadata streams.

[0262] Also as shown in Fig.15 by 1506 is the operation of increasing the bitrate budget for the audio signal streams according to the saved bitrate budget from the adjusted spatial audio streams.

[0263] Then as shown in Fig.15 by 1507 is the operation of encoding the adjusted spatial metadata and the original audio signal streams.

[0264] Finally as shown in Fig.15 by 1509 is the operation of transmitting the encoded streams with reduced total bitrate to the render side with equal total bitrate compared to the non-modified bitstreams.

[0265] Furthermore with respect to the render side 1450 are the following operations for this example scenario.

[0266] For example, as shown in Fig .15 by 1551 is the operation of receiving or otherwise obtaining encoded IVAS (MASA) streams from the capture side.

[0267] Then as shown in Fig.15 by 1553 is the operation of decoding the streams and extracting the spatial metadata and audio transport channel data from the decoded streams.

[0268] Following this is the operation as shown in Fig.15 by 1555 of performing 6DoF rendering with the decoded data providing increased rendering quality to the listener compared to rendering the non-modified original bitstreams.

[0269] Fig.16 furthermore shows a flow diagram which summarizes the embodiments.

[0270] In Fig.16 by 1601 is shown obtaining or receiving the listener position (from the renderer side)

[0271] Then as shown in Fig.16 by 1603 is obtaining at least two microphone array positions, each of the at least two microphone array positions associated with a respective microphone array configured to capture audio signals representing an audio scene and determining the microphone array positions.

[0272] Then as shown in Fig.16 by 1605 is the operation of determine a use for at least two immersive streams from the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position.

[0273] Following this as shown in Fig.16 by 1607 is the operation of determining a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use.

[0274] Then as shown in Fig.16 by 1609 is the operation of controlling an encoding of the at least two immersive streams based on the audio signals and the determined bitrate.

[0275] With respect to the render side is shown in Fig.16 by 1611 is the operation of obtaining at least two encoded immersive audio coded streams.

[0276] Additionally as shown in Fig .16 by 1613 is the operation of receiving or otherwise obtaining a listener position.

[0277] Then as shown in Fig.16 by 1615 is the operation of decoding the encoded at least two encoded immersive streams based on a determined use associated with each of the at least two encoded immersive streams, to generate decoded at least two immersive streams.

[0278] Following this is the operation as shown in Fig.16 by 1617 of performing 6DoF rendering at least one output audio signal based on the decoded at least two immersive streams and the listener position.

[0279] It should be understood that the apparatuses may comprise or be coupled to other units or modules used in or for transmission and / or reception. Although the apparatuses have been described as one entity, different modules and memory may be implemented in one or more physical or logical entities.

[0280] It is noted that whilst some embodiments have been described in relation to 5G networks, similar principles can be applied in relation to other networks and communication systems. Therefore, althoughcertain embodiments were described above by way of example with reference to certain example architectures for wireless networks, technologies and standards, embodiments may be applied to any other suitable forms of communication systems than those illustrated and described herein.

[0281] It is also noted herein that while the above describes example embodiments, there are several variations and modifications which may be made to the disclosed solution without departing from the scope of the present invention.

[0282] As used herein, “at least one of the following: ” and “at least one of ” and similar wording, where the list of two or more elements are joined by “and” or “or”, mean at least any one of the elements, or at least any two or more of the elements, or at least all the elements.

[0283] In general, the various embodiments may be implemented in hardware or special purpose circuitry, software, logic or any combination thereof. Some aspects of the disclosure may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the disclosure is not limited thereto. While various aspects of the disclosure may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

[0284] As used in this application, the term “circuitry” may refer to one or more or all of the following:(a) hardware-only circuit implementations (such as implementations in only analog and / or digital circuitry) and(b) combinations of hardware circuits and software, such as (as applicable):(c) a combination of analog and / or digital hardware circuit(s) with software / firmware and (i) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions); and(ii) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.

[0285] This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and / or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for amobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.

[0286] The embodiments of this disclosure may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Computer software or program, also called program product, including software routines, applets and / or macros, may be stored in any apparatus-readable data storage medium and they comprise program instructions to perform particular tasks. A computer program product may comprise one or more computer-executable components which, when the program is run, are configured to carry out embodiments. The one or more computer-executable components may be at least one software code or portions of it.

[0287] Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as DVD and the data variants thereof, CD. The physical media is a non-transitory media.

[0288] The term “non-transitory,” as used herein, is a limitation of the medium itself (i.e., tangible, not a signal) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM).

[0289] The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may comprise one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), FPGA, gate level circuits and processors based on multi core processor architecture, as non-limiting examples.

[0290] Embodiments of the disclosure may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

[0291] The scope of protection sought for various embodiments of the disclosure is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the disclosure.

[0292] The foregoing description has provided by way of non-limiting examples a full and informative description of the exemplary embodiment of this disclosure. However, various modifications and adaptationsmay become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this disclosure will still fall within the scope of this invention as defined in the appended claims. Indeed, there is a further embodiment comprising a combination of one or more embodiments with any of the other embodiments previously discussed.

Claims

1. CLAIMS1. An apparatus for controlling a generation of encoded immersive streams, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to:obtain a listener position;obtain at least two microphone array positions, each of the at least two microphone array positions associated with a respective microphone array configured to capture audio signals representing an audio scene;determine a use for at least two immersive streams from the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position;determine a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use; andcontrol an encoding of the at least two immersive streams based on the audio signals and the determined bitrate.

2. The apparatus as claimed in claim 1 , further caused to:generate at least two encoded immersive streams based on the control; andtransmit the at least two encoded immersive streams.

3. The apparatus as claimed in claim 2, caused to generate the at least two encoded immersive streams based on the control is caused to:create / generate at least two immersive streams; andencode the at least two immersive streams based on the control to generate the at least two encoded immersive streams.

4. The apparatus as claimed in any of claims 1 to 3, caused to determine the use associated with each of the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position is caused to determine the use as one of:spatial metadata determination; andspatial metadata determination and audio signal rendering.

365. The apparatus as claimed in claim 4, wherein the use as one of: spatial metadata determination; and spatial metadata determination and audio signal rendering is one ofspatial metadata interpolation; andspatial metadata interpolation and audio signal rendering6. The apparatus as claimed in any of claim 4 or 5, caused to determine the bitrate for the at least two immersive streams from the at least two microphone arrays based on the use is caused to determine: a first, higher, bitrate for the use of spatial metadata determination and audio signal rendering; a second, lower, bitrate for the use of spatial metadata determination.

7. The apparatus as claimed in any of claims 4 to 6, caused to determine the bitrate for the at least two immersive streams from the at least two microphone arrays based on the use is caused to determine: a first, higher, bitrate for the use of audio transport channel compared to a normal bitrate allocation mode; anda second, lower, bitrate for the use of audio transport channel compared to the normal bitrate allocation mode.

8. The apparatus as claimed in any of claims 1 to 7, caused to determine the bitrate for the at least two immersive streams from the at least two microphone arrays based on the use is caused to:determine a total bitrate for all of the at least two immersive streams;determine a first bitrate for each of the at least two immersive streams based on the total bitrate; and modify or adjust the determined first bitrate for each of the at least two immersive streams based on the use associated with each of the at least two microphone arrays.

9. The apparatus as claimed in any of claims 1 to 8, caused to determining a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use is caused to:determine a bitrate for a first of the at least two immersive streams;determine an audio bitrate based on the use for the first of the at least two immersive streams; determine a spatial metadata bitrate based on the use for the first of the at least two immersive streams, wherein the audio bitrate and spatial metadata bitrate combined are equal to or less than the bitrate for the first of the at least two immersive streams.

10. The apparatus as claimed in claim 9, caused to determine the audio bitrate based on the use for the first of the at least two immersive streams is caused to determine:a first, higher, bitrate for the audio bitrate when the use is the spatial metadata determination and audio signal rendering use; anda second, lower or zero, bitrate for the audio bitrate when the use is only the spatial metadata determination.

11. The apparatus as claimed in any of claim 9 or 10, caused to determine the spatial metadata bitrate based on the use for the first of the at least two immersive streams is caused to determine:a first, bitrate for the spatial metadata bitrate when the use is the spatial metadata determination and audio signal rendering use; anda second, higher, bitrate for the spatial metadata bitrate when the use is only the spatial metadata determination.

12. The apparatus as claimed in any of claims 1 to 11 , caused to obtain the listener position is caused to obtain the listener position from a further apparatus.

13. The apparatus as claimed in any of claims 1 to 12, further caused to transmit to at least one further apparatus at least one of:information associated with the use for at least two immersive streams from the at least two microphone arrays; andinformation associated with the control.

14. An apparatus for processing encoded immersive streams, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain at least two encoded immersive streams;obtain a listener position;decode the at least two encoded immersive streams, based on a determined use associated with each of the at least two encoded immersive streams, to generate decoded at least two immersive streams; andrender at least one output audio signal based on the decoded at least two immersive streams and the listener position.

15. The apparatus as claimed in claim 14, further caused to obtain information associated with the determined use associated with each of the at least two encoded immersive streams.

16. The apparatus as claimed in claim 15, caused to obtain information associated with the determined use associated with each of the at least two encoded immersive streams is caused to receive the information associated with the determined use associated with each of the at least two encoded immersive streams from a further apparatus.

17. The apparatus as claimed in any of claims 13 to 16, caused to obtain at least two encoded immersive audio coded streams is caused to receive the at least two encoded immersive streams from a further apparatus.

18. A method for controlling a generation of encoded immersive streams, the method comprising at least:obtaining a listener position;obtaining at least two microphone array positions, each of the at least two microphone array positions associated with a respective microphone array configured to capture audio signals representing an audio scene;determining a use for at least two immersive streams from the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position;determining a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use; andcontrolling an encoding of the at least two immersive streams based on the audio signals and the determined bitrate.

19. The method as claimed in claim 18, further comprising:generating at least two encoded immersive streams based on the control; andtransmitting the at least two encoded immersive streams.

20. The method as claimed in claim 19, wherein generating the at least two encoded immersive streams based on the control comprises:creating / generating at least two immersive streams; andencoding the at least two immersive streams based on the control to generate the at least two encoded immersive streams.

21. A method for processing encoded immersive streams, the method comprising at least:obtaining at least two encoded immersive streams;obtaining a listener position;39decoding the at least two encoded immersive streams, based on a determined use associated with each of the at least two encoded immersive streams, to generate decoded at least two immersive streams; andrendering at least one output audio signal based on the decoded at least two immersive streams and the listener position.

22. An apparatus for controlling a generation of encoded immersive streams, the apparatus comprising means configured to:obtain a listener position;obtain at least two microphone array positions, each of the at least two microphone array positions associated with a respective microphone array configured to capture audio signals representing an audio scene;determine a use for at least two immersive streams from the at least two microphone arrays, the use determined based on the at least two microphone array positions and the listener position;determine a bitrate for the at least two immersive streams from the at least two microphone arrays based on the use; andcontrol an encoding of the at least two immersive streams based on the audio signals and the determined bitrate.

23. An apparatus for processing encoded immersive streams, the apparatus comprising means configured to:obtain at least two encoded immersive streams;obtain a listener position;decode the at least two encoded immersive streams, based on a determined use associated with each of the at least two encoded immersive streams, to generate decoded at least two immersive streams; andrender at least one output audio signal based on the decoded at least two immersive streams and the listener position.