Spatial metadata post enhancement with machine learning during bitrate switch

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A machine-learning based apparatus and method addresses bitrate switch-induced artefacts in spatial audio by using two models to generate a compatible history context, enhancing spatial metadata and maintaining audio quality during coding scheme changes.

WO2026139199A1PCT designated stage Publication Date: 2026-07-02NOKIA TECHNOLOGIES OY

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: NOKIA TECHNOLOGIES OY
Filing Date: 2025-12-04
Publication Date: 2026-07-02

Application Information

Patent Timeline

04 Dec 2025

Application

02 Jul 2026

Publication

WO2026139199A1

IPC: G10L19/008; G10L25/30; G10L19/22

AI Tagging

Technology Topics

Computer hardware Audio frequency

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing spatial metadata processing systems face challenges in maintaining quality during bitrate switches, leading to artefacts and inaccuracies in spatial audio due to incompatible input features across different coding settings.

Method used

A machine-learning based apparatus and method that processes decoded spatial metadata and audio signals before and after a bitrate switch, using two machine learning models to generate a compatible history context and enhance spatial metadata to match the new coding scheme, thereby reducing artefacts and improving quality.

Benefits of technology

Enhances spatial metadata during bitrate switches, ensuring accurate and natural sound reproduction by generating a compatible history context for the new coding scheme, thus suppressing artefacts and maintaining audio quality.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure EP2025085420_02072026_PF_FP_ABST

Patent Text Reader

Abstract

An apparatus for enhancing at least one spatial metadata parameter, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain a spatial bitstream, the spatial bitstream defining spatial audio content and comprising: at least one spatial metadata parameter associated with at least one audio signal, the at least one spatial metadata parameter being an encoded version of an original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

Need to check novelty before this filing date? Find Prior Art

Description

[0001] SPATIAL METADATA POST ENHANCEMENT WITH MACHINE LEARNING DURING BITRATE SWITCH

[0002] Field

[0003] The present application relates to apparatus and methods for metadata processing for a spatial audio stream, but not exclusively for post enhancement spatial metadata processing during bitrate switched or other encoding switched streams.

[0004] Background

[0005] Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the 3GPP Immersive Voice and Audio Services (IVAS) codec which is designed to be suitable for use over a communications network such as a 4G / 5G network including use in such immersive services as for example immersive voice and audio for virtual reality (VR). This audio codec handles the encoding, decoding and rendering of speech, music, and generic audio. It supports a variety of input formats, such as channel-based audio, object-based audio, and scene-based audio inputs including spatial information about the sound field and sound sources, as well as MASA (Metadata-assisted spatial audio) inputs. IVAS operates with low latency to enable conversational services as well as supports high error robustness under various transmission conditions. The IVAS codec operates on a wide range bit rates from very low (13.2 kbps) to relatively high bit rates (512 kbps).

[0006] Additionally, the application of machine learning (ML), and more specifically the application of artificial or deep neural networks (ANNs, DNNs) to assist in processing operations is known. A neural network (NN) model is composed of a number of interconnected layers, each layer representing a set of operations (e.g., matrix multiplications, additions, convolutions, or non-linear operations) defining a graph of computational operations. These operations may have processing coefficients (or parameters, or weights) that are adapted during the model training phase based on the training data.A benefit of ML methods, especially those of DNNs, is that they are able to model highly complex relationships in the data without the need for manually describing those relationships. In many cases the relationships are so complex that it is currently not feasible or even possible to describe them manually. Instead, the ML methods are able to determine and model the complex relationships based on the training data examples with method inputs and expected model outputs.

[0007] Summary

[0008] There is provided according to a first aspect an apparatus for enhancing at least one spatial metadata parameter, the apparatus comprising means configured to: obtain a spatial bitstream, the spatial bitstream defining spatial audio content and comprising: at least one spatial metadata parameter associated with at least one audio signal, the at least one spatial metadata parameter being an encoded version of an original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

[0009] The means configured to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream may be configured to: obtain a machine-learning model for enhancing the at least one spatial metadata parameter, the machine-learning model for spatial audio content associated with the coding configuration, the machine-learning model configured to output at least one predicted spatial metadata property; and generate at least one enhanced spatial metadata parameter based on the at least one predicted spatial metadata property.

[0010] The means configured to generate at least one enhanced spatial metadata parameter based on the at least one predicted spatial metadata property may be configured to generate at least one enhanced spatial metadata parameter based on the at least one predicted spatial metadata property and the at least one spatial metadata parameter.The means configured to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream may be further configured to: obtain at least one input feature for the machine-learning model for enhancing the at least one spatial metadata parameter based on the coding configuration, the at least one input feature being based on the coding configuration.

[0011] The means configured to obtain at least one input feature for the machinelearning model for enhancing the at least one spatial metadata parameter may be configured to determine a switch mode of operation based on the coding configuration, the switch mode of operation indicating that there is a change in the coding configuration.

[0012] The means configured to obtain at least one input feature for the machinelearning model for enhancing the at least one spatial metadata parameter may be configured to determine a normal mode of operation based on the coding configuration, the normal mode of operation indicating that there is a consistent coding configuration.

[0013] The means configured to obtain the at least one input feature for the normal mode of operation based on the coding configuration may be configured to generate the at least one input feature based on at least one of: the at least one spatial metadata parameter; and the at least one audio signal.

[0014] The means configured to obtain at least one input feature for the switch mode of operation based on the coding configuration may be configured to: generate at least one history audio feature based on at least one of: at least one previous enhanced spatial metadata parameter; and the at least one audio signal; generate at least one audio feature based on at least one of: at least one spatial metadata parameter; and the at least one audio signal; and generate the at least one input feature based on the at least one history audio feature and the at least one audio feature.

[0015] The means configured to obtain at least one input feature for the switch mode of operation based on the coding configuration may be configured to: generate at least one history audio feature based on at least one of: at least one previous predicted spatial metadata property; and the at least one audio signal; generate at least one audio feature based on at least one of: at least one spatialmetadata parameter; and the at least one audio signal; and generate the at least one input feature based on the at least one history audio feature and the at least one audio feature.

[0016] The means configured to obtain at least one input feature for the switch mode of operation based on the coding configuration may be configured to: generate at least one history spatial metadata parameter based on at least one of: at least one previous enhanced spatial metadata parameter; and the at least one audio signal; generate the at least one input feature based on the at least one history spatial metadata parameter and the at least one spatial metadata parameter.

[0017] The means configured to generate at least one history spatial metadata parameter based on at least one of: at least one previous enhanced spatial metadata parameter; and the at least one audio signal may be configured to: determine frequency band limits based on the coding configuration; generate frequency combined spatial metadata parameters from the at least one previous enhanced spatial metadata parameter based on the frequency band limits.

[0018] The means configured to generate at least one history spatial metadata parameter based on at least one of: at least one previous enhanced spatial metadata parameter; and the at least one audio signal may be configured to: determine subframe limits based on the coding configuration; and generate the at least one history spatial metadata parameter from the at least one previous enhanced spatial metadata parameter based on the subframe limits.

[0019] The means configured to obtain at least one input feature for the switch mode of operation based on the coding configuration may be configured to: generate at least one history spatial metadata parameter based on at least one of: at least one previous predicted spatial metadata property; and the at least one audio signal; generate the at least one input feature based on the at least one history spatial metadata parameter and the at least one spatial metadata parameter.

[0020] The means configured to generate at least one history spatial metadata parameter based on at least one of: at least one previous predicted spatial metadata property; and the at least one audio signal may be configured to: determine frequency band limits based on the coding configuration; generatefrequency combined spatial metadata parameters from the at least one previous predicted spatial metadata property based on the frequency band limits.

[0021] The means configured to generate at least one history spatial metadata parameter based on at least one of: at least one previous predicted spatial metadata property; and the at least one audio signal may be configured to: determine subframe limits based on the coding configuration; and generate the at least one history spatial metadata parameter from the at least one previous predicted spatial metadata property based on the subframe limits.

[0022] The coding configuration associated with the encoding of the spatial bitstream may comprise at least one of: a bitrate value for the encoding of the spatial bitstream; a number of frequency bands for the encoding of the spatial bitstream; a number of temporal subframes for the encoding of the spatial bitstream.

[0023] According to a second aspect there is provided an apparatus for enhancing at least one spatial metadata parameter, the apparatus comprising means configured to: obtain spatial audio content comprising: at least one original spatial metadata parameter; and at least one audio signal; generate a spatial bitstream, the spatial bitstream comprising: at least one spatial metadata parameter associated with the at least one audio signal, the at least one spatial metadata parameter being an encoded version of the at least one original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and transmit the spatial bitstream to at least one further apparatus, wherein the at least one further apparatus is configured to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

[0024] The coding configuration associated with the encoding of the spatial bitstream may comprise at least one of: a bitrate value for the encoding of the spatial bitstream; a number of frequency bands for the encoding of the spatial bitstream; a number of temporal subframes for the encoding of the spatial bitstream.

[0025] According to a third aspect there is provided a method for an apparatus for enhancing at least one spatial metadata parameter, the method comprising: obtaining a spatial bitstream, the spatial bitstream defining spatial audio contentand comprising: at least one spatial metadata parameter associated with at least one audio signal, the at least one spatial metadata parameter being an encoded version of an original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and enhancing the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

[0026] Enhancing the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream may comprise: obtaining a machine-learning model for enhancing the at least one spatial metadata parameter, the machine-learning model for spatial audio content associated with the coding configuration, the machine-learning model configured to output at least one predicted spatial metadata property; and generating at least one enhanced spatial metadata parameter based on the at least one predicted spatial metadata property.

[0027] Generating at least one enhanced spatial metadata parameter based on the at least one predicted spatial metadata property may comprise generating at least one enhanced spatial metadata parameter based on the at least one predicted spatial metadata property and the at least one spatial metadata parameter.

[0028] Enhancing the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream may further comprise: obtaining at least one input feature for the machine-learning model for enhancing the at least one spatial metadata parameter based on the coding configuration, the at least one input feature being based on the coding configuration.

[0029] Obtaining at least one input feature for the machine-learning model for enhancing the at least one spatial metadata parameter may comprise determining a switch mode of operation based on the coding configuration, the switch mode of operation indicating that there is a change in the coding configuration.

[0030] Obtaining at least one input feature for the machine-learning model for enhancing the at least one spatial metadata parameter may comprise determining a normal mode of operation based on the coding configuration, the normal mode of operation indicating that there is a consistent coding configuration.Obtaining the at least one input feature for the normal mode of operation based on the coding configuration may comprise generating the at least one input feature based on at least one of: the at least one spatial metadata parameter; and the at least one audio signal.

[0031] Obtain at least one input feature for the switch mode of operation based on the coding configuration may comprise: generating at least one history audio feature based on at least one of: at least one previous enhanced spatial metadata parameter; and the at least one audio signal; generating at least one audio feature based on at least one of: at least one spatial metadata parameter; and the at least one audio signal; and generating the at least one input feature based on the at least one history audio feature and the at least one audio feature.

[0032] Obtaining at least one input feature for the switch mode of operation based on the coding configuration may comprise: generating at least one history audio feature based on at least one of: at least one previous predicted spatial metadata property; and the at least one audio signal; generating at least one audio feature based on at least one of: at least one spatial metadata parameter; and the at least one audio signal; and generating the at least one input feature based on the at least one history audio feature and the at least one audio feature.

[0033] Obtaining at least one input feature for the switch mode of operation based on the coding configuration may comprise: generating at least one history spatial metadata parameter based on at least one of: at least one previous enhanced spatial metadata parameter; and the at least one audio signal; generating the at least one input feature based on the at least one history spatial metadata parameter and the at least one spatial metadata parameter.

[0034] Generating at least one history spatial metadata parameter based on at least one of: at least one previous enhanced spatial metadata parameter; and the at least one audio signal may comprise: determining frequency band limits based on the coding configuration; generating frequency combined spatial metadata parameters from the at least one previous enhanced spatial metadata parameter based on the frequency band limits.

[0035] Generating at least one history spatial metadata parameter based on at least one of: at least one previous enhanced spatial metadata parameter; and the at least one audio signal may comprise: determining subframe limits based on thecoding configuration; and generating the at least one history spatial metadata parameter from the at least one previous enhanced spatial metadata parameter based on the subframe limits.

[0036] Obtaining at least one input feature for the switch mode of operation based on the coding configuration may comprise: generating at least one history spatial metadata parameter based on at least one of: at least one previous predicted spatial metadata property; and the at least one audio signal; generating the at least one input feature based on the at least one history spatial metadata parameter and the at least one spatial metadata parameter.

[0037] Generating at least one history spatial metadata parameter based on at least one of: at least one previous predicted spatial metadata property; and the at least one audio signal may comprise: determining frequency band limits based on the coding configuration; generating frequency combined spatial metadata parameters from the at least one previous predicted spatial metadata property based on the frequency band limits.

[0038] Generating at least one history spatial metadata parameter based on at least one of: at least one previous predicted spatial metadata property; and the at least one audio signal may comprise: determining subframe limits based on the coding configuration; and generating the at least one history spatial metadata parameter from the at least one previous predicted spatial metadata property based on the subframe limits.

[0039] The coding configuration associated with the encoding of the spatial bitstream may comprise at least one of: a bitrate value for the encoding of the spatial bitstream; a number of frequency bands for the encoding of the spatial bitstream; a number of temporal subframes for the encoding of the spatial bitstream.

[0040] According to a fourth aspect there is provided a method for an apparatus for enhancing at least one spatial metadata parameter, the method comprising: obtaining spatial audio content comprising: at least one original spatial metadata parameter; and at least one audio signal; generating a spatial bitstream, the spatial bitstream comprising: at least one spatial metadata parameter associated with the at least one audio signal, the at least one spatial metadata parameter being an encoded version of the at least one original spatial metadata parameter;and a coding configuration associated with the encoding of the spatial bitstream; and transmitting the spatial bitstream to at least one further apparatus, wherein the at least one further apparatus is configured to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

[0041] The coding configuration associated with the encoding of the spatial bitstream may comprise at least one of: a bitrate value for the encoding of the spatial bitstream; a number of frequency bands for the encoding of the spatial bitstream; a number of temporal subframes for the encoding of the spatial bitstream.

[0042] According to a fifth aspect there is provided an apparatus for enhancing at least one spatial metadata parameter, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain a spatial bitstream, the spatial bitstream defining spatial audio content and comprising: at least one spatial metadata parameter associated with at least one audio signal, the at least one spatial metadata parameter being an encoded version of an original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

[0043] The apparatus caused to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream may be caused to: obtain a machine-learning model for enhancing the at least one spatial metadata parameter, the machine-learning model for spatial audio content associated with the coding configuration, the machine-learning model configured to output at least one predicted spatial metadata property; and generate at least one enhanced spatial metadata parameter based on the at least one predicted spatial metadata property.

[0044] The apparatus caused to generate at least one enhanced spatial metadata parameter based on the at least one predicted spatial metadata property may be configured to generate at least one enhanced spatial metadata parameter basedon the at least one predicted spatial metadata property and the at least one spatial metadata parameter.

[0045] The apparatus caused to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream may be further caused to: obtain at least one input feature for the machine-learning model for enhancing the at least one spatial metadata parameter based on the coding configuration, the at least one input feature being based on the coding configuration.

[0046] The apparatus caused to obtain at least one input feature for the machinelearning model for enhancing the at least one spatial metadata parameter may be further caused to determine a switch mode of operation based on the coding configuration, the switch mode of operation indicating that there is a change in the coding configuration.

[0047] The apparatus caused to obtain at least one input feature for the machinelearning model for enhancing the at least one spatial metadata parameter may be further caused to determine a normal mode of operation based on the coding configuration, the normal mode of operation indicating that there is a consistent coding configuration.

[0048] The apparatus caused to obtain the at least one input feature for the normal mode of operation based on the coding configuration may be further caused to generate the at least one input feature based on at least one of: the at least one spatial metadata parameter; and the at least one audio signal.

[0049] The apparatus caused to obtain at least one input feature for the switch mode of operation based on the coding configuration may be further caused to: generate at least one history audio feature based on at least one of: at least one previous enhanced spatial metadata parameter; and the at least one audio signal; generate at least one audio feature based on at least one of: at least one spatial metadata parameter; and the at least one audio signal; and generate the at least one input feature based on the at least one history audio feature and the at least one audio feature.

[0050] The apparatus caused to obtain at least one input feature for the switch mode of operation based on the coding configuration may be further caused to: generate at least one history audio feature based on at least one of: at least oneprevious predicted spatial metadata property; and the at least one audio signal; generate at least one audio feature based on at least one of: at least one spatial metadata parameter; and the at least one audio signal; and generate the at least one input feature based on the at least one history audio feature and the at least one audio feature.

[0051] The apparatus caused to obtain at least one input feature for the switch mode of operation based on the coding configuration may be further caused to: generate at least one history spatial metadata parameter based on at least one of: at least one previous enhanced spatial metadata parameter; and the at least one audio signal; generate the at least one input feature based on the at least one history spatial metadata parameter and the at least one spatial metadata parameter.

[0052] The apparatus caused to generate at least one history spatial metadata parameter based on at least one of: at least one previous enhanced spatial metadata parameter; and the at least one audio signal may be caused to: determine frequency band limits based on the coding configuration; generate frequency combined spatial metadata parameters from the at least one previous enhanced spatial metadata parameter based on the frequency band limits.

[0053] The apparatus caused to generate at least one history spatial metadata parameter based on at least one of: at least one previous enhanced spatial metadata parameter; and the at least one audio signal may caused to: determine subframe limits based on the coding configuration; and generate the at least one history spatial metadata parameter from the at least one previous enhanced spatial metadata parameter based on the subframe limits.

[0054] The apparatus caused to obtain at least one input feature for the switch mode of operation based on the coding configuration may be further caused to: generate at least one history spatial metadata parameter based on at least one of: at least one previous predicted spatial metadata property; and the at least one audio signal; generate the at least one input feature based on the at least one history spatial metadata parameter and the at least one spatial metadata parameter.

[0055] The apparatus caused to generate at least one history spatial metadata parameter based on at least one of: at least one previous predicted spatialmetadata property; and the at least one audio signal may be caused to: determine frequency band limits based on the coding configuration; generate frequency combined spatial metadata parameters from the at least one previous predicted spatial metadata property based on the frequency band limits.

[0056] The apparatus caused to generate at least one history spatial metadata parameter based on at least one of: at least one previous predicted spatial metadata property; and the at least one audio signal may be caused to: determine subframe limits based on the coding configuration; and generate the at least one history spatial metadata parameter from the at least one previous predicted spatial metadata property based on the subframe limits.

[0057] The coding configuration associated with the encoding of the spatial bitstream may comprise at least one of: a bitrate value for the encoding of the spatial bitstream; a number of frequency bands for the encoding of the spatial bitstream; a number of temporal subframes for the encoding of the spatial bitstream.

[0058] According to a sixth aspect there is provided an apparatus for enhancing at least one spatial metadata parameter, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain spatial audio content comprising: at least one original spatial metadata parameter; and at least one audio signal; generate a spatial bitstream, the spatial bitstream comprising: at least one spatial metadata parameter associated with the at least one audio signal, the at least one spatial metadata parameter being an encoded version of the at least one original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and transmit the spatial bitstream to at least one further apparatus, wherein the at least one further apparatus is configured to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

[0059] The coding configuration associated with the encoding of the spatial bitstream may comprise at least one of: a bitrate value for the encoding of the spatial bitstream; a number of frequency bands for the encoding of the spatialbitstream; a number of temporal subframes for the encoding of the spatial bitstream.

[0060] According to a seventh aspect there is provided an apparatus for enhancing at least one spatial metadata parameter, the apparatus comprising: means for obtaining a spatial bitstream, the spatial bitstream defining spatial audio content and comprising: at least one spatial metadata parameter associated with at least one audio signal, the at least one spatial metadata parameter being an encoded version of an original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and means for enhancing the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream..

[0061] According to an eighth aspect there is provided an apparatus for enhancing at least one spatial metadata parameter, the apparatus comprising: means for obtaining spatial audio content comprising: at least one original spatial metadata parameter; and at least one audio signal; means for generating a spatial bitstream, the spatial bitstream comprising: at least one spatial metadata parameter associated with the at least one audio signal, the at least one spatial metadata parameter being an encoded version of the at least one original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and means for transmitting the spatial bitstream to at least one further apparatus, wherein the at least one further apparatus is configured to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

[0062] According to a ninth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for enhancing at least one spatial metadata parameter, to perform at least the following: obtaining a spatial bitstream, the spatial bitstream defining spatial audio content and comprising: at least one spatial metadata parameter associated with at least one audio signal, the at least one spatial metadata parameter being an encoded version of an original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and enhancing the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.According to a tenth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for enhancing at least one spatial metadata parameter, to perform at least the following: obtaining spatial audio content comprising: at least one original spatial metadata parameter; and at least one audio signal; generating a spatial bitstream, the spatial bitstream comprising: at least one spatial metadata parameter associated with the at least one audio signal, the at least one spatial metadata parameter being an encoded version of the at least one original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and transmitting the spatial bitstream to at least one further apparatus, wherein the at least one further apparatus is configured to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

[0063] According to an eleventh aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus for enhancing at least one spatial metadata parameter, to perform at least the following: obtaining a spatial bitstream, the spatial bitstream defining spatial audio content and comprising: at least one spatial metadata parameter associated with at least one audio signal, the at least one spatial metadata parameter being an encoded version of an original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and enhancing the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

[0064] According to a twelfth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus for enhancing at least one spatial metadata parameter, to perform at least the following: obtaining spatial audio content comprising: at least one original spatial metadata parameter; and at least one audio signal; generating a spatial bitstream, the spatial bitstream comprising: at least one spatial metadata parameter associated with the at least one audio signal, the at least one spatial metadata parameter being an encoded version of the at least one original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and transmitting the spatial bitstream to at least one further apparatus,wherein the at least one further apparatus is configured to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

[0065] According to a thirteenth aspect there is provided an apparatus for enhancing at least one spatial metadata parameter, the apparatus comprising: obtaining circuitry configured to obtain a spatial bitstream, the spatial bitstream defining spatial audio content and comprising: at least one spatial metadata parameter associated with at least one audio signal, the at least one spatial metadata parameter being an encoded version of an original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and enhancing circuitry configured to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream

[0066] According to a fourteenth aspect there is provided an apparatus for enhancing at least one spatial metadata parameter, the apparatus comprising: obtaining circuitry configured to obtain spatial audio content comprising: at least one original spatial metadata parameter; and at least one audio signal; generating circuitry configured to generate a spatial bitstream, the spatial bitstream comprising: at least one spatial metadata parameter associated with the at least one audio signal, the at least one spatial metadata parameter being an encoded version of the at least one original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and transmitting circuitry configured to transmit the spatial bitstream to at least one further apparatus, wherein the at least one further apparatus is configured to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream According to a fifteenth aspect there is provided a computer readable medium comprising program instructions for enhancing at least one spatial metadata parameter, to perform at least the following: obtaining a spatial bitstream, the spatial bitstream defining spatial audio content and comprising: at least one spatial metadata parameter associated with at least one audio signal, the at least one spatial metadata parameter being an encoded version of an original spatial metadata parameter; and a coding configuration associated withthe encoding of the spatial bitstream; and enhancing the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

[0067] According to a sixteenth aspect there is provided a computer readable medium comprising program instructions for enhancing at least one spatial metadata parameter, to perform at least the following: obtaining spatial audio content comprising: at least one original spatial metadata parameter; and at least one audio signal; generating a spatial bitstream, the spatial bitstream comprising: at least one spatial metadata parameter associated with the at least one audio signal, the at least one spatial metadata parameter being an encoded version of the at least one original spatial metadata parameter; and a coding configuration associated with the encoding of the spatial bitstream; and transmitting the spatial bitstream to at least one further apparatus, wherein the at least one further apparatus is configured to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

[0068] An apparatus comprising means for performing the actions of the method as described above.

[0069] An apparatus configured to perform the actions of the method as described above.

[0070] A computer program comprising program instructions for causing a computer to perform the method as described above.

[0071] A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

[0072] An electronic device may comprise apparatus as described herein.

[0073] A chipset may comprise apparatus as described herein.

[0074] Embodiments of the present application aim to address problems associated with the state of the art.

[0075] Summary of the Figures

[0076] For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:Fig.1 shows an example system of apparatus suitable for implementing some embodiments;

[0077] Fig.2 shows a flow diagram of the operation of the apparatus shown in Fig.1 according to some embodiments;

[0078] Fig.3 shows schematically an example encoder as shown in Fig.1 in further detail according to some embodiments;

[0079] Fig.4 shows a flow diagram of the operations of the example encoder shown in Fig.3 according to some embodiments;

[0080] Fig.5 shows schematically an example decoder as shown in Fig.1 in further detail according to some embodiments;

[0081] Fig.6 shows a flow diagram of the operations of the example decoder shown in Fig.5 according to some embodiments;

[0082] Fig.7 shows schematically an example metadata enhancer as shown in Fig.5 in further detail according to some embodiments;

[0083] Fig.8 shows a flow diagram of the operations of the example metadata enhancer shown in Fig.7 according to some embodiments;

[0084] Fig.9 and 10 show tables of example modes based feature computer inputs;

[0085] Fig.11 show example conceptual level differences between building blocks employed in the ML model when temporal context is utilised;

[0086] Fig.12 shows schematically an example history metadata generator as shown in Fig.7 in further detail according to some embodiments;

[0087] Fig.13 shows a flow diagram of the operations of the example history metadata generator shown in Fig.12 according to some embodiments;

[0088] Figs.14 and 15 show example traces showing the effect of the applications of some embodiments;

[0089] Fig.16 shows schematically a further example metadata enhancer as shown in Fig.5 in further detail according to some embodiments;

[0090] Fig.17 shows a flow diagram of the operations of the further example metadata enhancer shown in Fig.16 according to some embodiments;

[0091] Fig.18 shows schematically an example device suitable for implementing the apparatus shown herein;Figs.19 and 20 shows an example ML model structure suitable for implementing some embodiments;

[0092] Fig.21 shows an example structure of the ResUNet; and

[0093] Fig.22 shows an example Dilated ResBlock structure according to some embodiments.

[0094] Embodiments of the Application

[0095] The concept as discussed herein in further detail with respect to the following embodiments is related to encoding and decoding parametric spatial audio. In the following examples an IVAS codec is used to show practical implementations or examples of the concept. However, it would be appreciated that the embodiments presented herein may be extended to other codecs without inventive input.

[0096] As discussed earlier the metadata-assisted spatial audio (MASA) is one of the input formats supported by IVAS. It uses audio signal(s) together with corresponding spatial metadata (containing, e.g., directions and direct-to-total energy ratios in frequency bands). The MASA stream can, e.g., be obtained by capturing spatial audio with microphones of, e.g., a mobile device, where the set of spatial metadata is estimated based on the microphone signals. The MASA stream can be obtained also from other sources, such as specific spatial audio microphones (such as Ambisonics), studio mixes (e.g., 5.1 multichannel mix), or other content by means of a suitable format conversion.

[0097] MASA spatial metadata values are available for each time-frequency tile (TF-tile) (there can, for example, be 24 frequency bands and 4 temporal subframes in each frame). The frame size in IVAS is 20 ms (and thus the temporal sub-frame is 5 ms). In addition, MASA supports 1 or 2 directions for each timefrequency tile (i.e., there are 1 or 2 direction index values, and associated 1 or 2 direct-to-total energy ratios, and spread coherence parameters for each timefrequency tile. Other parameters and parameters can also be defined).

[0098] Additionally, IVAS supports also audio objects (Independent streams with metadata, ISM) as an input. The audio objects contain for each object an audio signal and associated metadata (e.g., the direction of the object). IVAS supports not only metadata such as direction(s), energy ratio(s), etc. but can also compriseextended metadata. The extended metadata parameters, for example, can comprise parameters such as a yaw, a pitch, and a radius. In IVAS, the coding of the extended metadata is supported for higher bitrates, for example, for ISM format input encoding the extended metadata is supported for brateIVAS ≥ 64 kbps.

[0099] Furthermore, IVAS supports a variety of bitrates, ranging from 13.2 kbps to 512 kbps. This bitrate is shared by various signaling bits, the audio signal coding, and the spatial metadata coding. With the MASA input, the bitrate allocated for the spatial metadata coding varies from about 2.25 kbps to about 65 kbps, depending on the total IVAS bitrate and the actual spatial metadata content.

[0100] Spatial metadata encoding and decoding operates on defined or engineered rules. For example, in IVAS, the MASA metadata is compressed using various data reduction methods, such as representing the direction parameter with sparser resolution when the direct-to-total energy ratio is small; or applying non-uniform encoding codebooks to the parameters that emphasize precision of the parameters at some data ranges over others.

[0101] As described in the following embodiments machine learning methods can be applied to assist the design of efficient compression. The machine learning methods as described herein do not directly address the compression but attempt to enhance the data degraded by compression. The concepts as discussed in the following embodiments focuses on controlling the utilization of the machine learning method based on an encoding bitrate or level of encoding.

[0102] The model or network before being able to generate useful outputs is typically required to be trained. In each training step, a set of input data elements (from the training dataset) are provided to the network and the network performs the defined computational operations resulting into the model output. The model output is compared against a known reference (i.e., ground truth) with a loss function. Then the network coefficients are adjusted such that the model output is closer to the reference target. If the comparison uses an error function or loss, this adaptation aims at reducing the value of the error or loss. If the comparison uses a similarity or fitness function, the adaptation aims at increasing this. The network training is finished when the network is converged, i.e., the error no longer is reduced or the fitness is no longer improved, or the magnitude of thechange is considered too small, or a given maximum number of training steps is reached. This decision may also be done using a separate validation dataset that is not used for adjusting the model parameters, but only to assess the performance. Common framework tools for defining a neural network model include PyTorch and TensorFlow.

[0103] Using the trained network is referred to as inference. During this stage, the model parameters are usually no longer adjusted. The inference can be done using the same framework which was used for defining and training the model, or the model can be saved into another known network definition format, such as the Open Neural Network Exchange (ONNX) format or the TensorFlow Lite format. These formats may include a software library for the inference, which is capable of performing computational operations according to the format definitions.

[0104] 3GPP TS 26.253 IVAS describes a set of techniques to convey the spatial metadata at a multitude of bitrates, and encoding and decoding signals with machine learning is known. For example, a machine-learning based audio encoder and decoder was described in Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., & Tagliasacchi, M. (2022). SoundStream: An end-to-end neural audio codec. IEEE / ACM Transactions on Audio, Speech, and Language Processing, 30, pp. 495-507 where it was possible to obtain high audio quality at very low bitrates.

[0105] The state-of-the art parametric spatial metadata coding methods (such as the MASA coding methods in IVAS) control the level of compression depending on the bitrate at the cost of the reconstruction accuracy of the spatial metadata at the decoder. With lower bitrates, a higher compression factor is needed, and this is obtained by trading off some of the reconstruction accuracy. Especially at very low bitrates (such as 13.2 kbps with IVAS), the bitrate that can be allocated to metadata encoding is so low (such as 2.25 kbps with IVAS) that it affects the perceptual quality of the spatial audio output at the decoder in a negative manner. This includes instability and inaccuracies of the perceived sound sources.

[0106] There has been proposed, for example, GB2411721 methods to improve the spatial metadata quality using a machine-learning based post-processing block. The post-processing block employs input features from a temporal contextpreceding (and in some embodiments also following) the current (sub)frame to be enhanced in addition to feature values in the current (sub)frame. Although it is possible to make the enhancement work without temporal context, the quality is significantly worse. It was found that using a history context of approximately 200 ms provides a good enhancement performance because there are typically some correlations in the metadata values over time. It is understood that the value of 200 ms is an example value that was found to work for a specific range of scenarios, and that other values may be employed based on the scenario or implementation.

[0107] In real-world communication codecs, the transport capability of the physical layer of the network may vary over time, and codec bitrate switching is employed to adapt to the available capacity or capability of the system. When the network capacity is lower the coding may use a lower bitrate than when the network capacity is higher. Bitrate switching significantly effects the coding scheme applied to the transport audio signals and spatial metadata. For example, in metadata coding the number frequency bands and the number of temporal subframes (i.e., the time-frequency resolution) are dependent on the bitrate. Thus, when the bitrate changes, the number of coding bands and subframes typically changes as well (apart from certain subsets of bitrates which use the same number of bands and subframes).

[0108] This bitrate switching can cause issues in post-processing systems for metadata enhancement, such as GB2411721, as spatial metadata before and after the bitrate switching differ, for example, having a different number of frequency bands and / or temporal subframes. Thus, employing data with a timefrequency resolution on a machine learning model trained for a different timefrequency resolution can result on incorrect or inaccurate outputs being generated. For example, feeding history data with 5 frequency bands to an ML method designed and trained to utilize 8 frequency bands from another encoding bitrate by zero-padding the history data significantly deteriorates the quality, as each ML model has been trained to operate with a dedicated time-frequency resolution. Furthermore, data originating from different bitrates may also differ on other relevant properties that the models learn. In other words, the input features to the ML models can be incompatible across the different coding settings.The following examples focus on encoding metadata, but the same compatibility issue can be present with transport audio signal encoding and post processing. The features from the transport audio signals used by GB2411721 are such that they are similar providing the number of transport channels and the signal bandwidth remains the same. If, for example, the number of transport audio signals change, similar compatibility issues with the models can be experienced.

[0109] One suggestion to overcome such compatibility issues could be to employ ML models that do not utilize history, but process all (sub)frames independently. Overcoming the issue of input incompatibility comes at the cost of performance of the ML network during all time instants other than the switching points, as it has been found that employing temporal context improves the performance of the post-enhancement block.

[0110] Another suggestion to overcome these compatibility issues is to employ a ML model with temporal context but set the history at the bitrate switch location to zero. This creates a situation similar to when no history context is available. When the training data for the ML contains these kind of examples, it learns to produce some output on those locations. However, the quality of the output is decreased, as the enhancement is missing temporal context during any switches.

[0111] The concept as discussed in the following examples is one which attempts to implement a ML-based metadata post-enhancement apparatus or method and prevent the situation where during bitrate switches the enhancement cannot provide improvement in the metadata quality (in other words making the decoded metadata closer to the original), but instead, artefacts are produced in the metadata (e.g., the directions and / or the direct-to-total energy ratios have significantly wrong values at least at some frequencies). These wrong values and artefacts are perceived as if sounds are suddenly perceived to arrive from wrong directions, suddenly changing spaciousness, and / or certain frequencies having different spatial impression than other frequencies. This makes the sound scene being perceived as unnatural and unpleasant to listen. These artefacts are caused because the enhancement uses temporal context, which is disturbed by the changes in the coding scheme at different bitrates (e.g., using different timefrequency resolution for the coded metadata), and as a result the enhancement produces artefacts instead of the desired improvements in the metadata.The following examples and embodiments are related to ML-based postenhancement of spatial metadata of a coded parametric spatial audio stream (containing transport audio signal(s) and associated spatial metadata) in a system consisting of encoding and decoding devices. In other words, the embodiments describe processing of the decoded spatial metadata in order to attempt to make the enhanced decoded spatial metadata more similar to the original spatial metadata be employing using a ML model, for example, such as disclosed in GB2411721.

[0112] In these embodiments apparatus and methods are described which enable spatial metadata enhancement during bitrate switches (which cause changes in the underlying coding scheme, such as different time-frequency resolution used for the coded metadata). The apparatus and methods described in further detail hereafter enable good quality spatial metadata enhancement during the bitrate switches (in other words, enable improvement over the non-enhanced decoded spatial metadata and aim to suppress the creation of artefacts), by generating a compatible history context (in other words obtaining a suitable example correct time-frequency resolution) for a second (machine learning) method for spatial metadata enhancement after the switching point using the output of a first (machine learning) method for spatial metadata enhancement before the switching point.

[0113] In some embodiments this can be achieved by processing decoded spatial metadata and decoded audio signal(s) of frames before the switching point (in the following examples the switching point is from bitrate A to bitrate B) using a first machine learning model (designed to operate on input data coded at bitrate A) to obtain spatial metadata in an enhanced time-frequency resolution, for example, the time-frequency resolution of the original spatial metadata before encoding.

[0114] In these embodiments the apparatus and methods are configured to determine generated spatial metadata having time-frequency resolution corresponding to the coding scheme employed at bitrate B by replicating or emulating the metadata encoding performed at bitrate B, using the enhanced spatial metadata of the previous frames (from bitrate A) as the input to this processing.Furthermore, these embodiments describe apparatus and methods configured to employ the determined generated spatial metadata as the history data (from bitrate A) for the current frame (of bitrate B), and process the current frame together with the determined generated spatial metadata for the previous frames using a second machine learning model (designed for bitrate B) to obtain enhanced decoded spatial metadata.

[0115] These embodiments therefore can be configured to render suitable format spatial audio signals (e.g., binaural audio signals) using the decoded audio signal(s) and the enhanced decoded spatial metadata.

[0116] In some embodiments, instead of generating emulated spatial metadata for the history context and computing input features for the ML model from this, the apparatus and methods are configured to generate directly emulated input features for the ML model.

[0117] In some further embodiments, in addition to generating the emulated spatial metadata for the history context or metadata-based model input features, features computed from the transport audio signals are generated for the history context. The audio-based features can, for example, be generated using enhanced decoded spatial metadata and transport audio signal prior to the switching point. These features computed or obtained from the transport audio signals can be employed in examples where the number of transport audio channels changes. For example, in situations where there is a switch from one to two transport audio signals the inter-channel features cannot be determined for the history using the mono signal alone. Thus, the inter-channel features need to be approximated with the aid of the enhanced spatial metadata. Furthermore if there is a switch from two to one transport, only mono features are used (as only a mono signal is available after the switch). These mono features can be obtained trivially from the stereo signals.

[0118] Error! Reference source not found, presents an example system suitable for implementing some embodiments as described in further detail herein. In this example the transport audio signals (or more generally input audio signals) 100 are passed to an encoder 101, which is furthermore configured to receive or otherwise obtain spatial metadata 104 and a bitrate 102.The bitrate 102 can comprise information indicating the bitrate from which the encoder is configured to encode the transport audio signals 100 and spatial metadata 104 into a bitstream 110.

[0119] In some embodiments the bitrate 102 is information identifying a value of the actual bitrate, or a bitrate level or coding mode based on the bitrate. In some embodiments the bitrate 102 is configured to identify a capacity or capability associated with the encoder, for example, identifying an available processing capability or capacity enabling a coding mode to be selected or determined. The bitrate 102 may change over time. For example, if the transport capability of the network changes over time, the bitrate 102 can be adjusted to compensate for this. As an example, if the transport capability decreases, the bitrate 102 can be decreased in order to mitigate the probability of losing packets (losing packets may cause audible artifacts, such as snaps and clicks in the audio). And, vice versa, if the transport capability increases, the bitrate 102 can be increased in order to get better audio quality (without a significant risk of major packet losses). Thus, in typical practical use cases, the bitrate is not expected to be constant over time.

[0120] In some embodiments the bitrate information is furthermore incorporated into the bitstream 110 in some form. For example, the bitrate 102 can be encoded by the encoder 101.

[0121] The bitstream 110 can then be obtained and processed by a decoder and renderer 111 to generate output audio signals 114. The output audio signals or spatial audio output can be any suitable output format, for example, binaural audio signals.

[0122] In embodiments where there is no spatial metadata 104 input then the transport audio signals (or input audio signals) 100 can be analysed to determine the spatial metadata. In some other embodiments the spatial metadata 104 can be input with the transport audio signals 100 as a combined input audio signals input passed to the encoder 101.

[0123] In other words, the input to the encoder 101 can be a spatial audio stream comprising transport audio signals 100 and spatial metadata 104. The spatial audio stream can, for example, be in a metadata-assisted spatial audio (MASA) format or any suitable input format.With respect to Fig.2 is shown a flow diagram showing example operations of the system shown in Fig.1.

[0124] For example, as shown in Fig.2 by 201 is the operation of obtaining transport audio signals and spatial metadata (which may be known as the spatial audio streams).

[0125] Additionally, as shown in Fig.2 by 202 is the operation of obtaining bitrate control information.

[0126] Then as shown in Fig.2 by 203 is the operation of encoding the spatial audio streams (the transport audio signals and spatial metadata) based on the bitrate control information. Furthermore, in some embodiments there is an encoding or incorporation of the bitrate into a bitstream comprising the encoded spatial audio streams.

[0127] Then is the operation of transmitting / receiving (or storing / retrieving the bitstream comprising the encoded spatial audio streams and the bitrate as shown in Fig.2 by 205.

[0128] The received / retrieved bitstream can then be parsed or demultiplexed into encoded transport audio signals and encoded spatial metadata as shown in Fig.2 by 207.

[0129] The encoded transport audio signals and encoded spatial metadata can then by decoded to generate decoded transport audio signals and decoded spatial metadata as shown in Fig.2 by 209.

[0130] Then based on the decoded transport audio signals and decoded spatial metadata there is as shown in Fig.2 by 211 a rendering of spatial or output audio signals.

[0131] Finally, the rendered spatial or output audio signals are output as shown in Fig.2 by 213.

[0132] Fig.3 furthermore shows a schematic example of the encoder as shown in Fig.1 in further detail. The input to the encoder 101, is as shown in Fig.3 and Fig.1 are transport audio signals 100, spatial metadata 104, and bitrate 102.

[0133] As shown in Fig.3, the spatial metadata 104 is forwarded to a metadata encoder 305, which encodes spatial metadata, based on the obtained bitrate 102. The encoding can use any suitable method, such as the MASA encoding methods of the IVAS encoder. In some embodiments the bitrate 102 is the total bitrate ofthe bitstream, and some defined or determined amount of the total bitrate is assigned to the metadata encoder 305, to be used for the encoding of the spatial metadata 104 to generate encoded spatial metadata 306 to be passed to the multiplexer 307.

[0134] Furthermore, as shown in Fig.3, the transport audio signals are forwarded to an audio encoder 303, which encodes the transport audio signals 100 using any suitable audio-signal encoder, such as the IVAS core coder, EVS, or AAC, based on the given bitrate 102.

[0135] In a manner similar to above, the bitrate 102 can be the total bitrate of the bitstream, and some defined or determined amount of the total bitrate is assigned to the audio encoder 303, to be used for the encoding of the transport audio signals 100 to generate encoded transport audio signals 304 to be passed to the multiplexer 307

[0136] In some encoders the split of the total bitrate to the amounts provided to encode the spatial metadata and transport audio signals can be signal-adaptive. For example, in IVAS encoder with MASA format input the spatial metadata is encoded first and any bit budget excess or deficit affects the amount allocated to the encoding of the transport audio signals.

[0137] The bitrate 102 affects the coding method. For example, in the case of IVAS coding of the MASA format, the bitrate given to the metadata coding may vary from about 2.5 kbps to 65 kbps (corresponding to the total bitrate from 13.2 kbps to 512 kbps). This means that the number of frequency bands used for coding can vary from 5 to 24, and the number of temporal subframes can vary from 1 to 4, depending on the bitrate.

[0138] The multiplexer 307 or “MUX” is configured to receive the encoded transport audio signals 304, the encoded spatial metadata 306, and the bitrate 102, and multiplexes them to a bitstream 110, which is the output of the encoder 101.

[0139] With respect to Fig.4 is shown a flow diagram showing example operations of the encoder shown in Fig.3.

[0140] For example, as shown in Fig.4 by 401 is the operation of obtaining transport audio signals, spatial metadata, and bitrate.Additionally, as shown in Fig.4 by 403 is the operation of generating encoded spatial metadata based on spatial metadata and bitrate.

[0141] Then as shown in Fig.4 by 405 is the operation of generating encoded transport audio signals based on transport audio signals and bitrate.

[0142] Then is the operation as shown in Fig.4 by 407 of multiplexing the encoded transport audio signal, bitrate, encoded spatial metadata to generate bitstream.

[0143] Finally, the bitstream is output as shown in Fig.4 by 409.

[0144] Fig.5 furthermore shows a schematic example of the decoder as shown in Fig.1 in further detail. The input to the decoder 111, is as shown in Fig.5 and Fig.1 the bitstream 110.

[0145] The bitstream 110 is forwarded to a demultiplexer 501 or " DEMUX", which demultiplexes the received or obtained bitstream and generates encoded transport audio signals 502, encoded spatial metadata 506, and the bitrate 504. The encoded transport audio signals 502 and the bitrate 504 are forwarded to an audio decoder 503, which decodes the encoded audio signals, based on the bitrate, using a decoder that is compatible with the encoder used for encoding the audio signals. The output of the audio decoder 503 is decoded transport audio signals 510.

[0146] The encoded spatial metadata 506 and the bitrate 504 are forwarded to a metadata decoder 505, which decodes the encoded spatial metadata 506, based on the bitrate 504, using a decoder that is compatible with the encoder used for encoding the spatial metadata. The output of the metadata decoder 505 is decoded spatial metadata 508.

[0147] The decoded spatial metadata 508, the decoded transport audio signals 510 and the bitrate 504 can be passed to the metadata enhancer 507. The metadata enhancer 507 can comprise a post-processing metadata enhancer which produces enhanced decoded spatial metadata 512 that attempts to be more similar to the original spatial metadata 104 than the decoded spatial metadata 508.

[0148] The metadata enhancer 507 is configured to process the decoded spatial metadata 508 using a machine learning model with the aid of the decoded transport audio signals 510, based on the bitrate 504. As discussed above, thebitrate 504 affects the coding that is applied on the spatial metadata, and thus knowing the bitrate improves, or is critical for the enhancement that is performed.

[0149] The details of the metadata enhancer 507 according to some embodiments is described further below with respect to Figs.7 and 17. The output of the metadata enhancer, the enhanced decoded spatial metadata 512 in some embodiments is output to the spatial synthesizer 509.

[0150] The decoded transport audio signals 510 and the enhanced decoded spatial metadata 512 can be forwarded to the spatial synthesizer 509, which is configured to render a spatial audio output 114 or output audio signals (for example, binaural audio signals) based at least in part on the decoded transport audio signals 510 and the enhanced decoded spatial metadata 512. The spatial synthesizer 509 can be any suitable spatial synthesizer implementation.

[0151] With respect to Fig.6 is shown a flow diagram showing example operations of the decoder shown in Fig.5.

[0152] For example, as shown in Fig.6 by 601 is the operation of obtaining the bitstream.

[0153] Additionally, as shown in Fig.6 by 603 is the operation of demultiplexing the bitstream to generate encoded spatial metadata, encoded transport audio signals, and bitrate.

[0154] Then as shown in Fig.6 by 605 is the operation of generating a decoded transport audio signal based on encoded transport audio signals and bitrate.

[0155] Also as shown in Fig.6 by 607 is the operation of generating a decoded spatial metadata based on encoded spatial metadata and bitrate.

[0156] Then is the operation as shown in Fig.6 by 609 of generating enhanced decoded spatial metadata based on decoded spatial metadata, decoded transport audio signal, and bitrate.

[0157] Furthermore, as shown in Fig.6 by 611 is the operation of spatially synthesizing (rendering) an output audio signal or spatial audio signal based on the enhanced decoded spatial metadata and the decoded transport audio signals.

[0158] Finally, the output audio signals, the spatial audio signals are output as shown in Fig.6 by 613.

[0159] Fig.7 shows schematically an example metadata enhancer 507 according to some embodiments. In the following examples the metadata enhancer 507 isconfigured to enhance the decoded version of the encoded metadata. However, in some embodiments the enhancement is applied to the encoded metadata. In such embodiments the enhanced encoded spatial metadata can then be decoded. Furthermore, in some embodiments the metadata enhancer 507 can be configured to enhance the spatial metadata (encoded or decoded) based on either decoded transport audio signals 510 or encoded transport audio signals 502.

[0160] The input to the metadata enhancer 507 is the decoded spatial metadata 508, bitrate 504, and decoded transport audio signals 510. Both of the decoded spatial metadata 508 and decoded transport audio signals 510 have been encoded and possibly decoded (partially or fully, and according to the bitrate 504), so the data that the metadata enhancer 507 receives has been compressed in some way in information theoretic sense.

[0161] Moreover, the decoded spatial metadata 508 contains the information at a low time-frequency (TF) resolution due to, for example, being coded by a low-bitrate IVAS codec. The TF resolution depends on the bitrate 504, so the bitrate is given or obtained as an input.

[0162] In some embodiments the metadata enhancer 507 is configured to operate in at least two modes. A first mode is when the bitrate 504 has not changed during the last N frames (this can be referred to as the normal mode), and a second mode when the bitrate 504 has changed during the last N frames (this can be referred to as the bitrate switch mode). In some embodiments the value of N is 10 frames but can be any suitable value and can be based on the length of the history context used by the ML model 705 and / or feature computer 701.

[0163] The metadata enhancer 507 when operating in the first or normal mode can be configured to operate in a manner similar to the metadata enhancer described in GB2411721.This is summarized below.

[0164] The metadata enhancer 507 in some embodiments comprises a feature computer 701 or feature determiner which is configured to receive or otherwise obtain as inputs the bitrate 504, the decoded transport audio signals 510, and the decoded spatial metadata 508. In the first or normal mode of operation the feature computer 701 is not configured to receive or obtain the generated history spatialmetadata 706 or not configured to employ or use the generated history spatial metadata 706 if it receives or obtains it.

[0165] The feature computer 701 is configured to determine at least one feature 702 which describes relevant properties of the sound scene described by the decoded spatial metadata 508 and decoded transport audio signals 510. The feature computer 701 determination in some embodiments is based on the information from the bitrate 504. The feature computer 701 can be implemented in some embodiments using expert-designed features, or it may be implemented as a machine learning algorithm trained alongside with the ML model 705.

[0166] The at least one feature 702 can comprise features determined from the decoded spatial metadata 508 and also include features determined from the decoded transport audio signals 510.

[0167] In some embodiments the at least one feature 702 may be represented as a sequence of features, where each sequence element corresponds to one temporal frame, sub-frame, slot, or some other temporal unit in the metadata and audio stream.

[0168] In an example embodiment, for the decoded spatial metadata 508 these features are obtained by transforming the directional MASA metadata into Cartesian XYZ-representation (3D-coordinates in X, Y, Z -axes). The relevant fields of the (low-resolution) decoded spatial metadata 508 are denoted with azi(K, ), ele(K, ri), and rQc.n) corresponding to the azimuth angle, elevation angle, and direct-to-total energy ratio in TF-tile K,n, where K is the parameter band index K = 1,..., n_bands_meta, and n is the (sub-)frame index. This spherical representation is transformed into Cartesian vector representation with vx(K,n)’ cos(azi( / <, n)) cos (eZe( / <, n))’

[0169] v( / c, n) = V / K, n) = r( / <,n) sin(azi( / <, n)) cos (eZe(K, n))

[0170]

[0171] vz(K,n). sin (eZe( / c,n)) The metadata feature is the decoded spatial metadata 508 represented with such vectors for each TF-tile. In some embodiments the metadata feature can be considered as a 3-dimensional tensor with the shape (n_bands_meta, n_frames, n_features_meta), where n_features_meta = 3 is the number of input features per tile. In other words corresponding to the 3 elements of the vector vQc.n), n_frames is the number of spatial metadata (sub-)framesthat are processed at once, and n_bands_meta = 5 corresponding to the 5 parameter bands of the low-resolution metadata (in case of 13.2 kbps, 8 would be used at 128 kbps).

[0172] Similarly, a number of features can, in some embodiments, be determined from decoded transport audio signals 510. These decoded transport audio signalbased features can comprise at least one of:

[0173] features that describe the inter-channel (of two channels of the transport audio signal) cross-correlation properties in frequency bands; features describing the inter-channel level differences in frequency bands;

[0174] features describing local signal energy evolution over time in frequency bands;

[0175] features describing band-energy ratios; and

[0176] features describing local signal energy evolution over time in frequency bands with frequency-dependent weighting.

[0177] These features can be referred to as the Cov features 2000 that refer to transport audio signal covariance properties. The Cov features 2000 can be determined as described in further detail in GB2411721.

[0178] In some examples decoded transport audio signal -based features can also comprise SPAC-based features 2002 that refer to spatial audio capture. The SPAC-based audio features 2002 can be determined as described in further detail in GB2411721.

[0179] It would be understood the actual set of features computed or determined (or selected) is implementation dependent.

[0180] The at least one feature 702 can then be provided to the ML model 705. The ML model implements a machine learning method, using the at least one feature 702 to determine at least one predicted metadata property 704. An example of the at least one predicted metadata property determined or generated by the ML model 705 can comprise MASA directional information (azimuth, elevation, and direct-to-total energy ratio) in a suitable representation, e.g., in Cartesian vector format.

[0181] In some embodiments the operation of the ML model 705 depends on the bitrate 504, since different bitrates affect the number of coded frequency bandsand temporal subframes. The bitrate 504 thus affects the structure of the features determined from the decoded spatial metadata 508. For example, the temporal feature sequence may have different number of feature frames per time unit (from different temporal framing), or there may be a different number of feature bands in each feature frame (from different number of coded frequency bands).

[0182] In other words, in some embodiments, a different ML model configuration can be employed by the ML model 705 in situations where there are different feature representations. Thus, in some embodiments a specific bitrates or range of bitrate 504 values may have dedicated ML models designed solely for them, while some bitrate values or ranges may share the same ML model. For example, for bitrates that share the same number of frequency bands and temporal subframes. In other words, the different bitrates 504 may result in different internal structures of ML models 705.

[0183] The original decoded spatial metadata 508 with a low TF-resolution and the at least one predicted metadata property 704 (alongside the bitrate 504) are provided as an input to the metadata determiner 707. The metadata determiner is configured to employ the at least one predicted metadata property 704 for enhancing the decoded spatial metadata (and further based on the bitrate 504 information). In other words, the metadata determiner 707 can be configured to process the decoded spatial metadata 508 parameters based on the at least one predicted metadata property 704 and the bitrate 504 to attempt to reduce any error between the output spatial metadata parameters (which are the enhanced decoded spatial metadata 512) and the original spatial metadata parameters 104 input to the encoder 101 as shown in Fig.1. In some embodiments this can result in an increase in the effective TF-resolution of the decoded spatial metadata, as well as an increase of the de-quantization accuracy of the parameter values.

[0184] The output of the metadata determiner 707 can be the enhanced decoded spatial metadata 512, which can be output from the metadata enhancer 507.

[0185] In some embodiments the ML model 705 and / or feature computer 701 may employ a number of surrounding temporal frames for processing a specific frame. For example, when processing latency is not of importance, the ML model 705and / or feature computer 701 may employ both past and future (neighboring) frames.

[0186] In some embodiments where processing latency should be kept as low as possible then the ML model 705 and / or feature computer 701 is configured to employ only past (history) frames. The exact length of the history context depends on the internal structure of ML model 705 and is an implementation design parameter. Some investigations have indicated that a history or time range of approximately 200 ms produces suitable results, however this is an example value and other values can be employed in other embodiments.

[0187] During initialization there is no history context available. In some embodiments the enhancement processing is activated only once enough history is available. In other words, no enhancement is performed at the beginning, leading to the worst output quality. Alternatively, in some embodiments a constant value, for example, zeros or mean over the training data, may be used to fill the history that is not available. The ML model 705 in some embodiments is configured to recognize this situation and allocate a lower or less importance weighting to those initial frames. As a result, the ML model 705 is configured to produce some enhancement during initialization or at the beginning of receiving a signal, but these enhancements may be inferior to a full or correct history context based enhancement.

[0188] The second or bitrate switch mode for the metadata enhancer 507, as described above, can be activated when the value of bitrate 504 changes (compared to the previous frame). After being activated, the second or bitrate switch mode is maintained or kept activated at least for a number of, N, frames.

[0189] Let us first consider the first frame after the bitrate has switched, in other words, the bitrate is different for the current frame than for the previous frame.

[0190] As the bitrate has changed, there may have been a change in the coding of the spatial metadata and / or transport audio signals within the encoder. This furthermore may have changed the encoding in such a way that the decoded spatial metadata 508 received for the previous frames is not compatible with the model employed by the ML model 705 and employed to enhance the decoded spatial metadata 508 received for this frame.As discussed earlier, this difference can be reflected, for example, by a difference in number of frequency bands for the current frame compared to previous frames. As an example, for the previous frame, IVAS coding of the MASA format at the total bitrate of 13.2 kbps may have been applied, which uses 5 frequency bands (and one temporal sub-frame) for the MASA spatial metadata. And, as an example, the total bitrate for the current frame may be 128 kbps, which uses 8 frequency bands (and 4 temporal sub-frames). Thus, the ML model used at the 128 kbps is not compatible with the decoded spatial metadata received for the previous frames.

[0191] Compared to the operation of the first or normal mode, the second or bitrate switch mode receives or employs generated history spatial metadata 706 obtained from the history metadata generator 703. This history spatial metadata 706 can be arranged or provided in a form compatible with the ML model 705 or feature computer 701 used for this frame. For example, the generated history spatial metadata 706 can have the correct number of frequency bands (which would be 8 frequency bands for our current example, where the bitrate 504 value indicates that the total bitrate is 128 kbps for this frame).

[0192] In this second or bitrate switch mode, the feature computer 701 is configured to concatenate or otherwise combine the decoded spatial metadata 508 and the generated history spatial metadata 706, so that for the current frame it employs the decoded spatial metadata 508, and for the previous N - 1 frames the feature computer 701 employs or processes the generated history spatial metadata 706.

[0193] This concatenated or combined current frame decoded spatial metadata 508 and previous frame generated history spatial metadata 706 is then processed by the feature computer to generate at least one feature 702, which is passed to the ML model 705 to generate the predicted metadata properties 704, which are then furthermore employed by the metadata determiner 707 to process the decoded spatial metadata 508.

[0194] For the next frame, the metadata enhancer in the second of bitrate mode is configured to generate a concatenation or combination based on spatial metadata taken for the current and one previous frame from the decoded spatialmetadata 508 and spatial metadata for the remaining N - 2 frames from the generated history spatial metadata 706.

[0195] This concatenation or combination process where subsequent frames add further metadata from the decoded spatial metadata and using generated history spatial metadata until the decoded spatial metadata 508 contains spatial metadata in the correct configuration or arrangement (e.g., the number of frequency bands is correct for the current ML model) for the full length of history context of N frames.

[0196] Thus, by generating compatible history spatial metadata, and concatenating the history spatial metadata with the received spatial metadata, it is possible to utilize the feature computer 701 and ML model 705 in order to enhance the spatial metadata, even when the coding configuration changes resulting in incompatible decoded transport audio signals or decoded spatial metadata for the new ML model.

[0197] With respect to Fig.8 is shown a flow diagram showing example operations of the metadata enhancer shown in Fig.7.

[0198] For example, as shown in Fig.8 by 801 is the operation of obtaining decoded spatial metadata, decoded transport audio signals, and bitrate.

[0199] Additionally, as shown in Fig.8 by 802 is the operation of, in the first or normal mode, selecting or using decoded spatial metadata for feature determination.

[0200] Optionally, as shown in Fig.8 by 803 is the operation of, in the second or bitrate switch mode, generating or obtaining history spatial metadata based on enhanced decoded spatial metadata, decoded transport audio signals and bitrate.

[0201] Also, in the second or bitrate switch mode, is the operation of generating concatenation or combination of history and decoded spatial metadata for feature determination as shown in Fig.8 by 804.

[0202] Then, as shown in Fig.8 by 805 is the operation of generating or determining at least one feature based on the concatenation (in second mode) as generated in 804, or decoded spatial metadata (in first mode) as obtained in 802 and the bitrate (and decoded transport audio signal).

[0203] Also as shown in Fig.8 by 807 is the operation of generating predicted metadata properties based on the determined at least one feature and bitrate.Then is the operation as shown in Fig.8 by 809 of generating enhanced decoded spatial metadata based on decoded spatial metadata predicted metadata properties and bitrate.

[0204] Finally, the enhanced decoded spatial metadata is output as shown in Fig.8 by 811.

[0205] With respect to Fig.9 and 10 is shown example tables of the inputs to the feature computer in various examples.

[0206] For example, Fig.9 shows example inputs for the feature computer for the first or normal mode and for a current frame and 5 previous frames. In the first or normal mode, the decoded spatial metadata is used for all N frames, as the timefrequency resolution is the same for all of them.

[0207] Then, in Fig.10, is shown example inputs for the feature computer for the second or bitrate switch mode and where the current frame shows a 128 kbps bitrate frame and the previous frames show 13.2 kbps bitrate frames. The upper table shows that without the combination or concatenation then the previous frame feature computer inputs have 5 bands but the current frame input has 8 bands (a discrepancy in the time-frequency resolution when the bitrate changes) and thus the aforementioned incompatibility issues can occur.

[0208] In Fig.10, shows that in the second or bitrate switch mode, any discrepancy of inconsistency in the time-frequency resolution is avoided by using the generated history spatial metadata for the previous frames with 8 bands and the current frame decoded spatial metadata with 8 bands.

[0209] In some embodiments if the bitrate changes during or while the bitrate switch mode is active then the bitrate switch mode is kept enabled or restarted for N frames, starting from the last bitrate switch. Otherwise, the above operations are implemented. In this case the history metadata generator is configured to determine generated history spatial metadata based on the enhanced decoded spatial metadata in such a form that the ML model for the current new switched bitrate can operate using this configuration or arrangement.

[0210] With respect to Fig.11 is shown an example ML model 705 configuration as shown in Fig.7 in further detail. Fig.11 exemplifies the temporal operations of a single Dilated Residual Block (DRB) 2056A-2056D, as shown in Fig.21. The DRBs are building blocks that can be used in ResU-Net structure shown in Fig.20,Fig.21, and that can be used in an example embodiment of ML model 705 as illustrated in Fig.19.

[0211] Different structures can be used for the machine learning models 705 in different examples. Figs.19 and 20 show an example structure for a machine learning model 705 that can be used in some examples. Other structures for the machine learning model 705 can be used in other examples.

[0212] The input features are provided or obtained as an input to a DNN that produces predicted metadata properties, which can then be used to determine enhanced spatial metadata. The exact architecture of the DNN is not critical and any suitable arrangement or implementation can be employed. The disclosure of GB2411721 uses a DNN consisting of separate U-Net -type sub-models for different feature inputs followed by one more U-Net -type model for producing the actual predicted metadata properties based on the outputs of the feature submodels. The U-Net -type model used in this example is similar to ResU-Net as disclosed in Zhang, Z., Liu, Q. & Wang, Y. (2018). Road extraction by deep residual U-Net. IEEE Geoscience and Remote Sensing Letters, vol. 15, issue 5, pp. 749-753. DOI: 10.1109 / LGRS.2018.2802944. In this implementation each macro layer consists of ResNet blocks as disclosed in He, K., Zhang, X. Ren, S. & Sun, J. (2015). Deep residual learning for image recognition. arXiv: 1512.03385. Normal ResNet blocks in the literature are non-causal, when considering applying them of a TF-representation of a signal. This means that they utilize values from future time instants for determining the output for the current time instant. However, it is possible to modify the block to operate in a causal manner, and this is used in the current example embodiment.

[0213] In the example of Fig.19 the machine learning model 705 comprises a W-Net structure. The structure is referred to a W-Net because the path from the input to the output goes through two U-Net structures. Fig.20 shows an example U-Net structure that can be used in the machine learning model 705.

[0214] The U-Net structure comprises a downsampling part 2010 followed by an upsampling part 2014. The downsampling part 2010 comprises a sequence of downsampling layers. The respective downsampling layers can comprise convolutional operations such as 2D convolutional layers applying convolutional operation over the spatial dimensions of the data. The downsampling layers ofthe downsampling part 2010 can reduce the dimensions of an input along at least some axes.

[0215] The output of the downsampling part 2010 has a smaller number of data elements in at least one axis compared to the original input.

[0216] The output of the downsampling part 2010 is provided as an input to the upsampling part 2014. The upsampling part 2014 is configured to produce output data. The upsampling part 2014 comprises a sequence of upsampling layers. The upsampling part 2014 can comprise X upsampling layers where X can be also the number of downsampling layers in the downsampling part 2010. The upsampling layers can comprise transposed convolutional operations and / or upsampling operations possibly followed or preceded by convolutional operations. The upsampling layers of the upsampling part 2014 can increase the dimensions of an input along at least some axes.

[0217] The U-Net structures in this example also comprise skip connections 2012. The skip connections 2012 are configured to relay skip connection signals from respective downsampling layers to corresponding upsampling layers. The skip connection signals can reintroduce features from the downsampling part 2010 back into corresponding layers of the upsampling part 2014. The upsampling layers can comprise operations such as concatenating operations to combine data from a skip connection signal with input data. The upsampling layers can also comprise operations to increase the dimensions of data in at least one axis compared to the input that is input to the upsampling part 2014.

[0218] The upsampling part 2014 provides output data. The output data of the upsampling part 2014 typically has the same number of data elements in at least one dimension as the input data that is originally provided to the downsampling part 2010.

[0219] In the example of Figs.19 and 20 the output of the downsampling part 2010 is provided as an input to the upsampling part 2014. In other examples there can comprise one or more intervening components such as a bottleneck and / or any other suitable operations or combinations of operations.

[0220] In the example of Fig.19 the machine learning model 705 comprises three pre-networks 2020A, 2020B, 2020C. Each of the pre-networks 2020A, 2020B, 2020C comprises a U-Net. The U-Net can be as shown in Fig.20 (some of thereference numbers are omitted in Fig.19 for clarity) or can have any other suitable arrangement.

[0221] The machine learning model 705 receives the features 702 as an input. In this example the features 702 can comprise the metadata features 2004 and the audio features. The audio features can comprise SPAC features 2002 and Cov features 2000. The respective feature inputs are provided to different prenetworks 2020. The metadata features 2004 are provided as an input to a first pre-network 2020A, the SPAC features 2002 are provided as an input to a second pre-network 2020B, and the Cov features 2000 are provided as an input to a third pre-network 2020C.

[0222] The respective pre-networks 2020A, 2020B, 2020C are arranged to process the respective inputs to provide intermediate representations 2022A, 2022B, 2022C as an output. The settings and weights of the respective prenetworks 2020A, 2020B, 2020C are specific for each of the input features. The first pre-network 2020A processes the input metadata features 2004 to provide a metadata intermediate representation 2022A as an output, the second prenetwork 2020B processes the input SPAC features 2002 to provide a SPAC intermediate representation 2022B as an output, and the third pre-network 2020C processes the input Cov features 2000 to provide a Cov intermediate representation 2022C as an output.

[0223] The intermediate representations 2022A, 2022B, 2022C are provided to a concatenation block 2024. The concatenation block 2024 is configured to combine the intermediate representations 2022A, 2022B, 2022C. The concatenation block 2024 can concatenate the intermediate representations 2022A, 2022B, 2022C along the feature axis or perform any other suitable combination. The concatenation block 2024 provides a combined intermediate representation 2026 as an output.

[0224] The combined intermediate representation 2026 is provided as an input to a combined prediction network 2028. The combined prediction network 2028 can comprise another U-Net structure. The weights and settings of the U-Net structure of each of the four networks, the combined prediction network 2028 and the pre-networks 2020A, 2020B, 2020C can be different. The combined predictionnetwork 2028 provides pre-scale predicted metadata properties 2030 as an output.

[0225] The pre-scale predicted metadata properties 2030 are provided as an input to an XYZ scale block 2032. The XYZ scale block 2032 is configured to apply appropriate scaling to the pre-scale predicted metadata properties 2030. The XYZ scale block 2032 provides predicted spatial metadata properties 704 as an output.

[0226] In the following, the structure and settings of the ML model 705 are described.

[0227] An example metadata features pre-network 2020A (for metadata from, for example, 13.2 kbps IVAS coding with 5 frequency bands) is shown with respect to Fig.21 and described as follows. Corresponding structures can be used for the SPAC features pre-network 2020B, the Cov features pre-network 2020C, and the combined prediction network 2028.

[0228] The input in this case is the metadata features. This input has shape (n_bands_meta, n_frames, n_features_meta). The input is provided to a dimension adjustment 2050 which comprises a linear layer 2052. The linear layer 2052 can be fully connected. The linear layer 2052 can operate on the input dimension that corresponds to the frequency bands. In this case this is the first dimension of the input. In this description a single input is processed. In examples of the disclosure multiple inputs can be processed in parallel as a batch. In such examples the stacking of multiple inputs adds one dimension in front of the actual data dimensions.

[0229] The operation of the linear layer 2052 provides an intermediate feature tensor Y 2054 as an output.

[0230] The metadata features pre-network 2020A can further comprise multiple residual blocks (ResBlock) or dilated residual blocks (DRB). The ResBlocks and DRBs 2056 comprise a stack of layers that is arranged so that the output of a given layer is taken and added to a subsequent layer deeper within the ResBlock or DRB. An example DBR is shown in Fig.22.

[0231] Each DRB 2056 has stride and dilation settings (str, dil)=(x,y) given within the block, for example, (str, dil)=(2, 2). The first number corresponds to the stride along the frequency axis (first dimension) and the second number corresponds to the convolutional kernel dilation factor parameter along the temporal axis (seconddimension). The number of output channels (third dimension) from each DRB is given as the last number in the triplet following the respective DRBs. The DRBs in the downsampling part may have the convolutional kernel parameters of (3, 3).

[0232] The metadata features pre-network 2020A can comprise a sequence of downwards DRBs 2056A-2056D and a sequence of upsampling or upwards DRBs 2056E-2056H possibly paired with a nearest neighbor upsampling layer. The sequence of downwards DRBs 2056A-2056D can provide a downsampling part of the metadata features pre-network. The sequence of upwards DRBs 2056E-2056H and upsampling blocks 2062E-2062G can provide an upsampling or upwards part of the metadata features pre-network 2020A.

[0233] For example, the first DRB 2056A has (str, dil)=(1, 1) with 8 output channels. The second DRB 2056B has (str, dil)=(2, 2) with 16 output channels. The second DRB 2056B is arranged to perform factor 2 sub-sampling in the spatial (time and frequency) dimensions. The third DRB 2056C has (str, dil)=(2,2) with 32 output channels. The third DRB 2056C is arranged to perform factor 2 sub-sampling in the spatial dimensions. The fourth DRB 2056D has (str, dil)=(2, 2) with 64 output channels. The fourth DRB 2056D is arranged to perform factor 2 sub-sampling in the spatial dimensions. All DRBs (first through fourth) of the downsampling path have the convolutional kernel size parameter of (3, 3).

[0234] The output of the last downwards DRB 2056D is passed through a convolutional block (ConvBlock) 2058. The ConvBlock 2058 comprises a two-dimensional convolution with kernel size of (3, 1) for (frequency, time) with 128 output channels, followed by a Batch Normalization (BatchNorm) and a Rectified Linear Unit (ReLU) activation (not shown in Fig. for clarity).

[0235] The output of the ConvBlock 2058 is passed through Transposed Convolution (TransConv) 2060 block. The TransConv 2060 block comprises a two-dimensional transposed convolution with kernel size of (3, 1) and 64 output channels. This effectively up-samples along the frequency axis. This is followed by a BatchNorm and ReLU activation (also not shown in Fig. for clarity).

[0236] The output of the TransConv 2060 block (including the BatchNorm and activation) is concatenated with the output of the last downward DRB 2056D. The concatenation of the output of the TransConv 2060 with the output of the last downward DRB 2056D is performed along the channel dimension. The resultingtensor is provided as an input to the upsampling part of the metadata features pre-network 2020A.

[0237] The upsampling part comprises a fifth DRB 2056E with (str, dil)=(1, 1) followed by an Upsampling (2, 1) 2062E block. The Upsampling 2062E block is arranged to upsample the layer input with the spatial size scaler factor given in the parenthesis (frequency, time) using, for example, the nearest neighbor upsampling method. Therefore Upsampling (2, 1) increases the size of the spatial dimension corresponding to frequency by a factor of two using nearest neighbor upsampling. The output of the Upsampling is the output of this layer of the upsampling part of the metadata features pre-network 2020A.

[0238] The output of the Upsampling block 2062E is concatenated with a corresponding matching skip connection tensor. This comprises data from the matching level downsampling layer 2056C. The concatenation is performed along the feature dimension (third dimension in this example). The output of the concatenation is provided to a sixth DRB 2056F with (str, dil)=(1, 1) and following Upsampling (2, 1) block 2062F.

[0239] The output of the Upsampling block 2062F is concatenated with a corresponding matching skip connection tensor from the downward DRB 2056B. The concatenation is performed along the feature dimension (third dimension in this example). The output of this concatenation is provided to a seventh DRB 2056G with (str, dil)=(1, 1) and following Upsampling (2, 1) block 2062G.

[0240] The convolutions in the upsampling part of the metadata features prenetwork 2020A have the number of output channels of 32, 16, and 8. The convolutions in the upsampling part of the metadata features pre-network 2020A have the kernel size parameters of (3, 1), corresponding to kernel size along the axes (frequency, time).

[0241] The output of the Upsampling block 2062G is concatenated with a corresponding matching skip connection tensor from the downward DRB 2056A. The last layer of the upsampling part comprises an eighth DRB 2056H with (str, dil)=(1, 1). The last DRB 2056H has five output channels and no following Upsampling blocks. The shape of the last DRB 2056H provides the output of the metadata features pre-network. The output of the metadata features pre-network is now (24, n_frames, 5). In this example the output of the metadata features pre-network is the intermediate metadata features. Other U-Net structures can be arranged in a similar manner but would have different inputs and outputs.

[0242] Fig.22 shows an example structure for a dilated residual block (DRB). The DRB could be used in the U-Net structures of the machine learning model. The example DRB could be used in a downsampling part. A DRB with the same internal structure could also be used in the upsampling part but different stride and kernel size settings could be used in the upsampling part.

[0243] The internal structure of the DRB has a pre-activation ordering. This means that BatchNorm layers 2212, 2218, and the ReLU activation layers 2214, 2220 are before the convolution operations 2266, 2282.

[0244] The DRB comprises two paths for an input 2200. The first path 2230 is shown on the left side of Fig.22 and the second path 2240 is shown on the right side of Fig.22. The path on the left side corresponds to the residual or by-pass path 2230. The path on the right side corresponds to the convolutional core path 2240.

[0245] The residual or by-pass path 2230 comprises convolution operations 2262 and a BatchNorm layer 2204. The convolution operations 2262 comprises a Conv2D layer with the kernel size of (1, 1) and striding str_f along the first dimension corresponding to frequency axis matching the stride settings of the DRB. The convolution operations 2262 adjust the size of the feature dimension of the input to match the feature dimension of the last convolution in the convolution core.

[0246] The convolution operations 2262 are followed by the BatchNorm layer 2204. The BatchNorm layer 2204 comprises a BatchNorm2D layer that operates on the feature dimension.

[0247] The convolutional core path 2240 consists of a sequence of blocks. The sequence comprises BatchNorm layers 2212, 2218 followed by ReLU activation layers 2214, 2220 which are then followed by convolution operations 2266, 2282. In this example the first BatchNorm layer 2212 is a BatchNorm2D that operates on the feature dimension (third dimension). The first ReLU activation layer 2214 is applied on each element from the BatchNorm layer 2212. The output of the ReLU activation layer 2214 is passed to the first convolution operations 2266. The first convolutions operations 2266 comprise a Conv2D with the kernel sizecorresponding to the kernel size of the DRB, for example, (3, 3) and the stride setting str_f matching the frequency axis stride settings of the DRB.

[0248] The output of the first convolution operations 2266 is passed through a second BatchNorm layer 2218 and a second ReLU activation layer 2220 before being passed to the second convolution operations 2282. The second convolution operations 2282 processes the input with a Conv2D with kernel size corresponding to the kernel size of the DRB, for example, (3, 3), and stride of 1 along the first dimension corresponding to the frequency axis, and dilation corresponding to the dil_t setting of the DRB, for example, 2. In this example, the padding corresponding to the convolutional shrinkage due to both convolutions 2266, 2282 along the first dimension, corresponding to the frequency axis, is applied as reflection padding along the frequency axis before the first convolution 2266.

[0249] The output of the residual or by-pass path and the output of the convolutional core are provided to an addition block 2250. The addition block 2250 adds the output of the residual or by-pass path and the output of the convolutional core in an element-wise manner. The output of the addition block 2250 is the output 2252 of the DRB.

[0250] The SPAC feature pre-network can have a similar structure to the metadata features pre-network described above however different dimensions and settings could be used.

[0251] The input to the SPAC feature pre-network would be the SPAC features. This input has shape (60, n_frames, 33). This input can be provided to a dimension adjustment as described above. The dimension adjustment can also comprise a fully connected linear layer. The dimension adjustment provides an intermediate feature tensor Y with shape (48, n_frames, 33).

[0252] In the example SPAC feature pre-network 2020B the downsampling part also comprises four DRBs. All DRBs in the downsampling path may have the kernel size parameter (3, 3). The first DRB with (str, dil)=(2, 1) has 64 output channels. The second DRB with (str, dil)=(2, 2) has 128 output channels. The third DRB with (str, dil)=(2, 2) has 256 output channels. The fourth DRB with (str, dil) = (2,2) has 512 output channels.In the example SPAC feature pre-network 2020B the output of the fourth DRB is passed through a convolutional block (ConvBlock). The ConvBlock comprises a two-dimensional convolution with kernel size of (3, 1 ) for (frequency, time) with 1024 output channels, followed by a TransConv block. The TransConv Block has a corresponding kernel size of (3, 1) and 256 output channels.

[0253] In the example SPAC feature pre-network 2020B the upsampling part also comprises a further four DRBs and corresponding Upsampling blocks. The DRBs in the upsampling part may have kernel size parameters of (3, 1). In this example the fifth DRB with (str, dil)=(1, 1) has 256 output channels followed by an Upsampling (2, 1) block. The Upsampling block does not provide any Upsampling in the temporal axis. The sixth DRB with (str, dil)=(1, 1) has 128 output channels followed by an Upsampling (2, 1) block. The seventh DRB with (str, dil)=(1, 1) has 64 output channels followed by an Upsampling (2, 1) block. The eighth DRB with (str, dil)=(1, 1) has 20 output. The DRBs in the upsampling part may have the convolutional kernel parameters of (3, 1).

[0254] The output of the SPAC feature pre-network 2020B has shape (24, n_frames, 20).

[0255] The Cov feature pre-network 2020C can have a similar structure to the metadata features pre-network 2020A and also the SPAC feature pre-network 2020B, however different dimensions and settings could be used. For the case of the Cov feature pre-network 2020C the dimension adjustment block can be omitted and the input to the U-Net structure would be an input of shape (2,n_sub frames, 5).

[0256] In the example Cov feature pre-network 2020C the downsampling part also comprises four DRBs. The first DRB with (str, dil)=(1, 1) has 8 output channels. The second DRB with (str, dil)=(2, 2) has 16 output channels. The third DRB with (str, dli)=(2, 2) has 32 output channels. The fourth DRB with (str, dil)=(2, 2) has 64 output channels. The DRBs in the downsampling part may have the convolutional kernel parameters of (3, 3).

[0257] In the example Cov feature pre-network 2020C the output of the fourth DRB is passed through a ConvBlock. The ConvBlock comprises a two-dimensional convolution with kernel size of (3, 1) with 128 output channels,followed by a TransConv block. The TransConv Block has a corresponding kernel size of (3, 1 ) and 64 output channels.

[0258] In the example Cov feature pre-network 2020C the upsampling portion also comprises a further four DRBs and corresponding Upsampling blocks. In this example the fifth DRB with (str, dil)=(1, 1) has 32 output channels followed by an Upsampling (2, 1) block. The sixth DRB with (str, dil)=(1, 1) has 16 output channels followed by an Upsampling (2, 1) block. The seventh DRB with (str, dil)=(1, 1) has 8 output channels followed by an Upsampling (2, 1) block. The eighth DRB with (str, dil)=(1, 1) has 8 output channels. There is no Upsampling block following the eighth DRB in the Cov feature pre-network 2020C. The DRBs in the upsampling part may have the convolutional kernel parameters of (3, 1).

[0259] The output of the Cov feature pre-network has shape (24, n_frames, 8). The metadata feature pre-network 2020A can provide a metadata intermediate representation 2022A as an output, the SPAC feature pre-network 2020B can provide a SPAC intermediate representation 2022B as an output, and the Cov feature pre-network 2020C can provide a Cov intermediate representation 2022C as an output. Using the example U-Net structures as described above the metadata intermediate representation 2022A has a shape (24, n_frames, 5), the SPAC intermediate representation 2022B has shape (24, n_frames, 20), and the Cov intermediate representation 2022C has shape (24, n_frames, 8).

[0260] The intermediate representations are provided to a concatenation block 2024. The concatenation block 2024 concatenates the intermediate representations 2022A, 2022B, 2022C along the feature axis. The concatenation block provides a combined intermediate representation 2026 as an output. The combined intermediate representation 2026 has a shape (24, n_frames, 33).

[0261] The combined intermediate representation 2026 is provided as an input to a fourth U-Net structure 2028. The fourth U-Net structure 2028 is the combined prediction network. The combined prediction network 2028 has a similar structure to the metadata features pre-network 2020A with some differences. For the case of the combined prediction network the dimension adjustment block can be omitted and the input to the U-Net structure would be an input of shape (24, n_frames, 33).In the example combined prediction network the downsampling portion also comprises four DRBs. The first DRB with (str, dil)=(1, 1) has 32 output channels. The second DRB with (str, dil)=(2, 2) has 48 output channels. The third DRB with (str, dil)=(2, 2) has 64 output channels. The fourth DRB with (str, dil)=(2, 2) has 128 output channels. All four DRBs of the downsampling portion may have the convolution kernel size parameter of (3, 3).

[0262] In the example combined prediction network the output of the fourth DRB is passed through a ConvBlock. The ConvBlock comprises a two-dimensional convolution with kernel size of (3, 1) and 512 output channels, followed by a TransConv block with a corresponding kernel size of (3, 1) and 256 output channels.

[0263] In the example combined prediction network the upsampling part also comprises a further four DRBs and corresponding Upsampling blocks. In this example the fifth DRB with (str, dil)=(1, 1) has 64 output channels followed by an Upsampling (2, 1) block. The sixth DRB with (str, dil)=(1, 1) has 48 output channels followed by an Upsampling (2, 1) block. The seventh DRB with (str, dil)=(1, 1) has 12 output channels followed by an Upsampling (2, 1) block. The eighth DRB with (str, dil)=(1, 1) has 3 output channels. There is no Upsampling block following the eighth DRB in the combined prediction network 2028.

[0264] The output 2030 of the combined prediction network 2028 has shape (24, n_frames, 3). The output 2030 of the combined prediction network provides pre-scale predicted metadata properties 2030. These represent the directional metadata in each (24, njrames) TF-tiles in XYZ vector representation Uprescale k, Tl).

[0265] The pre-scale predicted metadata properties are provided as an input to an XYZ scale block 2032. The XYZ scale block 2032 is configured to apply appropriate scaling to the pre-scale predicted metadata properties 2030. The XYZ scale block 2032 can apply final constraints on the length of the vector.

[0266] In some examples the scaling applied by the XYZ scale block 2032 can comprise determining the length of the input vectors. The length of the input vectors can be determined with

[0267] rin(k,n) = √(v2prescale,x(k,n) + v2prescale,y(k,n) + v2prescale,z(k,n))

[0268]

[0269] This is passed through hyperbolic tangent and scaled with a constant, for example, cxyz= 1.1 for obtaining the scaled length

[0270] r

[0271]

[0272] rscaled(k,n) = cxyztanh(rin(k,n))

[0273] Without the scaling, the output of the hyperbolic tangent would require infinite value of the input for the output to reach value of 1.0. When the model is used for inference the value of rscaied(k, ) is limited to the range of 0...1.0 with rscaiedCk’K) =max(o, min(l, rscaled(k,

[0274]

[0275] n))) and this value is used in place of rscaled(k,n).

[0276] A scaling value s(fc,n) can be determined from these two lengths with s(k,n) = rscaled(k,n)

[0277] s(k, n) = - - — —

[0278]

[0279] rin(k, n)

[0280] The output 704 of the XYZ scale block 2032 is the input 2030 multiplied by this scaling value

[0281] Vpred^Jt’Tl) (J > )T2prescale(, )

[0282] The XYZ scale block 2032 provides predicted spatial metadata properties 704 as an output which is the output of the machine learning model 705.

[0283] The machine learning model can be trained using any suitable process. In some examples the machine learning model can be trained using a set of spatial audio items in MASA format comprising an audio signal such as a transport audio signal and MASA spatial metadata with one directional field. The audio signal can comprise two channels.

[0284] It is assumed that the original spatial metadata has high temporal resolution and all four sub-frames in each frame may contain unique values. The training data comprises 4530 items that are 4 seconds in length (that is, 200 frames or 800 sub-frames). The directional spatial metadata consisting of azimuth azi(k,r), elevation ele(k,ri), and direct-to-total energy ratio r(fc,n) are transformed into Cartesian xyz vector representation using

[0285] cos(azi(fc, n)) cos (eZe(fc, n))’ V target (fc,n) = r(fc,n) sin(azi(fc,n)) cos (ele(k, n))

[0286]

[0287] sin (ele(k,ny) This is the reference or target data during the training. The validation data of 799 items is selected from this same pool of items.The description of metadata features pre-network 2020A above may be used for spatial metadata coded, for example, with 13.2 kbps IVAS codec. Other bitrates may result into different metadata feature representation 2004 being employed. For example, IVAS coding with 128 kbps may result in embodiments employing 8 distinct frequency bands. The ML model 705 for processing content encoded with, for example, 128 kbps, may be otherwise similar as the ML model 705 described above, but the metadata features pre-network 2020A may be adjusted. For example, the n_bands_meta may be 8, changing the size of the input 2004 in the first dimension. The dimension adjustment layer 2050 may still use a linear layer 2052 and adjust the size of the first dimension to, for example, 24 (or any suitable number). The rest of the metadata feature pre-network 2020A may be similar as described above, or it may feature other adjustments or changes. Similarly, the Cov feature pre-network 2020C, the SPAC feature prenetwork 2020B, and combined prediction network 2028 may be configured differently for different bitrates, or as in the example embodiment, similar configurations can be used for different bitrates.

[0288] For the input to ML model 705, the training items are passed through an IVAS codec (encoder and decoder) with a specific bitrate, for example, 13.2 kbps or 128 kbps, and the decoded MASA metadata is obtained using the external renderer (EXT) output mode of the IVAS decoder. The IVAS codec may reduce the frequency resolution of the spatial metadata into 5 bands (for the bitrate of 13.2 kbps, or 8 bands for 128 kbps) from the original 24 bands and applies quantization to the values. The IVAS codec may additionally reduce the temporal resolution of the spatial metadata by assigning the same parameter value to each of the four sub-frames.

[0289] This metadata is transformed into the Cartesian XYZ vector representation in the same way as the reference data. The training uses batch size of 32, that is, the parameters of the model are adjusted after each 32 training examples. AdaDelta optimizer is used with learning rate of 1.0. The training is run for the maximum of 1000 epochs or until early stopping is triggered. The early stopping is triggered when the per-epoch validation loss is not lower than the best per-epoch validation loss in 50 consecutive epochs. The validation batch size is 8 items.The loss can be computed as follows. First, the direct-to-total energy ratios of the target and the predicted data are computed

[0290] ’target^, Tl) = Vtarget,x(k>n) + ^target,y(k>n) + ^arget,z(k> ^l)

[0291] rpred(k,n) = ^vpredx(k,n) + vpredy(k,n) + v^redz(k,n)

[0292]

[0293] Then, the absolute value of the energy ratio difference is computed 1

[0294]

[0295] ~absdiff (Jt> Tl) \fpred(k> Tl) ^'target^.k, 71) |

[0296] Then, unit-length direction vectors are computed for the target and the predicted data

[0297] (target

[0298] ^tarqet.unitlenMn, \

[0299] rtarget(k> n)

[0300] ^pred,unitlen ~ \

[0301]

[0302] ^pred \k, Tl)

[0303] Then, direction error vector is computed by

[0304] V

[0305]

[0306] error(fc, 71) T^pred,unitlen^k, Tl) V target, unitlen^Jti Tl) and the length of the direction error vector is computed by

[0307] error Vgrror, x 71) + Vgrror y(fc, 71) + Vgrror z(fc, 71)

[0308]

[0309] This length is weighted by the target direct-to-total energy ratio

[0310]

[0311] terror, w ^0 ^error ^O^targetC^' ^1)

[0312] Using the determined absolute value of the energy ratio difference and the determined weighted length of the direction error vector, the combined error measure is determined by

[0313] ^

[0314]

[0315] comb(.k> Tl) ^absdiff Ck, Tl)lerror w(k, Tl)

[0316] Then, an energy-weighting metric is determined. The energy-weighting metric is for weighting the loss based on the energies of the corresponding timefrequency tiles (that is, time-frequency tiles having larger energy should have a larger effect on the loss). The energy-weighting metric can be formulated in any suitable manner. In some examples, the local signal energy evolution over time in frequency bands with frequency dependent weighting (referred to as Cov5(nsf,k)) as computed by the feature computation block may be used as theweight. As Cov5(nsf,k>) is computed in subframes nsf, and the loss is computed in frames n, the mean of the values Cov5(nsf,k>) for the corresponding frame n are computed and set as the weight

[0317] Eioss,w(.k> ri) jy

[0318]

[0319] nsfEn

[0320] where n is the frame index, nsfthe subframe index, and Nsf= 4 the number of subframes in a frame.

[0321] Then, the energy-weighted combined error measure is determined by ^

[0322]

[0323] comb,w ^0 ^comb ^)^loss,w ^0 UsingCOmb,w(k>nthe loss is determined by computing the mean over time and frequency over the entire training example

[0324] ^comb,w KN ’ %comb,w(k’

[0325]

[0326] k n

[0327] which is the loss that is output from the loss function.

[0328] The training of the ML model 705 can be performed with similar procedures for different target bitrates 102. The ML model 705 may be configured differently for different bitrates 102, for example, with respect to the number of frequency bands in the metadata feature 2004 and the settings of the metadata feature prenetwork 2020A.

[0329] In the example metadata enhancer 507, the metadata determiner block 707 receives the decoded spatial metadata 508 and predicted metadata properties 704 as an input. The decoded spatial metadata 508 can comprise all MASA metadata variables. The metadata determiner block 707 is arranged to enhance the decoded spatial metadata 508 based on the predicted metadata properties 704 by updating the direction (azimuth and elevation) and the direct-to-total energy ratio based on the predicted metadata properties 704 that are generated by the machine learning model 705.

[0330] In some examples the process of enhancing the decoded spatial metadata 508 can comprise transforming the Cartesian XYZ vector representation of the model prediction vpred(k,n) into azimuth azidec(k,n), elevation eledec(k,n), and direct-to-total energy ratio rdec(k,r) representation with

[0331] n) atan2 \Vpredy(k, n), Vpredx(k,

[0332]

[0333] eledec(k,ri) = atan2 \vpred>z(k,n), ^v^redx(k,n) + v^redy(k,n)j

[0334] rdec(k,n) = Jvpred x(k,n) + v^redy(k,n) + v^red z(k,n)

[0335]

[0336] Here atan2(·) is the inverse tangent function resolving the correct quadrant. If the predicted metadata properties 704 are produced at the temporal rate of per-frame, the temporal resolution of predicted metadata properties 704 can be adjusted to the per-sub-frame temporal resolution of MASA metadata by repeating the same value for all four sub-frames in a frame. The diffuse-to-total energy ratio rdiff(k, ) in the spatial metadata is adjusted to reflect the new direct-to-total energy ratio by

[0337] rdiff(k>n) = 1 - rdec(k,ri)

[0338] In this example, the surround and spread coherence parameter values are not adjusted, but the values from the low-resolution decoded spatial metadata are used instead, by replicating the, for example, 5 or 8 values to cover all 24 bands. In some other examples the surround and / or spread coherence parameter values can be adjusted. In some examples all parameter values can be obtained from the predicted metadata properties. In these examples the metadata determiner block 707 does not need to have the decoded spatial metadata 508 as an input because all spatial metadata values may be obtained from the predicted metadata properties 704.

[0339] With respect to Fig.11 is shown the temporal behavior of dilated residual block (DRB) described in Fig. 22 in the configuration for the ML model 705.

[0340] In some embodiments the at least one features input 1101 comprises three groups of features determined from both metadata and transport audio signals: metadata features and two sets of features from audio.

[0341] However, these are example input features and in some other embodiments other features can be used or employed as inputs.

[0342] The operation of a DRB block is conceptually illustrated in Fig.11 which shows an example simplified arrangement to show only the data axis corresponding to time. Thus, is shown the input features 1101 a first convolution 1163 to the intermediate features 1167, and a second convolution 1169 to the output features 1171 and a by-pass 1105 from the input features to output features 1101. The example embodiment contains also convolutional operationsalong the axis corresponding to frequency which are not illustrated in the figure. The input features 1101 boxes at the top row are the 7 consecutive input feature frames. These are processed with a first convolution 1163 kernel of size 3 and dilation 1 (or any suitable values) resulting in the intermediate features 1167 on the middle row. These are then processed with a second convolution 1169 kernel size 3 and dilation factor 2 (or any suitable values). This result is combined with the by-pass 1105 (which may contain further convolutions) coming from the input features 1101 to obtain the output features 1171 on the bottom row. However, these convolutions are not directly affecting the feature generation and operation modes described above.

[0343] The illustrated DRB block in Fig.11 is configured to operate in causal mode, in other words, the DRB block does not access future inputs. In other words, only the current and past (indices

[0344]

[0345] n - 6,n - 5, are used for determining the output for time instant n. The output in the current frame depends on the current input n and a number (here, 6) of earlier inputs. As there are multiple blocks of this kind stacked in the U-Net, the temporal dependencies accumulate and each output has a dependency to a large number (e.g., 20-40) earlier inputs in addition to the current input. This history context has proven to be useful in the enhancement process.

[0346] With respect to Fig.12 is shown schematically an example history metadata generator 703 as shown in Fig.7 in further detail according to some embodiments. The history metadata generator 703 is configured to emulate encoding and decoding the original spatial metadata into decoded spatial metadata. However, instead of employing the unavailable original spatial metadata the history metadata generator 703 is configured to employ the enhanced decoded spatial metadata 512 of the earlier frames as an input and produces a generated history spatial metadata 706.

[0347] The history metadata generator 703 receives or obtains the enhanced decoded spatial metadata 512, the bitrate 504, and optionally the decoded transport audio signals as an input.

[0348] The enhanced decoded spatial metadata 512 has the spatial metadata in the time-frequency resolution of the original MASA spatial metadata. In other words, in the current example there are 24 frequency bands and 4 temporalsubframes. Then, based on the bitrate 504, the history metadata generator 703 is configured to convert the obtained enhanced decoded spatial metadata 512 to the form that is compatible with the current ML model.

[0349] In some embodiments the history metadata generator 703 can furthermore employ the optional decoded transport audio signals 504 input for determining transport audio signal energy in TF-tiles. This in turn can be used in a frequency band combiner 1211 and subframe combiner 1213 for applying additional weighting in the operations to emulate the operations of the (IVAS) encoder more closely.

[0350] The bitrate 504 in some embodiments is first forwarded to a frequency band limits determiner 1201, which is configured to determine the number of the frequency bands for that bitrate, and the limits or ranges 1202 of the frequency bands (in other words which original MASA frequency bands belong to each coded frequency band).

[0351] This information and the frequency band limits are forwarded as frequency band limits 1202 to a frequency band combiner 1211.

[0352] The frequency band combiner 1211 is configured to receive or obtain the enhanced decoded spatial metadata 512 and the frequency band limits 1202 as an input and combines the input 24 frequency bands to the target coding frequency bands (e.g., 8 frequency bands in our example), with optionally the aid of the decoded transport audio signals 504. The conversion can be implemented using the same or similar methods as is done in the (IVAS) encoder in order to have similar features for the spatial metadata as has been used in the training of the ML model (which has been trained using (IVAS) encoded spatial metadata). For example, the methods presented in EP4082009 may be used for determining the 8 combined frequency bands from the original 24 bands. The output of the block is frequency combined spatial metadata 1212.

[0353] The bitrate 504 furthermore can be forwarded also to the subframe limit determiner 1203, which determines the number of the subframes for that bitrate 504, and the limits or ranges of the subframes (in other words, which original MASA subframes belong to each coded subframe). In the example, the current bitrate after the bitrate switch is 128 kbps, which means that the spatial metadatahas 4 subframes. This information and the subframes limits are forwarded as the subframe limits 1204 to a subframe combiner 1213.

[0354] The subframe combiner 1213 is configured to receive or obtain the frequency combined spatial metadata 1212 and the subframe limits 1204 as an input and combines the input 4 subframes to the target coding subframes (for example, 4 subframes in this example), with the optional aid of decoded transport audio signals 504. The conversion can be implemented employing the same or similar methods as performed in the (IVAS) encoder in order to have similar features for the spatial metadata as has been used in the training of the ML model (which has been trained using (IVAS) encoded spatial metadata). For example, the methods presented in EP4082009 can be used for determining combined subframes. However, in this example, as the original MASA spatial metadata and the target spatial metadata have both 4 subframes, the combination can be skipped, and the frequency combined spatial metadata can be passed to the output. The output of the block is generated history spatial metadata 706.

[0355] In addition to the frequency band combiner 1211 and the subframe combiner 1213, any other suitable processing can be performed or implemented as well in order to generate or determine data more resembling the decoded spatial metadata at the current bitrate.

[0356] The output of the history metadata generator 703 is the generated history spatial metadata 706, that has the correct properties in the spatial metadata, for example, in this example 8 frequency bands and 4 subframes.

[0357] The above example history metadata generator 703 is based on an IVAS codec. In some embodiments, when another codec or with different metadata resolution reduction method is employed, the history metadata generator is configured to emulate the operations performed by the codec on the metadata employed in the encoder.

[0358] In some embodiments the obtained generated history spatial metadata can be obtained by employing a metadata encoding and decoding using the employed codec within the encoder.

[0359] With respect to Fig.13 is shown a flow diagram of the operations of the example history metadata generator as shown in Fig.12.For example as shown in Fig.13 by 1301 is the operation of obtaining bitrate, enhanced decoded spatial metadata, and decoded transport audio signals.

[0360] Additionally as shown in Fig.13 by 1303 is the operation of generating frequency band limits based on bitrate.

[0361] Furthermore as shown in Fig.13 by 1305 is the operation of generating frequency combined spatial metadata based on the frequency band limits, enhanced decoded spatial metadata and decoded transport audio signals.

[0362] Also, there is the operation of generating subframe limits based on bitrate as shown in Fig.13 by 1307

[0363] Then, as shown in Fig.13 by 1309 is the operation of generating history spatial metadata based on the frequency combined spatial metadata, subframe limits and decoded transport audio signals.

[0364] Then is the operation as shown in Fig.13 by 1311 of outputting the history spatial metadata.

[0365] The spatial synthesizer 509 as shown in Fig.5 is configured to receive the enhanced decoded spatial metadata 512 and decoded transport audio signals 510, and employ any suitable spatial synthesis to generate the spatial audio signals or output audio signals. For example, the spatial synthesizer 509 may operate according to the principles described in PCT application WO2019086757A1. The cited reference describes also synthesis based on spatial and surround coherence metadata parameters, but in some embodiments they can be assumed or set to be zero. Furthermore, even though the enhancement of the coherence parameters are not discussed herein, it would be understood that coherence parameters could be enhanced in a similar manner. In such implementations if the decoded spatial metadata has the coherence parameters, but they are not enhanced, then the decoded coherence parameters can be directly output (i.e., there would not be enhancement for those parameters). These values would then be used in the spatial synthesizer.

[0366] Furthermore, in some embodiments the spatial synthesizer 509 can implement operations as described in UK application GB2218103.6, and the synthesis methods in 3GPP TS 26.253 IVAS specification.Examples of the effect of implementing some embodiments as described herein is shown with respect to Figs.14 and 15. To demonstrate the effect of implementing the above examples a bitrate switching scenario with postprocessing spatial metadata enhancement is simulated with the following setup. Two target IVAS bitrates of 13.2 kbps and 128 kbps are used. These differ in the TF-resolution used in the coded domain such that 13.2 kbps has only 1 temporal sub-frame (i.e., all 4 sub-frames contain the same parameter values) and 5 frequency bands, while 128 kbps has 4 sub-frames and 8 frequency bands. Furthermore, two machine learning models for enhancing the spatial metadata are trained, one for each target bitrate. Both models use the same underlying structure, but they differ in the size of the metadata feature input. The models are based on the disclosure GB2411721.

[0367] The effect of bitrate switching is simulated every 25 frames by toggling between the two bitrates. For the evaluation, the outputs are cut and re-organized so that the whole output is intended for one bitrate. In other words, all the outputs of 13.2 kbps model are concatenated into one output and all outputs of the 128 kbps model are concatenated into second output for evaluation.

[0368] The error in the spatial metadata is computed for each case. The error is measured as root-mean-squared difference from the original spatial metadata when it is represented in Cartesian XYZ-format.

[0369] The per sub-frame RMSE of the 24-band XYZ-representation (RMS error in each TF-tile, averaged over 24 bands) of the directional spatial metadata is evaluated for three cases:

[0370] “input” referring to “decoded spatial metadata”;

[0371] “base” where no bitrate switching is done and the whole signal is processed continuously with the same “ML model”. This is the ideal performance level in view of the current invention; and

[0372] “zero” where the history context is zeroed at each switch as is done in the current state-of-the-art.

[0373] Fig.14 illustrates the metadata error for an example signal. Without bitrate switching the spatial metadata enhancement processing is able to reduce the difference from the original spatial metadata (error of “base” 1503 < error of “input” 1501). When bitrate switching is introduced and the history context of theenhancement is reset, this results into increased errors at and directly after the switching locations. The additional error is shown as line 1505 with spikes at each bitrate switching location. The plots show the RMSE of the XYZ-representation metadata compared to the original spatial metadata in the encoder input. Thus 1500 shows an error 1501 of the decoded spatial metadata, 1510 shows error 1503 after spatial metadata enhancement with a method without bitrate switches and 1520 shows error 1505 after spatial metadata enhancement when there is a bitrate switch every 25 frames (i.e., 100 sub-frames) erasing the history context of the enhancement method and line 1507 shows the difference between this case and the case of using correct history context (second panel) and reveals the increased error at each switching points.

[0374] Fig.15 shows the panel 1520 again but further shows the same example with the application of some embodiments at the locations of the bitrate switches. In this example the history metadata generator receives the enhanced decoded spatial metadata from the other model as the input, i.e., the 13.2 kbps model receives the features generated from the output from 128 kbps enhancement, and vice versa. The error of the resulting enhanced spatial metadata is shown in the 1610 panel (referred to as “init” 1605). The line 1607 shows the difference between this case and the case of using the baseline using the correct history context.

[0375] The plots shown in Figs.14 and 15 demonstrate that the state-of-the-art enhancement improves the metadata over what can be decoded from the bitstream (error of “base” 1503 < error of “input” 1501), when the bitrate is being constant. Furthermore, as shown in these Figs when the history context is zeroed at the simulated rate switching location, the error increases significantly at the switching locations (error of “base” 1503 < error of “zero” 1505).

[0376] Furthermore, when the history context is initialized using the method proposed in this disclosure, the enhancement works again well without switching artifacts (error of “base” 1503 « error of “init” 1605 < error of “zero” 1505, and the line curve 1607 of bottom panel 1610 is effectively zero).

[0377] In some embodiments, in addition to generating the generated history spatial metadata based on the enhanced decoded spatial metadata generated by the first ML model before the bitrate switch, the apparatus and methods alsogenerate history features that can be determined or computed from the decoded transport audio signals. These embodiments can be useful if the number of transport audio channels changes, for example, from one channel to two channels in the bitrate switching.

[0378] For example, there could be an example situation where the apparatus or methods cannot determine all of the audio features used by the second ML model from a mono transport audio signal before the bitrate switching location alone. However, in some embodiments as discussed herein it is possible to generate approximate inter-channel audio features based on the enhanced spatial metadata (from the first “ML model”) and the single-channel audio features.

[0379] Fig.16 shows schematically an example apparatus which is similar to the example apparatus as shown in Fig.7 and otherwise operates as the example embodiment presented in Fig.7 but with the history metadata generator 703 replaced by a history audio feature generator 1703, which is configured to create at least one generated history audio feature 1706. The generated history audio feature 1706 is employed to replace the determined or computed audio-based features 702 computed from decoded transport audio signals 504.

[0380] In some embodiments a combination of these embodiments as shown in Fig.16 and the embodiments presented in Fig.7 can be implemented. In other words, the metadata enhancer comprises both the history metadata generator and the history audio feature generator.

[0381] With respect to Fig.17 is shown a flow diagram showing example operations of the metadata enhancer shown in Fig.16.

[0382] For example, as shown in Fig.17 by 801 is the operation of obtaining decoded spatial metadata, decoded transport audio signals, and bitrate.

[0383] Additionally, as shown in Fig.17 by 1802 is the operation of, in the first or normal mode, selecting or using decoded transport audio signals for feature determination.

[0384] Optionally as shown in Fig.17 by 1803 is the operation of, in the second or bitrate switch mode, generating or obtaining generated history audio feature based on enhanced decoded spatial metadata, decoded transport audio signals, and bitrate.Also, in the second or bitrate switch mode, is the operation of selecting the generated history audio features or generating a combination of history audio features and decoded transport audio signals for feature determination as shown in Fig.17 by 1804.

[0385] Then, as shown in Fig.17 by 1805 is the operation of generating or determining at least one feature based on the selection or combination (in second mode) as generated in 1804, or decoded transport audio signals (in first mode) as obtained in 1802 and the bitrate (and decoded spatial metadata).

[0386] Also as shown in Fig.17 by 807 is the operation of generating predicted metadata properties based on the determined at least one feature and bitrate.

[0387] Then is the operation as shown in Fig.17 by 809 of generating enhanced decoded spatial metadata based on decoded spatial metadata predicted metadata properties and bitrate.

[0388] Finally, the enhanced decoded spatial metadata is output as shown in Fig.17 by 811.

[0389] In some embodiments, the bitrate switch mode is not activated in changes in between all bitrates, but only switching between certain bitrates. For example, if the two bitrates use the same number of frequency bands and temporal subframes, the bitrate switch mode may not be needed, but, instead, the first or normal mode can be maintained. Furthermore, in these embodiments, the bitrate switch mode is only activated when the number of frequency bands and / or temporal subframes changes.

[0390] In some embodiments, the bitrate changes between frames, which causes differences in the coding configuration, for example, the number of frequency bands and / or temporal subframes changes, which in turn caused an incompatibility in the history data for the ML model. However, in some embodiments, a change in the coding configuration (such as the number of frequency bands and / or temporal subframes) could be caused by any suitable trigger or event. In these examples, the same issues as discussed above are present because of the generated incompatible history data for the current frame.

[0391] The changes with the coding configuration can occur while bitrate value is constant. For example, the changes can be caused by adaptive determination of the number of frequency bands. For example, there can be an adaptivedetermination of how many frequency bands are needed for a certain quality, and thus the number of frequency bands could change over time, even though the bitrate is constant.

[0392] These embodiments can be implemented as presented, for example, in relation to Figs.7 and / or 16 but instead of obtaining the bitrate a different input, for example a coding configuration input could be obtained. The coding configuration input could, for example, comprise at least one: number of frequency bands; and number of temporal subframes. Then, the same operations as presented above could be applied based on the coding configuration input.

[0393] In some embodiments, the history metadata generator is configured to apply the actual metadata encoding and decoding (for example, similar to IVAS) (at the current bitrate) on the enhanced decoded spatial metadata in order to obtain the generated history spatial metadata. This approach would be typically computationally more complex than, the approach of merging the frequency bands, but may in some cases produce a more accurate approximation of the decoded spatial metadata.

[0394] In some embodiments, the history metadata generator is not configured to generate history metadata but generates suitable features which are supplied as generated history metadata features analogous to the generated history audio features as shown in Fig.16. In some embodiments, the history metadata generator of Fig.7 is furthermore configured to directly obtain or receive predicted metadata properties rather than the enhanced decoded spatial metadata.

[0395] For example, if the metadata features are in the Cartesian vector representation described earlier, the relevant content in the spatial metadata and the metadata features represent the same information, but only in a different representation.

[0396] Thus, in some embodiments, either enhanced decoded spatial metadata or predicted metadata properties can be employed as an input for the history generation, as they effectively contain the same information. Correspondingly, the output of the history generation can be either generated history spatial metadata or generated history metadata features, as they practically comprise the same information.In some embodiments, the ML model employs a U-Net or some other structure conditioned to produce compatible generated history spatial metadata or generated history metadata features as the output based on the enhanced decoded spatial metadata or predicted metadata properties as the input. In some further embodiments, the history audio feature generator can be implemented with an ML model that has been conditioned to obtain or receive enhanced decoded spatial metadata or predicted metadata properties and decoded transport audio signals or features determined from decoded transport audio signals as the input and to produce generated history audio features as the output.

[0397] With respect to Fig.18 an example electronic device which may be used as the computer, encoder processor, decoder processor, or any of the functional blocks described herein is shown. The device may be any suitable electronics device or apparatus. For example, in some embodiments the device 1900 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, a laptop, or a teleconferencing system.

[0398] In some embodiments the device 1900 comprises at least one processor or central processing unit (CPU or processor) 1907. The processor 1907 can be configured to execute various program codes such as the methods such as described herein.

[0399] The device 1900 furthermore comprises a transceiver 1909 which is configured to receive the bitstream and provide it to the processor 1907. Typically, the connection is wirelessly received data from a remote device or a server, however, in some embodiments the bitstream is received via a wired connection or read from a local memory of the device. The transceiver can communicate with further apparatus by any suitable known communications protocol. For example, in some embodiments the transceiver can use a suitable 5G New Radio (5G NR) protocol, a Wi-Fi protocol such as, for example, IEEE 802.11be, a suitable short-range radio frequency communication protocol such as Bluetooth, or Li-Fi).

[0400] The device may furthermore comprise a user interface (UI) 1905 which may display to the user an interface for interacting with the device.

[0401] The device 1900 may further comprise memory (MEM) 1911 which is coupled to the processor 1907. In some embodiments the memory 1911 comprises the program code 1921 which is executed by the processor 1907. Theprogram code may involve instructions to perform the operations of the spatial synthesizer described above. The processor 1907 can then be configured to output the spatial audio signals, which in this example was a binaural output, to a digital to analogue converter (DAC) / Bluetooth 1901 converter.

[0402] The combination of the processor, CPU, 1907 and memory, MEM, can implement the IVAS decoder 1931 functionality described above.

[0403] The DAC / Bluetooth 1901 is configured to convert the spatial audio signals to an analogue form if the headphones are conventional wired (analogue) headphones. For wireless connections, the DAC / Bluetooth 1901 may be a Bluetooth transceiver.

[0404] The DAC / Bluetooth 1901 block provides (either wired or wirelessly) the spatial audio to be played back with the headphones 1903 to the user. In some embodiments, the headphones 1903 may have a head tracker which may provide orientation and / or position information of the user’s head to the processor 1907 of the rendering apparatus, so that user’s head orientation is accounted for at the spatial synthesizer.

[0405] It should be understood that the apparatuses may comprise or be coupled to other units or modules used in or for transmission and / or reception. Although the apparatuses have been described as one entity, different modules and memory may be implemented in one or more physical or logical entities.

[0406] It is also noted herein that while the above describes example embodiments, there are several variations and modifications which may be made to the disclosed solution without departing from the scope of the present invention.

[0407] As used herein, “at least one of the following: ” and “at least one of ” and similar wording, where the list of two or more elements are joined by “and” or “or”, mean at least any one of the elements, or at least any two or more of the elements, or at least all the elements.

[0408] In general, the various embodiments may be implemented in hardware or special purpose circuitry, software, logic or any combination thereof. Some aspects of the disclosure may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the disclosure isnot limited thereto. While various aspects of the disclosure may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

[0409] As used in this application, the term “circuitry” may refer to one or more or all of the following:

[0410] (a) hardware-only circuit implementations (such as implementations in only analog and / or digital circuitry) and

[0411] (b) combinations of hardware circuits and software, such as (as applicable):

[0412] (c) a combination of analog and / or digital hardware circuit(s) with software / firmware and

[0413] (i) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions); and (ii) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.

[0414] This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and / or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.

[0415] The embodiments of this disclosure may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.Computer software or program, also called program product, including software routines, applets and / or macros, may be stored in any apparatus-readable data storage medium and they comprise program instructions to perform particular tasks. A computer program product may comprise one or more computerexecutable components which, when the program is run, are configured to carry out embodiments. The one or more computer-executable components may be at least one software code or portions of it.

[0416] Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as DVD and the data variants thereof, CD. The physical media is a non-transitory media.

[0417] The term “non-transitory,” as used herein, is a limitation of the medium itself (i.e., tangible, not a signal) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM).

[0418] The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may comprise one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), FPGA, gate level circuits and processors based on multi core processor architecture, as non-limiting examples.

[0419] Embodiments of the disclosure may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

[0420] The scope of protection sought for various embodiments of the disclosure is set out by the independent claims. The embodiments and features, if any,described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the disclosure.

[0421] The foregoing description has provided by way of non-limiting examples a full and informative description of the exemplary embodiment of this disclosure. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this disclosure will still fall within the scope of this invention as defined in the appended claims. Indeed, there is a further embodiment comprising a combination of one or more embodiments with any of the other embodiments previously discussed.

[0422] List of abbreviations

[0423] AAC - Advanced Audio Coding

[0424] ANN - artificial neural network

[0425] BN - Batch Normalization

[0426] DNN - deep neural network

[0427] EVS - 3GPP Enhanced Voice Services

[0428] FSAC - Future Speech and Audio Codec

[0429] IVAS - 3GPP Immersive Voice and Audio Services

[0430] kbps - kilobits per second

[0431] MASA- Metadata-Assisted Spatial Audio

[0432] MDCT - modified discrete cosine transform

[0433] ML - machine learning

[0434] NN - neural network

[0435] ONNX - Open Neural Network eXchange

[0436] STFT - Short-time Fourier transform

[0437] TF - time / frequency

[0438] VR - virtual reality

Claims

CLAIMS:

1. An apparatus for enhancing at least one spatial metadata parameter, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to:obtain a spatial bitstream, the spatial bitstream defining spatial audio content and comprising:at least one spatial metadata parameter associated with at least one audio signal, the at least one spatial metadata parameter being an encoded version of an original spatial metadata parameter; anda coding configuration associated with the encoding of the spatial bitstream; andenhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

2. The apparatus as claimed in claim 1, caused to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream is further caused to:obtain a machine-learning model for enhancing the at least one spatial metadata parameter, the machine-learning model for spatial audio content associated with the coding configuration, the machine-learning model configured to output at least one predicted spatial metadata property; andgenerate at least one enhanced spatial metadata parameter based on the at least one predicted spatial metadata property.

3. The apparatus as claimed in claim 2, caused to generate at least one enhanced spatial metadata parameter based on the at least one predicted spatial metadata property is further caused to generate at least one enhanced spatial metadata parameter based on the at least one predicted spatial metadata property and the at least one spatial metadata parameter.

4. The apparatus as claimed in any of claims 2 or 3, caused to enhance the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream is further caused to:obtain at least one input feature for the machine-learning model for enhancing the at least one spatial metadata parameter based on the coding configuration, the at least one input feature being based on the coding configuration.

5. The apparatus as claimed in claim 4, caused to obtain at least one input feature for the machine-learning model for enhancing the at least one spatial metadata parameter is further caused to determine a switch mode of operation based on the coding configuration, the switch mode of operation indicating that there is a change in the coding configuration.

6. The apparatus as claimed in claim 5, caused to obtain at least one input feature for the machine-learning model for enhancing the at least one spatial metadata parameter is further caused to determine a normal mode of operation based on the coding configuration, the normal mode of operation indicating that there is a consistent coding configuration.

7. The apparatus as claimed in claim 6, caused to obtain the at least one input feature for the normal mode of operation based on the coding configuration is further caused to generate the at least one input feature based on at least one of:the at least one spatial metadata parameter; andthe at least one audio signal.

8. The apparatus as claimed in any of claims 5 to 7, caused to obtain at least one input feature for the switch mode of operation based on the coding configuration is further caused to:generate at least one history audio feature based on at least one of:at least one previous enhanced spatial metadata parameter; andthe at least one audio signal;generate at least one audio feature based on at least one of:at least one spatial metadata parameter; andthe at least one audio signal; andgenerate the at least one input feature based on the at least one history audio feature and the at least one audio feature.

9. The apparatus as claimed in any of claims 5 to 7, caused to obtain at least one input feature for the switch mode of operation based on the coding configuration is further caused to:generate at least one history audio feature based on at least one of:at least one previous predicted spatial metadata property; and the at least one audio signal;generate at least one audio feature based on at least one of:at least one spatial metadata parameter; andthe at least one audio signal; andgenerate the at least one input feature based on the at least one history audio feature and the at least one audio feature.

10. The apparatus as claimed in any of claims 5 to 9, caused to obtain at least one input feature for the switch mode of operation based on the coding configuration is further caused to:generate at least one history spatial metadata parameter based on at least one of:at least one previous enhanced spatial metadata parameter; and the at least one audio signal;generate the at least one input feature based on the at least one history spatial metadata parameter and the at least one spatial metadata parameter.

11. The apparatus as claimed in claim 10, caused to generate at least one history spatial metadata parameter based on at least one of: at least one previous enhanced spatial metadata parameter; and the at least one audio signal is further caused to:determine frequency band limits based on the coding configuration; generate frequency combined spatial metadata parameters from the at least one previous enhanced spatial metadata parameter based on the frequency band limits.

12. The apparatus as claimed in any of claims 10 or 11, caused to generate at least one history spatial metadata parameter based on at least one of: at least one previous enhanced spatial metadata parameter; and the at least one audio signal is further caused to:determine subframe limits based on the coding configuration; and generate the at least one history spatial metadata parameter from the at least one previous enhanced spatial metadata parameter based on the subframe limits.

13. The apparatus as claimed in any of claims 5 to 9, caused to obtain at least one input feature for the switch mode of operation based on the coding configuration is further caused to:generate at least one history spatial metadata parameter based on at least one of:at least one previous predicted spatial metadata property; and the at least one audio signal;generate the at least one input feature based on the at least one history spatial metadata parameter and the at least one spatial metadata parameter.

14. The apparatus as claimed in claim 13, caused to generate at least one history spatial metadata parameter based on at least one of: at least one previous predicted spatial metadata property; and the at least one audio signal is further caused to:determine frequency band limits based on the coding configuration; generate frequency combined spatial metadata parameters from the at least one previous predicted spatial metadata property based on the frequency band limits.

15. The apparatus as claimed in any of claims 13 or 14, caused to generate at least one history spatial metadata parameter based on at least one of: at least one previous predicted spatial metadata property; and the at least one audio signal is further caused to:determine subframe limits based on the coding configuration; and generate the at least one history spatial metadata parameter from the at least one previous predicted spatial metadata property based on the subframe limits.

16. The apparatus as claimed in any of claims 1 to 15, wherein the coding configuration associated with the encoding of the spatial bitstream comprises at least one of:a bitrate value for the encoding of the spatial bitstream;a number of frequency bands for the encoding of the spatial bitstream; a number of temporal subframes for the encoding of the spatial bitstream.

17. An apparatus for enhancing at least one spatial metadata parameter, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to:obtain spatial audio content comprising: at least one original spatial metadata parameter; and at least one audio signal;generate a spatial bitstream, the spatial bitstream comprising:at least one spatial metadata parameter associated with the at least one audio signal, the at least one spatial metadata parameter being an encoded version of the at least one original spatial metadata parameter; anda coding configuration associated with the encoding of the spatial bitstream; andtransmit the spatial bitstream to at least one further apparatus, wherein the at least one further apparatus is configured to enhance the at least one spatialmetadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

18. The apparatus as claimed in claim 17, wherein the coding configuration associated with the encoding of the spatial bitstream comprises at least one of:a bitrate value for the encoding of the spatial bitstream;a number of frequency bands for the encoding of the spatial bitstream; a number of temporal subframes for the encoding of the spatial bitstream.

19. A method for an apparatus for enhancing at least one spatial metadata parameter, the method comprising:obtaining a spatial bitstream, the spatial bitstream defining spatial audio content and comprising:at least one spatial metadata parameter associated with at least one audio signal, the at least one spatial metadata parameter being an encoded version of an original spatial metadata parameter; anda coding configuration associated with the encoding of the spatial bitstream; andenhancing the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream.

20. The method as claimed in claim 19, wherein enhancing the at least one spatial metadata parameter based on the coding configuration associated with the encoding of the spatial bitstream further comprising:obtaining a machine-learning model for enhancing the at least one spatial metadata parameter, the machine-learning model for spatial audio content associated with the coding configuration, the machine-learning model configured to output at least one predicted spatial metadata property; andgenerating at least one enhanced spatial metadata parameter based on the at least one predicted spatial metadata property.