Metadata for spatial audio

By decomposing and analyzing audio signals into distinct categories with tailored transforms, the method addresses spatial audio quality issues in complex sound scenes, enhancing immersion and reducing artifacts.

WO2026124964A1PCT designated stage Publication Date: 2026-06-18NOKIA TECHNOLOGIES OY

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
NOKIA TECHNOLOGIES OY
Filing Date
2025-11-24
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing spatial audio technologies face issues with reduced sound quality and immersion due to low temporal and/or frequency resolution of spatial metadata in complex sound scenes containing multiple audio signal categories, leading to artifacts like pre-echo and postreverb, especially when encoding and decoding transient and continuous sounds.

Method used

The method involves decomposing audio signals into different categories (tonal, transient, noise) and analyzing each category separately to obtain specific spatial metadata, which are then combined into a single spatial metadata stream, using tailored transforms based on energy density characteristics.

🎯Benefits of technology

This approach enhances spatial audio quality by accurately representing various audio categories, reducing artifacts and improving immersion in spatial audio rendering.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure EP2025083996_18062026_PF_FP_ABST
    Figure EP2025083996_18062026_PF_FP_ABST
Patent Text Reader

Abstract

Examples of the disclosure relate to metadata for spatial audio where the spatial audio comprises components in different audio signal categories. Two or more audio signals comprising multiple audio components are obtained. A first component and a second component are determined from the two or more audio signals. The first component represents a first audio signal category and the second component represents a second audio signal category. First spatial metadata is obtained based on analyzing the first component. The first spatial metadata comprises a first spatial audio parameter. Second spatial metadata is obtained based on analyzing the second component. The second spatial metadata comprises a second spatial audio parameter. The first spatial metadata and the second spatial metadata are combined into a single spatial metadata stream. The first spatial audio parameter and the second spatial audio parameter are represented inside the single spatial metadata stream.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] TITLE

[0002] Metadata for Spatial Audio

[0003] TECHNOLOGICAL FIELD

[0004] Examples of the disclosure relate to metadata for spatial audio. Some relate to metadata for spatial audio where the spatial audio comprises components in different audio signal categories.

[0005] BACKGROUND

[0006] Complex sound scenes can comprise multiple different types of audio signal categories. For example, they can comprise audio signals with components that have different temporal characteristics, such as transient and tonal sounds. If the spatial metadata for the complex sound scene has a low temporal and / or frequency resolution this can result in reduced sound quality and / or reduced immersion for the rendered spatial audio.

[0007] BRIEF SUMMARY

[0008] According to various, but not necessarily all, examples of the disclosure there may be provided an apparatus comprising:

[0009] at least one processor;

[0010] and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform:

[0011] obtaining two or more audio signals comprising multiple audio components;

[0012] determining at least a first component and a second component from the two or more audio signals wherein the first component represents a first audio signal category and the second component represents a second audio signal category;

[0013] obtaining first spatial metadata based on analyzing the first component wherein the first spatial metadata comprises at least one first spatial audio parameter;

[0014] obtaining second spatial metadata based on analyzing the second component wherein the second spatial metadata comprises at least one second spatial audio parameter; and

[0015] combining the first spatial metadata and the second spatial metadata into a single spatial metadata stream wherein the at least one first spatial audio parameter and the at least one second spatial audio parameter are represented inside the single spatial metadata stream.

[0016] Combining the first spatial metadata and the second spatial metadata into a single spatial metadata stream may comprise at least one of:

[0017] providing the first spatial metadata and the second spatial metadata in the single spatial metadata stream; processing the first spatial metadata and the second spatial metadata into a further spatial metadata; and storing the first spatial metadata and the second spatial metadata into a single structure that represents the spatial metadata stream. Combining the first spatial metadata and the second spatial metadata into a single spatial metadata stream may comprise assigning first spatial metadata to a first time / frequency tile and assigning second spatial metadata to a second time / frequency tile.

[0018] Combining the first spatial metadata and the second spatial metadata into a single spatial metadata stream may comprise assigning first spatial metadata to a first direction and assigning second spatial metadata to a second direction.

[0019] A first process may be used to analyze the first component to obtain first spatial metadata and a second, different process may be used to analyze the second component to obtain second spatial metadata.

[0020] The first process may comprise a first type of transform and the second process may comprise a second type of transform.

[0021] Different window lengths may be used for the different transforms based on a characteristic distribution of energy density for the audio signal categories of the respective components.

[0022] The processor and memory may also be configured to cause the apparatus to perform determining a third component of the two or more audio signals that represents a third audio signal category.

[0023] An energy density of a component that represents the second audio signal category may be predominantly comprised within a shorter time frame compared to an energy density of a component that represents the first audio signal category.

[0024] The first audio signal category may comprise predominantly tonal audio and the second audio signal category comprises predominantly transient audio.

[0025] Respective audio signal categories may comprise two or more of:

[0026] tonal audio;

[0027] harmonic audio;

[0028] noise;

[0029] transient audio;

[0030] onset audio;

[0031] speech;

[0032] remainder;

[0033] non-transient remainder;

[0034] transient remainder; music;

[0035] ambience.

[0036] Determining respective components from the two or more audio signals may comprise at least one of:

[0037] decomposing the two or more audio signals into the respective components;

[0038] splitting the two or more audio signals into the respective components;

[0039] dividing the two or more audio signals into the respective components;

[0040] extracting the respective components from the two or more audio signals;

[0041] weighting the respective components from the two or more audio signals;

[0042] identifying the respective components from the two or more audio signals;

[0043] emphasizing at least one component from the two or more audio signals.

[0044] The processor and memory may also be configured to determine spatial metadata for the two or more audio signals and use the spatial metadata for the two or more audio signals in place of the single spatial metadata stream if one or more criteria for the spatial metadata for the two or more audio signals are satisfied.

[0045] Respective components of the two or more audio signals may be determined using at least one of:

[0046] signal decomposition;

[0047] a machine learning model;

[0048] a classifier.

[0049] The spatial metadata may comprise at least one of:

[0050] one or more directional parameters;

[0051] one or more energy parameters; or

[0052] one or more ratio parameters.

[0053] The first spatial metadata and the second spatial metadata may be used to render a spatial audio output and the spatial audio output comprises one of:

[0054] binaural audio signals;

[0055] stereo audio signals;

[0056] multi-channel audio signals; or

[0057] Ambisonics signals.

[0058] According to various, but not necessarily all, examples of the disclosure there may be provided a method comprising:

[0059] obtaining two or more audio signals comprising multiple audio components;

[0060] determining at least a first component and a second component from the two or more audio signals wherein the first component represents a first audio signal category and the second component represents a second audio signal category; obtaining first spatial metadata based on analyzing the first component wherein the first spatial metadata comprises at least one first spatial audio parameter;

[0061] obtaining second spatial metadata based on analyzing the second component wherein the second spatial metadata comprises at least one second spatial audio parameter; and

[0062] combining the first spatial metadata and the second spatial metadata into a single spatial metadata stream wherein the at least one first spatial audio parameter and the at least one second spatial audio parameter are represented inside the single spatial metadata stream.

[0063] According to various, but not necessarily all, examples of the disclosure there may be provided a computer program comprising instructions which, when executed by an apparatus, cause the apparatus to perform:

[0064] obtaining two or more audio signals comprising multiple audio components;

[0065] determining at least a first component and a second component from the two or more audio signals wherein the first component represents a first audio signal category and the second component represents a second audio signal category;

[0066] obtaining first spatial metadata based on analyzing the first component wherein the first spatial metadata comprises at least one first spatial audio parameter;

[0067] obtaining second spatial metadata based on analyzing the second component wherein the second spatial metadata comprises at least one second spatial audio parameter; and

[0068] combining the first spatial metadata and the second spatial metadata into a single spatial metadata stream wherein the at least one first spatial audio parameter and the at least one second spatial audio parameter are represented inside the single spatial metadata stream.

[0069] According to various, but not necessarily all, embodiments there is provided an apparatus comprising:

[0070] at least one processor; and

[0071] at least one memory including computer program code;

[0072] the at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus to perform at least a part of one or more methods described herein.

[0073] According to various, but not necessarily all, embodiments there is provided an apparatus comprising means for performing at least part of one or more methods described herein. The description of a function and / or action should additionally be considered to also disclose any means suitable for performing that function and / or action. Functions and / or actions described herein can be performed in any suitable way using any suitable method.

[0074] According to various, but not necessarily all, embodiments there is provided examples as claimed in the appended claims.

[0075] While the above examples of the disclosure and optional features are described separately, it is to be understood that their provision in all possible combinations and permutations is contained within the disclosure. It is to be understood that various examples of the disclosure can comprise any or all the features described in respect of other examples of the disclosure, and vice versa. Also, it is to be appreciated that any one or more or all the features, in any combination, may be implemented by / comprised in / performable by an apparatus, a method, and / or computer program instructions as desired, and as appropriate. The description of a function should additionally be considered to also disclose any means suitable for performing that function

[0076] BRIEF DESCRIPTION

[0077] Some examples will now be described with reference to the accompanying drawings in which:

[0078] FIG. 1 shows an example system;

[0079] FIG. 2 shows an example method;

[0080] FIG. 3 shows an example encoder;

[0081] FIG. 4 shows an example decoder;

[0082] FIG. 5 shows an example metadata combiner;

[0083] FIG. 6 shows an example metadata combiner;

[0084] FIG. 7 shows an alternative example system;

[0085] FIG. 8 shows an example front end;

[0086] FIG. 9 shows an example encoder;

[0087] FIG. 10 shows an example method; and

[0088] FIG. 11 shows an example controller.

[0089] The figures are not necessarily to scale. Certain features and views of the figures can be shown schematically or exaggerated in scale in the interest of clarity and conciseness. For example, the dimensions of some elements in the figures can be exaggerated relative to other elements to aid explication. Corresponding reference numerals are used in the figures to designate corresponding features. For clarity, all reference numerals are not necessarily displayed in all figures.

[0090] DETAILED DESCRIPTION

[0091] Fig. 1 shows an example system 100 that can be used to provide parametric spatial audio and implement examples of the disclosure. The example system 100 comprises an encoder 104 and a decoder 108. In implementations of the disclosure the system 100 can comprise additional components that are not shown in Fig. 1.

[0092] The encoder 104 receives input audio signals 102. The input audio signals 102 can comprise microphone-array signals, Ambisonic signals, multi-channel audio signals, or any other suitable type of signals. Microphone -array signals could be received from a mobile device or any other suitable type of device. The multi-channel audio signals could be 5.1, 7.1 +4 or any other suitable type of signals. The encoder 104 is configured to process the input audio signals 102 to obtain a parametric spatial audio stream. The parametric spatial audio stream can be in any suitable format such as metadata-assisted spatial audio (MASA) format. In some examples the encoder 104 can receive spatial metadata alongside the input audio signals 102. In other examples 104 the encoder can process the input audio signals 102 to obtain the spatial metadata.

[0093] The encoder 104 is configured to encode the obtained parametric spatial stream to form a bitstream 106.

[0094] The bitstream 106 is passed from the encoder 104 to the decoder 108. The bitstream can be passed from the encoder to decoder via any suitable communication link. The decoder 108 is configured to decode the bitstream 108 and render spatial audio output 110. The spatial audio output 110 can comprise any suitable type of spatial audio signals such as binaural audio signals.

[0095] In example systems 100 if the spatial metadata has a high temporal and frequency resolution then the encoder 104 can encode the spatial metadata in a compressed manner. Within a single frame, there can be different types of audio signal categories. For instance, there could be continuous sounds (such as tones) and transient sounds (such as a basketball bouncing, clicks, claps). If transient sounds are present, there may be redundancy in the spatial metadata on the frequency axis. If continuous sounds are present, there may be redundancy in the spatial metadata on the temporal axis. Some encoding schemes, such as IVAS (Immersive voice and audio services), can take this redundancy into account so that low bitrates can be achieved while providing the spatial metadata to the decoder 108 in an accurate manner. The decoder 108 can then render the different audio signal categories (such as continuous and transient) to their appropriate directions, based on the decoded spatial metadata.

[0096] If the high temporal and frequency resolution of the spatial metadata is not available this can cause problems if different types of audio signal categories are present. For example, if the input audio signals 102 are obtained from a spaced microphone array of a mobile phone, the spatial metadata may be analyzed based on a delay analysis between the microphones. This procedure requires relatively long temporal windows (for example 20 millisecond windows). These long temporal windows can cause issues in spatial metadata encoding and decoding schemes.

[0097] As an illustrative example, a sound scene could comprise sounds in a first audio signal category and a second audio signal category. The first audio signal category could comprise continuous sounds such as speech, a waterfall, a music instrument, or a car engine or any other suitable type of sound. The second audio signal category could comprise transient sound such as a basketball hitting the ground. In this sound scene the transient sound may dominate the spatial metadata for a long time, significantly longer than the length of the transient itself. If the temporal resolution of the spatial metadata is low, both the transient sounds and the continuous sounds contribute to the spatial metadata in an averaged manner. The spatial metadata does not represent at least one of the audio signal categories very well. In some examples the spatial metadata might not represent any of the audio signal categories very well. When spatial metadata obtained in these circumstances is encoded, decoded, and then used in the spatial audio rendering, it can result in artifacts for the transient sounds. The artifacts could be perceived as pre-echo and / or postreverb. The issues with the spatial metadata can also result in other sounds being reproduced at or near the direction of the transient sound which results in these sounds being perceived to be spatially unstable. Spatial instabilities that could be perceived include mutual spatial attraction of direct sound signals, instability of an ambient portion among occasional directional sound, increased reverberance, and produced artefact-sounding pre-echoes and post-reverbs. The presence any of these artifacts results in reduced sound quality and / or reduced immersion. Also such spatial artifacts can be amplified when the determined spatial metadata is encoded, and at low bitrates, may also become more audible due to artifacts present in the audio signal coding path.

[0098] Examples of the disclosure provide systems and methods that can address these issues and enable good-quality spatial metadata to be obtained that can represent also complex sound scenes that comprise multiple different types of audio signal categories.

[0099] Fig. 2 shows an example method that can be used in examples of the disclosure. The method could be implemented by an apparatus within an encoder 104 or any other suitable entity.

[0100] The method comprises, at block 200, obtaining two or more audio signals comprising multiple audio components. The audio signals can represent a complex audio scene with different types of audio signal categories.

[0101] At block 202 the method comprises determining at least a first component and a second component from the two or more audio signals. The first component represents a first audio signal category and the second component represents a second audio signal category.

[0102] The different audio signal categories can have different temporal characteristics or energy densities. For example an energy density of a component that represents the second audio signal category can be predominantly comprised within a shorter time frame compared to an energy density of a component that represents the first audio signal category. In such cases the first audio signal category could comprise predominantly tonal audio while the second audio signal category could comprise predominantly transient audio. Other types of audio signal categories could be used in other examples. Other types of audio signals categories that can be used can comprise; tonal audio, harmonic audio, noise, transient audio, onset audio, speech, remainder, non-transient remainder, transient remainder, music, ambience, or any other suitable type of audio.

[0103] The respective audio signal categories can be defined by distinctive properties of the audio signals. For example, a transient audio signal category has predominantly concentrated energy density in short time frame. Onset audio signal category is similar to transient audio signal category in the sense that there is a short-term change in audio signal energy. However, in an onset category, this change is an increase in energy or more generally an activation of a sound event. Noise audio signal category comprises an audio signal that is characterized by a random noise signal. The source of such noise signals can be capture noise, wind noise, noise-like components of complex sounds, or any other signal with inherent randomness to it. Speech audio signal category comprises predominantly human speech but could also contain singing without other music components or other suitable vocal noises. Music audio signal category contains predominantly music components such as real and synthetic instruments possibly including singing. Ambience audio signal category comprises predominantly reverberation and non-prominent ambient sound sources such as wind. The remainder, non-transient remainder, and transient remainder audio signal categories represent a special audio signal category which are produced as a result of signal processing. For example, if tonal category signal is removed from a complete audio signal with signal processing methods, the remaining signal can be considered to be of remainder audio signal category. This remainder signal category can then be further processed to transient and nontransient remainder signal categories with signal processing methods where the former contains predominantly transient category signal of the remainder and the latter contains predominantly non-transient category signal of the remainder.

[0104] Any suitable processes can be used to determine the respective components from the two or more audio signals. The processes could comprise decomposing the two or more audio signals into the respective components, splitting the two or more audio signals into the respective components, dividing the two or more audio signals into the respective components, extracting the respective components from the two or more audio signals, weighting the respective components from the two or more audio signals, identifying the respective components from the two or more audio signals, emphasizing at least one component from the two or more audio signals, or any other suitable process.

[0105] " Splitting", "dividing", and "extracting" could be considered as an alternative definitions of the "decomposing" as the two or more audio signals are processed into separate components where each component predominantly contains only one audio signal category.

[0106] " Weighting" and "emphasizing" could be considered to be an alternative process to "decomposing" where the two or more audio signals are processed in such way that the relative prominence of specific audio signal categories are increased to produce the component audio signals representing audio signal categories. The difference is that no audio signal content is removed but instead the presence of desired audio signal category is emphasized in the produced audio signal component in relation to audio signals of another category.

[0107] " Identifying" is a case where the audio signal components representing audio signal categories are already separated in the two or more audio signals. In this case, the audio signals belonging to a specific category are identified and assigned to be a component representing the category. For example, such situation happens if the obtained two or more audio signals are a result of decomposition process that has happened before the obtaining step. In some examples a preprocessor can perform a transient, sine, noise decomposition to produce three sets of audio signals from a complete audio signal and this is given to a processor for spatial analysis and further processing. In this case, the components representing categories are identified from the audio signals and assigned accordingly. The respective components of the two or more audio signals can be determined using signal decomposition or a machine learning model or any other suitable means.

[0108] In some examples the components of the two or more audio signals can be determined using a classifier. The classifier could be comprised in an audio codec such as EVS, IVAS, Opus or any other suitable audio codec. Such classifiers can classify audio signals as speech, music, noise or other suitable categories.

[0109] At block 204 the method comprises obtaining first spatial metadata based on analyzing the first component and at block 206 the method comprises obtaining second spatial metadata based on analyzing the second component. The first spatial metadata comprises at least one first spatial audio parameter and the second spatial metadata comprises at least one second spatial audio parameter.

[0110] The spatial metadata can comprise directional parameters, energy parameters, ratio parameters or any other suitable parameters.

[0111] Different processes can be used to analyze the respective components. For example a first process can be used to analyze the first component to obtain first spatial metadata and a second, different process can be used to analyze the second component to obtain second spatial metadata.

[0112] In some examples the different processes used to analyze the different components can comprise different types of transforms. In such cases the first process comprises a first type of transform and the second process comprises a second type of transform. In some examples the different transforms can have different window lengths. The window lengths that are used for a component can be selected based on a characteristic distribution of energy density for the audio signal categories of the respective components.

[0113] At block 208 the method comprises combining the first spatial metadata and the second spatial metadata into a single spatial metadata stream. The at least one first spatial audio parameter and the at least one second spatial audio parameter are represented inside the single spatial metadata stream.

[0114] In some examples combining the first spatial metadata and the second spatial metadata into a single spatial metadata stream can comprise providing the first spatial metadata and the second spatial metadata in the single spatial metadata stream. In some examples the combining can comprise processing the first spatial metadata and the second spatial metadata into a further spatial metadata. The processing could comprise an averaging, a selection, or a merger or any other suitable processing. In some examples the combining can comprise storing the first spatial metadata and the second spatial metadata into a single structure that represents the spatial metadata stream.

[0115] In some examples combining the first spatial metadata and the second spatial metadata into a single spatial metadata stream comprises assigning first spatial metadata to a first time / frequency tile and assigning second spatial metadata to a second time / frequency tile. In some examples combining the first spatial metadata and the second spatial metadata into a single spatial metadata stream comprises assigning first spatial metadata to a first direction and assigning second spatial metadata to a second direction.

[0116] In some examples the audio signals can comprise more than two components. In such cases the method can comprise additional blocks such as determining a third component of the two or more audio signals that represents a third audio signal category and obtaining third spatial metadata based on analyzing the third component wherein the third spatial metadata comprises at least one first spatial audio parameter.

[0117] An example of a third audio signal category could be noise. Noise could be differentiated from tonal or harmonic audio signal categories because noise would be broader in frequency and would occupy multiple frequency bins whereas tonal or harmonic audio would predominately occupy a single frequency bin or two frequency bins (for each component of a harmonic sound).

[0118] The spatial metadata that is obtained using the example methods can be used to render a spatial audio output. The spatial audio output can comprise binaural audio signals, stereo audio signals, multi-channel audio signals, Ambisonics signals, or any other suitable spatial output.

[0119] In some examples spatial metadata for the two or more audio signals can be determined. That is, spatial metadata can be determined for the original input audio signals 102. The spatial metadata for the original input audio signals can be used in place of the single spatial metadata stream if one or more criteria for the spatial metadata for the original input audio signals 102 are satisfied. The criteria could be, if the direct-to-total energy ratio determined using the original input audio signals 102 is larger than any of the adjusted direct-to-total energy ratios determined using the decomposed portions, or any other suitable criteria.

[0120] Fig. 3 shows an example encoder 104 that can be used in examples of the disclosure. The encoder 104 can be part of a system 100 as shown in Fig. 1 or any other suitable type of system 100. The encoder 104 is configured to determine respective components of the audio signals, analyze the respective components to obtain spatial metadata for the respective components and combine the spatial metadata for the respective components into a single spatial metadata stream.

[0121] The encoder 104 receives input audio signals 102 as an input. The input audio signals 102 can comprise two or more audio signals comprising multiple audio components.

[0122] The input audio signals 102 are provided to a decomposer 300. The decomposer 300 is configured to determine the respective components of the input audio signals 102. In this example the decomposer 300 decomposes the input audio signals 102 into the components. Other means for determining the components could be used in other examples. The other means could comprise, splitting the input audio signals 102, dividing the input audio signals 102, extracting components from the input audio signals 102, weighting components from the input audio signals 102, identifying the components from the input audio signals 102, emphasizing at least one component from the input audio signals 102, or any other suitable means.

[0123] In this example the input audio signals 102 are decomposed into three components. The components represent different audio signal categories. In this example the first audio signal component 302 can be categorized as tonal audio, the second audio signal component 308 can be categorized as transient audio, and the third audio signal component 314 can be categorized as noise. Other audio signal categories, and numbers of audio signal categories, could be used in other examples.

[0124] The audio signal components 302, 308, 314 are provided as the output of the decomposer 300. The audio signal components 302, 308, 314 can be provided in any suitable format. The audio signal components 302, 308, 314 could be provided in the time-frequency domain if the decomposer 300 has operated in the time-frequency domain with the same resolution as is used by the spatial analyzers 304, 310, 316 of the encoder 104. In other examples the audio signal components 302, 308, 314 could be provided in the time domain.

[0125] The audio signal components 302, 308, 314 are provided to spatial analyzers 304, 310, 316. The spatial analyzers 304, 310, 316 are configured to analyze the audio signal components 302, 308, 314 to obtain spatial metadata for the respective audio signal components 302, 308, 314. Each of the respective audio signal components 302, 308, 314 is provided to its own spatial analyzer 304, 310, 316. This can enable the spatial metadata to be obtained separately or independently for the different audio signal components 302, 308, 314.

[0126] In some examples each of the spatial analyzers 304, 310, 316 could be the same so that the same process is used to analyze each of the audio signal components 302, 308, 314 to obtain the spatial metadata. In other examples one or more of the spatial analyzers 304, 310, 316 could be different. This can enable different processes to be used to analyze different audio signal components 302, 308, 314 to obtain the spatial metadata. The different processes could be used based on different characteristics of the audio signal components 302, 308, 314. For example, different transforms could be used and the window lengths for the different transforms could be selected based on a characteristic distribution of energy for the audio signal category that the audio signal component represents. For instance, a long window could be used for the time-frequency transform of a tonal audio signal component 302 so as to obtain accurate spectral features for the tones while a short window could be used for the time-frequency transform of a transient audio signal component 308 so as to obtain accurate temporal features for the transient sounds.

[0127] The implementation of the spatial analyzers 304, 310, 316 depends on the type of input audio signals 102, for instance, whether the input audio signals 102 comprise microphone array signals, Ambisonic audio signals, or multi-channel audio signals, or a different type of signals. The spatial analyzers 304, 310, 316 provide spatial metadata as an output. In the example of Fig. 3, the first spatial analyzer 304 provides first spatial metadata 306 as an output. The first spatial metadata 306 comprises at least one first spatial audio parameter. The first spatial metadata 306 comprises metadata for the tonal audio signal component 302 in this example. Also in Fig. 3 the second spatial analyzer 310 provides second spatial metadata 312 as an output. The second spatial metadata 312 comprises at least one second spatial audio parameter. The second spatial metadata 312 comprises metadata for the transient audio signal component 308 in this example. Similarly, the third spatial analyzer 316 provides third spatial metadata 318 as an output. The third spatial metadata 318 comprises at least one third spatial audio parameter. The third spatial metadata 318 comprises metadata for the noise audio signal component 314 in this example.

[0128] The spatial metadata parameters can comprise directional parameters, energy parameters or any other suitable parameters.

[0129] The respective spatial metadata 306, 312, 318 are provided to a metadata combiner 320. The metadata combiner 320 also receives the audio signal components 302, 308, 314 as inputs for aiding the processing. The metadata combiner 320 combines the respective spatial metadata 306, 312, 318 into a single spatial metadata stream 322. The single spatial metadata stream 322 comprises combined spatial metadata. An example metadata combiner 320 is shown in Fig. 5.

[0130] The single spatial metadata stream 322 is provided to a metadata encoder 324. The metadata encoder 324 is configured to encode the single spatial metadata stream 322. The single spatial metadata stream 322 can be encoded using methods, such as the MASA encoding methods of the IVAS encoder or any other suitable methods.

[0131] The metadata encoder 324 provides encoded metadata 326 as an output. The encoded metadata 326 is provided to a multiplexer 336.

[0132] The input audio signals 102 are also provided to a transport audio signal determiner 328. The transport audio signal determiner 328 determines the transport audio signals 330. Any suitable process can be used to determine the transport audio signals 330. The process that is used can be dependent upon the type of input audio signals 102. For example, if the input audio signals 102 comprise microphone-array signals from a mobile device, the process for determining the transport audio signals can comprise selecting a microphone signal from the left side of the device as the left transport signal and another one from the right side of the device as the right transport signal. If the input audio signals 102 comprise Ambisonic inputs, the process for determining the transport audio signals may comprise determining signals with cardioid directional patterns pointing to ±90 degrees. If the input audio signals 102 comprise multi-channel input, the process for determining the transport audio signals may comprise determining a stereo downmix of the input signals. Any other suitable approaches may also be used for other use cases. In some examples the transport audio signal determiner 328 can include operations to delay the transport audio signals 330, to match any potential delay caused by the decomposer 300 or spatial analyzers 304, 310 or 316.

[0133] The transport audio signals 330 are provided as an output of the transport audio signal determiner 328 and forwarded to an audio encoder 332. The audio encoder 332 is configured to encode the transport audio signals 330 using any suitable audio-signal encoder, such as the IVAS core coder, EVS, or AAC.

[0134] The audio encoder 332 provides encoded transport audio signals 334 as an output. The encoded transport audio signals 334 are forwarded to the multiplexer 336.

[0135] The multiplexer 336 receives the encoded transport audio signals 334 and the encoded spatial metadata 326 and multiplexes them to a bitstream 106. The bitstream 106 is the output of the encoder 104.

[0136] The decomposer 300 can use any suitable process. In some examples the decomposer can use the methods described in, or based on the methods described in, FIERRO, LEONARDO. " Enhanced Fuzzy Decomposition of Sound Into Sines, Transients, and Noise." J. Audio Eng. Soc 71.7 / 8 (2023): 468-480.

[0137] The decomposer 300 can comprise a function, denoted here as

[0138] x's(b, n, i), x't(b, n, i), x'n(b, n, i) = STN(x(b, n, i), g1, g2, mt, mf) where x(b, n, i) is a STFT (short time Fourier transform) domain signal where b denotes bin index, n denotes the temporal index and i denotes the channel index. The rest of the parameters are configuration parameters, the uses of which are described below. The left side signals are three different outputs of the function. The STN function in this example is called with different STFT resolutions, and as such the signal format x(b, n, i) in this notation can refer to any of them.

[0139] |X(ZJ n i) |

[0140] First, the mean amplitude can be formulated by xa(b, n) =

[0141]

[0142] - ’-t—, where Nchis the number of channels.

[0143] Nch

[0144] Then, the following steps are performed for each temporal step n and for each bin b:

[0145] Median filtering along frequency axis to obtain xmedf(b, n), which is the median value of a range of xa(b, n) along axis b, ranging from (b − mf / 2) to (b + mf / 2), where these limits are rounded to the nearest integer,

[0146]

[0147] and truncated if the limits exceed the data range defined in xa(b, ri).

[0148] Median filtering along time axis to obtain xmedt(b, n), which is the median value of a range of xa(b, n) along axis n, ranging from (n − mt+ 1) to (n), where the first limit is truncated if it exceeds the data range defined in xa(b, n).

[0149] Formulating an STN ratio value xr(b, n) = xmedt(b,n) / (xmedt(b,n) + xmedf(b,n)), where the denominator may be

[0150]

[0151] regularized with an epsilon value to avoid divisions by zero. Formulating

[0152] 0 if xr(b, n) < g1

[0153] 1 if xr(b, n) > g2

[0154] S(b, n)

[0155] sin2(0.5π · (xr(b,n) − g1) / (g2− g1))

[0156] sin² (0.5π · (...)) otherwise

[0157] \ 92 ~ 91 /

[0158] 0 if (1 − xr(b, n)) < g1

[0159] 1 if (1 − xr(b, n)) > g2

[0160] T(b, n)

[0161] sin2(0.5π · (1 − xr(b,n)

[0162] otherwise

[0163]

[0164] 92 9i

[0165] Formulating the output signals by

[0166] x's(b, n, i) = S(b, n) · x(b, n, i)

[0167] x't(b, n, i) = T(b, n) · x(b, n, i)

[0168] x’n(b, n, t) = (1 — S(b, n) — T(b, n)) x(b, n, t)

[0169] The operation of the decomposer 300 is described below, so that the above function is used. First, the time domain input audio signals 102, denoted x(t, i) are transformed using a short-time Fourier transform (STFT) with a hop size 128, FFT size of 256, and using a square root of the Hann window to obtain x128(b, n, i) Then, it is decomposed by x's,128(b, n, i), x't,128(b, n, i), x'n,128(b, n, i) = STN(x128(b, n, i), g1, g2, mt, mf) with parameter values g1= 0.75; g2= 0.85; mt= 75,

[0170]

[0171] and = 3.

[0172] Then, x't l28(b, n, t) is inverse transformed, with an inverse STFT, to obtain xt(t, i) that is the transient audio signal component 308 output by the decomposer 300.

[0173] Then, the sum x's l28(b, n, t) + x'n l28(b, n, t) is inverse transformed to obtain x’remainderft, if which in turn is forward transformed with an STFT with a hop size 1024 and FFT size 2048 to obtain x'remainder,1024(b, n, i) Then, this signal is decomposed by

[0174] x's,1024(b, n, i), x't,1024(b, n, i), x'n,1024(b, n, i) = STN(x'remainder,1024(b, n, i), g1, g2, mt, mf) with parameter values g1= 0.7; g2= 0.8;

[0175]

[0176] mt= 9, and = 21.

[0177] Then, x's,1024(b, n, i) is inverse transformed, with an inverse STFT, to obtain xs(t, i) that is the tonal audio signal component 302 output by the decomposer 300.

[0178] Finally, the sum x't,1024(b, n, i) + x'n,1024(b, n, i) is inverse transformed, with an inverse STFT, to obtain xn(t, i) that is the noise audio signal component 314 output by the decomposer 300. Depending on the implementation there may be look-ahead available or not. If not available, the forward and inverse STFT can cause delay to the signal. In the above, the transient audio signal component 308 had one STFT operation less than the other two parts, and therefore if delay is caused to the other components, then transient audio signal component 308 is delayed to the same amount.

[0179] Fig. 4 shows an example decoder 108 that can be used in examples of the disclosure. The decoder 108 can be part of a system 100 as shown in Fig. 1 or any other suitable type of system 100.

[0180] The decoder 108 receives the bitstream 106 as an input. The bitstream 106 is provided to a demultiplexer 400. The demultiplexer 400 demultiplexes the bitstream 106 into encoded transport audio signals 402 and the encoded metadata 408.

[0181] The encoded transport audio signals 402 are forwarded to an audio decoder 404. The audio decoder 404 is configured to decode the encoded transport audio signals 402. The audio decoder 404 is compatible with the audio encoder 332 that was used for encoding the transport audio signals 330. The audio decoder 404 provides decoded transport audio signals 406 as an output.

[0182] The encoded metadata 408 is forwarded to a metadata decoder 410. The metadata decoder 410 is configured to decode the encoded metadata 408. The metadata decoder 410 is compatible with the metadata encoder 324 that was used for encoding the single spatial metadata stream 322. The metadata decoder 410 provides decoded spatial metadata 412 as an output.

[0183] The decoded transport audio signals 406 and the decoded spatial metadata 412 are forwarded to a spatial synthesizer 414. The spatial synthesizer 414 uses the decoded transport audio signals 406 and the decoded spatial metadata 412 to render the spatial audio output 110. The spatial audio output 110 can comprise binaural audio or any other suitable type of audio. The spatial synthesizer can use any suitable process to render the spatial audio output 110.

[0184] Fig. 5 shows an example metadata combiner 320 that can be used in examples of the disclosure.

[0185] The metadata combiner 320 receives the audio signal components 302, 308, 314 and the corresponding spatial metadata 306, 312, 318 as inputs.

[0186] In this example the audio signal components 302, 308, 314 comprise a tonal audio signal component 302, a transient audio signal component 308 and a noise audio signal component 314. Other types of audio components could be used in other examples.

[0187] In this example the spatial metadata can comprise energy ratios and directional parameters. The energy ratios and directional parameters are received for each of the spatial metadata 306, 312, 318 for the different audio components. In the example of Fig. 5 the spatial metadata comprises an energy ratio 510 and a directional parameter 516 for the tonal audio signal component 302, an energy ratio 512 and a directional parameter 518 for the transient audio signal component 308, and an energy ratio 514 and a directional parameter 520 for the noise audio signal component 314. The energy ratios 510, 512, 514 could be direct- to- to tai energy ratio (r( / c, m). The directional parameters 516, 518, 520 could comprise azimuth 0(Jc,m) and elevation p(k, m), or a spherical index, where k is the frequency band index and m the temporal subframe index. The spatial metadata can comprise other parameters.

[0188] The audio signal components 302, 308, 314 are provided to a time-frequency transform 500. The time-frequency transform 500 transforms the audio signal components 302, 308, 314 to the time-frequency domain audio signals 502. The time-frequency domain audio signals can be denoted Xt(b, n, j), Xs(b, n, j), and Xn(b, n, 1). The timefrequency transform 500 can use any suitable process such as STFT, or complex low-delay filter bank (CLDFB).

[0189] The time-frequency transform 500 provides three time-frequency domain audio signals 502 as an output, one corresponding to each of the input audio signal components 302, 308, 314. The time-frequency domain audio signals 502 therefore comprise a tonal component, a transient component and a noise component.

[0190] The time-frequency domain audio signals 502 are forwarded to the energy determiner 504. The energy determiner 504 is configured to determine the energies of the different components with the time-frequency resolution corresponding to the time-frequency resolution of the spatial metadata. The energies can be determined for by NCh b2(k) n2(m)

[0191] Et(k,m) = |Xt(b, n, i)|2

[0192] i = l b=b1(k) n=n1(m)

[0193] NCh b2(k) n2(m)

[0194] Es(k,m) = ΣΣΣ |Xs(b, n, j)|2

[0195] i = l b=b1(k) n=n1(m)

[0196] NCh b2(k) n2(m)

[0197] En(k, m) = ΣΣΣ |Xn(b, n, j)|2

[0198]

[0199] i = l b=b1(k) n=n1(m)

[0200] where b1(k) is the lowest frequency bin of the frequency band k, b2(k) is the highest frequency bin of the frequency band k, n1(m) is the first temporal slot of the subframe m, and n2(m) is the last temporal slot of the subframe m.

[0201] The energy determiner 504 provides three energies 506 as an output, one corresponding to each of the input audio signal components 302, 308, 314. The energies 506 therefore comprise a tonal component, a transient component and a noise component. The energies 506 can be denoted Et(k, m), Es(k, m), En(k, m).

[0202] The energies 506 are provided as an input to the energy ratio adjuster 508. The energy ratio adjuster 508 also receives the energy ratios 510, 512, 514 from the spatial metadata as an input. The energy ratios can be denoted rt(k,m), rs(k, m), and rn(k, m). The received energy ratios 510, 512, 514 can be determined in relation to the respective component of the whole sound scene (for example the transient, tonal or noise component). Thus, these energy ratios 510, 512, 514 are actually not in relation to the total energy, but only the total energy of that component. Therefore the energy ratios can be considered to be direct-to-total ratio for a given component rather than for whole audio scene.

[0203] The energy ratio adjuster 508 is configured to adjust the energy ratios 510, 512, 514 so that they are in relation to the whole sound scene. In some examples the adjustment can be performed by

[0204] rt(k, ni)Et(k, m)

[0205] rt,adj(k, m) =

[0206] rs(k,in)Es(k, Tri)

[0207] rs,adj(k, m) =

[0208] r„ (k, m)En(k, m)

[0209] rn,adj(k, m) = rn(k,m)En(k,m) / (Et(k,m) + Es(k,m) + En(k,m))

[0210]

[0211] JEt(k, m) + Es(k, m) + En(k, m)

[0212] which computes for each component the directional energy of the component and then divides it with the sum energy of all components.

[0213] The energy ratio adjuster 508 provides three adjusted energy ratios 522 as an output, one corresponding to each of the input audio signal components 302, 308, 314. The adjusted energy ratios 522 therefore comprise a tonal component, a transient component and a noise component. The adjusted energy ratios 522, which can be denoted

[0214]

[0215] rt,adj(k, m), rs,adj(k, m), and rn,adj(k, m) are outputted from the block.

[0216] The adjusted energy ratios 522 are provided as an input to a combined metadata determiner 524. The combined metadata determiner 524 also receives the directional parameters 516, 518, 520 as an input. The directional parameters 516, 518, 520 can be denoted θt(k,m), φt(k, m), θs(k,m), φs(k, m), θn(k,m), and φn(k, m). The combined metadata determiner 524 determines a single spatial metadata stream from these inputs.

[0217] Any suitable process can be used to determine the single spatial metadata stream. The process that is used can be dependent upon the spatial metadata format, for example if it is MASA or a different format, or any other factor.

[0218] An example process for determining the single spatial metadata stream is presented below, but it is merely one example, and other processes can be used. In this example process, the input spatial metadata is one-directional MASA metadata, that is, there is only one concurrent direction per time-frequency tile. In addition, in this example process, the output spatial metadata is two-directional MASA metadata, that is, there are two concurrent directions per time-frequency tile. Therefore there are three metadata values for each time-frequency tile in the input spatial metadata, and only two metadata values for each time-frequency tile in the output spatial metadata. This means that all of the input values cannot be put into the output stream. In this example this is solved as follows.

[0219] First, the spatial metadata related to the tonal audio signal component 302 is always put to the first direction of the output spatial metadata, for example,

[0220] φcomb(k, m, 1) = φs(k, m)

[0221] ^

[0222]

[0223] comb (k / 1) y~s,adj (k>

[0224] The reason for this is that the tonal audio signal component 302 predominantly comprises tones, for example sinusoidal signals, which are continuous signals. Therefore, the spatial metadata for the tonal audio signal component 302 should also be continuous, in order to avoid any abrupt changes in the rendering of the tonal audio signal component 302. Abrupt changes in the rendering of the tonal audio signal component 302 could cause artefacts.

[0225] Then, for the second direction of the output spatial metadata, either the spatial metadata from the transient audio signal component 308 or the noise audio signal component 314 is selected. The selection is made independently for all timefrequency tiles. The selection can comprise selecting the spatial metadata that has larger adjusted direct- to- total energy ratio, for example,

[0226] if rt,adj(k, m) > rn,adj(k, m):

[0227] θcomb(k, m, 2) = θt(k, m)

[0228] φcomb(k, m, 2) = φt(k, m)

[0229] ^

[0230]

[0231] comb^k, m, 2) J~t,adj(k>

[0232] else:

[0233] θcomb(k, m, 2) = θn(k, m)

[0234] φcomb(k, m, 2) = φn(k, m)

[0235] ^

[0236]

[0237] comb (k, 2) Ai, adj fk, 771)

[0238] The combined metadata determiner 524 provides the resulting combined energy ratio 526 and combined direction 528 as an output. The combined energy ratio can be denoted rcomb(k, m, i), and the combined direction can be denoted θcomb(k, m, i), φcomb(k, m, i)

[0239] Fig. 6 shows another example metadata combiner 320 that can be used in examples of the disclosure. In this example the metadata combiner 320 receives the energies of the respective components as an input. In this case the metadata combiner 320 receives an energy 600 for the tonal component, an energy 602 for the transient component, and an energy 604 for the noise component. The energies 600, 602, 604 can be provided directly to the energy ratio adjuster 508. The energy ratio adjuster 508, and the other components of the metadata combiner 320, can perform as described in relation to Fig. 5. The approach shown in Fig. 6 can be used when the energies 600, 602, 604 have already been computed as part of some previous processing step.

[0240] Fig. 7 shows an alternative example system 100 that can be used in some examples of the disclosure. In this example system the input audio signals 102 are provided to a front end block 700. The front end block 700 is configured to generate a parametric audio stream comprising transport audio signals 704 and combined metadata 702. The combined metadata 702 can comprise a single spatial metadata stream that comprises spatial metadata for multiple respective components of the input audio signals 102. The spatial metadata is obtained independently for the respective components and then combined into the single spatial metadata stream.

[0241] The transport audio signals 704 and combined metadata 702 are provided as an input to the encoder 104. The encoder 104 encodes the transport audio signals 704 and combined metadata 702 to generate the bitstream 106. The bitstream 106 is forwarded to the decoder 108. The decoder 108 decodes the bitstream 106 and renders the spatial audio output 110. The spatial audio output 110 can comprise binaural audio signals, or any other suitable type of spatial audio.

[0242] The use of the front end block 700 can enable the examples of the disclosure to be implemented with existing audio codecs such as IVAS.

[0243] Fig. 8 shows an example front end 700 that could be used in the system 100 of Fig. 7. The front end comprises a decomposer 300, multiple spatial analyzers 304, 310, 316, a metadata combiner 320 and a transport audio signal determiner 328. These components of the front end 700 can perform the same functions as the corresponding components of the encoder 104 shown in Fig. 3 and described above. The front end 700 receives the input audio signals 102 as an input and provides a single spatial metadata stream 322 and transport audio signals 330 as an output.

[0244] Fig. 9 shows an example encoder 104 that could be used in the system 100 of Fig. 7. The encoder 104 in Fig. 9 could be used with a front end 700 as shown in Fig. 8. The encoder 104 comprises a metadata encoder 324, an audio encoder 332, and a multiplexer 336. These components of the encoder 104 in Fig. 9 can perform the same functions as the corresponding components of the encoder 104 shown in Fig. 3 and described above. The encoder 104 receives the single spatial metadata stream 322 and transport audio signals 330 as an input and provides the bitstream 106 as an output.

[0245] Fig.10 shows an example method that could be used in some examples. In the examples shown in Figs. 3 to 9 a decomposer 300 determines the components of the audio signals and then the respective components are provided to spatial analyzer 304, 310, 316. In some examples the spatial synthesizer 414 of the decoder 108 could also perform a decomposition or determining that is the same or similar to the decomposition or determining that is performed by the encoder 104. Fig. 10 shows an example method that could be implemented by a spatial synthesizer in such cases.

[0246] At block 1000 the spatial synthesizer 414 decomposes the decoded transport audio signals 406 to the components. In this example the decoded transport audio signals 406 are decomposed into three components. The components represent different audio signal categories. In this example the first audio signal component can be categorized as tonal audio, the second audio signal component can be categorized as transient audio, and the third audio signal component can be categorized as noise. Other audio signal categories, and numbers of audio signal categories, could be used in other examples.

[0247] At block 1002 the spatial synthesizer 414 performs spatial synthesis processing of the respective audio signal components. The spatial synthesis processing is performed based one the decoded spatial metadata 412. In some examples the decoded spatial metadata 412 can comprise multiple simultaneous spatial metadata parameters. The multiple simultaneous spatial metadata parameters can comprise multiple simultaneous directional parameters per time-frequency region, where each element corresponds to one of the audio signal components. Therefore, as an example, the tonal audio signal component could be spatially synthesized based on the metadata that was determined for tonal components, and correspondingly for the other components.

[0248] At block 1004 the spatially processed components are combined to provide the spatial audio output 110. The spatially processed components can be combined using a sum operation or any other suitable process.

[0249] Having the spatial synthesizer 414 perform a decomposition and perform the spatial synthesis separately on the respective portions enables the spatial synthesis to be configured differently for the different components. This can potentially provide better spatial audio quality. For example, the spatial processing of the transient audio signal components could be performed with fast temporal smoothing operations so that the transient sounds are accurately rendered to their positions. The spatial processing of the tonal audio signal components could use slower temporal smoothing to avoid any artefacts due fast-changing processing operators. Similarly, other processing procedures such as decorrelation could be differently configured for different signal components.

[0250] In another example the spatial metadata analysis can be performed on the audio signal components, and then the audio signal components and the corresponding spatial metadata can then be provided directly to a spatial synthesizer 414 that is configured to implement the method of Fig. 10. However, in this case the spatial synthesizer 414 would not need to perform signal decomposition because the signals are already decomposed. In such examples the decomposer 300 and spatial analyzers 304, 310, 316 of an encoder 104 or front end 700 would be used but not necessarily the metadata combiner 320 or metadata encoder 324. Such an example system could be used in a capture device, such as a phone, where the spatial analysis and synthesis is performed in the signal portions to enable higher quality spatial audio output (for example a binaural output), for example, during capturing a video. In the examples described above the audio signals were decomposed into tonal, transient and noise components. Other types of components could be used in other examples. As an example, a machine learning network could be applied to determine gain values gML(b, ) between 0 and 1, for instance, a spectral mask, that is used to decompose the input audio signals 102 into speech and remainder portions

[0251] ■^speech (b, n, i) = gML(b, n) x(b, n, i)

[0252]

[0253]

[0254] ^remainder 0 (1 gMb(J^f ^(b, n, l)

[0255] Then, all processing in any of the examples would use these components instead of the tonal, transient and noise components. The processing of the respective components would otherwise be the same. The decomposer 300 could use any other suitable decomposition.

[0256] In some examples each of the respective audio signal components can be analyzed for spatial metadata using a timefrequency resolution that is suited for the category of the audio signal component. This can improve the accuracy of the spatial metadata accuracy for audio signal components in addition to the improvement of the accuracy that results from the decomposing into separate components.

[0257] Also, it is not necessary for the spatial metadata for the different audio signal components to have the same timefrequency resolution. The spatial metadata can be produced with any time-frequency resolution that is optimal, or substantially optimal, for the audio signal category for the corresponding component. The time-frequency resolution can then be matched when the spatial metadata is combined into a single spatial metadata stream. For examples that are compliant with IVAS MASA the time-frequency resolution would be 24 frequency bands and 4 temporal subframes.

[0258] In the examples described above the single spatial metadata stream 322 provided as an output of the metadata combiner 320 comprised two direction MASA. The single spatial metadata stream 322 could comprise other numbers of simultaneous directions in other examples. As an example, single direction MASA metadata can be created as follows.

[0259] First, the adjusted direct-to- total energy ratios (rs adj(k, m), rt adj(k,ni), rn adj(J<, m)) are determined as described above. Then, the directions 9comb(k, m), (pcomb(k, m) and the ratios rcomb(Jc,m) for the timefrequency tile k, m are selected from the component (for example, tonal, transient, or noise) having the largest adjusted direct- to- total energy ratio. This selection is performed separately for each time-frequency tile.

[0260] In some examples, additional requirements can be set for the selection of single direction MASA content. For example, since it is known that transient sounds can unfavorably interfere with a tonal component, it can be limited how frequently metadata originating from the transient component is allowed to be selected to the single spatial metadata stream 322. For example, only one appearance might be allowed in a defined time period, or alternatively appearances in successive frames might not be allowed. As above, also this selection can be done separately for each time-frequency tile.

[0261] The selection of number of directions in the spatial metadata output can also be adaptive and could vary over time. As an example, the IVAS MASA standard supports this kind of changing number of directions from a frame to another.

[0262] The examples described above only refer to the directional parameters and the direct-to-total energy ratios of the spatial metadata. In some examples, other parameters could be used in addition or instead of the mentioned parameters of the spatial metadata. For example, the spread coherence and surround coherence parameters used in MASA could be used. These parameters could be determined in the spatial analyzers 304, 310, 316 as were the other parameters. Similarly these parameters could be combined from the different components in the same manner as the direction parameters are combined in the examples described above. For example, with the two-direction output, the spatial metadata parameters from the tonal component would be set to the first MASA direction, and the spatial metadata parameters from the transient and noise components would be set to the second MASA direction based on which one of the components has larger adjusted direct-to-total energy ratio.

[0263] In the examples described above, the single spatial metadata stream 322 was constructed using only the spatial metadata analyzed using the decomposed signal portions. In some examples, spatial metadata analyzed using the original input audio signals 102 can be used when determining the single spatial metadata stream 322. For example, when determining one direction MASA metadata, if the direct-to-total energy ratio rorig(k, m) determined using the original input audio signals 102 is larger than any of the adjusted direct-to-total energy ratios determined using the decomposed portions (rs adj(k, m), rt adj(k, m), rn adj(k, m)), the spatial metadata from the original input audio signals 102 can be set to the output (for example, rcortlb(k, m) = rorigk, m)). Otherwise the spatial metadata from the respective components could be used as described above.

[0264] Fig. 11 shows an example controller 1100. The controller 1100 could be provided within an encoder 104 or any other suitable entity. Implementation of the controller 1100 may be as controller circuitry. The controller 1100 may be implemented in hardware alone, have certain aspects in software including firmware alone or can be a combination of hardware and software (including firmware). The controller 1100 can provide an apparatus for implementing the disclosure of could be provided as part of an apparatus that implements the disclosure.

[0265] As illustrated in Fig. 11 the controller 1100 can be implemented using instructions that enable hardware functionality, for example, by using executable instructions of a computer program 1106 in a general-purpose or special-purpose processor 1102 that may be stored on a computer readable storage medium (disk, memory etc.) to be executed by such a processor 1102. The processor 1102 is configured to read from and write to the memory 1104. The processor 1102 may also comprise an output interface via which data and / or commands are output by the processor 1102 and an input interface via which data and / or commands are input to the processor 1102.

[0266] The memory 1104 stores a computer program 1106 comprising computer program instructions (computer program code) that controls the operation of the apparatus when loaded into the processor 1102. The computer program instructions, of the computer program 1106, provide the logic and routines that enables the apparatus to perform the methods illustrated in the Figs. The processor 1102 by reading the memory 1104 is able to load and execute the computer program 1106.

[0267] In some examples where the controller 1100 is provided within an apparatus, the controller therefore comprises means for:

[0268] obtaining 200 two or more audio signals comprising multiple audio components;

[0269] determining 202 at least a first component and a second component from the two or more audio signals wherein the first component represents a first audio signal category and the second component represents a second audio signal category;

[0270] obtaining 204 first spatial metadata based on analyzing the first component wherein the first spatial metadata comprises at least one first spatial audio parameter;

[0271] obtaining 206 second spatial metadata based on analyzing the second component wherein the second spatial metadata comprises at least one second spatial audio parameter; and

[0272] combining 208 the first spatial metadata and the second spatial metadata into a single spatial metadata stream wherein the at least one first spatial audio parameter and the at least one second spatial audio parameter are represented inside the single spatial metadata stream.

[0273] The computer program 1106 may arrive at the apparatus via any suitable delivery mechanism 1108. The delivery mechanism 1108 may be, for example, a machine-readable medium, a computer-readable medium, a non-transitory computer-readable storage medium, a computer program product, a memory device, a record medium such as a Compact Disc Read-Only Memory (CD-ROM) or a Digital Versatile Disc (DVD) or a solid-state memory, an article of manufacture that comprises or tangibly embodies the computer program 1106. The delivery mechanism may be a signal configured to reliably transfer the computer program 1106. The apparatus may propagate or transmit the computer program 1106 as a computer data signal.

[0274] The computer program 1106 can comprise computer program instructions for causing an apparatus to perform at least the following or for performing at least the following:

[0275] obtaining 200 two or more audio signals comprising multiple audio components;

[0276] determining 202 at least a first component and a second component from the two or more audio signals wherein the first component represents a first audio signal category and the second component represents a second audio signal category; obtaining 204 first spatial metadata based on analyzing the first component wherein the first spatial metadata comprises at least one first spatial audio parameter;

[0277] obtaining 206 second spatial metadata based on analyzing the second component wherein the second spatial metadata comprises at least one second spatial audio parameter; and

[0278] combining 208 the first spatial metadata and the second spatial metadata into a single spatial metadata stream wherein the at least one first spatial audio parameter and the at least one second spatial audio parameter are represented inside the single spatial metadata stream.

[0279] The computer program instructions may be comprised in a computer program, a non-transitory computer readable medium, a computer program product, a machine-readable medium. In some but not necessarily all examples, the computer program instructions may be distributed over more than one computer program.

[0280] Although the memory 1104 is illustrated as a single component / circuitry it may be implemented as one or more separate components / circuitry some or all of which may be integrated / removable and / or may provide permanent / semi-permanent / dynamic / cached storage.

[0281] Although the processor 1102 is illustrated as a single component / circuitry it may be implemented as one or more separate components / circuitry some or all of which may be integrated / removable. The processor 1102 may be a single core or multi-core processor.

[0282] References to "computer-readable storage medium”, "computer program product”, "tangibly embodied computer program” etc. or a "controller”, "computer”, "processor” etc. should be understood to encompass not only computers having different architectures such as single / multi- processor architectures and sequential (Von Neumann) / parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc.

[0283] As used in this application, the term "circuitry” can refer to one or more or all of the following:

[0284] (a) hardware-only circuitry implementations (such as implementations in only analog and / or digital circuitry) and

[0285] (b) combinations of hardware circuits and software, such as (as applicable):

[0286] (I) a combination of analog and / or digital hardware circuit(s) with software / fi rmware and

[0287] (II) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions and (c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g. firmware) for operation, but the software might not be present when it is not needed for operation.

[0288] This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor and its (or their) accompanying software and / or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit for a mobile device or a similar integrated circuit in a server, a cellular network device, or other computing or network device.

[0289] The blocks illustrated in the Figs, can represent steps in a method and / or sections of code in the computer program 1106. The illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block can be varied. Furthermore, it can be possible for some blocks to be omitted.

[0290] The above-described examples find application as enabling components of:

[0291] automotive systems; telecommunication systems; electronic systems including consumer electronic products; distributed computing systems; media systems for generating or rendering media content including audio, visual and audio visual content and mixed, mediated, virtual and / or augmented reality; personal systems including personal health systems or personal fitness systems; navigation systems; user interfaces also known as human machine interfaces; networks including cellular, non-cellular, and optical networks; ad-hoc networks; the internet; the internet of things; virtualized networks; and related software and services

[0292] The apparatus can be provided in an electronic device, for example, a mobile terminal, according to an example of the present disclosure. It should be understood, however, that a mobile terminal is merely illustrative of an electronic device that would benefit from examples of implementations of the present disclosure and, therefore, should not be taken to limit the scope of the present disclosure to the same. While in certain implementation examples, the apparatus can be provided in a mobile terminal, other types of electronic devices, such as, but not limited to: mobile communication devices, hand portable electronic devices, wearable computing devices, portable digital assistants (PDAs), pagers, mobile computers, desktop computers, televisions, gaming devices, laptop computers, cameras, video recorders, GPS devices and other types of electronic systems, can readily employ examples of the present disclosure. Furthermore, devices can readily employ examples of the present disclosure regardless of their intent to provide mobility.

[0293] The term 'comprise' is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising Y indicates that X may comprise only one Y or may comprise more than one Y. If it is intended to use 'comprise' with an exclusive meaning then it will be made clear in the context by referring to ‘comprising only one...' or by using 'consisting.' In this description, the wording 'connect', 'couple' and 'communication' and their derivatives mean operationally connected / coupled / in communication. It should be appreciated that any number or combination of intervening components can exist (including no intervening components), i.e., to provide direct or indirect connection / coupling / communication. Any such intervening components can include hardware and / or software components.

[0294] As used herein, the term "determine / determining" (and grammatical variants thereof) can include, not least: calculating, computing, processing, deriving, measuring, investigating, identifying, looking up (for example, looking up in a table, a database, or another data structure), ascertaining and the like. Also, "determining" can include receiving (for example, receiving information), accessing (for example, accessing data in a memory), obtaining and the like. Also, " determine / determining" can include resolving, selecting, choosing, establishing, and the like.

[0295] In this description, reference has been made to various examples. The description of features or functions in relation to an example indicates that those features or functions are present in that example. The use of the term 'example' or ‘for example' or 'can' or 'may' in the text denotes, whether explicitly stated or not, that such features or functions are present in at least the described example, whether described as an example or not, and that they can be, but are not necessarily, present in some of or all other examples. Thus 'example', ‘for example', 'can', or 'may' refers to a particular instance in a class of examples. A property of the instance can be a property of only that instance or a property of the class or a property of a sub-class of the class that includes some but not all the instances in the class. It is therefore implicitly disclosed that a feature described with reference to one example but not with reference to another example, can where possible be used in that other example as part of a working combination but does not necessarily have to be used in that other example.

[0296] As used herein, "at least one of the following: ” and "at least one of ” and similar wording, where the list of two or more elements are joined by "and” or "or” mean at least any one of the elements, or at least any two or more of the elements, or at least all the elements.

[0297] Although examples have been described in the preceding paragraphs with reference to various examples, it should be appreciated that modifications to the examples given can be made without departing from the scope of the claims.

[0298] Features described in the preceding description may be used in combinations other than the combinations explicitly described above.

[0299] Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not.

[0300] The description of a feature, such as an apparatus or a component of an apparatus, configured to perform a function, or for performing a function, should additionally be considered to also disclose a method of performing that function. For example, description of an apparatus configured to perform one or more actions, or for performing one or more actions, should additionally be considered to disclose a method of performing those one or more actions with or without the apparatus.

[0301] Although features have been described with reference to certain examples, those features may also be present in other examples whether described or not.

[0302] The term 'a', 'an' or 'the' is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising a / an / the Y indicates that X may comprise only one Y or may comprise more than one Y unless the context clearly indicates the contrary. If it is intended to use 'a', 'an' or 'the' with an exclusive meaning then it will be made clear in the context. In some circumstances the use of ‘at least one' or ‘one or more' may be used to emphasis an inclusive meaning but the absence of these terms should not be taken to infer any exclusive meaning.

[0303] The presence of a feature (or combination of features) in a claim is a reference to that feature or (combination of features) itself and to features that achieve substantially the same technical effect (equivalent features). The equivalent features include, for example, features that are variants and achieve substantially the same result in substantially the same way. The equivalent features include, for example, features that perform substantially the same function, in substantially the same way to achieve substantially the same result.

[0304] In this description, reference has been made to various examples using adjectives or adjectival phrases to describe characteristics of the examples. Such a description of a characteristic in relation to an example indicates that the characteristic is present in some examples exactly as described and is present in other examples substantially as described.

[0305] The above description describes some examples of the present disclosure however those of ordinary skill in the art will be aware of possible alternative structures and method features which offer equivalent functionality to the specific examples of such structures and features described herein above and which for the sake of brevity and clarity have been omitted from the above description. Nonetheless, the above description should be read as implicitly including reference to such alternative structures and method features which provide equivalent functionality unless such alternative structures or method features are explicitly excluded in the above description of the examples of the present disclosure.

[0306] Whilst endeavoring in the foregoing specification to draw attention to those features believed to be of importance the Applicant may seek protection via the claims in respect of any patentable feature or combination of features hereinbefore referred to and / or shown in the drawings whether or not emphasis has been placed thereon.

[0307] l / we claim:

Claims

CLAIMS1. An apparatus comprising:at least one processor;and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform:obtaining two or more audio signals comprising multiple audio components;determining at least a first component and a second component from the two or more audio signals wherein the first component represents a first audio signal category and the second component represents a second audio signal category;obtaining first spatial metadata based on analyzing the first component wherein the first spatial metadata comprises at least one first spatial audio parameter;obtaining second spatial metadata based on analyzing the second component wherein the second spatial metadata comprises at least one second spatial audio parameter; andcombining the first spatial metadata and the second spatial metadata into a single spatial metadata stream wherein the at least one first spatial audio parameter and the at least one second spatial audio parameter are represented inside the single spatial metadata stream.

2. An apparatus as claimed in claim 1, wherein combining the first spatial metadata and the second spatial metadata into a single spatial metadata stream comprises at least one of:providing the first spatial metadata and the second spatial metadata in the single spatial metadata stream; processing the first spatial metadata and the second spatial metadata into a further spatial metadata; and storing the first spatial metadata and the second spatial metadata into a single structure that represents the spatial metadata stream.

3. An apparatus as claimed in any preceding claim, wherein combining the first spatial metadata and the second spatial metadata into a single spatial metadata stream comprises at least one of:assigning first spatial metadata to a first time / frequency tile and assigning second spatial metadata to a second time / frequency tile; andassigning first spatial metadata to a first direction and assigning second spatial metadata to a second direction.

4. An apparatus as claimed in any preceding claim, wherein a first process is used to analyze the first component to obtain first spatial metadata and a second, different process is used to analyze the second component to obtain second spatial metadata.

5. An apparatus as claimed in claim 4, wherein the first process comprises a first type of transform and the second process comprises a second type of transform.

6. An apparatus as claimed in claim 5, wherein different window lengths are used for the different transforms based on a characteristic distribution of energy density for the audio signal categories of the respective components.

7. An apparatus as claimed in any preceding claim, wherein the processor and memory are also configured to cause the apparatus to perform determining a third component of the two or more audio signals that represents a third audio signal category.

8. An apparatus as claimed in any preceding claim, wherein an energy density of a component that represents the second audio signal category is predominantly comprised within a shorter time frame compared to an energy density of a component that represents the first audio signal category.

9. An apparatus as claimed in any preceding claim, wherein the first audio signal category comprises predominantly tonal audio and the second audio signal category comprises predominantly transient audio.

10. An apparatus as claimed in any preceding claim, wherein respective audio signal categories comprise two or more of:tonal audio;harmonic audio;noise;transient audio;onset audio;speech;remainder;non-transient remainder;transient remainder;music; andambience.

11. An apparatus as claimed in any preceding claim, wherein determining respective components from the two or more audio signals comprises at least one of:decomposing the two or more audio signals into the respective components;splitting the two or more audio signals into the respective components;dividing the two or more audio signals into the respective components;extracting the respective components from the two or more audio signals;weighting the respective components from the two or more audio signals;identifying the respective components from the two or more audio signals; andemphasizing at least one component from the two or more audio signals.

12. An apparatus as claimed in any preceding claim, wherein the processor and memory are also configured to determine spatial metadata for the two or more audio signals and use the spatial metadata for the two or more audio signals in place of the single spatial metadata stream if one or more criteria for the spatial metadata for the two or more audio signals are satisfied.

13. An apparatus as claimed in any preceding claim, wherein respective components of the two or more audio signals are determined using at least one of:signal decomposition;a machine learning model; anda classifier.

14. An apparatus as claimed in any preceding claim, wherein the spatial metadata comprises at least one of: one or more directional parameters;one or more energy parameters; andone or more ratio parameters.

15. An apparatus as claimed in any preceding claim, wherein the first spatial metadata and the second spatial metadata are used to render a spatial audio output and the spatial audio output comprises one of:binaural audio signals;stereo audio signals;multi-channel audio signals; andAmbisonics signals.

16. A method comprising:obtaining two or more audio signals comprising multiple audio components;determining at least a first component and a second component from the two or more audio signals wherein the first component represents a first audio signal category and the second component represents a second audio signal category;obtaining first spatial metadata based on analyzing the first component wherein the first spatial metadata comprises at least one first spatial audio parameter;obtaining second spatial metadata based on analyzing the second component wherein the second spatial metadata comprises at least one second spatial audio parameter; andcombining the first spatial metadata and the second spatial metadata into a single spatial metadata stream wherein the at least one first spatial audio parameter and the at least one second spatial audio parameter are represented inside the single spatial metadata stream.

17. A method as claimed in claim 16, wherein combining the first spatial metadata and the second spatial metadata into a single spatial metadata stream comprises at least one of:providing the first spatial metadata and the second spatial metadata in the single spatial metadata stream; processing the first spatial metadata and the second spatial metadata into a further spatial metadata; and storing the first spatial metadata and the second spatial metadata into a single structure that represents the spatial metadata stream.

18. A method as claimed in any of claim 16 or 17, wherein combining the first spatial metadata and the second spatial metadata into a single spatial metadata stream comprises at least one of:assigning first spatial metadata to a first time / frequency tile and assigning second spatial metadata to a second time / frequency tile; andassigning first spatial metadata to a first direction and assigning second spatial metadata to a second direction.

19. A method as claimed in any of claims 16 to 18, wherein a first process is used to analyze the first component to obtain first spatial metadata and a second, different process is used to analyze the second component to obtain second spatial metadata.

20. A method as claimed in any of claims 16 to 19, wherein the processor and memory are also configured to cause the apparatus to perform determining a third component of the two or more audio signals that represents a third audio signal category.

21. A method as claimed in any of claims 16 to 20, wherein an energy density of a component that represents the second audio signal category is predominantly comprised within a shorter time frame compared to an energy density of a component that represents the first audio signal category.

22. A method as claimed in any of claims 16 to 21, wherein the first audio signal category comprises predominantly tonal audio and the second audio signal category comprises predominantly transient audio.

23. A computer program comprising instructions which, when executed by an apparatus, cause the apparatus to perform:obtaining two or more audio signals comprising multiple audio components;determining at least a first component and a second component from the two or more audio signals wherein the first component represents a first audio signal category and the second component represents a second audio signal category;obtaining first spatial metadata based on analyzing the first component wherein the first spatial metadata comprises at least one first spatial audio parameter;obtaining second spatial metadata based on analyzing the second component wherein the second spatial metadata comprises at least one second spatial audio parameter; andcombining the first spatial metadata and the second spatial metadata into a single spatial metadata stream wherein the at least one first spatial audio parameter and the at least one second spatial audio parameter are represented inside the single spatial metadata stream.

24. An apparatus comprising means for:obtaining two or more audio signals comprising multiple audio components;determining at least a first component and a second component from the two or more audio signals wherein the first component represents a first audio signal category and the second component represents a second audio signal category;obtaining first spatial metadata based on analyzing the first component wherein the first spatial metadata comprises at least one first spatial audio parameter;obtaining second spatial metadata based on analyzing the second component wherein the second spatial metadata comprises at least one second spatial audio parameter; andcombining the first spatial metadata and the second spatial metadata into a single spatial metadata stream wherein the at least one first spatial audio parameter and the at least one second spatial audio parameter are represented inside the single spatial metadata stream.