Harmonization of spatial metadata orientation
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- NOKIA TECHNOLOGIES OY
- Filing Date
- 2024-02-29
- Publication Date
- 2026-07-02
AI Technical Summary
Existing immersive audio codecs face issues with spatial metadata orientation inconsistencies due to shuffling or reordering of directional metadata parameters, leading to suboptimal encoding and decoding of spatial audio, particularly at lower bitrates, resulting in degraded perceptual quality.
A method and apparatus for determining and correcting ordering errors in directional metadata parameters by reordering them based on similarity and difference measures within a grid of time-frequency tiles, ensuring accurate alignment of spatial metadata with audio sources.
Improves encoding efficiency and perceptual quality of spatial audio by maintaining alignment of spatial metadata with audio sources, even at lower bitrates, thereby enhancing the performance of immersive audio codecs.
Smart Images

Figure 00000000_0000_ABST
Abstract
Description
[Technical Field]
[0001] This application relates to an apparatus and method for harmonizing spatial metadata orientation. [Background technology]
[0002] Parametric spatial audio capture from inputs such as microphone arrays and other sources is a typical and effective choice for estimating a set of parameters from the input (microphone array signal), including the direction of sound in the frequency band and the ratio of directional to non-directional portions of the captured sound in the frequency band. These parameters are known to well represent the perceptual spatial characteristics of the captured sound at the microphone array's location. These parameters can therefore be used for spatial sound synthesis for binaural headphones, loudspeakers, or other formats such as ambisonics.
[0003] Directional and direct-to-total energy ratios and diffuse-to-total energy ratios in the frequency band are therefore particularly effective parameterizations for spatial audio capture.
[0004] A set of parameters consisting of directional parameters in the frequency band (indicating sound directivity) and energy ratio parameters in the frequency band can also be used as spatial metadata for an audio codec (which may also include other parameters such as surround coherence, spread coherence, number of directions, and distance). For example, these parameters can be estimated from audio signals captured by a microphone array, and can be generated from microphone array signals, for example, stereo or mono transport audio signals, along with spatial metadata.
[0005] Immersive audio codecs are implemented supporting a range of operating points, from low-bitrate operation to transparency. One example of such a codec is the Immersive Voice and Audio Services (IVAS) codec, designed for use over communication networks such as 3GPP 4G / 5G, including use in immersive services such as immersive voice and audio for virtual reality (VR). This audio codec is expected to handle the encoding, decoding, and rendering of speech, music, and general-purpose audio. It is also expected to support channel-based audio, object-based audio, and scene-based audio input including spatial information about sound fields and sources. Furthermore, this codec is expected to operate with low latency to enable conversational services and support high error robustness under various communication conditions.
[0006] Transport audio signals can be encoded using, for example, the IVAS audio core codec, or using AAC (Advanced Audio Coding) or EVS (Enhanced Voice Services) encoders. Decoders can decode audio signals into PCM (Pulse code modulation) signals and process the sound in the frequency band (using spatial metadata) to obtain spatial outputs such as binaural output.
[0007] The aforementioned immersive audio codecs are particularly well-suited for encoding spatial sound captured from microphone arrays (e.g., in mobile phones, VR cameras, and standalone microphone arrays). However, such encoders may have other input types, such as loudspeaker signals, audio object signals, or ambisonic signals. [Overview of the project]
[0008] According to a first aspect, an apparatus is provided which includes means for obtaining ordered directional metadata parameters with respect to at least two sources in an audio scene, wherein the ordered directional metadata parameters are associated with at least two sources, the directional metadata parameters identify the direction-of-arrival for at least two sources, and are arranged in a frame arranged as a grid of time-frequency tiles with respect to the time axis and frequency axis; determining an ordering error with respect to at least one time-frequency tile directional metadata parameter, wherein the ordering error is configured to identify that at least one adjacent time-frequency tile directional metadata parameter associated with a different order index is more similar and / or less different with respect to at least one time-frequency tile directional metadata parameter than at least one adjacent time-frequency tile directional metadata parameter associated with the same order index; and reordering the determined at least one time-frequency tile directional metadata parameter to a different order index.
[0009] Means for determining the ordering error with respect to at least one time-frequency tile directivity metadata parameter may be for determining whether the similarity measure between at least one adjacent time-frequency tile directivity metadata parameter associated with a different ordinal index and at least one time-frequency tile directivity metadata is greater than or equal to the similarity measure between at least one adjacent time-frequency tile directivity metadata parameter associated with the same ordinal index and at least one time-frequency tile directivity metadata.
[0010] Means for determining an ordering error with respect to at least one time-frequency tile directivity metadata parameter may be for determining that a difference measure between at least one adjacent time-frequency tile directivity metadata parameter associated with a different ordinal index and at least one time-frequency tile directivity metadata is smaller than or equal to a similarity measure between at least one adjacent time-frequency tile directivity metadata parameter associated with the same ordinal index and at least one time-frequency tile directivity metadata.
[0011] At least one adjacent time-frequency tile directivity metadata parameter may be at least one of the following: a preceding time-frequency tile directivity metadata parameter, a succeeding time-frequency tile directivity metadata parameter, a preceding frequency time-frequency tile directivity metadata parameter, a succeeding frequency time-frequency tile directivity metadata parameter, a preceding time and frequency time-frequency tile directivity metadata parameter, a succeeding time and frequency time-frequency tile directivity metadata parameter, a preceding time and succeeding frequency time-frequency tile directivity metadata parameter, and a succeeding time and preceding frequency time-frequency tile directivity metadata parameter.
[0012] Means for reordering the determined at least one time-frequency tile directivity metadata parameter to a different ordinal index may be means for reassigning the determined at least one time-frequency tile directivity metadata parameter to a different ordinal index.
[0013] Means for reassigning the determined at least one time-frequency tile directivity metadata parameter to a different ordinal index may include determining which of the at least one adjacent time-frequency tile directivity metadata parameters in the other ordinal index is more similar and / or less different than the at least one time-frequency tile directivity metadata parameter associated with the same ordinal index, and for reassigning the at least one subframe metadata parameter to a different ordinal order determined.
[0014] According to a second aspect, a method is provided which includes obtaining ordered directional metadata parameters with respect to at least two sources in an audio scene, wherein the ordered directional metadata parameters are associated with at least two sources, and the directional metadata parameters identify the direction of arrival for at least two sources and are arranged in a frame arranged as a grid of time-frequency tiles with respect to the time axis and frequency axis; determining an ordering error with respect to at least one time-frequency tile directional metadata parameter, wherein the ordering error is configured to identify that at least one adjacent time-frequency tile directional metadata parameter associated with a different ordering index is more similar and / or less different with respect to at least one time-frequency tile directional metadata parameter than at least one adjacent time-frequency tile directional metadata parameter associated with the same ordering index; and reordering the determined at least one time-frequency tile directional metadata parameter to a different ordering index.
[0015] Determining the ordering error for at least one time-frequency tile directivity metadata parameter may involve determining whether a similarity measure between at least one adjacent time-frequency tile directivity metadata parameter associated with a different ordinal index and at least one time-frequency tile directivity metadata is greater than or equal to a similarity measure between at least one adjacent time-frequency tile directivity metadata parameter associated with the same ordinal index and at least one time-frequency tile directivity metadata.
[0016] Determining the ordering error for at least one time-frequency tile directivity metadata parameter may involve determining that a difference measurement between at least one adjacent time-frequency tile directivity metadata parameter associated with a different ordinal index and at least one time-frequency tile directivity metadata is smaller than or equal to a similarity measurement between at least one adjacent time-frequency tile directivity metadata parameter associated with the same ordinal index and at least one time-frequency tile directivity metadata.
[0017] At least one adjacent time-frequency tile directivity metadata parameter may be at least one of the following: a time-frequency tile directivity metadata parameter for a preceding time, a time-frequency tile directivity metadata parameter for a succeeding time, a time-frequency tile directivity metadata parameter for a preceding frequency, a time-frequency tile directivity metadata parameter for a succeeding frequency, a time-frequency tile directivity metadata parameter for a preceding time and frequency, a time-frequency tile directivity metadata parameter for a succeeding time and frequency, a time-frequency tile directivity metadata parameter for a preceding time and a succeeding frequency, and a time-frequency tile directivity metadata parameter for a succeeding time and a preceding frequency.
[0018] Reordering at least one determined time-frequency tile directivity metadata parameter to a different ordinal index may involve reassigning at least one determined time-frequency tile directivity metadata parameter to a different ordinal index.
[0019] Reassigning at least one determined time-frequency tile directivity metadata parameter to a different ordinal index may involve determining which of at least one adjacent time-frequency tile directivity metadata parameters in the other ordinal index is more similar and / or less different than at least one time-frequency tile directivity metadata parameter associated with the same ordinal index, and reassigning at least one subframe metadata parameter to a different ordinal index determined.
[0020] According to a third aspect, there is provided an apparatus comprising at least one processor and at least one memory storing instructions, which when executed by the at least one processor, cause the system to at least obtain directional metadata parameters ordered with respect to at least two sources within an audio scene, the ordered directional metadata parameters being associated with the at least two sources, the directional metadata parameters identifying the directions of arrival with respect to the at least two sources and being arranged within a frame arranged as a grid of time-frequency tiles with respect to a time axis and a frequency axis; determine an ordering error for at least one time-frequency tile directional metadata parameter, the ordering error being configured to identify that at least one adjacent time-frequency tile directional metadata parameter associated with a different order index is more similar and / or less different than at least one adjacent time-frequency tile directional metadata parameter associated with the same order index with respect to at least one time-frequency tile directional metadata parameter; and reorder the determined at least one time-frequency tile directional metadata parameter to a different order index.
[0021] The apparatus caused to determine an ordering error for at least one time-frequency tile directional metadata parameter may be caused to determine that a similarity measurement between at least one adjacent time-frequency tile directional metadata parameter associated with a different order index and at least one time-frequency tile directional metadata is greater than or equal to a similarity measurement between at least one adjacent time-frequency tile directional metadata parameter associated with the same order index and at least one time-frequency tile directional metadata.
[0022] An apparatus that is caused to determine an ordering error for at least one time-frequency tile directivity metadata parameter may be further caused to determine that a difference measurement between at least one adjacent time-frequency tile directivity metadata parameter associated with a different order index and at least one time-frequency tile directivity metadata is less than or equal to a similarity measurement between at least one adjacent time-frequency tile directivity metadata parameter associated with the same order index and at least one time-frequency tile directivity metadata.
[0023] At least one adjacent time-frequency tile directivity metadata parameter can be at least one of a time-frequency tile directivity metadata parameter of a preceding time, a time-frequency tile directivity metadata parameter of a subsequent time, a time-frequency tile directivity metadata parameter of a preceding frequency, a time-frequency tile directivity metadata parameter of a subsequent frequency, a time-frequency tile directivity metadata parameter of a preceding time and frequency, a time-frequency tile directivity metadata parameter of a subsequent time and frequency, a time-frequency tile directivity metadata parameter of a preceding time and a subsequent frequency, and a time-frequency tile directivity metadata parameter of a subsequent time and a preceding frequency.
[0024] An apparatus that is caused to reorder at least one determined time-frequency tile directivity metadata parameter to a different order index may be caused to reassign at least one determined time-frequency tile directivity metadata parameter to a different order index.
[0025] An apparatus capable of performing the reassignment of at least one determined time-frequency tile directivity metadata parameter to a different ordinal index may be capable of determining which of at least one adjacent time-frequency tile directivity metadata parameters in the different ordinal index is more similar and / or less different than at least one time-frequency tile directivity metadata parameter associated with the same ordinal index, and of reassigning at least one subframe metadata parameter to a different ordinal index.
[0026] According to a fourth aspect, an acquisition circuit mechanism is provided which is configured to acquire ordered directional metadata parameters with respect to at least two sources in an audio scene, wherein the ordered directional metadata parameters are associated with at least two sources, the directional metadata parameters identify the direction of arrival with respect to at least two sources, and are arranged in a frame which are arranged as a grid of time-frequency tiles with respect to the time axis and frequency axis; a determination circuit mechanism configured to determine an ordering error with respect to at least one time-frequency tile directional metadata parameter, wherein the ordering error is configured to identify that at least one adjacent time-frequency tile directional metadata parameter associated with a different order index is more similar and / or less different with respect to at least one time-frequency tile directional metadata parameter than at least one adjacent time-frequency tile directional metadata parameter associated with the same order index; and a reordering circuit mechanism configured to reorder the determined at least one time-frequency tile directional metadata parameter to a different order index.
[0027] According to a fifth aspect, a computer program is provided, which includes instructions [or a computer-readable medium containing program instructions] for causing the device to perform at least: obtain ordered directional metadata parameters with respect to at least two sources in an audio scene, the ordered directional metadata parameters being associated with at least two sources, the directional metadata parameters identifying the direction of arrival with respect to at least two sources, and being arranged in a frame arranged as a grid of time-frequency tiles with respect to the time axis and frequency axis; determine an ordering error with respect to at least one time-frequency tile directional metadata parameter, the ordering error being configured to identify that at least one adjacent time-frequency tile directional metadata parameter associated with a different ordering index is more similar and / or less different with respect to at least one time-frequency tile directional metadata parameter than at least one adjacent time-frequency tile directional metadata parameter associated with the same ordering index; and reorder the determined at least one time-frequency tile directional metadata parameter to a different ordering index.
[0028] According to a sixth aspect, a non-temporary computer-readable medium is provided for causing the device to perform the following actions: obtain ordered directional metadata parameters with respect to at least two sources in an audio scene, the ordered directional metadata parameters being associated with at least two sources, the directional metadata parameters identifying the direction of arrival for at least two sources, and being arranged in a frame arranged as a grid of time-frequency tiles with respect to the time axis and frequency axis; determine an ordering error with respect to at least one time-frequency tile directional metadata parameter, the ordering error being configured to identify that at least one adjacent time-frequency tile directional metadata parameter associated with a different ordering index is more similar and / or less different with respect to at least one time-frequency tile directional metadata parameter than at least one adjacent time-frequency tile directional metadata parameter associated with the same ordering index; and reorder the determined at least one time-frequency tile directional metadata parameter to a different ordering index.
[0029] According to the seventh aspect, an apparatus is provided comprising means for obtaining ordered directional metadata parameters with respect to at least two sources in an audio scene, wherein the ordered directional metadata parameters are associated with at least two sources, the directional metadata parameters identify the direction of arrival with respect to at least two sources, and are arranged in a frame arranged as a grid of time-frequency tiles with respect to the time axis and frequency axis; means for determining an ordering error with respect to at least one time-frequency tile directional metadata parameter, wherein the ordering error is configured to identify that at least one adjacent time-frequency tile directional metadata parameter associated with a different ordering index is more similar and / or less different with respect to at least one time-frequency tile directional metadata parameter than at least one adjacent time-frequency tile directional metadata parameter associated with the same ordering index; and means for reordering the determined at least one time-frequency tile directional metadata parameter to a different ordering index.
[0030] According to the eighth aspect, a computer-readable medium is provided, which includes program instructions for causing the device to perform at least: obtain ordered directional metadata parameters with respect to at least two sources in an audio scene, the ordered directional metadata parameters being associated with at least two sources, the directional metadata parameters identifying the direction of arrival for at least two sources, and being arranged in a frame arranged as a grid of time-frequency tiles with respect to the time axis and frequency axis; determine an ordering error with respect to at least one time-frequency tile directional metadata parameter, the ordering error being configured to identify that at least one adjacent time-frequency tile directional metadata parameter associated with a different ordering index is more similar and / or less different with respect to at least one time-frequency tile directional metadata parameter than at least one adjacent time-frequency tile directional metadata parameter associated with the same ordering index; and reorder the determined at least one time-frequency tile directional metadata parameter to a different ordering index.
[0031] The apparatus includes means for carrying out the operation of the above method.
[0032] The apparatus is configured to perform the actions of the method described above.
[0033] A computer program includes program instructions that cause the computer to perform the above-described method.
[0034] A computer program product stored on a medium can cause the device to perform the methods described herein.
[0035] The electronic device may be equipped with the apparatus described herein.
[0036] The chipset may include the device described herein.
[0037] The embodiments of this application aim to address problems related to advanced technology.
[0038] For a better understanding of this application, references to the attached drawings are given below as an example. [Brief explanation of the drawing]
[0039] [Figure 1] A schematic diagram of the equipment used for extracting MASA metadata is shown. [Figure 2] A schematic diagram of the subframe structure of an exemplary MASA metadata frame is shown. [Figure 3] A schematic representation of the time-frequency structure of an exemplary MASA metadata frame is shown. [Figure 4] A schematic representation of an exemplary system of apparatus suitable for carrying out several embodiments is shown. [Figure 5] A schematic representation of known metadata analyzers, metadata encoders, and audio encoders is shown. [Figure 6] The following is an example of input data, each having two directional fields corresponding to one physical direction. [Figure 7] The following is an example of input data where each has two direction fields corresponding to one physical direction, and the direction fields in the subband are shuffled. [Figure 8] This exemplifies the coercion of a low spatial resolution mode caused by directional mixing resulting from the inconsistency of temporally continuous subframes. [Figure 9] Schematic diagrams of exemplary metadata analyzers, metadata encoders, and audio encoders according to several embodiments are shown. [Figure 10] Schematic diagrams of exemplary metadata analyzers, metadata encoders, and audio encoders according to several embodiments are shown. [Figure 11]The following are flowcharts illustrating the operation of exemplary metadata analyzers, metadata encoders, and audio encoders, respectively, according to several embodiments, as shown in Figures 9 and 10. [Figure 12] The following are flowcharts illustrating the operation of exemplary metadata analyzers, metadata encoders, and audio encoders, respectively, according to several embodiments, as shown in Figures 9 and 10. [Figure 13] An exemplary device suitable for implementing the apparatus shown in the previous figure is shown. [Modes for carrying out the invention]
[0040] Further details of suitable devices and possible mechanisms for encoding parametric spatial audio signals, including transport audio signals and spatial metadata, are described below. As mentioned above, immersive audio codecs (such as 3GPP IVAS) supporting various operating points from low bitrate operation to transparency are planned.
[0041] Metadata-Assisted Spatial Audio (MASA) is an example of a parametric spatial audio format and representation suitable as an input format for IVAS.
[0042] This can be thought of as an audio representation consisting of "N channels + spatial metadata." This is a scene-based audio format particularly well-suited for spatial audio capture on actual devices such as smartphones. This concept represents a sound scene with respect to time- and frequency-varying sound source directions, as well as, for example, energy ratios. Sound energy in a scene that is not defined (not represented) by direction is represented as diffuse (coming from all directions).
[0043] As discussed above, spatial metadata associated with an audio signal may include multiple parameters for each time-frequency tile (multiple directions, and for each direction (or directivity value), such as parameters associated with the direct-to-total energy ratio, spread coherence, distance, etc.). Spatial metadata may also include, or be associated with, other parameters that are considered non-directional (such as surround coherence, spread-to-total energy ratio, residual-to-total energy ratio), but when combined with directivity parameters, it can be used to define the characteristics of the audio scene. For example, one reasonable design choice that can produce a good quality output is for the spatial metadata to include one or more directions for each time-frequency subframe (and spatial metadata associated with the direct-to-total ratio, spread coherence, distance value, etc., for each direction is determined).
[0044] An exemplary MASA analyzer 101 is shown with respect to Figure 1. The MASA analyzer 101 is configured to receive and analyze an input audio signal 100 in order to generate a transport audio signal 102 and spatial metadata 104.
[0045] Examples of MASA spatial metadata are presented in the table below. These values are available for each time-frequency tile (TF tile). In other words, the metadata is arranged as a frame containing several TF tiles or time-frequency elements that can be placed within a "grid" of TF tiles or TF elements, and the grid is placed on the time axis and frequency axis. In some embodiments, the frame is subdivided into 24 frequency bands and 4 temporal subframes. In other embodiments, other divisions of frequency and time may be used. Furthermore, in some embodiments (e.g., as implemented in IVAS), the frame size is 20 ms (and therefore the temporal subframes are 5 ms). However, similarly, other frame lengths may be used in other embodiments. In some embodiments, the MASA analyzer is configured to determine one or two directions for each time-frequency tile (i.e., there are one or two direction indices, a direct-to-total energy ratio, and a spread coherence parameter for each time-frequency tile). However, in some embodiments, the analyzer is configured to generate three or more directions for a time-frequency tile.
[0046] [Table 1]
[0047] The MASA stream can be rendered to various outputs, such as multi-channel loudspeaker signals (e.g., 5.1) or binaural signals.
[0048] A directional index is an encoded form of the direction of the azimuth and elevation angles of arrival of a sound or source (or other directional values such as a Cartesian two-dimensional vector, a three-dimensional vector, or a polar coordinate-based vector).
[0049] As discussed above, the frame size in IVAS is 20ms. An example of the (IVAS) frame structure is shown in Figure 2, where metadata frame 201 contains four temporal subframes, each 5ms long. Figure 2 shows, for example, the current metadata frame 201, which includes metadata subframe 4 200 of the previous frame, followed by metadata subframe 1 202, metadata subframe 2 204, metadata subframe 3 206, and metadata subframe 4 208. This is followed by metadata subframe 1 210 of the subsequent or next frame.
[0050] Furthermore, as shown in Figure 3, an exemplary raw high-resolution metadata frame 300 having both high-frequency resolution and high-temporal resolution is shown. Frame 300 is shown with respect to a TF tile positioned along time 303 and frequency axis 301. Thus, the TF tile is positioned according to time axis 303 having four subframes or segments on the time axis, namely metadata subframe 1 302, metadata subframe 2 304, metadata subframe 3 306, and metadata subframe 4 308. In addition, a series of bands or segments on frequency axis 301 are shown (but these are not individually labeled in Figure 3). Thus, for a particular TF tile 350, there may be adjacent time TF tiles 360, 370, or adjacent frequency TF tiles 353, 354.
[0051] Furthermore, the IVAS codec is expected to operate at a wide range of bitrates, from very low bitrates (e.g., 13.2kbps) to relatively high bitrates (e.g., 512kbps or even 768kbps). Since the raw bitrate of MASA metadata is approximately 300-500kbps (depending on whether one or two encoded simultaneous directions are present), the metadata is significantly compressed (especially at the lowest bitrates).
[0052] One form of compression may be a method of reducing the temporal and / or frequency resolution of metadata (which may be used in conjunction with other methods for compressing data).
[0053] For example, a raw high-resolution metadata frame may contain 24 frequency bands on the frequency axis and 4 temporal subframes (subframes 1-4) on the time axis, which together constitute a time-frequency tile (also called a TF tile). Known methods for reducing the number of time-frequency tiles transmitted and thus significantly reducing the required bitrate can be based on those described in UKIPO patent applications 1919130.3 and 1919131.1 and WO2021 / 130405, which present methods that allow for the concatenation of metadata from multiple frequency bands and / or temporal subframes to reduce the number of frequency bands and / or temporal subframes.
[0054] For example, depending on the bitrate, 5 to 24 frequency bands and 1 to 4 subframes may be transmitted.
[0055] Therefore, such a method comprises a metadata resolution selector configured to select and generate at least one of 1sf, a high-frequency resolution (low-time resolution) metadata frame, and 4sf, a (low-frequency resolution) high-time resolution metadata frame, which can then be encoded and output.
[0056] Since MASA streams can be generated from various types of devices (e.g., microphone arrays on mobile devices and dedicated ambisonic microphone arrays such as Eigenmike), the methods used to determine spatial metadata can vary significantly between implementations. Some methods may have high temporal resolution but lower frequency resolution, while others may have low temporal resolution but higher frequency resolution.
[0057] To improve coding efficiency for both types of time-frequency resolution, it has been proposed that MASA metadata can be encoded in two different modes, as shown in PCT application WO2021250312. The first metadata frame resolution is a low time-resolution (1sf) mode having only one temporal subframe mode but with high frequency resolution, and the other metadata frame resolution is a high time-resolution (4sf) mode having four temporal subframes but with low frequency resolution.
[0058] In this example, the low time resolution mode (1sf) is selected when the encoder receives spatial metadata that is determined or detected to be identical (or substantially identical or similar) across all subframes of a frame.
[0059] If the spatial metadata is not identical (or substantially not identical or similar) across all subframes, the high temporal resolution (4sf) mode is then used.
[0060] For example, at a given bitrate, the low time-resolution mode (1sf) may transmit 18 frequency bands and 1 subframe (in other words, a total of 18 TF tiles), while the high time-resolution mode (4sf) may transmit 5 frequency bands and 4 subframes (in other words, a total of 20 TF tiles), which are roughly equivalent in size to the data transmitted at the same overall bitrate.
[0061] In PCT application WO2019105575, the use of variable input metadata time-frequency resolution is proposed. This achieves a similar trade-off to the method in PCT application WO2021250312, however, the decision is made outside the codec and may be based on a specific capture algorithm for the microphone array being used.
[0062] The method described above therefore demonstrates how encoding quality can be maintained, with temporal and frequency resolution adjusted or modulated for the audio input.
[0063] Figure 4 shows an exemplary system in which several embodiments may be implemented. A transport audio signal 102 and spatial metadata 104 are present as inputs. The transport audio signal 102 and spatial metadata 104 are passed to an encoder 401, which generates an encoded bitstream 402. The encoded bitstream 402 is received by a decoder 403, which is configured to generate a spatial audio output 404.
[0064] As discussed above, the system input, transport audio signal 102, and spatial metadata 104 may be acquired in the form of a MASA stream. The MASA stream may originate from, for example, a mobile device (containing a microphone array), or, as an alternative example, it may be created by an audio server that has in some way processed the MASA stream.
[0065] In some embodiments, the encoder 401 may be an IVAS encoder.
[0066] In some embodiments, the decoder 403 may be configured to directly output a spatial audio output 404 which is rendered by an external renderer or edited / processed by an audio server. In some embodiments, the decoder 403 includes a suitable renderer which is configured to render the output into a suitable form, such as a binaural audio signal or a multi-channel loudspeaker signal (such as in a 5.1 or 7.1+4 channel format), which are also examples of spatial audio output 404.
[0067] Further details of encoder 401 are shown in Figure 5. In this example, encoder 401 includes a spatial metadata encoder configured to operate using 4sf, which has high temporal resolution but low frequency resolution, when referencing four subframes with different metadata.
[0068] The spatial metadata encoder is configured to receive spatial metadata 104. The spatial metadata 104 is passed to a subframe analyzer 501, which is configured to analyze the subframes in the spatial metadata 104 to detect whether all four subframes are similar and whether the 1sf coding mode is available.
[0069] An exemplary similarity test can be performed in some embodiments by comparing spatial metadata fields element by element, and if the difference in values in some fields is greater than a given threshold, the spatial metadata is different. If the metadata is not different, the metadata is similar.
[0070] For example, the following can be performed as a similarity check.
[0071] Check the input directional spatial metadata field (with one or two directions active).
[0072] Check the spatial metadata parameters for each time-frequency tile.
[0073] If the difference in the azimuth parameter is greater than a given threshold, for example, 0.5 degrees, the metadata is considered different.
[0074] If the difference in the elevation parameter is greater than a given threshold, for example, 0.5 degrees, the metadata is considered different.
[0075] If the difference in the direct-to-total energy ratio parameter is greater than a given threshold, for example, 0.1, then the metadata is different.
[0076] If the difference in the spread coherence parameter is greater than a given threshold, for example, 0.1, the metadata is considered different.
[0077] If the difference in surround coherence parameters is greater than a given threshold, for example, 0.1, the metadata is considered different.
[0078] However, any appropriate similarity test can be performed. For example, direction and direct-to-whole ratio can be compared using importance measures such as those presented in UKIPO patent applications 1919130.3 and 1919131.1, i.e., importance measures that compare direction vectors having the length of the direct-to-whole ratio.
[0079] The analysis results 502 and spatial metadata 104 can be passed to a coherence detector and 2dir analyzer 503 configured to inspect the input and determine the presence of significant coherence metadata. The coherence detector and 2dir analyzer 503 can be further configured to analyze the spatial metadata and determine on a band-by-band basis whether one or two directions should be used.
[0080] The analysis results 504 and spatial metadata 104 can then be passed to a metadata codec constructor 505, which is used to generate configuration information 506 that can be passed to a metadata reducer (metadata encoder 507).
[0081] The encoder is further configured to receive transport audio signals 102 and pass them to the audio encoder 511 and also to the metadata reducer (metadata encoder) 507.
[0082] The configuration information 506, transport audio signal 102, and spatial metadata 104 can then be passed to the metadata reducer 507. The metadata reducer is configured to reduce the amount of metadata and generate encoded metadata 508.
[0083] Furthermore, the encoder includes an audio and metadata combiner (multiplexer) 513 configured to receive an encoded transport audio signal 512 and encoded metadata 508, and to generate an output bitstream 514 from them.
[0084] In some embodiments, there may be two or more directional fields present within the (MASA) spatial metadata.
[0085] Some (MASA) capture and analysis systems do not have a clear assignment between the source direction (of physical sound) and metadata direction field assignment or ordering. In each parameter TF tile, the capture may, in simplified terms, analyze the direction of a source with the highest energy (in the TF tile) and assign this to a first direction field, then analyze the dominant direction of the remaining sound fields (e.g., the direction of another sound source) and assign this to a second direction field. In other situations, the capture and analysis system may divide the space into non-overlapping regions, analyze the source direction in each region, and assign each region to a dedicated direction field.
[0086] When the relative levels of sound sources change over time (and frequency), the spatial parameters associated with each physical sound source may be distributed in both (or three or more other) directional fields. Exemplary methods for performing such analyses are presented in EP application EP3791605 and UK patent application GB2114186.6.
[0087] For example, consider a room with two sound sources speaking simultaneously from different directions. In reality, the direction associated with each source may be rapidly changing over time and frequency, whether they are in a first or second direction field.
[0088] Furthermore, while a source can be a physical source, in other words, a physical origin for an audio signal such as a speaker or instrument, it should be understood that a source is not a physical source but rather the result of a capture or capture analysis that assigns or orders some metadata (with a low energy ratio) in a certain direction, and in some situations may represent a group of low energy ratio-related sound sources. This situation may occur, for example, when a capture analysis can be identified to produce two directions in which no significant second physical source exists.
[0089] Furthermore, even if the (MASA) capture and analysis system is capable of assigning physical orientations to specific metadata orientation fields, this arrangement can be disrupted by data processing. For example, the (IVAS) encoder may reorder the metadata orientation fields so that those with a higher (or highest) direct-to-overall energy ratio are assigned to the first position or orientation field (e.g., the coding function ivas_qmetadata_reorder_2dir_bands() in the IVAS encoder). This is not a problem during the first coding round. However, if the decoder outputs transport audio signals and spatial metadata (in so-called external outputs) which are used as input to a second encoder (in so-called tandem coding), the original orientations in the parameter TF tiles can be reassigned or shuffled to other orientations from the first coding round.
[0090] Such shuffled directional data (regardless of where the shuffling occurs) may be suboptimal for encoding algorithms for several reasons. A notable reason is that if the bitrate does not allow for transmitting spatial metadata at full TF resolution, the TF grid is made coarser (to include fewer TF tiles) by combining TF tiles over time and / or frequency as described above. The combining can be done using energy-weighted averaging or some other method. When combining or averaging TF tiles containing spatial metadata from different directions, the metadata is effectively smeared, and any resulting decoded output will have a much lower perceptual quality than could have been achieved at the same bitrate if the directions were aligned.
[0091] A further coding-related drawback is that some metadata encoding systems use differential encoding to further reduce the bitrate of the encoded metadata. In such situations, the first value is encoded as is, while subsequent values are encoded based on the difference from the previous value. When the changes between values are small or slow, this allows for a highly efficient encoding scheme by changing the data distribution during encoding. However, differential encoding is likely to fail well when spatial metadata changes significantly, as shuffling the metadata means the field relates to elements that are different (e.g., different physical directions).
[0092] For example, (MASA) an audio scene can have two direct sound sources that occupy roughly the same spatial location.
[0093] An example scene can be analyzed and determined to have two direction fields, with each physical direction assigned to one metadata direction field. In Figures 6 and 7 below, solid lines or solid boxes correspond to the first direct source, and dashed lines or dashed boxes correspond to the second direct sound source.
[0094] For example, Figure 6 shows a series of metadata frames, namely frame #1 600, frame #2 602, and frame #3 604, as well as the parameters for direction 1 for the first direction 610 and the parameters for direction 2 for the second direction 620. The directional parameters are (θ, φ, γ) srcIdx、sfIdx This is shown as follows, where θ is the azimuth angle of the source, φ is the elevation angle of the source, γ is the direct-to-total energy ratio of the source, srcIdx is an index indicating the physical source index (similar to the visual representation by solid / dashed lines), and sfIdx is the parameter subframe index within the parameter frame. Below, it is assumed that the source has a nearly consistent physical location (or at least a slow movement). Therefore, this assumption holds true. (θ, φ) 1、sfIdx ≒(θ, φ) 1、sfIdx+1 ≠(θ, φ) 2、sfIdx ≒(θ, φ) 2、sfIdx+1 It can be expressed as follows.
[0095] Furthermore, Figure 7 shows an example in which, following a shuffle operation of frames or subframes, the data in the spatial metadata direction field may be ordered such that at least one subframe is shuffled between two directions. Thus, for example, Figure 7 shows a series of metadata frames, namely frame #1 700, frame #2 702, and frame #3 704. These metadata frames differ from the metadata frames shown with respect to Figure 6 in that some of the subframes of parameters for the first source or direction 610, i.e., direction 1, are located in the parameter field for the second direction 720, i.e., direction 2, and some of the subframes of parameters for the second source or direction 620, i.e., direction 2, are located in the parameter field for the first direction 720, i.e., direction 1.
[0096] Therefore, for example, with respect to frame #1 700, The parameter 710 for direction 1 is, First subframe 701 in the first direction, Second subframe 703 in the first direction, The third subframe 721 in the second direction, and Fourth subframe 705 in the first direction That is the case. The parameter 720 for direction 2 is, First subframe 751 in the second direction, Second subframe 753 in the second direction, The third subframe 723 in the first direction, and Second direction fourth subframe 755 That is the case. Regarding frame #2 702, The parameter 710 for direction 1 is, Fifth subframe 761 in the second direction, The sixth subframe 731 in the first direction, The seventh subframe 763 in the second direction, and Eighth subframe 765 in the second direction That is the case. The parameter 720 for direction 2 is, Fifth subframe 771 in the first direction, The sixth subframe 733 in the second direction, The seventh subframe 773 in the first direction, and Eighth subframe 775 in the first direction That is the case. Regarding frame #3 704, The parameter 710 for direction 1 is, The ninth subframe 741 in the first direction, The 10th subframe 781 in the second direction, The 11th subframe 783 in the second direction, and Subframe 785 of the 12th direction That is the case. The parameter 720 for direction 2 is, The ninth subframe 743 in the second direction, The tenth subframe 791 in the first direction, The 11th subframe 793 in the first direction, and Subframe 795 of the 12th direction That is the case.
[0097] The reordering or shuffling process can be carried out, for example, by an encoding system that assigns directions with a larger direct-to-overall energy ratio γ to direction 1. Furthermore, as mentioned above, some capture algorithms may not assign directions to direction fields based on physical directions, and the generated metadata may look directly similar to that shown in Figure 7.
[0098] For example, using exemplary (MASA) metadata with a relatively limited bitrate and this direction-shuffled subframe input to the encoder, the encoder may use a low-time-resolution coding mode (1sf mode) that combines each of the four consecutive subframes. The result is shown in Figure 8 as a very obscured representation of the direction parameters shown below.
[0099] For example, Figure 8 shows a combination of 1sf modes for parameter 810 in direction 1 for metadata frame #1 800, which is a combination of functions f(.) of the first subframe 701 in the first direction, the second subframe 703 in the first direction, the third subframe 721 in the second direction, and the fourth subframe 705 in the first direction. Similarly, parameter 820 in direction 2 for metadata frame #1 800 is a combination of functions of the first subframe 751 in the second direction, the second subframe 753 in the second direction, the third subframe 723 in the first direction, and the fourth subframe 755 in the second direction.
[0100] The parameter 810 for direction 1 of metadata frame #2 802 is a combination of functions of the fifth subframe 761 of direction 2, the sixth subframe 731 of direction 1, the seventh subframe 763 of direction 2, and the eighth subframe 765 of direction 2. Similarly, the parameter 820 for direction 2 of metadata frame #2 802 is a combination of functions of the fifth subframe 771 of direction 1, the sixth subframe 733 of direction 2, the seventh subframe 733 of direction 1, and the eighth subframe 775 of direction 1.
[0101] Furthermore, the parameter 810 for direction 1 of metadata frame #3 804 is a combination of functions of the 9th subframe 741 of the first direction, the 10th subframe 781 of the second direction, the 11th subframe 783 of the second direction, and the 12th subframe 785 of the second direction. Similarly, the parameter 820 for direction 2 of metadata frame #3 804 is a combination of functions of the 9th subframe 743 of the second direction, the 10th subframe 791 of the first direction, the 11th subframe 793 of the first direction, and the 12th subframe 795 of the first direction. Function f(.) can be any suitable combination of functions.
[0102] As can be seen, this generates a 1sf frame, where the obtained parameters for direction 1 and direction 2 are obscured because the obtained parameters for direction 1 encompass a portion of the original values for direction 2 and vice versa.
[0103] This example illustrates a problem that attempts to be overcome when the embodiments discussed herein are applied along the temporal axis of data. Similar problems may arise when considering spatial parameters that are aggregated across the frequency axis and to which some embodiments may be applied in a similar manner to those discussed in the following example. This is because, when spatial metadata is encoded at a limited bitrate, the number of frequency bands is typically reduced (from 24 to 5 at the coarsest resolution in the case of MASA). When the bands to be combined in encoding correspond to clearly different spatial directions, the operation of combining them may distort the resulting spatial representation.
[0104] The concept, which will be discussed in more detail with respect to the following embodiments and examples, relates to the encoding of parametric spatial audio (i.e., audio signals and spatial metadata) in which spatial metadata is coded into frames and subframes that encompass two (or more) directional fields.
[0105] In these embodiments, the apparatus and method are configured to pre-process spatial metadata with respect to direction field assignment (ordering), thereby allowing any subsequent coding operation to maintain better direction accuracy than without pre-processing.
[0106] In some embodiments, this is Obtaining two consecutive (sub)frames of spatial metadata, Comparing direction field values, Determining the ordering of the metadata direction field so that the difference between two consecutive (sub)frames is minimized. It can be implemented by [this method].
[0107] In some embodiments, this may be additionally or otherwise carried out by obtaining spatial metadata from two spatial metadata direction fields and two adjacent bands, comparing the values of the direction fields, and determining the ordering of the metadata direction fields such that the difference between the two adjacent bands is minimized.
[0108] As discussed in the following embodiments, a processing step is provided that attempts to align (or harmonize) the metadata orientation field (MASA) such that the sum of orientation differences between adjacent parameter tiles is minimized. Harmonization can be carried out in several embodiments over the time dimension as described below (and thus can be useful when joining multiple subframes of data, such as when operating in a low time-resolution 1sf coding mode).
[0109] Harmonization may be further performed over the frequency dimension in some embodiments (useful when reducing frequency resolution). Furthermore, in some embodiments, harmonization may be performed together over both the time and frequency dimensions depending on the TF tile grouping used in the encoding.
[0110] In some embodiments, aligning the spatial metadata orientation in grouped TF tiles attempts to offer the advantage that any averaging causes less distortion of the underlying data than when the orientation of the underlying data is shuffled. Furthermore, aligning the spatial metadata orientation across groups (or generally across TF regions) may have the advantage that data encoding can be more efficient due to the reduction of any variation within the data.
[0111] With respect to Figure 9, an exemplary encoder is shown that is based on the encoder shown in Figure 5, but includes metadata orientation alignment processing of the input spatial metadata, and as a result provides aligned spatial metadata for further operation.
[0112] In some embodiments, the encoder 991 includes an audio encoder 511 configured to encode an audio signal and generate an encoded transport audio signal 512.
[0113] In this example, the encoder 991 is configured to receive spatial metadata 104. The spatial metadata 104 is passed to the metadata direction aligner 901, which generates aligned spatial metadata 904.
[0114] The encoder 991 further comprises a subframe analyzer 501 configured to analyze subframes in the aligned spatial metadata 904 to detect whether all four subframes are similar and whether the 1sf coding mode is available.
[0115] The analysis results 502 and the aligned spatial metadata 904 can be passed to a coherence detector and 2dir analyzer 503 configured to inspect the input and determine the presence of significant coherence metadata. The coherence detector and 2dir analyzer 503 can be further configured to analyze the spatial metadata and determine on a band-by-band basis whether one or two directions should be used.
[0116] The analysis results 504 and the aligned spatial metadata 904 can then be passed to a metadata codec constructor 505, which is used to generate configuration information 506 that can be passed to a metadata reducer (metadata encoder) 507.
[0117] The encoder is further configured to receive transport audio signals 102 and pass them to the audio encoder 511 and also to the metadata reducer (metadata encoder) 507.
[0118] The configuration information 506, the transport audio signal 102, and the aligned spatial metadata 904 can then be passed to the metadata reducer 507. The metadata reducer is configured to reduce the amount of metadata and generate encoded metadata 508.
[0119] Furthermore, the encoder includes an audio and metadata combiner (multiplexer) 513 configured to receive an encoded transport audio signal 512 and encoded metadata 508, and to generate an output bitstream 514 from them.
[0120] Furthermore, Figure 10 shows a further exemplary encoder 1091 modified from the encoder shown in Figure 5. In the example shown in Figure 10, the metadata direction aligner may be located adjacent to the metadata reducer or metadata encoder, or within the coding chain or path within the metadata reducer or metadata encoder. In this configuration, the metadata direction aligner receives the metadata coding configuration as additional information, uses it to determine an axis, operates on this axis (time axis operation across subframes and frames, frequency axis operation across parameter bands, or both), and is configured to perform metadata direction field harmonization on this axis.
[0121] In this example, encoder 1091 is configured to receive spatial metadata 104. The spatial metadata 104 is passed to a subframe analyzer 501 configured to analyze the subframes in the spatial metadata 104 to detect whether all four subframes are similar and whether the 1sf coding mode is available.
[0122] The analysis results 502 and spatial metadata 104 can be passed to a coherence detector and 2dir analyzer 503 configured to inspect the input and determine the presence of significant coherence metadata. The coherence detector and 2dir analyzer 503 can be further configured to analyze the spatial metadata and determine on a band-by-band basis whether one or two directions should be used.
[0123] The analysis results 504 and spatial metadata 104 can then be passed to a metadata codec constructor 505, which is used to generate configuration information 506 that can be passed to a metadata reducer (metadata encoder) 507 and a metadata direction aligner 1001.
[0124] The metadata direction aligner 1001 is configured to receive the configuration 506 and spatial metadata 104 and generate aligned spatial metadata 1004 from them.
[0125] The encoder is further configured to receive transport audio signals 102 and pass them to the audio encoder 511 and also to the metadata reducer (metadata encoder) 507.
[0126] In some embodiments, the encoder 1091 includes an audio encoder 511 configured to encode an audio signal and generate an encoded transport audio signal 512.
[0127] The configuration information 506, the transport audio signal 102, and the aligned spatial metadata 1004 can then be passed to the metadata reducer 507. The metadata reducer is configured to reduce the amount of metadata and generate encoded metadata 508.
[0128] Furthermore, the encoder includes an audio and metadata combiner (multiplexer) 513 configured to receive an encoded transport audio signal 512 and encoded metadata 508, and to generate an output bitstream 514 from them.
[0129] Regarding Figure 11, an illustrative flowchart showing the operation of the encoder shown in Figure 9 is provided.
[0130] Therefore, the operation of acquiring the transport audio signal is indicated by 1101.
[0131] The encoding of the transport audio signal is shown by 1102.
[0132] Furthermore, the operation to retrieve spatial metadata is demonstrated by 1103.
[0133] Next, the operation of aligning spatial metadata is shown by 1105.
[0134] The operation of analyzing subframes following the alignment of spatial metadata is demonstrated by 1107.
[0135] Next, the operation for analyzing the coherence direction and the second direction is shown by 1109.
[0136] The metadata codec configuration is then shown by 1111.
[0137] Next, the encoding / reduction of metadata based on the configuration is shown by 1113.
[0138] As shown in 1115, the encoded audio and metadata can then be combined to generate a bitstream.
[0139] Next, the output of the bitstream is shown by 1117.
[0140] Regarding Figure 12, an exemplary flowchart illustrating the operation of the encoder shown in Figure 10 is provided.
[0141] Therefore, the operation of acquiring the transport audio signal is indicated by 1201.
[0142] The encoding of the transport audio signal is indicated by 1202.
[0143] Furthermore, the operation to retrieve spatial metadata is demonstrated by 1203.
[0144] Next, the operation to analyze the subframe is shown by 1205.
[0145] Next, the operation for analyzing the coherence direction and the second direction is shown by 1207.
[0146] The metadata codec configuration is then shown by 1209.
[0147] Next, the operation of aligning spatial metadata is shown by 1211.
[0148] Next, the encoding / reduction of metadata based on the configuration is shown by 1213.
[0149] As shown in 1215, the encoded audio and metadata can then be combined to generate a bitstream.
[0150] Next, the output of the bitstream is indicated by 1217.
[0151] The operation of the metadata direction aligners 901 and 1001 can be carried out in several embodiments in the following manner. 0. Directional information (θ, φ, γ) in direction field 1 1、sfIdx , and directional information (θ, φ, γ) in direction field 2 2、sfIdxObtain or receive a given spatial metadata subframe having this. This is the initial setting, and the output metadata subframe is the direction field
[0152]
Number
Number
[0153]
Number
Number
[0154]
Number
Number
[0155]
Number
number
[0156]
number
number
[0157] In this example, the measurement is a "difference" measurement, but in some embodiments, a "similarity" measurement may be used instead. In these embodiments, the smaller-than comparison shown in step 3 above is replaced with a greater-than comparison.
[0158] Furthermore, in some embodiments, smaller-than-or-equal-to-comparison may be used instead of smaller-than-or-equal-to-comparison (or similarly, greater-than-or-equal-to-comparison may be used instead of greater-than-or-equal-to-comparison for similarity measurements).
[0159] In some embodiments, the difference measurement D(·) is, for example, the angular distance (sensitive only to direction). D((θ1, φ1, γ1), (θ2, φ2, γ2))=cos -1 (cos(φ1)cos(φ2)cos(θ1-θ2)+sin(φ1)sin(φ2)) Or, Cartesian distance (which also considers radius or distance).
[0160]
number
[0161] Cartesian distance can be preferred in some situations because it is closer to what may occur in some embodiments of parameter aggregation in subframe grouping. In such embodiments, the set of parameters is the transport audio signal energy E and the direct vs. total energy ratio.
[0162]
number
number
[0163] The above examples illustrate possible measurements, and other difference measurements may be performed in some embodiments. Furthermore, it should be understood that difference measurements can also be known as distance measures (for example, the values of the difference measurements above are determined based on a function of distance).
[0164] The above examples and embodiments show two direction fields per subframe / frame. In some embodiments, this can be extended to a larger number of simultaneous direction fields. In such embodiments, instead of determining a difference / similarity measurement for two candidate direction fields, an evaluation of the difference / similarity measurement for the ordering of all available candidates from N directions is determined. For example, the following actions may be performed: 0.(θ, φ, γ) dirIdx、sfIdx Initialize N direction fields where dirIdx∈[1,N]. 1. Direction information (θ, φ, γ) dirIdx、sfIdx+1 Retrieve the following spatial metadata subframe having dirIdx∈[1,N]. 2. For example, by enumerating all possible combinations of N directions, an order of candidates is generated. 3. Difference measurement
[0165]
number
[0166]
number
[0167] Furthermore, in some embodiments, the above examples focus on MASA spatial metadata having two directional fields. A real-world capture system, however, can switch between analyzing one direction and analyzing two directions, which results in inconsistent codec inputs based on the number of directions. For example, a capture system may be configured to capture two directions, but due to spatial signal characteristics, only one candidate direction may be found, and therefore it may output only one direction for a frame, or set the second direction to zero (i.e., set the direct-to-overall energy ratio for that direction to zero). The latter can even occur for individual TF tiles.
[0168] In some embodiments, energy-based averaging can adequately address such data (zero-energy components, for example, do not cause bias with respect to averaged directional data). In some embodiments, various implementations can maintain consistency (track) of directional data across subframes and frames based purely on the directional values themselves, without energy weighting. Thus, as part of metadata directional alignment, zero-energy directions should be reset (in such embodiments) based on extrapolation (e.g., duplication of previous directional data) or interpolation (e.g., averaging between previous and next directional data). This provides consistent 2dirMASA metadata even when the original input data switches between 1dir and 2dir. This step may also be important in tandem coding operations where the previous quantization of zero-energy directions may have resulted in low-energy directions after decoding.
[0169] As described above, this example can be extended to three or more directions. (Focusing on two direction fields, the transition between 1 and 2 is determined by the IVAS MASA format specification, which is part of the IVAS design constraints (Tdoc S4-221619), and the implementation of the corresponding IVAS codec is anticipated for IVAS candidate submissions.)
[0170] In the embodiments described above, harmonization is performed in a direction across the time dimension (across subframes and frames). This is useful when the following operations benefit from a temporally consistent direction field. When encoding combines several frequency bands into fewer frequency bands (in the case of MASA, the highest resolution is 24 frequency bands and the lowest resolution is 5 bands), this is more beneficial for harmonizing spatial metadata in bands that are grouped together in the following processing. Harmonization may be performed in a manner similar to the embodiments described above for harmonizing data over time, but (θ, φ, γ) dirIdx、sfIdx Instead of (θ, φ, γ) dirIdx、bandIdxUse the following, where bandIdx is the frequency band index. Harmonization can be performed across all 24 bands of spatial metadata (in the case of MASA) or within each subset of bands that are grouped together at a lower frequency resolution.
[0171] In some embodiments, it is beneficial not to use a fixed frequency band to determine the alignment starting point. Instead, embodiments may select the frequency band with the highest energy (determined, for example, from a transport audio signal) and then apply the alignment method starting from this band toward higher and lower frequencies.
[0172] The presented embodiment determines the direction field ordering step by step in each subframe. This can be extended to consider several consecutive subframes simultaneously and determine the direction field ordering for several consecutive subframes simultaneously.
[0173] Furthermore, embodiments such as those shown in Figure 9 represent methods like pre-processing steps typically used before encoding and transmitting metadata. However, this method can also be applied as a post-processing step after decoding metadata from a bitstream, if the metadata is intended to be output as part of the MASA format output from the codec. This ensures that any further possible codecs or renderers obtain the MASA format in an optimal form, as well as the pre-processing does. In general, the methods presented are beneficial to perform on metadata at least once in any set of operations using the MASA format.
[0174] The presented embodiment determines alignment using directional information (azimuth and elevation angles) in spatial metadata. This is merely one possibility, and other embodiments may (similarly) consider other spatial metadata fields, such as spread coherence, when determining a total difference measure for ordering candidates in a directional field.
[0175] Furthermore, the above example considers the three-dimensional directional representation used in MASA's spatial metadata. This is based on the azimuth angle (angle to the left or right on the horizontal plane) and elevation angle (angle from the horizontal plane) of the direction in a spherical coordinate system. This should be considered only as an illustrative embodiment. All operations can be performed using other directional parameterizations such as azimuth and polar angle (angle from the vertical plane), and, in the case of two-dimensional directions, limited to azimuth or elevation angle only.
[0176] Furthermore, the encoders shown in Figures 9 and 10 illustrate two possible locations for processing within the encoder. Processing (alignment) can also be applied at other locations in the processing chain, or even at multiple locations simultaneously. For example, one instance of processing can be located near the input of the metadata encoder, as in Figure 9, and operate along the time axis. In addition, a second instance of the present invention can exist, operating along the frequency axis near the metadata encoder, as in Figure 10. Other configurations are also possible.
[0177] With respect to Figure 13, an exemplary electronic device that may be used as any part of the apparatus of the system described above. This device may be any suitable electronic device or apparatus. For example, in some embodiments, device 2200 is a mobile device, user equipment, tablet computer, computer, audio playback device, etc. This device may be configured to perform, for example, an encoder and / or decoder, or any of the functional blocks described above.
[0178] In some embodiments, the device 2200 comprises at least one processor or central processing unit 2207. The processor 2207 may be configured to execute various program codes, such as those described herein.
[0179] In some embodiments, the device 2200 comprises at least one memory 2211. In some embodiments, at least one processor 2207 is coupled to the memory 2211. The memory 2211 can be any suitable storage means. In some embodiments, the memory 2211 comprises a program code section for storing program code that can be executed on the processor 2207. Furthermore, in some embodiments, the memory 2211 may further comprise a stored data section for storing data such as data that has been processed or will be processed according to the embodiments described herein, for example. The executed program code stored in the program code section and the data stored in the stored data section can be retrieved by the processor 2207 whenever needed via the memory-processor coupling.
[0180] In some embodiments, device 2200 includes a user interface 2205. The user interface 2205 may be coupled to a processor 2207 in some embodiments. In some embodiments, the processor 2207 can control the operation of the user interface 2205 and receive input from it. In some embodiments, the user interface 2205 allows a user to input commands to device 2200, for example, via a keypad. In some embodiments, the user interface 2205 allows a user to retrieve information from device 2200. For example, the user interface 2205 may include a display configured to show information from device 2200 to the user. In some embodiments, the user interface 2205 includes a touchscreen or touch interface capable of both allowing information to be entered into device 2200 and further displaying the information to the user of device 2200. In some embodiments, the user interface 2205 may be a user interface for communication.
[0181] In some embodiments, device 2200 includes an input / output port 2209. In some embodiments, the input / output port 2209 includes a transceiver. In such embodiments, the transceiver may be coupled to a processor 2207 and configured to enable communication with other devices or electronic devices, for example, via a wireless communication network. The transceiver, or any suitable transceiver or transmitter and / or receiver means, may, in some embodiments, be configured to communicate with other electronic devices or devices via wiring or wiring coupling.
[0182] A transceiver can communicate with further devices by any suitable known communication protocol. For example, in some embodiments, a transceiver can use a suitable radio access architecture based on Long-Term Evolution Advanced (LTE Advanced, LTE-A) or New Radio (NR) (sometimes called 5G), Universal Mobile Communications System (UMTS) Radio Access Network (UTRAN or E-UTRAN), Long-Term Evolution (LTE, same as E-UTRA), 2G Network (Legacy Network Technology), Wireless Local Area Network (WLAN or Wi-Fi), Worldwide Interoperability for Microwave Access (WiMAX), Bluetooth®, Personal Communication Services (PCS), ZigBee®, Wideband Code Division Multiple Access (WCDMA), Systems using Ultra-Wideband (UWB) Technology, Sensor Networks, Mobile Ad Hoc Networks (MANET), Cellular Internet of Things (IoT) RAN and Internet Protocol Multimedia Subsystem (IMS), any other suitable option and / or any combination thereof.
[0183] The transceiver input / output port 1409 can be configured to receive signals.
[0184] In some embodiments, device 1400 may be used as at least part of a composite device. The input / output port 1409 may be coupled to headphones or similar devices (which may be head-tracking or non-head-tracking headphones) and a loudspeaker.
[0185] In general, various embodiments of the present invention may be implemented in hardware, special-purpose circuits, software, logic, or any combination thereof. For example, some embodiments may be implemented in hardware, while others may be implemented in firmware or software that can be executed by a controller, microprocessor, or other computing device, but the present invention is not limited thereto. Various embodiments of the present invention may be shown and described using block diagrams, flowcharts, or other graphic representations, but it should be understood that these blocks, devices, systems, techniques, or methods described herein may, in non-limiting examples, be implemented in hardware, software, firmware, special-purpose circuits or logic, general-purpose hardware or controllers, or other computing devices, or a combination thereof.
[0186] Embodiments of the present invention may be implemented by computer software or hardware, or a combination of software and hardware, that can be executed by a data processor of a mobile device, such as in a processor entity. In this regard, it should be noted that any block of the logic flow as shown in the figure may represent a program step, or an interconnected logic circuit, block and function, or a combination of a program step and a logic circuit, or a combination of a block and a function. The software may be stored on a physical medium such as a memory chip, or on a memory block executed within the processor, on a magnetic medium such as a hard disk or floppy disk, or on an optical medium such as a DVD and its data variants, or a CD.
[0187] Memory may be of any type suitable for the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. Data processors may be of any type suitable for the local technical environment and may include, in non-limiting examples, one or more of general-purpose computers, special-purpose computers, microprocessors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), gate-level circuits, and processors based on multi-core processor architectures.
[0188] Embodiments of the present invention can be implemented in various components, such as integrated circuit modules. The design of integrated circuits is generally a highly automated process. Complex and powerful software tools are available to translate logic-level designs into semiconductor circuit designs ready to be etched and formed on a semiconductor substrate.
[0189] For example, programs such as those offered by Synopsys, Inc. in Mountain View, California, and Cadence Design in San Jose, California, automatically route conductors and position components on a semiconductor chip using well-established design rules and a library of pre-stored design modules. Once the design for the semiconductor circuit is complete, the resulting design in a standardized electronic format (e.g., Opus or GDSII) can be sent to a semiconductor manufacturing facility or "fab" for production.
[0190] As used in this application, the term "circuit mechanism" may refer to one or more or all of the following: (a) Hardware-only circuit implementation (such as implementation using only analog and / or digital circuit mechanisms) and (b) The following combinations of hardware circuits and software (if applicable): (i) combination of analog and / or digital hardware circuits with software / firmware (ii) Any part of a hardware processor, software, and memory having software (including a digital signal processor) that works together to cause a device such as a mobile phone or server to perform various functions, Hardware circuits and / or processors, such as microprocessors or parts of microprocessors, that require software (e.g., firmware) to operate, but may not exist if the software is not required for operation.
[0191] This definition of circuit mechanism applies to all uses of the term in this application, including in any claim. Further examples include, as used in this application, the term circuit mechanism also covers simply hardware circuits or processors (or more processors) or parts of hardware circuits or processors, as well as implementations of their associated software and / or firmware. The term circuit mechanism also covers, for example, baseband integrated circuits or processor integrated circuits for mobile devices, or similar integrated circuits in servers, cellular network devices, or other computing or networking devices, where applicable to a particular claim element.
[0192] As used herein, the term “non-transient” refers to a limitation of the medium itself (i.e., being tangible and not signaling) and not to a limitation of the persistence of data storage (e.g., RAM vs. ROM).
[0193] As used herein, the terms “at least one of the following <list of two or more elements>” and “at least one of the <list of two or more elements>” and similar expressions mean at least any one of the elements, or at least any two or more of the elements, or at least all of the elements, when the lists of two or more elements are joined by “and” or “or”.
[0194] The foregoing description, by means of non-limiting examples, provides a full and informative description of exemplary embodiments of the present invention. However, when read in conjunction with the accompanying drawings and claims, various modifications and adaptations may become apparent to those skilled in the art in consideration of the foregoing description. Nevertheless, all such and similar modifications of the teachings of the present invention remain within the scope of the invention as defined in the accompanying claims.
Claims
1. The device comprises at least one processor and at least one memory for storing instructions, and when an instruction is executed by the at least one processor, Obtaining ordered directional metadata parameters for at least two sound sources in an audio scene, wherein the ordered directional metadata parameters are associated with the at least two sound sources, the ordered directional metadata parameters identify the direction of arrival for the at least two sound sources, and are arranged in a plurality of frames arranged as a grid of time-frequency tiles on a time axis and a frequency axis, each tile comprising at least two sets of at least one directional metadata parameter, one set for each of the at least two sound sources, and each set is associated with an order index that defines the order between the at least two sets in the tile, thereby providing the ordered directional metadata parameters. They will implement this, For a first time-frequency tile and an adjacent second time-frequency tile, the first tile has a first set of at least one directional metadata parameters having a first order index and a second set of at least one directional metadata parameters having a second order index, and the second tile has a third set of at least one directional metadata parameters having the first order index and a fourth set of at least one directional metadata parameters having the second order index. The first difference measurement is determined by determining the difference between the first set and the third set and the difference between the second set and the fourth set, The second difference measurement is determined by determining the difference between the first set and the fourth set and the difference between the second set and the third set, and if the first difference measurement is greater than or equal to the second difference measurement, the order index of the first set and the second set is swapped or determined. A device that enables the execution of an action.
2. The apparatus according to claim 1, wherein each of the first set, the second set, the third set, and the fourth set includes an azimuth parameter for the direction of arrival and an elevation parameter for the direction of arrival.
3. The first difference measurement described above is performed as follows: The determination is made by, The second difference measurement described above is performed as follows: The determination is made by and They were forced to carry out the task. In the formula, φ 1_1 φ is the elevation angle parameter of the first set, 1_2 φ is the elevation angle parameter of the second set, 2_1 φ is the elevation angle parameter of the third set, 2_2 θ is the elevation angle parameter of the fourth set, 1_1 θ is the azimuth angle parameter of the first set, 1_2 θ is the azimuth parameter of the second set, 2_1 θ is the azimuth parameter of the third set, 2_2 The apparatus according to claim 2, wherein is the azimuth angle parameter of the fourth set.
4. The apparatus according to claim 1, wherein each of the first set, the second set, the third set, and the fourth set is associated with the direct-to-total energy ratio and spread coherence.
5. After replacing the aforementioned sequential index, The method involves combining the at least one directional metadata parameter having the first ordinal index within the first time-frequency tile with the at least one directional metadata parameter having the first ordinal index within the second time-frequency tile, The apparatus according to claim 1, further comprising combining the at least one directional metadata parameter having the second ordinal index in the first time-frequency tile with the at least one directional metadata parameter having the second ordinal index in the second time-frequency tile.
6. The adjacent time-frequency tiles are Time-frequency tiles of preceding time, Subsequent time-frequency tiles, Leading frequency time-frequency tile, Successor frequency time-frequency tile, Preceding time and frequency time - frequency tile, Succession time and frequency time - frequency tile, Time-frequency tiles of preceding and succeeding frequencies, and Successor time and preceding frequency time - frequency tiling The apparatus according to claim 1, wherein at least one of the following.
7. The frames are arranged temporally consecutively, the first time-frequency tile is included by the first frame, the adjacent second time-frequency tile is included by the second frame, and The first time-frequency tile is the last time-frequency tile in the first frame, the second time-frequency tile is the first time-frequency tile in the second frame, and the second frame follows the first frame in time, or The first time-frequency tile is the first time-frequency tile within the first frame, the second time-frequency tile is the last time-frequency tile within the second frame, and the first frame is temporally immediately following the second frame. The apparatus according to any one of claims 1 to 6, wherein it is one of the following.
8. A method for an apparatus, Obtaining ordered directional metadata parameters for at least two sound sources in an audio scene, wherein the ordered directional metadata parameters are associated with the at least two sound sources, the ordered directional metadata parameters identify the direction of arrival for the at least two sound sources, and are arranged in a plurality of frames arranged as a grid of time-frequency tiles on a time axis and a frequency axis, each tile comprising at least two sets of at least one directional metadata parameter, one set for each of the at least two sound sources, and each set is associated with an order index that defines the order between the at least two sets in the tile, thereby providing the ordered directional metadata parameters. Includes, For a first time-frequency tile and an adjacent second time-frequency tile, the first tile has a first set of at least one directional metadata parameters having a first order index and a second set of at least one directional metadata parameters having a second order index, and the second tile has a third set of at least one directional metadata parameters having the first order index and a fourth set of at least one directional metadata parameters having the second order index. The first difference measurement is determined by determining the difference between the first set and the third set and the difference between the second set and the fourth set. The second difference measurement is determined by determining the difference between the first set and the fourth set and the difference between the second set and the third set, and if the first difference measurement is greater than or equal to the second difference measurement, the order index of the first set and the second set is swapped or determined. A method for an apparatus, including
9. The method according to claim 8, wherein each of the first set, the second set, the third set, and the fourth set includes an azimuth parameter for the direction of arrival and an elevation parameter for the direction of arrival.
10. The first difference measurement described above is performed as follows: The determination is made by, The second difference measurement described above is performed as follows: The determination is made by and Includes, where φ 1_1 is the elevation angle parameter of the first set, φ 1_2 is the elevation angle parameter of the second set, φ 2_1 is the elevation angle parameter of the third set, φ 2_2 is the elevation angle parameter of the fourth set, θ 1_1 is the azimuth angle parameter of the first set, θ 1_2 is the azimuth angle parameter of the second set, θ 2_1 is the azimuth angle parameter of the third set, θ 2_2 is the azimuth angle parameter of the fourth set, The method according to claim 9.
11. The method according to claim 8, wherein each of the first set, the second set, the third set, and the fourth set is associated with the direct-to-total energy ratio and spread coherence.
12. After replacing the aforementioned sequential index, The method involves combining the at least one directional metadata parameter having the first ordinal index within the first time-frequency tile with the at least one directional metadata parameter having the first ordinal index within the second time-frequency tile, The method involves combining the at least one directional metadata parameter having the second ordinal index within the first time-frequency tile with the at least one directional metadata parameter having the second ordinal index within the second time-frequency tile. The method according to claim 8, further comprising:
13. The adjacent time-frequency tiles are Time-frequency tiles of preceding time, Subsequent time-frequency tiles, Leading frequency time-frequency tile, Successor frequency time-frequency tile, Preceding time and frequency time - frequency tile, Succession time and frequency time - frequency tile, Time-frequency tiles of preceding and succeeding frequencies, and Successor time and preceding frequency time - frequency tiling The method according to claim 8, wherein at least one of the following is the method according to claim 8.
14. The frames are arranged temporally consecutively, with the first time-frequency tile being included by the first frame, and the adjacent second time-frequency tile being included by the second frame, The first time-frequency tile is the last time-frequency tile in the first frame, the second time-frequency tile is the first time-frequency tile in the second frame, and the second frame follows the first frame in time, or The first time-frequency tile is the first time-frequency tile within the first frame, the second time-frequency tile is the last time-frequency tile within the second frame, and the first frame is temporally immediately following the second frame. The method according to any one of claims 8 to 13, wherein the method is any one of the above.
15. A non-temporary computer-readable medium containing instructions, wherein when the instructions are executed by a device, the device contains at least: Obtaining ordered directional metadata parameters for at least two sound sources in an audio scene, wherein the ordered directional metadata parameters are associated with the at least two sound sources, the ordered directional metadata parameters identify the direction of arrival for the at least two sound sources, and are arranged in a plurality of frames arranged as a grid of time-frequency tiles on a time axis and a frequency axis, each tile comprising at least two sets of at least one directional metadata parameter, one set for each of the at least two sound sources, and each set is associated with an order index that defines the order between the at least two sets in the tile, thereby providing the ordered directional metadata parameters. Make it run, For a first time-frequency tile and an adjacent second time-frequency tile, the first tile has a first set of at least one directional metadata parameters having a first order index and a second set of at least one directional metadata parameters having a second order index, and the second tile has a third set of at least one directional metadata parameters having the first order index and a fourth set of at least one directional metadata parameters having the second order index. The first difference measurement is determined by determining the difference between the first set and the third set and the difference between the second set and the fourth set, The second difference measurement is determined by determining the difference between the first set and the fourth set and the difference between the second set and the third set, and if the first difference measurement is greater than or equal to the second difference measurement, the order index of the first set and the second set is swapped or determined. A non-temporary computer-readable medium that enables the execution of [the action].
16. The non-temporary computer-readable medium according to claim 15, wherein each set of the first set, the second set, the third set, and the fourth set includes an azimuth parameter for the direction of arrival and an elevation parameter for the direction of arrival.
17. When the aforementioned instruction is executed, the instruction is sent to the device. The first difference measurement described above is performed as follows: The determination is made by, The second difference measurement described above is performed as follows: The determination is made by and To make it even more so, In the formula, φ 1_1 φ is the elevation angle parameter of the first set, 1_2 φ is the elevation angle parameter of the second set, 2_1 φ is the elevation angle parameter of the third set, 2_2 θ is the elevation angle parameter of the fourth set, 1_1 θ is the azimuth angle parameter of the first set, 1_2 θ is the azimuth parameter of the second set, 2_1 θ is the azimuth parameter of the third set, 2_2 The non-temporary computer-readable medium according to claim 16, wherein is the azimuth parameter of the fourth set.
18. The non-temporary computer-readable medium according to claim 15, wherein each of the first set, the second set, the third set, and the fourth set is associated with a direct-to-total energy ratio and spread coherence.
19. When the aforementioned instruction is executed, the instruction instructs the device to exchange the sequence index, The method involves combining the at least one directional metadata parameter having the first ordinal index within the first time-frequency tile with the at least one directional metadata parameter having the first ordinal index within the second time-frequency tile, The method involves combining the at least one directional metadata parameter having the second ordinal index within the first time-frequency tile with the at least one directional metadata parameter having the second ordinal index within the second time-frequency tile. The non-temporary computer-readable medium according to claim 15, further relating to the above.
20. The adjacent time-frequency tiles are Time-frequency tiles of preceding time, Subsequent time-frequency tiles, Leading frequency time-frequency tile, Successor frequency time-frequency tile, Preceding time and frequency time - frequency tile, Succession time and frequency time - frequency tile, Time-frequency tiles of preceding and succeeding frequencies, and Successor time and preceding frequency time - frequency tiling A non-temporary computer-readable medium according to claim 15, which is at least one of the following.
21. The frames are arranged temporally consecutively, with the first time-frequency tile being included by the first frame, and the adjacent second time-frequency tile being included by the second frame, The first time-frequency tile is the last time-frequency tile in the first frame, the second time-frequency tile is the first time-frequency tile in the second frame, and the second frame follows the first frame in time, or The first time-frequency tile is the first time-frequency tile within the first frame, the second time-frequency tile is the last time-frequency tile within the second frame, and the first frame is temporally immediately following the second frame. A non-temporary computer-readable medium according to any one of claims 15 to 20, which is any of the above.