Quantization spatial audio parameters

By converting and quantizing the energy ratio parameter in spatial audio coding, the coding process is optimized, solving the problem of excessive bit count requirements in existing technologies and achieving more efficient coding and storage.

CN116508098BActive Publication Date: 2026-06-30NOKIA TECHNOLOGIES OY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NOKIA TECHNOLOGIES OY
Filing Date
2021-08-19
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively compress and encode spatial audio parameters, especially for multi-channel loudspeaker and audio object signals, resulting in excessive bit requirements and hindering efficient transmission and storage.

Method used

By converting the energy ratio associated with time-frequency blocks into energy ratio parameters and quantizing these parameters using a quantizer, a suitable quantizer is selected to reduce the bit rate, including swapping parameters such as direction index and distance to optimize the encoding process.

Benefits of technology

It effectively reduces the bit rate requirement of spatial audio parameters, improves coding efficiency, and is suitable for the transmission and storage of multi-channel speaker and audio object signals.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116508098B_ABST
    Figure CN116508098B_ABST
Patent Text Reader

Abstract

In particular, an apparatus for spatial audio coding is disclosed, the apparatus being configured to convert two or more energy ratios associated with time-frequency blocks of one or more audio signals into additional energy ratio parameters associated with the two or more energy ratios; quantize the additional energy ratio parameters using a first quantizer; determine a distribution factor of the energy ratios based on the ratio of the first energy ratio to the sum of the two or more energy ratios; select an additional quantizer from a plurality of additional quantizers using the quantized additional energy ratio parameters; and quantize the distribution factor of the energy ratios using the selected additional quantizer.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to apparatus and methods for encoding sound field related parameters, but is not specifically for encoding time-frequency domain directional related parameters for audio encoders and decoders. Background Technology

[0002] Parametric spatial audio processing is a field of audio signal processing where a set of parameters is used to describe the spatial aspects of sound. For example, in parametric spatial audio capture with a microphone array, estimating a set of parameters (such as the direction of sound in a frequency band and the ratio between the directional and non-directional components of the captured sound in that band) from the microphone array signal is a common and efficient choice. These parameters are well-known for describing the perceptual spatial properties of sound captured at the location of the microphone array. These parameters can then be utilized accordingly in the synthesis of spatial sound for binaural headphones, loudspeakers, or other formats such as stereo.

[0003] Therefore, the ratio of direction and direct to total energy in the frequency band is a particularly effective parameterization for spatial audio capture.

[0004] A set of parameters, consisting of directional parameters and energy ratio parameters (indicating the directionality of sound) within a frequency band, can also be used as spatial metadata for an audio codec (which may also include other parameters such as ambient coherence, diffusion coherence, number of directions, distance, etc.). For example, these parameters can be estimated from audio signals captured by a microphone array, and stereo or single-channel signals can be generated from the microphone array signal to be conveyed along with the spatial metadata. For instance, a stereo signal can be encoded using an AAC encoder, and a single-channel signal can be encoded using an EVS encoder. The decoder can then decode the audio signal into a PCM signal and process the sound within the frequency band (using the spatial metadata) to obtain a spatial output, such as a binaural output.

[0005] The above solution is particularly well-suited for encoding spatial sound captured from microphone arrays (e.g., mobile phones, VR cameras, standalone microphone arrays). However, it may be desirable for such an encoder to also accept other input types besides the signals captured by the microphone array, such as speaker signals, audio object signals, or stereo signals.

[0006] Analysis of first-order stereo (FOA) inputs used for spatial metadata extraction has been extensively documented in scientific literature related to DirAC (Directional Audio Coding) and Harpex (Harmonic Plane Wave Spread). This is because microphone arrays directly provide FOA signals (more precisely, their variant, B-format signals), and therefore, analyzing such inputs has become a focus of research in this field. Furthermore, analysis of high-order stereo (HOA) inputs for multi-directional spatial metadata extraction has also been documented in scientific literature related to High-Order DirAC (HO-DirAC).

[0007] The encoder's other inputs are also multi-channel speaker inputs, such as 5.1 or 7.1 channel surround inputs and audio objects.

[0008] However, regarding the components of spatial metadata, the compression and encoding of spatial audio parameters (such as direct energy ratio) is of considerable significance in minimizing the total number of bits required to represent the spatial audio parameters. Summary of the Invention

[0009] According to a first aspect, there exists an apparatus for spatial audio coding, comprising: a component configured to: convert two or more energy ratios associated with time-frequency blocks of one or more audio signals into additional energy ratio parameters associated with the two or more energy ratios; quantize the additional energy ratio parameters using a first quantizer; determine a distribution factor of the energy ratios based on the ratio of the first energy ratio to the sum of the two or more energy ratios; select an additional quantizer from a plurality of additional quantizers using the quantized additional energy ratio parameters; and quantize the distribution factor of the energy ratios using the selected additional quantizer.

[0010] Two or more energy ratios can be two directly related to the total energy ratio;

[0011] Another energy ratio parameter could be the diffusion-to-total energy ratio.

[0012] The diffusion-to-total-energy ratio can include one minus the sum of the two direct-to-total-energy ratios.

[0013] Other energy ratio parameters can be the sum of two direct sums of the total energy ratio.

[0014] The distribution factor of the energy ratio can include the ratio of the first direct-to-total energy ratio to the sum of the two direct-to-total energy ratios.

[0015] The component for selecting an additional quantizer from a plurality of additional quantizers using the quantized additional energy ratio parameter may include a component for: comparing the quantized additional energy ratio parameter with a threshold; and selecting an additional quantizer from a plurality of additional quantizers based on the comparison.

[0016] The first direct-to-total energy ratio of the two direct-to-total energy ratios can be associated with a first direction of the sound wave, and the second direct-to-total energy ratio of the two direct-to-total energy ratios can be associated with a second direction of the sound wave. The device may further include a continuing component for: determining that the second direct-to-total energy ratio of the two direct-to-total energy ratios is greater than the first direct-to-total energy ratio of the two direct-to-total energy ratios; exchanging the first direct-to-total energy ratio of the two direct-to-total energy ratios to be associated with the second direction; and exchanging the second direct-to-total energy ratio of the two direct-to-total energy ratios to be associated with the first direction.

[0017] The first direction index, first extended coherence, and first distance associated with the time-frequency block can each be associated with a first direction of the sound wave, and the second direction index, second extended coherence, and second distance associated with the time-frequency block can each be associated with a second direction of the sound wave. If it is determined that the second direct-to-total energy ratio of the two direct-to-total energy ratios is greater than the first direct-to-total energy ratio of the two direct-to-total energy ratios, the device may further include components for at least one of: swapping the first direction index to be associated with the second direction and swapping the second direction index to be associated with the first direction; swapping the first distance to be associated with the second direction and swapping the second distance to be associated with the first direction; and swapping the first extended coherence to be associated with the second direction and swapping the second extended coherence to be associated with the first direction.

[0018] According to the second aspect, there is a method for spatial audio coding, comprising: converting two or more energy ratios associated with time-frequency blocks of one or more audio signals into additional energy ratio parameters associated with the two or more energy ratios; quantizing the additional energy ratio parameters using a first quantizer; determining a distribution factor of the energy ratios based on the ratio of the first energy ratio to the sum of the two or more energy ratios; selecting an additional quantizer from a plurality of additional quantizers using the quantized additional energy ratio parameters; and quantizing the distribution factor of the energy ratios using the selected additional quantizer.

[0019] Two or more energy ratios can be two directly related to the total energy ratio;

[0020] Another energy ratio parameter could be the diffusion-to-total energy ratio.

[0021] The diffusion-to-total-energy ratio can include one minus the sum of the two direct-to-total-energy ratios.

[0022] Other energy ratio parameters can be the sum of two direct sums of the total energy ratio.

[0023] The distribution factor of the energy ratio can include the ratio of the first direct-to-total energy ratio to the sum of the two direct-to-total energy ratios.

[0024] Selecting an additional quantizer from a plurality of additional quantizers using the quantized additional energy ratio parameter may include: comparing the quantized additional energy ratio parameter with a threshold; and selecting an additional quantizer from a plurality of additional quantizers based on the comparison.

[0025] The method further includes the following processing steps: determining that the second direct-to-total energy ratio is greater than the first direct-to-total energy ratio; swapping the first direct-to-total energy ratio with the second direction of the sound wave; and swapping the second direct-to-total energy ratio with the first direction of the sound wave.

[0026] The first direction index, first extended coherence, and first distance associated with the time-frequency block can also each be associated with a first direction of the sound wave, and the second direction index, second extended coherence, and second distance associated with the time-frequency block are also each associated with a first direction of the sound wave. If it is determined that the second direct-to-total energy ratio is greater than the first direct-to-total energy ratio, the method may further include at least one of the following: swapping the first direction index to be associated with the second direction and swapping the second direction index to be associated with the first direction; swapping the first distance to be associated with the second direction and swapping the second distance to be associated with the first direction; and swapping the first extended coherence to be associated with the second direction and swapping the second extended coherence to be associated with the first direction.

[0027] According to a third aspect, there is an apparatus for spatial audio coding, comprising at least one processor and at least one memory, the memory including computer program code, the at least one memory and the computer program code being configured together with the at least one processor to cause the apparatus to at least perform: converting two or more energy ratios associated with time-frequency blocks of one or more audio signals into additional energy ratio parameters associated with the two or more energy ratios; quantizing the additional energy ratio parameters using a first quantizer; determining a distribution factor of the energy ratios based on the ratio of the first energy ratio among the two or more energy ratios to the sum of the two or more energy ratios; selecting an additional quantizer from a plurality of additional quantizers using the quantized additional energy ratio parameters; and quantizing the distribution factor of the energy ratios using the selected additional quantizer.

[0028] A computer program product stored on a medium can enable a device to perform the methods described herein.

[0029] Electronic devices may include devices as described herein.

[0030] Chipsets may include devices as described herein.

[0031] The embodiments of this application are intended to solve problems associated with the prior art. Attached Figure Description

[0032] To better understand this application, reference will now be made to the accompanying drawings by way of example, wherein:

[0033] Figure 1 A system suitable for implementing some embodiments is illustrated schematically;

[0034] Figure 2 A metadata encoder according to some embodiments is illustrated schematically;

[0035] Figure 3 Examples of some embodiments are shown. Figure 2 The flowchart shown illustrates the operation of the metadata encoder; and

[0036] Figure 4 An example device suitable for implementing the illustrated apparatus is shown schematically. Detailed Implementation

[0037] The following describes in more detail suitable apparatuses and possible mechanisms for providing metadata parameters derived from effective spatial analysis. In the discussion below, multi-channel systems are discussed in reference to multi-channel microphone implementations. However, as discussed above, the input format can be any suitable input format, such as multi-channel speakers, stereo (FOA / HOA), etc. Furthermore, the output of the example system is a multi-channel speaker arrangement. However, it should be understood that the output can be rendered to the user in ways other than speakers. Furthermore, the multi-channel speaker signal can be summarized as two or more playback audio signals. Such systems are currently being standardized by the 3GPP standardization body as Immersive Voice and Audio Services (IVAS). IVAS is designed to be an extension of the existing 3GPP Enhanced Voice Service (EVS) codec to facilitate immersive voice and audio services on existing and future mobile (cellular) and fixed-line networks. One application of IVAS could be providing immersive voice and audio services over 3GPP fourth-generation (4G) and fifth-generation (5G) networks. Furthermore, the IVAS codec, as an extension of EVS, can be used in store-and-forward applications where audio and voice content is encoded and stored in files for playback. It should be understood that IVAS can be used in conjunction with other audio and speech coding techniques that have the capability to encode samples of audio and speech signals.

[0038] For each time-frequency (TF) block or region under consideration (in other words, time / frequency subband), the metadata may include at least the spherical orientation (elevation, azimuth), at least one energy ratio of the resulting orientation, extended coherence, and orientation-independent ambient coherence. In general, IVAS may have many different types of metadata parameters for each time-frequency (TF) block. The types of spatial audio parameters that can constitute the metadata of IVAS are shown in Table 1 below.

[0039] This data can be encoded by the encoder and transmitted (or stored) so that the spatial signal can be reconstructed at the decoder.

[0040] Furthermore, in some instances, Metadata-Assisted Spatial Audio (MASA) can support up to two directions per TF block, which would require encoding and transmitting the aforementioned parameters for each direction on a per-TF block basis. According to Table 1 below, this could potentially double the required bitrate.

[0041]

[0042]

[0043] This data can be encoded by the encoder and transmitted (or stored) so that the spatial signal can be reconstructed at the decoder.

[0044] In practical immersive audio communication codecs, the bit rate allocated to metadata can vary considerably. A typical overall operating bit rate for a codec might leave only 2 to 10 kbps for the transmission / storage of spatial metadata. However, some alternative implementations may allow up to 30 kbps or more for the transmission / storage of spatial metadata. The encoding of directional parameters and energy fractions, as well as coherence data, have already been examined. However, regardless of the transmission / storage bit rate assigned to spatial metadata, it is always necessary to use as few bits as possible to represent these parameters, especially when TF blocks can support multiple directions corresponding to different sound sources in a spatial audio scene.

[0045] The concept discussed below is to quantify the direct to total energy ratio in all directions, in the form of the diffusion to total energy ratio of the TF block and the ratio based on the direct to total energy ratio.

[0046] Therefore, this invention is based on the following considerations: by using as few bits as possible to facilitate the transmission and storage of encoded audio signals, the bit rate required to transmit MASA data (or spatial metadata spatial audio parameters) can be reduced by quantizing the direct and total energy ratio corresponding to each direction on a TF block basis.

[0047] in this regard, Figure 1 Example apparatus and systems for implementing embodiments of this application are depicted. System 100 is shown having an "analysis" section 121 and a "synthesis" section 131. The "analysis" section 121 is the portion that encodes from received multichannel signals to metadata and downmixed signals, and the "synthesis" section 131 is the portion that decodes the encoded metadata and downmixed signals to present the regenerated signal (e.g., in the form of a multichannel speaker).

[0048] The input to system 100 and the "analysis" section 121 is a multi-channel signal 102. The following example describes a microphone channel signal input, but any suitable input (or synthesized multi-channel) format can be implemented in other embodiments. For example, in some embodiments, the spatial analyzer and spatial analysis can be implemented externally to the encoder. For example, in some embodiments, spatial metadata associated with the audio signal can be provided to the encoder as a separate bitstream. In some embodiments, spatial metadata can be provided as a set of spatial (direction) index values. These are examples of metadata-based audio input formats.

[0049] The multi-channel signal is transmitted to the transmission signal generator 103 and the analysis processor 105.

[0050] In some embodiments, the transmission signal generator 103 is configured to receive a multi-channel signal and generate a suitable transmission signal comprising a defined number of channels, and output a transmission signal 104. For example, the transmission signal generator 103 may be configured to generate a 2-channel audio mix of a multi-channel signal. The defined number of channels can be any suitable number of channels. In some embodiments, the transmission signal generator is configured to otherwise select or combine, for example, by beamforming, the input audio signal to a defined number of channels and output these as transmission signals.

[0051] In some embodiments, the transmission signal generator 103 is optional and the multi-channel signal is passed to the encoder 107 unprocessed in the same manner as the transmission signal in this example.

[0052] In some embodiments, the analysis processor 105 is also configured to receive and analyze the multichannel signals to generate metadata 106 associated with the multichannel signals and therefore with the transmission signal 104. The analysis processor 105 may be configured to generate metadata that, for each time-frequency analysis interval, may include a direction parameter 108 and an energy ratio parameter 110 (including the direct to total energy ratio and the diffusion to total energy ratio for each direction) and a coherence parameter 112. The direction, energy ratio, and coherence parameters may be considered spatial audio parameters in some embodiments. In other words, spatial audio parameters include parameters designed to characterize the sound field created / captured by the multichannel signals (or generally two or more audio signals).

[0053] In some embodiments, the generated parameters may vary by frequency band. Thus, for example, all parameters may be generated and transmitted in frequency band X, while only one parameter may be generated and transmitted in frequency band Y, and no parameters may be generated or transmitted in frequency band Z. A practical example of this could be that for some frequency bands, such as the highest frequency band, certain parameters are not needed for perceptual reasons. Transmission signal 104 and metadata 106 can be passed to encoder 107.

[0054] Encoder 107 may include an audio encoder core 109 configured to receive transmitted (e.g., mixed) signals 104 and generate appropriate encodings of these audio signals. In some embodiments, encoder 107 may be a computer (running appropriate software stored on memory and at least one processor), or alternatively, a specific device utilizing, for example, an FPGA or ASIC. Encoding can be implemented using any suitable scheme. Encoder 107 may also include a metadata encoder / quantizer 111 configured to receive metadata and output information in encoded or compressed form. In some embodiments, encoder 107 may further interleave, multiplex to a single data stream, or embed metadata within the encoded mixed signal before transmission or storage, such as... Figure 1 As shown by the dashed line, multiplexing can be implemented using any suitable scheme.

[0055] On the decoder side, the received or retrieved data (stream) can be received by the decoder / demultiplexer 133. The decoder / demultiplexer 133 can demultiplex the encoded stream and pass the audio encoded stream to the transport extractor 135, which is configured to decode the audio signal to obtain the transport signal. Similarly, the decoder / demultiplexer 133 may include a metadata extractor 137, which is configured to receive encoded metadata and generate metadata. In some embodiments, the decoder / demultiplexer 133 may be a computer (running appropriate software stored in memory and at least one processor), or alternatively, a specific device utilizing, for example, an FPGA or ASIC.

[0056] The decoded metadata and transmitted audio signals can be passed to the synthesis processor 139.

[0057] The “synthesis” section 131 of system 100 also shows a synthesis processor 139, which is configured to receive the transmitted signal and metadata and recreate the synthesized spatial audio in the form of a multi-channel signal 110 based on the transmitted signal and metadata in any suitable format (depending on the use case, these may be a multi-channel speaker format or, in some embodiments, any suitable output format, such as a binaural or stereo signal).

[0058] Therefore, in summary, the system (analysis section) is first configured to receive multi-channel audio signals.

[0059] The system (analysis section) is then configured to generate appropriate transmitted audio signals (e.g., by selecting or downmixing some of the audio signal channels) and spatial audio parameters as metadata.

[0060] The system is then configured to encode the transmitted signals and metadata for storage / transmission.

[0061] After that, the system can store / transmit the encoded transmission signals and metadata.

[0062] The system can retrieve / receive encoded transmission signals and metadata.

[0063] The system is then configured to extract the transmission signals and metadata from the encoded transmission signals and metadata parameters, such as demultiplexing and decoding the encoded transmission signals and metadata parameters.

[0064] The system (synthesis section) is configured to synthesize and output multi-channel audio signals based on the extracted transmitted audio signals and metadata.

[0065] about Figure 2The example analysis processor 105 and metadata encoder / quantizer 111 (e.g., according to some embodiments) are described in further detail. Figure 1 (as shown in the image).

[0066] Figure 1 and Figure 2 A metadata encoder / quantizer 111 and an analysis processor 105 coupled together are depicted. However, it should be understood that some embodiments may not couple these two corresponding processing entities so tightly, and therefore the analysis processor 105 may exist on a different device than the metadata encoder / quantizer 111. Thus, the device 111 including the metadata encoder / quantizer can be presented along with the transmission signal and metadata stream for processing and encoding independent of the capture and analysis process.

[0067] In some embodiments, the analysis processor 105 includes a time-frequency domain converter 201.

[0068] In some embodiments, the time-frequency domain converter 201 is configured to receive the multi-channel signal 102 and apply a suitable time-to-frequency domain transformation, such as a short-time Fourier transform (STFT), to convert the input time-domain signal into a suitable time-frequency signal. These time-frequency signals can then be passed to the spatial analyzer 203.

[0069] Therefore, for example, the time-frequency signal 202 can be represented in the time-frequency domain as follows:

[0070] s i (b,n),

[0071] Where b is the frequency bin index, n is the time-frequency block (frame) index, and i is the channel index. In other words, n can be considered the time index, with a sampling rate lower than that of the original time-domain signal. These frequency bins can be grouped into subbands, where a subband groups one or more bins into a subband of the frequency bin index, k = 0, ..., K-1. Each subband k has a minimum b... k,low and the highest b k,high And the subband contains from b k,low to b k,high All bars. The width of the subband can approximate any suitable distribution. For example, the Equivalent Rectangular Bandwidth (ERB) scale or the Bark scale.

[0072] Therefore, a time-frequency (TF) block (or block) is a specific sub-band within a subframe of a frame.

[0073] It can be understood that the number of bits required to represent spatial audio parameters can depend at least in part on the TF (time-frequency) block resolution (i.e., the number of TF subframes or blocks). For example, a 20ms audio frame can be divided into four 5ms time-domain subframes, and each time-domain subframe can have up to 24 frequency subbands in the frequency domain according to the Bark scale and its approximation, or any other suitable division. In this particular example, the audio frame can be divided into 96 TF subframes / blocks, in other words, four time-domain subframes with 24 frequency subbands. Therefore, the number of bits required to represent the spatial audio parameters of the audio frame may depend on the TF block resolution. For example, if each TF block is to be encoded according to the distribution in Table 1 above, then each TF block will require 64 bits (for each TF block with one source direction) and 104 bits (for each TF block with two source directions, taking into account parameters independent of source direction).

[0074] In one embodiment, the analysis processor 105 may include a spatial analyzer 203. The spatial analyzer 203 may be configured to receive time-frequency signals 202 and estimate direction parameters 108 based on these signals. The direction parameters may be determined based on any audio-based "direction" determination.

[0075] For example, in some embodiments, the spatial analyzer 203 is configured to estimate the direction of a sound source using two or more signal inputs.

[0076] The spatial analyzer 203 can therefore be configured to provide at least one azimuth and elevation angle, denoted as azimuth φ(k,n) and elevation θ(k,n), for each frequency band and time-frequency block within a frame of the audio signal. The direction parameter 108 of the time subframe can also be passed to the spatial parameter set encoder 207.

[0077] The spatial analyzer 203 can also be configured to determine the energy ratio parameter 110. The energy ratio can be considered as a determination of the energy of an audio signal that can be considered to arrive from one direction. The direct-to-total energy ratio r(k,n) can be estimated, for example, using a stability metric for direction estimation, or using any relevant metric or any other suitable method to obtain the ratio parameter. Each direct-to-total energy ratio corresponds to a specific spatial direction and describes how much energy comes from that specific spatial direction compared to the total energy. This value can also be represented individually for each time-frequency block. The spatial direction parameter and the direct-to-total energy ratio describe how much of the total energy for each time-frequency block comes from a specific direction. Generally, the spatial direction parameter can also be considered as the direction of arrival (DOA).

[0078] In an embodiment, the direct-to-total energy ratio parameter can be estimated based on the normalized cross-correlation parameter cor′(k,n) between microphone pairs at frequency band k, where the value of the cross-correlation parameter is between -1 and 1. The direct-to-total energy ratio parameter r(k,n) can be estimated by comparing the normalized cross-correlation parameter with the diffuse field normalized cross-correlation parameter cor′. D (k, n) was determined to be The direct and total energy ratios are further explained in PCT Publication WO2017 / 005978, which is incorporated herein by reference. The energy ratio can be passed to the space parameter set encoder 207.

[0079] The spatial analyzer 203 can also be configured to determine a plurality of coherence parameters 112, which may include ambient coherence (γ(k,n)) and diffusion coherence (ζ(k,n)), both of which are analyzed in the time-frequency domain.

[0080] The term audio source can refer to the dominant direction of the propagating sound waves, or it can encompass the actual direction of the sound source.

[0081] Therefore, for each subband k, there will exist a set (or collection) of spatial audio parameters associated with the subband and subframe n. In this instance, each subband k and subframe n (in other words, the TF block) can have the following spatial audio parameters associated with it on a per-audio-source-direction basis: at least one azimuth and elevation angle, denoted as azimuth φ(k,n) and elevation θ(k,n), and diffusion coherence (ζ(k,n)) and a parameter directly related to the total energy ratio r(k,n). Clearly, if each TF block has more than one direction, then the TF block can have each of the parameters listed above associated with each source direction. Additionally, the set of spatial audio parameters can also include ambient coherence (γ(k,n)). The parameter can also include diffusion with respect to the total energy ratio r. diff (k, n).

[0082] In the embodiment, the diffusion-to-total-energy ratio r diff (k, n) is the energy ratio of non-directional sound relative to its surrounding directions, and typically each TF block has a single diffusion-to-total energy ratio (and surrounding coherence (γ(k, n))). The diffusion-to-total energy ratio can be considered as the energy ratio remaining after subtracting the direct-to-total energy ratio (for each direction) from one. Looking ahead, the above parameters can be referred to as a set of spatial audio parameters (or spatial audio parameter set) for a specific TF block.

[0083] In an embodiment, in addition to the direction parameter 108 and the coherence parameter 112, the spatial parameter set encoder 207 can be arranged to quantize the energy ratio parameter 110. The energy ratio parameter 110, including the direct versus total energy ratio parameter r(k,n) for each direction, can be based on the diffusion versus total energy ratio r. diff (k, n) and other parameters are used to quantize each direction. The other parameters may include one of the direct to total energy ratio parameters, which is the ratio of the direct to total energy ratio of all directions to the sum of the direct to total energy ratios. The other parameter may be called dr(k, n).

[0084] In some alternative embodiments, the sum of the direct and total energy ratios can be quantified instead of the diffusion-to-total-energy ratio r. diff (k, n), where the sum of the direct ratios to the total energy can be expressed as:

[0085] r sum (k, n) = ∑ d r d (k, n)

[0086] For TF blocks assigned two audio source directions, the direct-to-total energy ratio parameter r1(k,n) for the first direction and the direct-to-total energy ratio parameter r2(k,n) for the second direction of the TF block (k,n) can be expressed as the diffusion-to-total energy ratio r of the TF block. diff It is quantized in the form of (k, n) and dr(k, n).

[0087] In an embodiment, the first direct-to-total energy ratio parameter r1(k,n) and the second direct-to-total energy ratio parameter r2(k,n) can be determined by measuring the diffusion-to-total energy ratio r. diff (k, n) is quantized as

[0088] r diff (k,n)=1-r1(k,n)-r2(k,n)

[0089] In some alternative embodiments, the diffusion-to-total-energy ratio r diff (k, n) can be provided as part of the MASA input metadata, rather than being calculated during runtime as outlined above. In this case, the spatial parameter set encoder 207 can obtain additional energy ratio parameters (or diffusion versus total energy ratio) associated with two or more energy ratios of the time-frequency block.

[0090] Determine the diffusion-to-total-energy ratio r diff The steps for (k, n) are as follows: Figure 3 The processing step 301 is shown in the figure.

[0091] r diff The value of (k, n) can then be scalar quantized to give In this embodiment, this can be performed using a non-uniform scalar quantizer.

[0092] Quantization r diff The steps for (k, n) are as follows: Figure 3 The processing step 305 is shown in the diagram.

[0093] In some embodiments, the diffusion-to-total-energy ratio parameter r diff The values ​​of (k, n) can be used to determine the size of the quantizer subsequently used in the process. For example, if r diff If (k, n) is higher than the selected value, then the quantizer of the first size can be selected; however, if r diff If (k, n) is less than the selected value, then a quantizer of the second size can be selected. In an embodiment, this step can be written as follows:

[0094] If r diff (k, n) > N q

[0095] a.Quant_size = Q1(number of bits, value 1)

[0096] otherwise

[0097] b.Quant_size = Q2(number of bits, value 2)

[0098] Finish

[0099] In other words, if r diff (k, n) > N q , (where N) q If the value is selected, then the quantizer size Q1 is chosen; otherwise, the quantizer size Q2 is chosen. Q1 and Q2 can be used to express the quantizer size based on the number of bits.

[0100] In the embodiment, N was found q Values ​​between 0 and 1. For example, N is found. q One operation point is 0.6.

[0101] In a specific example of one embodiment, the above steps may have the following numerical values.

[0102] If r diff (k, n) > 0.6

[0103] a.Quant_size = 2 (number of bits, value 1)

[0104] otherwise

[0105] b.Quant_size = 3 (number of bits, value 2)

[0106] Finish

[0107] In some embodiments, a quantized diffusion-to-total-energy ratio parameter may be used in the above processing steps. This can have the advantage of not needing to send the quantizer size (Quant_size) as part of the bitstream as a signal. Instead, it can be checked at the decoder. The value determines the quantizer size.

[0108] use The steps to determine the size of the quantizer are as follows: Figure 3 The processing step 303 is shown in the figure.

[0109] The implementation example can then determine the ratio of the first direct-to-total energy ratio parameter to the sum of the first and second direct-to-total energy ratio parameters; in other words, it determines the distribution factor of the energy ratio.

[0110] The distribution factor of this energy ratio can be expressed as:

[0111]

[0112] The steps to determine the above ratio dr are described as follows: Figure 3 Processing step 307.

[0113] For each TF block, the diffusion is directly related to the total energy ratio parameter r. diff (k, n) can be expressed as

[0114] r diff (k, n) = 1-(r1 (k, n) + r2 (k, n) + r3 (k, n))

[0115] Furthermore, the distribution factor of the energy ratio can be given as

[0116]

[0117] and

[0118]

[0119] Naturally, the above scheme can be extended to the general number of direct-to-total-energy ratio parameters for each TF block.

[0120] The value of the ratio dr(k, n) can now be quantized using a scalar quantizer. In an embodiment, one of several quantizers can be selected to quantize dr(k, n).

[0121] As described above, the quantizer used to quantize the ratio dr can be selected based on the result of processing step 303. In other words, processing step 303 can be used to determine the quantizer used to quantize dr(k, n) to give... The size of the scalar quantizer.

[0122] The steps for selecting the quantizer for quantizing dr(k, n) are as follows: Figure 3 As shown in step 309.

[0123] In some embodiments, dr(k, n) can be quantized using a quantizer selected from several uniform scalar quantizers. In the example above, dr can be quantized using one of two uniform scalar quantizers. As shown in Quant_size bits. Taking the specific example above from the embodiment, dr(k, n) can be quantized using a 2-bit or 3-bit scalar quantizer.

[0124] The steps for quantizing dr(k, n) are as follows: Figure 3 As shown in step 311.

[0125] Corresponding to two quantization parameters and The index can be encoded using a fixed or variable rate encoding scheme.

[0126] Alternatively, it can be combined with two quantization parameters. and The corresponding indexes are combined and encoded to form the primary index, and then entropy encoding (such as Golomb Rice or Huffman encoding) is used to encode the primary index.

[0127] In some embodiments, the quantization of the direct-to-total-energy ratio parameter may include an additional preprocessing step, wherein for each TF block, it is checked whether there are actually two direct-to-total-energy ratios r1(k,n) and r2(k,n) (associated with the first and second directions). The presence of a second direct-to-total-energy ratio indicates that the TF block (k,n) has at least two concurrent directions.

[0128] If a TF block is determined to have two concurrent directions, then if the direct-to-total energy ratio r1(k,n) of the first direction is less than the direct-to-total energy ratio r2(k,n) of the second direction, the spatial audio parameters associated with each of the two directions can be swapped. In an embodiment, the spatial audio parameters associated with a particular audio direction may include parameters (from Table 1 above); direction index, direct-to-total energy ratio, diffusion coherence, and distance. In other words, the preprocessing step may have the following form.

[0129] 1. Check if the TF block has two concurrent directions, i.e. check the ratio of the second direct direction to the total energy r2(k,n).

[0130] 2. If there are concurrent directions, then check whether r1(k, n) < r2(k, n).

[0131] 3. If r1(k, n) < r2(k, n), then swap the spatial audio parameters associated with the first direction with the spatial audio parameters associated with the second direction. Thus, this step may include swapping the direction index associated with the first direction of the TF block, directly with at least one of the total energy ratio r1(k, n), extended coherence (ζ1(k, n)), and distance, with the direction index associated with the second direction of the TF block, directly with the total energy ratio r2(k, n), diffusion coherence ζ2(k, n), and distance values.

[0132] The above process effectively sorts the directions such that the direction with the larger direct total energy ratio is always the first direction, and the direction with the smaller direct total energy ratio is always the second direction.

[0133] The advantage of the above preprocessing step is that it has the advantage of implementing a more efficient quantizer, such that dr is always between 0.5 and 1 (compared to having values between 0 and 1 without performing the above swapping mechanism). Thus, the same accuracy can be obtained with approximately half the number of codewords.

[0134] Any further processing performed by the spatial parameter set encoder 207 can use the quantized direct total energy ratio obtained from and obtained.

[0135] The above quantization scheme has been described in terms of the energy ratio of the TF block. However, those skilled in the art will appreciate that the above can equally be applied to other parameters by which signals are quantized, such as amplitude ratio, amplitude ratio, and power ratio.

[0136] The metadata encoder / quantizer 111 may also include a direction encoder. The direction encoder is configured to receive direction parameters (such as azimuth φ and elevation θ) (and in some embodiments the expected bit allocation) and thereby generate a suitable encoded output. In some embodiments, the encoding is based on a sphere arrangement forming a spherical grid, the spherical grid being arranged in rings on the "surface" sphere, which is defined by a look-up table defined by the determined quantization resolution. In other words, the idea of using a spherical grid is to cover a sphere with smaller spheres and consider the centers of the smaller spheres as points defining a grid of almost equidistant directions. Thus, the smaller spheres define cones or solid angles around the center point, which can be indexed according to any suitable indexing algorithm. Although spherical quantization is described here, any suitable linear or non-linear quantization can be used.

[0137] Similarly, the metadata encoder / quantizer 111 may also include a coherent encoder configured to receive a surrounding coherent value γ and an extended coherent value ζ and determine a suitable encoding for compressing the surrounding and extended coherent values.

[0138] The encoded direction, energy ratio, and coherence value can be passed to the combiner. The combiner can be configured to receive the encoded (or quantized / compressed) direction parameters, energy ratio parameters, and coherence parameters and combine them to generate a suitable output (e.g., a metadata bitstream, which can be combined with the transmitted signal or transmitted or stored separately from the transmitted signal).

[0139] In some embodiments, the encoded data stream is passed to a decoder / demultiplexer 133. The decoder / demultiplexer 133 demultiplexes the encoded quantized spatial audio parameter set of the frame and passes it to a metadata extractor 137. In some embodiments, the decoder / demultiplexer 133 may also extract the transmitted audio signal to a transport extractor for decoding and extraction.

[0140] In an embodiment, the metadata extractor 137 can be configured to extract for each TF block. and The index.

[0141] Can read and The associated index provides the corresponding quantization value.

[0142] The value can then be used (from multiple quantizers) to determine a specific quantizer (or quantization table), which can be used at the decoder to... The value is dequantized. In other words, It is used to select a quantization table at the decoder (from multiple quantization tables). The value can then be obtained by using with The associated index is read from the selected quantization table. The value directly related to the total energy ratio can then be determined using a process reversed from that applied at the encoder. From the example above, the quantization values ​​of r1(k,n) and r2(k,n) can be obtained as follows:

[0143] and

[0144]

[0145] The decoded spatial audio parameters can then be used to form decoded metadata output from metadata extractor 137 and passed to synthesis processor 139 to form multichannel signal 110.

[0146] about Figure 4The illustration shows an example electronic device that can be used as an analysis or synthesis device. This device can be any suitable electronic device or apparatus. For example, in some embodiments, device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback device, etc.

[0147] In some embodiments, device 1400 includes at least one processor or central processing unit 1407. Processor 1407 may be configured to execute various program codes, such as the methods described herein.

[0148] In some embodiments, device 1400 includes memory 1411. In some embodiments, at least one processor 1407 is coupled to memory 1411. Memory 1411 can be any suitable storage component. In some embodiments, memory 1411 includes a program code portion for storing program code that can be implemented on processor 1407. Furthermore, in some embodiments, memory 1411 may also include a storage data portion for storing data, such as data that has been processed or is to be processed according to the embodiments described herein. The implemented program code stored in the program code portion and the data stored in the storage data portion can be retrieved by processor 1407 via memory-processor coupling when needed.

[0149] In some embodiments, device 1400 includes a user interface 1405. In some embodiments, user interface 1405 may be coupled to processor 1407. In some embodiments, processor 1407 may control the operation of user interface 1405 and receive input from user interface 1405. In some embodiments, user interface 1405 may enable a user to input commands to device 1400, for example, via a keypad. In some embodiments, user interface 1405 may enable a user to obtain information from device 1400. For example, user interface 1405 may include a display configured to display information from device 1400 to a user. In some embodiments, user interface 1405 may include a touchscreen or touch interface capable of inputting information to device 1400 and further displaying the information to a user of device 1400. In some embodiments, user interface 1405 may be a user interface for communicating with a location determiner described herein.

[0150] In some embodiments, device 1400 includes an input / output port 1409. In some embodiments, input / output port 1409 includes a transceiver. In such embodiments, the transceiver may be coupled to processor 1407 and configured to enable communication with other devices or electronic equipment, for example, via a wireless communication network. In some embodiments, the transceiver, or any suitable transceiver or transmitter and / or receiver component, may be configured to communicate with other electronic equipment or devices via wired or wired coupling.

[0151] The transceiver can communicate with other devices using any suitable known communication protocol. For example, in some embodiments, the transceiver may use a suitable Universal Mobile Telecommunications System (UMTS) protocol, a Wireless Local Area Network (WLAN) protocol (such as IEEE 802.X), a suitable Short Range Radio Frequency Communication (SRF) protocol (such as Bluetooth), or an Infrared Data Communication Path (IRDA).

[0152] The transceiver input / output port 1409 can be configured to receive signals and, in some embodiments, determine the parameters described herein by executing appropriate code using processor 1407. Furthermore, the device can generate appropriate downmixed signals and parameter outputs for transmission to a synthesis device.

[0153] In some embodiments, device 1400 can be used as at least part of a synthesis device. Thus, input / output port 1409 can be configured to receive a downmixing signal and, in some embodiments, parameters determined at a capture or processing device as described herein, and to generate a suitable audio signal format output by executing appropriate code using processor 1407. Input / output port 1409 can be coupled to any suitable audio output, such as to a multi-channel speaker system and / or headphones or the like.

[0154] Generally, various embodiments of the present invention can be implemented in hardware or dedicated circuitry, software, logic, or any combination thereof. For example, some aspects can be implemented in hardware, while others can be implemented in firmware or software, which can be executed by a controller, microprocessor, or other computing device, but the invention is not limited thereto. Although various aspects of the invention may be illustrated and described as block diagrams, flowcharts, or other graphical representations, it should be understood that, by way of non-limiting example, the blocks, apparatuses, systems, techniques, or methods described herein can be implemented in hardware, software, firmware, dedicated circuitry or logic, general-purpose hardware or controllers or other computing devices, or some combination thereof.

[0155] Embodiments of the present invention can be implemented by computer software executable by a data processor, such as in a processor entity, or by hardware, or by a combination of software and hardware. Further in this respect, it should be noted that any block of the logical flow as shown in the figures can represent a program step, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. Software can be stored on physical media such as memory chips or memory blocks implemented within a processor, magnetic media such as hard disks or floppy disks, and optical media such as DVDs and their data variants, CDs.

[0156] The memory can be of any type suitable for the local technical environment and can be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory, and removable memory. The data processor can be of any type suitable for the local technical environment and, by way of non-limiting example, can include one or more of the following: general-purpose computers, special-purpose computers, microprocessors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), gate-level circuits, and processors based on multi-core processor architectures.

[0157] Embodiments of the present invention can be practiced in a variety of components, such as integrated circuit modules. The design of integrated circuits is essentially a highly automated process. Complex and powerful software tools can be used to transform logic-level designs into semiconductor circuit designs ready to be etched and formed on semiconductor substrates.

[0158] The program can use well-established design rules and a pre-stored library of design modules to route conductors and position components on semiconductor chips. Once the design for the semiconductor circuit is complete, the final design in a standardized electronic format can be transferred to a semiconductor manufacturing facility or "factory" for production.

[0159] The foregoing description has provided a complete and informative description of exemplary embodiments of the invention by way of exemplary and non-limiting examples. However, various modifications and adaptations will become apparent to those skilled in the art when read in conjunction with the accompanying drawings and appended claims, given the foregoing description. Nevertheless, all such and similar modifications to the teachings of this invention will still fall within the scope of the invention as defined in the appended claims.

Claims

1. An apparatus for spatial audio coding, comprising a component, said component being used for: Two or more energy ratios associated with time-frequency blocks of one or more audio signals are converted into additional energy ratio parameters associated with the two or more energy ratios; The additional energy ratio parameter is quantized using a first quantizer. The distribution factor of the energy ratio is determined based on the ratio of the first energy ratio among the two or more energy ratios to the sum of the two or more energy ratios; Using the quantized additional energy ratio parameter, select an additional quantizer from a plurality of additional quantizers; as well as The distribution factor of the energy ratio is quantized using the selected additional quantizer.

2. The apparatus of claim 1, wherein the two or more energy ratios are two directly related to the total energy ratio.

3. The apparatus of claim 1, wherein the additional energy ratio parameter is the diffusion to total energy ratio.

4. The apparatus of claim 3, wherein the diffusion-to-total-energy ratio comprises a subtraction of the sum of the two direct-to-total-energy ratios.

5. The apparatus of claim 2, wherein the additional energy ratio parameter is the sum of the two direct energy ratios and the total energy ratio.

6. The apparatus of claim 2, wherein the distribution factor of the energy ratio comprises: The ratio of the first direct to total energy ratio to the sum of the two direct to total energy ratios.

7. The apparatus of claim 2, wherein the component for selecting an additional quantizer from a plurality of additional quantizers using the quantized additional energy ratio parameter comprises components for: The quantified additional energy ratio parameter is compared with a threshold; and Based on the comparison, the additional quantizer is selected from a plurality of other quantizers.

8. The apparatus according to any one of claims 2 to 7, wherein the first direct-to-total energy ratio of the two direct-to-total energy ratios is associated with a first direction of the sound wave, and the second direct-to-total energy ratio of the two direct-to-total energy ratios is associated with a second direction of the sound wave, wherein the apparatus further comprises components for: The second direct-to-total energy ratio is determined to be greater than the first direct-to-total energy ratio. The first direct-to-total-energy ratio of the two direct-to-total-energy ratios is swapped with the second direction; as well as The second direct-to-total energy ratio of the two direct-to-total energy ratios is swapped with the one associated with the first direction.

9. The apparatus of claim 8, wherein a first direction index, a first extended coherence, and a first distance associated with the time-frequency block are each associated with a first direction of the sound wave, and wherein a second direction index, a second extended coherence, and a second distance associated with the time-frequency block are each associated with a second direction of the sound wave, wherein it is determined that the second direct-to-total energy ratio of the two direct-to-total energy ratios is greater than the first direct-to-total energy ratio of the two direct-to-total energy ratios, the apparatus further comprising components for at least one of the following: The first direction index is swapped to be associated with the second direction, and the second direction index is swapped to be associated with the first direction; The first distance is swapped to be associated with the second direction, and the second distance is swapped to be associated with the first direction; and The first extended coherence is swapped to be associated with the second direction, and the second extended coherence is swapped to be associated with the first direction.

10. A method for spatial audio coding, comprising: Two or more energy ratios associated with time-frequency blocks of one or more audio signals are converted into additional energy ratio parameters associated with the two or more energy ratios; The additional energy ratio parameter is quantized using a first quantizer. The distribution factor of the energy ratio is determined based on the ratio of the first energy ratio among the two or more energy ratios to the sum of the two or more energy ratios; Using the quantized additional energy ratio parameter, select an additional quantizer from a plurality of additional quantizers; as well as The distribution factor of the energy ratio is quantized using the selected additional quantizer.

11. The method of claim 10, wherein the two or more energy ratios are two directly related to the total energy ratio.

12. The method of claim 10, wherein the additional energy ratio parameter is the diffusion to total energy ratio.

13. The method of claim 12, wherein the diffusion-to-total-energy ratio comprises a subtraction of the sum of the two direct-to-total-energy ratios.

14. The method of claim 11, wherein the additional energy ratio parameter is the sum of the two direct and total energy ratios.

15. The method of claim 11, wherein the distribution factor of the energy ratio comprises: The ratio of the first direct to total energy ratio to the sum of the two direct to total energy ratios.

16. The method of claim 11, wherein selecting the additional quantizer from a plurality of additional quantizers using the quantized additional energy ratio parameter comprises: The quantified additional energy ratio parameter is compared with a threshold; as well as Based on the comparison, the additional quantizer is selected from a plurality of other quantizers.

17. The method of any one of claims 11 to 16, wherein the first direct-to-total energy ratio of the two direct-to-total energy ratios is associated with a first direction of the sound wave, and the second direct-to-total energy ratio of the two direct-to-total energy ratios is associated with a second direction of the sound wave, wherein the method further comprises: The second direct-to-total energy ratio is determined to be greater than the first direct-to-total energy ratio. The first direct-to-total-energy ratio of the two direct-to-total-energy ratios is swapped with the second direction; as well as The second direct-to-total energy ratio of the two direct-to-total energy ratios is swapped with the one associated with the first direction.

18. The method of claim 17, wherein a first direction index, a first extended coherence, and a first distance associated with the time-frequency block are each associated with a first direction of the sound wave, and wherein a second direction index, a second extended coherence, and a second distance associated with the time-frequency block are each associated with a second direction of the sound wave, wherein it is determined that the second direct-to-total energy ratio of the two direct-to-total energy ratios is greater than the first direct-to-total energy ratio of the two direct-to-total energy ratios, the method further comprising at least one of the following: The first direction index is swapped to be associated with the second direction, and the second direction index is swapped to be associated with the first direction; The first distance is swapped to be associated with the second direction, and the second distance is swapped to be associated with the first direction; and The first extended coherence is swapped to be associated with the second direction, and the second extended coherence is swapped to be associated with the first direction.