Apparatus and method for encoding multiple audio objects, or apparatus and method for decoding using two or more related audio objects.
By defining associated audio objects and using directional information for downmixing and covariance synthesis, the encoding of multiple audio objects at low bitrates is enhanced, ensuring high-quality audio reproduction with reduced computational cost.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- FRAUNHOFER GESELLSCHAFT ZUR FORDERUNG DER ANGEWANDTEN FORSCHUNG EV
- Filing Date
- 2021-10-12
- Publication Date
- 2026-06-22
- Estimated Expiration
- Not applicable · inactive patent
AI Technical Summary
Existing audio encoding technologies face challenges in efficiently encoding multiple audio objects at low bitrates, particularly when the number of objects increases, leading to significant bit consumption and degradation of output signal quality.
The proposed solution involves defining at least two associated audio objects per frequency bin and using parameter data, including directional information, to perform a downmix that generates transport channels with adjustable characteristics, and employing covariance synthesis or explicit contribution calculations to enhance audio quality and reduce computational cost.
This approach achieves high-quality and efficient audio encoding/decoding by minimizing bitrate requirements while maintaining superior audio quality, even with an increasing number of objects, through the use of directional information and advanced covariance synthesis.
Smart Images

Figure 0007877308000027 
Figure 0007877308000028 
Figure 0007877308000029
Abstract
Description
Technical Field
[0001] The present invention relates to the encoding of audio signals, such as audio objects, and the decoding of encoded audio signals, such as encoded audio objects.
Background Art
[0002] Introduction In this document, a parametric approach for encoding and decoding object-based audio content at low bitrates using Directional Audio Coding (DirAC) is described. The presented embodiments operate as part of a 3GPP (registered trademark) Immersive Voice and Audio Services (IVAS) codec and provide an advantageous alternative, low bitrate, discrete encoding approach in the Independent Stream with Metadata (ISM) mode with metadata.
[0003] Prior Art Discrete Coding of Objects The simplest way to code object-based audio content is to code the objects individually and transmit them together with the corresponding metadata. The main drawback of this approach is that as the number of objects increases, a very large amount of bit consumption is required to encode the objects. A simple solution to this problem is to adopt a "parametric approach" in which several relevant parameters are calculated from the input signal, quantized, and transmitted together with a suitable downmix signal that combines multiple object waveforms.
[0004] Spatial Audio Object Coding (SAOC) Spatial Audio Object Coding [SAOC_STD, SAOC_AES] is a parametric approach in which the encoder calculates a downmix signal based on a downmix matrix D and a set of parameters, and both are sent to the decoder. The parameters represent the psychoacoustically relevant properties of all individual objects. At the decoder, a rendering matrix R is used to render the downmix to a specific speaker layout.
[0005] The main parameter of SAOC is the object covariance matrix E of size N by N, where N represents the number of objects. This parameter is transferred to the decoder as object level differences (OLD) and optional inter-object covariance (IOC).
[0006] The individual elements e of the matrix E i,j are given by the following equation.
[0007]
Equation
[0008] The object level differences (OLD) are defined as follows.
[0009] ]>
Equation
[0010] where
[0011]
Equation
[0012] and the absolute object energy (NRG) is described as follows.
[0013] ]>
Equation
[0014] and
[0015]
number
[0016] In the formula, i and j are objects x, respectively. i and x j This is the object index, where n represents the time index and k represents the frequency index. l represents a set of time indices and m represents a set of frequency indices. ε is an additional constant to avoid division by zero, for example, ε = 10.
[0017] The similarity of input objects (IOCs) can be given, for example, by cross-correlation.
[0018]
number
[0019] A downmix matrix D of size N_dmx x rows N columns has elements d i,j Defined by, where i refers to the channel index of the downmix signal and j refers to the object index. In the case of a stereo downmix (N_dmx = 2), d i,j It is calculated from the parameters DMG and DCLD as follows:
[0020]
number
[0021] During the ceremony, DMG i and DCLD i It is given by the following formula.
[0022]
number
[0023] In the case of mono downmix (N_dmx = 1), d i,j It is calculated from the DMG parameters alone as follows:
[0024]
number
[0025] During the ceremony,
[0026]
number
[0027] That is the case.
[0028] Spatial Speech Object Coding - 3D (SAOC-3D) Spatial Speech Object Coding 3D Audio Playback (SAOC-3D) [MPEGH_AES, MPEGH_IEEE, MPEGH_STD, SAOC_3D_PAT] is an extension of the above MPEG SAOC technology that compresses and renders both channel signals and object signals in a highly bitrate-efficient manner.
[0029] The main differences from SAOC are as follows: While the original SAOC only supports a maximum of two downmix channels, SAOC-3D can map multi-object inputs to any number of downmix channels (and associated side information). • Rendering to multi-channel output is performed directly, in contrast to the previous SAOC which used MPEG Surround as the multi-channel output processor. • Some tools, such as the remaining coding tools, have been removed.
[0030] Despite these differences, SAOC-3D is the same as SAOC from a parameter perspective. The SAOC-3D decoder, like the SAOC decoder, receives the multi-channel downmix X, the covariance matrix E, the rendering matrix R, and the downmix matrix D.
[0031] The rendering matrix R is defined by the input channels and input objects, and is received from the format converter (channel) and the object renderer (object), respectively.
[0032] The downmix matrix D is defined by the element d i,j where i refers to the channel index of the downmix signal, j refers to the object index, and it is calculated from the downmix gain (DMG).
[0033]
Number
[0034] In the formula,
[0035]
Number
[0036] is.
[0037] The output covariance matrix C of size N_out * N_out is defined as follows. C = RER*
[0038] Related scheme There are several other schemes that are essentially similar to SAOC described above, but with slight differences. · Object binaural cue coding (BCC: Binaural Cue Coding) is described in [BCC2001] etc. and is the predecessor of SAOC technology. Joint Object Coding (JOC) and Advanced Joint Object Coding (A-JOC) perform similar functions to SAOC, delivering roughly separated objects on the decoder side without rendering to specific output speaker layouts [JOC_AES, AC4_AES]. This technique sends elements of the upmix matrix as parameters (not OLD) to objects separated from the downmix.
[0039] Directional Speech Coding (DirAC) Another parametric approach is directional speech coding. DirAC [Pulkki2009] is a perceptually motivated spatial sound reproduction. It is assumed that the spatial resolution of the human auditory system is limited at any given time to decoding one cue for direction and another cue for interaural coherence for one critical bandwidth.
[0040] Based on these assumptions, DirAC represents a spatial sound in a single frequency band by crossfading two streams: an omnidirectional diffuse stream and a directional non-spreadish stream. The DirAC process is performed in two phases: analysis and synthesis, as shown in Figures 12a and 12b.
[0041] In the DirAC analysis stage, a primary matching microphone in B format is considered as the input, and the sound's diffusion and direction of arrival are analyzed in the frequency domain.
[0042] During the DirAC synthesis stage, the sound is split into two streams: a non-spread stream and a spread stream. The non-spread stream is reproduced as a point source using amplitude panning, which can be done using vector-based amplitude panning (VBAP) [Pulkki 1997]. The spread stream, which is the source of the enveloping effect, is generated by transmitting mutually uncorrelated signals to the loudspeaker.
[0043] The analysis stage in Figure 12a comprises a bandfilter 1000, an energy estimator 1001, an intensity estimator 1002, time-averaging elements 999a and 999b, a spread calculator 1003, and a direction calculator 1004. The calculated spatial parameters are the spread value between 0 and 1 for each time / frequency tile and the direction of the arrival parameter for each time / frequency tile generated by block 1004. In Figure 12a, the direction parameters include azimuth and elevation angles and indicate the direction of sound arrival relative to the reference position or listening position, in particular, relative to the position of the microphone from which the four component signals input to the bandfilter 1000 are collected. These component signals are first-order ambisonic components, including an omnidirectional component W, a directional component X, another directional component Y, and another directional component Z, as shown in the diagram of Figure 12a.
[0044] The DirAC synthesis stage shown in Figure 12b includes a bandpass filter 1005 that generates time / frequency representations of B-format microphone signals W, X, Y, and Z. The signals corresponding to individual time / frequency tiles are input to a virtual microphone stage 1006 that generates virtual microphone signals for each channel. In particular, to generate the virtual microphone signal, for example, in the case of the center channel, the virtual microphone is directed towards the center channel, and the resulting signal is the corresponding component signal of the center channel. The signal is then processed via a direct signal branch 1015 and a spread signal branch 1014. Both branches are equipped with corresponding gain adjusters or amplifiers that are controlled by spread values derived from the original spread parameters in blocks 1007 and 1008 and further processed in blocks 1009 and 1010 to obtain specific microphone compensation.
[0045] The component signals of the direct signal branch 1015 are also gain-adjusted using gain parameters derived from directional parameters consisting of azimuth and elevation angles. In particular, these angles are input to a VBAP (vector base amplitude panning) gain table 1011. The results are input to a loudspeaker gain averaging stage 1012 and a further normalizer 1013 for each channel, and the resulting gain parameters are transferred to the amplifier or gain adjuster of the direct signal branch 1015. The spread signal generated at the output of the decorrelator 1016 is combined with the direct signal or non-spread stream in the combiner 1017, and then other subbands are added in another combiner 1018, which may be, for example, a composite filter bank. Thus, a loudspeaker signal for a particular loudspeaker is generated, and the same procedure is performed for other channels of other loudspeakers 1019 in a particular loudspeaker configuration.
[0046] A high-quality version of DirAC synthesis is shown in Figure 12b. Here, the synthesizer receives all B-format signals, from which virtual microphone signals are calculated for each loudspeaker direction. The directional pattern used is typically dipole. Next, the virtual microphone signals are modified non-linearly according to the metadata, as described for branches 1016 and 1015. A low-bitrate version of DirAC is not shown in Figure 12b. However, in this low-bitrate version, only one channel of audio is transmitted. The difference in processing is that all virtual microphone signals are replaced with this single channel of the received audio. The virtual microphone signals are split into two streams, a spread stream and a non-spread stream, and processed separately. Vector-based amplitude panning (VBAP) is used to reproduce the non-spread sound as a point source. In panning, the monophonic sound signal is applied to a subset of loudspeakers after being multiplied by a loudspeaker-specific gain coefficient. The gain coefficient is calculated using information on the loudspeaker setup and the specified pan direction. In the low-bitrate version, the input signal is simply panned in the direction implied by the metadata. In the high-quality version, a gain coefficient corresponding to each virtual microphone signal is multiplied, which achieves the same effect as panning but with a lower probability of nonlinear artifacts.
[0047] The purpose of spreading sound synthesis is to create the perception of sound surrounding the listener. In the low-bitrate version, the spreading stream is reproduced by decorrelating the input signal and playing it back through all loudspeakers. In the high-quality version, the virtual microphone signal of the spreading stream is already somewhat inconsistent and needs to be gently decorrelated.
[0048] DirAC parameters, also known as spatial metadata, consist of a tuple of diffusion and direction, and are represented in spherical coordinates by two angles: azimuthal and elevation. When both the analysis and synthesis stages are performed on the decoder side, the time-frequency resolution of the DirAC parameters can be the same as that of the filter bank used for DirAC analysis and synthesis, i.e., a different set of parameters can be selected for each time slot and frequency bin of the filter bank representation of the audio signal.
[0049] To make the DirAC paradigm usable in spatial speech coding and teleconferencing scenarios, some work was done to reduce the size of the metadata [Hirvonen2009].
[0050] [WO2019068638] introduced a universal spatial speech coding system based on DirAC. In contrast to the conventional DirAC, which was designed for B-format (primary ambisonics format) input, this system can accept primary or higher ambisonics, multi-channel, or object-based speech inputs, and also enables mixed-type input signals. All signal types are efficiently coded and transmitted individually or in combination. The former combines different representations at the renderer (decoder side), while the latter uses encoder-side combinations of different speech representations in the DirAC domain.
[0051] Compatibility with the DirAC framework This embodiment is based on the unified framework for arbitrary input types presented in [WO2019068638] and aims to address the problem of not being able to efficiently apply DirAC parameters (direction and diffusion) to object inputs (similar to what [WO2020249815] does for multi-channel content). In fact, diffusion parameters are not needed at all, but it has been found that a single direction cue per time / frequency unit is insufficient to reproduce high-quality object content. Therefore, this embodiment proposes using multiple direction cues per time / frequency unit and thus introduces a set of adapted parameters that replace conventional DirAC parameters in the case of object inputs.
[0052] A flexible system with a low bitrate. In contrast to DirAC, which uses scene-based representation from the listener's perspective, SAOC and SAOC-3D are designed for channel and object-based content, with parameters describing the relationships between channels and objects. To use scene-based representation for object inputs, maintain compatibility with the DirAC renderer, and ensure efficient representation and high-quality reproduction, a tailored parameter set is required to signal multiple directional cues.
[0053] A key objective of this embodiment was to find a way to efficiently code object inputs at a low bitrate, along with good scalability to an increasing number of objects. Code each object signal individually would not provide such scalability. Each additional object would significantly increase the overall bitrate. If the number of objects exceeds the acceptable bitrate, this directly and significantly degrades the output signal. This degradation is another argument supporting this embodiment. [Prior art documents] [Patent Documents]
[0054] [Patent Document 1] WO2019068638 [Patent Document 2] WO2020249815 [Overview of the project] [Problems that the invention aims to solve]
[0055] The object of the present invention is to provide an improved concept for encoding multiple audio objects or decoding encoded audio signals.
[0056] This objective is achieved by the encoding device of claim 1, the decoder of claim 18, the encoding method of claim 28, the decoding method of claim 29, the computer program of claim 30, or the encoded audio signal of claim 31. [Means for solving the problem]
[0057] In one aspect of the present invention, the present invention is based on the discovery that at least two associated audio objects are defined for one or more frequency bins among a plurality of frequency bins, and parameter data associated with these at least two associated objects is included on the encoder side and used on the decoder side to obtain a high-quality and efficient audio encoding / decoding concept.
[0058] According to a further aspect of the present invention, the present invention is based on the discovery that a specific downmix adapted to the directional information associated with each object is performed so that each object with associated directional information is valid for the entire object; that is, it is used to downmix this object into a number of transport channels for all frequency bins within a time frame. The use of directional information is equivalent, for example, to generating transport channels as virtual microphone signals with specific adjustable characteristics.
[0059] On the decoder side, in certain embodiments, a specific covariance synthesis is performed that is particularly suitable for high-quality covariance synthesis that is not plagued by artifacts introduced by the uncorrelator. In other embodiments, an advanced covariance synthesis is used that relies on specific improvements related to standard covariance synthesis to improve audio quality and / or reduce the computational cost required to calculate the mixture matrix used within the covariance synthesis.
[0060] However, even in more classical synthesis, where audio rendering is performed by explicitly determining the individual contributions within the time / frequency bins based on transmitted selection information, the audio quality is superior to the conventional object coding or channel downmixing approaches. In such situations, each time / frequency bin has object identification information, and when performing audio rendering, i.e., when considering the directional contribution of each object, this object identification is used to look up the direction associated with this object information in order to determine the gain value of the individual output channel for each time / frequency bin. Thus, if there is only one object associated with a time / frequency bin, then only the gain value of this single object for each time / frequency bin is determined based on the object ID and the "codebook" of the directional information of the associated object.
[0061] However, if there are multiple related objects in a time / frequency bin, the gain value of each related object is then calculated and the corresponding time / frequency bins of the transport channels are distributed to the corresponding output channels, which are managed by the user-provided output format, such as a specific channel format, like stereo format or 5.1 format. Regardless of whether the gain values are used for the purpose of covariance synthesis, i.e., for the purpose of applying a mixing matrix to mix the transport channels into the output channels, or whether the individual contribution of each object in the time / frequency bin is explicitly determined by multiplying the gain values by the corresponding time / frequency bins of one or more transport channels, and then the gain values are used to sum the contributions of each output channel in the corresponding time / frequency bin, which are perhaps enhanced by the addition of spreading signal components, the output audio quality is improved because flexibility is given by determining one or more related objects per frequency bin.
[0062] This decision is possible very efficiently because only one or more object IDs for the time / frequency bins need to be encoded and sent to the decoder, along with per-object orientation information, which is also very efficiently possible. This is due to the fact that for a single frame, there is only a single piece of orientation information for all frequency bins.
[0063] Therefore, regardless of whether the synthesis is performed using preferably enhanced covariance synthesis or using a combination of explicit transport channel contributions for each object, a highly efficient and high-quality object downmix is obtained, which is preferably enhanced by using a specific object-direction-dependent downmix that depends on downmix weights that reflect the generation of the transport channels as virtual microphone signals.
[0064] A configuration involving two or more related objects per time / frequency bin can preferably be combined with a configuration that performs a specific direction-dependent downmix of the objects to the transport channel. However, both configurations can also be applied independently of each other. Furthermore, in certain embodiments, covariance synthesis is performed with two or more related objects per time / frequency bin, but advanced covariance synthesis and advanced transport channel-to-output channel upmixing can also be performed by sending only one object ID per time / frequency bin.
[0065] Furthermore, upmixing can also be performed by calculating the mixing matrix within a standard or enhanced covariance synthesis, regardless of whether there is one or more associated objects per time / frequency bin, or by individually determining the contributions of the time / frequency bins based on object identification used to determine the gain values of the corresponding contributions, by obtaining specific directional information from a directional "codebook". These are summed to obtain the complete contributions for each time / frequency bin if there are two or more associated objects per time / frequency bin. The output of this summing step is equivalent to the output of the mixing matrix application, and the final filter banking is performed to generate the time-domain output channel signal in the corresponding output format.
[0066] Preferred embodiments of the present invention are described below with reference to the accompanying drawings. [Brief explanation of the drawing]
[0067] [Figure 1a] This figure shows an implementation of a first-mode audio encoder having at least two associated objects for each time / frequency bin. [Figure 1b] This figure shows an implementation of an encoder according to a second embodiment, which has a downmix of direction-dependent objects. [Figure 2] This figure shows a preferred implementation of the encoder according to a second embodiment. [Figure 3] This figure shows a preferred implementation of the encoder according to the first embodiment. [Figure 4] This figure shows preferred implementations of the decoder according to the first and second embodiments. [Figure 5] This figure shows a preferred implementation of the covariance synthesis process shown in Figure 4. [Figure 6a] This figure shows the implementation of the decoder according to the first embodiment. [Figure 6b] This figure shows a decoder according to a second embodiment. [Figure 7a] This is a flowchart showing the determination of parameter information according to the first embodiment. [Figure 7b] This figure shows a preferred implementation for further determining parametric data. [Figure 8] (a) A diagram showing the time / frequency representation of a high-resolution filter bank. (b) A diagram showing the transmission of relevant side information of frame J by preferred implementations of the first and second embodiments. (c) A diagram showing the "direction codebook" contained in the encoded audio signal. [Figure 9a] This figure shows a preferred encoding method according to a second embodiment. [Figure 9b] This figure shows the implementation of static downmix according to the second embodiment. [Figure 9c] This figure shows the implementation of dynamic downmix according to the second embodiment. [Figure 9d] This figure shows a further embodiment of the second aspect. [Figure 10a] This figure shows a flowchart for a preferred implementation of the decoder side in the first embodiment. [Figure 10b] This figure shows a preferred implementation of the output channel calculation in Figure 10a, which has a sum of contributions for each output channel. [Figure 10c] This figure shows a preferred method for determining power values for multiple objects according to a first embodiment. [Figure 10d]This figure shows an embodiment of the calculation of the output channel in Figure 10a, which uses covariance synthesis that depends on the calculation and application of the mixture matrix. [Figure 11] This figure shows several embodiments of advanced calculations of time / frequency bin mixture matrices. [Figure 12a] This figure shows a conventional DirAC encoder. [Figure 12b] This is a diagram showing a conventional DirAC decoder. [Modes for carrying out the invention]
[0068] Figure 1a shows a device for encoding multiple audio objects, which at input receives raw audio objects and / or metadata for audio objects. The encoder includes an object parameter calculator 100 that provides parameter data for at least two related audio objects in a time / frequency bin, and this data is transferred to an output interface 200. In particular, the object parameter calculator calculates parameter data for at least two related audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, where specifically, the number of at least two related audio objects is less than the total number of multiple audio objects. Thus, the object parameter calculator 100 does not actually perform a selection and simply indicate that all objects are related. In a preferred embodiment, the selection is made by relevance, which is determined by amplitude-related measurements such as amplitude, power, loudness, or another measurement obtained by increasing the amplitude to a power different from 1, preferably greater than 1. Then, if a certain number of related objects are available in the time / frequency bin, the objects with the most relevant characteristics, i.e., the objects with the highest power among all objects, are selected, and data about these selected objects is included in the parameter data.
[0069] The output interface 200 is configured to output an encoded audio signal containing information about parameter data for at least two associated audio objects in one or more frequency bins. Depending on the implementation, the output interface may receive other data, such as a downmix of objects, or one or more transport channels representing a downmix of objects, or additional parameter or object waveform data in a mixed representation of multiple downmixed objects, or other objects in a different representation, and input this data into the encoded audio signal. In this scenario, the objects are directly introduced or "copied" to their corresponding transport channels.
[0070] Figure 1b shows a preferred implementation of an apparatus for encoding multiple audio objects in a second manner, where the audio objects are received along with associated object metadata indicating directional information for multiple audio objects, i.e., one directional piece for each object or for each group of objects if a group of objects is associated with the same directional information. The audio objects are input to a downmixer 400 that downmixes the multiple audio objects to obtain one or more transport channels. Furthermore, a transport channel encoder 300 is provided that encodes one or more transport channels to obtain one or more encoded transport channels that are input to an output interface 200. In particular, the downmixer 400 is connected to an object directional information provider 110 that takes arbitrary data on its input from which object metadata can be derived and outputs directional information to be actually used by the downmixer 400. The directional information transferred from the object directional information provider 110 to the downmixer 400 is preferably inversely quantized directional information, i.e., the same directional information that becomes available on the decoder side. For this purpose, the object orientation information provider 110 is configured to derive, extract, or obtain unquantized object metadata, then quantize the object metadata to derive quantized object metadata that, in a preferred embodiment, represents the quantization index provided to the output interface 200 in the “other data” shown in Figure 1b. Furthermore, the object orientation information provider 110 is configured to dequantize the quantized object orientation information to obtain the actual orientation information transferred from block 110 to the downmixer 400.
[0071] Preferably, the output interface 200 is configured to receive parameter data of the audio object, object waveform data, identification of one or more single or multiple associated objects per time / frequency bin, and, as described above, quantized directional data.
[0072] Next, further embodiments are presented. A parametric approach is presented for coding audio object signals, enabling efficient transmission at low bitrates and high-quality playback at the consumer side. Based on DirAC's principle of considering one directional cue for each significant frequency band and time (time / frequency tile), the most dominant object is determined for each time / frequency tile of the time / frequency representation of the input signal. Since this proved insufficient for object inputs, an additional second most dominant object is determined for each time / frequency tile, and a power ratio is calculated based on these two objects to determine the influence of each of the two objects on the time / frequency tile considered. Note: It is also conceivable to consider more than two most dominant objects per time / frequency unit, especially when the number of input objects is increasing. For simplicity, the following explanation is based in most cases on two dominant objects per time / frequency unit.
[0073] Therefore, the parametric side information sent to the decoder includes the following: • Power ratios calculated for a subset of the relevant (dominant) objects in each time / frequency tile (or parameter band). • An object index representing a subset of the associated objects for each time / frequency tile (or parameter band). • Directional information associated with the object index and provided for each frame (each time-domain frame contains multiple parameter bands, and each parameter band contains multiple time / frequency tiles).
[0074] Directional information is available via an input metadata file associated with the audio object signal. The metadata may be specified, for example, on a frame-by-frame basis. In addition to the side information, a downmix signal combining the input object signals is also sent to the decoder.
[0075] During the rendering phase, the transmitted directional information (derived via the object index) is used to pan the transmitted downmix signal (more commonly the transport channel) in the appropriate direction. The downmix signal is distributed to the two relevant object directions based on the transmitted power ratio, which is used as a weighting factor. This process is performed for each time / frequency tile of the time / frequency representation of the decoded downmix signal.
[0076] This section outlines the encoder-side processing, followed by a detailed explanation of parameter and downmix calculations. The audio encoder receives one or more audio object signals. Each audio object signal is associated with a metadata file describing its object properties. In this embodiment, the object properties described in the associated metadata file correspond to directional information provided in frames, where one frame corresponds to 20 milliseconds. Each frame is identified by a frame number, which is also included in the metadata file. The directional information is given as azimuth and elevation information, with the azimuth taking a value of [-180, 180] degrees and the elevation taking a value of [-90, 90] degrees. Other properties provided in the metadata include distance, spread, and gain. These characteristics are not considered in this embodiment.
[0077] The information provided in the metadata file is used in conjunction with the actual audio object files and sent to the decoder to create a set of parameters used to render the final audio output file. More specifically, the encoder estimates the parameters, i.e., power ratios, of a dominant subset of objects for each given time / frequency tile. This dominant subset of objects is represented by an object index, which is also used to identify the orientation of the objects. These parameters are sent to the decoder along with the transport channel and orientation metadata.
[0078] An overview of the encoder is shown in Figure 2, where the transport channel includes a downmix signal calculated from an input object file and directional information provided in the input metadata. The number of transport channels is always less than the number of input object files. In one embodiment of the encoder, the encoded audio signal is represented by an encoded transport channel, and the encoded parametric side information is represented by an encoded object index, an encoded power ratio, and encoded directional information. Both the encoded transport channel and the encoded parametric side information together form a bitstream output by the multiplexer 220. In particular, the encoder includes a filter bank 102 that receives input object audio files. Furthermore, an object metadata file is provided to an extractor directional information block 110a. The output of block 110a is input to a quantized directional information block 110b that outputs directional information to a downmixer 400 that performs downmix calculations. Furthermore, the quantized directional information, i.e., the quantized index, is transferred from block 110b to an encoded directional information block 202 that preferably performs some kind of entropy encoding to further reduce the required bitrate.
[0079] Furthermore, the output of filter bank 102 is input to signal power calculation block 104, the output of signal power calculation block 104 is input to object selection block 106, and further input to power ratio calculation block 108. Power ratio calculation block 108 is also connected to object selection block 106 to calculate the power ratio, i.e., the combined value of only the selected objects. In block 210, the calculated power ratio or combined value is quantized and encoded. As will be outlined later, the power ratio is preferred to save the transmission of a single power data item. However, in other embodiments where this saving is not necessary, instead of the power ratio, the actual signal power determined by block 104 or other value derived from the signal power can be input to the quantizer and encoder under the selection of object selector 106. Then, power ratio calculation 108 is not necessary, and object selection 106 ensures that only the relevant parametric data, i.e., power-related data of the relevant objects, is input to block 210 for quantization and encoding purposes.
[0080] Comparing Figure 1a with Figure 2, blocks 102, 104, 110a, 110b, 106, and 108 are preferably included in the object parameter calculator 100 of Figure 1a, and blocks 202, 210, and 220 are preferably included in the output interface block 200 of Figure 1a.
[0081] Furthermore, the corecoder 300 in Figure 2 corresponds to the transport channel encoder 300 in Figure 1b, the downmix calculation block 400 corresponds to the downmixer 400 in Figure 1b, and the object direction information provider 110 in Figure 1b corresponds to blocks 110a and 110b in Figure 2. In addition, the output interface 200 in Figure 1b is preferably implemented in the same manner as the output interface 200 in Figure 1a and includes blocks 202, 210, and 220 in Figure 2.
[0082] Figure 3 shows a modified encoder where downmix calculation is optional and independent of input metadata. In this modification, input audio files are fed directly to the corecoder, which creates transport channels from them. Thus, the number of transport channels corresponds to the number of input object files. This is particularly interesting when the number of input objects is one or two. Even when the number of objects is large, the downmix signal is used to reduce the amount of data transmitted.
[0083] In Figure 3, similar reference numerals refer to similar functions in Figure 2. This is valid not only for Figures 2 and 3 but also for all other figures described herein. Unlike Figure 2, Figure 3 performs the downmix calculation 400 without directional information. Thus, the downmix calculation can be, for example, a static downmix using a known downmix matrix, or an energy-dependent downmix that does not depend on the directional information associated with the objects contained in the input object audio file. Nevertheless, the directional information is extracted in block 110a, quantized in block 110b, and the quantized values are transferred to the directional information encoder 202 for the purpose of having the directional information encoded in an encoded audio signal, which is, for example, a binary-encoded audio signal forming a bitstream.
[0084] If the number of input audio object files is not very large, or if there is sufficient available transmission bandwidth, the downmix calculation block 400 can be omitted, allowing the input audio object files to directly represent the transport channels encoded by the core encoder. In such an implementation, blocks 104, 104, 106, 108, and 210 are also not required. However, a preferred implementation yields a mixed implementation in which some objects are directly introduced into transport channels, while others are downmixed into one or more transport channels. In such a situation, all the blocks shown in Figure 3 are required to generate a bitstream having one or more objects directly in the encoded transport channels and one or more transport channels generated by the downmixer 400 in either Figure 2 or Figure 3.
[0085] Parameter calculation Time-domain audio signals, including all input object signals, are converted to the time / frequency domain using a filter bank. For example, the CLDFB (Composite Low-Latency Filter Bank) analysis filter converts a 20-millisecond frame (equivalent to 960 samples at a 48kHz sampling rate) into 16x60 time / frequency tiles with 16 time slots and 60 frequency bandwidths. For each time / frequency unit, the instantaneous signal power is calculated as follows: P i (k,n)=|X i (k,n)| 2 In the formula, k is the frequency band index, n is the time slot index, and i is the object index. Since sending the parameters for each time / frequency tile is very costly in terms of the final bitrate, grouping is used to calculate the parameters for a reduced number of time / frequency tiles. For example, 16 time slots can be grouped into one time slot, and 60 frequency bands can be grouped into 11 bands based on the psychoacoustic scale. This reduces the initial size of 16x60 to 1x11, which corresponds to 11 so-called parameter bands. The instantaneous signal power values are summed based on the grouping to obtain the signal power in the reduced dimension.
[0086]
number
[0087] In the formula, T corresponds to 15 in this example, and B S and B E This defines the boundary of the parameter band.
[0088] To determine the most dominant subset of objects for calculating the parameters, the instantaneous signal power values of all N input audio objects are sorted in descending order. In this embodiment, two most dominant objects are determined, and their corresponding object indices, ranging from 0 to N-1, are stored as part of the transmitted parameters. Furthermore, a power ratio relating the two dominant object signals is calculated.
[0089]
number
[0090] Or, in a more general expression not limited to two objects:
[0091]
number
[0092] In this context, S represents the number of dominant objects considered.
[0093]
number
[0094] That is the case.
[0095] For two dominant objects, a power ratio of 0.5 for each of the two objects means that both objects are equally present within their respective parameter bands, while power ratios of 1 and 0 represent the absence of either of the two objects. These power ratios are stored as the second part of the transmitted parameters. Since the sum of the power ratios is 1, it is sufficient to transmit a value of S-1 instead of S.
[0096] In addition to the object index and power ratio values for each parameter band, orientation information for each object extracted from the input metadata file must be sent. This is done per frame, as the information is originally provided on a frame-by-frame basis (each frame consists of 11 parameter bands, or a total of 16x60 time / frequency tiles in the example described). Thus, the object index indirectly represents the orientation of the object. Note: Since the sum of the power ratios is 1, the number of power ratios sent per parameter band can be reduced by 1. For example, if two related objects are to be considered, it is sufficient to send only one power ratio value.
[0097] Both the directional information and power ratio values are quantized and combined with the object index to form the parametric side information. This parametric side information is then encoded and mixed with the encoded transport channel / downmix signal into the final bitstream representation. A suitable trade-off between output quality and bitrate consumption is achieved, for example, by quantizing the power ratio using 3 bits per value. Directional information may be provided with a 5-degree angular resolution, then quantized with 7 bits per azimuth value and 6 bits per elevation value, as illustrated in a practical example.
[0098] Downmix calculation All input audio object signals are combined into a downmix signal containing one or more transport channels. The number of transport channels is less than the number of input object signals. Note: In this embodiment, a single transport channel occurs only when there is only one input object, which means that the downmix calculation is skipped.
[0099] If the downmix contains two transport channels, this stereo downmix is calculated, for example, as a virtual cardioid microphone signal. The virtual cardioid microphone signal is determined by applying the directional information provided for each frame in the metadata file (assuming all elevation values are zero). w L = 0.5 + 0.5 * cos(azimuth - π / 2) w R = 0.5 + 0.5 * cos(azimuth - π / 2)
[0100] Here, the virtual cardioid is positioned at 90° and -90°. Therefore, the weights for each of the two transport channels (left and right) are determined and applied to the corresponding speech object signals.
[0101]
number
[0102] In this context, N is the number of input objects, which is 2 or more. If the virtual cardioid weights are updated frame by frame, a dynamic downmix is employed to adapt to the directional information. Another possibility is to employ a fixed downmix, where each object is assumed to be in a static position. This static position may correspond, for example, to the object's initial orientation, thus ensuring the same static virtual cardioid weights are obtained in all frames.
[0103] If the target bitrate allows, more than three transport channels are possible. With three transport channels, the cardioids are uniformly positioned, for example, at 0°, 120°, and -120°. When using four transport channels, the fourth cardioid can be oriented upwards, or the four cardioids can be uniformly positioned horizontally. The object placement can also be adjusted to match the object's position. The resulting downmix signal is processed by the corecoder and converted into a bitstream representation along with the encoded parametric side information.
[0104] Alternatively, the input object signals may be supplied to the corecoder without being coupled to the downmix signal. In this case, the resulting number of transport channels corresponds to the number of input object signals. Typically, a maximum number of transport channels is specified that correlates with the total bitrate. The downmix signal is used only if the number of input object signals exceeds this maximum number of transport channels.
[0105] Figure 6a shows a decoder for decoding an encoded audio signal, such as the signal output by Figure 1a, Figure 2, or Figure 3, which includes one or more transport channels and directional information for multiple audio objects. Furthermore, the encoded audio signal includes parameter data for at least two associated audio objects for one or more frequency bins in a time frame, and the number of at least two associated objects is less than the total number of multiple audio objects. In particular, the decoder includes an input interface for providing one or more transport channels in a spectral representation having multiple frequency bins in a time frame. This represents a signal transferred from the input interface block 600 to the audio renderer block 700. In particular, the audio renderer 700 is configured to render one or more transport channels into a number of audio channels using the directional information contained in the encoded audio signal, the number of audio channels being preferably two channels for a stereo output format or three or more channels for a larger number of output formats such as 3 channels, 5 channels, 5.1 channels, etc. In particular, the audio renderer 700 is configured to calculate the contribution from one or more transport channels for each of one or more frequency bins, according to first directional information associated with a first audio object among at least two associated audio objects, and according to second directional information associated with a second of at least two associated objects. In particular, the directional information for multiple audio objects includes first directional information associated with the first object and second directional information associated with the second object.
[0106] Figure 8b shows, in a preferred embodiment, directional information 810 for multiple audio objects, and additionally, parameter data for a frame consisting of the power ratios of a certain number of parameter bands shown in 812, and one, preferably two or more object indices for each parameter band shown in block 814. In particular, the directional information for multiple audio objects 810 is shown in more detail in Figure 8c. Figure 8c shows a table with a first column having specific object IDs from 1 to N, where N is the number of multiple audio objects. Furthermore, a second column is provided having the directional information for each object, preferably as azimuth and elevation values, or only as azimuth values in the case of a two-dimensional situation. This is shown in 818. Thus, Figure 8c shows a “directional codebook” contained in the encoded audio signal input to the input interface 600 in Figure 6a. The directional information from column 818 is uniquely associated with a specific object ID from column 816 and is valid for the “whole” object in the frame, i.e., all frequency bands in the frame. Therefore, regardless of the number of time / frequency tiles in a high-resolution representation or the number of frequency bins in a time / parameter band in a low-resolution representation, only a single directional piece of information is transmitted and used by the input interface for each object identification.
[0107] In this context, Figure 8a shows the time / frequency representation generated by the filter bank 102 in Figure 2 or Figure 3 when it is implemented as the aforementioned CLDFB (Complex Low Delay Filterbank). For a frame given directional information as previously described with respect to Figures 8b and 8c, the filter bank generates 16 time slots from 0 to 15 and 60 frequency bands from 0 to 59 in Figure 8a. Thus, one time slot and one frequency band represent time / frequency tile 802 or 804. Nevertheless, in order to reduce the bitrate of the side information, it is preferable to convert the high-resolution representation to the low-resolution representation shown in Figure 8b, where only a single time bin exists and the 60 frequency bands are converted to 11 parameter bands as shown in 812 in Figure 8b. Thus, as shown in Figure 10c, the high-resolution representation is indicated by time slot index n and frequency band index k, and the low-resolution representation is given by grouped time slot index m and parameter band index l. Nevertheless, in the context of this specification, time / frequency bins may include low-resolution time / frequency units identified by the high-resolution time / frequency tiles 802, 804 in Figure 8a, or the grouped time slot index and parameter band index at the input of block 731c in Figure 10c.
[0108] In the embodiment shown in Figure 6a, the audio renderer 700 is configured to calculate contributions from one or more transport channels for each of one or more frequency bins, according to first directional information associated with a first of at least two related audio objects and according to second directional information associated with a second of at least two related audio objects. In the embodiment shown in Figure 8b, block 814 has object indices for each related object within a parameter band, i.e., two or more object indices such that there are two contributions for each time-frequency bin.
[0109] As will be outlined later with respect to Figure 10a, the calculation of contributions can be performed indirectly through the mixture matrix, which is used to calculate the mixture matrix after determining the gain value of each related object. Alternatively, as shown in Figure 10b, the contributions can be explicitly recalculated using the gain values and the explicitly calculated contributions can be summed for each output channel in a given time / frequency bin. Thus, regardless of whether the contributions are explicitly or implicitly calculated, the audio renderer nevertheless uses directional information to render one or more transport channels into a number of audio channels. Therefore, for each of one or more frequency bins, the information that the contributions from one or more transport channels are associated with at least two related audio objects is included in the number of audio channels, according to the first directional information associated with the first of at least two related audio objects, and according to the second directional information.
[0110] Figure 6b shows a decoder for decoding an encoded audio signal that includes directional information for one or more transport channels and multiple audio objects, and, in a second embodiment, parameter data of the audio objects for one or more frequency bins of a time frame. Here again, the decoder comprises an input interface 600 for receiving the encoded audio signal, and the decoder comprises an audio renderer 700 for rendering one or more transport channels into a number of audio channels using the directional information. In particular, the audio renderer is configured to calculate direct response information from one or more audio objects for each frequency bin of the number of frequency bins, and directional information associated with the relevant one or more audio objects in the frequency bin. This direct response information preferably includes gain values used for covariance synthesis or advanced covariance synthesis, or for explicit calculation of contributions from one or more transport channels.
[0111] Preferably, the audio renderer is configured to compute covariance synthesis information using direct response information of one or more related audio objects within a time / frequency band and information about the number of audio channels. Furthermore, the covariance synthesis information, which is preferably a mixture matrix, is applied to one or more transport channels to obtain the number of audio channels. In a further implementation, the direct response information is a direct response vector for one or more audio objects, the covariance synthesis information is a covariance synthesis matrix, and the audio renderer is configured to perform matrix operations per frequency bin when applying the covariance synthesis information.
[0112] Furthermore, the audio renderer 700 is configured to derive direct response vectors for one or more audio objects in the calculation of direct response information, and to calculate a covariance matrix from each direct response vector for one or more audio objects. In addition, a target covariance matrix is calculated in the calculation of covariance synthesis information. However, instead of the target covariance matrix, related information of the target covariance matrix, namely the direct response matrices or vectors of one or more most dominant objects and a diagonal matrix of direct powers, indicated as E, determined by the application of power ratios, can be used.
[0113] Therefore, target covariance information does not necessarily have to be an explicit target covariance matrix, but can be derived from the covariance matrix of a single audio object, or from the covariance matrices of multiple audio objects in a time / frequency bin, from the power information of one or more audio objects in each time / frequency bin, and from the power information derived from one or more transport channels in one or more time / frequency bins.
[0114] The bitstream representation is read by the decoder, and the encoded transport channel and the encoded parametric side information contained therein become available for further processing. The parametric side information includes the following: • Directional information as quantized azimuth and elevation values (per frame) • Object indices (for each parameter band) that show a subset of related objects. • Quantized power ratios (per parameter band) that relate related objects to each other.
[0115] All processing is performed frame by frame, and each frame consists of one or more subframes. A frame may consist of, for example, four subframes, in which case each subframe has a duration of 5 milliseconds. Figure 4 shows a simplified overview of the decoder.
[0116] Figure 4 shows audio decoders implementing the first and second embodiments. The input interface 600 shown in Figures 6a and 6b comprises a demultiplexer 602, a core decoder 604, a decoder for decoding object index 608, a decoder for decoding and dequantizing power ratio 612, and a decoder for decoding and dequantizing directional information indicated by 612. Furthermore, the input interface comprises a filter bank 606 for providing transport channels in time / frequency representation.
[0117] The audio renderer 700 comprises a direct response computer 704, a prototype matrix provider 702 controlled by the output configuration received by the user interface, such as a covariance synthesis block 706, and a synthesis filter bank 708, to ultimately provide an output audio file including the number of audio channels in the channel output format.
[0118] Therefore, items 602, 604, 606, 608, 610, and 612 are preferably included in the input interface of Figures 6a and 6b, and items 702, 704, 706, and 708 in Figure 4 are part of the audio renderer in Figure 6a or Figure 6b, indicated by reference no. 700.
[0119] The encoded parametric side information is decoded, and the quantized power ratio values, quantized azimuth and elevation values (directional information), and object index are retrieved. The value of one power ratio that is not transmitted is obtained by utilizing the fact that the sum of all power ratio values is 1. Their resolution (l,m) corresponds to the time / frequency tile group used on the encoder side. In further processing steps where a finer time / frequency resolution (k,n) is used, the parameter of the parameter band is valid for all time / frequency tiles contained in this parameter band, corresponding to extensions such as (l,m) → (k,n).
[0120] The encoded transport channel is decoded by the core decoder. Using a filter bank (matching that employed by the encoder), each frame of the thus decoded audio signal is converted to a time / frequency representation whose resolution is finer (but at least equal) to the resolution typically used for parametric side information.
[0121] Rendering / compositing of output signals The following explanation applies to one frame of an audio signal. ^T indicates the transpose operator.
[0122] Decoded transport channel x = x(k,n) = [X1(k,n), X2(k,n)] T That is, using the time-frequency representation of the audio signal (consisting of two transport channels in this case) and parametric side information, the mixture matrix M for each subframe (or frame to reduce computational complexity) is a time-frequency output signal y = y(k,n) = [Y1(k,n), Y2(k,n), Y3(k,n),…] which includes several output channels (e.g., 5.1, 7.1, 7.1+4, etc.). T It is derived in order to synthesize.
[0123] For every (input) object, the direction of the transmitted object is used to determine a so-called direct response value, which describes the panning gain used for the output channel. These direct response values are specific to the target layout, i.e., the number and position of loudspeakers (provided as part of the output configuration). Examples of panning methods include vector-based amplitude panning (VBAP) [Pulkki 1997] and edge-fading amplitude panning (EFAP) [Borss 2014]. Each object has an associated direct response value dr i There is a vector (containing the same number of elements as the loudspeakers). These vectors are calculated once per frame. Note: If an object's position corresponds to a loudspeaker's position, the vector contains the value 1 for that loudspeaker, and all other values are 0. If an object is between two (or three) loudspeakers, the number of corresponding non-zero vector elements is 2 (or 3).
[0124] The actual synthesis step (in this embodiment, covariance synthesis [Vilkamo2013]) includes the following substeps (see Figure 5 for visualization): ○ For each parameter band, use an object index that describes the dominant subset of objects within the input objects in the time / frequency tile grouped by this parameter band to generate the vector dr required for further processing. i Extract a subset of the following. For example, since only two related objects are considered, the two vectors dr associated with these two related objects are extracted. i It is necessary. Next, the direct response value dr i From this, the covariance matrix C of the output channels is obtained. i However, this is calculated for each related object. C i =dr i *dr i T For each time / frequency tile (within the parameter band), the audio signal power P(k,n) is determined. In the case of two transport channels, the signal power of the first channel is added to the signal power of the second channel. This signal power is then multiplied by the value of each power ratio to obtain one direct power value for each associated / dominant object i. DP i (k,n)=PR i (k,n)*P(k,n) For each frequency band k, the final target covariance matrix C of the output channel is determined by the size of each output channel. Y This is obtained by summing up all slots n in the (sub)frame and summing up all related objects.
[0125]
number
[0126] Figure 5 provides a detailed overview of the covariance synthesis step performed in block 706 of Figure 4. In particular, the embodiment of Figure 5 includes a signal-power calculation block 721, a direct-power calculation block 722, a covariance matrix calculation block 73, a target covariance matrix calculation block 724, an input covariance matrix calculation block 726, a mixture matrix calculation block 725, and, with respect to Figure 5, a rendering block 727, which further includes the filter bank block 708 of Figure 4, so that the output signal of block 727 preferably corresponds to the time-domain output signal. However, if block 708 is not included in the rendering block of Figure 5, the result is a spectral-domain representation of the corresponding audio channel.
[0127] (The following steps are part of the cutting-edge [Vilkamo2013] and have been added for clarity.) ○ For each (sub)frame and each frequency band, the input covariance matrix C of size per transport channel. x =xx TThis is calculated from the decoded audio signal. Optionally, only the entries on the main diagonal can be used. In this case, other non-zero entries are set to zero. A prototype matrix of output channels of each size is defined for each transport channel, describing the mapping of transport channels to output channels (provided as part of the output configuration). The number of output channels is given by the target output format (e.g., the layout of the target loudspeaker). This prototype matrix can be static or change per frame. For example, if only a single transport channel is transmitted, this transport channel is mapped to each output channel. If two transport channels are transmitted, the left (first) channel is mapped to all output channels located within (+0°, +180°), i.e., the "left" channels. The right (second) channel is mapped to all output channels located within (-0°, -180°), i.e., the "right" channels. (Note: 0° represents the position in front of the listener, positive angles represent the position to the left of the listener, and negative angles represent the position to the right of the listener. If a different rule is adopted, the sign of the angle must be adjusted accordingly.) 〇 Input covariance matrix C x , target covariance matrix C Y Using the prototype matrix, a mixing matrix is calculated for each (sub)frame and each frequency band [Vilkamo2013]. For example, 60 mixing matrices are obtained for each (sub)frame. The mixture matrix is interpolated (for example, linearly) between (sub)frames to correspond to temporal smoothing. Finally, the output channel y is synthesized band by band by multiplying the set of final mixing matrices M of the output channels for each transport channel by the corresponding band of the time / frequency representation of the decoded transport channel x. y = Mx Note that the residual signal r is not used, as explained in [Vilkamo2013].
[0128] The output signal y is transformed into a time-domain representation y(t) using a filter bank.
[0129] Optimized covariance synthesis Input covariance matrix C x and the target covariance matrix C Y Depending on how it is calculated in this embodiment, a specific optimization of the optimal mixture matrix calculation using covariance synthesis of [Vilkamo2013] can be achieved, significantly reducing the computational complexity of the mixture matrix calculation. Note in this section that the Hadamard operator ○ represents an element-wise operation on a matrix. That is, instead of following rules such as matrix multiplication, each operation is performed element by element. This operator indicates that the corresponding operation is performed individually on each element, rather than on the entire matrix. For example, the multiplication of matrices A and B does not correspond to the matrix multiplication AB=C, but to the element-wise operation a_ij * b_ij=c_ij.
[0130] SVD(.) represents singular value decomposition. The algorithm in [Vilkamo2013] is presented as a Matlab function (List 1) and is as follows (prior art).
[0131] [Table 1A]
[0132] [Table 1B]
[0133] As mentioned in the previous section, C x Only the main diagonal elements are optionally used, and all other entries are set to zero. In this case, C x is a diagonal matrix, and a valid decomposition satisfies equation (3) of [Vilkamo2013]. K x =C x ○1 / 2 SVD from the third line of the conventional algorithm is no longer necessary.
[0134] Direct response to the previous section i And considering the formula that generates the target covariance from direct power (or direct energy),
[0135]
number
[0136] The last expression can be rearranged and written as follows:
[0137]
number
[0138] If we define it now
[0139]
number
[0140] Therefore,
[0141]
number
[0142] You can obtain this. For k of the most dominant objects, the direct response matrix R = [dr1…dr k Place the response directly in ] and e i,i =E i ,C Y This can also be expressed as follows: C Y =RER H And C that satisfies equation (3) of [Vilkamo2013] Y The valid decomposition of is given by the following equation: C y =RE ○1 / 2
[0143] Therefore, SVD from line 1 of the conventional algorithm is no longer necessary.
[0144] This leads to an optimized algorithm for covariance synthesis within this embodiment, which also always uses the energy compensation option, and thus the residual target covariance C r We also need to consider that it may not be necessary.
[0145] [Table 2A]
[0146] [Table 2B]
[0147] A careful comparison of the conventional algorithm with the proposed algorithm reveals that the former requires three SVDs, each a matrix of size m×m, n×n, and m×n, where m is the number of downmix channels and n is the number of output channels through which the object is rendered.
[0148] The proposed algorithm requires only one SVD of size m × k, where k is the number of dominant objects. Furthermore, since k is usually much smaller than n, this matrix is smaller than the corresponding matrix in conventional algorithms.
[0149] The complexity of a standard SVD implementation is approximately O(c1m) for an m×n matrix. 2 n+c2n 3 ) [Golub2013], where c1 and c2 are constants that depend on the algorithm used. Thus, a significant reduction in the computational complexity of the proposed algorithm is achieved compared to the algorithms of prior art.
[0150] Furthermore, preferred embodiments relating to the encoder side of the first embodiment are discussed with reference to Figures 7a and 7b. In addition, preferred implementations of the encoder side implementation of the second embodiment are discussed with reference to Figures 9a to 9d.
[0151] Figure 7a shows a preferred implementation of the object parameter calculator 100 of Figure 1a. In block 120, the audio objects are converted to spectral representations. This is done by the filter bank 102 in Figure 2 or Figure 3. Next, in block 122, the selection information is calculated, for example, as shown in block 104 in Figure 2 or Figure 3. For this purpose, amplitude-related measures can be used, such as the amplitude itself, power, energy, or other amplitude-related measures obtained by raising the amplitude to a power other than 1. The result of block 122 is a set of selection information for each object in the corresponding time / frequency bin. Next, in block 124, object IDs are derived for each time / frequency bin. In the first embodiment, two or more object IDs are derived for each time / frequency bin. According to the second embodiment, the number of object IDs for each time / frequency bin may be only a single object ID, so as to identify the most important, strongest, or most relevant object in block 124 among the information provided by block 122. Block 124 outputs information about the parameter data and includes one or more indices for the one or more most relevant objects.
[0152] If there are two or more related objects per time / frequency bin, the function of block 126 is to help compute amplitude-related measurements characterizing the objects in the time / frequency bin. These amplitude-related measurements may be the same as those computed in block 122 for the selection information, or, preferably, a combined value is computed using information already computed by block 102, as indicated by the dashed line between block 122 and block 126, and then the amplitude-related measurements or one or more combined values are computed in block 126 and transferred to the quantizer and encoder block 212 to obtain encoded amplitude-related or encoded combined values in the side information as additional parametric side information. In the embodiments of Figure 2 or 3, these are "encoded power ratios" included in the bitstream along with "encoded object indices". If there is only one object ID per frequency bin, the calculation and quantization encoding of power ratios are unnecessary, and the index of the most relevant object in the time / frequency bin is sufficient to perform the decoder-side rendering.
[0153] Figure 7b shows a preferred implementation of the calculation of the selection information 102 in Figure 7b. As shown in block 123, the signal power is calculated for each object and each time / frequency bin as selection information. Next, in block 125, which shows a preferred embodiment of block 124 in Figure 7a, the object ID of the single or preferably two or more objects with the highest power is extracted and output. If there are multiple such objects, a power ratio is calculated as shown in block 127, as a preferred implementation of block 126, and the power ratio is calculated for the extracted object IDs related to the power of all extracted objects that have corresponding object IDs found by block 125. This procedure is advantageous in this embodiment because it is necessary to transmit only one fewer combination value than the number of objects in the time / frequency bin, and there is a rule known to the decoder that the sum of the power ratios of all objects must be 1. Preferably, the functions of blocks 120, 122, 124, 126 in Figure 7a and / or 123, 125, 127 in Figure 7b are implemented by the object parameter calculator 100 in Figure 1a, and the function of block 212 in Figure 7a is implemented by the output interface 200 in Figure 1a.
[0154] Accordingly, an apparatus for encoding according to the second embodiment shown in Figure 1b will be described in more detail with respect to several embodiments. In step 110a, directional information is extracted from the input signal or by reading or parsing metadata information contained in a metadata portion or metadata file, for example, as shown with respect to Figure 12a. In step 110b, the directional information and audio objects for each frame are quantized, and the quantized index for each object for each frame is transferred to an encoder or an output interface such as the output interface 200 in Figure 1b. In step 110c, the directional quantization index is dequantized to obtain an dequantized value, which in certain implementations may also be output directly by block 110b. Next, based on the dequantized directional index, block 422 calculates the weights for each transport channel and each object based on a specific virtual microphone setup. This virtual microphone setup may include two virtual microphone signals with different orientations placed at the same location, or it may be a setup where there are two different positions with respect to a reference position or orientation, such as a virtual listener position or orientation. Setting up two virtual microphone signals results in two transport channel weights per object.
[0155] When generating three transport channels, the virtual microphone setup can be considered to include three virtual microphone signals from microphones with different orientations located at the same position, or from microphones located at three different positions relative to a reference position or orientation, where the reference position for this orientation can be the position or orientation of a virtual listener.
[0156] Alternatively, the four transport channels can be generated based on a virtual microphone setup that generates four virtual microphone signals from microphones positioned at the same location but with different orientations, or from four virtual microphone signals positioned at four different locations relative to a reference position or direction, where the reference position or direction can be a virtual listener position or virtual listener direction.
[0157] Furthermore, the weights of each object and each transport channel w L and w R For the purpose of calculating, in the case of a two-channel example, the virtual microphone signal is a signal derived from a virtual primary microphone, a virtual cardioid microphone or a virtual figure-eight microphone or depot microphone, a bidirectional microphone, a virtual directional microphone, a virtual subcardioid microphone, a virtual unidirectional microphone, a virtual hypercardioid microphone, or a virtual omnidirectional microphone.
[0158] In this context, it should be noted that the actual microphone placement is not necessary for calculating the weights. Instead, the rules for calculating the weights change depending on the virtual microphone configuration, i.e., the virtual microphone placement and characteristics.
[0159] In block 404 of Figure 9a, weights are applied to the objects, and for each object, if the weight is not zero, the object's contribution to a particular transport channel is obtained. Thus, block 404 receives the object signals as input. Next, in block 406, the contributions are summed for each transport channel, for example, the contributions from the objects to the first transport channel are added together, and the contributions of the objects to the second transport channel are added together. As shown in block 406, the output of block 406 is, for example, the transport channels in the time domain.
[0160] Preferably, the object signal input to block 404 is a time-domain object signal containing full-bandwidth information, and the application in block 404 and the summation in block 406 are performed in the time domain. However, in other words, these steps can also be performed in the spectral domain.
[0161] Figure 9b shows a further embodiment in which static downmix is implemented. For this purpose, orientation information for the first frame is extracted in block 130, and weights are calculated according to the first frame, as shown in block 403a. Then, in order to implement static downmix, the weights are left as they are for the other frames, as shown in block 408.
[0162] Figure 9c shows an alternative implementation in which a dynamic downmix is calculated. For this purpose, block 132 extracts orientation information for each frame, and the weights for each frame are updated as shown in block 403b. Next, in block 405, the updated weights are applied to the frames, implementing a dynamic downmix that changes from frame to frame. Other implementations between the extreme cases in Figures 9b and 9c are equally useful, and for the purpose of downmixing according to orientation information, the weights are updated only every second and third or every nth frame, for example, so that the antenna characteristics do not change too much from time to time, and / or smoothing of the weights over time is performed. Figure 9d shows another implementation of the downmixer 400 controlled by the object orientation information provider 110 in Figure 1b. In block 410, the downmixer is configured to analyze the orientation information of all objects in the frame, and in block 112, the weights w of the stereo example w L and w RFor the purpose of calculating, the microphones are positioned according to the analysis results. Microphone positioning refers to the position and / or directivity of the microphones. In block 414, the microphones are left for other frames, similar to the static downmix discussed for block 408 in Figure 9b, or the microphones are updated according to what was discussed for block 405 in Figure 9c to obtain the functionality of block 414 in Figure 9d. With respect to the functionality of block 412, the microphones can be positioned to obtain good separation such that the first virtual microphone "appears" to the first group of objects and the second virtual microphone "appears" to the second group of objects. This differs from the first group of objects in that, preferably, as far as possible, no objects in one group are included in the other group. Alternatively, the analysis of block 410 can be enhanced by other parameters, and the positioning can also be controlled by other parameters.
[0163] Next, preferred implementations of the decoder according to the first or second embodiment are discussed with respect to, for example, Figures 6a and 6b, and are given with respect to Figures 10a, 10b, 10c, 10d and 11 below.
[0164] In block 613, the input interface 600 is configured to retrieve individual object orientation information associated with an object ID. This procedure corresponds to the functionality of block 612 in Figure 4 or Figure 5, resulting in the “frame codebook” illustrated and described with respect to Figure 8b, and especially Figure 8c.
[0165] Furthermore, in block 609, one or more object IDs for each time / frequency bin are retrieved, regardless of whether their data is available for low-resolution parameter bands or high-resolution frequency tiles. The result of block 609, corresponding to the procedure in block 608 in Figure 4, is the specific ID of one or more associated objects within the time / frequency bin. Next, in block 611, specific object orientation information for the specific one or more IDs in each time / frequency bin is retrieved from the “frame codebook,” i.e., from the exemplary table shown in Figure 8c. Then, in block 704, gain values are calculated for one or more associated objects in each output channel, governed by the output format calculated for each time / frequency bin. Next, in blocks 730 or 706, 708, the output channels are calculated. The function of calculating the output channels can be done within the explicit calculation of contributions from one or more transport channels, as shown in Figure 10b, or by indirectly calculating and using the contributions of the transport channels, as shown in Figure 10d or Figure 11. Figure 10b shows the functions for which power values or power ratios are retrieved in block 610, corresponding to the functions in Figure 4. These power values are then applied to individual transport channels for each associated object shown in blocks 733 and 735. Furthermore, since these power values are applied to individual transport channels in addition to the gain values determined by block 704, blocks 733 and 735 yield object-specific contributions to transport channels such as transport channels ch1, ch2, ... Next, in block 737, these explicitly calculated channel transport contributions are added for each output channel per time / frequency bin.
[0166] Next, depending on the implementation, a spread signal calculator 741 can be provided that generates a spread signal in time / frequency bins corresponding to each output channel ch1, ch2, ..., and the combination of the spread signal and the contribution results of block 737 is combined such that a complete channel contribution is obtained for each time / frequency bin. This signal corresponds to the input to the filter bank 708 in Figure 4 if the covariance synthesis further depends on the spread signal. However, if the covariance synthesis 706 does not depend on the spread signal and relies only on processing without a decorator, at least the energy of the output signal for each time / frequency bin corresponds to the energy of the channel contribution at the output of block 739 in Figure 10b. Furthermore, if the spread signal calculator 741 is not used, the result of block 739 corresponds to the result of block 706 and has a complete channel contribution for each time / frequency bin that can be individually converted for each output channel ch1, ch2, and the time-domain output channels can be saved or transferred to a loudspeaker or any kind of rendering device to finally obtain an output audio file.
[0167] Figure 10c shows a preferred implementation of the function of block 610 in Figure 10b or Figure 4. In step 610a, a combined (power) value or several values are taken for a particular time / frequency bin. In block 610b, other values corresponding to other related objects in the time / frequency bin are calculated based on a calculation rule that all combined values must be summed up to 1.
[0168] Next, the result is preferably a low-resolution representation with two power ratios for each grouped time slot index and for each parameter band index. These represent low time / frequency resolution. In block 610c, the time / frequency resolution can be extended to high time / frequency resolution so that there are power values for time / frequency tiles having high-resolution time slot index n and high-resolution frequency band index k. The extension may involve a simple use of the exact same low-resolution index for corresponding time slots within grouped time slots and corresponding frequency bands within parameter bands.
[0169] Figure 10d shows a preferred implementation of the function for calculating covariance synthesis information in block 706 of Figure 4, represented by a mixing matrix 725 used to mix two or more input transport channels into two or more output signals. Thus, for example, if there are two transport channels and six output channels, the size of the mixing matrix for each individual time / frequency bin will be 6 x 2. In block 723, corresponding to the function in block 723 of Figure 5, the gain value or direct response value for each object in each time / frequency bin is received and the covariance matrix is calculated. In block 722, the power value or ratio is received and the direct power value for each object in the time / frequency bin is calculated, with block 722 in Figure 10d corresponding to block 722 in Figure 5.
[0170] The results from both blocks 721 and 722 are input to the target covariance matrix calculator 724. Furthermore, or alternatively, the target covariance matrix C y Explicit calculation of is not required. Instead, the relevant information contained in the target covariance matrix, namely the direct response value information shown in matrix R and the direct power values of two or more relevant objects shown in matrix E, is input to block 725a for the calculation of the mixing matrix per time / frequency bin. Furthermore, the mixing matrix 725a is derived from the information of the prototype matrix Q and the input covariance matrix C shown in block 726, which corresponds to block 726 in Figure 5, from two or more transport channels.x The time / frequency bins and the per-frame mixture matrix can undergo time smoothing as shown in block 725b, and in block 727, which corresponds to at least a portion of the rendering block in Figure 5, the mixture matrix is applied to the transport channels of the corresponding time / frequency bins in either an unsmoothed or smoothed form to obtain, in the output of block 739, a full channel contribution in the time / frequency bin that is substantially similar to the corresponding full contribution discussed earlier with respect to Figure 10b. Thus, Figure 10b shows an implementation of the explicit calculation of the transport channel contribution, while Figure 10d shows a procedure for implicitly calculating the transport channel contribution per time / frequency bin and per relevant object within each time / frequency bin via the target covariance matrix Cy, or via the relevant information R and E in blocks 723 and 722, which are directly introduced in the mixture matrix calculation block 725a.
[0171] Next, Figure 11 shows a preferred optimization algorithm for covariance synthesis. It should be noted that all steps shown in Figure 11 are computed within the covariance synthesis 706 in Figure 4, or within the mixture matrix calculation block 725 in Figure 5, or 725a in Figure 10d. Step 751 yields the first decomposition result K y This is calculated. As shown in Figure 10d, this decomposition result can be easily calculated without calculating the covariance matrix because the information of the obtained values contained in matrix R and the information from two or more related objects, in particular the direct power information contained in matrix ER, are used directly without explicit use. Thus, since a specific singular value decomposition is no longer necessary, the first decomposition result in block 751 can be calculated directly and with little effort.
[0172] In step 752, the second decomposition result is K x This is calculated as follows. This decomposition result can also be calculated without explicit singular value decomposition because the input covariance matrix is treated as a diagonal matrix with off-diagonal elements ignored.
[0173] Next, in step 753, a first regularization result is calculated based on the first regularization parameter α, and in step 754, a second regularization result is calculated based on the second regularization parameter β. x In a preferred implementation, the matrix is diagonal, and the calculation of the first normalized result 753 is performed as in the prior art, S x Since the calculation involves simply changing parameters rather than decomposition, it is simplified compared to conventional techniques.
[0174] Furthermore, with respect to the calculation of the second regularized result in block 754, the first step is to use the matrix U in the prior art. x HS This is not a multiplication, but merely a renaming of the parameter.
[0175] Furthermore, in step 755, the normalized matrix G y The unitary matrix P is calculated, and based on step 755, in step 756, K x , prototype matrix Q, and K obtained by block 751 y The calculation is performed based on the information. The fact that the matrix Λ is not needed here simplifies the calculation of the unitary matrix P compared to the available prior art.
[0176] Next, in step 757, M opt An uncompensated mixture matrix is calculated, using the unitary matrix P, the results from block 754, and the results from block 751. Then, in block 758, energy compensation is performed using the compensation matrix G. Since energy compensation is performed, the residual signal derived from the uncorrelator is not needed. However, instead of performing energy compensation, this implementation calculates the mixture matrix M without energy information. opt A residual signal with sufficient energy to fill the remaining energy gap is added. However, for the purposes of the present invention, the uncorrelated signal is not relied upon in order to avoid artifacts introduced by the uncorrelator. However, energy compensation as shown in step 758 is preferred.
[0177] Therefore, the optimized algorithm for covariance synthesis offers advantages in steps 751, 752, 753, 754, and in step 756 for calculating the unitary matrix P. It should even be emphasized that the optimized algorithm offers advantages over the prior art in which only one of steps 755, 752, 753, 754, 756, or a subgroup of those steps, is performed as shown, while the corresponding other steps are performed in the same way as the prior art. This is because the improvements are not interdependent but can be applied independently of each other. However, the more improvements are implemented, the more the procedure improves in terms of implementation complexity. Therefore, while the full implementation of the embodiment in Figure 11 provides the greatest reduction in complexity, it is also preferable that even when only one of steps 751, 752, 753, 754, 756 is performed according to the optimized algorithm, and the other steps are performed in the same way as the prior art, a reduction in complexity is obtained without any loss of quality.
[0178] Embodiments of the present invention can also be considered as a procedure for generating comfort noise in a stereo signal by mixing three Gaussian noise sources, one for each channel and a third common noise source, to create correlated background noise, or, in addition to or separately, to control the mixing of the noise sources with the coherence values transmitted in the SID frame.
[0179] It should be noted here that all the alternatives or embodiments described above and below, and all embodiments defined by the claims or embodiments below, can be used individually, that is, there are no alternatives or embodiments other than the intended alternatives, purposes, or independent claims. However, in other embodiments, two or more alternatives or embodiments or independent claims can be combined with each other, and in other embodiments, all embodiments or alternatives and all independent claims can be combined with each other.
[0180] The signals encoded by the present invention can be stored in a digital storage medium or a non-temporary storage medium, or transmitted over a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
[0181] While some embodiments are described in the context of an apparatus, it is clear that these embodiments also represent a description of the corresponding method, where a block or device corresponds to a method step or the function of a method step. Similarly, embodiments described in the context of a method step also represent a description of the function of the corresponding block or item or the corresponding apparatus.
[0182] Of course, depending on the specific implementation requirements, embodiments of the present invention can be implemented in hardware or software. These embodiments can be implemented using digital storage media such as floppy disks, DVDs, CDs, ROMs, PROMs, EPROMs, EEPROMs, or flash memory, which cooperate (or can cooperate) with a computer system having electronically readable control signals stored thereon and which is programmable to perform each method.
[0183] Some embodiments of the present invention include a data carrier having an electronically readable control signal that can cooperate with a programmable computer system, thereby enabling one of the methods described herein to be performed.
[0184] Generally, embodiments of the present invention can be implemented as a computer program product having program code, so that when the computer program product is executed on a computer, the program code operates to execute one of the methods. The program code can be stored, for example, in a machine-readable carrier.
[0185] Other embodiments include computer programs stored in a machine-readable carrier or non-temporary storage medium for performing one of the methods described herein.
[0186] Therefore, in other words, embodiments of the methods of the present invention are computer programs having program code for performing one of the methods described herein when the computer program is executed on a computer.
[0187] Therefore, a further embodiment of the method of the present invention is a data carrier (or digital storage medium, or computer-readable medium) containing a computer program recorded thereon for performing one of the methods described herein.
[0188] Therefore, a further embodiment of the method of the present invention is a data stream or sequence of signals representing a computer program for performing one of the methods described herein. The data stream or sequence of signals may be configured to be transmitted over a data communication connection, such as the Internet.
[0189] Further embodiments include processing means, such as a computer or programmable logic device, configured or adapted to perform one of the methods described herein.
[0190] Further embodiments include a computer on which a computer program for performing one of the methods described herein is installed.
[0191] In some embodiments, a programmable logic device (such as a field-programmable gate array) can be used to perform some or all of the functions of the methods described herein. In some embodiments, a field-programmable gate array can be used in conjunction with a microprocessor to perform one of the methods described herein. Generally, these methods are preferably performed by any hardware device.
[0192] The embodiments described above are merely illustrative of the principles of the present invention. Modifications and changes to the configurations and details described herein will be obvious to those skilled in the art. Therefore, the invention is not intended to be limited only by the imminent claims, nor by the explanation of reasons or specific details presented herein.
[0193] Modes (using them independently of each other, using them together with all other aspects, or using only subgroups of other aspects).
[0194] A device, method, or computer program that includes one or more of the following functions:
[0195] Examples of inventions relating to novel embodiments: • Combine the idea of multi-waves with object coding (using multiple directional cues for each T / F tile) • An object-oriented coding approach as close as possible to the DirAC paradigm. Allows all kinds of input types in IVAS (object content has not been covered so far).
[0196] An inventive example of parameterization (encoder): • Each T / F tile: Selection information for the n most relevant objects within this T / F tile, and the power ratio between the contributions of those n most relevant objects. • Each frame, each object: 1 direction
[0197] An inventive example of rendering (decoder): • Obtain the direct response values for each relevant object from the transmitted object index and direction information, and the target output layout. • Obtain the covariance matrix from the direct response. • For each related object, calculate the power directly from the downmix signal power and the transmit power ratio. • Obtain the final target covariance matrix from the direct power and covariance matrix. · Use only the diagonal elements of the input covariance matrix. Optimized covariance synthesis
[0198] Supplementary information regarding the differences from SAOC: · Instead of all objects, n dominant objects are considered. → Thus, the power ratio is associated with OLD but is calculated in a different way. · SAOC does not use the direction in the encoder -> direction information (rendering matrix) is introduced only in the decoder. → The SAOC-3D decoder receives object metadata for rendering the matrix. · SAOC employs a downmix matrix and transmits the downmix gain. · Diffusivity is not considered in the embodiments of the present invention.
[0199] Subsequently, further examples of the present invention are summarized.
[0200] 1. An apparatus for encoding a plurality of audio objects and associated metadata indicating direction information regarding the plurality of audio objects, comprising: A downmixer (400) for downmixing the plurality of audio objects to obtain one or more transport channels; A transport channel encoder (300) for encoding the one or more transport channels to obtain one or more encoded transport channels; An output interface (200) for outputting an encoded audio signal including the one or more encoded transport channels; wherein the downmixer (400) is configured to downmix the plurality of audio objects in response to the direction information of the plurality of audio objects. Apparatus.
[0201] 2. The downmixer (400) Generate two transport channels as two virtual microphone signals located at the same position or orientation with respect to a reference position or orientation such as the position or direction of a virtual listener, but with different orientations, or at two different positions, or Generate three transport channels as three virtual microphone signals located at the same position or orientation with respect to a reference position or orientation such as the position or direction of a virtual listener, but with different orientations, or three different positions. It generates four transport channels as four virtual microphone signals that are positioned at the same location with respect to a reference location or direction, such as the position or orientation of a virtual listener, but with different orientations, or at four different locations. It is configured in such a way, A virtual microphone signal is a virtual primary microphone signal, or a virtual cardioid microphone signal, or a virtual figure-eight, dipole, or bidirectional microphone signal, or a virtual directional microphone signal, or a virtual subcardioid microphone signal, or a virtual unidirectional microphone signal, or a virtual hypercardioid microphone signal, or a virtual omnidirectional microphone signal. The apparatus described in Example 1.
[0202] 3. Down mixer (400) For each audio object in a group of audio objects, weighting information for each transport channel is derived using the directional information of the corresponding audio object (402). Using the weighting information of audio objects for a specific transport channel, weight the corresponding audio object (404) and obtain the object contribution for a specific transport channel. To obtain a specific transport channel, combine the contributions of objects for a specific transport channel from multiple audio objects (406). It is structured in such a way. The apparatus described in Example 1 or 2.
[0203] 4. The downmixer (400) is configured to calculate one or more transport channels as one or more virtual microphone signals that are located at the same position with respect to a reference position or direction, such as the position or direction of a virtual listener to which the directional information is associated, but have different orientations or are located at different positions. Different positions or orientations are located on or to the left of the centerline, and on or to the right of the centerline, or different positions or orientations are evenly or unevenly distributed to horizontal positions or orientations such as +90 degrees or -90 degrees relative to the centerline, or -120 degrees, 0 degrees, and +120 degrees relative to the centerline, or different positions or orientations include at least one position or orientation that is upward or downward with respect to the horizontal plane in which the virtual listener is positioned, and the directional information for multiple sound objects is associated with the virtual listener's position or reference position or orientation. The apparatus described in any one of Examples 1 to 3.
[0204] 5. Further comprising a parameter processor (110) that quantizes metadata indicating directional information for multiple audio objects and obtains quantized directional items for multiple audio objects, The downmixer (400) is configured to operate in response to a quantized directional item as directional information. The output interface (200) is configured to introduce information about the quantized directional items into the encoded audio signal. The apparatus described in any one of Examples 1 to 4.
[0205] 6. The downmixer (400) is configured to perform directional information analysis of multiple audio objects and to position one or more virtual microphones to generate transport channels according to the results of the analysis. The apparatus described in any one of Examples 1 to 5.
[0206] 7. The downmixer (400) is configured to downmix (408) using static downmix rules across multiple timeframes, or The directional information is variable over multiple time frames, and the downmixer (400) is configured to downmix (405) using downmixing rules that are variable over multiple time frames. The apparatus described in any one of Examples 1 to 6.
[0207] 8. The apparatus according to any one of Examples 1 to 7, wherein the downmixer (400) is configured to downmix in the time domain using sample-by-sample weighting and the combination of samples of multiple audio objects.
[0208] 9. An object parameter calculator (100) configured to calculate parameter data for at least two associated speech objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the number of at least two associated speech objects is less than the total number of the plurality of speech objects, Furthermore, The output interface (200) is configured to introduce information about parameter data of at least two associated audio objects in one or more frequency bins into the encoded audio signal. The apparatus described in any one of Examples 1 to 8.
[0209] 10. The object parameter calculator (100) Each of the multiple audio objects is converted into a spectral representation having multiple frequency bins (120), Calculate selection information from each audio object in one or more frequency bins (122), Based on the selection information, object identification is derived as parameter data indicating at least two related audio objects (124). It is configured in such a way, The output interface (200) is configured to introduce information regarding object identification into the encoded audio signal. The apparatus according to Example 9.
[0210] 11. The object parameter calculator (100) is configured to quantize and encode (212) one or more amplitude-related measurements or one or more combined values derived from the amplitude-related measurements of one or more related audio objects as parameter data at one or more frequency bins. The output interface (200) is configured to introduce the quantized one or more amplitude-related scales or the quantized one or more combined values into the encoded audio signal. The apparatus according to Example 9 or 10.
[0211] 12. The selection information is an amplitude-related measurement such as an amplitude value, a power value, or a loudness value, or a raised amplitude different from the amplitude of the audio object, The object parameter calculator (100) is configured to calculate (127) a combined value such as a ratio from the amplitude-related measurement of the related audio object and the sum of two or more amplitude-related measurements of the related audio object, The output interface (200) is configured to introduce information regarding the combined value into the encoded audio signal, and the number of information items regarding the combined value of the encoded audio signal is equal to at least 1 and less than the number of related audio objects at one or more frequency bins. The apparatus according to Example 10 or 11.
[0212] 13. The object parameter calculator (100) is configured to select object identification based on the order of the selection information of the plurality of audio objects within one or more frequency bins. The apparatus according to any one of Examples 10 to 12.
[0213] 14. The object parameter calculator (100) The signal power is calculated as selection information (122), The object identification of two or more audio objects having the maximum signal power value in one or more frequency bins corresponding to each frequency bin is individually derived (124), The power ratio is calculated between the sum of the signal powers of two or more audio objects having the maximum signal power value and the signal power of at least one audio object having the derived object identification as parameter data (126), Quantize and encode (212) the power ratio. It is configured in such a way, The output interface (200) is configured to introduce the quantized and encoded power ratio into the encoded audio signal. The apparatus described in any one of Examples 10 to 13.
[0214] 15. The apparatus according to any one of Examples 10 to 14, wherein the output interface (200) is configured to introduce into the encoded audio signal one or more encoded transport channels, and as parameter data, two or more encoded object identifiers of associated audio objects for each of one or more frequency bins of a plurality of frequency bins in a time frame, and one or more encoded combined values or encoded amplitude-related measurements, and quantized and encoded directional data for each audio object in a time frame, which is constant for all frequency bins of one or more frequency bins.
[0215] 16. The object parameter calculator (100) is configured to calculate parameter data for at least the most dominant object and the second most dominant object in one or more frequency bins. Multiple audio objects consist of three or more audio objects, and the multiple audio objects include a first audio object, a second audio object, and a third audio object. The object parameter calculator (100) is configured to calculate only a first group of audio objects, such as a first audio object and a second audio object, as associated audio objects for a first frequency bin of one or more frequency bins, and to calculate only a second group of audio objects, such as a second audio object and a third audio object, or a first audio object and a third audio object, as associated audio objects for a second frequency bin of one or more frequency bins, wherein the first group of audio objects differs from the second group of audio objects with respect to at least one group member. The apparatus described in any one of Examples 9 to 15.
[0216] 17. The object parameter calculator (100) Calculate raw parametric data with a first time resolution or frequency resolution, combine the raw parametric data with combined parametric data having a second time resolution or frequency resolution lower than the first time resolution or frequency resolution, and calculate parameter data for at least two related speech objects with respect to the combined parametric data having the second time resolution or frequency resolution, or Determine a parameter band having a second time resolution or frequency resolution different from the first time resolution or frequency resolution used in the time resolution or frequency resolution of multiple audio objects, and calculate the parameter data of at least two related audio objects for the parameter band having the second time resolution or frequency resolution. The apparatus according to any one of Examples 9 to 16, configured as described above.
[0217] 18. A decoder for decoding an encoded audio signal, which includes one or more transport channels, directional information of multiple audio objects, and parameter data of audio objects for one or more frequency bins of a time frame, An input interface (600) for providing one or more transport channels in a spectral representation having multiple frequency bins within a time frame, A voice renderer (700) for rendering one or more transport channels into multiple voice channels using directional information, Equipped with, The audio renderer (700) is configured to directly calculate response information (704) from one or more audio objects for each frequency bin of a plurality of frequency bins, and to calculate directional information (810) associated with one or more related audio objects within the frequency bin. decoder.
[0218] 19. The audio renderer (700) is configured to use direct response information and information about the number of audio channels (702) to calculate covariance synthesis information (706), apply the covariance synthesis information to one or more transport channels (727) to obtain the number of audio channels, The direct response information (704) is a direct response vector for one or more speech objects, the covariance synthesis information is a covariance synthesis matrix, and the speech renderer (700) is configured to perform a matrix operation for each frequency bin when applying the covariance synthesis information (727). The decoder described in Example 18.
[0219] 20. The audio renderer (700) In calculating direct response information (704), the direct response vectors of one or more speech objects are derived, and for one or more speech objects, a covariance matrix is calculated from each direct response vector. In the calculation of covariance synthesis information, target covariance information is derived from the covariance matrix of one audio object, or the covariance matrix from multiple audio objects, power information for one or more audio objects, and power information derived from one or more transport channels (724). The decoder according to Example 18 or 19, configured as described above.
[0220] 21. The audio renderer (700) In calculating direct response information, the direct response vectors of one or more speech objects are derived, and for each of the one or more speech objects, a covariance matrix is calculated from each direct response vector (723). Input covariance information is derived from the transport channel (726), Mixing information is derived from target covariance information, input covariance information, and channel number information (725a, 725b), Apply mixing information to the transport channels of each frequency bin within the time frame (727). The decoder according to Example 20, configured as described above.
[0221] 22. The decoder according to Example 21, wherein the result of applying mixing information to each frequency bin within a time frame is converted to the time domain (708), and the number of audio channels in the time domain is obtained.
[0222] 23. The audio renderer (700) In the decomposition of the input covariance matrix (752), only the principal diagonal elements of the input covariance matrix derived from the transport channels are used. Using the direct response matrix and the power matrix of the object or transport channel, perform a decomposition (751) of the target covariance matrix. The input covariance matrix is decomposed by taking the roots of each major diagonal element of the input covariance matrix (752), We calculate the normalized inverse of the decomposed input covariance matrix (753), Singular value decomposition is performed when calculating the optimal matrix used for energy compensation without an extended identity matrix (756). A decoder according to any one of Examples 18 to 22, configured as described above.
[0223] 24. The parameter data of one or more audio objects includes the parameter data of at least two related audio objects, and the number of at least two related audio objects is less than the total number of audio objects. The audio renderer (700) is configured to calculate the contribution from one or more transport channels for each of one or more frequency bins, according to a first directional information associated with a first of at least two related audio objects, and according to a second directional information associated with a second of at least two related audio objects. A decoder according to any one of Examples 18 to 23.
[0224] 25. The decoder according to Embodiment 24, wherein the audio renderer (700) is configured to ignore directional information of at least two different audio objects for one or more frequency bins.
[0225] 26. The encoded audio signal includes amplitude-related measurements of each associated audio object, or combined values related to at least two associated audio objects in the parameter data. The audio renderer (700) is configured to operate such that the contribution from one or more transport channels is considered according to a first directional information associated with a first of at least two related audio objects and according to a second directional information associated with a second of at least two related audio objects, or to determine the quantitative contribution of one or more transport channels according to amplitude-related measurements or combined values. The decoder described in Example 24 or 25.
[0226] 27. The encoded signal includes the combined values in the parameter data. The audio renderer (700) is configured to determine the contribution of one or more transport channels using a binding value for one of the associated audio objects and directional information for one of the associated audio objects. The audio renderer (700) is configured to determine the contribution of one or more transport channels using the combined values of related audio objects in one or more frequency bins and values derived from the directional information of other related audio objects. The decoder described in Example 26.
[0227] 28. The audio renderer (700) For each frequency bin of multiple frequency bins, the system calculates response information (704) directly from the associated audio object and directional information associated with the associated audio object within the frequency bin. A decoder according to any one of Examples 24 to 27, configured as described above.
[0228] 29. The audio renderer (700) uses spreading information such as spreading parameters or uncorrelated rules contained in the metadata to determine a spreading signal for each frequency bin of multiple frequency bins (741), and combines the direct response information with the spreading signal to obtain a signal that is determined by the direct response information and rendered in the spectral domain of one of the channels among multiple channels. The decoder described in Example 28.
[0229] 30. A method for encoding multiple audio objects and associated metadata indicating directional information relating to the multiple audio objects, The steps include downmixing multiple audio objects to obtain one or more transport channels, The steps include encoding one or more transport channels to obtain one or more encoded transport channels, A step of outputting an encoded audio signal that includes one or more encoded transport channels, Includes, The downmixing step includes a step of downmixing multiple audio objects according to the directional information of the multiple audio objects. method.
[0230] 31. A method for decoding an encoded audio signal, which includes transport channel and direction information of one or more audio objects and parameter data of the audio objects for one or more frequency bins of a time frame, A step of providing one or more transport channels in a spectral representation having multiple frequency bins within a time frame, A step of using directional information to audio-render one or more transport channels into multiple audio channels, Includes, The step of rendering audio includes calculating response information directly from one or more audio objects for each frequency bin of a plurality of frequency bins, and directional information associated with one or more relevant audio objects within the frequency bin. method.
[0231] 32. A computer program for performing the method of Example 30 or the method of Example 31 when running on a computer or processor.
[0232] (References) [Pulkki2009] V. Pulkki, M-V. Laitinen, J. Vilkamo, J. Ahonen, T. Lokki, and T. Pihlajamaeki, “Directional audio coding perception-based reproduction of spatial sound”, International Workshop on the Principles and Application on Spatial Hearing, Nov. 2009, Zao; Miyagi, Japan. [SAOC_STD] ISO / IEC, “MPEG audio technologies Part 2: Spatial Audio Object Coding (SAOC).” ISO / IEC JTC1 / SC29 / WG11 (MPEG) International Standard 23003-2. [SAOC_AES] J. Herre, H. Purnhagen, J. Koppens, O. Hellmuth, J. Engdegaard, J.Hilpert, L. Villemoes, L. Terentiv, C. Falch, A. Hoelzer, M. L. Valero, B. Resch, H. Mundt H, and H. Oh, “MPEG spatial audio object coding - the ISO / MPEG standard for efficient coding of interactive audio scenes,” J. AES, vol. 60, no. 9, pp. 655 - 673, Sep. 2012. [MPEGH_AES] J. Herre, J. Hilpert, A. Kuntz, and J. Plogsties, “MPEG-H audio - the new standard for universal spatial / 3D audio coding,” in Proc. 137th AES Conv., Los Angeles, CA, USA, 2014. [MPEGH_IEEE] J. Herre, J. Hilpert, A. Kuntz, and J. Plogsties, “MPEG-H 3D Audio - The New Standard for Coding of Immersive Spatial Audio“, IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 9, NO. 5, AUGUST 2015 [MPEGH_STD] Text of ISO / MPEG 23008 - 3 / DIS 3D Audio, Sapporo, ISO / IEC JTC1 / SC29 / WG11 N14747, Jul. 2014. [SAOC_3D_PAT] APPARATUS AND METHOD FOR ENHANCED SPATAL AUDIO OBJECT CODING, WO 2015 / 011024 A1 [Pulkki1997] V. Pulkki, “Virtual sound source positioning using vector base amplitude panning,” J. Audio Eng. Soc., vol. 45, no. 6, pp. 456 - 466, Jun. 1997. [DELAUNAY] C. B. Barber, D. P. Dobkin, and H. Huhdanpaa, “The quickhull algorithm for convex hulls,” in Proc. ACM Trans. Math. Software (TOMS), New York, NY, USA, Dec. 1996, vol. 22, pp. 469 - 483. [Hirvonen2009] T. Hirvonen, J. Ahonen, and V. Pulkki, “Perceptual compression methods for metadata in Directional Audio Coding applied to audiovisual teleconference”, AES 126th Convention 2009, May 7 - 10, Munich, Germany. [Borss2014] C. Borss, “A Polygon-Based Panning Method for 3D Loudspeaker Setups”, AES 137th Convention 2014, October 9 - 12, Los Angeles, USA. [WO2019068638] Apparatus, method and computer program for encoding, decoding, scene processing and other procedures related to DirAC based spatial audio coding, 2018 [WO2020249815] PARAMETER ENCODING AND DECODING FOR MULTICHANNEL AUDIO USING DirAC, 2019 [BCC2001] C. Faller, F. Baumgarte: “Efficient representation of spatial audio using perceptual parametrization”, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575). [JOC_AES] Heiko Purnhagen; Toni Hirvonen; Lars Villemoes; Jonas Samuelsson; Janusz Klejsa: “Immersive Audio Delivery Using Joint Object Coding”, 140th AES Convention, Paper Number: 9587, Paris, May 2016. [AC4_AES] K. Kjoerling, J. Roeden, M. Wolters, J. Riedmiller, A. Biswas, P. Ekstrand, A. Groeschel, P. Hedelin, T. Hirvonen, H. Hoerich, J. Klejsa, J. Koppens, K. Krauss, H-M. Lehtonen, K. Linzmeier, H. Muesch, H. Mundt, S. Norcross, J. Popp, H. Purnhagen, J. Samuelsson, M. Schug, L. Sehlstroem, R. Thesing, L. Villemoes, and M. Vinton: “AC-4 - The Next Generation Audio Codec”, 140th AES Convention, Paper Number: 9491, Paris, May 2016. [Vilkamo2013] J. Vilkamo, T. Baeckstroem, A. Kuntz, “Optimized covariance domain framework for time-frequency processing of spatial audio”, Journal of the Audio Engineering Society, 2013. [Golub2013] Gene H. Golub and Charles F. Van Loan, “Matrix Computations”, Johns Hopkins University Press, 4th edition, 2013. [Explanation of symbols]
[0233] 100 Object Parameter Calculator 110-parameter processor 200 output interfaces 212 encoding 300 Transport Channel Encoders 400 Down Mixer 405 Downmix 600 Input Interfaces 700 Voice Renderer 704 Direct Response Information 810 Direction information 812 Amplitude-related measurements
Claims
1. A device for encoding multiple audio objects, An object parameter calculator (100) configured to calculate parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the object parameter calculator (100) is configured to perform a selection on the plurality of audio objects to obtain the at least two related audio objects among the plurality of audio objects, and not to show all of the plurality of audio objects as the at least two related audio objects among the plurality of audio objects, An output interface (200) for outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Equipped with, The object parameter calculator (100) is configured to quantize and encode (212) one or more amplitude-related measurements of at least two related audio objects in one or more frequency bins of the plurality of frequency bins, or one or more combined values derived from the amplitude-related measurements, as parameter data. The output interface (200) is configured to introduce the quantized amplitude-related measurement value or the quantized combined value into the encoded audio signal. Device.
2. The object parameter calculator (100) Each of the plurality of audio objects is converted into a spectral representation having the plurality of frequency bins (120), Selection information is calculated from each of the audio objects of the plurality of audio objects for one or more frequency bins of the plurality of frequency bins (122), where the selection information is the amplitude-related measurement value of the audio object. Based on the selection information, object identification is derived as parameter data indicating at least two related audio objects among the plurality of audio objects (124) It is configured in such a way, The output interface (200) is configured to introduce the object identification information into the encoded audio signal. The apparatus according to claim 1.
3. A device for encoding multiple audio objects, An object parameter calculator (100) configured to calculate parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the object parameter calculator (100) is configured to perform a selection on the plurality of audio objects to obtain the at least two related audio objects among the plurality of audio objects, and not to show all of the plurality of audio objects as the at least two related audio objects among the plurality of audio objects, An output interface (200) for outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Equipped with, The object parameter calculator (100) Each of the plurality of audio objects is converted into a spectral representation having the plurality of frequency bins (120), Selection information is calculated from each of the audio objects of the plurality of audio objects for one or more frequency bins of the plurality of frequency bins (122), where the selection information is an amplitude-related measurement value of the audio object. Based on the selection information, object identification is derived as parameter data indicating at least two related audio objects among the plurality of audio objects (124) It is configured in such a way, The amplitude-related measurement is an amplitude value, power value, or loudness value, or an amplitude multiplied by a factor different from the amplitude of the audio object. The object parameter calculator (100) is configured to calculate a combined value (127) from amplitude-related measurements associated with at least two related audio objects among the plurality of audio objects and the sum of two or more amplitude-related measurements of the related audio objects among the at least two related audio objects. The output interface (200) is configured to introduce information regarding the combined value and the object identification information into the encoded audio signal, wherein the number of information items regarding the combined value in the encoded audio signal is equal to at least one and less than the number of at least two associated audio objects among the plurality of audio objects in one or more frequency bins of the plurality of frequency bins. Device.
4. A device for encoding multiple audio objects, An object parameter calculator (100) configured to calculate parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the object parameter calculator (100) is configured to perform a selection on the plurality of audio objects to obtain the at least two related audio objects among the plurality of audio objects, and not to show all of the plurality of audio objects as the at least two related audio objects among the plurality of audio objects, An output interface (200) for outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Equipped with, The object parameter calculator (100) Each of the plurality of audio objects is converted into a spectral representation having the plurality of frequency bins (120), Selection information is calculated from each of the audio objects of the plurality of audio objects for one or more frequency bins of the plurality of frequency bins (122), where the selection information is an amplitude-related measurement value of the audio object. Based on the selection information, object identification is derived as parameter data indicating at least two related audio objects among the plurality of audio objects (124) It is configured in such a way, The output interface (200) is configured to introduce the object identification information into the encoded audio signal, The object parameter calculator (100) is configured to select the object identification based on the order of the selection information of the plurality of audio objects in one or more frequency bins of the plurality of frequency bins. Device.
5. A device for encoding multiple audio objects, An object parameter calculator (100) configured to calculate parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the object parameter calculator (100) is configured to perform a selection on the plurality of audio objects to obtain the at least two related audio objects among the plurality of audio objects, and not to show all of the plurality of audio objects as the at least two related audio objects among the plurality of audio objects, An output interface (200) for outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Equipped with, The object parameter calculator (100) Each of the plurality of audio objects is converted into a spectral representation having the plurality of frequency bins (120), Selection information is calculated from each of the audio objects of the plurality of audio objects for one or more frequency bins of the plurality of frequency bins (122), where the selection information is an amplitude-related measurement value of the audio object. Based on the selection information, object identification is derived as parameter data indicating at least two related audio objects among the plurality of audio objects (124) It is configured in such a way, The output interface (200) is configured to introduce the object identification information into the encoded audio signal, The object parameter calculator (100) The signal power is calculated as the selection information for each of the plurality of audio objects (122), The object identification of two or more audio objects having two or more maximum signal power values among the signal power values of all audio objects in one or more frequency bins corresponding to each frequency bin is individually derived (124), where the two or more audio objects having two or more maximum signal power values among the signal power values of all audio objects in the plurality of audio objects are at least two related audio objects among the plurality of audio objects. The power ratio between the sum of the signal powers of at least two related audio objects among the plurality of audio objects and the signal power of one of the at least two related audio objects among the plurality of audio objects is calculated (126), The power ratio is quantized and encoded (212), It is configured in such a way, The output interface (200) is configured to introduce the quantized and encoded power ratio and the object identification information into the encoded audio signal. Device.
6. A device for encoding multiple audio objects, An object parameter calculator (100) configured to calculate parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the object parameter calculator (100) is configured to perform a selection on the plurality of audio objects to obtain the at least two related audio objects among the plurality of audio objects, and not to show all of the plurality of audio objects as the at least two related audio objects among the plurality of audio objects, An output interface (200) for outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Equipped with, The output interface (200) processes the encoded audio signal, One or more encoded transport channels, As the parameter data, for each of the one or more frequency bins of the plurality of frequency bins within the time frame, two or more encoded object identifiers of at least two related audio objects of the plurality of audio objects, and one or more encoded combined values or encoded amplitude-related measurements, quantized and encoded directional data for each of the plurality of sound objects within the time frame, wherein the directional data is constant for all frequency bins of one or more of the plurality of frequency bins, It is configured to implement Device.
7. A device for encoding multiple audio objects, An object parameter calculator (100) configured to calculate parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the object parameter calculator (100) is configured to perform a selection on the plurality of audio objects to obtain the at least two related audio objects among the plurality of audio objects, and not to show all of the plurality of audio objects as the at least two related audio objects among the plurality of audio objects, An output interface (200) for outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Equipped with, The object parameter calculator (100) is configured to calculate parameter data for at least the most dominant object and the second most dominant object in one or more frequency bins of the plurality of frequency bins, such that the most dominant object and the second most dominant object represent at least two related audio objects among the plurality of audio objects, or The number of the aforementioned plurality of audio objects is three or more, and the plurality of audio objects includes a first audio object, a second audio object, and a third audio object. The object parameter calculator (100) is configured to calculate only a first group of audio objects, including the first and second audio objects, as the at least two associated audio objects for a first frequency bin of the one or more frequency bins, and to calculate only a second group of audio objects, including the second and third audio objects, or the first and third audio objects, as the at least two associated audio objects for a second frequency bin of one or more frequency bins of the multiple frequency bins, wherein the first group of audio objects differs from the second group of audio objects with respect to at least one group member. Device.
8. A device for encoding multiple audio objects, An object parameter calculator (100) configured to calculate parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the object parameter calculator (100) is configured to perform a selection on the plurality of audio objects to obtain the at least two related audio objects among the plurality of audio objects, and not to show all of the plurality of audio objects as the at least two related audio objects among the plurality of audio objects, An output interface (200) for outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Equipped with, The object parameter calculator (100) The process involves calculating raw parametric data with a first time resolution or frequency resolution, combining the raw parametric data to obtain combined parametric data having a second time resolution or frequency resolution lower than the first time resolution or frequency resolution, and calculating parameter data for at least two related speech objects among the plurality of speech objects with respect to the combined parametric data having the second time resolution or frequency resolution, or Determine a parameter bandwidth having a second time resolution or frequency resolution different from the first time resolution or frequency resolution used in the time resolution or frequency resolution of the plurality of audio objects, and calculate the parameter data of at least two related audio objects among the plurality of audio objects with respect to the parameter bandwidth having the second time resolution or frequency resolution. A device configured in such a way.
9. The plurality of audio objects include associated metadata indicating directional information (810) relating to the plurality of audio objects, The aforementioned device A downmixer (400) for downmixing the plurality of audio objects to obtain one or more transport channels, wherein the downmixer (400) is configured to downmix the plurality of audio objects in response to the directional information of the plurality of audio objects, A transport channel encoder (300) for encoding one or more transport channels and obtaining one or more encoded transport channels, Furthermore, The output interface (200) is configured to introduce the one or more encoded transport channels into the encoded audio signal. The apparatus according to any one of claims 1 to 8.
10. A device for encoding multiple audio objects, An object parameter calculator (100) configured to calculate parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the object parameter calculator (100) is configured to perform a selection on the plurality of audio objects to obtain the at least two related audio objects among the plurality of audio objects, and not to show all of the plurality of audio objects as the at least two related audio objects among the plurality of audio objects, An output interface (200) for outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Equipped with, The plurality of audio objects include associated metadata indicating directional information (810) relating to the plurality of audio objects, The aforementioned device A downmixer (400) for downmixing the plurality of audio objects to obtain one or more transport channels, wherein the downmixer (400) is configured to downmix the plurality of audio objects in response to the directional information of the plurality of audio objects, A transport channel encoder (300) for encoding one or more transport channels and obtaining one or more encoded transport channels, Furthermore, The output interface (200) is configured to introduce the one or more encoded transport channels into the encoded audio signal. The aforementioned down mixer (400) To generate two transport channels as two virtual microphone signals positioned at the same location but with different orientations, or positioned at two different locations relative to a reference position or orientation, To generate three transport channels as three virtual microphone signals positioned at the same location but with different orientations, or at three different locations relative to a reference position or orientation, It generates four transport channels as four virtual microphone signals positioned at the same location but with different orientations, or at four different locations, relative to a reference position or orientation. It is configured in such a way, The virtual microphone signal is a virtual primary microphone signal, or a virtual cardioid microphone signal, or a virtual figure-eight, dipole, or bidirectional microphone signal, or a virtual directional microphone signal, or a virtual subcardioid microphone signal, or a virtual unidirectional microphone signal, or a virtual hypercardioid microphone signal, or a virtual omnidirectional microphone signal. Device.
11. A device for encoding multiple audio objects, An object parameter calculator (100) configured to calculate parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the object parameter calculator (100) is configured to perform a selection on the plurality of audio objects to obtain the at least two related audio objects among the plurality of audio objects, and not to show all of the plurality of audio objects as the at least two related audio objects among the plurality of audio objects, An output interface (200) for outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Equipped with, The plurality of audio objects include associated metadata indicating directional information (810) relating to the plurality of audio objects, The aforementioned device A downmixer (400) for downmixing the plurality of audio objects to obtain one or more transport channels, wherein the downmixer (400) is configured to downmix the plurality of audio objects in response to the directional information of the plurality of audio objects, A transport channel encoder (300) for encoding one or more transport channels and obtaining one or more encoded transport channels, Furthermore, The output interface (200) is configured to introduce the one or more encoded transport channels into the encoded audio signal. The aforementioned down mixer (400) For each of the plurality of audio objects, weighting information for each transport channel is derived using the direction information of the corresponding audio object (402), The corresponding audio object is weighted using the weighting information of the audio object for a specific transport channel (404), and the object contribution of the specific transport channel is obtained. To obtain the specific transport channel, the object contributions of the specific transport channel are combined from the plurality of audio objects (406), A device configured in such a way.
12. A device for encoding multiple audio objects, An object parameter calculator (100) configured to calculate parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the object parameter calculator (100) is configured to perform a selection on the plurality of audio objects to obtain the at least two related audio objects among the plurality of audio objects, and not to show all of the plurality of audio objects as the at least two related audio objects among the plurality of audio objects, An output interface (200) for outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Equipped with, The plurality of audio objects include associated metadata indicating directional information (810) relating to the plurality of audio objects, The aforementioned device A downmixer (400) for downmixing the plurality of audio objects to obtain one or more transport channels, wherein the downmixer (400) is configured to downmix the plurality of audio objects in response to the directional information of the plurality of audio objects, A transport channel encoder (300) for encoding one or more transport channels and obtaining one or more encoded transport channels, Furthermore, The output interface (200) is configured to introduce the one or more encoded transport channels into the encoded audio signal. The downmixer (400) is configured to calculate the one or more transport channels as one or more virtual microphone signals that are located at the same position as the reference position or orientation to which the directional information is associated, but have different orientations or are located at different positions. The different positions or orientations are on or to the left of the center line, and on or to the right of the center line, or the different positions or orientations are evenly or unevenly distributed across the horizontal positions or orientations, or the different positions or orientations include at least one position or orientation that is oriented upward or downward with respect to the horizontal plane in which the virtual listener is positioned, and the directional information relating to the plurality of sound objects is associated with the position or reference position or orientation of the virtual listener. Device.
13. A device for encoding multiple audio objects, An object parameter calculator (100) configured to calculate parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the object parameter calculator (100) is configured to perform a selection on the plurality of audio objects to obtain the at least two related audio objects among the plurality of audio objects, and not to show all of the plurality of audio objects as the at least two related audio objects among the plurality of audio objects, An output interface (200) for outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Equipped with, The plurality of audio objects include associated metadata indicating directional information (810) relating to the plurality of audio objects, The aforementioned device A downmixer (400) for downmixing the plurality of audio objects to obtain one or more transport channels, wherein the downmixer (400) is configured to downmix the plurality of audio objects in response to the directional information of the plurality of audio objects, A transport channel encoder (300) for encoding one or more transport channels and obtaining one or more encoded transport channels, Furthermore, The output interface (200) is configured to introduce the one or more encoded transport channels into the encoded audio signal. The system further comprises a parameter processor (110) that quantizes the metadata indicating the direction information relating to the plurality of sound objects and obtains quantized direction items relating to the plurality of sound objects. The downmixer (400) is configured to operate in response to the quantized direction item as direction information, The output interface (200) is configured to introduce information regarding the quantized directional items into the encoded audio signal. Device.
14. A device for encoding multiple audio objects, An object parameter calculator (100) configured to calculate parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the object parameter calculator (100) is configured to perform a selection on the plurality of audio objects to obtain the at least two related audio objects among the plurality of audio objects, and not to show all of the plurality of audio objects as the at least two related audio objects among the plurality of audio objects, An output interface (200) for outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Equipped with, The plurality of audio objects include associated metadata indicating directional information (810) relating to the plurality of audio objects, The aforementioned device A downmixer (400) for downmixing the plurality of audio objects to obtain one or more transport channels, wherein the downmixer (400) is configured to downmix the plurality of audio objects in response to the directional information of the plurality of audio objects, A transport channel encoder (300) for encoding one or more transport channels and obtaining one or more encoded transport channels, Furthermore, The output interface (200) is configured to introduce the one or more encoded transport channels into the encoded audio signal. The downmixer (400) is configured to perform an analysis of the directional information relating to the plurality of audio objects (410), and to position one or more virtual microphones (412) to generate the transport channel according to the results of the analysis. Device.
15. A device for encoding multiple audio objects, An object parameter calculator (100) configured to calculate parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the object parameter calculator (100) is configured to perform a selection on the plurality of audio objects to obtain the at least two related audio objects among the plurality of audio objects, and not to show all of the plurality of audio objects as the at least two related audio objects among the plurality of audio objects, An output interface (200) for outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Equipped with, The plurality of audio objects include associated metadata indicating directional information (810) relating to the plurality of audio objects, The aforementioned device A downmixer (400) for downmixing the plurality of audio objects to obtain one or more transport channels, wherein the downmixer (400) is configured to downmix the plurality of audio objects in response to the directional information of the plurality of audio objects, A transport channel encoder (300) for encoding one or more transport channels and obtaining one or more encoded transport channels, Furthermore, The output interface (200) is configured to introduce the one or more encoded transport channels into the encoded audio signal. The downmixer (400) is configured to downmix (408) using static downmix rules over multiple time frames, or The directional information is variable over the plurality of time frames, and the downmixer (400) is configured to downmix (405) using downmixing rules that are variable over the plurality of time frames. Device.
16. A device for encoding multiple audio objects, An object parameter calculator (100) configured to calculate parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the object parameter calculator (100) is configured to perform a selection on the plurality of audio objects to obtain the at least two related audio objects among the plurality of audio objects, and not to show all of the plurality of audio objects as the at least two related audio objects among the plurality of audio objects, An output interface (200) for outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Equipped with, The plurality of audio objects include associated metadata indicating directional information (810) relating to the plurality of audio objects, The aforementioned device A downmixer (400) for downmixing the plurality of audio objects to obtain one or more transport channels, wherein the downmixer (400) is configured to downmix the plurality of audio objects in response to the directional information of the plurality of audio objects, A transport channel encoder (300) for encoding one or more transport channels and obtaining one or more encoded transport channels, Furthermore, The output interface (200) is configured to introduce the one or more encoded transport channels into the encoded audio signal. The apparatus is configured such that the downmixer (400) is downmixed in the time domain using sample-by-sample weighting and the combination of samples of the plurality of audio objects.
17. A decoder for decoding an encoded audio signal, comprising one or more transport channels, directional information of a plurality of audio objects, and parameter data of at least two associated audio objects among the plurality of audio objects for one or more frequency bins among a plurality of frequency bins of a time frame, wherein the decoder An input interface (600) for providing parameter data for one or more transport channels and at least two associated audio objects among the plurality of audio objects in a spectral representation having a plurality of frequency bins within the time frame, wherein the number of at least two associated audio objects among the plurality of audio objects is less than the total number of the plurality of audio objects, where the at least two associated audio objects among the plurality of audio objects are selected from the plurality of audio objects and are for obtaining the at least two associated audio objects among the plurality of audio objects, and all audio objects among the plurality of audio objects are not shown as the at least two associated audio objects of the plurality of audio objects, A voice renderer (700) for rendering one or more transport channels into a plurality of voice channels, such that the contribution from one or more transport channels is taken into consideration according to a first directional information associated with a first of the at least two related voice objects among the plurality of voice objects, and according to a second directional information associated with a second of the at least two related voice objects among the plurality of voice objects, using the directional information, Equipped with, The audio renderer (700) is configured to calculate the contribution from one or more transport channels for each of the one or more frequency bins of the plurality of frequency bins, according to a first direction information associated with a first of the at least two related audio objects among the plurality of audio objects, and according to a second direction information associated with a second of the at least two related audio objects among the plurality of audio objects. decoder.
18. The audio renderer (700) is configured to ignore the direction information of audio objects that are different from at least two associated audio objects among the plurality of audio objects for one or more frequency bins of the plurality of frequency bins. The decoder according to claim 17.
19. The encoded audio signal includes amplitude-related measurements (812) of each related audio object among the plurality of audio objects, or combined values (812) related to the at least two related audio objects among the plurality of audio objects in the parameter data. The audio renderer (700) is configured to determine the quantitative contribution of one or more transport channels (704) according to the amplitude-related measurement value or the combined value. The decoder according to claim 18.
20. The encoded signal includes the combined value in the parameter data, The audio renderer (700) is configured to determine the contribution of one or more transport channels using the combined value of at least two related audio objects among the plurality of audio objects for one related audio object and the direction information of at least two related audio objects among the plurality of audio objects for one related audio object (704, 733), The audio renderer (700) is configured to determine the contribution of one or more transport channels using values derived from the combined values of at least two related audio objects in one or more frequency bins of the plurality of audio objects in the plurality of frequency bins, and the direction information of the other related audio objects. The decoder according to claim 19.
21. The aforementioned audio renderer (700) For each frequency bin of the plurality of frequency bins, direct response information is calculated from at least two related audio objects among the plurality of audio objects, and directional information is calculated associated with at least two related audio objects among the plurality of audio objects in the plurality of frequency bins (704). A decoder according to any one of claims 17 to 20, configured as follows.
22. The audio renderer (700) uses spreading information, including spreading parameters or uncorrelated rules included in the metadata, to determine a spreading signal for each frequency bin of the plurality of frequency bins (741), and combines the direct response information and the spreading signal to obtain a signal determined by the direct response information and the spreading signal, rendered in the spectral domain of the audio channel among the plurality of audio channels, or The system is configured to use the direct response information (704) and the information regarding the number of voice channels (702) to calculate covariance synthesis information (706), apply the covariance synthesis information to one or more transport channels (727) to obtain the number of voice channels, The direct response information (704) is a direct response vector for each of the at least two related audio objects among the plurality of audio objects, the covariance synthesis information is a covariance synthesis matrix, and the audio renderer (700) is configured to perform a matrix operation for each frequency bin when applying the covariance synthesis information (727). The decoder according to claim 21.
23. The aforementioned audio renderer (700) In the calculation of the direct response information (704), a direct response vector is derived for each of the at least two related audio objects among the plurality of audio objects, and a covariance matrix is calculated from each direct response vector for each of the at least two related audio objects among the plurality of audio objects. In the calculation of the covariance composite information, The covariance matrix from each of the at least two related audio objects among the plurality of audio objects, The power information of each of the associated audio objects, which is at least two of the associated audio objects among the plurality of audio objects, Power information derived from one or more transport channels, From this, we derive the target covariance information (724). The decoder according to claim 21 or 22, configured as follows.
24. The aforementioned audio renderer (700) In the calculation of the direct response information (704), a direct response vector is derived for each related audio object of at least two related audio objects among the plurality of audio objects, and a covariance matrix is calculated from each direct response vector for each related audio object of at least two related audio objects among the plurality of audio objects (723), Input covariance information is derived from the transport channel (726), Mixing information is derived from the target covariance information, the input covariance information, and the information regarding the number of audio channels (725a, 725b), The mixing information is applied to the transport channel of each frequency bin within the time frame (727). The decoder according to claim 23, configured as follows.
25. The decoder according to claim 24, wherein the result of applying the mixing information to each frequency bin within the time frame is converted to a time domain (708) and the number of audio channels in the time domain is obtained.
26. The aforementioned audio renderer (700) In the decomposition of the input covariance matrix (752), only the principal diagonal elements of the input covariance matrix derived from the transport channel are used. Using the direct response matrix and the power matrix of the voice object or transport channel, perform a decomposition (751) of the target covariance matrix. The input covariance matrix is decomposed by taking the roots of each major diagonal element of the input covariance matrix (752), We calculate the regularized inverse of the decomposed input covariance matrix (753), Singular value decomposition is performed when calculating the optimal matrix used for energy compensation without the extended identity matrix (756). A decoder according to any one of claims 21 to 25, configured as follows.
27. A method for encoding multiple audio objects, A step of calculating parameter data for at least two related audio objects among a plurality of audio objects for one or more frequency bins of a plurality of frequency bins associated with a time frame, wherein the number of at least two related audio objects among the plurality of audio objects is less than the total number of audio objects in the plurality of audio objects, and the calculation step includes a step of performing a selection on the plurality of audio objects in order to obtain at least two related audio objects among the plurality of audio objects, such that all audio objects among the plurality of audio objects are not shown as at least two related audio objects among the plurality of audio objects, The steps include outputting an encoded audio signal that includes information about the parameter data of at least two related audio objects among the plurality of audio objects, Includes, The calculation step includes quantizing and encoding (212) one or more amplitude-related measurements or one or more combined values derived from the amplitude-related measurements of at least two related audio objects in one or more frequency bins of the plurality of frequency bins as parameter data, and the output step includes introducing the quantized one or more amplitude-related measurements or the quantized one or more combined values into the encoded audio signal. or The calculation step includes (120) converting each of the plurality of audio objects into a spectral representation having the plurality of frequency bins, A step (122) of calculating selection information from each of the plurality of sound objects for one or more frequency bins of the plurality of frequency bins, wherein the selection information is the amplitude-related measurement value of the sound object, The step (124) includes deriving object identification as parameter data that indicates at least two related audio objects among the plurality of audio objects based on the selection information, The amplitude-related measurement is an amplitude value, power value, or loudness value, or an amplitude multiplied by a factor different from the amplitude of the audio object. The calculation step includes (127) calculating a combined value from amplitude-related measurements associated with at least two related audio objects among the plurality of audio objects and the sum of two or more amplitude-related measurements of the related audio objects of at least two related audio objects among the plurality of audio objects, and the output step includes introducing information about the combined value into the encoded audio signal, wherein the number of information items about the combined value in the encoded audio signal is at least equal to 1 and less than the number of at least two related audio objects among the plurality of audio objects in one or more frequency bins of the plurality of frequency bins, The calculation step includes selecting the object identification based on the order of the selection information of the plurality of audio objects in one or more frequency bins of the plurality of frequency bins, or The calculation step described above is: (122) A step of calculating the signal power as the selection information for each of the plurality of audio objects, Step (124) of individually deriving the object identification of two or more audio objects having two or more maximum signal power values among the signal power values of all audio objects in one or more frequency bins corresponding to each frequency bin, wherein the two or more audio objects having two or more maximum signal power values among the signal power values of all audio objects in the plurality of audio objects are at least two related audio objects among the plurality of audio objects, (126) A step of calculating the power ratio between the sum of the signal powers of at least two related audio objects among the plurality of audio objects and the signal power of one of the at least two related audio objects among the plurality of audio objects, The steps include quantizing and encoding (212) the power ratio, Includes, The output step includes the step of introducing the quantized and encoded power ratio into the encoded audio signal, or The output step involves the encoded audio signal, One or more encoded transport channels, As the parameter data, for each of the one or more frequency bins of the plurality of frequency bins within the time frame, two or more encoded object identifiers of at least two related audio objects of the plurality of audio objects, and one or more encoded combined values or encoded amplitude-related measurements, quantized and encoded directional data for each of the plurality of sound objects within the time frame, wherein the directional data is constant for all frequency bins of one or more of the plurality of frequency bins, This includes the step of implementing or The calculation step is a step of calculating parameter data for at least the most dominant object and the second most dominant object in one or more frequency bins of the plurality of frequency bins, wherein the most dominant object and the second most dominant object represent at least two related sound objects among the plurality of sound objects, or The number of the aforementioned plurality of audio objects is three or more, and the plurality of audio objects includes a first audio object, a second audio object, and a third audio object. The calculation step includes calculating, for a first frequency bin of the one or more frequency bins, only a first group of audio objects from the plurality of audio objects, including the first audio object and the second audio object, as the at least two associated audio objects from the plurality of audio objects; and calculating, for a second frequency bin of the one or more frequency bins of the plurality of frequency bins, only a second group of audio objects, including the second audio object and the third audio object, or the first audio object and the third audio object, as the at least two associated audio objects from the plurality of audio objects; wherein the first group of audio objects differs from the second group of audio objects with respect to at least one group member; or The calculation step is a step of calculating raw parametric data with a first time resolution or frequency resolution, combining the raw parametric data to obtain combined parametric data having a second time resolution or frequency resolution lower than the first time resolution or frequency resolution, and calculating parameter data for at least two related speech objects among the plurality of speech objects with respect to the combined parametric data having the second time resolution or frequency resolution, or The steps include determining a parameter bandwidth having a second time resolution or frequency resolution different from a first time resolution or frequency resolution used in the time resolution or frequency resolution of the plurality of audio objects, and calculating the parameter data of at least two related audio objects among the plurality of audio objects with respect to the parameter bandwidth having the second time resolution or frequency resolution. including, method.
28. A method for decoding an encoded audio signal, comprising transport channel and direction information of one or more audio objects, and parameter data of at least two associated audio objects for one or more frequency bins of a plurality of frequency bins of a time frame, wherein the decoding method is A step of providing parameter data for one or more transport channels and at least two associated audio objects among the plurality of audio objects in a spectral representation having a plurality of frequency bins within the time frame, wherein the number of at least two associated audio objects among the plurality of audio objects is less than the total number of the plurality of audio objects, where the at least two associated audio objects among the plurality of audio objects are selected from the plurality of audio objects and are for obtaining the at least two associated audio objects among the plurality of audio objects, and all audio objects among the plurality of audio objects are not shown as the at least two associated audio objects of the plurality of audio objects. The steps include using the directional information to audio-render one or more transport channels into multiple audio channels, Includes, The step of rendering the audio includes a step of calculating the contribution from one or more transport channels for each of the one or more frequency bins of the plurality of frequency bins, such that the contribution from one or more transport channels is taken into consideration according to a first direction information associated with a first of the at least two related audio objects among the plurality of audio objects, and according to a second direction information associated with a second of the at least two related audio objects among the plurality of audio objects, or according to a first direction information associated with a first of the at least two related audio objects among the plurality of audio objects, and according to a second direction information associated with a second of the at least two related audio objects among the plurality of audio objects. method.
29. A computer program for performing the method according to claim 27 or the method according to claim 28, when executed on a computer or processor.