Accurate performance optimization for audio decoding and rendering

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The integrated rendering algorithm using linear and polynomial interpolation synchronizes JOC upmixer and OAR matrices, addressing computational complexity and distortion issues in audio decoding and rendering, achieving over 30% optimization and accurate signal reconstruction.

WO2026128437A1PCT designated stage Publication Date: 2026-06-18DOLBY LABORATORIES LICENSING CORP

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: DOLBY LABORATORIES LICENSING CORP
Filing Date: 2025-12-09
Publication Date: 2026-06-18

Application Information

Patent Timeline

09 Dec 2025

Application

18 Jun 2026

Publication

WO2026128437A1

IPC: H04S3/00; G10L19/008

AI Tagging

Application Domain

Speech analysis Stereophonic systems

Technology Topics

Algorithm Engineering

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure US2025058728_18062026_PF_FP_ABST

Patent Text Reader

Abstract

The disclosure relates to a method of reconstructing an audio program for rendering at a playback device. The method includes receiving a downmix signal indicative of a plurality of audio objects relating to a plurality of spatially diverse audio signals of the audio program, the downmix signal comprising at least one channel; obtaining upmixing metadata defining a time-variant upmixing matrix for generating a multi-channel upmix signal indicative of the plurality of audio objects from the at least one channel of the downmix signal; obtaining object metadata defining a time-variant rendering matrix for generating a rendered output from the multi-channel upmix signal, wherein the time-variant upmixing matrix and the time-variant rendering matrix are dividable into a plurality of sampling frames each having a predetermined number of slots; and determining, for each slot of a sampling frame, an integrated upmixing and rendering matrix based on the time-variant upmixing and rendering matrices.

Need to check novelty before this filing date? Find Prior Art

Description

ACCURATE PERFORMANCE OPTIMIZATION FOR AUDIO DECODING AND RENDERINGCROSS REFERENCE TO RELATED APPLICATION

[0001] This application is related to United States Provisional Patent Application No. 63 / 931,481, filed on 04 December 2025, and International PCT Application No. PCT / CN2024 / 138613 filed on 11 December 2024, the entire contents of each of which are incorporated herein by reference.Technical Field

[0002] The present disclosure relates to techniques of audio processing and, in particular, to audio decoding and rendering using optimization algorithms with accurate performance.Background

[0003] Conventional audio systems employ channel-based audio (CBA) coding, where each channel may represent the content intended for one speaker (or one speaker array). For example, a set of tracks may be implicitly assigned to specific loudspeakers by associating the set of tracks with a channel configuration. For a playback speaker configuration different from the coded channel configuration, downmixing or upmixing specifications are required to redistribute audio to the available speakers.

[0004] On the other hand, object-based audio (OBA) coding applies rendering to objects that include the object audio essence (e.g., audio signal) in conjunction with associated metadata that contains individually assigned object properties (e.g., positional metadata, object width metadata, directivity metadata, etc.). In this way, in case of changing speaker configuration, audio can be reproduced based on better information regarding how to render to fewer (or more) speakers, so as to adapt audio rendering to a new speaker configuration.

[0005] The transmission of OBA, i.e., audio programs which include a plurality of audio objects, may require a relatively large bandwidth. In order to reduce the bandwidth of such audio programs, the plurality of audio objects may be downmixed to a limited number of audio channels (e.g., the audio objects may be downmixed to two audio channels (e.g., a stereo downmix signal), to 5+1 audio channels (e.g., a 5.1 downmix signal) or to 7+1 audio channels (e.g., a 7.1 downmix signal)). Such downmixed CBA content is to be transmitted using standardized formats (e.g., enhanced AC-3 (E-AC-3) defined in ETSI TS 102 366 (also known as the Dolby Digital Plus (DD+) audio compression scheme)).

[0006] To ensure compatibility with pre-existing devices, joint object coding (JOC) may be used in conjunction with the standardized CBA formats to transport OBA. In particular, JOC delivers immersive audio at low bitrates, which is achieved by conveying a multi-channel downmix (i.e., CBA) of the immersive content using perceptual audio coding algorithms together with parametric side information that enables the reconstruction of the audio objects (i.e., OBA) from the downmix in the decoder. In other words, the CBA content may be represented (e.g., by a JOC) as OBA content (i.e., audio objects) so that the content is compatible with OBA playback devices for rendering. The conversion of the CBA content into OBA content is known as an audio upmixing process, which may be implemented by e.g., a JOC upmixer. At the rendering stage, the converted audio objects are to be rendered (e.g., at an object audio Tenderer, OAR) based on the associated metadata.

[0007] In order to reduce computational complexity of the decoder, integrated rendering solution has been proposed that incorporates the audio upmixing process into the rendering stage. However, since the standardized bitstream formats for CBA and OBA are not entirely compatible, additional distortion may be introduced by the integrated rendering solution. Hence, there is a need for an improved algorithm which can provide accurate rendering results, in order to solve the current complexity bottlenecks in audio decoding and rendering processes.Summary

[0008] In view of this need, the present disclosure provides computer-implemented methods for audio decoding and rendering, apparatus for audio decoding and rendering, computer programs, and computer-readable storage media, to reconstruct an audio program for rendering at a playback device, having the features of the respective independent claims.

[0009] The present disclosure aims at reducing computational complexity in decoding and rendering processes by using integrated rendering to reconstruct the audio signal. The approaches / algorithms proposed in the present disclosure may be applied to e.g., the Dolby Digital Plus (DD+) digital audio compression scheme in combination with the joint object coding (JOC). For example, the current DD+ JOC implementation uses the processing of core decoding followed by the JOC decoding (e.g., via a JOC upmixer), which may be followed by a frequency-band Tenderer for generating a QMF (Quadrature Mirror Filter) -domain (i.e., frequency-domain) rendered output. The present disclosure is thus directed to integrate a JOC upmixer (e.g., a JOC upmixer gain matrix) with a QMF-domain object audio Tenderer (OAR)(e.g., an OAR rendering matrix) to form an integrated rendering block that jointly performs the upmixing and rendering processes to reduce the required computational efforts.

[0010] However, as mentioned above, the upmixer and the OAR might not be entirely compatible. Especially, it is not trivial to integrate these two matrices (i.e., the upmixing matrix and the rendering matrix), because they are not always synchronized. The use of linear interpolation by itself may produce an inaccurate reconstruction by introducing additional distortions to the rendered signal. Accordingly, the present disclosure proposes to use combined linear and polynomial interpolation, which can (1) reduce the number of combined matrices that need to be calculated and (2) reduce distortion introduced by linear interpolation in the cunent implementation. For example, the use of polynomial interpolation as proposed by the present disclosure allows for a compensated delta value (see Table 1 and Table 2 below) for use in linear interpolation, which can effectively fix the introduced distortion. As a result, the proposed solution can achieve over a 30% optimization in comparison with the current implementation.

[0011] The described solution according to the present disclosure may be delivered with the UDC (Unified Decoder Converter) 1.15 (CIDK v4.12) release and may benefit all DD+ JOC downstream products, including: MS 12, Harmonious, DAX, DAS, DAA, and DCX, as well as extend the lifespan of the Dolby ATMOS technology.

[0012] One aspect of the present disclosure relates to a computer-implemented method reconstructing an audio program for rendering at a playback device. The method may include receiving a downmix signal indicative of a plurality of audio objects relating to a plurality of spatially diverse audio signals of the audio program. In particular, the downmix signal may include at least one channel (e.g., a multi-channel downmix signal). The method may further include obtaining upmixing metadata defining a time- variant upmixing matrix for generating a multi-channel upmix signal indicative of the plurality of audio objects from the at least one channel of the downmix signal (i.e., for the audio upmixing process). Also, the method may include obtaining object metadata defining a time-variant rendering matrix for generating a rendered output from the multi-channel upmix signal (i.e., for the rendering stage). In addition, the method may also include determining, for each slot of a sampling frame, an integrated upmixing and rendering matrix based on the time- variant upmixing matrix and the time-variant rendering matrix.

[0013] Notably, the time-variant upmixing matrix and the time- variant rendering matrix may be dividable into a plurality of sampling frames each having a predetermined number of slots. Moreover, the integrated upmixing and rendering matrix may be applicable to acorresponding slot (i.e., a slot sample) of the downmix signal for generating the rendered output.

[0014] Specifically, determining the integrated upmixing and rendering matrix may include:

[0015] determining, for the sampling frame, a combined matrix based on the time- variant upmixing matrix and the time-variant rendering matrix corresponding to that sampling frame,

[0016] determining a slot offset value indicative of a number of slots by which a first frame boundary with respect to the time-variant upmixing matrix is misaligned with a second frame boundary with respect to the time-variant rendering matrix, and

[0017] determining, for each slot of the sampling frame, an interpolated matrix based on the determined combined matrix and the slot offset value.

[0018] Configured as above, by incorporating the time- variant upmixing matrix and the timevariant rendering matrix, integrated rendering combining the audio upmixing process and the rendering stage may be achieved to reduce computational complexity in audio decoding and rendering processes. The proposed algorithm further solves the problem of incompatibility between the upmixing process and the rendering stage by applying accurate interpolation to each slot sample of the downmix signal (i.e., for the determination of the interpolated matrix). On the other hand, a multiplication operation is conducted per frame basis (i.e., for the calculation of the combined matrix), which allows for enhanced rendering accuracy with reduced computational efforts.

[0019] In some embodiments, the method may further include acquiring, from the upmixing metadata, a sampled upmixing matrix for a first sampling frame associated with the timevariant upmixing matrix and for a second sampling frame associated with the time- variant rendering matrix. Also, the method may include acquiring, from the object metadata, a sampled rendering matrix for the first sampling frame associated with the time- variant upmixing matrix and for the second sampling frame associated with the time-variant rendering matrix. In such cases, determining the combined matrix may include:

[0020] computing, for the first sampling frame associated with the time- variant upmixing matrix, a first combined matrix based on the acquired sampled upmixing matrix corresponding to that first sampling frame and the acquired sampled rendering matrix corresponding to that first sampling frame, and

[0021] computing, for the second sampling frame associated with the time-variant rendering matrix, a second combined matrix based on the acquired sampled upmixing matrix corresponding to that second sampling frame and the acquired sampled rendering matrix corresponding to that second sampling frame.

[0022] In some embodiments, the interpolated matrix for each slot may be determined based on at least one of the computed first combined matrix and the computed second combined matrix.

[0023] In some embodiments, the sampled upmixing matrix for the first sampling frame associated with the time- variant upmixing matrix and for the second sampling frame associated with the time- variant rendering matrix may be acquired at the first frame boundary with respect to the time-variant upmixing matrix and the second frame boundary with respect to the time- variant rendering matrix, respectively. Also, the sampled rendering matrix for the first sampling frame associated with the time-variant upmixing matrix and for the second sampling frame associated with the time-variant rendering matrix may be acquired at the first frame boundary with respect to the time-variant upmixing matrix and the second frame boundary with respect to the time-variant rendering matrix, respectively.

[0024] In some embodiments, the proposed algorithm may introduce polynomial interpolation to slot samples. Specifically, the method may further include determining, for each slot of a sampling frame, a first weighting factor for the first combined matrix, a second weighting factor for the second combined matrix, and a compensation value based on the determined slot offset value. In such cases, the interpolated matrix may be generated by a linear combination of the first combined matrix multiplied with the first weighting factor, the second combined matrix multiplied with the second weighting factor, and the compensation value. In some embodiments, along processing of slots, the first weighting factor may ramp down for the first combined matrix, and the second weighting factor may ramp up for the second combined matrix.

[0025] In some embodiments, the compensation value may comprise a compensation weighting factor in dependence of the determined slot offset value and a current slot position. In particular, along processing of slots, the compensation weighting factor may ramp up away from the first and second frame boundary and may ramp down towards the first and second frame boundary.

[0026] In some embodiments, the compensation value may be determined further based on a upmixing ramping step of the time-variant upmixing matrix between slots, and a rendering ramping step of the time-variant rendering matrix between the slots. In such cases, at a current slot, the rendering ramping step of the time-variant rendering matrix may be determined based on a slot position of the current slot with respect to the first frame boundary and the second frame boundary.

[0027] In some embodiments, the rendering ramping step of the time-variant rendering matrix between the slots may vary in response to an update of rendering gains in the rendering matrix at the second frame boundary.

[0028] In some embodiments, the first combined matrix may be determined by multiplying the acquired sampled upmixing matrix corresponding to that first sampling frame with the acquired sampled rendering matrix corresponding to that first sampling frame. In some embodiments, the second combined matrix may be determined by multiplying the acquired sampled upmixing matrix corresponding to that second sampling frame with the acquired sampled rendering matrix corresponding to that second sampling frame.

[0029] In some embodiments, for a slot located at the first frame boundary, the first combined matrix may be determined as the integrated upmixing and rendering matrix. Alternatively, for a slot located at the second frame boundary, the second combined matrix may be determined as the integrated upmixing and rendering matrix. In some embodiments, for a slot not located at a frame boundary, the interpolated matrix may be determined as the integrated upmixing and rendering matrix.

[0030] In some embodiments, the time-variant upmixing matrix may contain a plurality of upmixing coefficients for generating the multi-channel upmix signal from the at least one channel of the downmix signal (i.e., the upmixing process). In some embodiments, the timevariant rendering matrix may contain a plurality of rendering gains for generating the rendered output from the multi-channel upmix signal (i.e., the rendering process). In some embodiments, the method may further include transforming the downmix signal from time domain to frequency domain. In such cases, the rendering matrix may comprise a frequencydomain rendering matrix for rendering the plurality of audio objects in the frequency domain.

[0031] In some embodiments, the method may further include, during an initial sampling frame, applying, for each slot within the initial sampling frame, the time-variant upmixing matrix and the time- variant rendering matrix in sequence to a corresponding slot of the downmix signal.

[0032] In some embodiments, the downmix signal may comprise a 5.1 channel signal, or a 7.1 channel signal.

[0033] Configured as above, by combining linear and polynomial interpolation, the number of combined matrices that need to be calculated can be significant reduced, i.e., the calculation for the combined matrices may only take place per frame basis. Furthermore, the distortion introduced by the linear interpolation (as in the current implementation) can be effectively eliminated. In particular, the proposed algorithm according to the presentdisclosure with the use of polynomial interpolation allows for a compensated delta value for use in the linear interpolation, which can reduce the introduced distortion. As a result, the solution can achieve e.g., over a 30% optimization in comparison with the current implementation.

[0034] In view of the above, the present disclosure proposes a computationally efficient and accurate method for audio reconstruction, especially for decoding and rendering an audio signal in a decoding system (e.g., a decoder) suffering from synchronization issues. The proposed algorithm is particularly beneficial to decoding systems receiving e.g., DD+ JOC bitstreams where the JOC upmixing matrix and the OAR rendering matrix are not always synchronized, and it is not possible to calculate the combined matrix only once per frame. By performing (polynomial) interpolation for each slot, compensation of a delta value for linear interpolation can fix the distortion caused by the linear interpolation, thereby accurately reconstructing the audio signal for rendering. In other words, the present disclosure proposes an accurate and performance optimized solution to reconstruct an audio signal by an integrated render, which jointly applies linear and polynomial interpolation to achieve higher complexity reduction and accurate recons truction of audio signals.

[0035] In addition to solving the above mentioned complexity bottlenecks in e.g., the DDPlus JOC decoding and rendering processes, the present disclosure further improves seamless transition effect by outputting the legacy downmix output and rendered output simultaneously, which allows the system layer to do more complex seamless transition behavior.

[0036] According to another aspect, an apparatus for reconstructing an audio program for rendering at a playback device is provided. The apparatus may include a processor and a memory coupled to the processor and storing instructions for the processor. The processor may be configured to perform all steps of the methods according to preceding aspects and their embodiments.

[0037] According to a further aspect, a computer program is described. The computer program may comprise executable instructions for performing the methods or method steps outlined throughout the present disclosure when executed by a computing device (e.g., processor).

[0038] According to another aspect, a computer-readable storage medium is described. The storage medium may store a computer program adapted for execution on a computing device (e.g., processor) and for performing the methods or method steps outlined throughout the present disclosure when carried out on the computing device.

[0039] It should be noted that the methods and systems including its preferred embodiments as outlined in the present disclosure may be used stand-alone or in combination with the other methods and systems disclosed in this document. Furthermore, all aspects of the methods and systems outlined in the present disclosure may be arbitrarily combined. In particular, the features of the claims may be combined with one another in an arbitrary manner.

[0040] It will be appreciated that apparatus features and method steps may be interchanged in many ways. In particular, the details of the disclosed method(s) can be realized by the corresponding apparatus, and vice versa, as the skilled person will appreciate. Moreover, any of the above statements made with respect to the method(s) (and, e.g., their steps) are understood to likewise apply to the corresponding apparatus (and, e.g., their blocks, stages, units), and vice versa.Brief Description of the Drawings

[0041] The invention is explained below in an exemplary manner with reference to the accompanying drawings, wherein

[0042] Fig. 1(a) schematically illustrates an exemplary block diagram for a decoder 100a using a conventional time-domain object audio Tenderer;

[0043] Fig. 1(b) schematically illustrates an exemplary block diagram for a decoder 100b using an integrated frequency-band object audio Tenderer in accordance with embodiments of the present disclosure;

[0044] Fig. 1(c) schematically illustrates an exemplary block diagram for a decoder 100c using an integrated frequency-band object audio Tenderer in accordance with embodiments of the present disclosure;

[0045] Fig. 2 schematically illustrates possible implementations of an audio decoder / renderer 200 using an integrated processing unit in accordance with embodiments of the present disclosure;

[0046] Fig. 3 schematically illustrates an exemplary architecture for a decoder 300 for reconstructing and rendering an audio program using an integrated frequency-band object audio Tenderer in accordance with embodiments of the present disclosure;

[0047] Fig. 4 shows an example flowchart of a method 400 of reconstructing an audio program for rendering at a playback device in accordance with embodiments of the disclosure; and

[0048] Fig. 5 schematically illustrates an exemplary apparatus for implementing the proposed audio reconstruction methods in accordance with embodiments of the disclosure.Detailed Description

[0049] In the following, example embodiments of the disclosure will be described with reference to the appended figures. Identical elements in the figures may be indicated by identical reference numbers, and repeated description thereof may be omitted.

[0050] In general, reconstructing audio signals (in the form of audio objects) for rendering may require upmixing and rendering processes at a decoder. Fig. 1(a) schematically illustrates an exemplary block diagram for a decoder 100a using a conventional time-domain object audio Tenderer (TDOAR). The decoder 100a comprises a core decoding element 101, a upmixing element 102a, and a rendering element 103a. The core decoding element 101 is configured to receive and decode a downmix signal from an encoder (not shown), and then to provide the decoded downmix signal to the upmixing element 102a. The upmixing element 102a may comprise a pre-processing unit 10 that performs a frequency-domain analysis (e.g., a QMF analysis) and 90-degree phase filtering to transform the downmix signal to the frequency domain, a upmixing unit 12 that converts the downmix signal to a plurality of audio objects to be rendered, and a post-processing unit 13 that performs a time-domain analysis via e.g., a QMF synthesis to obtain the audio objects in the time domain. In the rendering element 103a, rendering gains for rendering the audio objects may be estimated by a computation unit 14, followed by a time-domain rendering unit 15 that renders the audio objects in the time domain. In the decoder 100a, the core decoding element 101 and the upmixing element 102a may be implemented by an integrated processing unit, e.g., a unified decoder converter (UDC), while the rendering element 103a may be located outside the integrated processing unit (e.g., the UDC) as a separate part within the decoder 100a.

[0051] Specifically, the upmixing unit 12 may perform the conversion of the downmix signal into the audio objects via a upmixing matrix, and the rendering unit 1 may perform the rendering using a rendering matrix. In general, the processing flow of the decoder 100a as shown in Fig. 1(a) may require around 65Mcps (millions of cycles per second) at the core decoding element 101, around 80Mcps at the upmixing element 102a (including around 20Mcps for the QMF analysis and 90-degree phase filtering in case of 5.1 channels, around 20Mcps for the calculation of the upmixing matrix with a dimension 5x15, and around 40Mcps for the QMF synthesis in case of 5.1 channels), and around 30Mcps at the rendering element 103a (including around 20Mcps for rendering gain computation and around lOMcps for the calculation of the rendering matrix with a dimension 15x7 in the time domain, in case that the rendering gains are computed 4 times per frame).

[0052] In order to reduce the computational requirements for the processing in the decoder, the calculation of the rendering matrix (and the computation of the rendering gains) may be integrated with the upmixing process into the UDC. Fig. 1(b) schematically illustrates an exemplary block diagram for a decoder 100b using an integrated frequency-band object audio renderer (FBOAR) in accordance with embodiments of the present disclosure. Similar to decoder 100a, the decoder 100b comprises a core decoding element 101, and a upmixing element 102b. As illustrated above, the core decoding element 101 is configured to receive and decode a downmix signal from an encoder (not shown), and then to provide the decoded downmix signal to the upmixing element 102b. Similar to the upmixing element 102a of the decoder 100a, the upmixing element 102b of the decoder 100b may comprise the preprocessing unit 10 and the upmixing unit 12 to perform the QMF analysis / phase filtering and the downmix-to-objects conversion, respectively, as illustrated above.

[0053] Different from the decoder 100a, the processing functions in the decoder 100b for estimating the rendering gains and calculating the rendering matrix may be integrated into the upmixing element 102b. That is to say, the upmixing element 102b may further comprise, in addition to the pre-processing unit 10 and the upmixing unit 12 for performing the upmixing process of the decoder, the computation unit 14 (as in the decoder 100a) and a rendering unit 15’ for performing the rendering process. Unlike the rendering unit 15 of the decoder 100a that calculates the rendering matrix in the time domain, it is noted that the rendering unit 15’ of the decoder 100b may calculate the rendering matrix in the frequency domain, as a frequency-domain rendering unit 15’ (also referred to as a frequency-band object audio renderer, FBOAR). Thus, in the decoder 100b, the integrated processing unit (e.g., the UDC) may also contain the computation unit 14 and the frequency -domain rendering unit 15’ for the rendering process, while the post-processing unit 13 for conducting the QMF synthesis as in the decoder 100a may be moved to outside the integrated processing unit as a separate part within the decoder 100b to obtain the audio objects to be rendered in the time domain.

[0054] By integrating the computation unit 14 and the frequency-domain rendering unit 15’ for the rendering process into the upmixing element 102b (or into the UDC), the computational efforts for the processing flow of the decoder 100b as shown in Fig. 1(b) may be decreased. For example, the decoder 100b may require around 65Mcps (millions of cycles per second) at the core decoding element 101, around 80Mcps at the upmixing element 102b (including around 20Mcps for the QMF analysis and 90-degree phase filtering in case of 5.1 channels, around 20Mcps for the calculation of the upmixing matrix with a dimension 5x15,around 20Mcps for rendering gain computation and around 20Mcps for the calculation of the rendering matrix with a dimension 15x7 in the frequency domain), and around 20Mcps at the post-processing unit 13 for the QMF synthesis in case of 5.1.2 channels.

[0055] It is noted that calculating the rendering matrix in the time domain (by the rendering unit 15 of the decoder 100a) may require additional computational efforts that use 50% of the rendering complexity for calculating the rendering matrix in the frequency domain (i.e., the frequency-band object audio Tenderer, FBOAR, by the rendering unit 15’ of the decoder 100b). However, the total computational efforts of the decoder 100b may be reduced (-27%) compared to the decoder 100a.

[0056] Fig. 1(c) schematically illustrates an exemplary block diagram for a decoder 100c using an integrated frequency-band object audio Tenderer (FBOAR) in accordance with embodiments of the present disclosure. Similar to decoder 100a and 100b, the decoder 100c comprises a core decoding element 101, and a upmixing element 102c. It is noted that the core decoding element 101 may also perform functions similar to the core decoding element 101 of the decoder 100a and 100b, and the upmixing element 102c may also contain the preprocessing unit 10 and the computation unit 14 performing similar functions as in decoder 100a and 100b.

[0057] However, in the decoder 100c, the upmixing element 102c may further contain a upmixing -rendering combined unit 12’ to replace the upmixing unit 12 and the rendering unit 15’ of the decoder 100b (and thus to replace the upmixing unit 12 and the rendering unit 15 of the decoder 100a). In other words, the upmixing-rendering combined unit 12’ of the decoder 100c may apply a single combined matrix that jointly performs the upmixing and rendering computation (i.e., the functions of the upmixing unit 12 and the rendering unit 15’ of decoder 100b) to generate the audio objects to be rendered. Similar to the decoder 100b, the processing functions for estimating the rendering gains and calculating the rendering matrix at the decoder 100c may be integrated into the upmixing element 102c, and the calculation of the rendering matrix may be conducted in the frequency domain. However, compared to the decoder 100b, the decoder 100c may combine the calculation of the rendering matrix with the calculation of the upmixing matrix using a single matrix, instead of using two separate matrices independently.

[0058] Thus, in the decoder 100c, the integrated processing unit (e.g., the UDC) may contain the core decoding unit 10, the computation unit 14, and the upmixing-rendering combinedunit 12’, while the post-processing unit 13 (which performs similar function as in the decoder 100a and 100b) may be moved to outside the integrated processing unit as a separate part within the decoder 100c to obtain the audio objects to be rendered in the time domain, as in the decoder 100b.

[0059] By combining the calculation of the rendering matrix with the calculation of the upmixing matrix using a single matrix, the computational complexity for rendering the audio objects by the decoder 100c as shown in Fig. 1(c) may be further reduced. For example, the decoder 100c may require around 65Mcps (millions of cycles per second) at the core decoding element 101, around 50Mcps at the upmixing element 102c (including around 20Mcps for the QMF analysis and 90-degree phase filtering in case of 5.1 channels, around 20Mcps for rendering gain computation, and around lOMcps for the calculation of a upmixing-rendering combined matrix with a dimension 15x7 in the frequency domain), and around 20Mcps at the post-processing unit 13 for the QMF synthesis in case of 5.1.2 channels.

[0060] According to the present disclosure, the integrated rendering algorithm as shown in Fig. 1(c) may be considered as two stages of integrated rendering extending from the above mentioned TDOAR as shown in Fig. 1(a): FBOAR using the dual-matrix approach based on the decoder 100b of Fig. 1(b) as the first-stage integrated rendering, and FBOAR using the (single) combined matrix solution based on the decoder 100c of Fig. 1(c) as the second-stage integrated rendering. In a use case where 15 dynamic objects are used, optimization of the upmixing / rendering matrix for the first-stage integrated rendering may reduce -27% of complexity compared to the decoder 100a, and optimization of the single combined upmixing-rendering matrix for the second-stage combined matrix rendering may reduce -20% of complexity additionally. Further complexity reduction may be achieved by e.g., reducing the numbers of channels for the QMF synthesis (at the post-processing unit 13).

[0061] The decoder architectures as shown in Fig. 1(a) to 1(c) may be summarized in Fig. 2 which schematically illustrates possible implementations of an audio decoder / renderer 200 using an integrated processing unit such as a unified decoder converter (UDC) in accordance with embodiments of the present disclosure. The decoder / renderer 200 may comprise one or more processors to perform the required functions for decoding / rendering an audio signal, such as the upmixing and rendering calculations as described above. The decoder / renderer 200 may support inputs comprising multi-channel signals such as DD+JOC, DD+ 5.1, DD+ 7.1, DD+ 2.0 and DD 5.1. The decoder / renderer 200 may also support output configurationsincluding 5.1.2, 5.1.4 and 7.1.4, for example. However, other types of multi-channel signals and / or output configurations may also be supported by the decoder / renderer 200 according to the present disclosure.

[0062] Similar to the decoder architectures as shown in Fig. 1(a) to 1(c), the decoder / renderer 200 also comprises a core decoding element 201, and a upmixing element 202. In the embodiment of Fig. 2, the core decoding element 201 may be a DD+ core decoder configured to receive a DD+ bitstream 211 containing a downmix signal (and associated metadata), decode the downmix signal and provide the decoded downmix signal 212 to the upmixing element 202. The upmixing element 202 may be a JOC upmixer configured to convert the downmix signal 212 to a plurality of audio objects 213 to be rendered. The decoder / renderer 200 further comprises a rendering element 203 (e.g., an object audio Tenderer, OAR) for generating a rendered output 214 from the plurality of audio objects 213. In addition to the downmix signal 212, the DD+ core decoder may provide upmixing metadata 215 associated with the downmix signal (e.g., the JOC side information) to the JOC upmixer, and also provide object metadata (e.g., object audio metadata (OAMD)) 216 to the rendering element 203. The upmixing metadata 215 may be time-variant and may contain a plurality of upmix parameters for converting the downmix signal 212 to the audio objects 213 (e.g., upmixing coefficients for generating a upmix signal). The object metadata 216 may also be time- variant and may contain necessary information for rendering the audio objects 213 (such as rendering gains for generating the rendered output as well as information describing a (time-varying) position of an audio source associated with the audio objects 213 within a 3-dimensional rendering environment).

[0063] Similar to the decoder 100a as described above, the core decoding element 201 and the upmixing element 202 may be implemented by the integrated processing unit (e.g., a UDC) 20 of the decoder / renderer 200, while the rendering element 203 may be outside the integrated processing unit 20 as a separate part from the UDC 20 within the decoder / renderer 200. The decoder / renderer 200 implemented by the UDC 20 may correspond to the TDOAR approach as described above. On the other hand, the decoder / render 200 may be implemented by the UDC 20’, which moves the rendering element 203 from outside to inside the integrated processing unit 20’, and switches to use QMF domain (frequency-domain) OAR to render output, similar to the decoder 100b (and 100c) corresponding to the FBOAR approach as described above (i.e., the first and second stage of the integrated rendering). In some use cases, the UDC 20 without containing the rendering element 203 may be implemented bye.g., the UDC CIDK v4.9, while the UDC 20’ containing the rendering element 203 may be implemented by an updated version, such as the UDC CIDK v4.10.

[0064] Fig. 3 schematically illustrates an exemplary architecture for a decoder 300 for reconstructing and rendering an audio program using an integrated frequency-band object audio Tenderer (FBOAR) in accordance with embodiments of the present disclosure. The decoder 300 may correspond to the decoder 100b and 100c based on the FBOAR approach as illustrated in Fig. 1(b) and 1(c). Specifically, the decoder 300 may comprise a core decoding element 301, a JOC upmixer 302 and a frequency-band Tenderer 303. For implementation of the decoder 100b, the decoder 300 may conduct the processing of the core decoding 301 followed by the JOC decoding / upmixing 302, which may be followed by the frequency-band Tenderer 303 to generate a QMF-domain rendered output. In contrast, for implementation of the decoder 100c, the decoder 300 may conduct the processing of the core decoding 301 followed by a combined JOC and rendering matrix 312 (i.e., the upmixing-rendering combined matrix as described above) to generate the QMF-domain rendered output. Similar to decoder 100b and 100c, the decoder 300 may also conduct a QMF analysis 315 and 90- degree phase filtering 316 to transform a downmix signal 311 to the frequency domain, and may perform a time-domain analysis via a QMF synthesis 317.

[0065] According to the present disclosure, the combined matrix may refer to a joint matrix which multiplies a upmixing gain matrix and an OAR rendering gain matrix, so that an accurate and performance optimized solution may be used to reconstruct an audio signal by an integrated Tenderer (e.g., the integrated Tenderer as illustrated in Fig. 3). For example, for a typical DD+ 5.1 bitstream with 16 input objects, the OAR gain matrix may have a dimension of 12x16 for 7.1.4 output configurations, which corresponds to 12 output channels x 16 input objects. On the other hand, the JOC upmixing gain matrix may have a dimension of 15x23x5 (without the Low-frequency-effect (LFE) channel), which corresponds to 15 input objects x 23 bands x 5 downmix channels. Accordingly, the combined matrix may have a dimension of 12x23x5, which corresponds to 12 output channels x 23 bands x 5 downmix channels.

[0066] It is noted that the aforementioned exemplary signal types and output configurations are non-limiting examples for implementing the decoder / renderer as explicitly illustrated in the present disclosure, and other signal types and output configurations are feasible and within the scope of the present disclosure.

[0067] The optimization gain for this approach may be estimated by prototyping integrated rendering which indicates that around 20% of complexity reduction is the potential limit for an aggressive strategy. This serves at roughly the upper limn for the expectation. This upper limit also requires the same linear interpolation be used when processing samples through this combined matrix. However, the JOC upmixing matrix and the OAR rendering matrix are not synchronized in the DD+ JOC bitstream syntax. For example, the OAMD may have an offset of 18 slots, while the JOC upmixing gain may have no offset. Thus, it may not be possible to calculate the combined matrix only once per frame (containing 24 slots). Also, linear interpolation may cause inaccurate reconstruction of the audio signal (to the design of the JOC algorithm and the OAMD ramp), which means that additional distortion may be introduced to the rendered signal during the upmixing and rendering processes.

[0068] The present disclosure proposes an accurate combined matrix solution, which can compensate a delta value (see Table 1 and Table 2 below) for linear interpolation when performing interpolation for each slot, and can fix the additional distortion and thereby accurately reconstruct the signal.

[0069] Referring back to Fig. 3, the core decoding element 301 may be configured to receive a downmix signal 311 indicative of a plurality of audio objects relating to a plurality of spatially diverse audio signals of the audio program. In particular, the downmix signal 311 may have at least one channel. For example, the downmix signal may be a multi-channel signal such as a 5.1 channel signal, as shown in Fig. 3 (indicated by L, R, C, Ls, Rs). Besides, the architecture of Fig. 3 may be implemented with 7.1.4 output configurations, and the downmix signal 311 may be e.g., a DD+ 5.1 bitstream, or JOC content containing bitstream which has been encoded using a DD+ JOC encoder including audio and metadata information. However, other types of multi-channel signals (e.g., a 7.1 channel signal) and / or output configurations are also possible and within the scope of the present disclosure. In addition, the core decoding element 301 may also obtain and provide upmixing metadata (e.g., the upmixing metadata 215) defining a time-variant upmixing matrix to the JOC upmixer 302 for generating a multi-channel upmix signal indicative of the plurality of audio objects from the at least one channel of the downmix signal. The Tenderer 303 may be configured to obtain (from the core decoding element 301) object metadata (i.e., the object metadata 216) defining a time-variant rendering matrix for generating a rendered output from the multi-channel upmix signal. The time-variant upmixing matrix may contain a plurality of upmixing coefficients for generating the multi-channel upmix signal from the at least onechannel of the downmix signal. The time- variant rendering matrix may contain a plurality of rendering gains for generating the rendered output from the multi-channel upmix signal.

[0070] Notably, the time-variant upmixing matrix and the time- variant rendering matrix may be dividable into a plurality of sampling frames each having a predetermined number of slots. For signal smooth purpose, one frame may be divided into 24 slots. For example, when one frame contains 1536 samples, one slot may contain 64 samples as the minimum processing length. Table 1 shows an exemplary algorithm of computing the upmixing-rendenng combined matrix for generating audio objects to be rendered in accordance with embodiments of the present disclosure. Hereafter, the upmixing -rendering combined matrix may be referred to as an “integrated upmixing and rendering matrix” which is determined based on the time-variant upmixing matrix (i.e., indicated by the upmixing metadata 215) and the time-variant rendering matrix (i.e., indicated by the object metadata 216) for each slot of a sampling frame.

[0071] Specifically, the JOC upmixing gain and the OAR gain may (all) be linear interpolated for each of the 24 slots, while the OAR gain has latency of 18 slots, which may correspond to an OAMDI offset value. To overcome this misalignment, two different linear interpolations and compensations for the combined matrix may be applied, as illustrate by Table 1. In Table 1, two frames (frame 1 and frame 2) are shown each having 24 slots (slot 1 to slot 24) together with the corresponding upmixing matrix (e.g., the JOC upmixing gain coefficients) and rendering matrix (e.g., the OAR gain), as well as the integrated upmixing and rendering matrix (indicated by the “combined matrix” and the “interpolation and compensation”). The integrated upmixing and rendering matrix may be applicable to a corresponding slot of the downmix signal for generating the rendered output. It is appreciated that the exemplary algorithm shown in Table 1 may be applied to the decoder 100b, 100c, 200 and 300 as illustrated above.

[0072] According to the present disclosure, the integrated upmixing and rendering matrix may be determined by determining a combined matrix (on a frame basis) and an interpolated matrix (on a slot basis). Namely, the combined matrix may be determined, for the sampling frame, based on the time-variant upmixing matrix and the time- variant rendering matrix corresponding to that sampling frame, while the interpolated matrix may be determined, for each slot of the sampling frame, based on the determined combined matrix and a slot offset value indicative of slot latency between the upmixing matrix and the rendering matrix. In particular, the slot offset value may represent a number of slots by which a first frameboundary with respect to the time- variant upmixing matrix is misaligned with a second frame boundary with respect to the time- variant rendering matrix. In Table 1, the combined matrix is indicated by the “combined matrix” and the interpolated matrix is indicated by the “interpolation and compensation”.

[0073] As shown in the algorithm of Table 1, for the first frame, the JOC upmixing gain and the OAR gain may be applied in sequence, similar to the integrated rendering solution in relation to the decoder 100b (shown in Fig. 1(b)). For the second frame, the JOC upmixing gain may ramp from X2_l to X2_24 with a step size of AX. The OAR gain may have two linear interpolations having different step sizes AY1 and AY2: the 18 slots corresponding to from Yl_7 to Yl_24 are interpolated with step AY1, while the 6 slots corresponding to from Y2_l to Y2_6 are interpolated with step AY2. The highlighted rows show the slot positions where the calculation / determination of the combined matrix is conducted. In the example of Table 1, the combined matrix is calculated / determined at a frame boundary, e.g., at the first frame boundary with respect to the time- variant upmixing matrix and at the second frame boundary with respect to the time- variant rendering matrix. Due to the misalignment between the upmixing matrix and the rendering matrix, the first frame boundary with respect to the upmixing matrix may correspond to somewhere within a frame with respect to the rendering matrix. In the example of Table 1, since the OAR gain has latency of 18 slots (i.e., the slot offset value = 18), the first frame boundary (i.e., slot 24) with respect to the upmixing matrix corresponds to slot 18 with respect to the rendering matrix.

[0074] In Table 1, the “JOC upmixing gain” may indicate a sampled upmixing matrix (e.g., Xl_l to Xl_24 and X2_l to X2_24) acquired for a first sampling frame associated with the time- variant upmixing matrix and for a second sampling frame associated with the timevariant rendering matrix, while the “OAR gain” may indicate a sampled rendering matrix (e.g., Yl_l to Yl_24 and Y2_l to Y2_24) acquired for the first sampling frame associated with the time-variant upmixing matrix and for the second sampling frame associated with the time- variant rendering matrix. Hence, the combined matrix may be determined by computing, for the first sampling frame associated with the time-variant upmixing matrix, a first combined matrix (e.g., CM1=X1_24*Y1_6 and CM3=X2_24*Y2_6) based on the acquired sampled upmixing matrix corresponding to that first sampling frame and the acquired sampled rendering matrix corresponding to that first sampling frame. Also, the combined matrix may be determined by computing, for the second sampling frame associated with the time-variant rendering matrix, a second combined matrix (e.g., CM2=X2_18*Y1_24) basedon the acquired sampled upmixing matrix corresponding to that second sampling frame and the acquired sampled rendering matrix corresponding to that second sampling frame. Accordingly, the interpolated matrix for each slot is determined based on at least one of the computed first combined matrix and the computed second combined matrix (e.g., CM1, CM2 or CM3).

[0075] Table 1 further shows that the sampled upmixing matrix for the first sampling frame associated with the time-variant upmixing matrix (e.g., Xl_24 and X2_24) and for the second sampling frame associated with the time- variant rendering matrix (e.g., X2_18) is acquired at the first frame boundary with respect to the time-variant upmixing matrix (i.e., slot 24 with respect to the upmixing matrix X) and the second frame boundary with respect to the timevariant rendering matrix (i.e., slot 24 with respect to the rendering matrix Y, corresponding to slot 18 with respect to the upmixing matrix X), respectively. Besides, the sampled rendering matrix for the first sampling frame associated with the time- variant upmixing matrix (e.g., Yl_6 and Y2_6) and for the second sampling frame associated with the time-variant rendering matrix (e.g., Yl_24) is acquired at the first frame boundary with respect to the time- variant upmixing matrix (i.e., slot 24 with respect to the upmixing matrix X) and the second frame boundary with respect to the time-variant rendering matrix (i.e., slot 24 with respect to the rendering matrix Y, corresponding to slot 18 with respect to the upmixing matrix X), respectively.

[0076] As to the determination of the interpolated matrix, the calculation under “interpolation and compensation” in Table 1 shows the use of weighting factors for the respective combined matrix. In particular, for each slot of a sampling frame, a first weighting factor for the first combined matrix, a second weighting factor for the second combined matrix, and a compensation value may be determined based on the (determined) slot offset value. Accordingly, the interpolated matrix may be generated by a linear combination of the first combined matrix multiplied with the first weighting factor, the second combined matrix multiplied with the second weighting factor, and the compensation value. As shown in the example of Table 1, the interpolated matrix corresponding to slot 1 of frame 2 is determined as CM 1* 17 / 18+CM2* 1 / 18- 17*1 * AX AY1, where 17 / 18 represents the first weighting factor for the first combined matrix CM1, 1 / 18 represents the second weighting factor for the second combined matrix CM2, and 17*1* AX AY 1 represents the compensation value. As shown in Table 1, along processing of slots, the first weighting factor may ramp down (e.g., from 17 / 18 at slot 1 to 1 / 18 at slot 17) for the first combined matrix, and the secondweighting factor may ramp up (e.g., from 1 / 18 at slot 1 to 17 / 18 at slot 17) for the second combined matrix. The variation of the compensation value with respect to the slot position may be found in an example shown in Table 2 (i.e., the delta value).

[0077] Specifically, the compensation value may contain a compensation weighting factor (e.g., 17*1 at slot 1, 16*2 at slot 2, . . ., 1*17 at slot 17, etc.) in dependence of the determined slot offset value and a current slot position, as clearly shown in Table 1 and Table 2. In particular, along processing of slots, the compensation weighting factor may ramp up away from the first frame boundary (e.g., from 17*1 at slot 1 to 9*9 at slot 9, which moves away from the boundary between frame 1 and frame 2) and second frame boundary, and may ramp down towards the first and second frame boundary (e.g., from 9*9 at slot 9 to 1*17 at slot 17, which moves towards the next frame boundary, e.g., the boundary between frame 2 and frame 3 (not shown)).Table 1:

[0078] In addition, the compensation value may be determined further based on a upmixing ramping step of the time- variant upmixing matrix between slots (e.g., AX), and a rendering ramping step of the time- variant rendering matrix between the slots (e.g,, AY1, AY2). At a current slot, such a rendering ramping step of the time-variant rendering matrix may be determined based on a slot position of the current slot with respect to the first frame boundary and the second frame boundary. For example, AY 1 may be taken as the rendering ramping step when the current slot is located between the first frame boundary with respect to the upmixing matrix and the second frame boundary with respect to the rendering matrix (corresponding to slot 24 of frame 1 and slot 18 of frame 2, respectively), AY2 may be taken as the rendering ramping step when the current slot is located between the second frame boundary with respect to the rendering matrix and the next first frame boundary with respect to the upmixing matrix (corresponding to slot 18 and slot 24 of frame 2, respectively).Notably, the rendering ramping step of the time- variant rendering matrix between the slots may vary in response to an update of rendering gains in the rendering matrix at the second frame boundary, for example.Table 2:

[0079] Depending the slot position, the integrated upmixing and rendering matrix may be determined, either by the combined matrix (i.e., the first combined matrix or the second combined matrix), or by the interpolated matrix. For example, for a slot located at the first frame boundary (e.g., slot 24), the first combined matrix (e.g., CM1, CM3) may be determined as the integrated upmixing and rendering matrix, or for a slot located at the second frame boundary (e.g., slot 18), the second combined matrix (e.g., CM2) may bedetermined as the integrated upmixing and rendering matrix. Alternatively, for a slot not located at a frame boundary (e.g., slot 1 to slot 17, slot 19 to slot 23, etc.), the interpolated matrix may be determined as the integrated upmixing and rendering matrix.

[0080] For illustrative purpose, the example of Table 1 shows two frames each having 24 slots, and a latency of 18 slots between the upmixing matrix and the rendering matrix. However, these are non-limiting examples for computing the upmixing-rendering combined matrix (i.e., the integrated upmixing and rendering matrix) as explicitly illustrated in the present disclosure, and shall not be regarded limiting the scope of implementation according to the present disclosure. Other numbers of frames / slots as well as other offset values are feasible and within the scope of the present disclosure. Also, the slot position where the combined matrix shall be calculated (i.e., matrix multiplications) may vary depending on practical implementations and shall not be limited to at the frame boundaries. Moreover, the way of determining the weighting factors and the compensation value (delta value) for the interpolated matrix may deviate from the examples shown in Table 1 and Table 2, as long as its dependence on the offset value satisfies.

[0081] In view of the above, the proposed algorithm for reconstructing an audio program at a decoder / renderer may allow for complexity reduction in the decoding and rendering processes. Besides, the proposed decoder architecture according to the present disclosure further enables outputting the legacy downmix output and rendered output simultaneously, in order to satisfy the requirements of long cross fade duration time, thereby allowing the system layer to do more complex seamless transition behavior. Accordingly, real seamless transition between channel -based content and object-based content may be achieved. It is further appreciated that no glue code for the decoder (e.g., UDC) and OAR is needed for the integration layer, as long as an output configuration is set, so that the UDC will output the rendered channel, which simplifies the UDC and OAR integration work for the integrator.

[0082] Fig. 4 shows an example flowchart of a method 400 for reconstructing an audio program for rendering at a playback device in accordance with embodiments of the disclosure. Method 400 may be implemented in software, hardware, or combinations thereof e.g., at an audio decoding / rendering device, as or as part of a computing device, a server, or a distributed system between a computing device and a server which is suitable for decoding / rendering audio signals. Specific implementations may include a headset, computer, mobile phone, etc., or any other audio decoding / rendering devices for decoding / rendering audio. The method 400 comprises processing chains formed by steps S410 through S440 forreconstructing audio signals (in particular audio objects) to be further processed by a further rendering stage (e.g., for output by a playback device). More specifically, step S440 may be implemented by further processing chains formed by steps S450 through S470 for determining the integrated upmixing and rendering matrix at step S440.

[0083] The method 400 comprises step S410 of receiving a downmix signal indicative of a plurality of audio objects relating to a plurality of spatially diverse audio signals of the audio program. In particular, the downmix signal may comprise at least one channel. The method further comprises step S420 of obtaining upmixing metadata defining a time- variant upmixing matrix for generating a multi-channel upmix signal indicative of the plurality of audio objects from the at least one channel of the downmix signal. In addition, the method comprises step S430 of obtaining object metadata defining a time-variant rendering matrix for generating a rendered output from the multi-channel upmix signal. It is noted that the time- variant upmixing matrix and the time-variant rendering matrix may be dividable into a plurality of sampling frames each having a predetermined number of slots. Also, the method further comprises step S440 of determining, for each slot of a sampling frame, an integrated upmixing and rendering matrix based on the time- variant upmixing matrix and the timevariant rendering matrix.

[0084] For determining the integrated upmixing and rendering matrix, the method further comprises step S450 of determining, for the sampling frame, a combined matrix based on the time- variant upmixing matrix and the time-variant rendering matrix corresponding to that sampling frame, step S460 of determining a slot offset value indicative of a number of slots by which a first frame boundary with respect to the time-variant upmixing matrix is misaligned with a second frame boundary with respect to the time-variant rendering matrix, and step S470 of determining, for each slot of the sampling frame, an i nicrpolatcd matrix based on the determined combined matrix and the slot offset value.

[0085] It is further noted that steps S410 through S470 may be performed for each of a plurality of processing cycles of an audio reconstruction device / apparatus / system and do not need to be performed in the order shown in Fig. 4.

[0086] While methods and processing chains have been described above, it is understood that the present disclosure likewise relates to apparatus (e.g., computer apparatus or apparatus having processing capability in general) for implementing these methods and processing chains (or techniques in general).

[0087] An example of such apparatus 500 is schematically illustrated in Fig. 5. The apparatus 500 comprises a processor 501 and a memory 502 coupled to the processor 501. The memory 502 may store instructions for execution by the processor 501. The processor 501 may be adapted to implement the processing chains described throughout the disclosure and / or to perform methods (e.g., methods of reconstructing an audio program for rendering at a playback device) described throughout the disclosure. The apparatus 500 may receive inputs (e.g., a downmix signal comprising at least one channel, or a multi-channel signal) and generate rendered outputs (e.g., audio objects to be rendered at the playback device), and may be used for implementing the above described decoders 100b, 100c, 200 and 300 as illustrated in Figs. 1(b), 1(c), 2 and 3.

[0088] Aspects of the systems described herein may be implemented in an appropriate computer-based sound processing network environment (e.g., server or cloud environment) for processing digital or digitized audio files. Portions of these systems may include one or more networks that comprise any desired number of individual machines, including one or more routers (not shown) that serve to buffer and route the data transmitted among the computers. Such a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.

[0089] One or more of the components, blocks, processes or other functional components may be implemented through a computer program that controls execution of a processorbased computing device of the system. It should also be noted that the various functions disclosed herein may be described using any number of combinations of hardware, firmware, and / or as data and / or instructions embodied in various machine-readable or computer- readable media, in terms of their behavioral, register transfer, logic component, and / or other characteristics. Computer-readable media in which such formatted data and / or instructions may be embodied include, but are not limited to, physical (non- transitory), non-volatile storage media in various forms, such as optical, magnetic or semiconductor storage media.

[0090] Specifically, it should be understood that embodiments may include hardware, software, and electronic components or modules that, for purposes of discussion, may be illustrated and described as if the majority of the components were implemented solely in hardware. However, one of ordinary skill in the art, and based on a reading of this detailed description, would recognize that, in at least one embodiment, the electronic-based aspects may be implemented in software (e.g., stored on non-transitory computer-readable medium)executable by one or more electronic processors, such as a microprocessor and / or application specific integrated circuits (“ASICs”). As such, it should be noted that a plurality of hardware and software-based devices, as well as a plurality of different structural components, may be utilized to implement the embodiments. For example, computer-implemented neural networks described herein can include one or more electronic processors, one or more computer-readable medium modules, one or more input / output interfaces, and various connections (e.g., a system bus) connecting the various components.

[0091] While one or more implementations have been described by way of example and in terms of the specific embodiments, it is to be understood that one or more implementations are not limited to the disclosed embodiments. To the contrary, it is intended to cover various modifications and similar arrangements as would be apparent to those skilled in the art.Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.

[0092] Also, it is to be understood that the phraseology and terminology used herein are for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” or “having” and variations thereof are meant to encompass the items listed thereafter and equivalents thereof as well as additional items. Unless specified or limited otherwise, the terms “mounted,” “connected,” “supported,” and “coupled” and variations thereof are used broadly and encompass both direct and indirect mountings, connections, supports, and couplings.

[0093] Various Aspects an implementations of the invention may also be appreciated from the following enumerated example embodiments (EEEs), which are not claims.EEE1. A computer-implemented method of reconstructing an audio program for rendering at a playback device, the method comprising: receiving a downmix signal indicative of a plurality of audio objects relating to a plurality of spatially diverse audio signals of the audio program, the downmix signal comprising at least one channel; obtaining upmixing metadata defining a time-variant upmixing matrix for generating a multi-channel upmix signal indicative of the plurality of audio objects from the at least one channel of the downmix signal; obtaining object metadata defining a time-variant rendering matrix for generating a rendered output from the multi-channel upmix signal, wherein the time- variant upmixingmatrix and the time- variant rendering matrix are dividable into a plurality of sampling frames each having a predetermined number of slots; and determining, for each slot of a sampling frame, an integrated upmixing and rendering matrix based on the time-variant upmixing matrix and the time-variant rendering matrix, a. wherein the integrated upmixing and rendering matrix is applicable to a corresponding slot of the downmix signal for generating the rendered output, and wherein determining the integrated upmixing and rendering matrix comprises: determining, for the sampling frame, a combined matrix based on the time- variant upmixing matrix and the time-variant rendering matrix corresponding to that sampling frame; determining a slot offset value indicative of a number of slots by which a first frame boundary with respect to the time- variant upmixing matrix is misaligned with a second frame boundary with respect to the time- variant rendering matrix; and determining, for each slot of the sampling frame, an interpolated matrix based on the determined combined matrix and the slot offset value.EEE2. The method according to EEE1, further comprising: acquiring, from the upmixing metadata, a sampled upmixing matrix for a first sampling frame associated with the time-variant upmixing matrix and for a second sampling frame associated with the time- variant rendering matrix; and acquiring, from the object metadata, a sampled rendering matrix for the first sampling frame associated with the time- variant upmixing matrix and for the second sampling frame associated with the time-variant rendering matrix, wherein determining the combined matrix comprises: computing, for the first sampling frame associated with the time- variant upmixing matrix, a first combined matrix based on the acquired sampled upmixing matrix corresponding to that first sampling frame and the acquired sampled rendering matrix corresponding to that first sampling frame; and computing, for the second sampling frame associated with the time-variant rendering matrix, a second combined matrix based on the acquired sampled upmixing matrix corresponding to that second sampling frame and the acquired sampled rendering matrix corresponding to that second sampling frame.EEE3. The method according to EEE2, wherein the interpolated matrix for each slot is determined based on at least one of the computed first combined matrix and the computed second combined matrix.EEE4. The method according to EEE2 or EEE3, wherein the sampled upmixing matrix for the first sampling frame associated with the time-variant upmixing matrix and for the second sampling frame associated with the time-variant rendering matrix is acquired at the first frame boundary with respect to the time- variant upmixing matrix and the second frame boundary with respect to the time- variant rendering matrix, respectively, and wherein the sampled rendering matrix for the first sampling frame associated with the time- variant upmixing matrix and for the second sampling frame associated with the time- variant rendering matrix is acquired at the first frame boundary with respect to the time- variant upmixing matrix and the second frame boundary with respect to the time-variant rendering matrix, respectively.EEE5. The method according to any one of EEE2 to EEE4, further comprising determining, for each slot of a sampling frame, a first weighting factor for the first combined matrix, a second weighting factor for the second combined matrix, and a compensation value based on the determined slot offset value.EEE6. The method according to EEE5, wherein the interpolated matrix is generated by a linear combination of the first combined matrix multiplied with the first weighting factor, the second combined matrix multiplied with the second weighting factor, and the compensation value.EEE7. The method according to EEE5 or EEE6, wherein, along processing of slots, the first weighting factor ramps down for the first combined matrix, and the second weighting factor ramps up for the second combined matrix.EEE8. The method according to any one of EEE5 to EEE7, wherein the compensation value comprises a compensation weighting factor in dependence of the determined slot offset value and a current slot position, and wherein, along processing of slots, the compensation weighting factor ramps up away from the first and second frame boundary and ramps down towards the first and second frame boundary.EEE9. The method according to any one of EEE5 to EEE8, wherein the compensation value is determined further based on a upmixing ramping step of the time-variant upmixing matrix between slots, and a rendering ramping step of the time- variant rendering matrix between the slots.EEE10. The method according to EEE9, wherein, at a current slot, the rendering ramping step of the time-variant rendering matrix is determined based on a slot position of the current slot with respect to the first frame boundary and the second frame boundary.EEE11. The method according to EEE9 or EEE10, wherein the rendering ramping step of the time-variant rendering matrix between the slots varies in response to an update of rendering gains in the rendering matrix at the second frame boundary.EEE12. The method according to any one of EEE2 to EEE11, wherein the first combined matrix is determined by multiplying the acquired sampled upmixing matrix corresponding to that first sampling frame with the acquired sampled rendering matrix corresponding to that first sampling frame.EEE13. The method according to any one of EEE2 to EEE 12, wherein the second combined matrix is determined by multiplying the acquired sampled upmixing matrix corresponding to that second sampling frame with the acquired sampled rendering matrix corresponding to that second sampling frame.EEE 14. The method according to any one of EEE2 to EEE13, wherein, for a slot located at the first frame boundary, the first combined matrix is determined as the integrated upmixing and rendering matrix, or for a slot located at the second frame boundary, the second combined matrix is determined as the integrated upmixing and rendering matrix, and wherein, for a slot not located at a frame boundary, the interpolated matrix is determined as the integrated upmixing and rendering matrix.EEE15. The method according to any one of EEE1 to EEE14, wherein the time- variant upmixing matrix contains a plurality of upmixing coefficients for generating the multichannel upmix signal from the at least one channel of the downmix signal.EEE16. The method according to any one of EEE1 to EEE15, wherein the time-variant rendering matrix contains a plurality of rendering gains for generating the rendered output from the multi-channel upmix signal.EEE17. The method according to any one of EEE1 to EEE16, further comprising transforming the downmix signal from time domain to frequency domain, wherein the rendering matrix comprises a frequency-domain rendering matrix for rendering the plurality of audio objects in the frequency domain.EEE18. The method according to any one of EEE1 to EEE17, further comprising, during an initial sampling frame, applying, for each slot within the initial sampling frame, the timevariant upmixing matrix and the time-variant rendering matrix in sequence to a corresponding slot of the downmix signal.EEE19. The method according to any one of EEE1 to EEE18, wherein the downmix signal comprises a 5.1 channel signal, or a 7.1 channel signal.EEE20. An apparatus for reconstructing an audio program for rendering at a playback device, comprising a processor and a memory coupled to the processor and storing instructions for the processor, wherein the processor is configured to perform all steps of the method according to any one of EEE1 to EEE19.EEE21. A computer program comprising instructions that, when executed by a computing device, cause the computing device to perform all steps of the method according to any one of EEE1 to EEE19.EEE22. A computer-readable storage medium storing the computer program according to EEE21.

Claims

CLAIMS1. A computer-implemented method of reconstructing an audio program for rendering at a playback device, the method comprising: receiving a downmix signal indicative of a plurality of audio objects relating to a plurality of spatially diverse audio signals of the audio program, the downmix signal comprising at least one channel; obtaining upmixing metadata defining a time-variant upmixing matrix for generating a multi-channel upmix signal indicative of the plurality of audio objects from the at least one channel of the downmix signal; obtaining object metadata defining a time-variant rendering matrix for generating a rendered output from the multi-channel upmix signal, wherein the time- variant upmixing matrix and the time- variant rendering matrix are dividable into a plurality of sampling frames each having a predetermined number of slots; and determining, for each slot of a sampling frame, an integrated upmixing and rendering matrix based on the time-variant upmixing matrix and the time-variant rendering matrix, wherein the integrated upmixing and rendering matrix is applicable to a corresponding slot of the downmix signal for generating the rendered output, and wherein determining the integrated upmixing and rendering matrix comprises: determining, for the sampling frame, a combined matrix based on the time- variant upmixing matrix and the time-variant rendering matrix corresponding to that sampling frame; deteimining a slot offset value indicative of a number of slots by which a first frame boundary with respect to the time- variant upmixing matrix is misaligned with a second frame boundary with respect to the time- variant rendering matrix; and deteimining, for each slot of the sampling frame, an interpolated matrix based on the determined combined matrix and the slot offset value.2, The method according to claim 1, further comprising: acquiring, from the upmixing metadata, a sampled upmixing matrix for a first sampling frame associated with the time-variant upmixing matrix and for a second sampling frame associated with the time- variant rendering matrix; and acquiring, from the object metadata, a sampled rendering matrix for the first sampling frame associated with the time- variant upmixing matrix and for the second sampling frameassociated with the time-variant rendering matrix, wherein determining the combined matrix comprises: computing, for the first sampling frame associated with the time- variant upmixing matrix, a first combined matrix based on the acquired sampled upmixing matrix corresponding to that first sampling frame and the acquired sampled rendering matrix corresponding to that first sampling frame; and computing, for the second sampling frame associated with the time-variant rendering matrix, a second combined matrix based on the acquired sampled upmixing matrix corresponding to that second sampling frame and the acquired sampled rendering matrix corresponding to that second sampling frame.

3. The method according to claim 2, wherein the interpolated matrix for each slot is determined based on at least one of the computed first combined matrix and the computed second combined matrix.

4. The method according to claim 2 or 3, wherein the sampled upmixing matrix for the first sampling frame associated with the time-variant upmixing matrix and for the second sampling frame associated with the time-variant rendering matrix is acquired at the first frame boundary with respect to the time- variant upmixing matrix and the second frame boundary with respect to the time- variant rendering matrix, respectively, and wherein the sampled rendering matrix for the first sampling frame associated with the time- variant upmixing matrix and for the second sampling frame associated with the time- variant rendering matrix is acquired at the first frame boundary with respect to the time- variant upmixing matrix and the second frame boundary with respect to the time-variant rendering matrix, respectively.

5. The method according to any one of claims 2 to 4, further comprising: determining, for each slot of a sampling frame, a first weighting factor for the first combined matrix, a second weighting factor for the second combined matrix, and a compensation value based on the determined slot offset value.

6. The method according to claim 5, wherein the interpolated matrix is generated by a linear combination of the first combined matrix multiplied with the first weighting factor, thesecond combined matrix multiplied with the second weighting factor, and the compensation value.

7. The method according to claim 5 or 6, wherein, along processing of slots, the first weighting factor ramps down for the first combined matrix, and the second weighting factor ramps up for the second combined matrix.

8. The method according to any one of claims 5 to 7, wherein the compensation value comprises a compensation weighting factor in dependence of the determined slot offset value and a current slot position, and wherein, along processing of slots, the compensation weighting factor ramps up away from the first and second frame boundary and ramps down towards the first and second frame boundary.

9. The method according to any one of claims 5 to 8, wherein the compensation value is determined further based on a upmixing ramping step of the time- variant upmixing matrix between slots, and a rendering ramping step of the time- variant rendering matrix between the slots.

10. The method according to claim 9, wherein, at a current slot, the rendering ramping step of the time-variant rendering matrix is determined based on a slot position of the current slot with respect to the first frame boundary and the second frame boundary.

11. The method according to claim 9 or 10, wherein the rendering ramping step of the timevariant rendering matrix between the slots varies in response to an update of rendering gains in the rendering matrix at the second frame boundary.

12. The method according to any one of claims 2 to 11, wherein the first combined matrix is determined by multiplying the acquired sampled upmixing matrix corresponding to that first sampling frame with the acquired sampled rendering matrix corresponding to that first sampling frame.

13. The method according to any one of claims 2 to 12, wherein the second combined matrix is determined by multiplying the acquired sampled upmixing matrix corresponding to thatsecond sampling frame with the acquired sampled rendering matrix corresponding to that second sampling frame.

14. The method according to any one of claims 2 to 13, wherein, for a slot located at the first frame boundary, the first combined matrix is determined as the integrated upmixing and rendering matrix, or for a slot located at the second frame boundary, the second combined matrix is determined as the integrated upmixing and rendering matrix, and wherein, for a slot not located at a frame boundary, the interpolated matrix is determined as the integrated upmixing and rendering matrix.

15. The method according to any one of the preceding claims, wherein the time-variant upmixing matrix contains a plurality of upmixing coefficients for generating the multichannel upmix signal from the at least one channel of the downmix signal.

16. The method according to any one of the preceding claims, wherein the time- variant rendering matrix contains a plurality of rendering gains for generating the rendered output from the multi-channel upmix signal.

17. The method according to any one of the preceding claims, further comprising transforming the downmix signal from time domain to frequency domain, wherein the rendering matrix comprises a frequency-domain rendering matrix for rendering the plurality of audio objects in the frequency domain.

18. The method according to any one of the preceding claims, further comprising, during an initial sampling frame, applying, for each slot within the initial sampling frame, the timevariant upmixing matrix and the time- variant rendering matrix in sequence to a corresponding slot of the downmix signal.

19. The method according to any one of the preceding claims, wherein the downmix signal comprises a 5.1 channel signal, or a 7.1 channel signal.

20. An apparatus for reconstructing an audio program for rendering at a playback device, the apparatus comprising a processor and a memory coupled to the processor and storinginstructions for the processor, wherein the processor is configured to perform all steps of the method according to any one of claims 1 to 19.

21. A computer program comprising instructions that, when executed by a computing device, cause the computing device to perform all steps of the method according to any one of claims 1 to 19.

22. A computer-readable storage medium storing the computer program according to claim21.