A device for generating a bitstream, a device for generating rendered audio, a method for generating a bitstream, and a method for generating rendered audio.

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The bitstream generation apparatus addresses the challenge of efficiently representing audio environments in VR and AR by generating metadata for dynamic audio rendering, improving flexibility and reducing data rates while enhancing the audio experience.

JP7876588B2Active Publication Date: 2026-06-19KONINKLIJKE PHILIPS NV

View PDF 6 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Patents
Current Assignee / Owner: KONINKLIJKE PHILIPS NV
Filing Date: 2024-10-09
Publication Date: 2026-06-19

Application Information

Patent Timeline

09 Oct 2024

Application

19 Jun 2026

Publication

JP7876588B2

IPC: G10L19/00; H04S7/00

AI Tagging

Application Domain

Speech analysis Stereophonic systems

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies face challenges in efficiently communicating audio data for immersive applications like VR and AR, particularly in providing flexible, dynamic, and low-data-rate representations of audio environments that adapt to changing listener positions and environments.

Method used

A bitstream generation apparatus that includes a metadata generator for generating metadata describing acoustic environment characteristics, both static and dynamic, which is used to create a bitstream containing audio data for sound sources, enabling flexible and dynamic rendering of audio environments.

Benefits of technology

The solution improves audio rendering in immersive applications by offering high flexibility, reduced complexity, and lower data rates, facilitating dynamic adaptation and customization, and enhancing the audio experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 0007876588000066
Figure 0007876588000067
Figure 0007876588000068

Patent Text Reader

Abstract

To provide a bitstream representing an acoustic environment, and an apparatus for generating such a bitstream and an apparatus for processing such a bitstream.SOLUTION: An apparatus includes a metadata generator 203 that generates metadata for audio data for a plurality of audio elements representing audio sources in an environment. The metadata includes acoustic environment data for the environment, and the acoustic environment data describes properties affecting sound propagation for the audio sources in the environment, at least part of which is applicable to a plurality of listening poses in the environment, and the properties include both static properties and dynamic properties. A bitstream generator 205 generates a bitstream including metadata. The bitstream often further includes audio data representing the audio elements for the audio sources in the environment. A decoding apparatus includes a receiver that receives the bitstream, and a renderer that renders audio for the audio environment on the basis of the acoustic environment data and audio data for the audio elements.SELECTED DRAWING: Figure 2

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to a bitstream representing an audio environment, as well as an apparatus for generating such a bitstream and an apparatus for processing such a bitstream, and more particularly, but not limited to, for example, a bitstream representing a virtual audio environment for virtual reality applications.

Background Art

[0002] In recent years, the diversity and scope of experiences based on audiovisual content have greatly increased, and new services and methods for using and consuming such content have been continuously developed and introduced. In particular, many spatial interactive services, applications, and experiences have been developed to provide users with more engaging and immersive experiences.

[0003] Examples of such applications include virtual reality (VR) applications, augmented reality (AR) applications, and mixed reality (MR) applications that are rapidly becoming mainstream, and many solutions are targeting the consumer market. Also, various standards have been developed by various standardization bodies. Such standardization activities are actively developing standards for various aspects of VR / AR / MR systems, such as, for example, streaming, broadcasting, and rendering.

[0004] VR applications tend to provide a user experience that corresponds to a user in a different world / environment / scene, while AR (including mixed reality / MR) applications tend to provide a user experience that corresponds to a user in the current environment with added information or virtual objects or information. Therefore, VR applications tend to provide a fully immersive artificial world / scene, while AR applications tend to provide a partially artificial world / scene that is overlaid on the user's physically existing real-world scene. However, these terms are often used synonymously and overlap considerably. Below, the term virtual reality / VR will be used to refer to both virtual reality and augmented / mixed reality.

[0005] For example, communicating audiovisual data, particularly audio data, that represents the environment, especially the audio environment, in order to provide flexible representations that allow user adaptation in order to provide a VR experience is an extremely difficult task. Preferably, the data to be communicated represents the environment so that it can be used locally to render a dynamic experience that reflects changes in the (virtual) listening position and changes in the environment itself.

[0006] Much research has been conducted to obtain advantageous approaches for efficiently communicating data representing such environments. Various proposals have been made regarding appropriate data streams and formats, most of which involve individualized models in which individual sound sources are presented separately and associated with metadata describing various characteristics such as the location of the sound source. In addition, general data describing the audio environment, such as data describing reverberation and decay, may be provided.

[0007] However, defining a bitstream format that provides efficient (e.g., reduced data rate) communication of such information is extremely difficult, and achieving a favorable approach requires careful consideration and balancing of numerous issues, characteristics, and trade-offs. The Moving Picture Experts Group (MPEG) initiated a standardization approach to develop a standard called MPEG-I for bitstreams suitable for VR and similar experiences.

[0008] Therefore, improved approaches and data formats / bitstreams for supporting audio in immersive applications and services such as VR and AR would be advantageous. In particular, approaches / bitstreams / formats that enable improved operation, increased flexibility, reduced complexity, easier implementation, enhanced audio experience, reduced complexity, reduced computational load, improved audio quality, lower data rates, improved trade-offs, and / or improved performance and / or operation would be advantageous. [Overview of the project]

[0009] Therefore, the present invention aims to suitably mitigate, reduce, or eliminate one or more of the above-mentioned drawbacks, either individually or in any combination.

[0010] According to aspects and optional features of the present invention, an apparatus for generating a bitstream is provided, the apparatus comprising a metadata generator (203) that generates metadata for audio data of a plurality of audio elements representing sound sources in an environment, wherein the metadata includes acoustic environment data of the environment, the acoustic environment data describes characteristics that affect the propagation of sound from sound sources in the environment, and at least a portion of the acoustic environment data is applicable to a plurality of listening pauses in the environment, and the characteristics include both static and dynamic characteristics; and a bitstream generator that generates a bitstream containing the metadata.

[0011] This approach can improve the performance and behavior of many applications, including immersive, flexible, and dynamic audiovisual applications (e.g., many VR and AR applications). It can often resolve trade-offs between different requirements, such as the trade-off between providing accurate, complete, and / or dynamic data of the environment and providing a low data-rate bitstream. This approach often offers a high degree of flexibility, potentially facilitating, improving, and even enabling adaptation and customization on the rendering side. It may facilitate and / or improve the rendering of audio in the environment, specifically facilitating and / or improving the dynamic rendering of changing environments and / or changing listening positions.

[0012] This approach specifically offers the potential to provide a carefully adapted audiovisual bitstream, particularly well-suited for dynamic applications, which provides a carefully selected and at least partially optimized data representation of the audio environment. This data representation includes both the acoustic properties of the sound source and the environment, and possibly even individual objects within the environment.

[0013] In some embodiments, the apparatus may further include an audio data generator configured to generate audio data for multiple audio elements representing sound sources in the environment, and a bitstream generator may be configured to include the audio data within a bitstream.

[0014] Static characteristics can be time-invariant (at least within the time interval for which the characteristic value is considered). Dynamic characteristics can be time-varying (at least within the time interval for which the characteristic value is considered).

[0015] In many embodiments, at least one characteristic is dependent on position and orientation. In many embodiments, the acoustic environment data of at least one characteristic exhibits position and / or orientation dependence. Position and / or orientation dependence may be dependence on the orientation and / or position of the sound source, and / or the orientation and / or position of the listener pause.

[0016] In some embodiments, the acoustic environment data in the bitstream is divided into a series of data groups, each of which provides characteristic values for properties that affect sound propagation.

[0017] According to an optional feature of the present invention, the acoustic environment data includes a data group that describes a data format for at least a portion of the representation of a characteristic value of at least one of the characteristics that affect sound propagation, and a plurality of data groups, each containing data that describes at least one characteristic value using the representation.

[0018] In some embodiments, the acoustic environment data includes a data group that describes the data format of a characteristic among the characteristics that affect sound propagation, and a plurality of data groups, each containing data that describes the characteristic value of that characteristic according to the data format.

[0019] A data group can be one or more data values, specifically a set of bits. A data group can be a set of data, specifically one or more data fields of a bitstream.

[0020] In some embodiments, the acoustic environment data includes a data group describing a frequency grid and a plurality of data groups, each containing data describing frequency-dependent characteristics of the characteristics using the frequency grid.

[0021] A frequency grid may be a subdivision of a frequency range into frequency subranges, for example, by defining the center frequencies of the frequency subranges.

[0022] In some embodiments, the bitstream includes an indicator indicating whether the bitstream contains data groups describing a frequency grid.

[0023] In some embodiments, the data group includes an indication of the data format that describes the frequency grid.

[0024] In some embodiments, the acoustic environment data includes a data group describing a frequency grid and a plurality of data groups, each containing data describing frequency-dependent characteristics of the characteristics using the frequency grid, the bitstream includes an indicator indicating whether the bitstream contains a data group describing a frequency grid, and the data groups include an indication of the format of the data describing the frequency grid.

[0025] In some embodiments, the data group includes at least one of the following: data indicating a predetermined default grid; data indicating the starting frequencies and frequency ranges of at least several sub-ranges of the frequency grid; and data indicating individual frequencies.

[0026] At least one subrange or individual frequency represented by the data in the data group is aligned with the fraction of the octave band.

[0027] According to an optional feature of the present invention, the acoustic environment data includes a data group describing a frequency grid and a plurality of data groups each including data describing frequency-dependent characteristics among the characteristics using the frequency grid, the bitstream includes an indicator indicating whether the bitstream includes a data group describing the frequency grid, the data group includes an indication of the format of the data describing the frequency grid, and the data group includes at least one of data indicating a predetermined default grid, data indicating start frequencies and frequency ranges of at least some partial ranges of the frequency grid, and data indicating individual frequencies.

[0028] In some embodiments, the acoustic environment data includes a data group describing an orientation representation format for representing orientation characteristics and at least one data group including data describing orientation characteristics among the characteristics using the orientation representation format.

[0029] The bitstream may include an indicator indicating whether the bitstream includes a data group describing the orientation representation format.

[0030] In some embodiments, the data group includes at least one of data indicating a predetermined default orientation representation, data indicating a set of predetermined angles, and data indicating an angle on a quantization grid.

[0031] According to an optional feature of the present invention, the acoustic environment data includes a data group describing an orientation representation format for representing orientation characteristics and at least one data group including data describing orientation characteristics among the characteristics using the orientation representation format, and the data group includes at least one of data indicating a predetermined default orientation representation, data indicating a set of predetermined angles, and data indicating an angle on a quantization grid.

[0032] According to an optional feature of the present invention, the acoustic environment data includes a first data field for a first bit representing the value of a first characteristic among the characteristics that affect sound propagation, and a second data field indicating whether or not the acoustic environment data includes extended data for a second bit representing the value of the first characteristic.

[0033] According to an optional feature of the present invention, the second bit extends the range of values that the first characteristic can take.

[0034] The second bit may be a bit more significant than the first bit of the data word, representing the value of the first characteristic.

[0035] According to an optional feature of the present invention, the second bit increases the resolution of the values that the first characteristic can take.

[0036] The second bit may be a bit less significant than the first bit of the data word, representing the value of the first characteristic. The second bit can extend the precision of the values that the first characteristic can take.

[0037] In some embodiments, the first characteristic is selected from the group including temporal characteristics, spatial characteristics, quantity, gain characteristics, capacitance characteristics, frequency characteristics, index characteristics, and discrimination characteristics.

[0038] According to an optional feature of the present invention, the metadata generator (203) generates acoustic environment data to include a global indicator indicating that the environment is a spatially constrained environment, and the metadata generator restricts (at least one) data value of the acoustic environment data to conform to a predetermined restricted format for the data value of the global indicator indicating that the environment is spatially constrained.

[0039] A spatially constrained environment may be an environment with a spatial extent below a threshold. A given constrained format may be a data format for at least some values of the properties affecting sound propagation, using fewer bits than the data format used for the bitstream when the environment is not spatially constrained.

[0040] The global indicator may be an optional indicator. In some embodiments, the global indicator may indicate whether the environment is a spatially constrained environment or a less spatially constrained environment (or an environment without spatial constraints).

[0041] In some embodiments, the acoustic environment data includes an animation indication of at least one first sound element, which indicates whether at least one characteristic of the first sound element changes during a time interval.

[0042] According to an optional feature of the present invention, the acoustic environment data includes an animation indication of at least one first sound element, the animation indication showing whether at least one characteristic of the first sound element changes over time, and for an animation indication showing that the first sound element has at least one changing characteristic, the acoustic environment data includes data describing the variation of the at least one changing characteristic.

[0043] In some embodiments, if the animation indication indicates that at least one characteristic of the first sound element changes during a time interval, the acoustic environment data further includes further animation indications for each of at least two characteristics, the further animation indications indicating whether the corresponding characteristic is animated or not; if the animation indication indicates that there are no characteristics of the first sound element that change during a time interval, the acoustic environment data does not include further animation indications for the at least two characteristics.

[0044] In some embodiments, if the animation indication indicates that at least one characteristic of the first audio element changes during a time interval, then there are further animation indications for at least two characteristics, the further animation indications indicating whether the corresponding characteristics are animated or not, and if the animation indication indicates that there is no at least one characteristic of the first audio element that changes during a time interval, then the animation indication is excluded.

[0045] In some embodiments, the acoustic environment data includes at least two values of at least one characteristic that changes during a time interval, and interpolation data that describes a temporal interpolation to interpolate between these at least two values.

[0046] According to an optional feature of the present invention, the audio element includes a plurality of sound effect elements, and the acoustic environment data includes data that links changes to the environment controlled by the user with a first sound effect element among the plurality of sound effect elements.

[0047] In some embodiments, the acoustic environment data is arranged as a continuous dataset, with each dataset containing data over a time interval.

[0048] The time interval varies for each dataset.

[0049] According to an optional feature of the present invention, the acoustic environment data is arranged as a plurality of consecutive datasets, each dataset containing data over a time interval, the first dataset of the consecutive datasets containing a first characteristic value of at least one characteristic affecting sound, and a time indication of the first characteristic value indicating the time within the time interval represented by the first dataset.

[0050] In some embodiments, a first dataset of a series of datasets includes at least two characteristic values of at least one characteristic affecting sound, and a time indication of these at least two characteristic values, the time indication indicating time within the time interval represented by the first dataset.

[0051] According to an optional feature of the present invention, acoustic environment data is arranged as a plurality of consecutive datasets, each dataset containing data over a time interval, and the bitstream generator determines whether a characteristic value of a first characteristic among the characteristics affecting sound propagation is provided for a default time within the time interval represented by the first dataset, and if so, includes the first characteristic value in the first dataset without time indication, and otherwise includes the first characteristic value in the first dataset with time indication for the first characteristic value.

[0052] In some embodiments, the acoustic environment data includes several complete rendering datasets that contain all the data necessary to render the environment's sound, and several partial rendering datasets that require additional data from other datasets to render the environment's sound.

[0053] In some embodiments, the acoustic environment data for at least some elements of the environment includes element identification data and parent identification data of the environment's scene graph.

[0054] At least some of these elements could be, for example, objects, sound sources (sound elements), and / or acoustic properties of an environment.

[0055] According to an optional feature of the present invention, the acoustic environment data of the first sound element includes indications of a first applicability region and a second applicability region for a first characteristic value of a first characteristic among the characteristics that affect sound propagation, wherein the first applicability region indicates the region of the location of the first sound element to which the first characteristic value is applied, and the second applicability region indicates the region of the listening position to which the first characteristic value is applied.

[0056] According to aspects and optional features of the present invention, an apparatus for generating rendered audio is provided, the apparatus comprising: a first receiver (303) that receives audio data of a plurality of audio elements representing sound sources in an environment; a second receiver (305) that receives a bitstream containing metadata of the audio data of the plurality of audio elements representing sound sources in an environment, the metadata of which includes acoustic environment data of the environment, the acoustic environment data describing characteristics that affect the propagation of sound from sound sources in the environment, and at least a portion of the acoustic environment data is applicable to a plurality of listening pauses in the environment, the characteristics including both static and dynamic characteristics; and a renderer (307) that generates output audio data of the environment in response to the audio data and acoustic environment data.

[0057] According to aspects and optional features of the present invention, a bitstream is provided which includes metadata of audio data of a plurality of sound elements representing sound sources in an environment, wherein the metadata includes acoustic environment data of the environment, which describes characteristics that affect the propagation of sound from the sound sources in the environment, and at least a portion of the acoustic environment data is applicable to a plurality of listening pauses in the environment, and the characteristics include both static and dynamic characteristics.

[0058] According to aspects and optional features of the present invention, a method for generating a bitstream is provided, the method comprising the steps of generating metadata for audio data of a plurality of audio elements representing sound sources in an environment, wherein the metadata includes acoustic environment data of the environment, the acoustic environment data describes characteristics that affect the propagation of sound from the sound sources in the environment, and at least a portion of the acoustic environment data is applicable to a plurality of listening pauses in the environment, and the characteristics include both static and dynamic characteristics; and generating a bitstream containing the metadata.

[0059] According to aspects and optional features of the present invention, a method for generating rendered audio is provided, the method comprising: receiving audio data of a plurality of audio elements representing sound sources in an environment; receiving a bitstream containing metadata of the audio data of the plurality of audio elements representing sound sources in an environment, wherein the metadata includes acoustic environment data of the environment, the acoustic environment data describes characteristics that affect the propagation of sound from sound sources in the environment, and at least a portion of the acoustic environment data is applicable to a plurality of listening pauses in the environment, the characteristics include both static and dynamic characteristics; and generating output audio data of the environment in response to the audio data and acoustic environment data.

[0060] The above and other aspects, features, and advantages of the present invention will be described and made apparent with reference to the embodiments described below. [Brief explanation of the drawing]

[0061] Hereinafter, embodiments of the present invention, which are merely examples, will be described with reference to the following drawings. [Figure 1] Figure 1 shows an example of an audiovisual distribution system. [Figure 2] Figure 2 shows an example of elements of an encoding device according to some embodiments of the present invention. [Figure 3] Figure 3 shows an example of elements of a decoding device according to some embodiments of the present invention. [Figure 4]Figure 4 shows an example of a bitstream data structure according to some embodiments of the present invention. [Figure 5] Figure 5 shows an example of a changing characteristic value represented in a bitstream according to a partial embodiment of the present invention. [Figure 6] Figure 6 shows a bitstream data structure according to some embodiments of the present invention. [Modes for carrying out the invention]

[0062] The following explanation focuses on audiovisual applications such as virtual reality (VR) applications, but it will be understood that the principles and concepts described can be used in many other applications and embodiments.

[0063] The following description focuses on generating a bitstream that provides an auditory representation of an environment. In many examples, the auditory representation is complemented by a visual representation of the environment, and VR applications are configured to generate both audio and visuals to present to the user. The auditory representation may complement the visual representation of the virtual scene / environment, or the bitstream may provide a representation of the real world or a hybrid environment, for example. The bitstream may contain data representing individual elements within the environment / scene, such as sound sources or objects. Furthermore, it may provide more general information, such as general acoustic or visual data (e.g., data describing reverberation or background color).

[0064] However, it will be understood that the principles and concepts described can be used in many other applications and embodiments.

[0065] Virtual reality experiences can enable devices such as computers to generate virtual experiences for users by rendering three-dimensional audio and video for a virtual scene and presenting it to the user. Since users are likely to move around, the observer / listening pose may change dynamically. In many embodiments, the virtual scene / environment may be a dynamic scene in which, for example, objects move or change shape, and multiple sound sources emit sound at different times.

[0066] In this technical field, the terms position and pose are used as general terms to describe location and / or orientation. For example, a combination of the location and orientation of an object, camera, head, or view may be called a pose or position. Thus, an indication of position or pose can have six values / components / degrees of freedom. Each value / component typically describes an individual characteristic of the location or orientation of the corresponding object. Naturally, in many situations, position or pose can be considered or represented with fewer components, for example, when one or more components are considered constant or irrelevant (for example, if all objects are considered to have the same height and their orientation is horizontal, it may be possible to fully represent the pose of an object with four components). Hereinafter, the term pose will be used to refer to a location and / or orientation that can be represented by 1 to 6 values (corresponding to the maximum possible degrees of freedom). The term pose can be replaced with the term position. The term pose can be replaced with the term location and / or orientation. The term "pose" can be replaced with the terms "position and orientation" (when the pose provides information on both position and orientation), the term "position" (when the pose provides information on position (and possibly only)), or "orientation" (when the pose provides information on orientation (and possibly only)).

[0067] In many approaches, VR applications may be delivered locally to the viewer by a standalone device that does not use a remote VR server at all, or in some cases does not even have access to a remote VR server. However, in other applications, VR applications may be based on data received from a remote or central server. For example, audio or visual data may be provided to the VR device from a remote central server, processed locally, and used to generate the desired VR experience.

[0068] Figure 1 shows an example of a VR system in which a remote VR client device 101 communicates with a VR server 103 via a network 105, such as the internet. The server 103 may be configured to simultaneously support a large number of client devices 101.

[0069] The VR server 103 can support a virtual reality experience by, for example, sending data to the client device 101 that defines elements of the virtual environment and virtual objects. The data may specifically describe the visual and geometric characteristics of numerous virtual objects that can be used by the client device 101 to generate graphics that can be presented to the user. In some embodiments, the data may also include various information that can be presented to the user. Furthermore, the server 103 can provide the client device 103 with audio data that can be used to locally generate virtual sounds / voices that may further enhance the user experience, particularly immersion.

[0070] The following description focuses on generating audio bitstreams that provide representations of acoustic scenes and environments (typically including both representations of sound sources and representations of the acoustic properties of the environment).

[0071] Figure 2 shows an example of a device that generates a bitstream, which may specifically be (or be included in) the server 103 in Figure 1. The device may specifically be an encoder / transmitter. Figure 3 shows an example of a device that receives and processes a bitstream, such as the bitstream generated by the device in Figure 2. The device in Figure 3 may be a decoder / receiver, which may specifically be (part of) the client device 101 in Figure 1. Hereafter, the device in Figure 2 will also be called an encoder, and the device in Figure 3 will also be called a decoder.

[0072] In this example, the encoder generates a bitstream describing the audio environment. In this specific example, the bitstream includes both audio data of the sound produced by sound sources in the environment and metadata describing the acoustic environment (typically including both metadata for individual sound sources and metadata for the acoustic environment). However, in some embodiments, the audio data and metadata may be provided in separate bitstreams; specifically, a separate bitstream containing only the metadata for the audio data may be generated, and the actual audio data may be provided separately in another bitstream. In fact, in some embodiments, the audio data may be provided from one source / provider, and the additional metadata from another source / provider.

[0073] The encoder includes an audio data generator 201 configured to generate audio data for multiple audio elements representing multiple sound sources in the environment. The audio generator may generate audio data from received audio data describing the sound from individual sound sources or audio elements representing sounds picked up by, for example, a microphone, or it may generate the audio data itself based on, for example, an audio model of the scene. In some embodiments, the audio data generator 201 may extract audio data for a particular sound source from a local store containing, for example, appropriate representations of the sound from various sound sources. In other embodiments, instead or in addition, the audio data may be generated from, for example, a microphone input that captures live audio.

[0074] The audio data generator 201 can generate audio data in accordance with any suitable data format, and any suitable encoding and representation may be used. Therefore, the audio data can be generated in any suitable way, and the audio elements can be represented in any suitable way.

[0075] A speech element can be a speech object, a speech clip, a speech channel, a primary or higher-order ambisonic (FOA, HOA), or other speech element. Specifically, each speech element may be represented by a set of speech data that characterizes the speech that can be generated in the environment. The speech data is generally generated to contain multiple sets of speech data for multiple different speech elements, specifically multiple sets of speech data for multiple different sound sources represented by the individual speech elements.

[0076] The encoder further includes a metadata generator 203 configured to generate metadata. The metadata includes acoustic environment data of the environment, which describes properties that affect the propagation of sound from sound sources within the environment. The properties of the acoustic environment data may include the acoustic properties of the environment (e.g., reverberation, reflection properties), properties of objects within the environment that may affect the propagation of sound (e.g., location, orientation, size, material, attenuation, reflections, etc.), or properties of sound source / speech elements (e.g., location, orientation, size, volume, etc.).

[0077] A bitstream may contain several data groups of data symbols / bits that together provide indications of characteristic values for various properties that affect the sound perceived in the environment. It may also contain data groups that provide various other data, such as definitions of auxiliary parameters or formats for other data of the acoustic environment data.

[0078] A data group may be a simple bit sequence indicating a data value / format, or it may be a more complex combination of data that provides appropriate information. In many scenarios, for example, a data group can be considered to correspond to one or more data fields of a bitstream. A data field may contain sub-data fields, i.e., a hierarchical structure of data fields may be applied, where each sub-data field is itself a combination of data fields.

[0079] Acoustic environment data may include metadata describing, for example, the pose (position and / or orientation) of one or more sound sources represented by sound elements, the acoustic properties of the environment or individual objects within the environment (e.g., attenuation (potentially frequency-dependent), reverberation properties, object shape that may affect sound wave propagation, material properties such as sound absorption, acoustic reflection, acoustic scattering, acoustic coupling, or acoustic transfer parameters), signal references, reference distances, rendering control flags, acoustic effect properties, and user interaction metadata.

[0080] Acoustic environment data may include data that can be applied to multiple listening poses, specifically, data that the renderer can use to adapt the rendering of audio elements so that it depends on different listener poses (specifically, different positions and / or orientations) (so that it differs for each listener pose). Furthermore, acoustic environment data may include data for both static and dynamic properties. Specifically, acoustic environment data may be time-independent and may include data that describes at least time-invariant properties (values) in the time interval for which the data is provided (thus such properties may be static). Acoustic environment data may also include data that is time-dependent and may further include data that describes at least time-varying properties (values) in the time interval for which the data is provided (thus such properties may be dynamic). For at least one property, acoustic environment data may include data that describes the time variation of the property (value). The property may be one of the properties of sound sources, audio elements, and / or sound propagation in the environment.

[0081] The encoder further includes a bitstream generator 205 configured to generate a bitstream containing audio data and metadata (or metadata only in some embodiments). The bitstream may be generated to satisfy one or more aspects and characteristics of a particular data format, which will be described in detail later.

[0082] The bitstream generator 205 is coupled to an output processor 207 configured to output the generated bitstream. The output processor 207 may have functions necessary to communicate the bitstream according to the requirements of a particular application. For example, the output processor may have a network interface, wireless functionality, a WiFi circuit, or internet connectivity.

[0083] The decoder shown in Figure 3 includes a receiver or input processor 301 configured to receive a bitstream generated by the encoder. The input processor 301 may have functions necessary to receive the bitstream depending on the requirements of a particular application. For example, the input processor 301 may have a network interface, wireless functionality, a WiFi circuit, internet connectivity, etc. In many embodiments, the input processor 301 may have functions that complement the functions of the encoder's output processor 207.

[0084] The input processor 301 is coupled to an audio data processor 303 configured to extract audio data from a bitstream. Thus, the audio data processor 303 is configured to extract and often process audio data representing numerous audio elements of the environment. Typically, each audio element may correspond to one or more sound sources in the environment.

[0085] In the described example, the bitstream contains audio data describing the audio elements. In other embodiments, the audio data may not be included in the bitstream and may be received from another source, such as an internal source of the decoder / client device. For example, internal memory may store the audio data, and server 103 may provide additional metadata that can provide an enhanced experience, for example, by providing information that enhances the audio to present a dynamic animation of the sound source moving within the environment. In some embodiments, the input processor 301 may be configured to receive audio data from an external source different from the server, for example, the audio data may be received as part of a different bitstream provided by a server different from server 103 that provides the metadata.

[0086] The decoder further includes a metadata processor 305 configured to extract metadata from the bitstream. Thus, the metadata processor 305 is configured to extract and often process metadata of audio elements / environment. The metadata processor 305 may be configured to extract data and generate appropriate characteristic values for one, more, or all of the characteristics described by the metadata.

[0087] In this example, the decoder has a processor that processes audio data based on metadata and characteristic values. Specifically, the decoder may have a renderer 307 configured to render one or more audio elements based on the value of at least one characteristic among the characteristics represented by the metadata, where the value of the characteristic is determined from the metadata. Typically, the renderer 307 may be configured to render an audio scene corresponding to an environment by rendering one or more audio elements based on the metadata, for example, poses (and changes thereof) determined by the metadata, frequency-dependent decay based on metadata describing the environment and objects within the environment, and reverberation representing the reverberation characteristics of the environment described by the metadata.

[0088] Many algorithms, approaches, and techniques for rendering audio data based on environmental and contextual data are known to those skilled in the art (including, for example, HRTF rendering and reverberation modeling), and for the sake of brevity and clarity, please understand that we will not go into further detail on these.

[0089] The encoder is configured to generate a bitstream, and the decoder is configured to receive, decode, and process the bitstream, according to a data format that includes one or more of the features and aspects described below. While the approach will be described with reference to a specific bitstream / format that may include most or all of the specific features and characteristics, it will be understood that in many embodiments, the generated bitstream may include only some of the features or approaches, for example. In fact, the multiple different functions and elements of the described bitstream are not necessarily used together, but are individual functions that can be used individually or combined in any way with other described functions.

[0090] In a concrete example, a metadata bitstream is generated that contains acoustic environment data, which is structured as a series of consecutive datasets (a dataset may also be called a data frame). Each dataset contains data for a given time interval. The acoustic environment data may be structured so that data values for a given time interval are grouped together; specifically, all encoded audio data streams applicable to only a single time interval may be grouped together as a series of datasets that include that dataset, and the dataset does not contain acoustic environment data applicable only to time intervals other than that single time interval. To obtain all data specific to a given time interval, the decoder only needs to decode that dataset. In some embodiments, acoustic environment data for a given time interval may also apply to other time intervals (for example, some static acoustic environment data may apply to all time intervals). Such multi-interval data may be contained within a dataset for one of several time intervals, within a dataset for two or more of several time intervals, within all datasets, or outside of datasets (for example, a common dataset provided separately from the datasets for individual time intervals).

[0091] Therefore, a bitstream can contain datasets at multiple different time intervals. In the following, each dataset will also be referred to as a data point in the bitstream.

[0092] The generated metadata bitstream is often relatively static metadata and may include metadata with relatively slow animation / changes in characteristics (e.g., position, orientation, signal gain). Therefore, in this example, the bitstream is not organized as frames with durations in the range of milliseconds, but rather as datasets or data points representing time intervals on a much longer time scale. The time intervals of the datasets / data points may be, for example, 0.5 seconds, 1 second, 2 seconds, or 5 seconds or more.

[0093] A bitstream may contain independent data points that contain enough data to begin decoding the bitstream, either by starting a random read or by beginning the decoding process if multiple bitstreams are combined. The data in such independent data points is independent of the data in preceding data points. An example of a bitstream using this approach is shown in Figure 4.

[0094] The duration of the time interval represented by each dataset / point may be flexible and may differ from group to group. Typically, the duration may be only a few seconds. Data points may be defined relative to timestamps, specifically as start, end, or midpoint points for a data point. The metadata of a data point may represent data about the characteristics of the acoustic environment data for the time interval (e.g., data specifically indicating specific characteristic values). For example, the value indicated may be the value at the end of the time interval (e.g., where it is taken over to the next data point, or where the scene ends). Furthermore, if the value changes over time within the time interval, data may be included that allows for the reconstruction of the characteristic value at other points in time within the time interval. For example, the value within the time interval may be determined by interpolation.

[0095] In this specific example, the time interval may be represented by the data values of the bitstream. Similarly, data values within a time interval may be referenced and represented so as to apply to a particular time interval represented by a data field / value. Specifically, a value referenced as targetOffset[n] may be used, where n is the index.

[0096] This approach can support time-dependent changes, such as when an animation starts, stops, or changes in speed within that range, using a variety of methods, as follows: - Waypoint - Update data points

[0097] If a metadata field is animated and the animation needs to be updated multiple times within a time interval of a data point, multiple waypoints can be included by including multiple targetOffsets within the time interval of the data point. For example, - TargetOffset[0]=48000 - TargetOffset[1]=20000 - TargetOffset[2]=38643

[0098] A data point may contain target values for one or more characteristics linked to a specific point in time within a time interval. This point in time is indicated by a time indication, specifically in the form of a targetOffset field. Multiple waypoints can be generated, each providing multiple target values paired with a targetOffset reference, resulting in interpolated values. Figure 5 illustrates how this can be used to provide varying parameter values.

[0099] Alternatively, or additionally, dynamic changes may be supported by multiple update data points, which may be multiple data points that are not independent of each other. An update data point may not provide enough data to enable a complete rendering of its time interval and may rely on data located elsewhere in the bitstream. An update data point may contain only a subset of acoustic environment data that describes the characteristics of the time interval represented by the update data point. In some embodiments, an update data point may always be linked to at least one independent data point.

[0100] An update data point may, in some cases, contain only data relating to time-varying characteristics (data that changes during a time interval). An update data point may also have associated data indicating the time interval to which the update data point applies, such as a start time, end time, or midpoint. An update data point may have a maximum time interval duration that does not exceed the next independent data point.

[0101] The advantage of update data points can be that they are useful for live streaming of scene elements (e.g., other users' movements and actions) where the corresponding fields are transmitted at a higher rate than independent data points.

[0102] Therefore, a bitstream may include multiple consecutive datasets / data points containing one or more characteristic values for environmental properties that affect sound propagation. Furthermore, it may include time indications (e.g., targetOffset) to indicate when a particular characteristic value is appropriate. In some embodiments, a dataset may include two or more characteristic values for one or more properties. For variable properties, the dataset may include time indications to show when multiple characteristic values should be applied.

[0103] In some embodiments, time indications may be included only if they differ from the default time within a time interval. For example, a characteristic value is indicated for the end of a time interval by default, and a characteristic value provided at the end of the time interval may not have a specific time indication. However, a characteristic value provided for a different time within the time interval may be associated with a time indication.

[0104] Therefore, in some embodiments, the bitstream generator 205 is configured to determine whether the characteristic value of a given characteristic is provided for a default time within the time interval represented. If so, the first characteristic value is included in the data point without time indication; otherwise, the characteristic value includes time indication.

[0105] The bitstream may be generated to include both datasets that are independent and do not require additional data for rendering, and datasets that are not independent and require acoustic environment data from other datasets to enable complete rendering of the acoustic environment. Thus, the acoustic environment data may include several complete rendering datasets that contain all the data necessary to render the environment's audio, and several partial rendering datasets that rely on and require additional data from other datasets to render the environment's audio.

[0106] Given the concept that data points are independent and can cover gaps of up to several seconds, it may be useful to include both the start and end points of the data point time interval (e.g., indicating the start position of the data point and the end position of the interpolation interval). This is optional, as in many cases it is acceptable to specify only the end point, and potential deviations can only occur over the duration of the interval of the first data point to be decoded when the decoder interrupts the stream.

[0107] In many cases, the endpoint of the source pose may be considered the most important information, and this may include position and orientation data. In addition, an optional starting position / orientation may be provided. To save bitrate through more efficient coding, this may be differentially coded with respect to the endpoint. For non-interpolable data (e.g., flags, identifiers, indices), there may be an offset where the change occurs instantaneously, and the initial value of the data point may be indicated (and therefore not differential). In the case of flags, the initial value may not be indicated, as it is considered to be the opposite of the value following the provided indication.

[0108] An example of this approach in a JSON structure is as follows: - ObjectSource[] · ID SignalIndex PreGain • PreGainInterpMethod (optional) • PreGainInterpLengthIdx (optional) • PreGainDelta (optional): % indicates that there is animation within that data point. ·position • PositionInterpMethod (optional) • PositionInterpLengthIdx (optional) • PositionDelta (optional): % indicates that there is animation within that data point. ·direction • OrientationInterpMethod (Optional) • OrientationInterpLengthIdx (optional) • OrientationDelta (optional): % Presence indicates that there is animation within that data point. ·rendering • RenderUpdateOffsetIdx (optional): % Presence indicates that there is animation within that data point. ·Directional ID • DirectivityIDUpdateOffsetIdx (optional) • DirectivityIDStart (optional): % Presence indicates that there is animation within that data point.

[0109] Here, the endpoint is [position, direction] and the starting point is [position - PositionDelta, orientation - OrientationDelta]. This means that the endpoint can be provided as either an absolute or relative value, allowing the starting point to be reconstructed as needed.

[0110] For example, various interpolation methods can be indicated by a field such as InterpMethod, which defines one of the following: - 'linear'<default> - 'instant' (Instantly changes the target timestamp without interpolation. Useful for transforming, for example, SignalIndex, ExtentID, or any Boolean field.) - 'spherical' - 'logarithmic'

[0111] The endpoint target can be the starting point of the next independent data point, an arbitrary intermediate update data point, or an intermediate waypoint. Multiple potential targets may be enumerated within the general part of the bitstream, and other parts of the bitstream may efficiently reference them to indicate a suitable target for the data covered by the other parts. The target may also be the target value / intended value at a given point in time.

[0112] The following describes an example of the data format and syntax for a single data point / dataset. This description follows an approach used to describe bitstream formats used in MPEG Audio standards such as MPEG-H 3D Audio (ISO / IEC 23008-3). The syntax description is structured as pseudocode, and function calls indicate that the syntax described under that function is inserted where the function call is made. Fields occupying bits in the bitstream are shown in bold, and the second and third columns use mnemonics to indicate the number of bits and bit format. Some fields may have a variable number of bits nr, depending on the value they represent. These fields are associated with a lookup table that describes codewords and their corresponding values. The codewords are designed so that, when reading bit by bit, shorter codewords never overlap with the earlier bits of longer codewords. This is known to those skilled in the art as lossless data coding or entropy coding, for example, Huffman coding, run-length coding, or arithmetic coding. Data read in the first half of the syntax may be used in the second half of the syntax to control the decoding of the other half. For example, this can be done by notifying the number of bits, the number of elements, the encoding method, and the existence of specific data.

[0113] The following specific acronyms are used in the description (quoted from ISO / IEC 23008-3, MPEG-H 3D Audio specification): - bslbf - Bit string, left bit first. "Left" refers to the order in which bit strings are written according to ISO / IEC 14496. Bit strings are written as strings of 1s and 0s enclosed in single quotes, for example, '1000 0001'. Whitespace within bit strings is for readability and has no particular meaning. • Used in boolean - tcimsbf - Two's complement integer, most significant (sign) bit first. - uimsbf - Unsigned Integer Most Significant Bit First. Used in ·uint - vlclbf - Variable length code, left bit first. "Left" refers to the order in which the variable length code is written.

[0114] Datasets / points can be provided according to the following format / syntax: DataPoint() [Table 1]

[0115] The following sections will describe in more detail the various fields / parts / elements of such data points, but it will be understood that the fields / parts / elements described are not limited to specific data points or syntax, and may be used individually or in any combination within any bitstream containing acoustic environment data.

[0116] Further explanation of the functions / fields / sub-elements of data points.

[0117] dpType The type of data point. Indicates the type of data point. [Table 2]

[0118] reservedBits Reserved bits. These can be used to introduce extension mechanisms.

[0119] Therefore, a data point can begin with two bits that indicate a specific type of data point. Specifically, these two bits indicate whether the data point is an independent data point or an update point.

[0120] FieldData() The GeneralData() field may provide data that offers information about several parameters in general configuration or rendering data. [Table 3]

[0121] bsVersion Bitstream version. [Table 4]

[0122] isSmallScene If set to true, this indicates that no coordinates exist outside the range of [-100..100] meters.

[0123] In some embodiments, the metadata generator 203 is configured to generate acoustic environment data that includes a global indicator indicating that the environment is a spatially constrained environment. For example, the global indicator may be in the form of isSmallScene, which can be set to true or false. When set to true, isSmallScene is restricted to being represented in coordinates not exceeding [-100..100] meters.

[0124] When a global indicator is set to represent a spatially constrained environment, many data values are restricted to conform to a predetermined restricted format for data values. Specifically, spatial coordinates (specifically position values) may be restricted so as not to exceed a threshold, such as being limited to intervals of [-100..100] meters. In some embodiments, when the global indicator is set to represent a spatially constrained environment, other parameters may be restricted to a limited range. For example, in the case of a small environment, the maximum duration or time constant of reverberation (e.g., T) may be restricted. 60 The value may be limited.

[0125] A spatially constrained environment may be an environment with a spatial extent below a threshold. A given constrained format may be a data format for at least some values of the properties affecting sound propagation, using fewer bits than the data format used for the bitstream when the environment is not spatially constrained.

[0126] In some embodiments, a global flag or indicator may be included that indicates a limit on one or more parameter values.

[0127] usacSamplingFrequencyIndex An index of the sample rate used for audio signals. Based on the definition in 23003-3, [Table 5] TIFF0007876588000006.tif12477

[0128] usacSamplingFrequency When usacSamplingFrequencyIndex=0, the decoder's output sampling frequency is coded as an unsigned integer value.

[0129] useDefaultSpeedOfSound A flag indicating whether to use the default speed of sound (343 m / s at 20°C) for materials without material, or to specify a custom value.

[0130] speedOfSound A custom sound velocity value for materialless media. Materialless media are considered to be spaces not occupied by geometry to which different materials are assigned.

[0131] TargetData() The field TargetData() may contain the target offset for the attribute's animation, which is referenced by index from other parts of the data point. [Table 6]

[0132] FreqGridData() The field may contain a frequency grid definition that is referenced by an index from other parts of the data point. This typically informs some bitstream parsers of the number of frequency-dependent elements to be coded next.

[0133] The field FreqGridData() may provide data that provides information about the frequency grid definition, which is referenced by index from other parts of the data point. Specifically, it may describe a frequency grid / subdivision into multiple frequency ranges, and other parameters / other data of the bitstream may be provided according to the defined frequency grid. For example, different frequency-dependent filtering can be provided by indicating attenuation for multiple different ranges of the defined frequency of the frequency grid. Thus, the filter can be simply described as a set of attenuation values without having to explicitly describe the frequencies corresponding to these attenuation values.

[0134] In some embodiments, acoustic environment data may be configured to include one or more data groups / data fields describing a frequency grid, and multiple data groups / data fields, each containing data describing frequency-dependent characteristics of the characteristics using the frequency grid. This can be done, for example, by providing data values for one or more frequency values among the frequency values and / or frequency values defined by the frequency grid.

[0135] A frequency grid may be a subdivision of a frequency range into frequency subranges, for example, by defining the center frequencies of the frequency subranges.

[0136] In some embodiments, the bitstream may include an indicator indicating whether the bitstream contains a data group describing a frequency grid. For example, a single bit, bFgdPresent, may indicate whether a frequency grid definition is included (e.g., in the following bits).

[0137] In some embodiments, the bitstream may include an indication of the format of the data describing the frequency grid. For example, the bitstream may include an indication of whether the frequency grid is described by a reference to a predefined grid, for example, by indicating the index of the grid among a given / predefined set of frequency grids. As another example, the bitstream may include data indicating the start frequency and frequency range for at least some subranges of the frequency grid, and typically for all subranges of the frequency grid. In some embodiments, the frequency grid may be indicated by a set of transition frequencies indicating the boundary frequencies of frequency ranges / frequency intervals. For example, a first frequency interval / range may be indicated by data indicating the start frequency of the first frequency interval / range and the start frequency of the next frequency interval / range.

[0138] In many embodiments, the frequency range may be constant in the logarithmic representation of the frequency scale.

[0139] Frequency banding and division into ranges / intervals may be based on octaves. A difference of one octave represents doubling the frequency (e.g., 125, 250, 500, 1000 Hz). The bitstream indicates, for example, whether there is banding in an octave band or another subdivision (e.g., a 1 / 3 octave band with two more values between the octave bands (125, 160, 200, 250, 315, 400, 500, 630, 800, 1000)). In some embodiments, data aligned to fractions of octave bands may indicate at least one subrange or individual frequency.

[0140] In some embodiments, the bitstream may include data indicating individual frequencies, for example, a set of multiple individual frequencies. In this case, other frequency-dependent characteristics within the bitstream may simply be represented by a set of characteristic values for these individual frequencies, without the need to explicitly describe each of these individual frequencies for each characteristic.

[0141] In many embodiments, the frequency grid may be described / defined by bitstream data. For example, different modes of the frequency grid may be used, and the data may indicate the mode used to describe the frequency grid.

[0142] For example, the field fgdMethod may indicate which of the following modes will be used: - Default grid For example, it is aligned to the fraction of an octave band. - Starting frequency + Frequency hop size + Band size For example, it is aligned to the fraction of an octave band. - Individual frequencies For example, it is aligned to the fraction of an octave band.

[0143] Examples of frequency grid formats / syntax are as follows: [Table 7] TIFF0007876588000009.tif47170

[0144] bFgdPresent A flag indicating whether a frequency grid is defined.

[0145] fgdMethod The way in which the frequency grid is coded. [Table 8]

[0146] fgdCenterFreq The center frequencies of each band in each frequency grid are shown in Hz.

[0147] frequencyHopCode A code showing the hop coefficient for frequency banding. [Table 9]

[0148] fgdDefaultBanding defines one or several default banding schemes. [Table 10]

[0149] Therefore, in some embodiments, the bitstream is generated to include a data group describing a frequency grid used in other data groups to describe the frequency-dependent characteristics of acoustic environment data. Thus, the frequency grid is provided within the datastream and used to describe frequency variations. Furthermore, the bitstream includes a specific indicator indicating whether the bitstream contains a description of the frequency grid. Thus, a flexible approach is provided in which bitstream generation can be adapted to use an optional and customizable frequency grid that can be optimized for the specific characteristics and frequency dependencies being communicated. As a simple example, the frequency resolution or frequency range may be flexibly adaptable, and in practice, the entire frequency grid description may be optional and omitted by appropriately setting an indicator of whether or not the description is included.

[0150] Furthermore, the data group may be configured to not only include a description of the frequency grid, but also to specifically indicate the format used for describing the frequency grid. Thus, an indication may be provided that enables the receiver to interpret the frequency grid description data, thereby allowing the bitstream generator to freely select a format for describing the frequency grid that is particularly suitable for specific characteristics and the specific frequency dependence of those characteristics.

[0151] A data group describing a frequency grid may include a predetermined default grid, the starting frequencies and frequency ranges of at least a portion of the frequency grid's sub-ranges, and / or data indicating individual frequencies.

[0152] Therefore, this approach makes it possible to include frequency-dependent characteristics in the bitstream by using a specific, narrow approach to encode and represent such acoustic environment data within the bitstream. This approach can provide a highly efficient, adaptable, and flexible approach to including the frequency-dependent acoustic environment characteristics described above in the bitstream.

[0153] AcousticElementData() This part of the syntax covers the data describing acoustic elements. These are typically generic elements that function as hierarchical elements in the scene graph. Typically, the properties of such elements are inherited by their child nodes; that is, the pre-gain value of an acoustic element also applies to sound sources organized under the corresponding acoustic element. The position and orientation of these nodes transform the position and orientation of their child nodes, unless the node has a pause dependency set to ignore the parent's orientation or pause.

[0154] In some embodiments, the acoustic environment data includes animation indications for at least one sound element. The animation indications whether the characteristics of the sound element change during a given time interval, and thus whether the characteristics are dynamic or static. Specifically, for a given data point / group, a flag or indication may be included that indicates whether the sound element is static or time-varying within the time interval of that data point / group.

[0155] If animation indications show that the corresponding audio changes over time, the acoustic environment data may further include data describing the changes in the changing properties. Such acoustic environment data for an audio element may include data indicating whether at least one property of the audio element is a time-varying property. Therefore, it may include indications about properties that show which properties change over time and which do not (within the time interval of the dataset / point).

[0156] In many embodiments, the acoustic environment data may include data describing how characteristic values are determined at different times within a time interval; specifically, the acoustic environment data may include data describing interpolation approaches applied to determine characteristic values at different times.

[0157] Acoustic environment data may contain two or more values for a particular time-varying characteristic, each value potentially representing a specific point in time. Characteristic values for other times may be determined by interpolation. Encoded audio data streams, particularly individual data points, may contain data describing the characteristics of the temporal interpolation to be performed for the time-varying characteristics.

[0158] In some cases, one or more of the values used for interpolation within a data point's time interval may be provided outside the time interval / data point, or determined from provided values. For example, one or more values may be derived from data in another data point, which may have been transmitted earlier, for example, within a bitstream. Such data values may be associated with a specific point in time (e.g., indicated by a timestamp) that may be before or after the time interval of the data point to which interpolation is applied (and for which interpolation may be defined).

[0159] In some embodiments, different interpolation may be described for each data point, and therefore different interpolation may be applied for each time interval. Furthermore, different interpolation may be described for different characteristics (including different characteristics for the same data point / time interval).

[0160] The interpolation method can be indicated, for example, by a code represented by the flag *InterpMethod. The code may be as shown in Table 1 below. [Table 11]

[0161] In some embodiments, the acoustic environment data for at least some elements of the environment may include element identification data and parent identification data for the environment's scene graph. These at least some elements may be, for example, objects, sound sources (sound elements), and / or acoustic properties of the environment.

[0162] For example, multiple different objects and / or sound sources may be placed within the scene graph, and the acoustic environment data may include data that represents this scene graph. This can be achieved specifically by providing identification data and parent identification information for individual elements, because this allows for the reconstruction of the scene graph, and thus it can be configured to represent a hierarchy.

[0163] In some embodiments, the animation of elements may be shown (or possibly described) within the acoustic environment data.

[0164] For example, acoustic environment data can include ElementAnimated, SourceAnimated, and AttributeAnimated flags / indications to indicate whether each element / source is animated (within a time segment). In that case, a flag / indication can be included for each attribute / characteristic to indicate whether that attribute / characteristic is animated. In that case, further data describing the animation / temporal changes may be included.

[0165] As an example of the above functionality, the following format may be used. [Table 12] TIFF0007876588000015.tif255168TIFF0007876588000016.tif147170

[0166] aedPresent A flag indicating whether or not an acoustic element is defined.

[0167] aedNrElements It notifies the number of defined acoustic elements.

[0168] Element Animated A flag indicating whether the corresponding element at that data point is animated or not.

[0169] Animated A flag indicating whether the corresponding attribute is animated at that data point. It is also used to indicate whether data exists for more waypoints.

[0170] aedID ID of the acoustic element.

[0171] aedParentID The ID of the parent of the sound element in the scene graph.

[0172] aedPoseDependency This indicates what the pause of the sound element is relative to.

[0173] aedPosition The position coordinates (x, y, z) of the acoustic elements are shown in meters. If the same element appears multiple times, this indicates that there are multiple waypoints within the range of the data point.

[0174] aedPositionInterpMethod The interpolation methods used for positional animation are shown (see Table 1).

[0175] aedPositionInterpTargetIdx An index to TargetOffset. Indicates the offset from the timestamp of the data point for which the preceding position target value is valid. Multiple target indexes may be provided such that there is one target index for each waypoint within the range of data points.

[0176] aedPositionDelta A position delta value that allows for the reconstruction of the position value in the timestamp of a data point. PositionAtDPStart = aedPosition - aedPositionDelta If multiple target values exist, the first target value will be used as the baseline.

[0177] aedorientation The direction of the acoustic elements (yaw, pitch, roll) is shown in radians. If the same element appears multiple times, this indicates that there are multiple waypoints within the range of the data point.

[0178] aedOrientationInterpMethod The interpolation method used for orientation animation is shown (see Table 1).

[0179] aedorientationInterpTargetIdx An index to TargetOffset. Indicates the offset from the timestamp of the data point for which the preceding orientation target value is valid. Multiple target indexes may be provided such that there is one target index for each waypoint within the range of data points.

[0180] aedorientationDelta An orientation delta value that allows you to reconstruct the orientation value in the timestamp of a data point. OrientationAtDPStart = aedOrientation - aedOrientationDelta If multiple target values exist, the first target value will be used as the baseline.

[0181] aedPregain The pre-gain values (dB) of all sources hierarchically arranged below the corresponding acoustic elements. If the same element appears multiple times, this indicates that there are multiple waypoints within the range of the data point.

[0182] aedPregainInterpMethod The interpolation methods used for pre-gain animation are shown (see Table 1).

[0183] aedPregainInterpTargetIdx An index to TargetOffset. Indicates the offset from the timestamp of the data point for which the preceding pre-gain target value is valid. Multiple target indexes may be provided such that there is one target index for each waypoint within the range of data points.

[0184] aedPregainDelta The pre-gain delta value allows for the reconstruction of the pre-gain value in the timestamp of a data point. PregainAtDPStart = aedPregain - aedPregainDelta If multiple target values exist, the first target value will be used as the baseline.

[0185] aedRender Rendering flag. If false, it indicates that all sources hierarchically placed under the corresponding acoustic element will not be rendered.

[0186] aedRenderUpdateTargetIdx The index to TargetOffset. Indicates the offset from the data point's timestamp, which causes the rendering flag to invert to the preceding aedRender value. Multiple target indices may be provided, such that there is one target index for each waypoint within the range of data points. Each target index represents a binary inversion of the flag state.

[0187] AudioSourceData()AudioSourceData() This part of the syntax collects various types of sound sources. [Table 13]

[0188] ObjectSourceData() This part of the syntax describes the characteristics of the object source. [Table 14] TIFF0007876588000019.tif255170TIFF0007876588000020.tif255169

[0189] osdPresent A flag indicating the presence or absence of an object source.

[0190] osdNrElements Indicates the number of subsequent object source elements.

[0191] sourceAnimated A flag indicating whether the corresponding source at that data point is animated or not.

[0192] aedID ID of the acoustic element.

[0193] aedParentID The ID of the parent of the sound element in the scene graph.

[0194] osdSignalIndex This indicates the index of the signal corresponding to the source within the signal input buffer.

[0195] osdIsContinuousSource A flag indicating whether it is associated with a continuous signal or with an acoustic effect triggered by the decoder.

[0196] osdReferenceDistance This indicates the reference distance to the object source.

[0197] osdPoseDependency This shows what the source's pose is relative to.

[0198] Animated A flag indicating whether the corresponding attribute is animated at that data point. It is also used to indicate whether data exists for more waypoints.

[0199] osdPosition The object source's position coordinates (x, y, z) are shown in meters. If the same source appears multiple times, this indicates that there are multiple waypoints within the range of the data point.

[0200] osdPositionInterpMethod The interpolation methods used for positional animation are shown (see Table 1).

[0201] osdPositionInterpTargetIdx An index to TargetOffset. Indicates the offset from the timestamp of the data point for which the preceding position target value is valid. Multiple target indexes may be provided such that there is one target index for each waypoint within the range of data points.

[0202] osdPositionDelta A position delta value that allows you to reconstruct the object source position value in the timestamp of a data point. PositionAtDPStart = osdPosition - osdPositionDelta If multiple target values exist, the first target value will be used as the baseline.

[0203] osdOrientation The orientation (yaw, pitch, roll) of the object source is shown in radians. If the same source appears multiple times, this indicates that there are multiple waypoints within the range of the data point.

[0204] osdOrientationInterpMethod The interpolation method used for orientation animation is shown (see Table 1).

[0205] osdOrientationInterpTargetIdx An index to TargetOffset. Indicates the offset from the timestamp of the data point for which the preceding orientation target value is valid. Multiple target indexes may be provided such that there is one target index for each waypoint within the range of data points.

[0206] osdOrientationDelta An orientation delta value that allows you to reconstruct the orientation value of the object source in the timestamp of a data point. OrientationAtDPStart = osdOrientation - osdOrientationDelta If multiple target values exist, the first target value will be used as the baseline.

[0207] osdPregain Pre-gain value (dB) of the object source. If the same element appears multiple times, this indicates that there are multiple waypoints within the range of the data point.

[0208] osdPregainInterpMethod The interpolation methods used for pre-gain animation are shown (see Table 1).

[0209] osdPregainInterpTargetIdx An index to TargetOffset. Indicates the offset from the timestamp of the data point for which the preceding pre-gain target value is valid. Multiple target indexes may be provided such that there is one target index for each waypoint within the range of data points.

[0210] osdPregainDelta The pre-gain delta value allows for the reconstruction of the object source's pre-gain value in the data point's timestamp. PregainAtDPStart = osdPregain - osdPregainDelta If multiple target values exist, the first target value will be used as the baseline.

[0211] osdRender Rendering flag. If false, it indicates that the object source should not be rendered.

[0212] osdRenderUpdateTargetIdx The index to TargetOffset. Indicates the offset from the timestamp of the data point where the rendering flag is reversed to the preceding osdRender value. Multiple target indices may be provided, such that there is one target index for each waypoint within the range of data points. Each target index represents a binary inversion of the flag state.

[0213] osdIsOmniDirectional A flag indicating whether the source is omnidirectional.

[0214] osdDirectivityPatternID The ID of the directional pattern to be used.

[0215] SoundEffectData() This part of the syntax describes the characteristics of the sound effects. These represent short segments of audio signals that are triggered by user interaction, etc., rather than continuously rendered audio signals.

[0216] In some embodiments, acoustic environment data can provide metadata associated with sound effects, clips, and other audio elements, the metadata in particular may include data for sound effect audio elements associated with specific user actions or interactions. For example, an acoustic effect element may be provided for a door opening that makes a creaking sound. Such an acoustic effect may be associated with a user action so that the renderer can extract and render the sound effect of a creaking door when the user provides input corresponding to a door opening in the environment.

[0217] In some embodiments, the audio element may include a number of acoustic effect elements, and the acoustic environment data may include data that links an environment change controlled by the user to one (or more) of such acoustic effect elements. The environment change controlled by the user may be, for example, a user input or user interaction, or may be determined based on (or derived from) these.

[0218] The acoustic environment data can indicate changes for a specific user interaction, for example, by referring to a target ID (which identifies an element), target attributes, and / or corresponding target values. The list of these changes may be executed according to a trigger ID.

[0219] As a specific example, the start of rendering an acoustic effect may be indicated by using a SoundEffectID as an attribute of the sound source, and the acoustic effect ID of the acoustic effect to be played may be included as a target value.

[0220] The acoustic effect data can be provided, for example, according to the following syntax.

Table 15

[0221] sedPresent A flag indicating the presence or absence of acoustic effect data.

[0222] sedNrElements Indicates the number of acoustic effect elements defined next.

[0223] sedID The ID of the acoustic effect.

[0224] acdParentID [[ID=4The duration of the sound effect.

[0226] sedPregain Pre-gain value for sound effects.

[0227] In some embodiments, acoustic environment data may include data associated with specific regions. For example, characteristic values may be provided along with indications of specific regions to which the characteristic values apply. For instance, with respect to a given sound source, a characteristic (e.g., frequency-dependent attenuation) may depend on whether the listener is in the same room as the sound source or whether the user is in a different room. Thus, the characteristic may have indications of linked regions within which the listening position must lie for the value to be valid. If the listening position is outside the application zone, the characteristic cannot be used properly. Instead, for example, another characteristic value with a different scope of application, including the current listening position, may be included.

[0228] In many embodiments, metadata may include multiple linked regions for one or more characteristics, specifically including a first application region and a second application region. The first region may be provided for the location of the audio element / sound source, and the second region may be provided for the listening location. Thus, a characteristic value may be associated with two regions, one relating to the listening location and the other to the sound source location. In this case, the decoder / renderer evaluates whether the listening location and sound source location are within the appropriate regions. If not, a different characteristic value may be used, for example, a default value, a value associated with other validity regions including the listening location and sound source location, or an alternative characteristic value provided as a substitute for the original value and indicated for use when the validity region is not met.

[0229] As a concrete example, a bitstream can contain fields / data indicated by `applicableRegion` and `internalSourceRegion`, which represent the effective regions for the listening position and the sound source / audio element position, respectively.

[0230] For example, a bitstream may contain data for a first characteristic value that depends on the listening position and whether the listening position is within the region indicated by the value of the data field `applicableRegion`. In this case, this value is further directly applicable to sound sources within the region indicated by the value in the data field `internalSourceRegion`, and a different characteristic value is applied if the source is outside the region indicated by the data field `internalSourceRegion`.

[0231] An example of bitstream data using this approach is shown below.

[0232] AcousticEnvironmentData() This part of the syntax describes the (general) acoustic environment / surrounding (overall) characteristics. Specifically, reverberation characteristics. [Table 16]

[0233] acdPresent A flag indicating the presence or absence of acoustic environment data.

[0234] acdNrElements This indicates the number of defined acoustic environments.

[0235] acdID The ID for defining the acoustic environment.

[0236] acdParentID The ID of the parent of the acoustic environment definition in the scene graph.

[0237] applicableRegionID The ID of the geometric element that describes the region to which the parameter applies. Unless the pose dependency is global, the position and orientation are offset by the parent of the acoustic environment.

[0238] internalSourceRegionID The ID of the geometric element that describes the region where all the energy of all included sources contributes to the reverberation. If the pose-dependency is not global, the position and orientation are offset by the parent of the acoustic environment.

[0239] freqGridIdx Index of the list of frequency grids defined within FreqGridData().

[0240] dsrOffset Offset (in seconds) from where the DSR is calculated in the RIR. Offset = 0 is equivalent to being emitted at the source. Therefore, the offset should be higher than that.

[0241] T60 T60 time calculated from the 0 to -30 dB points of the linear part of the EDC after the initial decay.

[0242] DSR Diffuse to Source energy Ratio. The diffuse reverberation energy is calculated after the RIR lag dsrOffset for a single user's sample point. The source energy is the total emitted source energy that causes that diffuse energy.

[0243] dsrCode Code indicating the DSR value.

Table 17

[0244] GeometricElementData() This part of the syntax describes the geometric element. [Table 18]

[0245] gedPresent A flag indicating the presence or absence of acoustic environment data.

[0246] gedNrElements This indicates the number of defined geometric elements.

[0247] gedID ID of a geometric element.

[0248] gedType Next, define the type of geometric element to be transmitted. [Table 19]

[0249] cornerPos1 Includes one corner of a box parallel to the coordinate axis. The position represents global coordinates.

[0250] cornerPos2 This includes the second corner of the box parallel to the coordinate axes, which is the diagonal of cornerPos1. The position represents global coordinates.

[0251] boxParentID The ID of the parent node of the corresponding box geometric element.

[0252] boxPoseDependency This indicates what the pause of the sound element is relative to.

[0253] boxPosition The position coordinates (x, y, z) of the box geometry are shown in meters.

[0254] boxOrientation The orientation (yaw, pitch, roll) of the box geometry is shown in radians.

[0255] boxXDim Dimensions of the box geometry along the x-axis before rotation.

[0256] boxYDim Dimensions of the box geometry along the y-axis before rotation.

[0257] boxZDim Dimensions of the box geometry along the z-axis before rotation.

[0258] UserInteractionData() This part of the syntax describes user interaction data. It describes how user interactions can be triggered and what changes should be made in response to those triggers by describing which elements, which attributes need to be changed to what values, or whether these values are provided by an external entity (for example, in the case of fully user-controlled interactions such as picking up a source and moving it around in the scene).

[0259] A more semantic definition of user interaction, covering all aspects of 6DoF media rendering such as visual, auditory, and haptic, can be defined at a higher bitstream level. User interaction can influence all or some of these aspects. This system layer (covered by the MPEG Systems Working Group WG03 in the case of MPEG-I) defines how a user can trigger a particular interaction. This may mean defining the activation of a controller's trigger button within a specific spatial area of the scene, or it may mean a more abstract meaning where other layers link abstract meanings to hardware-dependent controls. For example, if the user opens door 5, user interaction G is activated.

[0260] Such a system-level user interaction trigger may send a dedicated user interaction trigger to the respective media renderer. For example, a system-level user interaction G may be linked to a voice user interaction 12, and the voice renderer may then perform the changes associated with the user interaction having triggerID=12.

[0261] In many embodiments, such user interaction triggers may be accompanied by further parameters for more immersive user interaction. For example, the position coordinates of a sound source being picked up and moved by the user may be provided.

[0262] Such user interactions, triggered from outside the audio renderer, may also be called external triggers. Other user interactions may rely on being triggered by the audio renderer itself. Distinctions between such triggers may be indicated by the `triggerType` property in the bitstream that describes the change in user interaction. [Table 20] TIFF0007876588000028.tif255162TIFF0007876588000029.tif255164TIFF0007876588000030.tif39170

[0263] uidPresent A flag indicating whether or not user interaction occurred.

[0264] Trigger Type Define how user interactions trigger scene updates. [Table 21]

[0265] triggerIdx An index used within user interaction messages from an external source to trigger updates to this specific scene.

[0266] TriggerID The ID of the state element that describes the state that describes this specific scene update.

[0267] condTransition An update is triggered when the state changes to a specified transition value.

[0268] updatedelay How long should the update be delayed after the trigger?

[0269] nrChanges This shows the number of changes in this scene update.

[0270] usePreviousID A flag indicating whether the change is for the same ID as the previous change.

[0271] immediateChange A flag indicating whether the change is immediate or interpolated to the target value over a specific period.

[0272] Theatrical releases are a major part of the process. The interpolation period for change.

[0273] targetAttribute The name of the attribute to be changed.

[0274] attribCode A code indicating a modifiable attribute. [Table 22]

[0275] externalParameter A flag indicating whether the parameter value is provided by an external process.

[0276] parameterIdx The index of the parameter in the message from the external process that is mapped to this attribute.

[0277] paramIdxCode A code that shows the parameter index. [Table 23]

[0278] moreParameters A flag indicating the presence or absence of additional parameters.

[0279] TargetValue The target value for the change. Depending on the type of target attribute, the value may be coded differently according to the switch statement in the syntax.

[0280] Support elements A bitstream can include various supporting elements that can support the provision of data representing characteristic values.

[0281] Furthermore, in many embodiments, the number of bits used to indicate a characteristic value may be variable within the bitstream. In particular, a characteristic value may be indicated by a field containing a predetermined number of bits. However, a flag indicator may be included to indicate that one or more extension fields providing additional bits for the characteristic value are included. Specifically, the extension fields may provide additional bits to extend the range of the characteristic value. In particular, the extension fields may further include most significant (or higher) bits that are combined with the data bits of the default field to generate a characteristic value with a (dynamic) range (specifically, they may provide additional bits that enable the representation of higher values).

[0282] In other scenarios, the extended field may specifically provide additional bits to extend the precision of the characteristic value. In particular, the extended field may include additional least significant bits that are combined with the data bits of the default field to produce a more accurate characteristic value.

[0283] In some embodiments, acoustic environment data may include a first data field providing a first bit representing a value of a given characteristic, and a second data field that may provide an indication for the first data field / given characteristic. This indication may include whether or not a further extended data field is included that provides further bits representing a value of the first characteristic. This indication may be, for example, an indication that the extended field includes bits to extend the range of data values provided, and / or an indication that it includes bits to increase the precision / resolution of the data values provided. Specifically, this indication may indicate that the value is represented by a larger data word obtained from a combination of bits from both the first default field and the extended field.

[0284] In some embodiments, additional bits extend the range of values the characteristic can take. These additional bits may be higher-order bits of the data word representing the characteristic value. In some embodiments, additional bits increase the resolution of the values the characteristic can take. These additional bits may be lower-order bits of the data word representing the first characteristic value. These additional bits may extend the precision of the values the first characteristic can take.

[0285] In many embodiments, the default bit and additional bits may each represent values that are not necessarily combined by concatenating bits, but can be combined in other ways.

[0286] Therefore, in some embodiments, the acoustic environment data may include an indication / flag to show that additional bits are used to determine the characteristic value. This approach can be used for a variety of characteristic values, such as spatial characteristics, quantity, gain characteristics, capacitance characteristics, frequency characteristics, index characteristics, and discrimination characteristics.

[0287] For example, the following fields may be used to indicate that a wider range is available. - For time scales: addSeconds - For spatial scales: addHectometers - For a measure of quantity: isLargerNumber - (Integer) ID number: largerValue

[0288] The following fields may be used to indicate that more accurate (higher resolution) values are provided. - For time units: addMilliseconds - For spatial scales: addCentimeters - For frequency scales: moreAccuracy - For angle scales: addFineAngle - For the gain scale: addFineGain.

[0289] The following are various support elements for bitstreams. As mentioned above, support elements may use variable range / resolution values.

[0290] GetID() Returns an integer ID. [Table 24]

[0291] idVal ID value or partial ID value.

[0292] Larger Value A flag indicating whether the ID value is greater than the specified value.

[0293] GetCountOrIndex() Returns a number within the range [0..1023]. [Table 25]

[0294] countOrIndexLoCode A code that indicates the lower bits of the count or index value. [Table 26]

[0295] isLargerNumber A flag indicating whether or not more bits are sent to represent a larger number.

[0296] countOrindexHiCode A code that indicates the upper bits of the count or index value. [Table 27]

[0297] GetDuration() Returns the duration of time as a sample. [Table 28]

[0298] LUT battery The query is executed on the lookup table, corresponding to the field whose value is specified as an argument.

[0299] deciSecondsCode Code indicating a decisecond duration offset. [Table 29]

[0300] addMilliseconds Next is a flag indicating whether or not a millisecond duration offset is sent.

[0301] milliSecondsCode A code indicating a millisecond duration offset. [Table 30]

[0302] addSamples Next is a flag indicating whether or not a sample-based duration offset is sent.

[0303] Sample Code Code that shows the sample size duration offset. [Table 31]

[0304] addSeconds Next is a flag indicating whether or not a seconds duration offset is sent.

[0305] SecondCode A code indicating a time offset in seconds. [Table 32]

[0306] GetFrequency() Returns a frequency (Hz) within the range [16...49717]. [Table 33]

[0307] LUT battery The query is executed on the lookup table, corresponding to the field whose value is specified as an argument.

[0308] frequency code A code indicating the center frequency (Hz) of a 1 / 3 octave band. [Table 34]

[0309] moreAccuracy A flag indicating whether or not data for a more precise frequency is being transmitted.

[0310] frequencyRefine A field that indicates a value for adjusting the frequency.

[0311] GetPosition() Returns the position [x, y, z] in meters. [Table 35]

[0312] GetPositionDelta() Returns the position Δ[dx,dy,dz] in meters. [Table 36]

[0313] GetDistance() Returns the distance in meters. [Table 37]

[0314] LUT battery The query is executed on the lookup table, corresponding to the field whose value is specified as an argument.

[0315] Meiscode A code indicating a meter coordinate offset. [Table 38]

[0316] addHectometers Next is a flag indicating whether or not a hectometer coordinate offset is sent.

[0317] hectometersCode A code indicating a hectometer coordinate offset. [Table 39]

[0318] addKilometers Next is a flag indicating whether or not the kilometer coordinate offset is sent.

[0319] Code A code indicating a kilometer coordinate offset. For distances exceeding 10 km, multiple occurrences may be provided. [Table 40]

[0320] addCentimeters Next is a flag indicating whether or not a centimeter coordinate offset is sent.

[0321] centimetersCode A code indicating a centimeter coordinate offset. [Table 41]

[0322] addMillimeters Next is a flag indicating whether or not millimeter coordinate offsets are sent.

[0323] millimetersCode A code indicating a millimeter coordinate offset. [Table 42]

[0324] Representation of directional characteristics Acoustic environment data can, in many cases, and indeed almost always, include values for one or more directional characteristics.

[0325] In some embodiments, this can be advantageously achieved by acoustic environment data that includes a representation of orientation values.

[0326] In some embodiments, acoustic environment data includes data groups / data fields that describe orientation representation formats for representing orientation characteristics. In many embodiments, acoustic environment data may include data (within the same or multiple different datasets / points) that describe or define multiple different orientation representation formats. Each orientation representation format may provide a (data / bit) format for representing orientation values. In that case, multiple datasets / points may include data that describes orientation characteristics by using one of the defined orientation representation formats.

[0327] In some embodiments, the acoustic environment data may include an indicator indicating whether the bitstream contains a data group / data field describing the orientation representation format. For example, a flag may be indicated to indicate that a data field describing the orientation representation format is included.

[0328] Furthermore, a flag / indicator may be provided for individual orientation values to indicate whether or not the orientation value is provided according to the orientation representation format defined within the acoustic environment data. The flag / indicator may be included for individual values, for example, to indicate which of multiple orientation representation formats is used for that particular value.

[0329] The orientation representation format may include one or more of the following, for example:

[0330] An indication of a predetermined default orientation representation. The number of default orientation representations may be predetermined (e.g., by a standard definition). Data referencing such default orientations may be included. For example, the field orientationCode may indicate the default orientation.

[0331] A set of predetermined angles. The orientation representation format may define a set of predetermined angles, and the orientation value may be indicated, for example, by simply referencing one of such predetermined angles. For example, each predetermined angle may be represented by an index, and a given orientation value may be represented by indicating the appropriate index. For example, each angle of the orientation value may be indicated by a default angle from a narrow range of angles.

[0332] A set of angles (for example, on a quantization grid). The angles may be represented by explicit angle values. The angle values may be represented by a given word length and therefore can have a given level of quantization. Thus, the angle for each direction is represented by one of a larger range of angles.

[0333] Below are some examples of approaches to providing direction values, following the example above.

[0334] GetOrientation() Returns the direction [yaw, pitch, roll] in radians. [Table 43]

[0335] LUT battery The query is executed on the lookup table, corresponding to the field whose value is specified as an argument.

[0336] facilityCode An orientation code indicating either the default orientation or one of two escape values whose orientation is defined by further data. [Table 44]

[0337] defaultYawCode The default yaw angle code. [Table 45]

[0338] defaultPitchCode The default pitch angle code. [Table 46]

[0339] Default Roll Code The code for the default roll angle. [Table 47]

[0340] coarseAngleCode Code for coarse angle indication in 1 / 36 pi increments. [Table 48] TIFF0007876588000059.tif50170

[0341] addFineAngle A flag indicating whether or not finer-grained angle data is being sent.

[0342] fineAngleCode Code for fine angle indication in 1 / 1800pi increments. [Table 49]

[0343] GetGain() Returns the gain value (dB). [Table 50]

[0344] coarseGaanCode A code for a rough gain value (dB). [Table 51]

[0345] addFineGain A flag indicating whether or not additional data is sent to provide finer resolution gain values.

[0346] fineGainCode A code with finer gain resolution (1dB resolution). [Table 52]

[0347] GetGainDelta() Returns the gain delta value (dB). [Table 53]

[0348] GetRenderingConditions() Includes information on how to render the source. [Table 54]

[0349] isNormalConditions A flag indicating whether the rendering conditions are normal or not. This means that no acoustic features are explicitly turned off, and the rendering of the source, or specific acoustic features of the source, is determined by the decoder.

[0350] doReverb A flag indicating whether or not to render the reverberation of the corresponding source.

[0351] doEarlyReflections A flag indicating whether or not to render the initial reflections of the corresponding source.

[0352] doDoppler A flag indicating whether or not to render the Doppler hardening of the corresponding source.

[0353] doDistanceAtt A flag indicating whether or not to render distance attenuation for the corresponding source.

[0354] doDirectPath A flag indicating whether or not to render the direct path to the corresponding source.

[0355] regionDependentActivation A flag indicating whether further data specifies whether or not to activate the source depending on the user's location.

[0356] Ren Going In A flag indicating that the source will be activated when the user moves into the specified area, if true.

[0357] regionDependentDeactivation A flag indicating whether or not to deactivate the source based on the user's location, as further data specifies.

[0358] deactivateGoingIn A flag indicating that the source will be deactivated when the user moves into the specified area, if true.

[0359] In the above, the terms "voice" and "sound source" are used, but it should be understood that these are equivalent to the terms "sound" and "sound source." References to the term "voice" can be replaced with references to the term "sound."

[0360] For clarity, the above description describes embodiments of the invention in relation to different functional circuits, units, and processors. However, it will be understood that functions can be appropriately distributed among different functional circuits, units, or processors without impairing the invention. For example, a function described as being performed by multiple separate processors or controllers may be performed by the same processor or controller. Therefore, references to specific functional units or circuits should be considered not as indicating a strict logical or physical structure or organization, but as references to appropriate means for providing the described function.

[0361] The present invention can be implemented in any suitable form, including hardware, software, firmware, or any combination thereof. The present invention may be implemented at least partially as computer software running on one or more data processors and / or digital signal processors. Elements and components of embodiments of the present invention can be implemented physically, functionally, and logically in any suitable way. In practice, the functionality may be implemented as a single unit, multiple units, or as part of other functional units. Thus, the present invention may be implemented as a single unit, or it may be physically and functionally distributed among multiple different units, circuits, and processors.

[0362] Although the present invention has been described in relation to several embodiments, the present invention is not limited to the specific forms described in the specification. The scope of the present invention is limited only by the appended claims. Furthermore, even if a certain feature appears to be described in relation to a particular embodiment, those skilled in the art will recognize that various features of the above embodiments can be combined in accordance with the present invention. In the claims, terms such as "equipment," "includes," etc., do not preclude the presence of other elements or steps.

[0363] Furthermore, even if listed individually, multiple means, elements, circuits, or method steps may be carried out by, for example, a single circuit, unit, or processor. Moreover, even if individual features are included in different claims, they can be suitably combined, and their inclusion in different claims does not mean that the combination of features is impossible and / or unfavorable. Also, the inclusion of a feature in one claim category does not mean that the feature is limited to that category; features are equally applicable to other claim categories as appropriate. Furthermore, the order of features in a claim does not indicate a specific order in which the features should act, and in particular, the order of individual steps in a method claim does not mean that the steps must be performed in that order. The steps can be performed in any suitable order. Also, singular expressions do not preclude plural forms; therefore, expressions such as "a," "an," "first," and "second" do not preclude plurals. Reference numerals in the claims are merely examples for clarity and do not in any way limit the scope of the claims.

Claims

1. A device for generating a bitstream, wherein the device is A metadata generator that generates metadata for audio data of multiple audio elements representing sound sources in an environment, wherein the metadata includes acoustic environment data of the environment, the acoustic environment data describes characteristics that affect the propagation of sound from the sound sources in the environment, at least a portion of the acoustic environment data is applicable to multiple listening pauses in the environment, and the characteristics include both static and dynamic characteristics. The system includes a bitstream generator that generates the bitstream containing the metadata, The metadata generator generates the acoustic environment data to include a global indicator indicating that the environment is a spatially constrained environment, and the metadata generator restricts the data values of the acoustic environment data to conform to a predetermined restricted format for data values when the global indicator indicating that the environment is spatially constrained is set.

2. The apparatus according to claim 1, wherein the acoustic environment data includes a data group describing a data format for at least a portion of the representation of a characteristic value of at least one of the characteristics that affect sound propagation, and a plurality of data groups, each containing data describing at least one characteristic value using the representation.

3. The acoustic environment data includes a data group describing a frequency grid, and a plurality of data groups, each containing data describing frequency-dependent characteristics among the characteristics using the frequency grid, the bitstream includes an indicator indicating whether the bitstream contains the data group describing the frequency grid, the data group describing the frequency grid includes an indication of the format of the data describing the frequency grid, and the data group describing the frequency grid is Data indicating a predetermined default grid, Data indicating the starting frequency and frequency range of at least some sub-ranges of the frequency grid, The apparatus according to claim 2, comprising at least one of the following: data indicating individual frequencies.

4. The apparatus according to claim 1, wherein the acoustic environment data includes a first data field for a first bit representing a value of a first characteristic among the characteristics that affect sound propagation, and a second data field indicating whether or not the acoustic environment data includes extended data for a second bit representing the value of the first characteristic.

5. The apparatus according to claim 4, wherein the second bit extends the range of values that the first characteristic can take.

6. The apparatus according to claim 4 or 5, wherein the second bit increases the resolution of the values that the first characteristic can take.

7. The apparatus according to claim 1, wherein the acoustic environment data includes at least an animation indication of a first sound element, the animation indication showing whether at least one characteristic of the first sound element changes over time, and for an animation indication showing that the first sound element has at least one changing characteristic, the acoustic environment data includes data describing the variation of the at least one changing characteristic.

8. The apparatus according to claim 1, wherein the sound element includes a plurality of sound effect elements, and the sound environment data includes data that links a change in the environment controlled by the user with a first sound effect element among the plurality of sound effect elements.

9. The apparatus according to claim 1, wherein the acoustic environment data is arranged in a plurality of consecutive datasets, each dataset includes data over a time interval, and the first dataset among the consecutive datasets includes a first characteristic value of at least one of the characteristics that affect sound, and a time indication of the first characteristic value indicating time within the time interval represented by the first dataset.

10. The apparatus according to claim 1, wherein the acoustic environment data is arranged in a plurality of consecutive datasets, each dataset containing data over a time interval, and the bitstream generator determines whether a characteristic value of a first characteristic among the characteristics affecting sound propagation is provided for a default time within the time interval represented by the first dataset, and if provided, includes the first characteristic value in the first dataset without time indication, and if not provided, includes the first characteristic value in the first dataset with time indication for the first characteristic value.

11. The apparatus according to claim 1, wherein the acoustic environment data of the first sound element includes indications of a first applicability region and a second applicability region for a first characteristic value of the first characteristic among the characteristics that affect sound propagation, the first applicability region indicating a region of location of the first sound element to which the first characteristic value is applied, and the second applicability region indicating a region of listening position to which the first characteristic value is applied.

12. A device for generating rendered sound, wherein the device is A first receiver that receives audio data of multiple audio elements representing sound sources in the environment, A second receiver that receives a bitstream containing metadata of audio data of a plurality of audio elements representing sound sources in the environment, wherein the metadata includes acoustic environment data of the environment, the acoustic environment data describes characteristics that affect the propagation of sound from the sound sources in the environment, at least a portion of the acoustic environment data is applicable to a plurality of listening pauses in the environment, and the characteristics include both static and dynamic characteristics, The system includes a renderer that generates output audio data for the environment in response to the audio data and the acoustic environment data, The acoustic environment data includes a global indicator indicating that the environment is a spatially constrained environment, and at least one value of the acoustic environment data is restricted to conform to a predetermined restricted format for data values when the global indicator indicating that the environment is spatially constrained is set. The renderer mentioned above is If the global indicator indicates that the environment is a spatially constrained environment, the sound rendering process is performed by interpreting the sound environment data on the premise that at least one value of the sound environment data conforms to the predetermined restricted format. If the global indicator does not indicate that the environment is a spatially constrained environment, the apparatus interprets the acoustic environment data without assuming that at least one value of the acoustic environment data conforms to the predetermined restricted format and performs sound rendering processing.

13. A method for generating a bitstream, wherein the method is A step of generating metadata for audio data of multiple audio elements representing sound sources in an environment, wherein the metadata includes acoustic environment data of the environment, the acoustic environment data describes characteristics that affect the propagation of sound from the sound sources in the environment, at least a portion of the acoustic environment data is applicable to multiple listening pauses in the environment, and the characteristics include both static and dynamic characteristics. The step of generating the bitstream containing the metadata, The step of generating the metadata includes the step of generating the acoustic environment data such that it includes a global indicator indicating that the environment is a spatially constrained environment, A method comprising the step of restricting the data values of the acoustic environment data to conform to a predetermined restricted format for data values, when the global indicator indicating that the environment is spatially constrained is set.

14. A method for generating rendered audio, wherein the method is The steps include receiving audio data of multiple audio elements representing sound sources in the environment, A step of receiving a bitstream containing metadata of audio data for a plurality of audio elements representing sound sources in the environment, wherein the metadata includes acoustic environment data of the environment, the acoustic environment data describes characteristics that affect the propagation of sound from the sound sources in the environment, at least a portion of the acoustic environment data is applicable to a plurality of listening pauses in the environment, and the characteristics include both static and dynamic characteristics. The process includes a rendering step that generates output audio data for the environment in response to the audio data and the acoustic environment data, The acoustic environment data includes a global indicator indicating that the environment is a spatially constrained environment, and at least one value of the acoustic environment data is restricted to conform to a predetermined restricted format for data values when the global indicator indicating that the environment is spatially constrained is set. The rendering step described above is: If the global indicator indicates that the environment is a spatially constrained environment, the sound rendering process is performed by interpreting the sound environment data on the premise that at least one value of the sound environment data conforms to the predetermined restricted format. A method for interpreting the acoustic environment data and performing audio rendering processing without assuming that at least one value of the acoustic environment data conforms to the predetermined restricted format, if the global indicator does not indicate that the environment is a spatially constrained environment.