Split rendering of extended reality data over a 5g network

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By initializing a streaming session with the same number of dynamic virtual objects in an XR scene, and retrieving and rendering media data based on QoS and billing information, the flexibility issue of inter-device collaboration in XR data transmission is solved, and efficient rendering of dynamic virtual objects is achieved.

CN117256154BActive Publication Date: 2026-06-19QUALCOMM INC

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: QUALCOMM INC
Filing Date: 2022-05-12
Publication Date: 2026-06-19

Application Information

Patent Timeline

12 May 2022

Application

19 Jun 2026

Publication

CN117256154B

IPC: H04N21/63; H04L65/1066; G06T19/00; G06F3/01; H04N21/234; H04N21/61; H04N21/81; H04L65/80

AI Tagging

Application Domain

Input/output for user-computer interaction Image data processing

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

User interface display system, method, computer device and storage medium
US12657756B2Input/output for user-computer interaction Image analysis
Electronic devices with finger sensors
US12656914B2Input/output for user-computer interaction Details for portable computers
Semiconductor inventory equipment maintenance system and method
CN120087937Blower requirementEasy to carry outInput/output for user-computer interaction Data processing applications
Device for work support in a predefined work area within an assigned spatial profile
DE102013201309B4Input/output for user-computer interactionMeasuring points marking
AR head-mounted device, and AR head-mounted device and terminal device combination system
CN114967926BInput/output for user-computer interaction Graph reading

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies struggle to efficiently transmit and render media data for dynamic virtual objects when processing extended reality (XR) data, especially in scenarios involving split rendering across multiple devices, where service quality and billing information configuration is not flexible enough.

Method used

By parsing the entry point data of the XR scene, a streaming session with the same number of dynamic virtual objects is initialized, and media data is retrieved and rendered according to QoS and billing information. A split rendering method is used to work collaboratively between client devices and server devices.

Benefits of technology

It enables efficient media data transmission and rendering of dynamic virtual objects in XR scenarios involving multiple devices, ensuring flexible configuration of service quality and billing information, and improving user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117256154B_ABST

Patent Text Reader

Abstract

An example device for processing extended reality (XR) data includes a processor configured to: parse entry point data of an XR scene to extract information about one or more desired virtual objects for the XR scene, the desired virtual objects including a number of dynamic virtual objects equal to or greater than one, each of the dynamic virtual objects including at least one dynamic media component for which media data is to be retrieved; initialize a number of streaming sessions using the entry point data, the number of streaming sessions being equal to or greater than the number of dynamic virtual objects; configure quality of service (QoS) and billing information for the streaming sessions; retrieve media data for the dynamic virtual objects via the streaming sessions; and send the retrieved media data to a rendering unit to render the XR scene, so as to include the retrieved media data at corresponding locations within the XR scene.

Need to check novelty before this filing date? Find Prior Art

Description

[0001] This application claims priority to U.S. Patent Application No. 17 / 742,168, filed May 11, 2022, and U.S. Provisional Application No. 63 / 187,840, filed May 12, 2021, the entire contents of which are incorporated herein by reference. U.S. Patent Application No. 17 / 742,168, filed May 11, 2022, claims the benefit of U.S. Provisional Application No. 63 / 187,840, filed May 12, 2021. Technical Field

[0002] This disclosure pertains to the storage and transmission of media data. Background Technology

[0003] Digital video capabilities can be incorporated into a wide variety of devices, including digital televisions, digital live broadcast systems, wireless broadcasting systems, personal digital assistants (PDAs), laptops or desktop computers, digital cameras, digital recording devices, digital media players, video game devices, video game consoles, cellular or satellite radio phones, video conferencing equipment, and more. Digital video devices implement video compression technologies (such as those described in standards defined by MPEG-2, MPEG-4, ITU-T H.263 or ITU-T H.264 / MPEG-4 (Part 10, Advanced Video Decoding (AVC)), ITU-T H.265 (also known as High Efficiency Video Decoding (HEVC)), and extensions to such standards) to transmit and receive digital video information more efficiently.

[0004] After video and other media data have been encoded, the media data can be grouped for transmission or storage. The media data can be assembled into video files conforming to any of various standards, such as the International Organization for Standardization (ISO) base media file formats and their extensions. Summary of the Invention

[0005] In summary, this disclosure describes techniques related to processing extended reality (XR) data, such as using split rendering. Specifically, the techniques of this disclosure relate to processing media data comprising multiple dynamic virtual objects. A client device can be configured to initialize a corresponding streaming session for each dynamic virtual object. That is, a one-to-one correspondence can exist between streaming sessions and dynamic virtual objects. In this way, media data for each dynamic streaming session can be streamed via the corresponding streaming session. Each streaming session can have a single Quality of Service (QoS) and billing information configured according to, for example, the type of the corresponding dynamic virtual object.

[0006] In one example, a method for processing extended reality (XR) data includes: parsing entry point data of an XR scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects including a number of dynamic virtual objects equal to or greater than one, each of the dynamic virtual objects including at least one dynamic media component for which media data is to be retrieved; initializing a number of streaming sessions using the entry point data, the number of streaming sessions being equal to the number of dynamic virtual objects, wherein initializing the streaming sessions includes initializing the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions; retrieving media data for each of the dynamic media components of the dynamic virtual objects via one of the corresponding number of streaming sessions; and sending the retrieved media data to a rendering unit to render the XR scene to include the retrieved media data at a corresponding location within the XR scene.

[0007] In another example, an apparatus for processing extended reality (XR) data includes: a memory configured to store XR data and media data; and one or more processors implemented in circuitry and configured to: parse entry point data of an XR scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects comprising a number of dynamic virtual objects equal to or greater than one, each of the dynamic virtual objects including at least one dynamic media component for which media data is to be retrieved; initialize a number of streaming sessions using the entry point data, the number of streaming sessions being equal to the number of dynamic virtual objects, wherein, in order to initialize the streaming sessions, the one or more processors are configured to initialize the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions; retrieve media data for each dynamic media component of the dynamic media components for the dynamic virtual objects via one of the corresponding number of streaming sessions; and send the retrieved media data to a rendering unit to render the XR scene to include the retrieved media data at a corresponding location within the XR scene.

[0008] In another example, a computer-readable storage medium has instructions stored thereon that, when executed, cause a processor to: parse entry point data of an XR scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects comprising a number of dynamic virtual objects equal to or greater than one, each of the dynamic virtual objects comprising at least one dynamic media component for which media data is to be retrieved; initialize a number of streaming sessions using the entry point data, the number of streaming sessions being equal to the number of dynamic virtual objects, wherein the instructions causing the processor to initialize the number of streaming sessions include instructions causing the processor to: initialize the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions; retrieve media data for each dynamic media component of the dynamic virtual objects via one of the corresponding number of streaming sessions; and send the retrieved media data to a rendering unit to render the XR scene, including the retrieved media data at a corresponding location within the XR scene.

[0009] In another example, an apparatus for processing extended reality (XR) data includes: a unit for parsing entry point data of an XR scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects including a number of dynamic virtual objects equal to or greater than one, each of the dynamic virtual objects including at least one dynamic media component for which media data is to be retrieved; a unit for initializing a number of streaming sessions, the number of streaming sessions being equal to the number of dynamic virtual objects, wherein the unit for initializing the number of streaming sessions includes a unit for initializing the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions; a unit for retrieving media data for each dynamic media component of the dynamic virtual objects via one of the corresponding number of streaming sessions; and a unit for sending the retrieved media data to a rendering unit to render the XR scene, so as to include the retrieved media data at a corresponding location within the XR scene.

[0010] Details of one or more examples are set forth in the accompanying drawings and the following description. Other features, objects, and advantages will be apparent from the specification, the drawings, and the claims. Attached Figure Description

[0011] Figure 1This is a block diagram illustrating an example system for implementing a technology for streaming media data over a network.

[0012] Figure 2 This is a block diagram illustrating an example computing system that can perform the techniques described in this disclosure.

[0013] Figure 3 This is a block diagram illustrating an example client device configured as a 5G standalone AR (STAR) user device according to the technology of this disclosure.

[0014] Figure 4 This is a block diagram illustrating another example client device configured as a 5G EDGE-related AR (EDGAR) user device according to the technology of this disclosure.

[0015] Figure 5 This is a flowchart illustrating an example augmented reality session for a STAR user device using the technology described in this disclosure.

[0016] Figure 6 This is a flowchart illustrating an example augmented reality session for an EDGAR user device using the technology described in this disclosure.

[0017] Figure 7 This is a flowchart illustrating an example method for processing XR data according to the technology disclosed herein. Detailed Implementation

[0018] OpenXR is an application programming interface (API) for developing extended reality (XR) applications for various XR devices. XR refers to a blend of real and virtual world environments generated by computers through human interaction. XR includes technologies such as virtual reality (VR), augmented reality (AR), and mixed reality (MR). OpenXR is the interface between applications and XR runtime. XR runtime handles functions such as frame composition, user-triggered actions, and tracking information.

[0019] OpenXR is designed as a layered API, meaning users or applications can insert API layers between their application and runtime implementation. These API layers provide additional functionality by intercepting OpenXR functions from the layers above and then performing actions different from what would have been done without that layer. In the simplest case, the layer simply calls the next layer down using the same arguments, but more complex layers can implement API functionality that doesn't exist in the layers below or at runtime. This mechanism is essentially an architectural "function smoothing" or "interception" feature designed into OpenXR and intended to replace the more informal "hook" API call approach.

[0020] An application can determine which API layers are available to it by calling the `xrEnumerateApiLayerProperties` function to obtain a list of available API layers. The application can then select the desired API layer from this list and provide it to the `xrCreateInstance` function when creating an instance.

[0021] The API layer can implement OpenXR functions, which may or may not be supported by the underlying runtime. To expose these new features, the API layer must expose this functionality as an OpenXR extension. It cannot expose new OpenXR functions that do not have an associated extension.

[0022] An OpenXR instance is an object that allows OpenXR applications to communicate with the OpenXR runtime. Applications accomplish this communication by calling `xrCreateInstance` and receiving a handle to the resulting `XrInstance` object.

[0023] The XrInstance object stores and tracks application state related to OpenXR without storing any such state in the application's global address space. This allows applications to create multiple instances and securely encapsulate the application's OpenXR state because the object is opaque to the application. OpenXR runtime may limit the number of XrInstance objects that can be created and used simultaneously, but they must support the creation and use of at least one XrInstance object per process.

[0024] The space is represented by an XrSpace handle, which is created by the application and then used in API calls. Whenever an application calls a function that returns coordinates, it provides an XrSpace to specify the reference frame in which those coordinates will be represented. Similarly, when providing coordinates to a function, the application specifies which XrSpace should be used at runtime to interpret those coordinates.

[0025] OpenXR defines a set of well-known reference spaces that applications use to initiate their spatial reasoning. These reference spaces are: VIEW, LOCAL, and STAGE. Each reference space has a defined meaning, establishing where its origin is located and how its axes rotate.

[0026] Its tracking system improves its understanding of the world over time, and its runtime can independently track spaces. For example, even though the LOCAL and STAGE spaces each map their origins to static locations in the world, the runtime of a tracking system with an inside-out approach can continuously introduce slight adjustments to the origin of each space to keep each origin in place.

[0027] In addition to the well-known reference space, runtime also reveals other independent tracking spaces, such as the pose motion space that tracks the pose of the motion controller over time.

[0028] According to the technology disclosed herein, XR data can be rendered in a split-rendering manner. That is, two or more devices can participate in rendering XR data, for example, client devices and server devices. Multiple client and / or server devices can participate in an XR split-rendering session. Typically, the server can use streaming network protocols such as Dynamic Adaptive Streaming (DASH) over HTTP, HTTP Real-Time Streaming (HLS), etc., to stream media data to the client.

[0029] In HTTP streaming, frequently used operations include HEAD, GET, and partial GET. The HEAD operation retrieves the header of a file associated with a given Uniform Resource Locator (URL) or Uniform Resource Name (URN), but not the payload associated with the URL or URN. The GET operation retrieves the entire file associated with a given URL or URN. The partial GET operation receives a range of bytes as input and retrieves a consecutive number of bytes from the file, where the number of bytes corresponds to the received range. Therefore, movie fragments can be provided for HTTP streaming because a partial GET operation can obtain one or more individual movie fragments. Within a movie fragment, there can be several track segments from different tracks. In HTTP streaming, media presentation can be a structured collection of data accessible to the client. The client can request and download media data information to present the streaming service to the user.

[0030] In examples of using HTTP streaming to stream 3GPP data, multiple representations can exist for video and / or audio data of multimedia content. As explained below, different representations can correspond to different decoding characteristics (e.g., different profiles or levels of the video decoding standard), different decoding standards or extensions to those standards (such as multi-view and / or scalable extensions), or different bit rates. Such a list of representations can be defined in a Media Presentation Description (MPD) data structure. A media presentation can correspond to a structured set of data accessible to the HTTP streaming client device. The HTTP streaming client device can request and download media data information to present the streaming service to the user of the client device. The media presentation can be described in an MPD data structure, which can include updates to the MPD.

[0031] A media presentation may consist of a sequence of one or more time segments. Each time segment may extend until the beginning of the next time segment, or until the end of the media presentation (in the case of the last time segment). Each time segment may contain one or more representations of the same media content. A representation may be one of several alternative encoded versions of audio, video, timed text, or other such data. Representations may differ in terms of encoding type (e.g., bitrate, resolution, and / or codec for video data, and bitrate, language, and / or codec for audio data). The term "representation" can be used to refer to a portion of encoded audio or video data that corresponds to a specific time segment of multimedia content and is encoded in a specific manner.

[0032] Representations for a specific time period can be assigned to a group indicated by an attribute in the MPD that indicates the adapter set to which these representations belong. Representations within the same adapter set are generally considered interchangeable, as client devices can dynamically and seamlessly switch between these representations, for example, to perform bandwidth adaptation. For instance, each representation of video data for a specific time period can be assigned to the same adapter set, allowing the selection of any of these representations for decoding to render multimedia content for the corresponding time period, such as video or audio data. In some examples, the media content within a time period can be represented by any representation from group 0 (if present) or a combination of at most one representation from each non-zero group. Timing data for each representation of a time period can be expressed relative to the start time of that time period.

[0033] A representation may include one or more segments. Each representation may include an initialization segment, or each segment of a representation may be self-initialized. When present, the initialization segment may contain initialization information for accessing the representation. Typically, the initialization segment does not contain media data. Segments may be uniquely referenced by identifiers such as Uniform Resource Locators (URLs), Uniform Resource Names (URNs), or Uniform Resource Identifiers (URIs). The MPD may provide an identifier for each segment. In some examples, the MPD may also provide a byte range as a range attribute, which may correspond to the data in the segment that is accessible within the file via a URL, URN, or URI.

[0034] Different representations can be selected for retrieving different types of media data substantially simultaneously. For example, a client device can choose to retrieve segmented audio representations, video representations, and timed text representations from it. In some examples, the client device can select a specific set of adapters to perform bandwidth adaptation. That is, the client device can select an adapter set that includes video representations, an adapter set that includes audio representations, and / or an adapter set that includes timed text. Alternatively, the client device can select an adapter set for some types of media (e.g., video) and directly select a representation for other types of media (e.g., audio and / or timed text).

[0035] Figure 1 This is a block diagram illustrating an example system 10 for implementing techniques for streaming media data over a network. In this example, system 10 includes a content preparation device 20, a server device 60, and a client device 40. Server device 60 and client device 40 can participate in an extended reality (XR) split rendering process, discussed in more detail below. Client device 40 and server device 60 are communicatively coupled via a network 74, which may include the Internet. In some examples, content preparation device 20 and server device 60 may also be coupled via network 74 or another network, or they may be directly communicatively coupled. In some examples, content preparation device 20 and server device 60 may include the same device.

[0036] exist Figure 1In the example, content preparation device 20 includes an audio source 22 and a video source 24. Audio source 22 may include, for example, a microphone that generates electrical signals representing captured audio data to be encoded by audio encoder 26. Alternatively, audio source 22 may include a storage medium storing previously recorded audio data, an audio data generator (such as a computerized synthesizer), or any other audio data source. Video source 24 may include a camera generating video data to be encoded by video encoder 28, a storage medium encoding using previously recorded video data, a video data generation unit (such as a computer graphics source), or any other video data source. In all examples, content preparation device 20 is not necessarily communicatively coupled to server device 60, but may instead store multimedia content on a separate medium that is read by server device 60.

[0037] The raw audio and video data may include analog or digital data. Analog data may be digitized before being encoded by audio encoder 26 and / or video encoder 28. Audio source 22 may acquire audio data from the speaker while the speaker is speaking, and video source 24 may acquire video data of the speaker simultaneously. In other examples, audio source 22 may include a computer-readable storage medium containing stored audio data, and video source 24 may include a computer-readable storage medium containing stored video data. In this way, the techniques described in this disclosure can be applied to live, streaming, real-time audio and video data or to archived, pre-recorded audio and video data.

[0038] An audio frame corresponding to a video frame is typically an audio frame containing audio data, which is captured (or generated) simultaneously by audio source 22 and video source 22, along with the video data contained within the video frame and captured (or generated) by video source 24. For example, when a speaker typically generates audio data by speaking, audio source 22 captures the audio data, while video source 24 simultaneously (i.e., while audio source 22 is capturing the audio data) captures the speaker's video data. Therefore, an audio frame can temporally correspond to one or more specific video frames. Accordingly, an audio frame corresponding to a video frame typically corresponds to the case where audio and video data are captured simultaneously, and for this case, the audio frame and video frame respectively include the simultaneously captured audio and video data.

[0039] In some examples, audio encoder 26 may encode a timestamp indicating the time when audio data for each encoded audio frame was recorded into the encoded audio frame, and similarly, video encoder 28 may encode a timestamp indicating the time when video data for each encoded video frame was recorded into the encoded video frame. In such examples, the audio frame corresponding to the video frame may include an audio frame containing a timestamp and a video frame containing the same timestamp. Content preparation device 20 may include an internal clock, which audio encoder 26 and / or video encoder 28 may use to generate timestamps, or audio source 22 and video source 24 may use the internal clock to associate audio data and video data with timestamps respectively.

[0040] In some examples, audio source 22 may send data to audio encoder 26 corresponding to the time the audio data was recorded, while video source 24 may send data to video encoder 28 corresponding to the time the video data was recorded. In some examples, audio encoder 26 may encode sequence identifiers into the encoded audio data to indicate the relative temporal order of the encoded audio data, but not necessarily the absolute time the audio data was recorded; similarly, video encoder 28 may use sequence identifiers to indicate the relative temporal order of the encoded video data. Similarly, in some examples, sequence identifiers may be mapped or otherwise associated with timestamps.

[0041] Audio encoder 26 typically produces a stream of encoded audio data, while video encoder 28 produces a stream of encoded video data. Each individual data stream (whether audio or video) can be referred to as an elementary stream. An elementary stream is a single, digitally decoded (and possibly compressed) component. For example, a decoded video or audio portion can be an elementary stream. Before encapsulating an elementary stream within a video file, it can be converted to a packetized elementary stream (PES). Within the same representation, a stream ID can be used to distinguish PES packets belonging to one elementary stream from those belonging to another. The fundamental data unit of an elementary stream is a packetized elementary stream (PES). Therefore, decoded video data typically corresponds to an elementary video stream. Similarly, audio data corresponds to one or more corresponding elementary streams.

[0042] Many video decoding standards (such as ITU-T H.264 / AVC and the upcoming High Efficiency Video Decoding (HEVC) standard) define the syntax, semantics, and decoding procedures for error-free bitstreams, any of which conform to a profile or level. Video decoding standards typically do not specify the encoder, but the encoder is assigned the task of ensuring that the generated bitstream is compliant with the standard for the decoder. In the context of video decoding standards, a "profile" corresponds to a subset of the algorithms, features, or tools and constraints applied to them. For example, as defined by the H.264 standard, a "profile" is a subset of the entire bitstream syntax specified by the H.264 standard. A "level" corresponds to limitations on decoder resource consumption related to image resolution, bitrate, and block processing rate, such as, for example, decoder memory and computation. Profiles can be signaled using the `profile_idc` (profile indicator) value, while levels can be signaled using the `level_idc` (level indicator) value.

[0043] For example, the H.264 standard acknowledges that, within the limits imposed by the syntax of a given profile, large variations in encoder and decoder performance may still be required, depending on the values adopted by the syntax elements in the bitstream, such as the specified size of the decoded image. The H.264 standard further acknowledges that, in many applications, implementing a decoder capable of handling all hypothetical uses of the syntax within a particular profile is neither practical nor economical. Therefore, the H.264 standard defines a “level” as a specified set of constraints imposed on the values of syntax elements in the bitstream. These constraints may be simple restrictions on the values. Alternatively, these constraints may take the form of constraints on arithmetic combinations of values (e.g., image width multiplied by image height multiplied by the number of images decoded per second). The H.264 standard also specifies that a single implementation can support different levels for each supported profile.

[0044] A profile-compliant decoder typically supports all features defined in the profile. For example, B-picture decoding, as a decoding feature, is not supported in the base H.264 / AVC profile but is supported in other H.264 / AVC profiles. A level-compliant decoder should be able to decode any bitstream that does not require exceeding the limits defined in that level. The definition of the profile and level can contribute to interpretability. For example, during video transmission, a pair of profile and level definitions can be negotiated and agreed upon for the entire transmission session. More specifically, in H.264 / AVC, a level can define limitations on the following: the number of macroblocks to be processed, the size of the decoded picture buffer (DPB), the size of the decoded picture buffer (CPB), the vertical motion vector range, the maximum number of motion vectors for every two consecutive MBs, and whether a B-block can have sub-macroblock partitions of less than 8x8 pixels. In this way, the decoder can determine whether it is capable of correctly decoding the bitstream.

[0045] exist Figure 1 In one example, the encapsulation unit 30 of the content preparation device 20 receives a base stream comprising decoded video data from the video encoder 28 and a base stream comprising decoded audio data from the audio encoder 26. In some examples, the video encoder 28 and the audio encoder 26 may each include a packetizer for forming PES packets from the encoded data. In other examples, the video encoder 28 and the audio encoder 26 may each interface with a corresponding packetizer for forming PES packets from the encoded data. In still other examples, the encapsulation unit 30 may include packetizers for forming PES packets from the encoded audio and video data.

[0046] Video encoder 28 can encode video data of multimedia content in various ways to produce different representations of the multimedia content at various bit rates and with various characteristics (such as pixel resolution, frame rate, conformity to various decoding standards, conformity to various profiles and / or profile levels for various decoding standards, representations with one or more views (e.g., for two-dimensional or three-dimensional playback) or other such characteristics). Representations as used in this disclosure may include audio data, video data, text data (e.g., for closed captions), or other such data. Representations may include elementary streams, such as audio elementary streams or video elementary streams. Each PES packet may include a stream_id identifying the elementary stream to which the PES packet belongs. Encapsulation unit 30 is responsible for assembling the elementary streams into video files (e.g., segments) of the various representations.

[0047] Encapsulation unit 30 receives PES packets for representing the basic stream from audio encoder 26 and video encoder 28, and forms corresponding Network Abstraction Layer (NAL) units from the PES packets. Decoded video segments can be organized into NAL units, which provide a “network-friendly” video representation addressed to applications such as video telephony, storage, broadcasting, or streaming. NAL units can be classified as Video Decoded Layer (VCL) NAL units and non-VCL NAL units. VCL units may contain the core compression engine and may include block, macroblock, and / or slice-level data. Other NAL units may be non-VCL NAL units. In some examples, a decoded image typically presented as a basic decoded image in a temporal instance may be included in an access unit, which may include one or more NAL units.

[0048] Non-VCL NAL units can include parameter set NAL units, SEI NAL units, and other units. Parameter sets can contain sequence-level header information (in the Sequence Parameter Set (SPS)) and infrequently changing picture-level header information (in the Picture Parameter Set (PPS)). Using parameter sets (e.g., PPS and SPS), it is unnecessary to repeat infrequently changing information for each sequence or picture; thus, decoding efficiency can be improved. Furthermore, using parameter sets allows for out-of-band transmission of important header information, thereby avoiding the need for redundant transmissions for error recovery. In an out-of-band transmission example, parameter set NAL units can be transmitted on a different channel than other NAL units (such as SEI NAL units).

[0049] Supplemental Enhancement Information (SEI) may contain information unnecessary for decoding the decoded image sample from the VCL NAL unit, but may be helpful for processes related to decoding, display, error recovery, and other purposes. SEI messages can be included in non-VCL NAL units. SEI messages are a specification part of some standards and are therefore not always mandatory for standards-compliant decoder implementations. SEI messages can be sequence-level SEI messages or image-level SEI messages. Some sequence-level information can be included in SEI messages, such as the scalability information SEI message in the SVC example and the view scalability information SEI message in MVC. These example SEI messages can convey information about, for example, the extraction and characteristics of operation points. Additionally, the encapsulation unit 30 can form a manifest file, such as a Media Rendering Descriptor (MPD) describing the characteristics of the representation. The encapsulation unit 30 can format the MPD according to Extensible Markup Language (XML).

[0050] The encapsulation unit 30 can provide data for one or more representations of multimedia content, along with a manifest file (e.g., MPD), to the output interface 32. The output interface 32 may include a network interface or an interface for writing to storage media (such as a Universal Serial Bus (USB) interface, a CD or DVD burner, an interface to magnetic or flash storage media, or other interfaces for storing or transmitting media data). The encapsulation unit 30 can provide data for each representation of the multimedia content to the output interface 32, which can then transmit the data to the server device 60 via a network or storage medium. Figure 1 In the example, server device 60 includes storage medium 62 for storing various multimedia content 64, each multimedia content including a corresponding manifest file 66 and one or more representations 68A-68N (representations 68). In some examples, output interface 32 can also send data directly to network 74.

[0051] In some examples, representation 68 can be divided into adaptation sets. That is, each subset of representation 68 may include a corresponding set of common characteristics, such as codec, profile and level, resolution, number of views, file format used for segmentation, text type information that can identify the language or other characteristics of the text to be displayed along with the representation and / or audio data to be decoded and presented, for example, by a speaker, camera angle information that can describe the camera angle or real-world perspective of the scene for the representation in the adaptation set, rating information describing the suitability of the content for a particular audience, etc.

[0052] Manifest file 66 may include data indicating a subset of representations 68 corresponding to a particular adaptation set, as well as data on common characteristics of the adaptation set. Manifest file 66 may also include data indicating individual characteristics of individual representations within the adaptation set, such as bit rate. In this way, adaptation sets can provide simplified network bandwidth adaptation. Sub-elements in the adaptation set elements of manifest file 66 can be used to indicate representations within the adaptation set.

[0053] Server device 60 includes a request processing unit 70 and a network interface 72. In some examples, server device 60 may include multiple network interfaces. Furthermore, any or all features of server device 60 may be implemented on other devices in the content delivery network, such as routers, bridges, proxy devices, switches, or other devices. In some examples, intermediate devices in the content delivery network may cache data for multimedia content 64 and include components substantially consistent with those of server device 60. Typically, network interface 72 is configured to send and receive data via network 74.

[0054] Request processing unit 70 is configured to receive network requests for data on storage medium 62 from a client device, such as client device 40. For example, request processing unit 70 may implement Hypertext Transfer Protocol (HTTP) version 1.1 as described in RFC 2616 (June 1999, IETF, Networking Working Group, R. Fielding et al., “Hypertext Transfer Protocol – HTTP / 1.1”). That is, request processing unit 70 may be configured to receive HTTP GET or partial GET requests and, in response to said requests, provide data for multimedia content 64. The request may specify a segment of a representation in representation 68 (e.g., a URL using that segment). In some examples, the request may also specify one or more byte ranges of the segment, thereby including partial GET requests. Request processing unit 70 may also be configured to service HTTP HEAD requests to provide header data for a segment of a representation in representation 68. In any case, request processing unit 70 may be configured to process the request to provide the requested data to the requesting device, such as client device 40.

[0055] Alternatively, request processing unit 70 may be configured to deliver media data via a broadcast or multicast protocol such as eMBMS. Content preparation device 20 may create DASH segments and / or sub-segments in substantially the same manner as described, but server device 60 may use eMBMS or another broadcast or multicast network transport protocol to deliver these segments or sub-segments. For example, request processing unit 70 may be configured to receive multicast group join requests from client device 40. That is, server device 60 may advertise to client devices including client device 40 the Internet Protocol (IP) address associated with a multicast group that is associated with specific media content (e.g., a broadcast of a live event). Client device 40 may then submit a request to join the multicast group. This request may be propagated throughout network 74 (e.g., routers comprising network 74), causing routers to direct traffic to the IP address associated with the multicast group to customized client devices (such as client device 40).

[0056] As in Figure 1 As shown in the example, multimedia content 64 includes a manifest file 66, which may correspond to a Media Presentation Description (MPD). The manifest file 66 may contain descriptions of different alternative representations 68 (e.g., video services with different qualities), and this description may include, for example, codec information, profile values, level values, bitrate, and other descriptive characteristics of representation 68. Client device 40 may retrieve the MPD of the media presentation to determine how to access segments of representation 68.

[0057] Specifically, the retrieval unit 52 can retrieve configuration data (not shown) of the client device 40 to determine the decoding capabilities of the video decoder 48 and the rendering capabilities of the video output 44. The configuration data may also include any or all of the following: language preferences selected by the user of the client device 40, one or more camera angles corresponding to depth preferences set by the user of the client device 40, and / or rating preferences selected by the user of the client device 40. The retrieval unit 52 may include, for example, a web browser or media client configured to submit HTTP GET and partial GET requests. The retrieval unit 52 may correspond to software instructions executed by one or more processors or processing units (not shown) of the client device 40. In some examples, all or part of the functions described regarding the retrieval unit 52 may be implemented in hardware, or a combination of hardware, software, and / or firmware, wherein the necessary hardware may be provided to execute instructions for the software or firmware.

[0058] The retrieval unit 52 can compare the decoding and rendering capabilities of the client device 40 with the characteristics of the representation 68 indicated by the information in the manifest file 66. The retrieval unit 52 can initially retrieve at least a portion of the manifest file 66 to determine the characteristics of the representation 68. For example, the retrieval unit 52 can request a portion of the manifest file 66 describing the characteristics of one or more adaptation sets. The retrieval unit 52 can select a subset (e.g., an adaptation set) of representations 68 having characteristics that can be satisfied by the decoding and rendering capabilities of the client device 40. The retrieval unit 52 can then determine the bit rate for the representations in the adaptation set, determine the amount of currently available network bandwidth, and retrieve segments from one of the representations having a bit rate that the network bandwidth can satisfy.

[0059] Generally, higher bitrate representations produce higher quality video playback, while lower bitrate representations provide sufficient quality video playback when available network bandwidth is reduced. Accordingly, when available network bandwidth is relatively high, retrieval unit 52 can retrieve data from a relatively high bitrate representation, and when available network bandwidth is low, retrieval unit 52 can retrieve data from a relatively low bitrate representation. In this way, client device 40 can stream multimedia data over network 74 while adapting to varying network bandwidth availability.

[0060] Alternatively, the retrieval unit 52 can be configured to receive data according to a broadcast or multicast network protocol such as eMBMS or IP multicast. In such an example, the retrieval unit 52 can submit a request to join a multicast network group associated with specific media content. After joining the multicast group, the retrieval unit 52 can receive data from that multicast group without issuing additional requests to the server device 60 or the content preparation device 20. When the multicast group's data is no longer needed, the retrieval unit 52 can submit a request to leave the multicast group, for example, to stop playback or change the channel to a different multicast group.

[0061] Network interface 54 can receive the selected segmented data and provide it to retrieval unit 52, which in turn can provide the segments to decapsulation unit 50. Decapsulation unit 50 can decapsulate the elements of the video file into a composed PES stream, degroup the PES stream to retrieve the encoded data, and send the encoded data to audio decoder 46 or video decoder 48, depending on whether the encoded data is part of an audio stream or a video stream (e.g., as indicated by the PES packet header of the stream). Audio decoder 46 decodes the encoded audio data and sends the decoded audio data to audio output 42, while video decoder 48 decodes the encoded video data and sends the decoded video data (which may include multiple views of the stream) to video output 44.

[0062] The video encoder 28, video decoder 48, audio encoder 26, audio decoder 46, encapsulation unit 30, retrieval unit 52, and decapsulation unit 50 can all be implemented as any of a variety of suitable processing circuits, such as one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), discrete logic circuits, software, hardware, firmware, or any combination thereof. Each of the video encoder 28 and video decoder 48 can be included in one or more encoders or decoders, either of which can be integrated as part of a combined video encoder / decoder (CODEC). Similarly, each of the audio encoder 26 and audio decoder 46 can be included in one or more encoders or decoders, either of which can be integrated as part of a combined CODEC. The apparatus including the video encoder 28, video decoder 48, audio encoder 26, audio decoder 46, encapsulation unit 30, retrieval unit 52, and / or decapsulation unit 50 can include integrated circuits, microprocessors, and / or wireless communication devices (such as cellular phones).

[0063] Client device 40, server device 60, and / or content preparation device 20 may be configured to operate according to the techniques of this disclosure. For illustrative purposes, this disclosure describes these techniques with respect to client device 40 and server device 60. However, it should be understood that content preparation device 20 may be configured to perform these techniques in place of (or in addition to) server device 60.

[0064] Encapsulation unit 30 can form NAL units, which include a header identifying the program to which the NAL unit belongs and a payload (e.g., audio data, video data, or data describing the transmission or program stream corresponding to the NAL unit). For example, in H.264 / AVC, a NAL unit includes a 1-byte header and a variable-size payload. NAL units that include video data in their payload can include video data at various granularities. For example, a NAL unit can include blocks of video data, multiple blocks, slices of video data, or entire frames of video data. Encapsulation unit 30 can receive encoded video data from video encoder 28 in the form of PES packets of the elementary stream. Encapsulation unit 30 can associate each elementary stream with its corresponding program.

[0065] Encapsulation unit 30 can also assemble access units from multiple NAL units. Typically, an access unit may include one or more NAL units representing frames of video data and corresponding audio data (when such audio data is available). Access units typically include all NAL units for an output time instance, such as all audio and video data for a time instance. For example, if each view has a frame rate of 20 frames per second (fps), each time instance may correspond to a time interval of 0.05 seconds. During this time interval, specific frames for all views of the same access unit (same time instance) can be rendered simultaneously. In one example, an access unit may include a decoded image within a time instance, which can be rendered as a basic decoded image.

[0066] Accordingly, an access unit may include all audio and video frames of a common time instance, for example, all views corresponding to time X. This disclosure also refers to the encoded image of a particular view as a "view component." That is, a view component may include an encoded image (or frame) for a particular view at a particular time. Accordingly, an access unit may be defined as including all view components of a common time instance. The decoding order of the access units does not necessarily need to be the same as the output or display order.

[0067] Media presentations may include a Media Presentation Description (MPD), which may contain descriptions of different alternative representations (e.g., video services with different qualities), and this description may include, for example, codec information, profile values, and level values. An MPD is an example of a manifest file, such as manifest file 66. Client device 40 may retrieve the MPD of a media presentation to determine how to access the movie clips of each presentation. Movie clips may reside within a movie clip box (moof box) of a video file.

[0068] The manifest file 66 (which may include, for example, an MPD) can announce the availability of segments of representation 68. That is, the MPD may include information indicating the clock time when a first segment of a representation in representation 68 becomes available, and information indicating the duration of segments within representation 68. In this way, the retrieval unit 52 of the client device 40 can determine when each segment is available based on the start time and duration of segments preceding a particular segment.

[0069] After the encapsulation unit 30 has assembled the NAL units and / or access units into a video file based on the received data, the encapsulation unit 30 passes the video file to the output interface 32 for output. In some examples, the encapsulation unit 30 may store the video file locally or send the video file to a remote server via the output interface 32, instead of sending the video file directly to the client device 40. The output interface 32 may include, for example, a transmitter, transceiver, a device for writing data to a computer-readable medium (such as, for example, an optical drive, a magnetic media drive (e.g., a floppy disk drive)), a universal serial bus (USB) port, a network interface, or other output interface. The output interface 32 outputs the video file to a computer-readable medium, such as, for example, a transmission signal, magnetic media, optical media, memory, flash memory, or other computer-readable media.

[0070] Network interface 54 can receive NAL units or access units via network 74 and provide NAL units or access units to decapsulation unit 50 via retrieval unit 52. Decapsulation unit 50 can decapsulate the elements of the video file into a PES stream, degroup the PES stream to retrieve encoded data, and send the encoded data to audio decoder 46 or video decoder 48 (depending on whether the encoded data is part of an audio stream or a video stream, as indicated by the PES packet header of the stream). Audio decoder 46 decodes the encoded audio data and sends the decoded audio data to audio output 42, while video decoder 48 decodes the encoded video data and sends the decoded video data (which may include multiple views of the stream) to video output 44.

[0071] According to the technology disclosed herein, client device 40 can be configured to perform separate rendering or split rendering using a separate device. For example, in some examples, video output 44 may be a device separate from client device 40, such as... Figure 2 As shown in the diagram. Typically, according to the techniques of this disclosure, client device 40 can be configured to render extended reality (XR) data. Specifically, the XR data may correspond to a scene including an XR scene. The XR scene may include one or more desired virtual objects, which may include dynamic virtual objects. Dynamic virtual objects are typically animated objects that can move as the XR scene is presented to the user. For example, in an augmented reality (AR) use case for a virtual gym, dynamic virtual objects may include a coach or another student. Dynamic virtual objects may be represented by dynamic meshes, animated meshes, or point clouds. Dynamic virtual objects may include one or more dynamic media components (e.g., textures for 3D virtual objects) and zero or more static components. For example, the geometry for the virtual object may be static, but the texture may be dynamic. Client device 40 may be configured to retrieve entry point data for the scene from, for example, server device 60. The entry point data may include information about desired virtual objects, which include dynamic virtual objects and dynamic media components.

[0072] Using the entry point data, client device 40 can initialize (e.g., with server device 60) a number of streaming sessions, the number of which equals the number of dynamic virtual objects (or the number of dynamic media components for each of the dynamic virtual objects). That is, client device 40 can initialize one streaming session for each dynamic virtual object or each of its dynamic media components. Therefore, for example, if there are three dynamic virtual objects, client device 40 can initialize three streaming sessions, one for each of the dynamic virtual objects.

[0073] As part of initializing a streaming session, client device 40 can configure Quality of Service (QoS) and billing information for the streaming session. For example, QoS and billing information can be handled by a Policy Control Function (PCF). Each of the dynamic virtual objects can conform to a specific type of object required by various streaming requirements. Different types of objects can be associated with different QoS requirements. For example, dynamic virtual objects can be two-dimensional (2D) or three-dimensional (3D) objects. Typically, 3D objects may require a higher bitrate stream compared to 2D objects because 3D objects require at least two distinct images (left-eye image and right-eye image) to be displayed correctly in 3D. Higher bitrates may also result in higher billing costs due to higher bandwidth consumption.

[0074] As another example, media streams can have varying bitrates and consume significant bandwidth, for instance, due to the rendered size / resolution of the corresponding dynamic virtual object. Different qualities may be available for the dynamic virtual object. Therefore, QoS and billing information can vary based on the size of the dynamic virtual object and / or the quality of the corresponding media stream.

[0075] In some cases, the location of dynamic virtual objects may need to be precisely positioned relative to the user's position within the XR scene. For example, if a user is interacting with a dynamic virtual object (e.g., in a virtual meeting or video game), one or more dynamic virtual objects may require precise, accurate user location information for the streaming session associated with that object. Therefore, QoS and billing information may need to take into account the need for accurate user location information. For instance, streaming sessions requiring accurate user location information may be assigned a higher QoS compared to streaming sessions associated with other dynamic virtual objects, such as head-up display (HUD) elements. For example, if the game is a sports game, a dynamic virtual object representing a ball, such as a baseball or soccer ball, may require accurate user location information so that the user can interact with the ball (hit, catch, throw, etc.).

[0076] As described above, in some cases, client device 40 can be configured to perform split rendering of XR data. When performing split rendering, an ultra-low latency decoding structure can be used to decode the media data. For example, video frames can be decoded using an IPPP structure, where the first frame is decoded using intra-frame decoding, while subsequent frames are unidirectional inter-frame prediction. Frames in IPPP are not bidirectional inter-frame prediction, and therefore, the bitrate used for such a decoding structure can be approximately 30% higher than other decoding structures such as IBP. Furthermore, each decoded frame can be frame-packed, meaning that a frame can include data for both the left-eye and right-eye views, packaged together as a single frame. Therefore, QoS and billing requirements can take this decoding structure into account. For example, when client device 40 is configured to perform split rendering, the bitrate (such as the minimum bitrate) used for the split-rendered media stream can be higher than the bitrate (such as the minimum bitrate) used for the non-split-rendered media stream.

[0077] Figure 2This is a block diagram illustrating an example computing system 100 capable of performing the techniques described herein. In this example, the computing system 100 includes an extended reality (XR) server device 110, a network 130, an XR client device 140, and a display device 152. The XR server device 110 includes an XR scene generation unit 112, an XR viewport pre-rendering rasterization unit 114, a 2D media encoding unit 116, an XR media content delivery unit 118, and a 5G system (5GS) delivery unit 120. The XR server device 110 may also include components that perform attribution... Figure 1 The content preparation device 20 and the server device 60 are functional components. For example, the 5GS delivery unit 120 may correspond to... Figure 1 The network interface 72, the XR scene generation unit 112, and the XR viewport pre-rasterization unit 114 can correspond to Figure 1 The video source 24, the 2D media encoding unit 116 can correspond to the video encoder 28, and the XR media content delivery unit 118 can correspond to... Figure 1 The encapsulation unit 30 and the request processing unit 70.

[0078] Network 130 can usually correspond to Figure 1 Network 74. Network 130 can correspond to any network of computing devices that communicate according to one or more network protocols (such as the Internet). Specifically, network 130 may include a 5G radio access network (RAN), which includes access devices for XR client device 140 to connect to access network 130 and XR server device 110. In other examples, other types of networks, such as other types of RANs, may be used.

[0079] XR client device 140 includes a 5GS delivery unit 150, a tracking / XR sensor 146, an XR viewport rendering unit 142, a 2D media decoder 144, and an XR media content delivery unit 148. XR client device 140 also interacts with display device 152 to present XR media data to a user (not shown). XR client device 140 may include functions for performing attribution. Figure 1 The client device 40 is a functional component. For example, the 5GS delivery unit 150 may correspond to... Figure 1 The network interface 54, and the XR media content delivery unit 148 can correspond to Figure 1 Retrieval unit 52.

[0080] In some examples, the XR scene generation unit 112 may correspond to an interactive media entertainment application (such as a video game), which may be executed by one or more processors implemented in the circuitry of the XR server device 110. The XR viewport pre-rendering rasterization unit 114 may format the scene data generated by the XR scene generation unit 112 into pre-rendered two-dimensional (2D) media data (e.g., video data) for the viewport of the user of the XR client device 140. The 2D media encoding unit 116 may encode the formatted scene data from the XR viewport pre-rendering rasterization unit 114, for example, using video encoding standards such as, for example, ITU-T H.264 / Advanced Video Decoding (AVC), ITU-T H.265 / High Efficiency Video Decoding (HEVC), ITU-T H.266 Universal Video Decoding (VVC), etc. In this example, the XR media content delivery unit 118 represents a content delivery transmitter. In this example, the XR media content delivery unit 148 represents a content delivery receiver, and the 2D media decoder 144 can perform error handling.

[0081] Typically, the XR client device 140 can determine the user's viewport, such as the direction the user is looking and the user's physical location, which can correspond to the orientation and geographic location of the XR client device 140. A tracking / XR sensor 146 can determine such location and orientation data, for example, using a camera, accelerometer, magnetometer, gyroscope, etc. The tracking / XR sensor 146 provides the location and orientation data to the XR viewport rendering unit 142 and the 5GS delivery unit 150. The XR client device 140 provides tracking and sensor information 132 to the XR server device 110 via network 130. The XR server device 110 then receives the tracking and sensor information 132 and provides this information to the XR scene generation unit 112 and the XR viewport pre-rendering rasterization unit 114. In this way, the XR scene generation unit 112 can generate scene data for the user's viewport and location, and then use the XR viewport pre-rendering rasterization unit 114 to pre-render 2D media data for the user's viewport. Therefore, the XR server device 110 can deliver encoded, pre-rendered 2D media data 134 to the XR client device 140 via network 130 (e.g., using a 5G radio configuration).

[0082] The XR scene generation unit 112 can receive data representing the type of multimedia application (e.g., video game type), the application's state, multiple user actions, etc. The XR viewport pre-rasterization unit 114 can format the rasterized video signal. The 2D media encoding unit 116 can be configured with a specific encoder / decoder, bitrate for media encoding, rate control algorithm and corresponding parameters, slice data of images used to form video data, low-latency coding parameters, error recovery parameters, intra-frame prediction parameters, etc. The XR media content delivery unit 118 can be configured with Real-Time Transport Protocol (RTP) parameters, rate control parameters, error recovery information, etc. The XR media content delivery unit 148 can be configured with feedback parameters, error concealment algorithms and parameters, post-correction algorithms and parameters, etc.

[0083] Raster-based split rendering refers to a scenario where XR server device 110 runs an XR engine (e.g., XR scene generation unit 112) to generate an XR scene based on information from an XR device (e.g., XR client device 140) and tracking and sensor information 132. XR server device 110 can rasterize the XR viewport and perform XR pre-rendering using XR viewport pre-rendering rasterization unit 114.

[0084] exist Figure 2 In the example, the viewport is primarily rendered on the XR server device 110, but the XR client device 140 is capable of performing up-to-date pose corrections, such as using asynchronous time warp (ATW) or other XR pose correction methods to address pose variations. XR graphics workloads can be divided into rendering workloads on the powerful XR server device 110 (in the cloud or at the edge) and pose corrections (such as asynchronous time warp (ATW)) on the XR client device 140. Low motion-to-photon latency is maintained through on-device asynchronous time warp (ATW) or other pose correction methods performed via the XR client device 140.

[0085] In some examples, the latency caused by the XR server device 110 receiving such pre-rendered video data and the XR client device 140 rendering the video data can be in the range of 50 milliseconds (ms). Although the latency for the XR client device 140 to provide location and position (e.g., pose) information can be lower (e.g., 20 ms), the XR server device 110 may perform asynchronous time warps to compensate for the latest pose in the XR client device 140.

[0086] The following call flow is an example that emphasizes the steps involved in performing these techniques:

[0087] 1) XR client device 140 connects to network 130 and joins an XR application (e.g., executed by XR scene generation unit 112).

[0088] a) XR client device 140 sends static device information and capabilities (supported decoders, viewports).

[0089] 2) Based on this information, the XR server device 110 sets the encoder and format.

[0090] 3) Loop:

[0091] a) The XR client device 140 uses the tracking / XR sensor 146 to collect XR poses (or predicted XR poses).

[0092] b) The XR client device 140 sends XR pose information to the XR server device 110 in the form of tracking and sensor information 132.

[0093] c) The XR server device 110 uses tracking and sensor information 132 to pre-render the XR viewport via the XR scene generation unit 112 and the XR viewport pre-rendering rasterization unit 114.

[0094] d) The 2D media encoding unit 116 encodes the XR viewport.

[0095] e) The XR media content delivery unit 118 and the 5GS delivery unit 120 send the compressed media, along with data representing the viewport's XR pose for rendering, to the XR client device 140.

[0096] f) The XR client device 140 uses the 2D media decoder 144 to decompress video data.

[0097] g) The XR client device 140 uses XR pose data provided along with video frames and the actual XR pose from the tracking / XR sensor 146 to improve predictions and correct local poses, for example, using ATW performed by the XR viewport rendering unit 142.

[0098] Various types of client devices (also referred to as "user equipment" or "UE") can perform XR. XR client device 140 can conform to one or more of these different types. Table 1 below describes several different types of client devices that can perform split rendering of XR data over a 5G network. Typically, split rendering refers to rendering an image through two or more different devices. In one example, split rendering can be defined as follows:

[0099] Tethered devices or external entities (such as cloud or edge devices) perform some preprocessing (e.g., pre-rendering the viewport based on sensor and pose information), and the XR device and / or tethered devices perform rendering (e.g., applying pose correction) while taking into account the latest sensor information. There is a varying degree of decoupling between different devices and entities. Similarly, visual engine functions and other XR / AR / MR functions (such as AR / MR media reconstruction, encoding, and decoding) can be subject to decoupling computation.

[0100] Table 1 shows various examples of device types participating in XR, how these devices connect to access information, where the 5G Uu modem is expected to be placed, where the basic AR functionality is placed, where the AR / MR functionality is placed, where the AR / MR application runs, and where the power supply / battery is located. In all glasses device types, the sensors, camera, and microphone are assumed to be located on the device (UE) itself.

[0101] Table 1: Types of 5G Augmented Reality Devices

[0102]

[0103]

[0104] Type 1 5G standalone AR (STAR) UE can have the following characteristics:

[0105] • STAR UE is a standard 5G UE. It provides 5G connectivity through an embedded 5G modem.

[0106] • User controls are local and obtained from sensors, audio inputs, or video inputs.

[0107] AR / MR functionality may be integrated into AR / MR devices or separated from them.

[0108] Some devices may have limited support for immersive media decoding and rendering, and may need to rely on 5G cloud / edge.

[0109] In this case, the STAR UE can be assisted by the edge.

[0110] AR / MR applications reside on the device.

[0111] Due to the required processing power, such devices are likely to require higher power consumption compared to other types of devices.

[0112] Functionality is more important than design.

[0113] • Because the device includes all UE functions, applications reside on and primarily execute on this device, and all basic AR / MR functions are available for typical media processing use cases, this device is referred to as a standalone AR (STAR) UE.

[0114] Type 2 5G EDGe-related AR (EDGAR) UEs can have the following characteristics:

[0115] The 5G EDGAR UE is a standard 5G UE. It provides 5G connectivity through an embedded 5G modem.

[0116] • User controls are local and obtained from sensors, audio inputs, or video inputs.

[0117] Media processing is local; the device needs to embed all the media codecs required to decode the pre-rendered viewport.

[0118] Basic AR functionality is local to the AR / MR device, while AR / MR functionality resides on the 5G cloud / edge.

[0119] The main AR / MR applications reside in the cloud / edge, but basic application functions reside on the UE to support regular UE functions as well as launch services and applications.

[0120] The power consumption of such glasses must be low enough to accommodate the shape factor. Heat dissipation is crucial.

[0121] Design is often more important than function.

[0122] While the EDGAR UE can have additional features (such as those available in the STAR UE), for media-centric use case processing, edge support is typically required.

[0123] Type 3 5G radio link AR UEs can have the following characteristics:

[0124] • 5G connectivity is provided through tethered devices with embedded 5G modems. Wireless tethering connections are via WiFi or 5G sidelinks. BLE (Bluetooth Low) connectivity can be used for audio.

[0125] • User controls are primarily provided locally to AR / MR devices; some remote user interactions can also be initiated from the tethered device.

[0126] AR / MR functions (including SLAM / registration and pose correction) can be integrated into the AR / MR device or separated.

[0127] While media processing (for 2D media) can be done locally on AR glasses, heavy AR / MR media processing can be done on or off the AR / MR tethered device.

[0128] Some devices may have limited support for immersive media decoding and rendering, and may need to rely on 5G cloud / edge.

[0129] While these devices will likely use significantly less processing power compared to Type 1: 5G STAR devices by leveraging the processing capabilities of linked devices, they can still support a substantial amount of local media and AR / MR processing. Such devices are expected to offer 8-10 hours of battery life while maintaining a much lower weight.

[0130] • The tethered glasses themselves are not a regular 5G UE, but the combination of the glasses and the mobile phone achieves a regular 5G UE.

[0131] An augmented reality (AR) use case for a virtual gym could be as follows: A user launches a virtual coach application on AR glasses (e.g., client device 140). The AR glasses present a list of available training routines. The user selects a routine for morning exercise from these routines. The AR glasses then present a virtual coach and another student in the user's room. Background music is played via a virtual speaker through a physical speaker (e.g., built into the AR glasses or another device in the user's room). The AR glasses then present the virtual coach and the other student beginning their workout, along with voice instructions provided by the virtual coach.

[0132] Various components of the XR client device 140 can form part of one of an AR runtime, a scene manager, or a 5G media client. For example, the tracking / XR sensor 146 can represent the AR runtime, the XR viewport rendering unit 142 can represent the scene manager, and the 5GS delivery unit 150 can represent the 5G media client. Typically, the AR runtime can expose access to the AR device's functionality via an API, the scene manager can provide the functionality to parse a scene description and then use it to retrieve media, process input, and render the scene, and the 5G media client can represent a set of functions that enable access to media and (e.g., from XR server device 110) request resources to support an AR session.

[0133] XR data can include entry points, dynamic virtual objects, static virtual objects, and spatial audio. Entry points can include scene descriptions of objects within the scene. Dynamic virtual objects can be dynamic meshes, animated meshes, point clouds, etc. Typically, dynamic virtual objects can move within the XR scene, and sound can originate from a corresponding location of the dynamic virtual object. Static objects can be static meshes and can represent locations from which audio can originate. Spatial audio can represent vocalizations (e.g., speech) from people (represented as dynamic virtual objects) and / or other sound elements (e.g., music, white noise, etc.) from which static or dynamic virtual objects are their sources.

[0134] Figure 3 This is a block diagram illustrating an example client device configured as a 5G standalone AR (STAR) user equipment (UE) device 160 according to the technology of this disclosure. Figure 2 The XR client device 140 can be based on Figure 3 Configure the example.

[0135] exist Figure 3 In the example, the 5G STAR UE device 160 includes a sensor 162, a camera 164, a vision engine 166, a user interface 180, an AR / MR application 182, a 5G media streaming downlink (5GMSd) sensing application 184, a media session handler (MSH) 186, a scene rendering processing unit 176, an access client 188, an immersive media decoder 190, an immersive visual renderer 192, an immersive audio renderer 194, a synthesis unit 178, a posture correction unit 172, a sound field mapping unit 174, a display 168, and a speaker 170. Each of these various units can be implemented in hardware, software, or firmware, or a combination thereof. When implemented in software or firmware, instructions for the software or firmware can be stored in hardware memory and executed by the necessary hardware processing circuitry.

[0136] Sensor 162 may be, for example, a gyroscope sensor, configured to detect user posture information. Sensor 162 and camera 164 collect posture information and images, and pass the posture and image data to vision engine 166. Vision engine 166 may provide posture information to posture correction unit 172, synthesis unit 178, immersive visual renderer 192, and access client 188. User interface 180 may include, for example, a game controller, buttons, joysticks, etc., for collecting user input. User interface 180 may pass user input to augmented reality / mixed reality (AR / MR) application 182.

[0137] AR / MR application 182 and 5GMSd sensing application 184 can be the same application or separate applications communicating with each other. Typically, AR / MR application 182 can obtain user input from user interface 180 and from other users and / or from servers associated with the application (such as 5GMSd+AR / MR application provider 200) via 5GMSd sensing application 184. AR / MR application 182 can determine what to present to the user of 5G STAR UE device 160 based on various inputs from servers, users, and other users, such as virtual objects to be displayed, animations to be applied to dynamic virtual objects, etc. Animations can be streamed as dynamic media components for dynamic virtual objects. 5GMSd sensing application 184 can pass information to media session handler (MSH) 186, which can also receive information from 5GMSd application function (AF) 202.

[0138] MSH 186 can provide this information to access client 188, which can also receive one or more media streams from 5GMSd application server (AS) 210. Specifically, according to the technology of this disclosure, each of the various dynamic virtual objects can be associated with a corresponding media stream transmission session, such as 5GMSd AS 210. That is, media data for each of the various dynamic virtual objects can be sent to 5G STARUE device 160 via corresponding different media stream transmission sessions. Each of the media stream transmission sessions can have a corresponding manifest file (e.g., MPD) received from manifest server 212 and media data provided by segmentation server 214.

[0139] Access client 188 can initialize each of the various media stream transmission sessions, as shown below for example. Figure 5 and 6 As described. According to the technology of this disclosure, each party in a media streaming session can have a corresponding associated Quality of Service (QoS) and charging configuration, for example, depending on the type of dynamic virtual object. For example, the QoS and charging configuration may depend on whether the corresponding dynamic virtual object is a 2D or 3D object, whether accurate user location information is required, the amount of bandwidth required for the media streaming session, and whether the 5G STAR UE device 160 is configured to perform split rendering (in... Figure 3 In the example, the 5G STAR UE device 160 does not perform split rendering, etc. The access client 188 can receive media data for each media stream transmission session and provide the media data to the immersive media decoder 190.

[0140] By configuring QoS and charging information for each dynamic virtual object, the 5G STAR UE device 160 can deliver media data for certain dynamic virtual objects with a higher priority than other media data. For example, if a dynamic virtual object requires accurate location information for a user of the 5G STAR UE device 160 (e.g., for a dynamic virtual object for which collision detection with the user of the 5G STAR UE device 160 is enabled), then such location information may need to be provided to the 5GMSd AS210 more urgently than other data. Providing media data for different dynamic virtual objects in separate, corresponding media streams allows for the priority delivery of certain media data (and the inputs used to generate the media data) and the non-priority (best-effort) delivery of other media data, which increases flexibility in utilizing available network bandwidth. Furthermore, by configuring QoS and charging information separately, QoS for all streaming sessions can be more easily implemented.

[0141] Each element in the media stream can be associated with a corresponding element in the immersive media decoder 190. The immersive media decoder 190 can decode audio and video media data and pass the decoded audio data to the immersive audio renderer 194 and the decoded video data to the immersive visual renderer 192. The immersive visual renderer 192 can render video data for each of the various dynamic virtual objects and provide the rendered media data to the compositing unit 178. The compositing unit 178 can compose a single frame (or multiple frames, e.g., left-eye and right-eye frames for 3D rendering) including data for each of the various dynamic virtual objects and provide these frames to the pose correction unit 172. The pose correction unit 172 can modify the composed frames according to the current user pose (e.g., by rotating or translating the image in the frame) and then provide the pose-corrected frames to the display 168 for display to the user.

[0142] The immersive audio renderer 194 can render audio data and provide the rendered audio data to the sound field mapping unit 174. The sound field mapping unit 174 can use gesture information to modify the received rendered audio data, for example, based on the user's gesture and the relative position to the object from which the audio is rendered.

[0143] Figure 4 This is a block diagram illustrating another example client device configured as a 5G EDGe-related AR (EDGAR) user equipment (UE) device 220 according to the technology of this disclosure. Figure 2 The XR client device 140 can be based on Figure 4The example configuration is as follows. In this example, the 5G EDGAR UE device 220 is configured to perform split rendering together with the 5G EDGE server device 250.

[0144] In this example, the 5G EDGAR UE device 220 includes a sensor 222, a camera 224, a microphone 226, a vision engine 228, an encoder 230, a 5G system 242, a decoder 232, a compositing unit 234, a posture correction unit 236, a display 238, a speaker 240, a 5G system 242, a user interface 244, and an AR / MR application 246. The 5G EDGE server device 250 in this example includes a 5GMSd application 252, an MSH 254, an access client 256, a decoder 258, a rendering unit 260, a compositing unit 262, a decoder 264, an encoder 266, and a 5G system 268. Each of these various units can be implemented in hardware, software, or firmware, or a combination thereof. When implemented in software or firmware, instructions for the software or firmware can be stored in hardware memory and executed by the necessary hardware processing circuitry.

[0145] Typically, the various components of the 5G EDGE UE device 220 and the 5G EDGE server device 250 operate together to achieve essentially the same performance as... Figure 3 The 5G STAR UE device 160 performs split rendering in a manner corresponding to its components. That is, the 5G EDGE server device 250 performs a first rendering process and provides the result of the first rendering process to the 5G EDGAR UE device 220, which then performs a second rendering process to finally output the rendered video and audio data via the display 238 and the speaker 240.

[0146] In this example, the 5G EDGE server device 250 communicates with the 5GMSd+AR / MR application provider 270, 5GMSd 272, and 5GMSd AS280. According to the technology of this disclosure, the 5G EDGE server device 250 initializes a corresponding media streaming session for dynamic virtual objects in an XR scenario from the 5GMSd AS280. (As mentioned above...) Figure 3 Each of the media streaming sessions discussed may include a corresponding manifest file provided by manifest server 282 and media data segments provided by segment server 284.

[0147] According to the technology disclosed herein, each party in a media streaming session can have a corresponding associated Quality of Service (QoS) and charging configuration, for example, depending on the type of dynamic virtual object. For instance, the QoS and charging configuration may depend on whether the corresponding dynamic virtual object is a 2D or 3D object, whether accurate user location information is required, the amount of bandwidth required for the media streaming session, and whether the 5G EDGAR UE device 220 is configured to perform split rendering (in... Figure 4 In the example, the 5GEDGAR UE device 220 is configured to perform split rendering, etc. The access client 256 can receive media data for each media stream transmission session and provide the media data to the decoder 258.

[0148] In this example, the first rendering process includes decoding the media data for each media streaming session by decoder 258. Rendering unit 260 can render video data for each of the media streaming sessions, and compositing unit 262 can compose frames including rendered data for each dynamic virtual object. In this example, encoder 266 can then encode the rendered frames, and 5G EDGE server device 250 can transmit the rendered frames to 5G EDGE UE device 220 via 5G system 268.

[0149] In this example, the 5G EDGAR UE device 220 performs a second rendering process after receiving an encoded rendered frame via the 5G system 242. The decoder 232 decodes the rendered frame, and the synthesis unit 234 can further synthesize the frame to include data for one or more additional virtual objects. The pose correction unit 236 can then modify the synthesized frame to take into account updated user pose information collected by the sensor 222 and / or camera 224. Finally, the 5G EDGAR UE device 220 can output the frame via the display 238. Similarly, the 5G EDGAR UE device 220 can output audio data via the speaker 240. Although in Figure 4 The example is not shown, but the 5G EDGAR UE device 220 may further include, for example, Figure 3 The example shows a sound field mapping unit that can modify audio data based on updated pose information.

[0150] Figure 5 This is a call flowchart illustrating an example augmented reality session for a STAR user device according to the technology described in this disclosure. When according to... Figure 3 When configuring XR client device 140 using the example, XR client device 140 can perform... Figure 5Certain aspects of the call flow, such as those functions belonging to AR / MR application 182, its AR engine, immersive media decoder 190, scene description handler (e.g., scene graph processing unit 176), and media session handler (MSH) 186. When according to Figure 4 When configuring XR client device 140 using the example above, XR client device 140 can perform the actions described above. Figure 4 The split rendering discussed.

[0151] exist Figure 5 In the example, initially, the user launches the application. The application connects to the cloud to retrieve a list (400) of workout routines for the user.

[0152] The application provider (AP) sends a list of conventions to the application (402). Each convention is associated with an entry point used for that convention. The entry point is typically a scene description that describes objects in the scene and anchors the scene using world space.

[0153] The application receives a convention selection from the user (404).

[0154] The application retrieves the scene description for the selected convention from the application provider (406). The application also uses the entry point to initialize the Immersive Scene Renderer (ISR) (408).

[0155] The session description handler resolves the entry point to extract information about the desired objects in the scene and provides media access information to the application (410). In the example use case above, the coach, student, and speaker are three objects that will be rendered in the scene. The coach and student are examples of dynamic virtual objects. The speaker is an example of a static virtual object.

[0156] The application notifies the MSH that it will initiate two streaming sessions for two dynamic virtual objects (412). For example, each of the two streaming sessions can be a Protocol Data Unit (PDU) session, according to the PDU Session User Plane Protocol.

[0157] The MSH shares information with the AF, and based on the application provider's existing settings, the AF can request QoS and charging modifications to the PDU session (414). For example, the AF can notify the Policy Control Function (PCF) of this request. The PCF can then initiate or modify the PDU session. In some implementations, the anchor point for the PDU session can be the User Plane Function (UPF). The PCF can then ensure that the appropriate QoS flow is assigned to the appropriate PDU session via the UPF. By sharing this information, the MSH can be configured with streaming sessions that conform to the appropriate QoS and charging information.

[0158] The application creates a new XR session and anchors the scene to a selected space within the XR session, then begins media exchange. Specifically, in the example use case above, the application retrieves data (418) for static objects in the scene (speakers in this example). The application then retrieves a list (420) for object 1 and a list (422) for object 2. In the example use case above, object 1 is the coach's dynamic virtual object, while object 2 is the other students' dynamic virtual objects.

[0159] The application then configures the immersive video decoder (424) based on the components of each object. The application then retrieves the media segments (426) for each component of each object. The media decoder decodes the media segments (428) and passes the decoded media data to the immersive media renderer (430).

[0160] The immersive visual renderer periodically renders frames by iteratively determining the user's latest pose (432) and reconstructing each object and rendering it as a swap chain image. The swap chain image is then passed to the compositor for rendering (434).

[0161] Figure 6 This is a call flowchart illustrating an example augmented reality session for an EDGAR user device according to the technology described in this disclosure. When according to... Figure 4 When configuring XR client device 140 using the example, XR client device 140 can perform... Figure 6 Certain aspects of the call flow, such as those functions belonging to 5GMSD application 252 and the corresponding AR engine. EDGE server (such as...) Figure 4 The 5G EDGE server device (250) can participate in the form of split rendering. Figure 6 The technology includes an immersive media decoder 258, a scene description processor, and a media session processor (MSH) 254.

[0162] The user launches the application. The application connects to the cloud to retrieve a list of media programs for the user (e.g., exercise routines) (440).

[0163] The application provider (AP) sends a list of programs (e.g., routines) to the application (442). Each routine is associated with an entry point used for that routine. The entry point is typically a scene description that describes objects in the scene and anchors the scene using world space.

[0164] The application uses the convention of receiving preference selections from the user (444).

[0165] The application sends a request for an entry point to the selected content (446). The application provider responds using the entry point described in the scenario and a list of requirements for the best processing of the scenario. The application determines that EDGE support is required and sends a request to MSH to discover an appropriate edge application server (AS) that can serve the application (448).

[0166] MSH sends a request to AF and receives a list of candidate edge application servers (EAS) (450).

[0167] MSH selects the appropriate EAS (452) from the candidate list.

[0168] The location (454) where MSH provides EAS to applications.

[0169] The application connects to EAS and provides initialization information (456). Initialization information includes: the URL to the scene description entry point or the actual scene description, its current processing capabilities, supported formats and protocols, etc.

[0170] EAS configures the server application accordingly and generates a custom entry point (458) for the client. The format can depend on the UE's capabilities. EAS adjusts the amount of processing performed by EAS based on the application's current capabilities. For example, EAS can perform scene lighting and ray tracing, and then generate a simplified 3D scene description for the application. A less capable UE can receive a more planar scene that includes a stereoscopic viewpoint and some depth information.

[0171] The remaining steps are similar Figure 5 Steps 410 to 434 in the STAR call flow.

[0172] In this way, Figure 5 and 6 An example of a method for processing extended reality (XR) data includes: parsing entry point data of a scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects including a number of dynamic virtual objects greater than 1; initializing a number of streaming sessions using the entry point data, the number of streaming sessions being equal to the number of dynamic virtual objects, wherein initializing the number of streaming sessions includes configuring quality of service (QoS) and billing information for the streaming sessions; retrieving media data for each of the number of dynamic virtual objects via one of the streaming sessions; and sending the retrieved media data to a rendering unit to render the XR scene, so as to include the retrieved media data at corresponding locations within the XR scene.

[0173] Figure 7This is a flowchart illustrating an example method for processing XR data according to the technology disclosed herein. Figure 7 The methods can typically be performed by the XR client device, such as Figure 1 Client device 40 or Figure 2 The XR client device 140. The XR client device can be configured to... Figure 3 The 5G STAR UE device 160 example performs separate rendering, or according to Figure 4 The example of the 5G EDGAR UE device 220 performs split rendering. For illustrative purposes, regarding... Figure 2 XR client device 140 explained Figure 7 The method.

[0174] Initially, the XR client device 140 may determine one or more dynamic virtual objects (500) for the XR scene. For example, the XR client device 140 may receive and parse a scene description including entry point data of the XR scene. The XR client device 140 may extract information about one or more desired virtual objects for the XR scene. The desired virtual objects may include one or more dynamic virtual objects, i.e., virtual objects designed to change over time. In some examples, the desired virtual objects may also include static virtual objects.

[0175] XR client device 140 can initialize (502) a media streaming session for the XR scene and one or more additional media streaming sessions for each dynamic virtual object in the dynamic virtual objects. Therefore, if there are N dynamic virtual objects, XR client device 140 can initialize N+1 media streaming sessions, one for the XR scene and one for each dynamic virtual object in the dynamic virtual objects.

[0176] In addition, XR client device 140 can determine the type of dynamic virtual object and configure the Quality of Service (QoS) and accounting for the streaming session (504). For example, XR client device 140 can determine QoS and accounting based on whether the media data used for the dynamic virtual object is 2D or 3D, the amount of bandwidth required for the media streaming session, whether accurate user location information is required for the media streaming session, and / or whether XR client device 140 is configured to perform standalone rendering or split rendering.

[0177] Then, the XR client device 140 can retrieve media data for the XR scene and dynamic virtual objects via the corresponding media streaming session (506). The XR client device 140 can decode the media data received via each media streaming session in the media streaming session (508). The XR client device 140 can also render the received media data (510). The XR client device 140 can also synthesize video frames including the rendered media data (512). In some cases, the XR client device 140 can determine the current user gesture information (514) and use the gesture information to update the synthesized frame (516). Finally, the XR client device 140 can display the frame.

[0178] In this way, Figure 7 The method represents an example of a method for processing extended reality (XR) data, comprising: parsing entry point data of a scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects including a number of dynamic virtual objects greater than 1; initializing a number of streaming sessions using the entry point data, the number of streaming sessions being equal to the number of dynamic virtual objects, wherein initializing a number of streaming sessions includes configuring quality of service (QoS) and billing information for the streaming sessions; retrieving media data for each of the number of dynamic virtual objects via one of the streaming sessions; and sending the retrieved media data to a rendering unit to render the XR scene to include the retrieved media data at appropriate locations within the XR scene.

[0179] Various example technologies of this disclosure are outlined in the following terms:

[0180] Clause 1: A method for processing extended reality (XR) data, the method comprising: parsing entry point data of a scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects comprising a number of dynamic virtual objects greater than one; initializing a number of streaming sessions, the number of streaming sessions being equal to the number of dynamic virtual objects; retrieving media data for each of the number of dynamic virtual objects via one of the streaming sessions; and sending the retrieved media data to a rendering unit to render the XR scene to include the retrieved media data at a corresponding location within the XR scene.

[0181] Clause 2: The method described in Clause 1 further includes: creating an XR session; and anchoring the XR scene to a real-world space for the XR session.

[0182] Clause 3: The method according to any one of Clauses 1 and 2, wherein the desired virtual object further includes one or more static virtual objects, the method further includes: retrieving media data for each of the one or more static virtual objects, and wherein rendering the XR scene further includes: rendering the XR scene to include the retrieved media data for the one or more static virtual objects at a corresponding location within the XR scene.

[0183] Clause 4: The method according to any one of Clauses 1-3, wherein retrieving the media data for each of the number of dynamic virtual objects comprises: retrieving a manifest file for each of the number of dynamic virtual objects; and using the corresponding manifest file to retrieve media segments for each of the number of dynamic virtual objects.

[0184] Clause 5: The method described in Clause 4, wherein the manifest file includes a Media Presentation Description (MPD).

[0185] Clause 6: The method according to any one of Clauses 1-5 further includes: configuring an immersive video decoder for each of the number of dynamic virtual objects.

[0186] Clause 7: The method according to any one of Clauses 1-6 further comprises: retrieving a list of available XR sessions, each of the available XR sessions having associated entry point data; receiving a selection of one of the available XR sessions; and retrieving a scene description for the selected XR session among the available XR sessions, the scene description including the entry point data associated with the selected XR session among the available XR sessions.

[0187] Clause 8: The method according to any one of Clauses 1-6 further comprises: retrieving a list of available XR sessions, each of the available XR sessions having associated entry point data; receiving a selection of one of the available XR sessions; requesting the entry point data associated with the one XR session in the available XR sessions; receiving the requested entry point data and data representing a requirement for optimal processing of a scenario for the selected XR session in the available XR sessions; in response to determining that the requirement includes edge support, requesting data representing an edge application server (AS) for the selected XR session in the available XR sessions; sending initialization information for the selected XR session in the available XR sessions to the edge AS; and receiving custom entry point data for the selected XR session in the available XR sessions from the edge AS.

[0188] Clause 9: The method according to any one of Clauses 1-8, wherein the entry point data includes a scene description, the scene description including information about the one or more desired virtual objects for the XR scene.

[0189] Clause 10: The method according to any one of Clauses 1-9, wherein the dynamic virtual object includes at least one of a dynamic mesh, an animated mesh, or a point cloud.

[0190] Clause 11: The method according to any one of Clauses 1-10 further includes: retrieving audio data for at least one of the said number of dynamic virtual objects, and presenting the retrieved audio data.

[0191] Clause 12: An apparatus for processing extended reality (XR) data, the apparatus comprising one or more units for performing the method according to any one of Clauses 1-11.

[0192] Clause 13: The device according to Clause 12, wherein the one or more units include one or more processors implemented in a circuit.

[0193] Clause 14: The device according to any one of Clauses 12 and 13 further includes: a display configured to display the XR data.

[0194] Clause 15: The device according to any one of Clauses 12-14, wherein the device includes one or more of a camera, a computer, a mobile device, a broadcast receiver device, or a set-top box.

[0195] Clause 16: The device according to Clauses 12-15 further includes: a memory configured to store the XR data.

[0196] Clause 17: A computer-readable storage medium having instructions stored thereon, which, when executed, cause a processor of an apparatus for decoding video data to perform the method according to any one of Clauses 1-11.

[0197] Clause 18: An apparatus for processing extended reality (XR) data, the apparatus comprising: a unit for parsing entry point data of a scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects comprising a number greater than one dynamic virtual object; a unit for initializing a number of streaming sessions, the number of streaming sessions being equal to the number of dynamic virtual objects; a unit for retrieving media data for each of the number of dynamic virtual objects via one of the streaming sessions; and a unit for sending the retrieved media data to a rendering unit to render the XR scene, such that the retrieved media data is included at a corresponding location within the XR scene.

[0198] Clause 19: A method for processing extended reality (XR) data, the method comprising: parsing entry point data of a scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects comprising a number of dynamic virtual objects greater than one; initializing a number of streaming sessions using the entry point data, the number of streaming sessions being equal to the number of dynamic virtual objects, wherein initializing the number of streaming sessions comprises initializing the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions; retrieving media data for each of the number of dynamic virtual objects via one of the streaming sessions; and sending the retrieved media data to a rendering unit to render the XR scene, including the retrieved media data at a corresponding location within the XR scene.

[0199] Clause 20: The method according to Clause 19, wherein configuring the QoS and charging information for the streaming session includes: for each of the dynamic virtual objects: determining the type of the dynamic virtual object; and determining the QoS and charging information based on the type of the dynamic virtual object.

[0200] Clause 21: The method according to Clause 20 further includes: for at least one of the dynamic virtual objects: determining whether the media data for the streaming session associated with the type of the at least one of the dynamic virtual objects is two-dimensional (2D) media data or three-dimensional (3D) media data; and determining the QoS and billing information based on whether the media data for the streaming session associated with the type of the at least one of the dynamic virtual objects is the 2D media data or the 3D media data.

[0201] Clause 22: The method according to Clause 19, wherein configuring the QoS and charging information for the streaming session includes: for each of the dynamic virtual objects: determining the amount of bandwidth required for the media data associated with the streaming session for the dynamic virtual object; and configuring the QoS and charging information for the streaming session for the dynamic virtual object based on the required amount of bandwidth.

[0202] Clause 23: The method according to Clause 19, wherein configuring the QoS and charging information for the streaming session includes: for each of the dynamic virtual objects: determining that accurate user location information is required for the streaming session for the dynamic virtual object; and configuring the QoS and charging information for the streaming session for the dynamic virtual object based on the determination that the accurate user location information is required.

[0203] Clause 24: The method according to Clause 19, wherein configuring the QoS and charging information for the streaming session includes: determining whether the rendering unit is configured to perform split rendering of the media data; when the rendering unit is not configured to perform split rendering, determining a first bit rate for the streaming session; and when the rendering unit is configured to perform split rendering, determining a second bit rate for the streaming session, the second bit rate being higher than the first bit rate.

[0204] Clause 25: The method described in Clause 19 further includes: creating an XR session; and anchoring the XR scene to a real-world space for the XR session.

[0205] Clause 26: The method according to Clause 19, wherein the desired virtual object further includes one or more static virtual objects, the method further includes: retrieving media data for each of the one or more static virtual objects, and wherein rendering the XR scene further includes: rendering the XR scene to include the retrieved media data for the one or more static virtual objects at a corresponding location within the XR scene.

[0206] Clause 27: The method according to Clause 19, wherein retrieving the media data for each of the number of dynamic virtual objects comprises: retrieving a manifest file for each of the number of dynamic virtual objects; and using the corresponding manifest file to retrieve media segments for each of the number of dynamic virtual objects.

[0207] Clause 28: The method described in Clause 27, wherein the manifest file includes a Media Presentation Description (MPD).

[0208] Clause 29: The method described in Clause 1 further includes: configuring an immersive video decoder for each of the number of dynamic virtual objects.

[0209] Clause 30: The method according to Clause 1 further includes: retrieving a list of available XR sessions, each of the available XR sessions having associated entry point data; receiving a selection of one of the available XR sessions; and retrieving a scene description for the selected XR session among the available XR sessions, the scene description including the entry point data associated with the selected XR session among the available XR sessions.

[0210] Clause 31: The method according to Clause 1 further comprises: retrieving a list of available XR sessions, each of the available XR sessions having associated entry point data; receiving a selection of one of the available XR sessions; requesting the entry point data associated with the one XR session in the available XR sessions; receiving the requested entry point data and data representing a requirement for optimal processing of a scenario for the selected XR session in the available XR sessions; in response to determining that the requirement includes edge support, requesting data representing an edge application server (AS) for the selected XR session in the available XR sessions; sending initialization information for the selected XR session in the available XR sessions to the edge AS; and receiving custom entry point data for the selected XR session in the available XR sessions from the edge AS.

[0211] Clause 32: The method according to Clause 1, wherein the entry point data includes a scene description, the scene description including information about the one or more desired virtual objects for the XR scene.

[0212] Clause 33: The method according to Clause 1, wherein the dynamic virtual object includes at least one of a dynamic mesh, an animated mesh, or a point cloud.

[0213] Clause 34: The method according to Clause 1 further includes: retrieving audio data for at least one of the said number of dynamic virtual objects, and presenting the retrieved audio data.

[0214] Clause 35: An apparatus for processing extended reality (XR) data, the apparatus comprising: a memory configured to store XR data and media data; and one or more processors implemented in circuitry and configured to: parse entry point data of a scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects comprising a number greater than one dynamic virtual object; initialize a number of streaming sessions using the entry point data, the number of streaming sessions being equal to the number of dynamic virtual objects, wherein, in order to initialize the number of streaming sessions, the one or more processors are configured to initialize the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions; retrieve media data for each of the number of dynamic virtual objects via one of the streaming sessions; and send the retrieved media data to a rendering unit to render the XR scene, including the retrieved media data at a corresponding location within the XR scene.

[0215] Clause 36: The device according to Clause 35, wherein, in order to configure the QoS and charging information for the streaming session, the one or more processors are configured to: for each of the dynamic virtual objects: determine the type of the dynamic virtual object; and determine the QoS and charging information based on the type of the dynamic virtual object.

[0216] Clause 37: The device according to Clause 35, wherein the one or more processors are further configured to: create an XR session; and anchor the XR scene to a real-world space for the XR session.

[0217] Clause 38: The device according to Clause 35, wherein the desired virtual object further includes one or more static virtual objects, and wherein the one or more processors are further configured to: retrieve media data for each of the one or more static virtual objects, and wherein, in order to render the XR scene, the one or more processors are further configured to: render the XR scene to include the retrieved media data for the one or more static virtual objects at a corresponding location within the XR scene.

[0218] Clause 39: The device according to Clause 35, wherein, in order to retrieve the media data for each of the number of dynamic virtual objects, the one or more processors are configured to: retrieve a manifest file for each of the number of dynamic virtual objects; and use the corresponding manifest file to retrieve media segments for each of the number of dynamic virtual objects.

[0219] Clause 40: The device according to Clause 35, wherein the one or more processors are further configured to: retrieve a list of available XR sessions, each of the available XR sessions having associated entry point data; receive a selection of one of the available XR sessions; and retrieve a scene description for the selected XR session among the available XR sessions, the scene description including the entry point data associated with the selected XR session among the available XR sessions.

[0220] Clause 41: The device according to Clause 35, wherein the one or more processors are further configured to: retrieve a list of available XR sessions, each of the available XR sessions having associated entry point data; receive a selection of one of the available XR sessions; request the entry point data associated with the one of the available XR sessions; receive the requested entry point data and data representing a requirement for optimal processing of a scenario for the selected XR session in the available XR sessions; in response to determining that the requirement includes edge support, request data representing an edge application server (AS) for the selected XR session in the available XR sessions; send initialization information for the selected XR session in the available XR sessions to the edge AS; and receive custom entry point data for the selected XR session in the available XR sessions from the edge AS.

[0221] Clause 42: The device according to Clause 35, wherein the entry point data includes a scene description, the scene description including information about the one or more desired virtual objects for the XR scene.

[0222] Clause 43: The device pursuant to Clause 35, wherein the dynamic virtual object includes at least one of a dynamic mesh, an animated mesh, or a point cloud.

[0223] Clause 44: The device according to Clause 35 further includes: a display configured to display the XR data.

[0224] Clause 45: The device as described in Clause 35, wherein the device includes one or more of a camera, computer, mobile device, broadcast receiver device or set-top box.

[0225] Clause 46: A computer-readable storage medium having instructions stored thereon, which, when executed, cause a processor to: parse entry point data of a scene to extract information about one or more desired virtual objects for an XR scene, the one or more desired virtual objects comprising a number greater than one dynamic virtual object; initialize a number of streaming sessions using the entry point data, the number of streaming sessions being equal to the number of dynamic virtual objects, wherein the instructions causing the processor to initialize the number of streaming sessions include instructions causing the processor to: initialize the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions; retrieve media data for each of the number of dynamic virtual objects via one of the streaming sessions; and send the retrieved media data to a rendering unit to render the XR scene, including the retrieved media data at a corresponding location within the XR scene.

[0226] Clause 47: An apparatus for processing extended reality (XR) data, the apparatus comprising: a unit for parsing entry point data of a scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects comprising a number greater than one dynamic virtual object; a unit for initializing a number of streaming sessions, the number of streaming sessions being equal to the number of dynamic virtual objects, wherein the unit for initializing the number of streaming sessions includes a unit for initializing the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions; a unit for retrieving media data for each of the number of dynamic virtual objects via one of the streaming sessions; and a unit for sending the retrieved media data to a rendering unit to render the XR scene, such that the retrieved media data is included at a corresponding location within the XR scene.

[0227] Clause 48: A method for processing extended reality (XR) data, the method comprising: parsing entry point data of a scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects comprising a number of dynamic virtual objects greater than one; initializing a number of streaming sessions using the entry point data, the number of streaming sessions being equal to the number of dynamic virtual objects, wherein initializing the number of streaming sessions comprises initializing the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions; retrieving media data for each of the number of dynamic virtual objects via one of the streaming sessions; and sending the retrieved media data to a rendering unit to render the XR scene, including the retrieved media data at a corresponding location within the XR scene.

[0228] Clause 49: The method according to Clause 48, wherein configuring the QoS and charging information for the streaming session includes: for each of the dynamic virtual objects: determining the type of the dynamic virtual object; and determining the QoS and charging information based on the type of the dynamic virtual object.

[0229] Clause 50: The method according to Clause 49 further includes: for at least one of the dynamic virtual objects: determining whether the media data for the streaming session associated with the type of the at least one of the dynamic virtual objects is two-dimensional (2D) media data or three-dimensional (3D) media data; and determining the QoS and billing information based on whether the media data for the streaming session associated with the type of the at least one of the dynamic virtual objects is the 2D media data or the 3D media data.

[0230] Clause 51: The method according to any one of Clauses 48-50, wherein configuring the QoS and charging information for the streaming session comprises: for each of the dynamic virtual objects: determining the amount of bandwidth required for the media data associated with the streaming session for the dynamic virtual object; and configuring the QoS and charging information for the streaming session for the dynamic virtual object according to the required amount of bandwidth.

[0231] Clause 52: The method according to any one of Clauses 48-51, wherein configuring the QoS and charging information for the streaming session comprises: for each of the dynamic virtual objects: determining that accurate user location information is required for the streaming session for the dynamic virtual object; and configuring the QoS and charging information for the streaming session for the dynamic virtual object based on the determination that the accurate user location information is required.

[0232] Clause 53: The method according to any one of Clauses 48-52, wherein configuring the QoS and charging information for the streaming session includes: determining whether the rendering unit is configured to perform split rendering of the media data; when the rendering unit is not configured to perform split rendering, determining a first minimum bit rate for the streaming session; and when the rendering unit is configured to perform split rendering, determining a second minimum bit rate for the streaming session, the second bit rate being higher than the first bit rate.

[0233] Clause 54: The method according to any one of Clauses 48-53 further includes: creating an XR session; and anchoring the XR scene to a real-world space for the XR session.

[0234] Clause 55: The method according to any one of Clauses 48-54, wherein the desired virtual object further includes one or more static virtual objects, the method further includes: retrieving media data for each of the one or more static virtual objects, and wherein rendering the XR scene further includes: rendering the XR scene to include the retrieved media data for the one or more static virtual objects at a corresponding location within the XR scene.

[0235] Clause 56: The method according to Clauses 48-55, wherein retrieving the media data for each of the number of dynamic virtual objects comprises: retrieving a manifest file for each of the number of dynamic virtual objects; and using the corresponding manifest file to retrieve media segments for each of the number of dynamic virtual objects.

[0236] Clause 57: The method described in Clause 56, wherein the manifest file includes a Media Presentation Description (MPD).

[0237] Clause 58: The method according to any one of Clauses 48-57 further includes: configuring an immersive video decoder for each of the number of dynamic virtual objects.

[0238] Clause 59: The method according to any one of Clauses 48-58 further comprises: retrieving a list of available XR sessions, each of the available XR sessions having associated entry point data; receiving a selection of one of the available XR sessions; and retrieving a scene description for the selected XR session among the available XR sessions, the scene description including the entry point data associated with the selected XR session among the available XR sessions.

[0239] Clause 60: The method according to any one of Clauses 48-59 further comprises: retrieving a list of available XR sessions, each of the available XR sessions having associated entry point data; receiving a selection of one of the available XR sessions; requesting the entry point data associated with the one XR session in the available XR sessions; receiving the requested entry point data and data representing a requirement for optimal processing of a scenario for the selected XR session in the available XR sessions; in response to determining that the requirement includes edge support, requesting data representing an edge application server (AS) for the selected XR session in the available XR sessions; sending initialization information for the selected XR session in the available XR sessions to the edge AS; and receiving custom entry point data for the selected XR session in the available XR sessions from the edge AS.

[0240] Clause 61: The method according to Clause 60, wherein the entry point data includes a scene description, the scene description including information about the one or more desired virtual objects for the XR scene.

[0241] Clause 62: The method according to any one of Clauses 48-61, wherein the dynamic virtual object includes at least one of a dynamic mesh, an animated mesh, or a point cloud.

[0242] Clause 63: The method according to any one of Clauses 48-62 further includes: retrieving audio data for at least one of the number of dynamic virtual objects, and presenting the retrieved audio data.

[0243] Clause 64: An apparatus for processing extended reality (XR) data, the apparatus comprising: a memory configured to store XR data and media data; and one or more processors implemented in circuitry and configured to: parse entry point data of a scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects comprising a number greater than one dynamic virtual object; initialize a number of streaming sessions using the entry point data, the number of streaming sessions being equal to the number of dynamic virtual objects, wherein, in order to initialize the number of streaming sessions, the one or more processors are configured to initialize the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions; retrieve media data for each of the number of dynamic virtual objects via one of the streaming sessions; and send the retrieved media data to a rendering unit to render the XR scene, including the retrieved media data at a corresponding location within the XR scene.

[0244] Clause 65: The apparatus according to Clause 64, wherein, in order to configure the QoS and charging information for the streaming session, the one or more processors are configured to: for each of the dynamic virtual objects: determine the type of the dynamic virtual object; and determine the QoS and charging information based on the type of the dynamic virtual object.

[0245] Clause 66: A device pursuant to any one of Clauses 64 and 65, wherein the one or more processors are further configured to: create an XR session; and anchor the XR scene to a real-world space for the XR session.

[0246] Clause 67: The device according to any one of Clauses 64-66, wherein the desired virtual object further comprises one or more static virtual objects, and wherein the one or more processors are further configured to: retrieve media data for each of the one or more static virtual objects, and wherein, in order to render the XR scene, the one or more processors are further configured to: render the XR scene to include the retrieved media data for the one or more static virtual objects at a corresponding location within the XR scene.

[0247] Clause 68: The device according to any one of Clauses 64-67, wherein, in order to retrieve the media data for each of the number of dynamic virtual objects, the one or more processors are configured to: retrieve a manifest file for each of the number of dynamic virtual objects; and use the corresponding manifest file to retrieve media segments for each of the number of dynamic virtual objects.

[0248] Clause 69: A device according to any one of Clauses 64-68, wherein the one or more processors are further configured to: retrieve a list of available XR sessions, each of the available XR sessions having associated entry point data; receive a selection of one of the available XR sessions; and retrieve a scene description for the selected XR session among the available XR sessions, the scene description including the entry point data associated with the selected XR session among the available XR sessions.

[0249] Clause 70: A device according to any one of Clauses 64-69, wherein the one or more processors are further configured to: retrieve a list of available XR sessions, each of the available XR sessions having associated entry point data; receive a selection of one of the available XR sessions; request the entry point data associated with the one of the available XR sessions; receive the requested entry point data and data representing a requirement for optimal processing of a scenario for the selected XR session in the available XR sessions; in response to determining that the requirement includes edge support, request data representing an edge application server (AS) for the selected XR session in the available XR sessions; send initialization information for the selected XR session in the available XR sessions to the edge AS; and receive custom entry point data for the selected XR session in the available XR sessions from the edge AS.

[0250] Clause 71: A device pursuant to any one of Clauses 64-70, wherein the entry point data includes a scene description, the scene description including information about the one or more desired virtual objects for the XR scene.

[0251] Clause 72: The device according to any one of Clauses 64-71, wherein the dynamic virtual object includes at least one of a dynamic mesh, an animated mesh, or a point cloud.

[0252] Clause 73: The device according to any one of Clauses 64-72 further includes: a display configured to display the XR data.

[0253] Clause 74: The device pursuant to any one of Clauses 64-73, wherein the device includes one or more of a camera, a computer, a mobile device, a broadcast receiver device, or a set-top box.

[0254] Clause 75: A computer-readable storage medium having instructions stored thereon, which, when executed, cause a processor to: parse entry point data of a scene to extract information about one or more desired virtual objects for an XR scene, the one or more desired virtual objects comprising a number greater than one dynamic virtual object; initialize a number of streaming sessions using the entry point data, the number of streaming sessions being equal to the number of dynamic virtual objects, wherein the instructions causing the processor to initialize the number of streaming sessions include instructions causing the processor to: initialize the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions; retrieve media data for each of the number of dynamic virtual objects via one of the streaming sessions; and send the retrieved media data to a rendering unit to render the XR scene, including the retrieved media data at a corresponding location within the XR scene.

[0255] Clause 76: An apparatus for processing extended reality (XR) data, the apparatus comprising: a unit for parsing entry point data of a scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects comprising a number of dynamic virtual objects greater than one; a unit for initializing a number of streaming sessions, the number of streaming sessions being equal to the number of dynamic virtual objects, wherein the unit for initializing the number of streaming sessions includes a unit for initializing the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions; a unit for retrieving media data for each of the number of dynamic virtual objects via one of the streaming sessions; and a unit for sending the retrieved media data to a rendering unit to render the XR scene, such that the retrieved media data is included at a corresponding location within the XR scene.

[0256] In one or more examples, the described functionality can be implemented using hardware, software, firmware, or any combination thereof. If implemented in software, the functionality can be stored or transmitted as one or more instructions or code on or through a computer-readable medium and executed by a hardware-based processing unit. A computer-readable medium can include a computer-readable storage medium, which corresponds to a tangible medium such as a data storage medium, or a communication medium that includes, for example, any medium facilitating the transfer of a computer program from one place to another according to a communication protocol. In this way, a computer-readable medium can generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium such as a signal or carrier wave. A data storage medium can be any available medium that can be accessed by one or more computers or one or more processors to retrieve instructions, code, and / or data structures for implementing the techniques described in this disclosure. Computer program products can include computer-readable media.

[0257] By way of example, and not limitation, such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disc storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium capable of storing desired program code in the form of instructions or data structures and accessible by a computer. Furthermore, any connection is appropriately referred to as a computer-readable medium. For example, if instructions are transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technology (such as infrared, radio, and microwave), then coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technology (such as infrared, radio, and microwave) is included in the definition of medium. However, it should be understood that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transient media, but instead refer to non-transient tangible storage media. As used herein, disks and optical discs include compact optical discs (CDs), laser discs, optical discs, digital versatile optical discs (DVDs), floppy disks, and Blu-ray discs, wherein disks typically magnetically copy data, while optical discs optically copy data using lasers. Combinations of the above items should also be included within the scope of computer-readable media.

[0258] Instructions can be executed by one or more processors, such as one or more digital signal processors (DSPs), general-purpose microprocessors, application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), or other equivalent integrated or discrete logic circuits. Therefore, the term "processor" as used herein can refer to any of the foregoing structures or any other structure suitable for implementing the techniques described herein. Additionally, in some aspects, the functionality described herein can be provided within dedicated hardware and / or software modules configured for encoding and decoding, or incorporated into combined codecs. Furthermore, the techniques can be implemented entirely within one or more circuit or logic elements.

[0259] The technologies disclosed herein can be implemented in a wide variety of devices or apparatuses, including wireless mobile phones, integrated circuits (ICs), or a set of ICs (e.g., chipsets). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed technologies, but they do not necessarily need to be implemented through different hardware units. Specifically, as described above, the various units can be combined in a codec hardware unit, or provided by a collection of interoperable hardware units (including one or more processors as described above) combined with appropriate software and / or firmware.

[0260] Various examples have been described. These and other examples are within the scope of the appended claims.

Claims

1. A method for processing extended reality (XR) data, the method comprising: Parse entry point data of an XR scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects including a number of dynamic virtual objects equal to or greater than one, each of the dynamic virtual objects including at least one dynamic media component for which media data is to be retrieved. The entry point data is used to initialize a number of streaming sessions, the number of streaming sessions being equal to or greater than the number of dynamic virtual objects, wherein initializing the streaming sessions includes initializing the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions; Media data for each dynamic media component in the dynamic media components of the dynamic virtual object is retrieved via one of the corresponding number of streaming sessions; and The retrieved media data is sent to the rendering unit to render the XR scene, so that the retrieved media data is included at the corresponding location within the XR scene.

2. The method of claim 1, wherein, Configuring the QoS and charging information for the streaming session includes, for each of the dynamic virtual objects: Determine the type of the dynamic virtual object; and QoS and billing information are determined based on the type of the dynamic virtual object.

3. The method of claim 2, further comprising: For at least one of the dynamic virtual objects: Determine whether the media data used for the streaming session associated with the type of the at least one of the dynamic virtual objects is two-dimensional (2D) media data or three-dimensional (3D) media data; and The QoS and billing information is determined based on whether the media data used for the streaming session associated with the type of at least one of the dynamic virtual objects is 2D media data or 3D media data.

4. The method of claim 1, wherein, Configuring the QoS and charging information for the streaming session includes, for each of the dynamic virtual objects: Determine the amount of bandwidth required for the media data associated with the streaming session used for the dynamic virtual object; and Configure the QoS and billing information for the streaming session of the dynamic virtual object according to the required amount of bandwidth.

5. The method of claim 1, wherein, Configuring the QoS and charging information for the streaming session includes, for each of the dynamic virtual objects: It is determined that accurate user location information is required for the streaming session used for the dynamic virtual object; and Based on the determination that accurate user location information is required, the QoS and billing information for the streaming session of the dynamic virtual object is configured.

6. The method of claim 1, wherein, Configuring the QoS and charging information for the streaming session includes: Determine whether the rendering unit is configured to perform split rendering of the media data; When the rendering unit is not configured to perform split rendering, a first minimum bit rate is determined for the streaming session; and When the rendering unit is configured to perform split rendering, a second minimum bit rate is determined for the streaming session, the second bit rate being higher than the first bit rate.

7. The method according to claim 1, further comprising: Create an XR session; as well as The XR scene is anchored to the real-world space used for the XR session.

8. The method of claim 1, wherein, The required virtual object also includes one or more static virtual objects, and the method further includes: retrieving media data for each of the one or more static virtual objects, and wherein rendering the XR scene further includes: rendering the XR scene to include the retrieved media data for the one or more static virtual objects at a corresponding location within the XR scene.

9. The method of claim 1, wherein, Retrieving the media data for each of the stated number of dynamic virtual objects includes: Retrieve the manifest file for each of the stated number of dynamic virtual objects; and Use the corresponding manifest file to retrieve the media segments for each dynamic virtual object in the stated number of dynamic virtual objects.

10. The method of claim 9, wherein, The manifest file includes a Media Presentation Description (MPD).

11. The method of claim 1, further comprising: Configure an immersive video decoder for each of the stated number of dynamic virtual objects.

12. The method according to claim 1, further comprising: Retrieve a list of available XR sessions, each of which has associated entry point data; Receive a selection of one of the available XR sessions; as well as Retrieve a scene description for a selected XR session from the available XR sessions, the scene description including the entry point data associated with the selected XR session from the available XR sessions.

13. The method according to claim 1, further comprising: Retrieve a list of available XR sessions, each of which has associated entry point data; Receive a selection of one of the available XR sessions; Request the entry point data associated with one of the available XR sessions; Receive the requested entry point data and data representing the requirements for optimal processing of the scenario for a selected XR session among the available XR sessions; In response to determining that the request includes edge support, data representing an edge application server (AS) for a selected XR session among the available XR sessions is requested; Send initialization information to the edge AS for a selected XR session among the available XR sessions; as well as Receive custom entry point data from the edge AS for a selected XR session among the available XR sessions.

14. The method of claim 1, wherein, The entry point data includes a scene description, which includes information about the one or more desired virtual objects used in the XR scene.

15. The method of claim 1, wherein, The dynamic virtual object includes at least one of a dynamic mesh, an animated mesh, or a point cloud.

16. The method of claim 1, further comprising: Audio data for at least one of the said number of dynamic virtual objects is retrieved, and the retrieved audio data is presented.

17. An apparatus for processing extended reality (XR) data, the apparatus comprising: The memory is configured to store XR data and media data; as well as One or more processors, which are implemented in a circuit and configured to: Parse entry point data of an XR scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects including a number of dynamic virtual objects equal to or greater than one, each of the dynamic virtual objects including at least one dynamic media component for which media data is to be retrieved. The entry point data is used to initialize a number of streaming sessions, the number of streaming sessions being equal to or greater than the number of dynamic virtual objects, wherein, in order to initialize the streaming sessions, the one or more processors are configured to initialize the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions. Media data for each dynamic media component in the dynamic media components used for the dynamic virtual object is retrieved via one of the corresponding number of streaming sessions. as well as The retrieved media data is sent to the rendering unit to render the XR scene, so that the retrieved media data is included at the corresponding location within the XR scene.

18. The apparatus of claim 17, wherein, To configure the QoS and charging information for the streaming session, the one or more processors are configured to: for each dynamic virtual object in the dynamic virtual objects: Determine the type of the dynamic virtual object; and QoS and billing information are determined based on the type of the dynamic virtual object.

19. The apparatus of claim 17, wherein, The one or more processors are further configured to: Creating an XR session; and The XR scene is anchored to the real-world space used for the XR session.

20. The apparatus of claim 17, wherein, The required virtual object also includes one or more static virtual objects, and wherein the one or more processors are further configured to: retrieve media data for each of the one or more static virtual objects, and wherein, in order to render the XR scene, the one or more processors are further configured to: render the XR scene to include the retrieved media data for the one or more static virtual objects at corresponding locations within the XR scene.

21. The apparatus of claim 17, wherein, In order to retrieve the media data for each of the said number of dynamic virtual objects, the one or more processors are configured to: Retrieve the manifest file for each of the stated number of dynamic virtual objects; and Use the corresponding manifest file to retrieve the media segments for each dynamic virtual object in the stated number of dynamic virtual objects.

22. The apparatus of claim 17, wherein, The one or more processors are further configured to: Retrieve a list of available XR sessions, each of which has associated entry point data; Receive a selection of one of the available XR sessions; as well as Retrieve a scene description for a selected XR session from the available XR sessions, the scene description including the entry point data associated with the selected XR session from the available XR sessions.

23. The device according to claim 17, wherein, The one or more processors are further configured to: Retrieve a list of available XR sessions, each of which has associated entry point data; Receive a selection of one of the available XR sessions; Request the entry point data associated with one of the available XR sessions; Receive the requested entry point data and data representing the requirements for optimal processing of the scenario for a selected XR session among the available XR sessions; In response to determining that the request includes edge support, data representing an edge application server (AS) for a selected XR session among the available XR sessions is requested; Send initialization information to the edge AS for a selected XR session among the available XR sessions; as well as Receive custom entry point data from the edge AS for a selected XR session among the available XR sessions.

24. The device according to claim 17, wherein, The entry point data includes a scene description, which includes information about the one or more desired virtual objects used in the XR scene.

25. The device according to claim 17, wherein, The dynamic virtual object includes at least one of a dynamic mesh, an animated mesh, or a point cloud.

26. The apparatus of claim 17, further comprising: A display configured to show the XR data.

27. The device according to claim 17, wherein, The device includes one or more of a camera, computer, mobile device, broadcast receiver device or set-top box.

28. A computer-readable storage medium having instructions stored thereon, said instructions, when executed, causing a processor to perform the following operations: Parse entry point data of an XR scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects including a number of dynamic virtual objects equal to or greater than one, each of the dynamic virtual objects including at least one dynamic media component for which media data is to be retrieved. The entry point data is used to initialize a number of streaming sessions, the number of streaming sessions being equal to or greater than the number of dynamic virtual objects, wherein... The instructions that cause the processor to initialize the number of streaming sessions include instructions that cause the processor to perform the following operations: initialize the streaming sessions according to the configured Quality of Service (QoS) and billing information for the streaming sessions; Media data for each dynamic media component in the dynamic media components used for the dynamic virtual object is retrieved via one of the corresponding number of streaming sessions. as well as The retrieved media data is sent to the rendering unit to render the XR scene, so that the retrieved media data is included at the corresponding location within the XR scene.

29. An apparatus for processing extended reality (XR) data, the apparatus comprising: A unit for parsing entry point data of an XR scene to extract information about one or more desired virtual objects for the XR scene, the one or more desired virtual objects including a number of dynamic virtual objects equal to or greater than one, each of the dynamic virtual objects including at least one dynamic media component for which media data is to be retrieved. A unit for initializing a number of streaming sessions, the number of streaming sessions being equal to or greater than the number of dynamic virtual objects, wherein the unit for initializing the number of streaming sessions includes a unit for initializing the streaming sessions according to configured Quality of Service (QoS) and billing information for the streaming sessions. A unit for retrieving media data for each of the number of dynamic virtual objects via one of the corresponding number of streaming sessions; and Units for sending retrieved media data to rendering units to render the XR scene, including the retrieved media data at corresponding locations within the XR scene.