Adaptive streaming transmission method, apparatus and computer readable medium
By using an adaptive streaming method, the depth or priority of the transmission of light field/holographic display content is adjusted according to the network bandwidth and processing capabilities of the terminal client, solving the problems of high equipment costs and large bandwidth requirements, and achieving a smooth viewing experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TENCENT AMERICA LLC
- Filing Date
- 2022-10-22
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies for capturing and transmitting light field/holographic display content are costly and require huge bandwidth, which may cause buffering or interruption problems on the terminal client.
An adaptive streaming approach is adopted to adjust the depth or priority of the transmission scene based on the network bandwidth and processing capabilities of the terminal client, transmitting only high-priority objects or those within the available depth, and using predefined background images to provide a pleasant viewing experience.
It effectively reduces equipment costs and bandwidth requirements, avoids buffering and interruptions, and provides a smooth viewing experience under limited network conditions.
Smart Images

Figure CN116710852B_ABST
Abstract
Description
[0001] Related applications
[0002] This application claims priority to U.S. Provisional Patent Application No. 63 / 270,978 (filed October 22, 2021) and U.S. Patent Application No. 17 / 971,048 (filed October 21, 2022) issued by the United States Patent and Trademark Office, the disclosures of which are incorporated herein by reference in their entirety. Technical Field
[0003] Embodiments of this disclosure relate to image and video coding techniques. More specifically, embodiments of this disclosure relate to improvements in adaptive streaming of immersive media content for holographic or light field displays. Background Technology
[0004] Immersive media involves immersive technologies that attempt to create or mimic the physical world through digital simulation, typically simulating any or all human sensory systems to create the user's perception of actually being in the scene.
[0005] Immersive media technologies can include Virtual Reality (VR), Augmented Reality (AR), Mixed Reality (MR), and light field / holography. VR refers to a digital environment that places the user in a computer-generated world using headphones, replacing the user's physical environment. AR uses clear visuals or a smartphone to overlay digital media onto the real world around you. MR refers to merging the real and digital worlds to create an environment where technology and the physical world can coexist.
[0006] Light field display, or holographic display technology, consists of light rays in three-dimensional (3D) space, originating from every point and direction. A light ray can be a five-dimensional holographic function, where each beam can be defined by three coordinates (3D) and two angles in 3D space to specify its direction. The concept of a light field display is based on the understanding that everything we see is illuminated by light from any light source, propagating through space and striking the surfaces of objects. Before reaching our eyes, some light is absorbed, and some is reflected to another surface. What kind of light reaches our eyes depends on the user's precise position within the light field. As the user moves around, they perceive a portion of the light field and use it to determine the location of objects.
[0007] To capture 360-degree video content, a 360-degree camera is required; however, when it comes to capturing content for light field / holographic displays, an expensive setup consisting of multiple depth cameras or camera arrays is needed, depending on the field of view (FoV) of the scene to be rendered. Traditional cameras can only capture a two-dimensional (2D) representation of the light rays arriving at a given camera lens. Image sensors record the sum of the brightness and color of all the light rays arriving at each pixel, but not the direction of all the light rays arriving at the camera sensor. Therefore, equipment specifically designed to capture content for light field / holographic displays is prohibitively expensive.
[0008] Furthermore, the multimedia content, real-world content, or synthetic content used in such holographic or light field displays is enormous in size and is captured and stored on servers. Transmitting this media content to end clients requires significant bandwidth, even after data compression. Therefore, with limited bandwidth, clients may experience buffering or interruptions. Summary of the Invention
[0009] According to an embodiment, a method for adaptive streaming of light field or holographic immersive media can be provided. The method can be executed by at least one processor and may include: determining features associated with a scene to be transmitted to an end client; adjusting at least a portion of the scene to be transmitted to the end client based on the determined features; and transmitting an adaptive stream of light field or holographic immersive media including the adjusted scene based on the determined features.
[0010] According to an embodiment, an apparatus for adaptive streaming of light field or holographic immersive media can be provided. The apparatus may include at least one memory configured to store program code; and at least one processor configured to read the program code and operate according to the instructions of the program code. The program code may include first determining code configured to cause the at least one processor to determine features associated with a scene to be transmitted to a terminal client; second determining code configured to cause the at least one processor to adjust at least a portion of the scene to be transmitted to the terminal client based on the determined features; and transmission code configured to cause the at least one processor to transmit the adaptive stream of light field or holographic immersive media including the adjusted scene based on the determined features.
[0011] According to an embodiment, a non-transitory computer-readable medium storing instructions may be provided. When executed by at least one processor of a device for an adaptive stream of light field or holographic immersive media, the instructions may cause the at least one processor to determine features associated with a scene to be transmitted to an end client; adjust at least a portion of the scene to be transmitted to the end client based on the determined features associated with the end client; and transmit an adaptive stream of light field or holographic immersive media including the adjusted scene based on the determined features. Attached Figure Description
[0012] Figure 1 An adaptive streaming of depth-based immersive media according to an embodiment of the present disclosure is illustrated;
[0013] Figure 2 An adaptive streaming of priority-based immersive media according to an embodiment of the present disclosure is illustrated;
[0014] Figure 3A A flowchart of an adaptive streaming method for immersive media according to an embodiment of the present disclosure is shown;
[0015] Figure 3B A flowchart of an adaptive streaming method for immersive media according to an embodiment of the present disclosure is shown;
[0016] Figure 4 This is a simplified block diagram of a communication system according to an embodiment of the present disclosure;
[0017] Figure 5 This is a schematic diagram of the placement of video encoders and decoders in a streaming environment;
[0018] Figure 6 This is a functional block diagram of a video decoder according to an embodiment of the present disclosure;
[0019] Figure 7 This is a functional block diagram of a video encoder according to an embodiment of the present disclosure;
[0020] Figure 8 This is a schematic diagram of a computer system according to an embodiment of the present disclosure. Detailed Implementation
[0021] The aspects of the disclosed embodiments can be used individually or in combination. Embodiments of this disclosure relate to improvements in adaptive streaming techniques for immersive light fields or holographic media streaming, taking into account network and / or device capabilities.
[0022] Holographic / light field technology creates a virtual environment with precise depth and three-dimensionality without the need for headphones, thus avoiding side effects such as motion sickness. As mentioned above, capturing 360-degree video content requires a 360-degree camera; however, when it comes to capturing content for light field / holographic displays, depending on the field of view (FoV) of the scene to be captured, an expensive setup consisting of multiple depth cameras or camera arrays is required.
[0023] According to one aspect of this disclosure, a server or media distribution processor can use depth-based adaptive streaming for holographic or light field display media. A bandwidth-based depth method is disclosed for situations where network bandwidth or processing power is low, rather than rendering the entire scene at once. When network capabilities are ideal, the terminal client can receive and render the entire scene at once. However, when network bandwidth or processing power is limited, the terminal client does not render the entire scene, but rather renders the scene to a certain depth. Therefore, according to an embodiment, the depth is a function of the client's bandwidth. In an embodiment, after obtaining information about the terminal client's bandwidth, the server adjusts the media being streamed between scenes with different depths.
[0024] refer to Figure 1 , Figure 1 A depth-based method 100 for adaptive streaming media associated with holographic or light field displays is illustrated. (e.g.) Figure 1 As shown, objects 101-103 are one or more objects at different depths in the scene, wherein object 101 is located at a first depth 105, object 102 is located at a second depth 106, and object 103 is located at a third depth from the imaging device (also referred to as a camera or capture device). According to embodiments of this disclosure, objects at depths up to the first, second, or third depth may be included, depending on network bandwidth or the processing power of the terminal client. In some embodiments, if only objects at depths up to the second depth are included, objects at the third depth may be excluded from the scene being transmitted or streamed.
[0025] According to the embodiments, depth-based streaming is superior to transmitting the entire scene at once because the scene depth can be adjusted based on the available network bandwidth, which is the opposite of buffering or interruptions that may occur during playback when the client's bandwidth is limited and cannot support rendering the entire scene.
[0026] According to one aspect of this disclosure, the server can assign a priority value to each asset (also called an object) and use that priority value for adaptive streaming of holographic or light field displays. Thus, a bandwidth-based prioritization method is considered, such that instead of rendering the entire scene at once, only a prioritized version of the scene is transmitted and rendered. When network capabilities are unrestricted, the end client can receive and render the entire scene assets at once. However, when network bandwidth or processing power is limited, the end client can render assets with higher priority instead of rendering all assets in the scene. Therefore, the total number of assets and / or objects rendered is a function of the client's bandwidth. According to one embodiment, after obtaining information about the end client's bandwidth, the server adjusts the media for streaming between scenes with different assets.
[0027] refer to Figure 2 , Figure 2 A priority-based method 200 for adaptive streaming media associated with holographic or light field displays is illustrated. For example... Figure 2 As shown, objects 201-203 are one or more objects in the scene at different depths and priorities, wherein object 101 is at a first priority, object 203 is at a second priority, and object 202 is at a third priority. In some embodiments, the priority of an object may be based on the identified object. In some embodiments, the priority of an object may be based on the distance between the object and the imaging device (also referred to as a camera or capturing device). According to embodiments of this disclosure, based on network bandwidth or the processing power of the terminal client, only objects with first, second, or third priorities may be included. In some embodiments, if only objects with second priority are included, objects with first priority may be included, but objects with third priority may be excluded from the scene being transmitted or streamed.
[0028] According to one aspect of this disclosure, the server may have a two-part content description: a Media Presentation Description (MPD) describing a list of available scenarios, various alternatives, and other characteristics; and multiple scenarios with different assets based on scenario depth or asset priority. In one embodiment, when an end client first obtains the MPD to play any media content, it can parse the MPD and understand the various scenarios with different assets, scenario timing, media content availability, media type, various encoding alternatives for the media content, supported minimum and maximum bandwidth, and other content characteristics. Using this information, the end client can appropriately select when and under what bandwidth availability to render which scenario. The end client can continuously measure bandwidth fluctuations and / or processing power fluctuations, and depending on its analysis, the end client can determine how to adapt to available bandwidth by obtaining alternative scenarios with fewer or more assets.
[0029] According to one aspect of this disclosure, when network bandwidth and / or processing power is limited, the server may stream higher-priority assets first, rather than lower-priority assets. In some embodiments, assets with a priority equal to or greater than a threshold may be included, and assets with a priority below the threshold may be excluded. In some embodiments, assets may be layered and compressed, including a base stream layer and layers with additional details such as materials. Thus, when network bandwidth and / or processing power is limited, only the base stream may be rendered, and as bandwidth increases, layers with more detail may be added. In some embodiments, the priority value and / or priority threshold of an asset may be defined by the server / sender and may be changed by the end client during a session, and vice versa.
[0030] According to one aspect of this disclosure, the server can have a predefined flat background image. This predefined background can provide a pleasant viewing experience for the client when the client's bandwidth is limited and the end client cannot stream and / or render all assets in the scene. The background image can be updated periodically based on the scene being rendered. As an example, when bandwidth is very limited, a predefined 2D background video can be used. Therefore, when depth-based adaptive streaming is available, the scene is not rendered entirely as a 3D scene, but can be rendered as a 2D stream. Thus, a scene can be partially 3D and partially 2D.
[0031] Figure 3A A flowchart of an adaptive streaming process 300 for immersive media according to an embodiment of the present disclosure is shown.
[0032] like Figure 3AAs shown, in operation 305, network capabilities associated with the terminal client can be determined. As an example, the network capabilities associated with the client device can be determined by a server (which may be part of network 855) or a media distribution processor. In some embodiments, processing capabilities associated with the terminal client can also be determined. Based on the determined capabilities associated with the terminal client, a portion of the scene to be transmitted can be determined.
[0033] In operation 310, a portion of the scene to be transmitted to the end client can be determined based on the capabilities associated with the end client. As an example, a server or media distribution processor can determine the portion of the scene to be transmitted to the end client based on the capabilities associated with the end client.
[0034] According to one aspect, determining the scene to be transmitted may include determining a depth associated with the scene to be transmitted based on network capabilities; and adjusting the scene to be transmitted based on the depth to include one or more first objects in the scene, wherein the one or more first objects are located at a first distance within the depth. In some embodiments, it may further include adjusting the scene to be transmitted based on the depth to exclude one or more second objects in the scene, wherein the one or more second objects are located at a distance beyond the depth.
[0035] According to one aspect, determining the scene to be transmitted may include determining a threshold priority associated with one or more objects in the scene to be transmitted based on network capabilities; and adjusting the scene to be transmitted based on the threshold priority to include one or more first objects among the one or more objects in the scene, wherein the one or more first objects have a higher priority than the threshold priority. It may also include adjusting the scene to be transmitted based on the threshold priority to exclude one or more second objects among the one or more objects in the scene, wherein the one or more second objects have a lower priority than the threshold priority. In some embodiments, the priority of a corresponding object associated with one or more objects in the scene may be determined based on the distance between the corresponding object and the imaging device capturing the scene.
[0036] According to one aspect, determining the scene to be transmitted may include receiving a request from the terminal client for an alternative scene based on network capabilities associated with the terminal client, wherein the alternative scene has fewer objects than one or more objects in the scene; and adjusting the alternative scene to be transmitted to include one or more first objects among the one or more objects, wherein the one or more first objects have a priority higher than a threshold priority. It may also include adjusting the alternative scene to be transmitted to exclude one or more second objects among the one or more objects, wherein the one or more second objects have a priority lower than a threshold priority. In some embodiments, the corresponding priorities associated with one or more objects in the scene may be defined by the terminal client or the server.
[0037] In operation 315, an immersive media stream associated with the scene can be transmitted based on the determined portion. In some embodiments, the immersive media stream can be transmitted from a server or media distribution processor to the end client.
[0038] Figure 3B A flowchart of an adaptive streaming process 350 for immersive media according to an embodiment of the present disclosure is shown.
[0039] like Figure 3B As shown, in operation 355, features associated with a scene to be transmitted to the end client can be determined. As an example, the features associated with the scene to be transmitted to the end client can be determined by a server (which may be part of network 855) or a media distribution processor. In some embodiments, the determined features may include image and video features and encoded data associated with the immersive media stream. In some embodiments, the determined features may include depth or priority information associated with images, video, or the scene associated with the immersive media stream. In some embodiments, network capabilities / bandwidth and processing power associated with the end client can also be determined. Features based on the determined capabilities and / or the determined scene to be transmitted to the end client can be determined.
[0040] In Operation 360, a portion of the scene to be transmitted to the end client can be determined or adjusted based on the identified characteristics associated with the scene to be transmitted to the end client. As an example, a server or media distribution processor can determine at least a portion of the scene to be transmitted to the end client based on the identified characteristics associated with the scene to be transmitted to the end client.
[0041] According to one aspect, adjusting the scene to be transmitted may include: determining a depth associated with the scene to be transmitted based on identified features associated with the scene to be transmitted to the terminal client; and adjusting the scene to be transmitted based on the depth to include one or more first objects in the scene, wherein the one or more first objects are located at a first distance within the depth. In some embodiments, it may further include adjusting the scene to be transmitted based on the depth to exclude one or more second objects in the scene, wherein the one or more second objects are located at a distance beyond the depth.
[0042] According to one aspect, adjusting the scene to be transmitted may include: determining a threshold priority associated with one or more objects in the scene to be transmitted, based on determined features associated with the scene to be transmitted to a terminal client; and adjusting the scene to be transmitted based on the threshold priority to include one or more first objects among the one or more objects in the scene, wherein the one or more first objects have a higher priority than the threshold priority. It may also include adjusting the scene to be transmitted based on the threshold priority to exclude one or more second objects among the one or more objects in the scene, wherein the one or more second objects have a lower priority than the threshold priority. In some embodiments, the priority of a corresponding object associated with one or more objects in the scene may be determined based on the distance between the corresponding object and the imaging device capturing the scene.
[0043] According to one aspect, adjusting the scene to be transmitted may include: receiving a request from the terminal client for an alternative scene based on determined characteristics associated with the scene to be transmitted to the terminal client, wherein the alternative scene has fewer objects than one or more objects in the scene; and adjusting the alternative scene to be transmitted to include one or more first objects among the one or more objects, wherein the one or more first objects have a priority higher than a threshold priority. It may also include adjusting the alternative scene to be transmitted to exclude one or more second objects among the one or more objects, wherein the one or more second objects have a priority lower than a threshold priority. In some embodiments, the corresponding priorities associated with one or more objects in the scene may be defined by the terminal client or the server.
[0044] In Operation 365, an adaptive stream of immersive media associated with a scene can be transmitted based on a defined portion. In some embodiments, the immersive media stream can be transmitted from a server or media distribution processor to the end client.
[0045] although Figure 3A -B illustrates the example blocks for processes 300 and 350, but in some implementations, processes 300 and 350 may include... Figure 3ACompared to the boxes depicted in -B, there may be additional boxes, fewer boxes, different boxes, or boxes with different arrangements. Additionally or alternatively, two or more boxes of processes 300 and 350 may be executed in parallel.
[0046] Furthermore, the proposed methods can be implemented using processing circuitry (e.g., one or more processors or one or more integrated circuits). In one example, one or more processors execute a program stored in a non-transitory computer-readable medium to perform one or more of the proposed methods.
[0047] The above-described technology can be implemented as computer software using computer-readable instructions and physically stored in one or more computer-readable media. For example, Figure 8 A computer system 800 suitable for implementing certain embodiments of the disclosed subject matter is shown.
[0048] Computer software can be encoded using any suitable machine code or computer language, and can be assembled, compiled, linked, or similarly to create code containing instructions that can be executed directly by computer central processing units (CPUs), graphics processing units (GPUs), or through interpretation, microcode execution, etc.
[0049] The instructions can be executed on various types of computers or their components, including, for example, personal computers, tablets, servers, smartphones, gaming devices, and Internet of Things (IoT) devices.
[0050] Figure 4 A simplified block diagram of a communication system 400 according to an embodiment of the present disclosure is shown. The communication system 400 may include at least two terminals 410-420 interconnected via a network 450. For unidirectional data transmission, a first terminal 410 may encode video data at its local location for transmission to another terminal 420 via the network 450. The second terminal 420 may receive the encoded video data from the other terminal from the network 450, decode the encoded data, and display the recovered video data. Unidirectional data transmission is common in media service applications, etc.
[0051] Figure 4A second pair of terminals 430 and 440 is shown, provided to support bidirectional transmission of encoded video, for example, during video conferencing. For bidirectional data transmission, each terminal 430 and 440 can encode video data captured at a local location for transmission to the other terminal via network 450. Each terminal 430 and 440 can also receive encoded video data transmitted by the other terminal, decode the encoded data, and display the recovered video data on a local display device.
[0052] exist Figure 4 In this disclosure, terminals 410-440 may be shown as servers, personal computers, and smartphones, but the principles of this disclosure are not limited thereto. Embodiments of this disclosure are applicable to laptop computers, tablet computers, media players, and / or dedicated video conferencing equipment. Network 450 represents any number of networks transmitting encoded video data between terminals 410-440, including, for example, wired and / or wireless communication networks. Communication network 450 may exchange data in circuit-switched and / or packet-switched channels. Representative networks include telecommunications networks, local area networks (LANs), wide area networks (WANs), and / or the Internet. For the purposes of this discussion, the architecture and topology of network 450 may be of little importance to the operation of this disclosure, unless explained herein.
[0053] As an example of the application of the disclosed topic, Figure 5 The placement of video encoders and decoders in a streaming environment, such as streaming system 500, is illustrated. The disclosed subject matter can also be applied to other video-enabled applications, including, for example, video conferencing, digital television, and storing compressed video on digital media including Compact Discs (CDs), Digital Video Discs (DVDs), Memory Sticks, etc.
[0054] The streaming system may include a capture subsystem 513, which may include a video source 501, such as a digital camera that creates, for example, an uncompressed video sample stream 502. The sample stream 502, depicted as a thick line to emphasize its high data volume compared to an encoded video bitstream, may be processed by an encoder 503 coupled to the video source 501, such as the camera. The encoder 503 may include hardware, software, or a combination thereof to implement or enforce aspects of the disclosed subject matter, which are described in more detail below. The encoded video bitstream 504, depicted as a thin line to emphasize its lower data volume compared to the sample stream, may be stored on a streaming server 505 for future use. One or more streaming clients 506, 508 may access the streaming server 505 to retrieve copies of the encoded video bitstream 504, such as video bitstreams 507 and 509. Client 506 may include a video decoder 510 that decodes an input copy of the encoded video bitstream 507 and creates an output video sample stream 511 that can be presented on a display 512 or other presentation device (not shown). In some streaming systems, video bitstreams of 504, 507, and 509 can be encoded according to certain video coding / compression standards. Examples of these standards include the ITU-T recommendation H.265 (High Efficiency Video Coding, HEVC). A video coding standard, informally known as Versatile Video Coding (VVC), is under development. The topics disclosed can be used in the context of VVC.
[0055] Figure 6 This may be a functional block diagram of the video decoder 510 according to an embodiment.
[0056] Receiver 610 can receive one or more codec video sequences to be decoded by decoder 510; in the same or another embodiment, one encoded video sequence at a time, wherein the decoding of each encoded video sequence is independent of other encoded video sequences. Encoded video sequences can be received from channel 612, which can be a hardware / software link to a storage device storing the encoded video data. Receiver 610 can receive encoded video data and other data, such as encoded audio data and / or auxiliary data streams, which can be forwarded to their respective user entities (not shown). Receiver 610 can isolate the encoded video sequences from the other data. To combat network jitter, buffer memory 615 (e.g., it may be a buffer memory) can be coupled between receiver 610 and entropy decoder / parser 620 (hereinafter referred to as "parser"). Buffer 615 may be unnecessary or small when receiver 610 receives data from a store / forward device with sufficient bandwidth and controllability or from a synchronization network. To make the best use of packet networks (e.g., the Internet), a buffer of 615 may be required, which may be relatively large and may have an adaptive size.
[0057] Video decoder 510 may include parser 620 to reconstruct symbols 621 from an entropy-coded video sequence. These symbols include information for managing the operation of decoder 510 and information that potentially controls a presentation device (e.g., display 521), which is not part of the decoder but may be coupled to it. Figure 6As shown. The control information used for the presentation device may be in the form of Supplementary Enhancement Information (SEI) messages or Video Usability Information (VUI) parameter set fragments (not shown). The parser 620 can parse / entropy decode the received encoded video sequence. The encoding of the encoded video sequence can be based on video coding techniques or standards and can follow principles known to those skilled in the art, including variable-length coding, Huffman coding, arithmetic coding with or without context sensitivity, etc. The parser 620 can extract a set of subgroup parameters of at least one pixel subgroup from the encoded video sequence based on at least one parameter corresponding to the group. The subgroup may include Groups of Pictures (GOP), pictures, tiles, slices, macroblocks, coding units (CU), blocks, transform units (TU), prediction units (PU), etc. The entropy decoder / parser can also extract information from the encoded video sequence, such as transform coefficients, quantizer parameter (QP) values, motion vectors, etc.
[0058] The parser 620 can perform entropy decoding / parsing operations on the video sequence received from the buffer 615 to create symbols 621. The parser 620 can receive encoded data and selectively decode specific symbols 621. Furthermore, the parser 620 can determine whether a specific symbol 621 will be provided to the motion compensation prediction unit 653, the scaler / inverse transform unit 651, the intra-frame prediction unit 652, or the loop filter 656.
[0059] Depending on the type of encoded video picture or its components (e.g., inter- and intra-pictures, inter- and intra-blocks) and other factors, the reconstruction of symbol 621 can involve multiple distinct units. Which units are involved and how they are involved can be controlled by subgroup control information parsed from the encoded video sequence by parser 620. For clarity, the flow of this subgroup control information between parser 620 and the various units below is not described.
[0060] In addition to the functional blocks already mentioned, the decoder 510 can be conceptually subdivided into multiple functional units as described below. In a practical implementation operating under commercial constraints, many of these units interact closely with each other and can be integrated with each other at least partially. However, for the purpose of describing the disclosed subject matter, it is appropriate to conceptually subdivide it into the following functional units.
[0061] The first unit is the scaler / inverse transform unit 651. The scaler / inverse transform unit 651 receives the quantized transform coefficients and control information, including which transform to use, block size, quantization factor, quantization scaling matrix, etc., as symbols 621 from the parser 620. The scaler / inverse transform unit can output blocks containing sample values, which can be input into the aggregator 655.
[0062] In some cases, the output samples of the scaler / inverse transform unit 651 may belong to intra-coded blocks; that is, blocks that do not use prediction information from previously reconstructed images but can use prediction information from previously reconstructed portions of the current image. The intra-image prediction unit 652 can provide such prediction information. In some cases, the intra-image prediction unit 652 uses surrounding reconstructed information obtained from the current (partially reconstructed) image 658 to generate blocks with the same size and shape as the blocks in the reconstruction. In some cases, the aggregator 655 adds the prediction information already generated by the intra-prediction unit 652 to the output sample information provided by the scaler / inverse transform unit 651 based on each sample.
[0063] In other cases, the output samples of the scaler / inverse transform unit 651 may belong to inter-frame coded and possibly motion-compensated blocks. In this case, the motion compensation prediction unit 653 can access the reference image memory 657 to obtain samples for prediction. After motion compensation of the obtained samples according to symbols, the aggregator 655 of the scaler / inverter output can add 621 samples belonging to the block, which are referred to in this case as residual samples or residual signals, to generate output sample information. The addresses in the reference image memory from which the motion compensation unit obtains the predicted samples can be controlled by motion vectors, and the motion compensation unit can obtain these addresses in the form of symbols 621, which may have, for example, X, Y, and reference image components. When using subsampled precise motion vectors, motion compensation may also include interpolation of sampled values obtained from the reference image memory, motion vector prediction mechanisms, etc.
[0064] The output samples of aggregator 655 can undergo various loop filtering techniques in loop filter unit 656. Video compression techniques may include loop filtering techniques controlled by parameters contained in the encoded video bitstream and available as symbol 621 from parser 620 to loop filter unit 656, but may also be in response to metadata obtained during decoding of previous (in the order of decoding) portions of the encoded picture or encoded video sequence and to previously reconstructed and loop-filtered sample values.
[0065] The output of the loop filter unit 656 can be a sample stream, which can be output to the presentation device 521 and stored in the reference image memory 657 for future inter-frame image prediction.
[0066] Once fully reconstructed, certain coded images can be used as reference images for future predictions. Once the coded images have been fully reconstructed and have been identified as reference images (e.g., by parser 620), the current reference image 658 can become part of the reference image buffer 657, and a new current image memory can be reallocated before the reconstruction of the next coded image begins.
[0067] The video decoder 510 can perform decoding operations according to a predetermined video compression technique, which may be described in standards such as ITU-T Rec.H.265. The encoded video sequence may conform to the syntax specified by the video compression technique or standard used, in a sense, conforming to the syntax of the video compression technique or standard, as specified in the video compression technique documentation or standard, particularly in the brief document therein. Standard conformance also requires the complexity of the encoded video sequence to be within the range defined by the level of the video compression technique or standard. In some cases, the level limits the maximum picture size, maximum frame rate, maximum reconstruction sampling rate (e.g., measured in megasamples per second), maximum reference picture size, etc. In some cases, the limitations set by the level can be further restricted by the hypothetical reference decoder (HRD) specification and metadata managed by the HRD buffer for signaling in the encoded video sequence.
[0068] In one embodiment, receiver 610 may receive additional (redundant) data with encoded video. This additional data may be included as part of the encoded video sequence. Video decoder 510 may use the additional data to correctly decode the data and / or more accurately reconstruct the original video data. The additional data may be, for example, in the form of temporal, spatial, or signal-to-noise ratio (SNR) enhancement layers, redundant slices, redundant images, forward error correction codes, etc.
[0069] Figure 7 This may be a functional block diagram of a video encoder 503 according to an embodiment of this disclosure.
[0070] Encoder 503 can receive video samples from video source 501 (not part of the encoder), which can capture video images to be encoded by encoder 503.
[0071] Video source 501 can provide a source video sequence to be encoded by encoder 503 in the form of a digital video sample stream, which can have any suitable bit depth (e.g., 8-bit, 10-bit, 12-bit, ...), any color space (e.g., BT.601YCrCb, RGB, ...), and any suitable sampling structure (e.g., YCrCb 4:2:0, YCrCb 4:4:4). In a media service system, video source 501 can be a storage device storing previously prepared video. In a video conferencing system, video source 501 can be a camera capturing local image information as a video sequence. Video data can be provided as multiple individual pictures, which are given motion when viewed sequentially. The pictures themselves can be organized as a spatial array of pixels, where each pixel can include one or more samples, depending on the sampling structure, color space, etc., used. Those skilled in the art will readily understand the relationship between pixels and samples. The following description focuses on samples.
[0072] According to one embodiment, the video encoder 503 can encode and compress images of a source video sequence into an encoded video sequence 743 in real time or under any other time constraints required by the application. Implementing an appropriate encoding rate is a function of the controller 750. The controller controls and is functionally coupled to other functional units as described below. For clarity, coupling is not described. Parameters set by the controller may include rate control-related parameters (image skipping, quantizer, λ value of rate-distortion optimization techniques, etc.), image size, group of pictures (GOP) layout, maximum motion vector search range, etc. Other functions of the controller 750 can be readily identified by those skilled in the art, as they may be related to the video encoder 503 optimized for a particular system design.
[0073] Some video encoders operate within an “encoding loop” readily recognizable to those skilled in the art. As an oversimplification, the encoding loop can consist of the encoding portion of encoder 730 (hereinafter referred to as the “source encoder”) responsible for creating symbols based on the input picture and reference picture to be encoded, and a (local) decoder 733 embedded in encoder 503 that reconstructs the symbols to create sample data that a (remote) decoder would also create (since any compression between the symbols and the encoded video bitstream is lossless in the video compression techniques considered in the disclosed subject matter). This reconstructed sample stream is input to reference picture memory 734. Since the decoding of the symbol stream results in bit-accurate results independent of the decoder location (local or remote), the contents of the reference picture buffer are also bit-accurate between the local and remote encoders. In other words, when prediction is used during decoding, the encoder’s prediction portion, as a reference picture sample, “sees” the exact same sample values as the decoder “sees.” The basic principles of reference picture synchronization (and the resulting drift, if synchronization cannot be maintained, for example, due to channel errors) are well known to those skilled in the art.
[0074] The operation of the "local" decoder 733 can be the same as that of the "remote" decoder 510, as already combined above. Figure 6 A detailed description was provided. However, a brief reference is also included. Figure 7 Since symbols are available, and the encoding / decoding of symbols in the encoded video sequence by the entropy encoder 745 and the parser 620 can be lossless, the entropy decoding part of the decoder 510 (including the channel 612, receiver 610, buffer 615 and parser 620) may not be fully implemented in the local decoder 733.
[0075] At this point, it can be observed that, in addition to the parsing / entropy decoding present in the decoder, any decoder technique must also exist in the corresponding encoder in essentially the same functional form. The description of encoder techniques can be simplified, as these techniques are the inverse of a fully described decoder technique. More detailed descriptions are only required in certain areas, and are provided below.
[0076] As part of its operation, the source encoder 730 can perform motion-compensated predictive coding, which predictively encodes the input frame by referencing one or more previously encoded frames from the video sequence designated as "reference frames". In this way, the encoding engine 732 encodes the differences between pixel blocks of the input frame and pixel blocks of the reference frame, which can be selected as the predictive reference for the input frame.
[0077] The local video decoder 733 can decode encoded video data of frames that can be designated as reference frames, based on symbols created by the source encoder 730. The operation of the encoding engine 732 can advantageously be a lossy process. When the encoded video data can be decoded by the video decoder (… Figure 7 When decoded at (not shown), the reconstructed video sequence can typically be a copy of the source video sequence with some errors. The local video decoder 733 replicates the decoding process performed by the video decoder on the reference frame, and can store the reconstructed reference frame in the reference picture cache 734. In this way, the encoder 503 can locally store copies of the reconstructed reference frames that have the same content as the reconstructed reference frames that will be obtained by the remote video decoder (without transmission errors).
[0078] Predictor 735 can perform a prediction search on encoding engine 732. That is, for a new frame to be encoded, predictor 735 can search the reference image memory 734 for sample data (as candidate reference pixel blocks) or certain metadata, such as reference image motion vectors, block shapes, etc., which can be used as appropriate prediction references for the new image. Predictor 735 can operate on a sample block-by-pixel basis to find suitable prediction references. In some cases, as determined by the search results obtained by predictor 735, the input image can have prediction references extracted from multiple reference images stored in reference image memory 734.
[0079] The controller 750 can manage the encoding operations of the video encoder 730, including, for example, the setting of parameters and subgroup parameters for encoding video data.
[0080] The outputs of all the aforementioned functional units can undergo entropy encoding 745 in the entropy encoder. The entropy encoder, based on techniques known to those skilled in the art, such as Huffman coding, variable-length coding, arithmetic coding, etc., converts the symbols generated by the various functional units into encoded video sequences through lossless compression symbols.
[0081] Transmitter 740 can buffer the encoded video sequence created by entropy encoder 745 in preparation for transmission via communication channel 760, which can be a hardware / software link to a storage device that will store the encoded video data. Transmitter 740 can combine encoded video data from video encoder 730 with other data to be transmitted, such as encoded audio data and / or auxiliary data streams (source not shown).
[0082] The controller 750 can manage the operation of the encoder 503. During encoding, the controller 750 can assign a specific encoded image type to each encoded image, which can affect the encoding techniques that can be applied to the corresponding image. For example, an image can typically be specified as one of the following frame types:
[0083] An intra-picture (I-picture) can be a picture that is encoded and decoded without using any other frames in the sequence as a prediction source. Some video codecs allow different types of intra-pictures, including, for example, images refreshed by a separate decoder. Those skilled in the art are aware of those variations of I-pictures and their corresponding applications and characteristics.
[0084] A predictive picture (P-picture) can be an image that uses at most one motion vector and a reference index to predict the sample values for each block, and is encoded and decoded using intra-frame prediction or inter-frame prediction.
[0085] A bidirectionally predictive picture (B-picture) can be an image that uses up to two motion vectors and a reference index to predict the sample values for each block, and is encoded and decoded using intra-frame prediction or inter-frame prediction. Similarly, a multi-predictive picture can use two or more reference pictures and associated metadata to reconstruct a single block.
[0086] The source image can typically be spatially subdivided into multiple sample blocks (e.g., each sample block is 4×4, 8×8, 4×8, or 16×16 sample blocks) and encoded on a block-by-block basis. Blocks can be predictedly encoded by referencing other (already encoded) blocks determined by the encoding allocation applied to the corresponding image of the block. For example, blocks of image I can be encoded unpredictably, or predictively (spatial prediction or intra-frame prediction) by referencing already encoded blocks of the same image. Pixel blocks of image P can be encoded unpredictably by referencing a previously encoded reference image, via spatial prediction or via temporal prediction. Blocks of image B can be predictedively encoded by referencing one or two previously encoded reference images, via spatial prediction or via temporal prediction.
[0087] The video decoder (503) can perform encoding operations according to a predetermined video coding technique or standard (e.g., ITU-T REC.H.265). In its operation, the video decoder 503 can perform various compression operations, including predictive coding operations that utilize temporal and spatial redundancy in the input video sequence. Therefore, the encoded video data can conform to the syntax specified by the video coding technique or standard being used.
[0088] In one embodiment, transmitter 740 may transmit additional data along with the encoded video. Video encoder 730 may include such data as part of the encoded video sequence. The additional data may include temporal / spatial / SNR enhancement layers, other forms of redundant data (e.g., redundant images and slices), supplementary enhancement information (SEI) messages, fragments of visual usability information (VUI) parameter sets, etc.
[0089] Figure 8 The components of the computer system 800 shown are exemplary in nature and are not intended to impose any limitation on the scope or functionality of the computer software used to implement the embodiments of this disclosure. The configuration of the components should also not be construed as having any dependency or requirement on any component or combination of components shown in the exemplary embodiments of the computer system 800.
[0090] Computer system 800 may include certain human-machine interface input devices. Such human-machine interface input devices may respond to input from one or more human users via, for example, tactile input (e.g., keystrokes, swipes, data glove movements), audio input (e.g., voice, clapping), visual input (e.g., gestures), and olfactory input (not shown). The human-machine interface device may also be used to capture certain media that are not necessarily directly related to conscious human input, such as audio (e.g., speech, music, ambient sounds), images (e.g., scanned images, photographic images obtained from still image cameras), and video (e.g., two-dimensional video, three-dimensional video including stereoscopic video).
[0091] The input human-machine interface device may include one or more of the following (only one of each is depicted): keyboard 801, mouse 802, trackpad 803, touch screen 810, joystick 805, microphone 806, scanner 807, and camera 808.
[0092] The computer system 800 may also include certain human-machine interface (HMI) output devices. Such HMI output devices can stimulate the senses of one or more human users through, for example, tactile output, sound, light, and smell / taste. These HMI output devices may include tactile output devices (e.g., tactile feedback via touchscreen 810, data glove 1204, or joystick 805, but may also include tactile feedback devices not used as input devices), audio output devices (e.g., speaker 809, headphones (not shown)), visual output devices (e.g., screen 810, including cathode ray tube (CRT) screens, liquid crystal display (LCD) screens, plasma screens, organic light-emitting diode (OLED) screens, each with or without touchscreen input capability, each with or without tactile feedback capability—some of which are capable of outputting two-dimensional or more than three-dimensional visual outputs in a manner such as stereoscopic output; virtual reality glasses (not shown), holographic displays, and smoke boxes (not shown)), and printers (not shown).
[0093] The computer system 800 may also include human-accessible storage devices and their associated media, such as optical media including CD / DVD ROM (Read-Only Memory) / RW (Read / Write) 820 with CD / DVD or similar media 821, thumb drives 822, removable hard disk drives or solid-state drives 823, conventional magnetic media such as magnetic tapes and floppy disks (not shown), and special-purpose ROM / ASIC (Application Specific Integrated Circuit) / PLD (Programable Logic Device) devices such as security dongles (not shown).
[0094] Those skilled in the art should also understand that the term "computer-readable medium" as used in connection with the presently disclosed subject matter does not include transmission media, carrier waves, or other transient signals.
[0095] The computer system 800 may also include an interface to one or more communication networks 855. Network 855 may be, for example, wireless, wired, or optical. Network 855 may also be local area, wide area, metropolitan, vehicular, industrial, real-time, latency-tolerant, etc. Examples of network 855 include local area networks such as Ethernet and wireless local area networks (LANs); cellular networks including Global System for Mobile Communications (GSM), Third Generation (3G), Fourth Generation (4G), Fifth Generation (5G), and Long Term Evolution (LTE); cable or wireless wide area digital television networks including cable television, satellite television, and terrestrial broadcast television; and vehicular and industrial networks including Controller Area Network Bus (CANBus) technology. Some networks 855 typically require an external network interface adapter 854 to connect to certain general-purpose data ports or peripheral buses 849 (e.g., the Universal Serial Bus (USB) port of computer system 800); others are typically integrated into the core of computer system 800 via connection to system buses as described below (e.g., an Ethernet interface in a personal computer (PC) system or a cellular network interface in a smartphone system). Using any of these networks 855, computer system 800 can communicate with other entities. This communication can be unidirectional and receive-only (e.g., broadcast television), unidirectional and transmit-only (e.g., to a CANbus device), or bidirectional, e.g., to other computer systems using local area or wide area digital networks. As described above, certain protocols and protocol stacks can be used on each of these networks 855 and network interfaces 854.
[0096] The aforementioned human-machine interface devices, human-accessible storage devices, and network interfaces can be attached to the core 840 of the computer system 800.
[0097] Core 840 may include one or more central processing units (CPUs) 841, graphics processing units (GPUs) 842, dedicated programmable processing units in the form of field-programmable gate areas (FPGAs) 843, task-specific hardware accelerators (e.g., accelerators 844), graphics adapters 844, etc. These devices, along with read-only memory (ROM) 845, random-access memory (RAM) 846, and internal mass storage 847 such as internal non-user-accessible hard disk drives and solid-state drives (SSDs), can be connected via system bus 899. In some computer systems, system bus 899 may be accessed as one or more physical connectors to allow for the expansion of additional CPUs, GPUs, etc. Peripheral devices may be connected to the core's system bus 899 directly or via peripheral bus 849. Peripheral bus architectures include Peripheral Component Interconnect (PCI), USB, etc.
[0098] The CPU 841, GPU 842, FPGA 843, and accelerator 844 can execute certain instructions, which, when combined, constitute the aforementioned computer code. This computer code can be stored in ROM 845 or RAM 846. Transitional data can be stored in RAM 846, while permanent data can be stored, for example, in internal mass storage 847. Fast storage and retrieval of any storage device can be achieved by using a cache memory, which can be closely associated with one or more CPUs 841, GPUs 842, mass storage 847, ROM 845, RAM 846, etc.
[0099] Computer-readable media may contain computer code for performing operations of various computer implementations. The media and computer code may be specifically designed and constructed for the purposes of this disclosure, or may be of a type known and available to those skilled in the art of computer software.
[0100] By way of example and not limitation, a computer system having architecture 800, particularly core 840, can provide functionality as a result of a processor (including CPU, GPU, FPGA, accelerator, etc.) executing software contained in one or more tangible computer-readable media. Such computer-readable media can be media associated with user-accessible mass storage as described above, as well as some memory of the non-transitory core 840, such as internal mass storage 847 or ROM 845. Software implementing various embodiments of this disclosure can be stored in such a device and executed by core 840. Depending on specific needs, the computer-readable medium may include one or more storage devices or chips. The software can cause core 840, and particularly the processors therein (including CPU, GPU, FPGA, etc.), to execute specific processes or specific portions of specific processes described herein, including defining data structures stored in RAM 846 and modifying such data structures according to software-defined processes. Furthermore or alternatively, the computer system can provide functionality as a result of hard-wired or otherwise contained logic (e.g., accelerator 844) that can replace or operate with software to execute specific processes or specific portions of specific processes described herein. Where appropriate, references to software may include logic, and vice versa. Where appropriate, references to computer-readable media may include circuitry storing software for execution (e.g., an integrated circuit (IC)), circuitry containing logic for execution, or both. This disclosure includes any suitable combination of hardware and software.
[0101] Although several exemplary embodiments have been described in this disclosure, there are changes, substitutions, and various alternative equivalents that fall within the scope of this disclosure. Therefore, it should be understood that those skilled in the art will be able to design many systems and methods that, while not expressly shown or described herein, embody the principles of this disclosure and are therefore within its spirit and scope.
Claims
1. A method for adaptive streaming of light field or holographic immersive media, the method being executed by one or more processors, the method comprising: Identify the features associated with the scenario to be transmitted to the terminal client; Determine the currently available bandwidth; When the currently available bandwidth is limited, based on the determined features, at least a portion of the scene to be transmitted to the terminal client is rendered to a target scene depth corresponding to the currently available bandwidth, wherein the target scene depth is used to indicate the distance between the camera and objects in the scene; as well as Based on the determined features, an adaptive stream of the light field of the rendered scene or the holographic immersive media is transmitted.
2. The method according to claim 1, characterized in that, Adjusting at least a portion of the scenario to be transmitted includes: Based on the determined features, determine the depth associated with the scene to be transmitted; and The scene to be transmitted is adjusted based on the depth to include one or more first objects in the scene, wherein the one or more first objects are located at a first distance within the depth.
3. The method according to claim 2, characterized in that, Adjusting at least a portion of the scenario to be transmitted also includes: The scene to be transmitted is adjusted based on the depth to exclude one or more second objects in the scene, wherein the one or more second objects are located at a distance beyond the depth.
4. The method according to claim 1, characterized in that, Adjusting at least a portion of the scenario to be transmitted includes: Based on the determined features, a threshold priority associated with one or more objects in the scene to be transmitted is determined; and The scene to be transmitted is adjusted based on the threshold priority to include one or more first objects among the one or more objects in the scene, wherein the one or more first objects have a higher priority than the threshold priority.
5. The method according to claim 4, characterized in that, Adjusting at least a portion of the scenario to be transmitted also includes: The scenario to be transmitted is adjusted based on the threshold priority to exclude one or more second objects from the one or more objects in the scenario, wherein the one or more second objects have a lower priority than the threshold priority.
6. The method according to claim 5, characterized in that, The priority of a corresponding object associated with one or more objects in the scene is determined based on the distance between the corresponding object and the imaging device that captures the scene.
7. The method according to claim 1, characterized in that, Adjusting at least a portion of the scenario to be transmitted includes: Based on the determined characteristics associated with the terminal client, a request for an alternative scenario is received from the terminal client, wherein the alternative scenario has fewer objects than one or more objects in the scenario; and The alternative scenario to be transmitted is adjusted to include one or more first objects among the one or more objects, wherein the one or more first objects have a higher priority than the threshold priority.
8. The method according to claim 7, characterized in that, Adjusting at least a portion of the scenario to be transmitted also includes: The alternative scenario to be transmitted is adjusted to exclude one or more second objects from the one or more objects, wherein the one or more second objects have a lower priority than the threshold priority.
9. The method according to claim 8, characterized in that, The corresponding priorities associated with the one or more objects in the scenario are defined by the terminal client.
10. An apparatus for adaptive streaming of light field or holographic immersive media, characterized in that, The device includes: At least one memory configured to store program code; and At least one processor is configured to read the program code and operate according to the instructions of the program code, the program code comprising: A first determining code is configured to cause the at least one processor to determine features associated with the scenario to be transmitted to the terminal client, as well as the currently available bandwidth; A second determining code, configured to, when the currently available bandwidth is limited, cause the at least one processor to render at least a portion of the scene to be transmitted to the terminal client to a target scene depth corresponding to the currently available bandwidth, based on the determined features, wherein the target scene depth is used to indicate the distance between the camera and objects in the scene; and A transmission code configured to cause the at least one processor to transmit an adaptive stream of the light field of the rendered scene or the holographic immersive media based on the determined features.
11. The apparatus according to claim 10, characterized in that, The second determining code includes: A third determining code, configured to cause the at least one processor to determine a depth associated with the scene to be transmitted based on the determined features; and First adjustment code, configured to cause the at least one processor to adjust the scene to be transmitted based on the depth to include one or more first objects in the scene, wherein the one or more first objects are located at a first distance within the depth.
12. The apparatus according to claim 11, characterized in that, The second determining code also includes: A second adjustment code is configured to cause the at least one processor to adjust the scene to be transmitted based on the depth to exclude one or more second objects in the scene, wherein the one or more second objects are located at a distance beyond the depth.
13. The apparatus according to claim 10, characterized in that, The second determining code includes: A fourth determining code, configured to cause the at least one processor to determine a threshold priority associated with one or more objects in the scene to be transmitted, based on the determined features; and A third adjustment code is configured to cause the at least one processor to adjust the scene to be transmitted based on the threshold priority to include one or more first objects among the one or more objects in the scene, wherein the one or more first objects have a higher priority than the threshold priority.
14. The apparatus according to claim 13, characterized in that, The second determining code also includes: A fourth adjustment code is configured to cause the at least one processor to adjust the scene to be transmitted based on the threshold priority to exclude one or more second objects among the one or more objects in the scene, wherein the one or more second objects have a lower priority than the threshold priority.
15. The apparatus according to claim 14, characterized in that, The priority of a corresponding object associated with one or more objects in the scene is determined based on the distance between the corresponding object and the imaging device that captures the scene.
16. A non-transitory computer-readable medium, characterized in that, Its storage instructions, the instructions including: one or more instructions, when executed by one or more processors of a device for adaptive streaming of light fields or holographic immersive media, causing the one or more processors to: Determine the characteristics associated with the scenario to be transmitted to the terminal client, as well as the currently available bandwidth; When the currently available bandwidth is limited, based on the determined features associated with the terminal client, at least a portion of the scene to be transmitted to the terminal client is adjusted to be rendered to a target scene depth corresponding to the currently available bandwidth, wherein the target scene depth is used to indicate the distance between the camera and objects in the scene; and Based on the determined features, an adaptive stream of the light field of the rendered scene or the holographic immersive media is transmitted.
17. The non-transitory computer-readable medium according to claim 16, characterized in that, Adjusting at least a portion of the scenario to be transmitted includes: Based on the determined features, determine the depth associated with the scene to be transmitted; and The scene to be transmitted is adjusted based on the depth to include one or more first objects in the scene, wherein the one or more first objects are located at a first distance within the depth.
18. The non-transitory computer-readable medium according to claim 17, characterized in that, Adjusting at least a portion of the scenario to be transmitted also includes: The scene to be transmitted is adjusted based on the depth to exclude one or more second objects in the scene, wherein the one or more second objects are located at a distance beyond the depth.
19. The non-transitory computer-readable medium according to claim 16, characterized in that, Adjusting at least a portion of the scenario to be transmitted includes: Based on the determined features, a threshold priority associated with one or more objects in the scene to be transmitted is determined; and The scene to be transmitted is adjusted based on the threshold priority to include one or more first objects among the one or more objects in the scene, wherein the one or more first objects have a higher priority than the threshold priority.
20. The non-transitory computer-readable medium according to claim 19, characterized in that, Adjusting at least a portion of the scenario to be transmitted also includes: The scenario to be transmitted is adjusted based on the threshold priority to exclude one or more second objects from the one or more objects in the scenario, wherein the one or more second objects have a lower priority than the threshold priority.