Method and apparatus for delivering volumetric video content
By encoding metadata for viewing boundaries and paths in volumetric video data streams, the challenges of restricted viewing and dizziness in immersive technologies are addressed, enhancing user immersion and rendering quality.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- INTERDIGITAL VC HOLDINGS INC
- Filing Date
- 2024-12-26
- Publication Date
- 2026-06-22
AI Technical Summary
Existing immersive video technologies, such as 3DoF and 6DoF, face challenges in providing consistent visual feedback during head translation, leading to dizziness and limited immersion due to restricted viewing angles and lack of parallax, necessitating a solution to guide users within volumetric video content effectively.
Encoding metadata within a data stream to signal viewing bounding boxes, curved paths, and viewing direction ranges, allowing for restricted guidance within the 3D space, ensuring high-quality rendering and preventing users from exceeding viewable boundaries.
Enhances user immersion by providing consistent visual feedback during head translation, preventing dizziness, and ensuring high-quality rendering by guiding users within defined viewing boundaries, thus improving the overall immersive experience.
Smart Images

Figure 0007877434000001 
Figure 0007877434000002 
Figure 0007877434000003
Abstract
Description
[Technical Field]
[0001] This principle generally pertains to the domains of three-dimensional (3D) scenes and volumetric video content. This document is also understood in the context of encoding, formatting, and decoding data representing the textures and geometry of 3D scenes for rendering volumetric content to end-user devices such as mobile devices or head-mounted displays (HMDs). In particular, this principle relates to the signaling and decoding of information representing induction limitations in volumetric video. [Background technology]
[0002] This section is intended to introduce readers to various aspects of the art, which may relate to the various aspects of the disclosure described and / or claimed below. This discussion is intended to provide readers with background information to facilitate a better understanding of the various aspects of the principle. Therefore, it should be understood that these descriptions should be read in this context and not as acceptance of the prior art.
[0003] In recent years, the amount of wide-field-of-view content (up to 360°) available has increased. Such content is potentially not fully visible to users viewing it on immersive display devices such as head-mounted displays, smart glasses, PC screens, tablets, and smartphones. This means that at any given moment, the user may only be viewing parts of the content. However, users can typically be guided through the content by various means, such as head movement, mouse movement, touchscreens, and audio. It is usually desirable to encode and decode this content.
[0004] Immersive video, also known as 360° flat video, allows users to see everything around them by rotating their head around a point of focus. This rotation limits the experience to only three degrees of freedom (3DoF). While 3DoF video may be sufficient for the initial omnidirectional video experience, for example, using a head-mounted display device (HMD), 3DoF video can quickly frustrate viewers who expect more degrees of freedom, such as those experiencing parallax. Furthermore, 3DoF also allows users to translate their heads in three directions in addition to rotating them, but translation is not replicated in the 3DoF video experience, which can potentially cause dizziness.
[0005] Wide-field content can include, among other things, three-dimensional computer graphic image scenes (3D CGI scenes), point clouds, or immersive videos. Many conditions can be used to design such immersive videos: for example, virtual reality (VR) videos, 360-degree videos, panoramic videos, 4π stereoscopic videos, immersive videos, omnidirectional videos, or wide-field videos.
[0006] Volumetric video (also known as 6-degree-of-freedom (6DoF) video) is an alternative to 3DoF video. When viewing 6DoF video, in addition to rotation, the user can also translate their head, and even their body, within the content they are viewing, experiencing parallax and even volume. Such video significantly enhances immersion and the sense of depth of the scene, and prevents dizziness by providing consistent visual feedback during head translation. The content is created by dedicated sensors that allow for the simultaneous recording of the color and depth of the target scene. The use of color camera equipment combined with photogrammetry techniques is a way to perform such recording, although technical difficulties remain.
[0007] 3DoF video includes a sequence of images obtained from demapping a textured image (e.g., a spherical image encoded according to latitude / longitude projection mapping or equirectangular projection mapping), while 6DoF video frames have information embedded from several viewpoints. These video frames can be thought of as a transient series of point clouds obtained from a 3D capture. The two types of volumetric video can be thought of as dependent on the viewing state. The first type (i.e., full 6DoF) allows for completely free guidance within the video content, while the second type (also known as 3DoF+) restricts the user viewing space to a limited volume called a viewing bounding box, enabling limited head translation and parallax experiences. This second context represents a valuable trade-off between a free guidance state and a passive viewing state for a seated viewer. Between 3DoF+ and 6DoF experiences, the 4DoF+ case can be defined as an intermediate between 3DoF+ and 6DoF. In this case, the user's displacement is constrained along a curved (1D) path in the 3D scene, and horizontal, vertical, and depth translational movement around each path sample is limited. The user can move through a kind of tunnel along a path where good visual quality is guaranteed. If the user moves the virtual camera outside the tunnel, data for reconstructing a good quality 3D scene may be lost.
[0008] It may be necessary to limit the range of acceptable positions for the user's virtual camera and viewing orientation while guiding a given volumetric video content. Otherwise, the user may "leave" the 3D scene when requesting a viewport that cannot be fully rendered due to a lack of available visual data. Therefore, a solution is needed to inform the end-user device of suitable, preferred, and / or acceptable viewer (virtual) positions and viewing orientations when consuming a given volumetric video content. [Overview of the Initiative]
[0009] Below is a simplified overview of the Principle to provide a basic understanding of some aspects of it. This overview is not a comprehensive summary of the Principle. It is not intended to identify the main or important elements of the Principle. The following overview merely presents some aspects of the Principle in a simplified form as a prelude to the more detailed explanation provided below.
[0010] This principle relates to a method and device for signaling information representing induction limitations in volumetric video. The method includes encoding metadata within a data stream containing video data representing the volumetric video. The metadata is: -Data representing the viewing bounding box, -Data representing the curved path in 3D space of the above volumetric video, -Includes data representing at least one viewing direction range associated with a point on the curved path described above.
[0011] In another embodiment, a second method and a second device are provided for decoding information representing induction limits in volumetric video. The second method includes decoding metadata from a data stream containing video data representing volumetric video. The metadata is, -Data representing the viewing bounding box, -Data representing the curved path in 3D space of the above volumetric video, -Includes data representing at least one viewing direction range associated with a point on the above curved path. According to another general aspect of at least one embodiment, a data stream is provided which includes video data and associated metadata generated according to any of the described encoding embodiments or variations.
[0012] According to another general aspect of at least one embodiment, a non-temporary computer-readable medium is provided which contains data content generated according to any of the described embodiments or variations of encoding.
[0013] According to another general aspect of at least one embodiment, a computer program product is provided which includes instructions, which, when the program is executed by a computer, cause the computer to perform any of the described decoding embodiments or variations. [Brief explanation of the drawing]
[0014] The following description will help to better understand this disclosure and reveal other specific features and advantages, and this description refers to the attached drawings.
[0015] [Figure 1] A non-limiting embodiment of this principle is shown, illustrating a three-dimensional (3D) model of an object and the points of a point cloud corresponding to the 3D model. [Figure 2] This document presents non-limiting examples of encoding, transmitting, and decoding data representing a series of 3D scenes using non-limiting embodiments of the principle. [Figure 3] An exemplary architecture of a device that may be configured to implement the method described in relation to Figures 7 and 8, according to a non-limiting embodiment of the present principle, is shown. [Figure 4] An example of one embodiment of the syntax of a stream when data is transmitted via a packet-based transmission protocol is shown, based on a non-limiting embodiment of this principle. [Figure 5] The concept of restricted induction in a virtual 3D scene is schematically illustrated by depicting a curved path, as well as a spherical bounding box and viewing orientation range at a given position along this path, using a non-limiting embodiment of the principle. [Figure 5b]An example of a collection camera device with five convergent cameras arranged on an arc, having a tubular viewing space inferred from the position and orientation of the collection cameras, according to a non-limiting embodiment of the present principle, is shown. [Figure 6] A specific case of a circular path for guiding around a 3D scene having an object of interest at a central position, according to a non-limiting embodiment of the present principle, is shown. [Figure 7] A method 70 for signaling information representing limitations of guidance in volumetric video, according to a non-limiting embodiment of the present principle, is schematically shown. [Figure 8] A method 80 for decoding information representing limitations of guidance in volumetric video, according to a non-limiting embodiment of the present principle, is schematically shown. **DETAILED DESCRIPTION OF THE INVENTION**
[0016] The present principle will be described more fully hereinafter with reference to the accompanying drawings in which examples of the present principle are shown. However, the present principle may be embodied in many alternative forms and should not be construed as limited to the examples set forth herein. Accordingly, while the present principle is capable of various modifications and alternative forms, specific examples thereof are shown by way of example in the drawings and will be described in detail herein. However, it is not intended to limit the present principle to the particular forms disclosed, but on the contrary, it is to be understood that the disclosure is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present principle as defined by the claims.
[0017] The terminology used herein is for the purpose of describing particular examples only and is not intended to limit the principles. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. As used herein, the terms "comprises", "comprising", "includes", and / or "including" specify the presence of the stated features, integers, steps, operations, elements, and / or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof. Further, when an element is referred to as being "responsive" or "connected" to another element, it can be directly responsive or connected to the other element or intervening elements may be present. In contrast, when an element is referred to as being "directly responsive" or "directly connected" to another element, intervening elements are absent. As used herein, the term "and / or" includes any and all combinations of one or more of the associated listed items and may be abbreviated as " / ".
[0018] In this specification, terms such as first, second, etc. may be used to describe various elements, but it will be understood that these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, without departing from the teachings of the principles, the first element may be referred to as the second element and, similarly, the second element may be referred to as the first element.
[0019] Some of the figures include arrows on the communication paths to indicate the main direction of communication, but it should be understood that communication may occur in a direction opposite to the drawn arrows.
[0020] Some examples are illustrated with block diagrams and operation flowcharts, where each block represents a circuit element, module, or portion of code containing one or more executable instructions to implement a specified logical function. Note that in other implementations, the functions described in a block may occur in a different order than those described. For example, two consecutively shown blocks may actually be executed substantially simultaneously, or blocks may sometimes be executed in reverse order depending on the functions they entail.
[0021] Any reference in this specification to “by example” or “in one example” means that a particular feature, structure, or characteristic described in relation to the example may be included in at least one implementation of the principle. The occurrence of the phrase “by example” or “in one example” in various places in this specification does not necessarily refer to the same example, nor do separate or alternative examples necessarily exclude each other.
[0022] Reference numerals appearing in the claims are illustrative only and shall not limit the scope of the claims. Although not expressly described, these examples and variations may be used in any combination or partial combination.
[0023] Figure 1 shows a three-dimensional (3D) model 10 of an object and the points of a point cloud 11 corresponding to the 3D model 10. The 3D model 10 and the point cloud 11 may correspond to possible 3D representations of an object in a 3D scene, for example, including other objects. Model 10 may be a 3D mesh representation, and the points of the point cloud 11 may be vertices of the mesh. The points of the point cloud 11 may also be points distributed on the surface of a mesh face. Model 10 may also be represented as a splatting version of the point cloud 11, where the surface of Model 10 is created by splatting the points of the point cloud 11. Model 10 may be represented by a number of different representations, such as voxels or splines. Figure 1 shows that a point cloud may be defined as a surface representation of a 3D object, and that the surface representation of a 3D object may be generated from the points of the cloud. As used herein, projecting the points of a 3D object onto an image (by extension points of a 3D scene) is equivalent to projecting any representation of this 3D object, such as a point cloud, mesh, spline model, or voxel model.
[0024] A point cloud can be represented in memory, for example, as a vector-based structure, where each point has its own coordinates within the reference frame of the viewpoint (e.g., 3D coordinates XYZ, or solid angle and distance (called depth) from / to the viewpoint) and one or more attributes, also called components. Examples of components are color components, which can be represented in various color spaces, for example, RGB (red, green, and blue) or YUV (where Y is the lumen component and UV are the two chrominance components). A point cloud is a representation of a 3D scene containing objects. A 3D scene can be viewed from a given viewpoint or from various viewpoints. A point cloud can be obtained in many ways, for example, From the capture of actual objects, taken with a camera system optionally complemented by a depth-sensing active device, • From the capture of virtual / synthetic objects taken by a virtual camera device in the modeling tool, • From a mixture of both real and virtual objects.
[0025] Figure 2 shows a non-limiting example of encoding, transmitting, and decoding data representing a series of 3D scenes. The encoding format may be compatible with, for example, simultaneous 3DoF, 3DoF+, and 6DoF decoding.
[0026] A series of 3D scenes 20 are acquired. Since the series of pictures is 2D video, the series of 3D scenes is 3D (also called volumetric) video. The series of 3D scenes can be provided to a volumetric video rendering device for 3DoF, 3DoF+, or 6DoF rendering and display.
[0027] A series of 3D scenes 20 are provided to an encoder 21. The encoder 21 takes one 3D scene or a series of 3D scenes as input and provides a bitstream representing the input. The bitstream may be stored in memory 22 and / or an electronic data medium and transmitted over the network 22. The bitstream representing the series of 3D scenes may be read from memory 22 and / or received from the network 22 by a decoder 23. The decoder 23 is input with the bitstream and provides the series of 3D scenes, for example, in point cloud format.
[0028] The encoder 21 may comprise several circuits that implement several steps. In the first step, the encoder 21 projects each 3D scene onto at least one 2D picture. 3D projection is any method of mapping three-dimensional points onto a two-dimensional plane. The use of this type of projection is widely used, particularly in computer graphics, engineering, and drafting, because most current methods for displaying graphical data are based on a two-dimensional medium of a plane (pixel information from multiple bit planes). The projection circuit 211 provides at least one 2D frame 2111 for a set of 3D scenes 20. Frame 2111 contains color information and depth information representing the 3D scene projected onto frame 2111. In a variation, the color information and depth information are encoded within two separate frames 2111 and 2112.
[0029] Metadata 212 is used and updated by the projection circuit 211. Metadata 212 includes information about projection operations (e.g., projection parameters), as well as information about how color and depth information is organized within frames 2111 and 2112, as described in relation to Figures 5 to 7.
[0030] The video encoding circuit 213 encodes a series of frames 2111 and 2112 as video. The pictures of 3D scenes 2111 and 2112 (or a series of pictures of a 3D scene) are encoded into a stream by the video encoder 213. The video data and metadata 212 are then encapsulated into a data stream by the data encapsulation circuit 214.
[0031] Encoder 213 conforms to the following encoder standards, for example: -JPEG, Specification ISO / CEI 10918-1 UIT-T Recommendation T.81, https: / / www.itu.int / rec / T-REC-T.81 / en, -AVC, also known as MPEG-4 AVC or h264. Specified in both UIT-T H.264 and ISO / CEI MPEG-4 Part 10 (ISO / CEI 14496-10), http: / / www.itu.int / rec / T-REC-H.264 / en, HEVC (its specification can be found on the ITU website, T Recommendation, H Series, h265, http: / / www.itu.int / rec / T-REC-H.265-201612-I / en), -3D-HEVC (an extension of HEVC, the specification of which can be found on the ITU website, T Recommendation, H Series, h265, http: / / www.itu.int / rec / T-REC-H.265-201612-I / en, Annexes G and I), - VP9 developed by Google, or AV1 (AOMedia Video 1) was developed by the Alliance for Open Media.
[0032] The data stream is stored by the decoder 23 in memory accessible, for example, via the network 22. The decoder 23 comprises various circuits that implement the various steps of decoding. The decoder 23 receives the data stream generated by the encoder 21 as input and provides a series of 3D scenes 24 that are rendered and displayed by a volumetric video display device such as a head-mounted device (HMD). The decoder 23 obtains the stream from the source 22. For example, the source 22 belongs to a set that includes the following: - Local memory, for example, video memory or RAM (or random access memory), flash memory, ROM (or read-only memory), hard disk, - Storage interface, for example, interface with mass storage, RAM, flash memory, ROM, optical disc, or magnetic support. - Communication interfaces, such as wired interfaces (e.g., bus interfaces, wide area network interfaces, local area network interfaces) or wireless interfaces (IEEE 802.11 interfaces or Bluetooth® interfaces, etc.), and - User interfaces such as graphical user interfaces that allow users to input data.
[0033] Decoder 23 includes circuit 234 for extracting encoded data from the data stream. Circuit 234 takes the data stream as input and provides metadata 232 corresponding to the metadata 212 encoded in the stream and a 2D video. The video is decoded by video decoder 233, which provides a series of frames. The decoded frames include color information and depth information. In a modified example, video decoder 233 provides two sequences of frames, one containing color information and the other containing depth information. Circuit 231 uses metadata 232 to backproject the color and depth information from the decoded frames to provide a series of 3D scenes 24. The series of 3D scenes 24 corresponds to a series of 3D scenes 20, which may lose accuracy due to encoding as 2D video and video compression.
[0034] Figure 3 shows an exemplary architecture of device 30 that may be configured to implement the methods described in relation to Figures 7 and 8. The encoder 21 and / or decoder 23 of Figure 2 may implement this architecture. Alternatively, the encoder 21 and / or decoder 23 circuits may be linked to each other, for example, via their bus 31 and / or via the I / O interface 36, in a device according to the architecture of Figure 3.
[0035] Device 30 comprises the following elements, which are linked to each other by a data and address bus 31. - For example, a microprocessor 32 (or CPU), which is a DSP (or digital signal processor), -ROM (or read-only memory) 33, -RAM (or random access memory) 34, -Storage interface 35, - I / O interface 36 that receives data sent from the application, and - Power source, for example, a battery.
[0036] For example, the power supply is external to the device. In each of the above-mentioned memories, the word "register" as used herein may correspond to a small area (a few bits) or a very large area (e.g., an entire program or a large amount of received or decoded data). ROM33 contains at least the program and parameters. ROM33 may store algorithms and instructions for performing the technology according to this principle. When switched on, CPU32 uploads the program into RAM and executes the corresponding instructions.
[0037] RAM34 contains registers for a program that is executed by CPU32 and uploaded after device30 is switched on, input data, intermediate data for different states of the method, and other variables used to execute the method.
[0038] The implementations described herein may be implemented, for example, in methods or processes, apparatus, computer program products, data streams, or signals. Even when considered only in the context of a single implementation (e.g., only as a method or device), the implementations of the features considered may also be implemented in other forms (e.g., programs). Apparatus may be implemented, for example, in appropriate hardware, software, and firmware. These methods may be implemented, for example, in apparatus, and may be implemented in processing devices, generally including computers, microprocessors, integrated circuits, or programmable logic devices, such as processors. Processors also include communication devices, such as computers, mobile phones, portable / personal digital assistants ("PDAs"), and other devices that facilitate the communication of information between end users.
[0039] For example, device 30 is configured to implement the method described in relation to Figures 7 and 8 and belongs to a set including: - Mobile devices, -Communication devices, - Game devices, - Tablet (or tablet computer), -Laptop, -Still camera, -Video camera, - Encoding chip, - A server (for example, a broadcast server, a video-on-demand server, or a web server).
[0040] Figure 4 shows an example of one embodiment of the syntax of a stream when data is transmitted via a packet-based transmission protocol. Figure 4 shows an exemplary structure 4 of a volumetric video stream. The structure resides within a container that organizes the stream into individual syntax elements. This structure may include a header section 41, which is a set of data common to all syntax elements of the stream. For example, the header section includes some metadata about the syntax elements, describing the nature and role of each of them. The header section may also include some of the metadata 212 in Figure 2, for example, the coordinates of the central viewpoint used to project points of a 3D scene onto frames 2111 and 2112. The structure includes a payload containing elements of syntax 42 and at least one element of syntax 43. The syntax element 42 includes data representing color and depth frames. The image may be compressed according to a video compression method.
[0041] The elements of syntax 43 are part of the payload of the data stream and may contain metadata about how the frames of the elements of syntax 42 are encoded, such as parameters used to project and pack points of a 3D scene into frames. Such metadata may be associated with each frame of the video or with a group of frames (also known as a group of pictures (GoP) in video compression standards).
[0042] Figure 5 schematically illustrates the concept of restricted guidance within a virtual 3D scene by depicting a curved path, as well as the spherical bounding box and viewing orientation range at a given position along this path. Restrictions to guidance are determined on a per-content basis during the content creation phase and have the added benefit of potentially adding subjective (e.g., artistic) constraints to the objective constraints of the 3D geometry.
[0043] According to this principle, the arrangement of high-level syntactic elements explicitly describes a subset of viewing positions and orientations associated with a given volumetric video content representing a 3D scene 50. The signaled information includes the following elements: - 51 curved paths (or sets of curved paths) in 3D space, -Basic bounding box volume 52, - A set of 53 viewing orientation ranges indexed by position along a curved path. The combination of these elements describes a virtual displacement along a path that is suitable, preferred, and / or acceptable for high-quality 3D scene reconstruction. At each position along the path, small translational movements within the bounding box are possible, and the viewing orientation is restricted to a given angular range. This can be described as a 4DoF + virtual induction "tunnel".
[0044] According to the first embodiment, a set of curved guiding paths is defined by a set of 3D points with the following syntax and semantics: aligned(8)class NavigationPathSet() { ViewingBoundingBox; unsigned int(8)num_paths; for(n=0;n <num_paths;n++){ unsigned int(32)num_points[n]; for(i=0;i <num_points[n];i++){ unsigned int(32)X[n][i]; unsigned int(32)Y[n][i]; unsigned int(32)Z[n][i]; signed int(32)phi_min[n][i]; signed int(32)phi_max[n][i]; signed int(32)theta_min[n][i]; signed int(32)theta_max[n][i]; } } } aligned(8)class ViewingBoundingBox() { unsigned int(8)shape_type; unsigned int(32)first_dimension; unsigned int(32)second_dimension; unsigned int(32)third_dimension; }
[0045] `num_paths` specifies the number of curved guidance paths defined to guide the content. A value of 0 indicates that the entire 3D space can be guided.
[0046] `num_points` specifies the number of 3D points sampled along the curved path.
[0047] X[n][i], Y[n][i], Z[n][i] are fixed-point values (e.g., values of 16.16) that define the 3D coordinates of the i-th sample along the n-th path in the global coordinate system of the 3D scene. The points are ordered according to the horizontal coordinates of the curve along the path. Thus, the n-th path is defined between two 3D endpoints (X[n][0], Y[n][0], Z[n][0]) and (X[n][num_points[n]-1], Y[n][num_points[n]-1], Z[n][num_points[n]-1]). For example, a curved path may be a piecewise path between two consecutive points in a list, and a straight path from the last point in the list to the first point. In variations, a curved path may be determined by using a quadratic or cubic Bézier curve by using three or four consecutive points.
[0048] phi_min[n][i], phi_max[n][i] and theta_min[n][i], theta_max[n][i] are 2 16 These are the minimum and maximum azimuth and elevation angles in degrees that determine the viewing orientation at the i-th point along the n-th path. The azimuth angle value is, for example, -180 * 2, including both endpoints. 16 ~180*2 16 It is within the range of -1. The elevation angle value is, for example, -90 * 2, including the values at both ends. 16 ~90*2 16 It is within the range. The azimuth and elevation values may be expressed in different units of measurement, such as radians.
[0049] `shape_type` specifies the shape of the viewing bounding box that the user is allowed to move slightly to at a given position along the guidance path. `shape_type` being 0 indicates a sphere, the radius of which is specified by `first_dimension`. This syntax allows for the definition of more complex ellipsoidal or cuboidal 3D volume shapes.
[0050] According to another embodiment, the specification of the lightweight viewing space 54 is obtained by inferring a guide path 51 from the parameters of the acquisition camera associated with the volumetric content. The 3D points from which the guide path is sampled are the 3D positions of the acquisition camera. The viewing direction (central azimuth and elevation) at each sample position is the viewing direction of the acquisition camera at that position. Since the acquisition external camera parameters are already part of the metadata associated with the volumetric video content, the additional metadata sent to specify the viewing space is reduced to the shape and size of the viewing box at each sample position, as well as the range of azimuth and elevation around the viewing direction. An example of the syntax and associated semantics is as follows: aligned(8)class NavigationPathFromCameraRig() { ViewingBoundingBox; unsigned int(8)num_paths; for(n=0;n <num_paths;n++){ unsigned int(32)num_cams[n]; for(i=0;i <num_cams[n];i++){ unsigned int(32)cam_idx[n][i]; unsigned int(32)phi_range[n][i]; unsigned int(32)theta_range[n][i]; } } }
[0051] num_cams[n] is the size of the subset of acquisition cameras used to sample the nth guidance path. cam_idx[n][i] is the index of the i-th camera along the nth guidance path (out of the list of all acquisition cameras). phi_range[n][i] and theta_range[n][i] are the range of deviations of the azimuth and elevation angles at the i-th position along the nth path, around the azimuth and elevation angles of the cam_idx[n][i]-th acquisition camera.
[0052] Such a viewing space specification is particularly suitable for volumetric content captured by a camera device positioned along the arc of a circle. Figure 5b shows an example of a viewing space 54 associated with 3D content captured from such a device, where 10 cameras are configured into 5 pairs of converging cameras 55 to capture the 3D scene, and the tubular viewing space 54 is specified by sampling 5 ellipsoids at the locations of all the other cameras.
[0053] Figure 6 shows a specific example of a circular path for guiding around a 3D scene with a target object at its center. In this use case, the user can walk around the 3D scene 50 on a circular path 61 in the horizontal plane at a given height in the global coordinate system, using an inward-facing field of view 63. Displacement within the bounding box 52 is permitted, as in the first embodiment.
[0054] Examples of syntax and semantics may include the following: aligned(8)class CircularNavigationPath() { ViewingBoundingBox; signed int(32)center_x; signed int(32)center_y; signed int(32)center_z; unsigned int(32)radius; signed int(32)delta_phi; signed int(32)theta_min; signed int(32)theta_max; }
[0055] center_x, center_y, and center_z are fixed-point values (for example, values of 16.16) that define the 3D coordinates of the center of the circular path.
[0056] The radius is a fixed-point value that defines the radius of a circular path.
[0057] phi_range is 2 16 This is the angular range, in degrees, that defines the azimuth viewing orientation at any point on a circular path (relative to the radial direction). The value of delta_phi, including both endpoints, is 0 to 360*2. 16 The range is -1.
[0058] theta_min and theta_max are the minimum and maximum values of the viewing elevation angle at any point on the circular path in units of 2 16 degrees. The values of the elevation angle range from -90*2 16 to 90*2 16 , including both end values.
[0059] More generally, the curved guidance path can be defined by a parametric 3D curve as X = f(s), Y = g(s), Z = h(s), where s is a scalar value. Such diversity of functional parameterization is suitable according to this principle. This general approach is particularly suitable for typical 4DoF+ experiences, such as the rendering of volumetric videos of sports events or concert events. For such 4DoF+ videos, suitable, preferred, and / or acceptable paths have simple shapes that can be parameterized using a few parameters, such as an ellipse of a stadium stand or a rectangle around a stadium stage or field.
[0060] FIG. 7 schematically shows a method 70 for signaling information representing the guidance limitations in a volumetric video. In step 71, volumetric video data is acquired. Data representing the curved path and the viewing direction range of points on the curved path is acquired simultaneously. The curved path associated with the viewing bounding box represents the guidance limitations in the 3D space including the 3D scene of the volumetric video. According to an embodiment of this principle, these data can be represented via different data structures, as described in connection with FIGS. 5 and 6. In step 72, the volumetric video data is encoded into a data stream in relation to metadata including the guidance limitations acquired in step 71. In step 73, the data stream encoded in step 72 can be stored on a non-temporary medium or transmitted to a client device.
[0061] According to one embodiment, the proposed restricted guidance path message is encoded in a dedicated SEI message (Supplemental Enhancement Information) within the video stream. According to another embodiment, the proposed restricted guidance path message is encoded at the container level using an ISO-based media file format. Adding such guidance path messages to the metadata of a volumetric video stream allows the renderer to restrict virtual guidance to a viewing position and orientation that matches the encoded 3D scene content, thereby ensuring the quality of the immersive experience.
[0062] Figure 8 schematically illustrates a method 80 for decoding information representing induction limits in volumetric video. In step 81, a data stream containing video data of the volumetric video is obtained from the source. The data stream also includes metadata associated with the volumetric video and a representation of induction limits in 3D space containing the 3D scene of the volumetric video. In step 82, the video data and metadata are decoded. The metadata includes data representing the viewing bounding box, data representing the curved path in 3D space of the volumetric video, and data representing at least one viewing direction range associated with a point on the curved path. This data may be represented by different data structures according to embodiments of the present principle, as shown in relation to Figures 5 and 6. In step 83, information representing induction limits in 3D space of the 3D scene is retrieved using the decoded metadata, and this information is used by the renderer. For example, the renderer may warn the user when they are about to leave a tunnel where good rendering is guaranteed. The renderer may also prevent the virtual camera from moving outside the description volume or changing the rendering (for example, by fading the image) when the user is moving outside the preferred path.
[0063] The implementations described herein may be implemented, for example, in methods or processes, apparatus, computer program products, data streams, or signals. Even when considered only in the context of a single form of implementation (e.g., only as a method or device), the implementations of the features considered may also be implemented in other forms (e.g., programs). Apparatus may be implemented, for example, in appropriate hardware, software, and firmware. For example, these methods may be implemented in apparatus such as processors, which broadly refer to processing devices including computers, microprocessors, integrated circuits, or programmable logic devices. Processors also include communication devices such as smartphones, tablets, computers, mobile phones, portable / personal digital assistants ("PDAs"), and other devices that facilitate the transmission of information between end users.
[0064] The various processes and features described herein may be embodied in a variety of different devices or applications, specifically, for example, in devices or applications associated with data encoding, data decoding, view generation, texture processing, and other image processing, as well as related texture information and / or depth information. Examples of such devices include encoders, decoders, post-processors that process the output from decoders, pre-processors that supply inputs to encoders, video coders, video decoders, video codecs, web servers, set-top boxes, laptops, personal computers, mobile phones, PDAs, and other communication devices. As should be obvious, the devices may be portable and may even be mounted on mobile vehicles.
[0065] Furthermore, the method may be implemented by instructions executed by the processor, and such instructions (and / or data values generated by the implementation) may be stored in a processor-readable medium such as an integrated circuit or a software carrier, or in other storage devices such as a hard disk, a compact diskette ("CD"), an optical disc (such as a DVD, often referred to as a digital multipurpose disc or digital video disc), random access memory ("RAM"), or read-only memory ("ROM"). The instructions may form an application program that is tangibly embodied in the processor-readable medium. Instructions may be found, for example, in hardware, firmware, software, or a combination of the two. Instructions may be found, for example, in an operating system, a separate application, or a combination of the two. Thus, a processor may be characterized as both, for example, a device configured to perform processing and a device including a processor-readable medium (such as a storage device) having instructions for performing processing. Furthermore, the processor-readable medium may store data values generated by the implementation in addition to or instead of instructions.
[0066] As will be apparent to those skilled in the art, implementations can generate a variety of signals formatted to carry information, which can, for example, be stored or transmitted. The information may include, for example, instructions for performing a method, or data generated by one of the implementations described. For example, a signal may be formatted as data to convey rules for writing or reading the syntax of the embodiment described, or as data to convey actual syntax values described by the embodiment described. Such a signal may be formatted, for example, as an electromagnetic wave (e.g., using the radio frequency portion of the spectrum) or as a baseband signal. Formatting may include, for example, encoding a data stream and modulating a carrier wave with the encoded data stream. The information carried by the signal may be, for example, analog or digital information. The signal may be transmitted over a wide variety of different wired or wireless links, as is known. The signal may be stored in a processor-readable medium.
[0067] Several implementations have been described. Nevertheless, it will be understood that various modifications are possible. For example, elements of different implementations may be combined, supplemented, modified, or deleted to produce other implementations. Furthermore, those skilled in the art will understand that other structures and processes may be substituted for the disclosed structures and processes, and that the resulting implementations may perform at least substantially the same functions(s) in at least substantially the same manner(s) as the disclosed implementations to achieve at least substantially the same results(s). Accordingly, these and other implementations are conceived in this application.
Claims
1. A method for rendering a volumetric video in accordance with induction limitations, wherein the method is Decoding metadata from a data stream containing video data representing the volumetric video, wherein the metadata includes data representing a curved path in the three-dimensional (3D) space of the volumetric video, data representing at least two 3D viewpoints on the curved path associated with a viewing direction range, and data representing a viewing bounding box centered on a first viewpoint which is one of the at least two 3D viewpoints. In the limitation of the associated viewing direction range, rendering the volumetric video from the viewpoint within the viewing bounding box associated with the first viewpoint among the at least two viewpoints, Methods that include...
2. The method according to claim 1, wherein, in response to an action from the user, the first viewpoint moves to a second viewpoint among the at least two viewpoints, and the rendering is performed in accordance with the data associated with the second viewpoint.
3. The method according to claim 1, wherein, in response to an action from the user, the first viewpoint moves to a third viewpoint located on the curved path and between two of the at least two viewpoints, and the rendering is performed according to data calculated in accordance with the data associated with two of the at least two viewpoints.
4. The method according to claim 1, wherein the data representing the curved path includes parameters representing a parametric 3D curve, and data representing at least one 3D point is associated with one origin of the curved path.
5. The method according to claim 1, wherein the metadata includes data representing at least two curved paths in the 3D space of the volumetric video, and a user action moves the first viewpoint of the first curved path to the first viewpoint of the second curved path.
6. A device for rendering volumetric video according to induction limits, comprising a processor, wherein the processor Decoding metadata from a data stream containing video data representing the volumetric video, wherein the metadata includes data representing a curved path in the three-dimensional (3D) space of the volumetric video, data representing at least two 3D viewpoints on the curved path associated with a viewing direction range, and data representing a viewing bounding box centered on a first viewpoint which is one of the at least two 3D viewpoints. In the limitation of the associated viewing direction range, rendering the volumetric video from the viewpoint within the viewing bounding box associated with the first viewpoint among the at least two viewpoints, A device configured to perform the following actions.
7. The device according to claim 6, wherein, in response to an action from the user, the first viewpoint moves to a second viewpoint among the at least two viewpoints, and the rendering is performed in accordance with the data associated with the second viewpoint.
8. The device according to claim 6, wherein, in response to an action from the user, the first viewpoint moves to a third viewpoint located on the curved path and between two of the at least two viewpoints, and the rendering is performed according to data calculated in accordance with the data associated with two of the at least two viewpoints.
9. The device according to claim 6, wherein the data representing the curved path includes parameters representing a parametric 3D curve, and data representing at least one 3D point is associated with one origin of the curved path.
10. The device according to claim 6, wherein the metadata includes data representing at least two curved paths in the 3D space of the volumetric video, and a user action moves the first viewpoint of the first curved path to the first viewpoint of the second curved path.