Methods and devices for coding and decoding a sequence of scenes, and corresponding computer programs
By encoding 3D scenes using predictors and correctors to leverage temporal redundancy, the method addresses the challenge of large data sizes in 3D video sequences, enabling efficient compression and real-time streaming on mobile devices.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ORANGE SA
- Filing Date
- 2025-12-04
- Publication Date
- 2026-07-02
AI Technical Summary
Existing methods for reconstructing and compressing 3D video sequences, such as 3D Gaussian Splatting, result in large data sizes that are unsuitable for real-time streaming and decompression on mobile devices due to high data rates and complexity, limiting their application to wireless networks and devices with limited resources.
A method for encoding 3D scenes by leveraging temporal redundancy between frames, using predictors and correctors to encode actions and differences between objects, reducing data volume through optimized quantization and encoding techniques.
Enables efficient compression and real-time streaming of 3D scene sequences on mobile devices, reducing bandwidth requirements and energy consumption while maintaining high-quality rendering.
Smart Images

Figure EP2025085508_02072026_PF_FP_ABST
Abstract
Description
[0001] DESCRIPTION
[0002] Title: Methods and devices for encoding and decoding a sequence of scenes and corresponding computer program.
[0003] 1. Scope of the invention
[0004] The field of the invention is that of the encoding and decoding of digital signals. More specifically, the invention relates to the compression of data representing a sequence of multidimensional scenes, for example for the purpose of their transmission and / or storage, and the reconstruction of such a sequence.
[0005] In particular, the invention proposes a solution enabling real-time decompression and / or rendering of a sequence of multidimensional scenes on equipment with limited resources in terms of RAM (Random Access Memory), CPU (Central Processing Unit) and / or GPU (Graphics Processing Unit), for example on a mobile terminal such as a smartphone.
[0006] The proposed solution has applications in all areas where having multiple perspectives can be beneficial, such as education, sports, entertainment, maintenance, etc. 2. Prior art
[0007] Numerous solutions have been proposed for reconstructing a 3D scene (also called a 3D model) from photos taken from different viewpoints. When this reconstruction is obtained from video content, by taking the same time moment from each video clip, a temporal sequence of 3D scenes is obtained, also called 3D video or volumetric video, which can be viewed from any point of view.
[0008] Kerbl et al. presented a technique for representing a 3D scene as a set of 3D Gaussian primitives, which can be projected onto the camera's image plane. These projections are called "splat" or "Gaussian splat." This technique, known as "3D Gaussian Splatting" or 3DGS, is notably presented in the paper "3D Gaussian Splatting for Real-Time Radiance Field Rendering," SIGGRAPH 2023. In simplified terms, a point cloud is generated from input images corresponding to different viewpoints. Each point is then converted into a Gaussian that can be split, cloned, deleted, etc. The resulting 3D splat can be considered a spherical object possessing spatial, geometric, and colorimetric properties. The properties of a 3D splat can be described by a set of properties or descriptive information, an example of which is shown in the table below:
[0009] Table 1]
[0010]
[0011]
[0012] Combinations of these properties can produce a wide variety of visual representations. For example, it is estimated that a few tens of thousands of splats can represent a 3D scene in a photorealistic way.
[0013] Reconstructing a 3D video at a specific point in time using a technique like "3D Gaussian Splatting" offers good visual quality, but relies on processing very large amounts of data. For example, even if only the central subject of a video is extracted, frames on the order of 30 MB (30,000,000 bytes), composed of a multitude of splats, are obtained from multiple video clips using the same point in time for each video. Reconstructing a sequence of frames from videos, for example, reconstructing a series of 30 3D scenes per second, requires a data rate of around 7 Gbps (gigabits per second).
[0014] It is necessary to compress such a large amount of data in order to consider a transfer to a playback device, especially if one considers a wireless transmission technique (for example, Wifi®, 5G or other).
[0015] It is also necessary to quickly decompress and restore such a large amount of data.
[0016] Although "3D Gaussian Splatting" type techniques allow for the reconstruction of realistic 3D scenes from multiple viewpoints, current compression techniques do not allow for the streaming of a time series of 3D scenes on mobile devices, due to the size of the data and the complexity of decoding and rendering.
[0017] There is therefore a need for a solution that allows for the efficient compression of data representing a sequence of 3D scenes, for example in order to be able to transfer them via a wireless network (streaming) and / or display them in real time (for example at a frequency of around 30 frames per second), particularly on a mobile terminal such as a smartphone.
[0018] 3. Description of the invention
[0019] The invention proposes a solution in the form of a method for encoding at least one scene from a sequence of scenes comprising, at a first instant, a scene represented by a plurality of multidimensional elementary objects forming a first frame and, at a second instant, a scene represented by a plurality of multidimensional elementary objects forming a second frame. According to the invention, said method implements an encoding of the second frame, comprising:
[0020] - obtaining a list of predictors and a list of correctors, said list of predictors comprising, for a current object of said second frame, a corresponding object of said first frame, said selected object, and said list of correctors comprising, for said current object of the second frame, a difference between at least one descriptive information of said selected object and at least one descriptive information of said current object,
[0021] - the construction of a list of actions associating at least one action with at least one object from said first framework, taking into account said list of predictors,
[0022] - a coding of at least one action from said list of actions and at least one difference from said list of correctors.
[0023] The invention thus proposes a new solution for compressing content represented by a plurality of multidimensional elementary objects at different times, for example a 3D video made up of a succession of 3D scenes. By way of example, such objects can be non-Gaussian splats, Gaussian splats, polygons, edges and vertices of a 3D mesh, etc.
[0024] More specifically, the solution proposes to take into account the similarities between two objects belonging to scenes at different points in time, in order to reduce the amount of data to be encoded before storage and / or transmission. It thus proposes to exploit the temporal redundancy between successive, but not necessarily consecutive, scenes. A list of actions can then be constructed to represent the objects in the predictor list, which allows actions to be encoded rather than objects, thereby further reducing the amount of data to be encoded.
[0025] In particular, reducing the amount of data to be coded opens up possibilities for new applications, for example:
[0026] by offering a solution compatible with lower bandwidth networks, such as a 4G network or an overloaded network (for example when a large number of terminals are connected to the same antenna),
[0027] by enabling the streaming of scenes represented by a very large number of objects, for example more than 100,000, depending on the power of the users' CPUs and GPUs (which continues to grow with the emergence of new technologies),
[0028] by reducing the energy required for data transmission,
[0029] etc.
[0030] Thus, according to at least one embodiment, the proposed technique makes it possible to compress at least one multidimensional scene sufficiently to ensure streaming of the time sequence over a wireless network, such as a Wifi® or 5G network or any other network, while guaranteeing real-time processing and display compatible with the capabilities of a mobile terminal such as a smartphone.
[0031] Information enabling the identification of the first frame, for example, an index of the first frame within the list of frames in the time sequence, can also be encoded. In a particular embodiment, the method also implements the encoding of the first frame. For example, the first frame can be encoded using the compression method described in French patent application FR 2408093 filed on July 23, 2024.
[0032] The first frame can be, in particular, an intra-frame (I-frame), or a previously encoded and decoded frame (P-frame). The first encoded frame can thus be stored and / or transmitted, for example in the same stream as the second encoded frame.
[0033] In a particular embodiment, for at least one object of said first frame, said action belongs to the group comprising:
[0034] - an "ignore" action if the object in question does not belong to the list of predictors,
[0035] - a "copy" type action for the first occurrence of said object in said predictor list,
[0036] - a "duplication" type action for each subsequent occurrence of said object in said predictor list.
[0037] In particular, the number of different actions is restricted, which limits the number of bits needed to encode an action. This further reduces the amount of data to be encoded. For example, such an action can be encoded using only two bits: '00' for an "ignore" action, '01' for a "copy" action, and '10' for a "duplicate" action, for example.
[0038] According to one embodiment, the coding process includes ordering said list of predictors and said list of correctors according to an ordering criterion, prior to the construction of the list of actions.
[0039] In a particular embodiment, said scheduling criterion is an order of appearance of objects in said first frame.
[0040] According to this embodiment, the objects selected in the first frame are, for example, ordered into a predictor list ordered according to the order in which the objects appear in the first frame. The differences associated with the selected objects are then ordered into a corrector list ordered in the same order.
[0041] Alternatively, the list of predictors and the list of correctors are constructed so that a predictor (object of the first frame) and the corresponding correction (difference) (to be applied to the predictor to predict an object of the second frame) are located in the same position in the list of predictors and in the list of correctors.
[0042] The list of actions can also be constructed according to this same criterion, for example by traversing the objects of the first frame one by one, according to their order of appearance in the first frame, and checking for each object of the first frame whether it is present in the list of predictors, and if so, whether there is one or more occurrences of this object in the list of predictors. In one embodiment, an object of the first frame is selected according to a proximity criterion in terms of Euclidean distance between said at least one descriptive information of said current object and said at least one descriptive information of said selected object.
[0043] For example, at least one descriptive piece of information about an object (current object or selected object), also called a property, belongs to the group comprising:
[0044] - a position of the center of the object (x, y, z),
[0045] - a color associated with the object (r, g, b),
[0046] - a density associated with the object (d),
[0047] - a scale factor associated with the object (sx, sy, sz),
[0048] - a rotation factor associated with the object (ox, oy, oz),
[0049] - a rotation angle associated with the object oa,
[0050] - spherical harmonics associated with the object (srO, sgO, sbO, srl, sgi, sbl, .... ).
[0051] The selection criterion may take into account one of these descriptive pieces of information, and for example a combination of all of these descriptive pieces of information, particularly when the objects considered are splats.
[0052] Thus, according to one embodiment, a combination of this descriptive information is considered, expressed in the following form:
[0053] In particular, said at least one descriptive piece of information for one of said objects is expressed in the following form:
[0054]
[0055] with :
[0056] (%, y, z) a position of the center of said object,
[0057] (r, g, b) a color associated with said object,
[0058] of a density associated with said object,
[0059] (sx,sy, sz) a scaling factor associated with said object,
[0060] (ox, oy, oz) a rotation factor associated with said object,
[0061] at an angle of rotation associated with said object,
[0062] a1, a2, a3, a4 are coefficients depending on the type of descriptive information.
[0063] The coefficients a1, a2, a3, a4 reflect the size of the representation space required to maintain rendering quality. As an example, the following values are chosen for the coefficients: a1 = 2, a2 = 1, a3 = 2, a4 = 1. Indeed, as described in the aforementioned French patent application FR 2408093, it is considered that for descriptive information such as center position or scale factor, the quantified descriptive information can be represented on two bytes.For example, two bytes are used to quantify the center's position in x, two bytes in y, two bytes in z, two bytes in x, two bytes in y, and two bytes in z. Therefore, two bytes are used to quantify the scale factor in x, two bytes in y, and two bytes in z, hence the choice of the coefficients a1 = 2 and a3 = 2. Twelve bytes can thus be allocated for representing quantified descriptive information such as center position or scale factor. Furthermore, for descriptive information such as color, density, or rotation factor, the quantified descriptive information can be represented in one byte.For example, one byte is used for the quantification of the color red, one byte is used for the quantification of the color green, one byte is used for the quantification of the color blue, one byte is used for the quantification of the density, one byte is used for the quantification of the rotation factor in x, one byte is used for the quantification of the rotation factor in y, one byte is used for the quantification of the rotation factor in z, and one byte is used for the quantification of the rotation angle, hence the choice of the coefficients a2 = 1 and a4 = 1. Eight bytes can thus be provided for the representation of the quantified descriptive information of the color, density or rotation factor type.
[0064] Of course, the values of these coefficients al, a2, a3, a4 can be adapted to take into account the orders of magnitude of the scene.
[0065] It should also be noted that the selection criterion does not necessarily take into account descriptive information such as spherical harmonics. Indeed, such spherical harmonics can be omitted for processing by the GPU. Therefore, it is not necessary to quantify descriptive information such as spherical harmonics, which further reduces the amount of data to be encoded, nor to consider it for object selection in the first frame.
[0066] In a particular embodiment, said at least one descriptive piece of information of the selected object and said at least one descriptive piece of information of the current object are quantified over an integer number of bytes.
[0067] In this way, the quantified descriptive information retains a structure that can be efficiently used by the GPU since it is quantified over an integer number of bytes.
[0068] In a particular embodiment, said at least one quantified descriptive information of the selected object being optimized by modifying the value of N bits in said at least one quantified descriptive information of the selected object, with N an integer greater than or equal to 1, said method comprises optimizing said at least one quantified descriptive information of the current object by modifying N bits in said at least one quantified descriptive information of the current object.
[0069] According to this embodiment, the first frame is considered to have been previously compressed using the compression method described in the aforementioned French patent application FR 2408093, or any other compression technique. According to the aforementioned French patent application FR 2408093, a compression method for at least one multidimensional scene represented by a plurality of multidimensional elementary objects implements, for at least one of said elementary objects, referred to as the current object:
[0070] for at least one descriptive piece of information of said current object, called current descriptive information: - quantification of said current descriptive information over an integer number of bytes,
[0071] - optimization of the quantification of said current descriptive information including, for at least one bit of said current quantified descriptive information, called current bit:
[0072] • obtaining a representative reference image of said scene,
[0073] • modification of a value of said current bit, relative to an initial value, providing a modified current quantified descriptive information,
[0074] • obtaining a degraded image representative of said scene, from said current modified quantified descriptive information,
[0075] • measurement of distortion between said degraded and reference images,
[0076] • if said distortion between said degraded and reference images is less than a threshold, storage of said modified value of said current bit, otherwise reset to said initial value of said current bit.
[0077] The optimization of the quantization of the current descriptive information can be iterated over the bits of the current quantized descriptive information as long as the distortion is below the threshold.
[0078] Thus, for at least one descriptive piece of information for at least one elementary object, one or more bits of the quantized descriptive information can be modified. This modification is performed in a supervised manner, by checking for each modified "candidate" bit whether the resulting degraded image is sufficiently close to the reference image. If so, the "candidate" modification is accepted. Otherwise, the "candidate" modification is rejected. The aim is to maintain a high-quality rendering of a multidimensional scene while reducing the data quantization space.
[0079] According to a particular embodiment of the invention, the first frame is compressed using the compression method described above or any other compression method that reduces the quantization space of the descriptive information. The quantized descriptive information of the selected object is thus optimized by modifying the value of its N least significant bits, for example, by forcing them to 0, as long as the resulting degraded image is sufficiently close to the reference image. By analogy, the quantized descriptive information of the current object can also be optimized by modifying the value of its N least significant bits, for example, by forcing them to 0.
[0080] In this way, the proposed solution according to this embodiment relies on an optimization of the quantification of descriptive information of at least one current object, by supervisedly reducing the quantification space of descriptive information, i.e. the number of possible values for descriptive information.
[0081] Such a reduction in the data quantization space allows for rapid entropic compression and / or decompression. Furthermore, the optimized quantized descriptive information retains a structure that can be efficiently processed by the GPU.
[0082] In a particular embodiment, at least one difference is determined by comparing quantified descriptive information or optimized quantified descriptive information. Specifically, a difference is determined by comparing descriptive information of the same type two by two, after quantification and / or optimization: for example, by comparing the x-position of the current object with the x-position of the selected object, then the y-position of the current object with the y-position of the selected object, and so on. Using the examples of descriptive information listed above, it is possible, for instance, to obtain 14 difference values (by comparing the x, y, z, r, g, b, d, sx, sy, sz, ox, oy, oz, oa values of the current object and the selected object).Alternatively, a difference can be determined by comparing a set of descriptive information of the same type, after quantification and / or optimization: for example by comparing a position in (x,y,z) of the current object and a position in (x,y,z) of the selected object.
[0083] In a particular embodiment, the binary representation of said at least one difference is shifted N bits to the right before encoding.
[0084] Thus, the binary representation of a difference can be shifted by a number of bits corresponding to the number of bits modified in said at least one optimized quantified descriptive information of the selected object. This makes it possible to further reduce the amount of data to be encoded.
[0085] Such an offset value associated with a difference can notably be coded before storage and / or transmission in the stream.
[0086] In a particular embodiment, said at least one difference associated with at least one descriptive information being represented on said integer number of bytes, said encoding of said at least one difference is implemented on at most half of said integer number of bytes.
[0087] According to this embodiment, if we consider that quantified descriptive information can be represented in two bytes (for example, descriptive information such as the center position or scale factor of a current or selected object), then the difference between two descriptive pieces of information of the same type can be expressed in two bytes, but only one byte can be encoded. A signed byte is then sufficient to encode such a difference. Similarly, if we consider that quantified descriptive information can be represented in one byte (for example, descriptive information such as the color, density, or rotation factor of a current or selected object), then the difference between two descriptive pieces of information of the same type can be expressed in one byte, but only half a byte can be encoded. A signed half-byte is then sufficient to encode such a difference.In another embodiment, the encoding of the second frame includes, for at least one of said objects of said second frame, called current object:.
[0088] a selection, in the first frame, of an object corresponding to the current object, according to a selection criterion taking into account at least one descriptive piece of information of the current object and at least one descriptive piece of information of the corresponding object, the selected object,
[0089] a determination of at least one difference between at least one of the descriptive information of the selected object and at least one of the descriptive information of the current object, an ordering of the selected objects in an ordered predictor list and of the associated differences in an ordered corrector list, according to an ordering criterion,
[0090] a construction of a list of actions associating at least one action with at least one object of said first framework, taking into account said ordered list of predictors,
[0091] a coding of at least one action from said list of actions and at least one difference from said ordered list of correctors.
[0092] The information obtained from the coding process, particularly action, difference, and possibly offset values, can be stored (in a file, memory, etc.) and / or transmitted. It can, in particular, be encoded by an entropy coder before transmission and / or storage. Information allowing the identification of the first frame, for example, an index of the first frame within the list of frames in the time sequence, can also be encoded, stored, and / or transmitted.
[0093] In particular, such information can be streamed over a wireless network. For example, such entropic coding belongs to the group comprising:
[0094] - an LZFSE type coding (“Lempel-Ziv Finite State Entropy”),
[0095] - an LZ4 type coding (“Lempel-Ziv 4”),
[0096] - an LZMA type coding (“Lempel-Ziv Markov chain algorithm”),
[0097] - a ZLI B type encoding,
[0098] - LZBITMAP type encoding (“Lempel-Ziv Bitmap”).
[0099] In another embodiment, the invention relates to a corresponding coding device.
[0100] Such a coding device is particularly well-suited to implementing the coding process described above. It may, of course, incorporate the various features of the coding process according to the invention, which may be combined or considered individually. Thus, the characteristics and advantages of the coding device are the same as those of the process described above. Consequently, they are not described in further detail.
[0101] Such a coding device is for example integrated into a computer, a tablet, a mobile terminal such as a smartphone, a PDA, a headset or virtual or augmented reality glasses, etc. It can be a hardware entity or a software entity, which can be distributed over one or more network functions or be hosted by one or more hardware devices.
[0102] In another embodiment, the invention relates to a method for decoding a stream of coded data representing at least one scene from a sequence of scenes comprising, at a first instant, a scene represented by a plurality of multidimensional elementary objects forming a first frame and, at a second instant, a scene represented by a plurality of multidimensional elementary objects forming a second frame.
[0103] According to the invention, said method implements a decoding of said second frame, comprising: obtaining a list of predictors, from the decoding of at least one action associated with at least one object of said first frame,
[0104] obtaining a list of correctors associated with said list of predictors, from the decoding of at least one difference between at least one descriptive piece of information of an object in said list of predictors and at least one descriptive piece of information of a corresponding object to be predicted in said second frame,
[0105] a reconstruction of said object to be predicted, by applying said decoded difference to said at least one descriptive information of said object from said list of predictors.
[0106] Such a decoding method is particularly suited to receiving or reading a signal generated using the encoding method described above. It may, of course, include the various features of the encoding method according to the invention. Thus, the characteristics and advantages of the decoding method are the same as those of the encoding method described previously.
[0107] In particular, such a process makes it possible to reconstruct at least one object from a second frame, and for example, all the objects of the second frame, from a first frame that has been previously decoded or reconstructed. As already mentioned, such a first frame can be, in particular, a type I or type P frame. It can be identified in the data stream, for example, by means of an index.
[0108] Such a first frame, used in coding to construct the list of actions from a list of predictors, can be used with the decoded list of actions to reconstruct the list of predictors. By combining an object from the list of predictors and the associated difference (the list of predictors and the list of correctors being associated, for example, constructed or ordered according to the same criterion), it is possible to reconstruct the different objects of the second frame one by one, and thus to reconstruct the scene at a second time point.
[0109] In a particular embodiment, the method also includes decoding a shift value associated with at least one optimized quantified descriptive information of said selected object, and according to which said at least one decoded difference is shifted left by a number of bits corresponding to said shift value prior to said application. Such a shift value associated with a difference can in particular be determined at the encoding stage and transmitted in the stream.
[0110] In one embodiment, said method also includes displaying said time sequence on a mobile terminal.
[0111] In another embodiment, the decoding of the second frame includes:
[0112] decoding at least one action from a list of actions associating at least one action with at least one object from the first previously obtained frame,
[0113] reconstruction of an ordered predictor list from said at least one decoded action and said first frame, said ordered predictor list comprising objects from said first frame ordered according to an ordering criterion,
[0114] decoding at least one difference from an ordered corrector list, said ordered corrector list comprising differences between at least one descriptive piece of information of an object from said ordered predictor list and at least one descriptive piece of information of an object to be predicted from said second frame, said differences being ordered according to said ordering criterion,
[0115] reconstruction of at least one object from said second frame, called the current object, comprising:
[0116] o selection of an object from said ordered list of predictors corresponding to said current object,
[0117] o application to said at least one descriptive information of said selected object of said at least one corresponding decoded difference.
[0118] In another embodiment, the invention relates to a corresponding decoding device.
[0119] Such a decoding device is particularly well-suited to implementing the decoding process described above. It may, of course, incorporate the various features of the decoding process according to the invention, which may be combined or considered individually. Thus, the features and advantages of the decoding device are the same as those of the process described above. Consequently, they are not described in further detail.
[0120] Such a decoding device is for example integrated into a mobile terminal such as a smartphone, tablet, PDA, virtual or augmented reality headset or glasses, etc.
[0121] It can be a hardware entity or a software entity, which can be distributed across one or more network functions or hosted by one or more hardware devices.
[0122] The invention further relates to one or more computer programs comprising instructions for implementing a process as described above when this or these programs are executed by at least one processor.
[0123] The invention also relates to a computer-readable information carrier containing instructions for a computer program as mentioned above. 4. List of figures
[0124] Other features and advantages of the invention will become more apparent upon reading the following description of a particular embodiment, given by way of simple illustration and not limitation, and the accompanying drawings, among which:
[0125] Figure 1 illustrates the main steps implemented by a coding process according to one embodiment of the invention.
[0126] Figure 2 illustrates the main steps implemented by a decoding process according to one embodiment of the invention.
[0127] Figure 3 presents an example of constructing a list of predictors,
[0128] Figure 4 shows an example of ordering a list of predictors and a list of correctors,
[0129] Figure 5 illustrates the construction of an action list according to a particular implementation method; Figure 6 illustrates an example of the reconstruction of a second framework.
[0130] Figure 7 presents an example of a time sequence compression scheme with prediction.
[0131] Figure 8 presents another, unclaimed, embodiment for encoding a second frame,
[0132] Figure 9 presents another, unclaimed, embodiment for the reconstruction of a second frame.
[0133] Figures 10 and 11 illustrate the simplified structure of a corresponding encoding device and decoding device.
[0134] 5. Description of an embodiment
[0135] 5.1 General Principle
[0136] The general principle of the invention is based on taking into account similarities between multidimensional objects belonging to distinct scenes of a sequence of multidimensional scenes, in order to reduce the amount of data to be coded in order to be able to reconstruct at least one scene of the sequence.
[0137] We first consider at least two multidimensional scenes at different times, one scene represented by a plurality of multidimensional elementary objects forming a first frame at a first time, and the other scene represented by a plurality of multidimensional elementary objects forming a second frame at a second time. We assume that the number of objects in the first and second frames can be different, and that the order of the objects in the first and second frames is irrelevant. In particular, the proposed solution relies on a process for selecting objects from the first frame, allowing the objects of the first frame and the objects of the second frame to be aligned in such a way as to minimize the corrections required to reconstruct the second frame.
[0138] We will subsequently consider a multidimensional scene to be a 3D scene and a multidimensional object to be a 3D object, for example, a Gaussian splat. The proposed solution then allows for the efficient compression of a sequence of 3D scenes represented by 3D splats in terms of compression gain, decompression time, and / or 3D rendering speed by the GPU. Other dimensions, notably 2D or 4D, or types of objects are of course conceivable. Thus, the proposed solution could also be applied to image encoding or decoding.
[0139] Figure 1 illustrates the main steps of a coding process according to one embodiment of the invention.
[0140] Such a process takes as input a sequence of at least two 3D scenes, each represented by a plurality of 3D objects. As described in relation to prior art, a 3D scene can be obtained from photographs captured from different viewpoints, or from videos taken from different viewpoints at the same time. Alternatively, a 3D scene can be obtained from images generated by a graphics processing unit.
[0141] Consider a 3D scene at a time index t1 and a 3D scene at a time index t2, where 1 < t1 and t2 < T. The 3D scene at time t1 is represented by a plurality of 3D objects forming a first frame, also called the reference frame Tref. The 3D scene at time t2 is represented by a plurality of 3D objects forming a second frame, also called the predictor frame Tpred. The reference frame Tref thus comprises a set of I 3D objects OrefJ, with 1 < i < I, and I an integer that can be several hundred thousand. Similarly, the predictor frame Tpred comprises a set of J 3D objects OpredJ, with 1 < j < ], and ] an integer that can be several hundred thousand. The 3D objects can be represented or described by at least one descriptive piece of information or property.
[0142] The aim here is to encode the frame to be predicted, Tpred, from the reference frame. The reference frame can be, in particular, an intra-frame (I-frame), or a previously encoded and decoded frame (P-frame).
[0143] To do this, we consider one of the objects in the frame to be predicted, Tpred, called the current object.
[0144] In a first step 10, a list of predictors BM and a list of correctors PATCH is obtained. The list of predictors includes, for a current object in the frame to be predicted Tpred, a corresponding object in the reference frame Tref, called the selected object, and the list of correctors includes, for the current object in the frame to be predicted Tpred, a difference between at least one descriptive piece of information in the selected object and at least one descriptive piece of information in the current object. To do this, for example, in a selection step 11, an object corresponding to the current object is selected in the reference frame Tref, according to a selection criterion that takes into account at least one descriptive piece of information in the current object and at least one descriptive piece of information in the corresponding object. The object selected in the reference frame can optionally be inserted into the list of predictors BM (in English, "BestMatch") for temporary storage.
[0145] During a step to determine at least one difference (12), at least one difference is determined between at least one of the descriptive details of the selected object and at least one of the descriptive details of the current object of the same type. The difference(s) associated with a selected object can be inserted into a list of PATCH correctors.
[0146] These different steps can be repeated for several objects in the frame to be predicted Tpred, and in particular for all the objects in the frame to be predicted Tpred.
[0147] At the end of these steps, we therefore obtain at least one selected object, stored in a list of BM predictors, and at least one associated difference, stored in a list of PATCH correctors.
[0148] Note that the same object from the reference frame can be selected for several objects in the frame to be predicted, depending on the selection criterion. An object from the reference frame may never be selected, depending on the selection criterion. The list of BM predictors containing the selected objects can therefore include zero, one, or more occurrences of an object from the reference frame.
[0149] Optionally, during a scheduling step 13, the selected objects can be ordered into a BM_ord ordered predictor list and the associated differences can be ordered into a PATCH_ord ordered corrector list, according to the same scheduling criterion.
[0150] During construction step 14, a list of AL actions is constructed, associating at least one action with at least one object in the reference frame, taking into account the predictor list (possibly ordered). More precisely, at least one action is associated with the different objects in the reference frame, depending on whether the object in question appears in the predictor list with zero, one, or more occurrences.
[0151] Finally, at least one action from the AL action list and at least one difference from the corrector list (possibly ordered) can be coded during a coding step 15.
[0152] In one particular embodiment, steps 10 to 15 are iterated for all objects in the frame to be predicted, Tpred. Alternatively, steps 10 to 15 are iterated for only some of the objects in the frame to be predicted, Tpred. Step 12, which determines a difference, can be implemented as soon as a new object in the reference frame is selected, or after selecting all the objects in the reference frame useful for predicting the objects in the frame to be predicted. Similarly, the scheduling step 13, the action list construction step 14, and / or the coding step 15 can be implemented sequentially, as soon as a new object in the reference frame is selected or a difference is determined, or after selecting all the objects in the reference frame useful for predicting the objects in the frame to be predicted, or after determining all the differences useful for predicting the objects in the frame to be predicted.
[0153] In particular, the coding step for said at least one difference can be implemented as soon as a difference is obtained, or when the set of differences has been obtained.
[0154] Specifically, if several 3D objects have identical descriptive information / properties, it is possible to process the plurality of 3D objects simultaneously, rather than each object individually. The coding process, according to one embodiment, thus allows the frame to be predicted (Tpred) to be encoded from the reference frame (Tref), generating a list of actions and a list of correctors, which can be encoded with an entropy coder, for example.
[0155] Once compressed, the size of the action list and the corrector list can be in the order of 25% to 50% of the size of a scene compressed autonomously, i.e. without exploiting the temporality between a succession of frames.
[0156] According to at least one embodiment, exploiting temporality to encode a frame to be predicted from a reference frame makes it possible to divide the bandwidth required by a factor of about 3 compared to independent compression of each scene.
[0157] This gain can be combined with that obtained for the encoding of the reference frame, which can reach a compression factor or ratio of the order of 20 if the reference frame is compressed by implementing the compression process described in patent application FR 2408093.
[0158] At the end of these various steps, a signal or stream F is obtained, representing at least one scene from a temporal sequence of 3D scenes represented by a plurality of 3D objects, intended for transmission and / or storage. Such a signal carries at least one coded action and at least one coded difference for a current object of the frame to be predicted, Tpred. Such a signal may also carry information allowing the identification of the reference frame, or even carry the coded reference frame itself.
[0159] We now describe, in relation to figure 2, the main steps of a decoding process according to an embodiment of the invention.
[0160] The aim here is to reconstruct at least one scene from a temporal sequence of multidimensional scenes, represented by a plurality of multidimensional elementary objects forming a predictable frame, from another scene in the temporal sequence, represented by a plurality of multidimensional elementary objects forming a reference frame, previously decoded or reconstructed. As already mentioned, the reference frame can be an I-frame or a P-frame.
[0161] The decoding / reconstruction of the reference frame is not the subject of this patent application. For example, such a reference frame is obtained by implementing the reconstruction process described in patent application FR 2408093. As illustrated in Figure 2, the process according to the invention implements a preliminary step 21 for obtaining a reference frame Tref. Such a reference frame can be identified by reading and / or decoding information present in the stream, for example, an index of the reference frame in the list of frames in the time sequence. Such a reference frame can be directly decoded beforehand (if it is of the intra-frame type) or reconstructed from a previously decoded frame (if it is of the predicted frame type).
[0162] The process according to the invention also implements a decoding step 22 of the frame to be predicted Tref.
[0163] To do this, at least one action from a list of actions associating at least one action with at least one object of said reference frame is decoded during a decoding step 221. For example, the list of actions AL is obtained at the end of this decoding step 221.
[0164] During a step 222, a list of BM predictors is reconstructed from said at least one decoded action from the list of AL actions.
[0165] At least one difference from a list of correctors is then decoded during a decoding step 223. Such a list of correctors includes differences between at least one descriptive piece of information of an object from the list of predictors (obtained from an object selection in the reference frame) and at least one descriptive piece of information of an object to be predicted from the frame to be predicted of the same type.
[0166] To reconstruct at least one object to be predicted from the frame to be predicted, called the current object, we apply to at least one descriptive piece of information of an object from said list of predictors, the corresponding difference.
[0167] According to a particular embodiment, the reconstruction 226 includes a selection 224 of an object from the list of predictors corresponding to the current object. For example, if the current object to be reconstructed is the first object in the frame to be predicted, the first object in the list of predictors is selected. If the current object to be reconstructed is the k-th object in the frame to be predicted, the k-th object in the list of predictors is selected.
[0168] Next, at least one decoded difference corresponding to at least one descriptive piece of information about the selected object is applied during an application step 225. For example, if the selected object is the first object in the predictor list, the first difference from the corrector list is applied to it. If the selected object is the k-th object in the predictor list, the k-th difference from the corrector list is applied to it.
[0169] This results in a reconstruction of the current object OpredJ. As with the encoding, these steps 224 and 225 can be iterated for all objects in the frame to be predicted Tpred. Alternatively, steps 224 and 225 can be iterated for only a subset of the objects in the frame to be predicted Tpred.
[0170] In particular, the decoding step 223 of at least one difference can be implemented to decode all the differences, or as a current object of the frame to be predicted is decoded. In this case, the decoding step 223 can implement a decoding of at least one difference between at least one descriptive information of the object selected during step 224 and at least one descriptive information of the current object.
[0171] Similarly, the steps of decoding at least one action from a list of actions 221 and / or reconstructing a list of predictors can be implemented as they go along, for example during the step of reconstructing an object to be predicted.
[0172] In particular, such a reconstruction process can be implemented by equipment with limited resources in terms of RAM, CPU and / or GPU, for example a mobile terminal such as a "smartphone".
[0173] Such a process can thus implement a step of rendering, or displaying, a temporal succession of 3D scenes thus reconstructed.
[0174] As indicated above, certain steps of the encoding process or the decoding process according to an embodiment may be carried out in a different order than that illustrated in Figures 1 and 2, or even in parallel.
[0175] 5.2 Example of implementation
[0176] An example of an implementation of the invention for encoding and decoding at least one scene from a temporal sequence of 3D scenes, each represented by a plurality of 3D objects of the Gaussian splat type, is presented below. Of course, other 3D objects can be considered, such as non-Gaussian splats, or polygons, edges and vertices of a mesh, etc. Similarly, objects / scenes of other dimensions can be considered. Thus, although the detailed example below is described for splats, it can be applied to any multidimensional object.
[0177] Using splat objects is advantageous because a list of splats representing a scene can be unordered. This is because the graphics rendering process imposes a reordering of splats based on depth, which can be calculated according to the virtual camera used to project the user's viewpoint. Therefore, when the user views a scene represented by a list of splats, by rotating through the scene, the splat list can have a new order with each frame displayed. The order in which the splats will be inked for prediction can thus be arbitrarily decided.
[0178] The different steps of Figure 1 are described in more detail below, for the encoding of a frame to be predicted Tpred from a reference frame Tref composed of a multitude of splats.
[0179] 5.2.1 Obtaining a list of predictors
[0180] As illustrated in Figure 3, the reference frame Tref comprises a plurality of splats: splat A1, splat B1, splat C1, ..., splat J1, .... The frame to be predicted Tpred also comprises a plurality of splats: splat A2, splat B2, splat C2, ..., splat J2, .... It should be noted that the number of splats in the reference frame may differ from the number of splats in the frame to be predicted. It should also be noted that the reference frame can be any previous frame, regardless of its type: intra-frame or previously encoded and decoded frame.
[0181] As described in relation to Figure 1, for at least one current splat of the frame to be predicted, and for example for each splat of the frame to be predicted, a splat corresponding to the current splat is selected from the reference frame according to a selection criterion that takes into account at least one property of the current splat and at least one property of the corresponding splat. Thus, for at least one current splat of the frame to be predicted, and for example for each splat of the frame to be predicted, the best match is sought in the reference frame.
[0182] For example, such a selection criterion is a proximity criterion in terms of Euclidean distance between at least one property of the current splat and at least one property of the selected object.
[0183] Thus, the correspondence between a splat of the frame to be predicted and a splat of the reference frame is calculated from the Euclidean distance between properties of the splats.
[0184] As indicated in Table 1 cited in prior art, a splat can be described by different properties. For example, consider the following properties of a splat of the reference frame or the frame to be predicted:
[0185] (x,y,z) a position of the center of the splat,
[0186] (r, g, b) a color associated with splat,
[0187] of a density associated with splat,
[0188] (sx,sy, sz) a scale factor associated with the splat,
[0189] (ox, oy, oz) a rotation factor associated with splat,
[0190] oa an angle of rotation associated with the splat.
[0191] In particular, a "global property" of a splat can be determined from one or more properties in the following form:
[0192]
[0193] with al, a2, a3, a4 coefficients depending on the type of property of the splat considered.
[0194] The coefficients a1, a2, a3, and a4 reflect the size of the representation space needed to maintain good rendering quality. For example, the following values are chosen for the coefficients: a1 = 2, a2 = 1, a3 = 2, a4 = 1.
[0195] The value of a property can be quantified and stored in one or more bytes. To do this, we calculate, for example, the minimum (min) and maximum (max) possible values of the property in question. We obtain the quantized value q in the space of size by applying the following equation:
[0196] p — min
[0197] q = -x space
[0198] max — min For example, the space of size space corresponds to 256 values if we consider a quantization of a property on one byte, 65536 values if we consider a quantization on two bytes, etc.
[0199] For a property such as center position or scale factor, quantization can be implemented using two bytes, that is, a value from a space containing 65,536 values. Indeed, the necessary precision can be significant because precise values are encoded within an interval of several meters. For a property such as color, density, or rotation factor, quantization can be implemented using one byte, that is, a value from a space containing 256 values. If, for example, we consider a scene of lm quantized using 256 values, we obtain a precision on the order of 0.004m, or 4mm. Such precision is not necessarily required if we are trying to reconstruct objects in this scene larger than 10cm, for example.
[0200] We can therefore adapt the values of the coefficients al, a2, a3, a4 according to the orders of magnitude of the scene.
[0201] During selection step 11, if we consider the current splat of the frame to be predicted to be splat A2, we calculate the Euclidean distance between the global property value of splat A2 and the global property values of splats A1, B1, ... J1 of the reference frame. Of course, the global property values can be read from memory if they have been previously determined.
[0202] The splat of the reference frame closest to the current splat of the frame to be predicted is then selected. In the example shown in Figure 3, splat Cl is considered the closest to splat A2, and is selected.
[0203] Such a selection step 11 can be implemented for the different splats in the frame to be predicted. For example, as illustrated in Figure 3, splat Bl is considered closest to splat B2. Splat Bl is also considered closest to splat C2. Splat G1 is considered closest to splat D2. Splat G1 is also considered closest to splat E2. Splat Fl is considered closest to splat F2. Splat Cl is considered closest to splat G2. Splat J1 is considered closest to splat H2. Splat J1 is considered closest to splat 12. Splat Cl is considered closest to splat J2.
[0204] The splats thus selected in the reference frame can be inserted into a list of BM predictors (in English "BestMatch").
[0205] By taking the traversal order A2, B2, C2, ..., J2 of the splats of the frame to be predicted (ascending order), we thus obtain a list of predictors BM comprising the splats Cl, Bl, Bl, Gl, Gl, Fl, Cl, Jl, Jl, Cl. At the end of this search for the best matches, we therefore obtain a new frame (list of predictors BM) composed of the splats of the reference frame whose order corresponds to that of the frame to be predicted.
[0206] There is no one-to-one correspondence between this list of predictors and the frame to be predicted: the same splat from the reference frame can be the best match of several splats from the frame to be predicted. 5.2.2 Obtaining a list of correctors
[0207] When a splat is selected 11, or once the BM predictor list is created, it is possible to determine 12 a difference between at least one property of the current splat and at least one property of the same type of the selected splat.
[0208] For example, if we consider the current splat A2, and the corresponding selected splat Cl, we determine a difference between one or more properties of the current splat A2 and one or more properties of the selected splat Cl.
[0209] To reduce computation time, and still with a real-time perspective, such a determination can be implemented by comparing quantized properties. Indeed, since the quantized and unquantized spaces exhibit a linear relationship with the unquantized space, the differences can be expressed in one space or the other. Consequently, it is not necessary to carry the quantization information into the encoded frame to be predicted, as the quantization information from the reference frame can be reused.
[0210]
[0211]
[0212]
[0213] The same procedure is used to determine at least one difference between one or more properties of a current splat and a selected splat, for example, between the properties listed above such as the x, y, and z position of the splat's center, the r, g, and b color associated with the splat, the d density associated with the splat, the x, y, and z scale factor associated with the splat, the x, y, and z rotation factor associated with the splat, and the rotation angle associated with the splat. We denote by AGI — A2 the set of differences between the current splat A2 and the selected splat Cl (for example, 14 differences if we compare 14 properties).
[0214] Such a step of determining a difference 12 can be implemented for the different splats of the frame to be predicted.
[0215] The differences thus determined can be inserted into a list of PATCH correctors. Thus, as illustrated in figure 3, the list of PATCH correctors includes the following values: AGI — A2, AB1 — B2, AB1 - C2, AGI - D2, AGI - E2, AF1 - F2, AGI - G2, A / l - H2, A / l - 12, AGI - J2.
[0216] 5.2.3 Coding the difference
[0217] Since the difference between two objects is minimized a priori because it is obtained after selecting the best match between a current splat of the frame to be predicted and the splats of the reference frame, it contains less information than the property of the current object associated with it, and can therefore be coded with fewer bits.
[0218] Several variations for encoding at least one difference are presented below, which can be combined. In particular, such encoding can be implemented as soon as a difference is determined, or after the set of differences useful for predicting the splats of the frame to be predicted has been determined.
[0219] Thus, according to a particular embodiment, in an attempt to achieve both low decoding complexity and minimal distortion, the encoding of a difference can be implemented using at most half the bits of the binary representation of the difference normally required to encode the corresponding property. For example, if two bytes are used to represent a property of the x, y, or z positional component of the splat's center, one signed byte or less is sufficient to encode the difference associated with this property. Similarly, if one byte is used to represent a color component property associated with a splat, half a signed byte or less can be used to encode the difference associated with this property without significantly impacting reconstruction efficiency.
[0220] If the binary representation of the difference is expressed using 16 bits, the difference encoding is therefore implemented using at most 8 bits of the binary representation of the difference. If the difference is greater than the precision of this reduced encoding (for example, a difference of 256 when the 8 bits allow for a difference between 0 and 255), then the number of bits used for encoding the difference is set to the maximum, i.e., 8 bits in this example. The introduced error can be corrected during the encoding of a subsequent frame.
[0221] In order to further reduce the amount of data to be coded, it is also possible to reduce the quantization space of the properties of a current object of the frame to be predicted.
[0222] To do this, it is assumed, according to a particular embodiment, that the reference frame was coded with the compression process of the aforementioned patent application FR 2408093.
[0223] For a splat of the reference frame, at least one optimized quantized property has been determined by modifying N bits of the quantized property, for example, the N lowest bits, also called masking bits. By analogy, this embodiment allows us to modify N bits of a quantized property of a current object.
[0224] Taking the previous example where the x-position value of the selected splat Cl is equal to 151, and considering that we can reduce the quantization space of the relevant property of the selected splat of the reference frame by forcing the two least significant bits to 0, the optimized quantized binary representation of the x-position value of the selected splat Cl is expressed as:
[0225]
[0226] According to this embodiment, the quantization space of the same property of the current splat of the frame to be predicted can be similarly reduced by forcing the two least significant bits to 0. The optimized quantized binary representation of the x-position value of the current splat A2 is then expressed as:
[0227]
[0228] In particular, the binary representation of the difference can be shifted N bits to the right to remove the bits forced to 0.
[0229]
[0230] As indicated above, according to a particular embodiment, only half of the bits needed to encode the corresponding property are encoded. In other words, the following difference is encoded:
[0231]
[0232] Such a difference, along with the associated offset value (here 2 bits) can be encoded, stored and / or transmitted.
[0233] In other words, if bit masking is used to optimize the encoding of the reference frame values according to the compression method described in the aforementioned French patent application FR 2408093, the determined difference is based on the bits actually used. This is done by indicating, for example in the stream header, the shifts to be applied based on the masking performed in the reference frame. For example, if N masking bits are used, the difference is determined on the existing values (i.e., those containing bits forced to 0), and then shifted N bits to the right.
[0234] Such a shift allows in particular to gain in precision, especially if the value of the difference is greater than 255 and therefore could not be coded precisely on 8 bits.
[0235] Indeed, if the value of the difference is equal to 256:
[0236]
[0237] and if we consider two masking bits, the binary representation of the difference can be shifted two bits to the right to remove the bits forced to 0:
[0238]
[0239] We then code the following difference:
[0240]
[0241] This particular embodiment allows, in particular, the maximization of the useful value contained in the difference. It should be noted that such operations implemented for the encoding of at least one difference require only simple operations (shifts, masking and addition), which are computationally inexpensive.
[0242] The same approach can be used to code one or more differences AC1 — A2 associated with the different properties of the current splat A2 and the selected splat Cl, for example, the position in x, y and z of the center of the splat, the color in r, g and b associated with the splat, the density d associated with the splat, the scale factor in x, y and z associated with the splat, the rotation factor in x, y and z associated with the splat, and the rotation angle associated with the splat.
[0243] Similarly, we can proceed in the same way to code the differences associated with the other pairs of current splat / selected splat: AB1 — B2, AB1 — C2, ...
[0244] 5.2.4 Building an action list
[0245] As illustrated in Figure 1, the encoding of the frame to be predicted, Tpred, can implement a 13-fold ordering of the selected splats from the predictor list BM into an ordered predictor list BM_ord, as well as an ordering of the differences from the corrector list PATCH into an ordered corrector list PATCH_ord. These two lists are ordered in the same order. Alternatively, the method according to the invention directly constructs an ordered predictor list BM_ord and an ordered corrector list PATCH_ord without first constructing a predictor list BM and a corrector list PATCH.
[0246] Indeed, since the order of the splats has little or no influence on the rendering, this reordering of the predictor list only affects the corrector list. Therefore, if the predictor list is sorted, the corrector list must be sorted in the same way to maintain a strict equivalence of the indices in both lists.
[0247] For example, a criterion for ordering the predictor list is the order in which objects appear in the reference frame. The corrector list is ordered in the same way, with each difference associated with a current splat, which is itself associated with a selected object in the reference frame.
[0248] Thus, the selected splats present in the BM predictor list can be ordered according to the order of appearance of the splats in the reference frame Al, Bl, Cl, ..., J1 (ascending order). Of course, a different order could be considered, such as the reverse order (descending order).
[0249] Alternatively, the predictor list and the corrector list are constructed so that a predictor (the object of the first frame) and its corresponding correction (difference) are located in the same position in both the predictor list and the corrector list. For example, the corrector list can be built as the predictor list is constructed, thus maintaining equivalence between the two lists.
[0250] In the example illustrated in Figure 4, a scheduling system is implemented. The resulting ordered predictor list BM_ord is: Bl, Bl, Cl, Cl, Cl, Fl, Gl, Gl, Jl, Jl. The associated ordered corrector list PATCH_ord is: AB1 — B2, AGI — C2, AGI — A2, AGI — G2, AGI — J2, AF1 — F2, AGI — D2, AGI — E2, A / l — H2, A / l — 12.
[0251] The encoding of the frame to be predicted (Tpred) then implements a construction (14) of an AL action list associating at least one action with at least one splat of the reference frame, taking into account the predictor list (Tpred). Such an action list allows, in particular, during decoding, the reconstruction of the frame to be predicted from the reference frame.
[0252] For example, the shares belong to the group comprising:
[0253] - an "ignore" type action (IGN) if the splat of the reference frame does not belong to the predictor list (possibly ordered),
[0254] - a "copy" (COPY) type action for the first occurrence of the splat in the predictor list (possibly ordered),
[0255] - a "duplication" type action (DUP) for each subsequent occurrence of the splat in the predictor list (possibly ordered).
[0256] The action list is constructed following the same order as that used for the construction or ordering of the BM predictor list, for example the order of appearance of splats in the reference frame Al, Bl, Cl, ..., J1 (ascending order).
[0257] For example, consider the splats Al, then Bl, then Cl, ... in the reference frame and iterate through the predictor list. The first splat in the predictor list is Bl. Since the predictors are ordered according to the order in which the splats appear in the reference frame, this means that Al is not in the predictor list. Therefore, an IGN action is inserted into the action list. Since the first splat in the predictor list is Bl, a COPY action is inserted into the action list (first occurrence of Bl). We continue traversing the predictor list. Since the second splat in the predictor list is Bl, a DUP action is inserted into the action list (second occurrence of Bl). We continue traversing the predictor list.Since the third splat in the predictor list is splat Cl, a COPY action is inserted into the action list (first occurrence of splat Cl). The predictor list is then traversed. Since the fourth splat in the predictor list is splat Cl, a DUP action is inserted into the action list (second occurrence of splat Cl). The predictor list is then traversed. Since the fifth splat in the predictor list is splat Cl, a DUP action is inserted into the action list (third occurrence of splat Cl). The predictor list is then traversed. Since the sixth splat in the predictor list is splat Fl, this means that splats DI and El are not in the predictor list. Therefore, two IGN actions are inserted into the action list. Next, a COPY type action is inserted into the list of actions (first occurrence of the Fl splat).We continue traversing the predictor list. Since the seventh splat in the predictor list is splat Gl, a COPY action is inserted into the action list (first occurrence of splat Gl). We continue traversing the predictor list. Since the eighth splat in the predictor list is splat Gl, a DUP action is inserted into the action list (second occurrence of splat Gl). We continue traversing the predictor list. Since the ninth splat in the predictor list is splat Jl, this means that splats H1 and 11 are not in the predictor list. Therefore, two IGN actions are inserted into the action list. Then, a COPY action is inserted into the action list (first occurrence of splat Jl). We continue traversing the predictor list.Since the tenth splat in the predictor list is the Jl splat, a DUP action is inserted into the action list (second occurrence of the Jl splat). The process continues through the predictor list, and so on.
[0258] Figure 5 illustrates the resulting list of actions. Note that the presence of the indices A, B, C ... J is optional and included to facilitate understanding of the invention.
[0259] In the absence of scheduling during coding (optional step 13), the list of actions obtained according to this example could be COP. Cl, COP. Bl, DUP. Bl, COP. Gl, DUP. Gl, COP. Fl, COP. Cl, COP. Jl, DUP. Jl, COP. Cl.
[0260] If the BM predictor list is constructed or ordered in a different order, for example, the reverse order of splats appearing in the reference frame, or any other order, the action list can be constructed in the same order. In descending order, the splats J1, then 11, then H1, ... from the reference frame are considered to construct the action list.
[0261] 5.2.5 Coding an action
[0262] An embodiment for encoding at least one action is presented below. In particular, such encoding can be implemented as soon as an action is determined, or after the set of actions useful for predicting the splats of the frame to be predicted has been determined.
[0263] Because the number of actions is limited, an action can be coded on only two bits.
[0264] Thus, although there are approximately as many actions as splats in the predictor list, each action only needs two bits to be represented, making it effectively almost negligible.
[0265] 5.2.6 Data Flow
[0266] According to one embodiment, the header of the data stream obtained at the end of the coding process contains the following information:
[0267] Information allowing identification of the reference frame, for example an index of the reference frame (in the list of frames in the sequence), for example 8 bytes long,
[0268] a number of NumCodes of actions in the action list, for example on 4 bytes,
[0269] a number NumSplats of splats in the frame to be predicted, for example on 4 bytes, possibly a list of offset values associated with each of the properties, for example to the 14 properties of type an x, y and z position of the center of the splat, color in r, g and b associated with the splat, density d associated with the splat, scale factor in x, y and z associated with the splat, rotation factor in x, y and z associated with the splat, rotation angle associated with the splat.
[0270] In addition, as previously stated, at least one action from the AL action list and at least one difference from the corrector list can be coded.
[0271] Thus, the header can be followed by:
[0272] from the AL action list, for example on (NumCodes / 4) bytes (an action can be coded on two bits, therefore a quarter of a byte per action),
[0273] The list of correctors, possibly ordered, for example on (NumSplats * 10) bytes. If we consider the 14 properties mentioned above, an element of the corrector list can have 14 differences, one per property.
[0274] In particular, the actions in the action list and the differences in the corrector list can be encoded or compressed by an entropy coder before transmission. Thus, once the reduced data is obtained and stored in a buffer, for example, global entropy compression can be applied to it.
[0275] Entropic coding can be used to compress the action list and the corrector list, ensuring real-time decompression. Specifically, a coding technique is chosen that allows for fast decompression, enabling, for example, playback on a terminal with limited RAM, CPU, and / or GPU resources.
[0276] For example, in one embodiment, the differences in the list of correctors are encoded separately, splat by splat: the 14 differences associated with a first splat are encoded (one per property), then the 14 differences associated with a second splat, and so on. Since the size of the data to be encoded for a frame to be predicted can be approximately twice the size of the data to be encoded for a reference frame, the decompression time can also be reduced by a factor of two, the time saved thus at least partially offsetting the additional cost introduced by processing the actions and differences. For example, LZFSE encoding is used.
[0277] According to a second embodiment, the differences in the list of correctors are coded property by property: the differences associated with a property of type x position of the center of the splat are coded for the J splats, then the differences associated with a property of type y position of the center of the splat for the J splats, then the differences associated with a property of type z position of the center of the splat for the J splats, then the differences associated with a property of type color r for the J splats, etc.
[0278] This allows us to take advantage of the fact that the set of values for a property is relatively homogeneous, and therefore benefit from greater compression because there is less dispersion to manage. We can then use Huffman coding, rather than LZFSE coding.
[0279] Furthermore, the trend is towards the creation of stabilized scenes in which stable areas are represented by the same group of splats, thus improving the proposed compression because the Tl
[0280] differences in this case present many zero values which can be compressed in a particularly efficient way.
[0281] 5.2.7 Reconstruction
[0282] As illustrated in Figure 2, different steps are implemented for the reconstruction of a frame to be predicted from a reference frame.
[0283] The reconstruction process is quite simple. First, we seek to decode at least one action from the action list in order to reconstruct the predictor list. If the action list was constructed during encoding following the order of appearance of splats in the reference frame Al, Bl, Cl, ..., J1 (ascending order), we consider this same order to reconstruct the action list.
[0284] As illustrated in Figure 6, the first decoded action is of type IGN. This means that the splat Al from the reference frame does not belong to the predictor list. In this case, the splat Al is ignored, and the process moves to the next action and the following splat Bl from the reference frame. The second decoded action is of type COP. This means that a first occurrence of the splat Bl can be added to the predictor list being reconstructed. Therefore, the current splat is copied. The process then moves to the next action. The third decoded action is of type DUP. This means that a second occurrence of the splat Bl can be added to the predictor list. Therefore, the last splat added is duplicated. The process then moves to the next action. The fourth decoded action is of type COP. This means that a first occurrence of the splat Cl can be added to the predictor list. The process then moves to the next action. The fifth decoded action is of type DUP.This means that a second occurrence of the splat Cl can be added to the predictor list. We then proceed to the next action. The sixth decoded action is of type DUP. This means that a third occurrence of the splat Cl can be added to the predictor list. We then proceed to the next action. The seventh decoded action is of type IGN. This means that the splat DI of the reference frame does not belong to the predictor list. The splat DI is ignored, and we proceed to the next action and the following splat El. The eighth decoded action is of type IGN. This means that the splat El of the reference frame does not belong to the predictor list. The splat El is ignored, and we proceed to the next action and the following splat Fl. The ninth decoded action is of type COP. This means that a first occurrence of the splat Fl can be added to the predictor list. We then proceed to the next action. The tenth decoded action is of the COP type.This means that a first occurrence of splat G1 can be added to the predictor list. We then proceed to the next action. The eleventh decoded action is of type DUP. This means that a second occurrence of splat G1 can be added to the predictor list. We then proceed to the next action. The twelfth decoded action is of type IGN. This means that splat HI from the reference frame does not belong to the predictor list. Splat HI is ignored, and we proceed to the next action and the following splat. The thirteenth decoded action is of type IGN. This means that splat 11 from the reference frame does not belong to the predictor list. Splat 11 is ignored, and we proceed to the next action and the following splat. The fourteenth decoded action is of type COP. This means that the first occurrence of splat J1 can be added to the predictor list. We then proceed to the next action. The fifteenth decoded action is of type DUP.This means that a second occurrence of splat J1 can be added to the list of predictors.
[0285] At the end of the decoding of the list of actions, we therefore obtain the list of predictors, for example ordered in increasing order: Bl, Bl, Cl, Cl, Cl, Fl, Gl, Gl, Jl, Jl.
[0286] In other words, the reference frame can be duplicated into a new list corresponding to the predictor list, altering this new list by deleting, copying, and duplicating certain splats. This process iterates through the splats of the reference frame, with the concept of the current splat of the reference frame initially pointing to the first splat of this new list. Applying the list of actions thus allows the new list to be progressively populated.
[0287] In the absence of ordering at the coding stage (optional step 13), the resulting list of predictors would be, for example, Cl, Bl, Bl, Gl, Gl, Fl, Cl, Jl, Jl, Cl.
[0288] Next, the decoded differences from the corrector list can be combined with the splats from the predictor list, with both lists being constructed or ordered according to the same criterion.
[0289] Next, the first decoded difference AB1 — B2 from the corrector list can be added to the first splat Bl from the predictor list, to reconstruct splat B2, and so on.
[0290] In particular, as already indicated, a difference can be determined property by property. For example, the first decoded difference AB1 — B2 is considered to contain the elements (e.g., 14 differences) to modify each property of the reference splat Bl in the predictor list.
[0291] The same procedure is used to reconstruct the splats of the pattern to be predicted.
[0292] 5.3 Advantages
[0293] The proposed solution, in at least one embodiment, offers certain advantages, for example among those listed below:
[0294] It helps to reduce the size required to encode a frame to be predicted, thanks to the use of temporal redundancies between the objects of a reference frame representing a scene at a first instant and the objects of the frame to be predicted representing a scene at a second instant - it allows to obtain a compression factor or ratio greater than or equal to 20,
[0295] - It facilitates the spread of the flow by significantly reducing the data rate.
[0296] - It allows for real-time streaming, particularly on a mobile device.
[0297] - Decompression can be implemented in real time,
[0298] - Decompression can produce a structure directly usable by GPUs.
[0299] The proposed decoding process thus makes it possible, according to at least one embodiment, to read data from a storage volume or network sufficiently efficiently to allow it to be decompressed and displayed in real time, for example on a mobile device. It is therefore possible to create volumetric videos and transmit them via streaming, for example with bitrates around 100 Mbs.
[0300] 5.4 Variants
[0301] Figure 7 illustrates a temporal sequence of multidimensional scenes, each represented by a plurality of objects forming a frame (Frame A, Frame B, ... Frame F).
[0302] These frames can be intra-frames, which can be directly encoded / decoded (Intra A, Intra B, ... Intra F), or some frames can be intra-frames (Intra A, Intra C) and others can be frames to be predicted from another frame (Pred B, Pred D, Pred E, Pred F). As mentioned above, the reference frame can be an intra-frame. Note that the number of frames to be predicted from such a reference frame is not predefined. Thus, according to the example shown in Figure 7, frames Pred D and Pred E are encoded / decoded from the reference frame Intra C. Therefore, there can be an arbitrary number of frames to be predicted after an intra-frame. The reference frame can also be a previously encoded / decoded frame. Thus, according to the example illustrated in figure 7, the Pred F frame is encoded / decoded from the previously predicted Pred E reference frame.
[0303] Furthermore, the prediction does not necessarily apply only to the previous frame but can apply to any preceding frame. Thus, according to the example illustrated in Figure 7, the frame Pred E can be encoded / decoded from the reference frame Intra C, even if these two frames do not represent consecutive scenes.
[0304] According to another unclaimed embodiment, the encoding of a frame to be predicted implements step 10 of obtaining a list of predictors BM and a list of correctors PATCH, for example in the form of the steps of selecting 11 an object from the reference frame and determining 12 at least one difference, but does not implement the steps of scheduling 13, constructing a list of actions 14, and a fortiori of encoding at least one action from the list.
[0305] According to this embodiment, information enabling the identification of a selected object, for example its index in the reference frame, is associated with a specific difference between at least one descriptive piece of information about the selected object and at least one descriptive piece of information about the current object. Thus, a pair (index, difference) can be associated with a current object in the frame to be predicted. Such a pair allows:
[0306] to identify an object in the reference frame from the index (selected object), and to indicate the correction to be made to the selected object to predict the current object.
[0307] In particular, as illustrated in figure 8, a list of IDX_PATCH correctors can be constructed, carrying the pairs (index, difference) associated with the different objects of the frame to be predicted: the pair (idxCl, AC1 — 712) is associated with the object A2, the pair (idxBl, AB1 — B2) is associated with the object B2, etc. For a pair (index, difference), it suffices to take the object in the reference frame specified by the index and apply the difference to it to obtain the object of the frame to be predicted.
[0308] For example, the pair at index 5 (idxGl, AGI — E2) allows us to reconstruct the object at index 5 (E2) of the frame to be predicted, by taking the object of the reference frame indicated by the index idxGl (Gl) and applying the corresponding difference (AGI — F2).
[0309] Figure 9 details these different steps for a splat-type object:
[0310] 1) reconstruction of a list of predictors (best matches) using the indices, 2) application, to the different properties of a splat of the list of predictors, of at least one difference corresponding to the index considered.
[0311] Note that, for optimization purposes, this process can advantageously be performed splat by splat by applying at least one difference to the splat of the reference frame defined by the index, without going through an intermediate list, according to the following algorithm:
[0312] based on a reference framework and a list of proofreaders,
[0313] create the empty reconstructed prediction frame
[0314] as long as there is still information in the list of proofreaders:
[0315] read the next pair (index, difference) of the prediction framework,
[0316] copy the splat corresponding to the index of the reference frame (selected splat), apply at least one difference to it,
[0317] add it to the reconstructed prediction framework.
[0318] This method, based on pairs (index, difference), has the advantage of halving the amount of data to be encoded compared to a method based on the properties of the splats in a reference frame (raw data). This allows us to go from a group of 20 bytes to be encoded per splat (after quantifying the different properties of the splat) to a group of 10 bytes per pair.
[0319] However, it requires the introduction of an index for each object, which must be encoded. Given the number of objects in a frame, which can be on the order of several hundred thousand, encoding such an index may require an additional 4 bytes to obtain a final compression ratio of approximately 1.4 instead of 2.
[0320] 5.5 Devices
[0321] Finally, in relation to figures 10 and 11, simplified structures of an encoding device and a decoding device are presented according to at least one embodiment of the invention.
[0322] As illustrated in Figure 10, a coding device comprises at least one memory 101, at least one processing unit 102, equipped for example with a programmable computing machine or a dedicated computing machine, for example a processor P, and controlled by the computer program 103, implementing the steps of the coding process according to at least one embodiment of the invention. At initialization, the code instructions of the program 103 are for example loaded into a RAM memory before being executed by the processor of the processing unit 102.
[0323] The processor of the processing unit 102 implements steps of the coding process described above on at least one scene of a time sequence of multidimensional scenes, according to the instructions of the computer program 103, to encode a frame to be predicted.
[0324] As illustrated in Figure 11, a decoding device includes at least one memory 111, at least one processing unit 112, equipped for example with a programmable computing machine or a dedicated computing machine, for example a processor P, and controlled by the computer program 113, implementing the steps of the decoding process according to at least one embodiment of the invention.
[0325] At initialization, the code instructions of program 113 are, for example, loaded into RAM memory before being executed by the processor of processing unit 112.
[0326] The processor of the processing unit 112 implements steps of the decoding process described above, from information encoded in the data stream representative of at least one scene of a time sequence of multidimensional scenes, according to the instructions of the computer program 113, to reconstruct the frame to be predicted.
Claims
DEMANDS 1. A method for encoding at least one frame representing a scene in a sequence of scenes comprising, at a first instant, a scene represented by a plurality of multidimensional elementary objects forming a first frame, and at a second instant, a scene represented by a plurality of multidimensional elementary objects forming a second frame, according to which said method implements an encoding of said second frame comprising: - obtaining (10) a list of predictors and a list of correctors, said list of predictors comprising, for a current object of said second frame, a corresponding object of said first frame, said selected object, and said list of correctors comprising, for said current object of the second frame, a difference between at least one descriptive information of said selected object and at least one descriptive information of said current object, - a construction (14) of a list of actions associating at least one action with at least one object of said first frame, taking into account said list of predictors, - a coding (15) of at least one action from said list of actions and of at least one difference from said list of correctors.
2. A method according to claim 1, wherein, for at least one object of said first frame, said action belongs to the group comprising: - an "ignore" action if the object in question does not belong to the list of predictors, - a "copy" type action for the first occurrence of said object in said predictor list, - a "duplication" type action for each subsequent occurrence of said object in said predictor list.
3. A method according to any one of the preceding claims, comprising a scheduling (13) of said list of predictors and said list of correctors according to a scheduling criterion prior to the construction of said list of actions.
4. A method according to claim 3, wherein said scheduling criterion is an order of appearance of objects in said first frame.
5. A method according to any one of the preceding claims, wherein said selected object of said first frame is selected according to a proximity criterion in terms of Euclidean distance between said at least one descriptive information of said current object and said at least one descriptive information of said selected object.
6. A method according to any one of the preceding claims, wherein said at least one descriptive piece of information about one of said objects is expressed in the following form: with: - (x, y, z) a position of the center of said object - (r, g, b) a color associated with said object, - of a density associated with said object, - (sx, sy, sz) a scaling factor associated with said object, - (ox, oy, oz) a rotation factor associated with said object, - oa an angle of rotation associated with said object, - a1, a2, a3, a4 are coefficients depending on the type of descriptive information.
7. A method according to any one of the preceding claims, wherein said at least one descriptive piece of information of the selected object and said at least one descriptive piece of information of the current object are quantified over an integer number of bytes.
8. A method according to claim 7, wherein, said at least one quantified descriptive information of the selected object being optimized by modifying the value of N bits in said at least one quantified descriptive information of the selected object, with N an integer greater than or equal to 1, said method comprises optimizing said at least one quantified descriptive information of the current object by modifying N bits in said at least one quantified descriptive information of the current object.
9. A method according to any one of claims 7 and 8, wherein said difference is determined by comparing said descriptive information quantified according to claim 7 or said descriptive information quantified optimized according to claim 8.
10. A method according to any one of claims 7 to 9, wherein said difference associated with at least one descriptive piece of information is represented on said integer number of bytes, and the encoding of said at least one difference is implemented on at most half of said integer number of bytes.
11. A method for decoding a stream of encoded data representing at least one scene of a sequence of scenes comprising, at a first instant, a scene represented by a plurality of multidimensional elementary objects forming a first frame and, at a second instant, a scene represented by a plurality of multidimensional elementary objects forming a second frame, wherein said method implements a decoding (22) of said second frame comprising: - obtaining (222) a list of predictors, from the decoding (221) of at least one action associated with at least one object of said first frame, - obtaining a list of correctors associated with said list of predictors, from the decoding (223) of at least one difference between at least one descriptive piece of information of an object in said list of predictors and at least one descriptive piece of information of a corresponding object to be predicted in said second frame, - a reconstruction (226) of said object to be predicted, by applying (225) said decoded difference to said at least one descriptive information of said object from said list of predictors.
12. Method according to claim 11, also comprising decoding a shift value associated with at least one optimized quantified descriptive information of said selected object, and wherein said at least one decoded difference is shifted left by a number of bits corresponding to said shift value prior to said application.
13. A device for encoding at least one frame representing a scene in a sequence of scenes comprising, at a first instant, a scene represented by a plurality of multidimensional elementary objects forming a first frame, and at a second instant, a scene represented by a plurality of multidimensional elementary objects forming a second frame, wherein said device comprises at least one encoding processing unit for said second frame, configured to: - to obtain a list of predictors and a list of correctors, said list of predictors comprising, for a current object of said second frame, a corresponding object of said first frame, said selected object, and said list of correctors comprising, for said current object of the second frame, a difference between at least one descriptive information of said selected object and at least one descriptive information of said current object, - construct a list of actions associating at least one action with at least one object from said first framework, taking into account said list of predictors, - code at least one action from said list of actions and at least one difference from said list of correctors.
14. A device for decoding a stream of encoded data representing at least one scene from a sequence of scenes comprising, at a first instant, a scene represented by a plurality of multidimensional elementary objects forming a first frame and, at a second instant, a scene represented by a plurality of multidimensional elementary objects forming a second frame, wherein said device comprises at least one decoding processing unit (22) for said second frame, configured to: - to obtain a list of predictors, based on the decoding of at least one action associated with at least one object of said first frame, - to obtain a list of correctors associated with said list of predictors, from the decoding of at least one difference between at least one descriptive piece of information of an object in said list of predictors and at least one descriptive piece of information of a corresponding object to be predicted from said second frame, - reconstruct said object to be predicted, by applying said decoded difference to said at least one descriptive information of said object from said list of predictors.
15. Computer program comprising instructions for carrying out a method according to any one of claims 1 to 12 when this program is executed by a processor.