Generator-nested codec
The generator-nested codec architecture enhances neural network-based encoding and decoding by using a first neural network to generate and encode outputs, which are then decoded by a second neural network, addressing inefficiencies in existing technologies and improving image and video processing.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- NOKIA TECHNOLOGIES OY
- Filing Date
- 2025-12-09
- Publication Date
- 2026-06-25
AI Technical Summary
Existing image and video coding technologies face challenges in efficiently utilizing neural networks for encoding and decoding processes, particularly in generating and decoding bitstreams effectively.
A generator-nested codec architecture is employed, utilizing a first neural network for generating an output that is encoded and signaled as a bitstream, which is then decoded by a second neural network to produce a final output, with components like latent generators and synthesis operations enhancing the encoding and decoding processes.
This approach improves the efficiency and effectiveness of encoding and decoding processes by leveraging neural networks, allowing for more sophisticated image and video processing capabilities.
Smart Images

Figure IB2025062632_25062026_PF_FP_ABST
Abstract
Description
GENERATOR-NESTED CODECTECHNICAL FIELD
[0001] The example and non-limiting embodiments relate generally to data encoding and decoding and, more particularly, to a generator-nested codec.BACKGROUND
[0002] It is known, in image and video coding, to use neural networks to perform encoding and decoding functions as part of a codec.SUMMARY
[0003] The following summary is merely intended to be illustrative. The summary is not intended to limit the scope of the claims.
[0004] Example 1 : An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform: receiving input for first portion of a generator; generating, based at least on the input, an output of the first portion of the generator; providing the output of the first portion of the generator to an encoder; obtaining a bitstream based at least on the output of the first portion of the generator; and signaling the bitstream to a decoder, wherein the bitstream is intended to be used by the decoder to generate a decoded output.
[0005] Example 2: An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform: receiving a bitstream, wherein the bitstream has been generated by an encoder based at least on an output of a first portion of a generator; decoding the bitstream to generate a decoded bitstream; providing the decoded bitstream to a second portion of the generator; and generating an output of the second portion of the generator.
[0006] Example 3: The apparatus ofany of the examples 1 or 2, wherein the first portion of the generator comprises a first neural network and the second portion comprises a second neural network.
[0007] Example 4: The apparatus of example 3, wherein the first neural network comprises a latent generator and the second neural network comprises a synthesis operation.
[0008] Example 5: The apparatus of any of the examples 1 to 4, wherein the first portion of the generator comprises a first component and a second component.
[0009] Example 6: The apparatus of example 5, wherein the first component maps or transforms an input to a latent space and the second component processes the mapped or transformed input in latent space; or the first component comprises a neural network that maps or transforms an input image to features and the second component comprises a first diffusion model that performs an iterative denoising process on the features.
[0010] Example 7 : The apparatus of example 6, wherein an input to the first component comprises one of following: a noise signal; or the noise signal and an image.
[0011] Example 8: The apparatus of any of examples 6 or 7, wherein an input to the second component comprises a noise signal; or features and the noise signal.
[0012] Example 9: The apparatus of any of the examples 1 to 8, wherein the second portion of the generator comprises a neural network that maps the generated latent data to an output in image domain; or thesecond portion of the generator comprises a third component and a fourth component, wherein the third component comprises a denoising process and the fourth component comprises a neural network that converts the output of the third component into a final output in image domain, and wherein the denoising process denoises an input to the denoising process.
[0013] Example 10: The apparatus of example 9, wherein the denoising process comprises a neural network, or a second diffusion model for performing an iterative process to denoise the input generated latent data.
[0014] Example 11 : The apparatus of any of the examples 6 to 10, wherein the first diffusion model in the first portion of the generator is different or substantially different from the second diffusion model in the second portion of the generator.
[0015] Example 12: The apparatus of any of the examples 1 to 11, wherein the input to the first portion of the generator comprises one or more of the following: one or more noise samples, one or more prompts, one or more images; one or more videos; one or more indications of respective one or more locations or positions, one or more indications of respective one or more time instants, one or more indications of respective one or more spatio-temporal positions, one or more indications of respective one or more data items or data points to be generated.
[0016] Example 13: The apparatus of any of the examples 1 to 12, wherein the input to the first portion of the generator or to the second portion of the generator comprises one or more controlling signals.
[0017] Example 14: The apparatus of example 13, wherein the one or more controlling signals comprise an additional conditioning prompt for the generation process, or data derived from an additional conditioning prompt.
[0018] Example 15: The apparatus of any of the examples 13 or 14, wherein the one or more controlling signals received by the first portion of the generator is different or at least partially different from the one or more controlling signals received by the second portion of the generator.
[0019] Example 16: The apparatus of any of the examples 1 to 15, wherein a processing unit or processing circuit for the first portion of the generator is different or at least partially different from a processing unit or a processing circuit for the second portion of the generator.
[0020] Example 17 : The apparatus of any of the examples 13 to 16, wherein the one or more controlling signals are combined with: one or more intermediate features of the latent generator; one or more intermediate features of the first portion of the generator; one or more intermediate features of the synthesis operation; or one or more intermediate features of the second portion of the generator.
[0021] Example 18: The apparatus of any of the examples 13 to 16, wherein the one or more controlling signals are combined with: parameters from one or more layers of the latent generator; parameters from one or more layers of the first portion of the generator; parameters from one or more layers of the synthesis operation; or parameters from one or more layers of the second portion of the generator.
[0022] Example 19: The apparatus of any of the examples 13 to 18, wherein the apparatus is further caused to perform signaling or receiving: one or more controlling signals that are for the second portion of the generator, or data from which the one or more controlling signals are derived.
[0023] Example 20: The apparatus of any of the examples 13 to 18, wherein the one or more controlling signals are determined at a decoder side that comprises the second portion of the generator.
[0024] Example 21 : The apparatus of any of the examples 13 to 18, wherein the one or more controlling signals are determined externally to a decoder side that comprises the second portion of the generator and are provided to the second portion of the generator.
[0025] Example 22: The apparatus of any of the examples 6 to 22, wherein an output of the synthesis operation or of the second portion of the generator comprises a generated output.
[0026] Example 23: The apparatus of example 22, wherein when an input to the first portion of the generator is an image, or when an output of the first portion of the generator comprises features of the image, an output of the second portion of the generator comprises a generated image.
[0027] Example 24: The apparatus of any of the examples 6, 22, or 23, wherein the synthesis operation comprises a neural network (synthesis neural network).
[0028] Example 25: The apparatus of example 24, wherein the apparatus is further caused to perform: training the generator by using the synthesis neural network.
[0029] Example 26: The apparatus of example 24, wherein to perform the training, the apparatus is further caused to perform: providing an output of the latent generator to the synthesis neural network to obtain a synthesized output; and using the synthesized output for computing a training loss based at least on ground-truth data.
[0030] Example 27: The apparatus of any of the examples 1 to 26, wherein the codec comprises a lossless codec, the encoder comprises a lossless encoder, and the decoder comprises a lossless decoder.
[0031] Example 28: The apparatus of example 27, wherein the lossless codec comprises a learned probability model that estimates one or more parameters of a probability distribution, and wherein the one or more parameters are used for encoding one or more inputs to the lossless encoder and / or for decoding one or more inputs to the lossless decoder.
[0032] Example 29: The apparatus of example 28, wherein the encoder comprises a first learned probability model and the decoder comprises a second learned probability model, and wherein the first and second learned probability models are same, substantially same, or are two copies of the learned probability model.
[0033] Example 30: The apparatus of any of the examples 1 to 26, wherein the codec comprises a lossy codec, the encoder comprises a lossy encoder, and the decoder comprises a lossy decoder.
[0034] Example 31 : The apparatus of example 30, wherein when the codec comprises the lossy codec, the lossy encoder comprises: a neural network based encoder (NN encoder), a quantization operation, a first probability model, and a lossless encoder that uses an output of the first probability model; and the lossy decoder comprises a second probability model, a lossless decoder that uses an output of the second probability model, a dequantization operation, and a neural network based decoder (NN decoder), wherein the first probability model and the second probability model are same, substantially same, or are two copies of the same learned probability model.
[0035] Example 32: The apparatus of any of the examples 1 to 31 , wherein one or more of the following components or operations are trained jointly: the latent generator, the NN encoder, the learned probability model, the NN decoder, or the synthesis operation.
[0036] Example 33: The apparatus of any of the examples 1 to 31 , wherein the latent generator and the synthesis operation comprises neural networks and are pretrained to obtain a pretrained latent generator and a pretrained synthesis neural network, and wherein the pretrained latent generator and the pretrained synthesis neural network are combined with an end-to-end learned codec.
[0037] Example 34: The apparatus of any of the examples 1 to 33, wherein the apparatus is further caused to perform: applying a post-processing filter on the output of the second portion of the generator or the output of the synthesis operation.
[0038] Example 35: The apparatus of any of the examples 1 to 33, wherein the second portion of the generator or the synthesis operation is used as part of the decoder, and wherein in this configuration, the bitstream is input to a lossless decoder to obtain a lossless-decoded signal, the lossless decoded signal is input to a dequantization operation to obtain a dequantized signal, and the dequantized signal is input to the second portion of the generator.
[0039] Example 36: The apparatus of any of the examples 27 to 35, wherein the apparatus is further caused to perform: using one or more additional controlling signals with the lossy or lossless codec, and when the one or more additional controlling signals are used, the one or more additional controlling signals or the data derived from the one or more additional controlling signals is comprised in the bitstream or sent in an auxiliary bitstream.
[0040] Example 37: The apparatus of any of the examples 27 to 36, wherein the apparatus is further caused to perform: applying different controlling signals of the one or more controlling signals to the second portion of the generator or the synthesis operation in order to generate different outputs from the bitstream.
[0041] Example 38: The apparatus of any of the examples 27 to 37, wherein the apparatus is further caused to perform: receiving the one or more controlling signals by the second portion of the generator or signaling the one or more controlling signals from the first portion of the generator or from an external source.
[0042] Example 39: The apparatus of any of the examples 27 to 37, wherein the one or more controlling signals or data derived from the one or more controlling signals are inferred or derived at the decoder side.
[0043] Example 40: A method comprising: receiving input for first portion of a generator; generating, based at least on the input, an output of the first portion of the generator; providing the output of the first portion of the generator to an encoder; obtaining a bitstream based at least on the output of the first portion of the generator; and signaling the bitstream to a decoder, wherein the bitstream is intended to be used by the decoder to generate a decoded output.
[0044] Example 41 : A method comprising: receiving a bitstream, wherein the bitstream has been generated by an encoder based at least on an output of a first portion of a generator; decoding the bitstream to generate a decoded bitstream; providing the decoded bitstream to a second portion of the generator; and generating an output of the second portion of the generator.
[0045] Example 42: The method of any of the examples 40 or 41, wherein the first portion of the generator comprises a first neural network and the second portion comprises a second neural network.
[0046] Example 43: The method of example 42, wherein the first neural network comprises a latent generator and the second neural network comprises a synthesis operation.
[0047] Example 44: The method of any of the examples 40 to 43, wherein the first portion of the generator comprises a first component and a second component.
[0048] Example 45: The method of example 44, wherein the first component maps or transforms an input to a latent space and the second component processes the mapped or transformed input in latent space; or the first component comprises a neural network that maps or transforms an input image to features and the second component comprises a first diffusion model that performs an iterative denoising process on the features.
[0049] Example 46: The method of example 45, wherein an input to the first component comprises one of following: a noise signal; or the noise signal and an image.
[0050] Example 47 : The method of any of examples 45 or 46, wherein an input to the second component comprises a noise signal; or features and the noise signal.
[0051] Example 48: The method of any of the examples 40 to 47, wherein the second portion of the generator comprises a neural network that maps the generated latent data to an output in image domain; or the second portion of the generator comprises a third component and a fourth component, wherein the third component comprises a denoising process and the fourth component comprises a neural network that converts the output of the third component into a final output in image domain, and wherein the denoising process denoises an input to the denoising process.
[0052] Example 49: The method of example 48, wherein the denoising process comprises a neural network, or a second diffusion model for performing an iterative process to denoise the input generated latent data.
[0053] Example 50: The method of any of the examples 45 to 49, wherein the first diffusion model in the first portion of the generator is different or substantially different from the second diffusion model in the second portion of the generator.
[0054] Example 51 : The method of any of the examples 40 to 50, wherein the input to the first portion of the generator comprises one or more of the following: one or more noise samples, one or more prompts, one or more images; one or more videos; one or more indications of respective one or more locations or positions, one or more indications of respective one or more time instants, one or more indications of respective one or more spatio-temporal positions, one or more indications of respective one or more data items or data points to be generated.
[0055] Example 52: The method of any of the examples 40 to 51 , wherein the input to the first portion of the generator or to the second portion of the generator comprises one or more controlling signals.
[0056] Example 53: The method of example 52, wherein the one or more controlling signals comprise an additional conditioning prompt for the generation process, or data derived from an additional conditioning prompt.
[0057] Example 54: The method of any of the examples 52 or 53, wherein the one or more controlling signals received by the first portion of the generator is different or at least partially different from the one or more controlling signals received by the second portion of the generator.
[0058] Example 55: The method of any of the examples 40 to 54, wherein a processing unit or processing circuit for the first portion of the generator is different or at least partially different from a processing unit or a processing circuit for the second portion of the generator.
[0059] Example 56: The method of any of the examples 42 to 55, wherein the one or more controlling signals are combined with: one or more intermediate features of the latent generator; one or more intermediate features of the first portion of the generator; one or more intermediate features of the synthesis operation; or one or more intermediate features of the second portion of the generator.
[0060] Example 57: The method of any of the examples 42 to 55, wherein the one or more controlling signals are combined with: parameters from one or more layers of the latent generator; parameters from one or more layers of the first portion of the generator; parameters from one or more layers of the synthesis operation; or parameters from one or more layers of the second portion of the generator.
[0061] Example 58: The method of any of the examples 42 to 57 further comprising: signaling or receiving: one or more controlling signals that are for the second portion of the generator, or data from which the one or more controlling signals are derived.
[0062] Example 59: The method of any of the examples 42 to 57, wherein the one or more controlling signals are determined at a decoder side that comprises the second portion of the generator.
[0063] Example 60: The method of any of the examples 42 to 57, wherein the one or more controlling signals are determined externally to a decoder side that comprises the second portion of the generator and are provided to the second portion of the generator.
[0064] Example 61 : The method of any of the examples 45 to 61, wherein an output of the synthesis operation or of the second portion of the generator comprises a generated output.
[0065] Example 62: The method of example 61, wherein when an input to the first portion of the generator is an image, or when an output of the first portion of the generator comprises features of the image, an output of the second portion of the generator comprises a generated image.
[0066] Example 63: The method of any of the examples 45, 61 , or 62, wherein the synthesis operation comprises a neural network (synthesis neural network).
[0067] Example 64: The method of example 63 further comprising: training the generator by using the synthesis neural network.
[0068] Example 65: The method of example 63, wherein to perform the training, the method further comprises: providing an output of the latent generator to the synthesis neural network to obtain a synthesized output; and using the synthesized output for computing a training loss based at least on ground-truth data.
[0069] Example 66: The method of any of the examples 40 to 65, wherein the codec comprises a lossless codec, the encoder comprises a lossless encoder, and the decoder comprises a lossless decoder.
[0070] Example 67: The method of example 66, wherein the lossless codec comprises a learned probability model that estimates one or more parameters of a probability distribution, and wherein the one or more parameters are used for encoding one or more inputs to the lossless encoder and / or for decoding one or more inputs to the lossless decoder.
[0071] Example 68: The method of example 67, wherein the encoder comprises a first learned probability model and the decoder comprises a second learned probability model, and wherein the first and second learned probability models are same, substantially same, or are two copies of the learned probability model.
[0072] Example 69: The method of any of the examples 40 to 65, wherein the codec comprises a lossy codec, the encoder comprises a lossy encoder, and the decoder comprises a lossy decoder.
[0073] Example 70: The method of example 69, wherein when the codec comprises the lossy codec, the lossy encoder comprises: a neural network based encoder (NN encoder), a quantization operation, a first probability model, and a lossless encoder that uses an output of the first probability model; and the lossy decoder comprises a second probability model, a lossless decoder that uses an output of the second probability model, a dequantization operation, and a neural network based decoder (NN decoder), wherein the first probability model and the second probability model are same, substantially same, or are two copies of the same learned probability model.
[0074] Example 71 : The method of any of the examples 40 to 70, wherein one or more of the following components or operations are trained jointly: the latent generator, the NN encoder, the learned probability model, the NN decoder, or the synthesis operation.
[0075] Example 72: The method of any of the examples 40 to 70, wherein the latent generator and the synthesis operation comprises neural networks and are pretrained to obtain a pretrained latent generator and a pretrained synthesis neural network, and wherein the pretrained latent generator and the pretrained synthesis neural network are combined with an end-to-end learned codec.
[0076] Example 73: The method of any of the examples 40 to 72 further comprising: applying a postprocessing filter on the output of the second portion of the generator or the output of the synthesis operation.
[0077] Example 74: The method of any of the examples 40 to 72, wherein the second portion of the generator or the synthesis operation is used as part of the decoder, and wherein in this configuration, the bitstream is input to a lossless decoder to obtain a lossless-decoded signal, the lossless decoded signal is input to a dequantization operation to obtain a dequantized signal, and the dequantized signal is input to the second portion of the generator.
[0078] Example 75: The method of any of the examples 66 to 74 further comprising: using one or more additional controlling signals with the lossy or lossless codec, and when the one or more additional controlling signals are used, the one or more additional controlling signals or the data derived from the one or more additional controlling signals is comprised in the bitstream or sent in an auxiliary bitstream.
[0079] Example 76: The method of any of the examples 66 to 75 further comprising: applying different controlling signals of the one or more controlling signals to the second portion of the generator or the synthesis operation in order to generate different outputs from the bitstream.
[0080] Example 77: The method of any of the examples 66 to 76 further comprising: receiving the one or more controlling signals by the second portion of the generator or signaling the one or more controlling signals from the first portion of the generator or from an external source.
[0081] Example 78: The method of any of the examples 66 to 76, wherein the one or more controlling signals or data derived from the one or more controlling signals are inferred or derived at the decoder side.
[0082] Example 79: A computer readable medium comprising instructions, when executed by an apparatus, cause the apparatus to perform at least the following: receiving input for a first portion of a generator; generating, based at least on the input, an output of the first portion of the generator; providing the output of the first portion of the generator to an encoder; obtaining a bitstream based at least on the output of the first portion of the generator; and signaling the bitstream to a decoder, wherein the bitstream is intended to be used by the decoder to generate a decoded output.
[0083] Example 80: A computer readable medium comprising instructions, when executed by an apparatus, cause the apparatus to perform at least the following: receiving a bitstream, wherein the bitstream has been generated by an encoder based at least on an output of a first portion of a generator; decoding the bitstream to generate a decoded bitstream; providing the decoded bitstream to a second portion of the generator; and generating an output of the second portion of the generator.
[0084] Example 81 : The computer readable medium of any of the examples 79 or 80, wherein the apparatus is further caused to perform methods as described in any of the examples 42 to 78.
[0085] Example 82: The computer readable medium of any of the examples 79 to 81, wherein the computer readable medium comprises a non-transitory computer readable medium.
[0086] Example 83: An apparatus comprising: means for receiving input for a first portion of a generator; means for generating, based at least on the input, an output of the first portion of the generator; means for providing the output of the first portion of the generator to an encoder; means for obtaining a bitstream based at least on the output of the first portion of the generator; and means for signaling the bitstream to a decoder, wherein the bitstream is intended to be used by the decoder to generate a decoded output.
[0087] Example 84: An apparatus comprising: means for receiving a bitstream, wherein the bitstream has been generated by an encoder based at least on an output of a first portion of a generator; means for decoding the bitstream to generate a decoded bitstream; means for providing the decoded bitstream to a second portion of the generator; and means for generating an output of the second portion of the generator.
[0088] Example 85: The apparatus of any of the examples 83 or 84, wherein the apparatus further comprises means for performing methods as described in any of the examples 42 to 78.BRIEF DESCRIPTION OF THE DRAWINGS
[0089] The foregoing examples and other features are explained in the following description, taken in connection with the accompanying drawings, wherein:
[0090] FIG. 1 is a block diagram of one possible and non-limiting example system in which the example embodiments may be practiced;
[0091] FIG. 2 is a diagram illustrating features as described herein;
[0092] FIG. 3 is a diagram illustrating features as described herein;
[0093] FIG. 4 is a diagram illustrating features as described herein;
[0094] FIG. 5 is a diagram illustrating features as described herein;
[0095] FIG. 6 is a diagram illustrating features as described herein;
[0096] FIG. 7 is a diagram illustrating features as described herein;
[0097] FIG. 8 is a diagram illustrating features as described herein;
[0098] FIGs. 9a and 9b are diagrams illustrating features as described herein;
[0099] FIG. 10 is a diagram illustrating features as described herein;
[0100] FIG. 11 is a diagram illustrating features as described herein;
[0101] FIG. 12 is a diagram illustrating features as described herein;
[0102] FIG. 13 is a diagram illustrating features as described herein;
[0103] FIG. 14 is a diagram illustrating features as described herein;
[0104] FIG. 15 is a diagram illustrating features as described herein;
[0105] FIG. 16 is a diagram illustrating features as described herein;
[0106] FIG. 17 is a diagram illustrating features as described herein;
[0107] FIG. 18 is a diagram illustrating an example apparatus, which may be implemented in hardware, configured to implement the examples described herein;
[0108] FIG. 19 is a diagram illustrating an example of non-volatile memory media used to store instructions that implement the examples described herein;
[0109] FIG. 20 is a flowchart illustrating an example method as described herein; and
[0110] FIG. 21 is a flowchart illustrating another example method as described herein.DETAILED DESCRIPTION OF EMBODIMENTS
[0111] The following abbreviations that may be found in the specification and / or the drawing figures are defined as follows:3GPP third generation partnership project4G fourth generation5G fifth generation5GC 5G core networkAPS adaptation parameter setAR augmented realityCABAC context-adaptive binary arithmetic codingCDMA code division multiple accessCPU central processing unit cRAN cloud radio access networkDCT discrete cosine transformE2E end-to-end eNB (or e Node B) evolved Node B (e.g., an LTE base station)EN-DC E-UTRA-NR dual connectivity en-gNB or En-gNB node providing NR user plane and control plane protocol terminations towards the UE, and acting as secondary node in EN- DCE-UTRA evolved universal terrestrial radio access, i.e., the LTE radio access technologyFDMA frequency division multiple accessGAN generative adversarial network gNB (or g Node B) base station for 5G / NR, i.e., a node providing NR user plane and control plane protocol terminations towards the UE, and connected via the NG interface to the 5GCGPU graphical processing unitGSM global systems for mobile communicationsHMD head-mounted displayIBC intra block copyIEEE Institute of Electrical and Electronics EngineersIMD integrated messaging deviceIMS instant messaging service loT Internet of ThingsJVET Joint Video Expert TeamLTE long term evolutionMAE mean absolute error mAP mean average precisionMMS multimedia messaging serviceMPEG-I Moving Picture Experts Group immersive codec familyMR mixed realityMSE mean squared errorMS-SSIM multiscale structure similarity index measureNAL network abstraction layer ng or NG new generation ng-eNB or NG-eNB new generation eNB NN neural networkNNC neural network codingNR new radioN / W or NW networkO-RAN open radio access networkPC personal computerPDA personal digital assistantPSNR peak signal-to-noise ratioQP quantization parameterROI region of interestSEI supplemental enhancement informationSGD stochastic gradient descentSMS short messaging serviceSSIM structure similarity index measureTCP-IP transmission control protocol-internet protocolTDMA time division multiple accessUE user equipment (e.g., a wireless, typically mobile device)UMTS universal mobile telecommunications systemUSB universal serial busVCM video coding for machinesVMAF Video Multimethod Assessment FusionVNR virtualized network functionVR virtual realityWC volumetric video codingWLAN wireless local area network
[0112] The following describes suitable apparatus and possible mechanisms for practicing example embodiments of the present disclosure. Accordingly, reference is first made to FIG. 1 , which shows an example block diagram of an apparatus 50 (for example, an electronic device or a user equipment). The apparatus may be configured to perform various functions such as, for example, gathering information by one or more sensors, encoding and / or decoding information, receiving and / or transmitting information, analyzing information gathered or received by the apparatus, or the like. A device configured to encode a video scene may (optionally) comprise one or more microphones for capturing the scene and / or one or more sensors, such as cameras, for capturing information about the physical environment in which the scene is captured. Alternatively, a device configured to encode a video scene may be configured to receive information about an environment in which a scene is captured and / or a simulated environment. A device configured to decode and / or render the video scene may be configured to receive a Moving Picture Experts Group immersive codec family (MPEG-I) bitstream comprising the encoded video scene. A device configured to decode and / or render the video scene may comprise one or more speakers / audio transducers and / or displays, and / or may be configured to transmit a decoded scene or signals to a device comprising one or more speakers / audio transducers and / or displays. A device configured to decode and / or render the video scene may comprise a user equipment, a head / mounted display, or another device capable of rendering to a user an AR, VR and / or MR experience.
[0113] The apparatus 50 may for example be a mobile terminal or user equipment of a wireless communication system. Alternatively, the electronic device may be a computer or part of a computer that is not mobile. It should be appreciated that example embodiments of the present disclosure may be implemented within any electronic device or apparatus which may process data. The apparatus 50 may comprise a device that can access a network and / or cloud through a wired or wireless connection. The apparatus 50 may comprise one or more processors / controllers 56, one or more memories 58, and one or more radio interface circuitry 52 interconnected through one or more buses. The one or more processors / controllers 56 may comprise a central processing unit (CPU) and / or a graphical processing unit (GPU). Each of the one or more radio interface circuitry 52 includes a receiver and a transmitter. The one or more buses may be address, data, or control buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, and the like. A “circuit’ may include dedicated hardware or hardware in association with software executable thereon. The one or more transceivers may be connected to one or more antennas 44. The one or more memories 58 may include computer program code. The one or more memories 58 and the computer program code may be configured to, with the one or more processors / controllers 56, cause the apparatus 50 to perform one or more of the operations as described herein.
[0114] The apparatus 50 may connect to a node of a network. The network node may comprise one or more processors, one or more memories, and one or more transceivers interconnected through one or more buses. Each of the one or more transceivers includes a receiver and a transmitter. The one or more buses may be address, data, or control buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, and the like. The one or more transceivers may be connected to one or more antennas. The one or more memories may include computer program code. The one or more memories and the computer program code may be configured to, with the one or more processors, cause the network node to perform one or more of the operations as described herein.
[0115] The apparatus 50 may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device 38 which in example embodiments of the present disclosure may be any one of: an earpiece, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery (or in other example embodiments of the present disclosure the device may be powered by any suitable mobile energy device such as solar cell, fuel cell, or clockwork generator). The apparatus 50 may further comprise a camera 42 or other sensor capable of recording or capturing images and / or video. Additionally or alternatively, the apparatus 50 may further comprise a depth sensor. The apparatus 50 may further comprise a display 32. The apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices. In other example embodiments of the present disclosure the apparatus or the apparatus 50 may further comprise any suitable short- range communication solution such as for example a BLUETOOTH™ wireless connection or a USB / firewire wired connection.
[0116] It should be understood that an apparatus 50 configured to perform example embodiments of the present disclosure may have fewer and / or additional components, which may correspond to what processes theapparatus 50 is configured to perform. For example, an apparatus configured to encode a video might not comprise a speaker or audio transducer and may comprise a microphone, while an apparatus configured to render the decoded video might not comprise a microphone and may comprise a speaker or audio transducer.
[0117] Referring now to FIG. 1 , the apparatus 50 may comprise a processors / controllers 56, processor or processor circuitry for controlling the apparatus 50. The processors / controllers 56 may be connected to memory 58 which in example embodiments of the present disclosure may store both data in the form of image and audio data and / or may also store instructions for implementation on the processors / controllers 56. The processors / controllers 56 may further be connected to codec circuitry 54 suitable for carrying out coding and / or decoding of audio and / or video data or assisting in coding and / or decoding carried out by the controller.
[0118] The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC and UICC reader, for providing user information and being suitable for providing authentication information for authentication and authorization of the apparatus 50 at a network. The apparatus 50 may further comprise an input device 34, such as a keypad, one or more input buttons, or a touch screen input device, for providing information to the processors / controllers 56.
[0119] The apparatus 50 may comprise a radio interface circuitry 52 (for example, transceivers) connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system, or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and / or for receiving radio frequency signals from other apparatus(es).
[0120] The apparatus 50 may comprise a microphone 36, camera 42, and / or other sensors capable of recording or detecting audio signals, image / video signals, and / or other information about the local / virtual environment, which are then passed to the codec circuitry 54 or the processors / controllers 56 for processing. The apparatus 50 may receive the audio / image / video signals and / or information about the local / virtual environment for processing from another device prior to transmission and / or storage. The apparatus 50 may also receive either wirelessly or by a wired connection the audio / image / video signals and / or information about the local / virtual environment for encoding / decoding. The structural elements of apparatus 50 described above represent examples of means for performing a corresponding function.
[0121] The memory 58 may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, flash memory, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The memory 58 may be a non-transitory memory. The memory 58 may be means for performing storage functions. The processors / controllers 56 may be or comprise one or more processors, which may be of any type suitable to the local technical environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs) and processors based on a multi-core processor architecture, as non-limiting examples. The processors / controllers 56 may be means for performing functions.
[0122] The apparatus 50 may be configured to perform capture of a volumetric scene according to example embodiments of the present disclosure. For example, the apparatus 50 may comprise a camera 42 or other sensor capable of recording or capturing images and / or video. The apparatus 50 may also comprise one or more radio interface circuitry 52 to enable transmission of captured content for processing at another device. Such an apparatus 50 may or may not include all the modules illustrated in FIG. 1.
[0123] The apparatus 50 may be configured to perform processing of volumetric video content according to example embodiments of the present disclosure. For example, the apparatus 50 may comprise a processors / controllers 56 for processing images to produce volumetric video content, a processors / controllers 56 for processing volumetric video content to project 3D information into 2D information, patches, and auxiliary information, and / or a codec circuitry 54 for encoding 2D information, patches, and auxiliary information into a bitstream for transmission to another device with radio interface circuitry 52. Such an apparatus 50 may or may not include all the modules illustrated in FIG. 1.
[0124] The apparatus 50 may be configured to perform encoding or decoding of 2D information representative of volumetric video content according to example embodiments of the present disclosure. For example, the apparatus 50 may comprise a codec circuitry 54 for encoding or decoding 2D information representative of volumetric video content. Such an apparatus 50 may or may not include all the modules illustrated in FIG. 1.
[0125] The apparatus 50 may be configured to perform rendering of decoded 3D volumetric video according to example embodiments of the present disclosure. For example, the apparatus 50 may comprise a controller for projecting 2D information to reconstruct 3D volumetric video, and / or a display 32 for rendering decoded 3D volumetric video. Such an apparatus 50 may or may not include all the modules illustrated in FIG. 1.
[0126] With respect to FIG. 2, an example of a system within which example embodiments of the present disclosure can be utilized is shown. The system 10 comprises multiple communication devices which can communicate through one or more networks. The system 10 may comprise any combination of wired or wireless networks including, but not limited to a wireless cellular telephone network (such as a GSM, UMTS, E-UTRA, LTE, CDMA, 4G, 5G, 6G network etc.), a wireless local area network (WLAN) such as defined by any of the IEEE 802.x standards, a BLUETOOTH™ personal area network, an Ethernet local area network, a token ring local area network, a wide area network, and / or the Internet. A wireless network may implement network virtualization, which is the process of combining hardware and software network resources and network functionality into a single, software-based administrative entity, a virtual network. Network virtualization involves platform virtualization, often combined with resource virtualization. Network virtualization is categorized as either external, combining many networks, or parts of networks, into a virtual unit, or internal, providing network-like functionality to software containers on a single system. For example, a network may be deployed in a tele cloud, with virtualized network functions (VNF) running on, for example, data center servers. For example, network core functions and / or radio access network(s) (e.g. CloudRAN, O-RAN, edge cloud) may be virtualized. Note that the virtualized entities that result from the network virtualization are still implemented, at some level, using hardware such as processors and memories, and also such virtualized entities create technical effects.
[0127] It may also be noted that operations of example embodiments of the present disclosure may be carried out by a plurality of cooperating devices (e.g. cRAN).
[0128] The system 10 may include both wired and wireless communication devices and / or electronic devices suitable for implementing example embodiments of the present disclosure.
[0129] For example, the system shown in FIG. 2 shows a mobile telephone network 11 and a representation of the internet 28. Connectivity to the internet 28 may include, but is not limited to, long range wireless connections, short range wireless connections, and various wired connections including, but not limited to, telephone lines, cable lines, power lines, and similar communication pathways.
[0130] The example communication devices shown in the system 10 may include, but are not limited to, an apparatus 15, a combination of a personal digital assistant (PDA) and a mobile telephone 14, a PDA 16, an integrated messaging device (IMD) 18, a desktop computer 20, a notebook computer 22, and a head-mounted display (HMD) 17. The apparatus 50 may comprise any of those example communication devices. In an example embodiment of the present disclosure, more than one of these devices, or a plurality of one or more of these devices, may perform the disclosed process(es). These devices may connect to the internet 28 through a wireless connection 2.
[0131] The example embodiments of the present disclosure may also be implemented in a set-top box; i.e. a digital TV receiver, which may / may not have a display or wireless capabilities, in tablets or (laptop) personal computers (PC), which have hardware and / or software to process neural network data, in various operating systems, and in chipsets, processors, DSPs and / or embedded systems offering hardware / software based coding. The example embodiments of the present disclosure may also be implemented in cellular telephones such as smart phones, tablets, personal digital assistants (PDAs) having wireless communication capabilities, portable computers having wireless communication capabilities, image capture devices such as digital cameras having wireless communication capabilities, gaming devices having wireless communication capabilities, music storage and playback appliances having wireless communication capabilities, Internet appliances permitting wireless Internet access and browsing, tablets with wireless communication capabilities, as well as portable units or terminals that incorporate combinations of such functions.
[0132] Some or further apparatus may send and receive calls and messages and communicate with service providers through a wireless connection 25 to a base station 24, which may be, for example, an eNB, gNB, access point, access node, other node, etc. The base station 24 may be connected to a network server 26 that allows communication between the mobile telephone network 11 and the internet 28. The system may include additional communication devices and communication devices of various types.
[0133] The communication devices may communicate using various transmission technologies including, but not limited to, code division multiple access (CDMA), global systems for mobile communications (GSM), universal mobile telecommunications system (UMTS), time divisional multiple access (TDMA), frequency division multiple access (FDMA), transmission control protocol-internet protocol (T CP-IP), short messaging service (SMS), multimedia messaging service (MMS), email, instant messaging service (IMS), BLUETOOTH™, IEEE 802.11 , 3GPP Narrowband loT and any similar wireless communication technology. A communications device involved inimplementing various example embodiments of the present disclosure may communicate using various media including, but not limited to, radio, infrared, laser, cable connections, and any suitable connection.
[0134] In telecommunications and data networks, a channel may refer either to a physical channel or to a logical channel. A physical channel may refer to a physical transmission medium such as a wire, whereas a logical channel may refer to a logical connection over a multiplexed medium, capable of conveying several logical channels. A channel may be used for conveying an information signal, for example a bitstream, which may be a MPEG-I bitstream, from one or several senders (or transmitters) to one or several receivers.
[0135] Having thus introduced one suitable but non-limiting technical context for the practice of the example embodiments of the present disclosure, example embodiments will now be described with greater specificity.
[0136] Fundamentals of neural networks
[0137] Features as described herein may generally relate to neural networks. A neural network (NN) may be described as a computation graph consisting of several layers of computation. In an example of a NN, each layer may consist of one or more units, where each unit may perform an elementary computation. A unit may be connected to one or more other units, and the connection may be associated with a weight. The weight may be used for scaling the signal passing through the associated connection. Weights are learnable parameters, i.e., values which can be learned from training data. There may be other learnable parameters, such as those of batchnormalization layers. Example embodiments of the present disclosure may or may not relate to, or involve, NN comprising multiple layers of computation.
[0138] In some neural networks, such as convolutional neural networks for image classification, initial layers (those close to the input data) may extract semantically low-level features such as edges and textures in images, whereas intermediate layers may extract higher-level features. After the feature extraction layers, there may be one or more layers performing a certain task, such as classification, semantic segmentation, object detection, denoising, style transfer, super-resolution, etc. Example embodiments of the present disclosure may or may not relate to, or involve, convolutional neural networks.
[0139] Neural networks are being utilized in an ever-increasing number of applications for many different types of devices, such as mobile phones. Examples include image and video analysis and processing, social media data analysis, device usage data analysis, etc.
[0140] One property of neural nets / networks (and other machine learning tools) is that they are able to learn properties from input data, e.g., in a supervised way or in an unsupervised way. Such learning may be a result of a training algorithm, or may be achieved by means of another neural network providing the training signal (sometimes, this latter approach may be referred to as “meta learning”).
[0141] In general, the training algorithm may consist of changing some properties of the neural network so that its output is as close as possible to a desired output. For example, in the case of classification of objects in images, the output of the neural network may be used to derive a class or category index which may indicate the class or category to which the object in the input image belongs. Training may comprise minimizing or decreasing the output’s error, also referred to as the loss or loss function. Examples of losses are mean squarederror, cross-entropy, etc. Example embodiments of the present disclosure may or may not relate to, or involve, neural networks trained according to a training algorithm.
[0142] In recent deep learning techniques, training may be an iterative process, where at each iteration the algorithm may modify the weights of the neural net to make a gradual improvement of the network’s output, i.e., to gradually decrease the loss, for example by means of a gradient descent technique. In one example, at each training iteration, gradients of the loss function with respect to one or more weights or parameters of the NN may be computed, for example by a backpropagation technique; the computed gradients may then be used by an optimization routine, such as Adam or Stochastic Gradient Descent (SGD) to obtain an update to the one or more weights or parameters.
[0143] In the present disclosure, the terms “model”, “neural network”, “neural net” and “network” are used interchangeably. In the present disclosure, the weights of neural networks may sometimes be referred to as learnable parameters or simply as parameters.
[0144] Training a neural network may be regarded as an optimization process, but the final goal may be different from the typical goal of optimization. In optimization, the main goal is to minimize a function. In machine learning, the goal of the optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset. In other words, the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, i.e., data which was not used for training the model. This is usually referred to as generalization. In practice, data is usually split into at least two sets, the training set and the validation set. The training set is used for training the network, i.e., to modify its learnable parameters in order to minimize the loss. The validation set is at least partially different from the training set. The validation set is used for checking the performance of the network on data which was not used to minimize the loss, as an indication of the final performance of the model. In particular, the errors on the training set and on the validation set may be monitored during the training process to understand the following:
[0145] - If the network is learning at all - in this case, the training set error should decrease, otherwise the model is in the regime of underfitting.
[0146] - If the network is learning to generalize - in this case, also the validation set error needs to decrease and to be not too much higher than the training set error. If the training set error is low, but the validation set error is much higher than the training set error, or it does not decrease, or it even increases, the model may be in the regime of overfitting. This means that the model has just memorized the training set’s properties and performs well only on that set, but performs poorly on a set not used for tuning its parameters.
[0147] Fundamentals of video / imaqe coding
[0148] Features as described herein may generally relate to video or image coding. A video codec consists of an encoder that transforms the input video into a compressed representation suited for storage / transmission, and a decoder that can decompress the compressed video representation back into a viewable form. Typically, the encoder discards some information in the original video sequence in order to represent the video in a more compact form (that is, at a lower bitrate).
[0149] Typical hybrid video codecs, for example ITU-T H.263 and H.264, encode the video information in two phases. Firstly, pixel values in a certain picture area (or “block”) are predicted, for example by motion compensation means (i.e. finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded), or by spatial means (i.e. using the pixel values around the block to be coded in a specified manner). Secondly, the prediction error, i.e. the difference between the predicted block of pixels and the original block of pixels, is coded. This is typically done by transforming the difference in pixel values using a specified transform (e.g. Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients, and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel representation (i.e. picture quality) and the size of the resulting coded video representation (i.e. file size or transmission bitrate).
[0150] Inter prediction, which may also be referred to as temporal prediction, motion compensation, or motion-compensated prediction, exploits temporal redundancy. In inter prediction, the sources of prediction are previously decoded pictures (a.k.a. reference pictures).
[0151] In temporal inter prediction, the sources of prediction are previously decoded pictures in the same scalable layer. In intra block copy (IBC; a.k.a. intra-block-copy prediction), prediction may be applied similarly to temporal inter prediction, but the reference picture is the current picture, and only previously decoded samples can be referred in the prediction process. Inter-layer or inter-view prediction may be applied similarly to temporal inter prediction, but the reference picture is a decoded picture from another scalable layer or from another view, respectively. In some cases, inter prediction may refer to temporal inter prediction only, while in other cases inter prediction may refer collectively to temporal inter prediction and any of intra block copy, inter-layer prediction, and inter-view prediction, provided that they are performed with the same or similar process as temporal prediction. Inter prediction, temporal inter prediction, or temporal prediction may sometimes be referred to as motion compensation or motion-compensated prediction.
[0152] Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in the spatial or transform domain, i.e., either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intra coding, where no inter prediction is applied.
[0153] One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently if they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors, and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction.
[0154] The decoder reconstructs the output video by applying prediction means similar to the encoder to form a predicted representation of the pixel blocks (e.g. using the motion or spatial information created by the encoder and stored in the compressed representation) and prediction error decoding (e.g. inverse operation of the prediction error coding recovering the quantized prediction error signal in the spatial pixel domain). After applying prediction and prediction error decoding means, the decoder sums up the prediction and prediction error signals(pixel values) to form the output video frame. The decoder (and encoder) can also apply additional filtering means to improve the quality of the output video before passing it for display and / or storing it as prediction reference for the forthcoming frames in the video sequence.
[0155] In typical video codecs, the motion information is indicated with motion vectors associated with each motion compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder side) or decoded (in the decoder side) and the prediction source block in one of the previously coded or decoded pictures. In order to represent motion vectors efficiently, those are typically coded differentially with respect to block specific predicted motion vectors. In typical video codecs, the predicted motion vectors are created in a predefined way, for example calculating the median of the encoded or decoded motion vectors of the adjacent blocks. Another way to create motion vector predictions is to generate a list of candidate predictions from adjacent blocks and / or co-located blocks in the temporal reference pictures, and signaling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded / decoded picture can be predicted. The reference index is typically predicted from adjacent blocks and / or or co-located blocks in the temporal reference picture. Moreover, typical high efficiency video codecs employ an additional motion information coding / decoding mechanism, often called merging / merge mode, where all the motion field information, which includes the motion vector and corresponding reference picture index for each available reference picture list, is predicted and used without any modification / correction. Similarly, predicting the motion field information is carried out using the motion field information of adjacent blocks and / or co-located blocks in the temporal reference pictures, and the used motion field information is signaled among a list of motion field candidates filled with motion field information of available adjacent / co-located blocks.
[0156] In typical video codecs, the prediction residual after motion compensation is first transformed with a transform kernel (like DCT) and then coded. The reason for this is that, often, there still exists some correlation among the residual and transform can in many cases help reduce this correlation and provide more efficient coding.
[0157] Typical video encoders utilize Lagrangian cost functions to find optimal coding modes, e.g. the desired Macroblock mode and associated motion vectors. This kind of cost function uses a weighting factor A to tie together the (exact or estimated) image distortion due to lossy coding methods and the (exact or estimated) amount of information that is required to represent the pixel values in an image area:C = D + AR
[0158] where C is the Lagrangian cost to be minimized, D is the image distortion (e.g. Mean Squared Error) with the mode and motion vectors considered, and R the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data to represent the candidate motion vectors).
[0159] Video coding specifications may enable the use of supplemental enhancement information (SEI) messages or alike. Some video coding specifications include SEI network abstraction layer (NAL) units, and some video coding specifications contain both prefix SEI NAL units and suffix SEI NAL units, where the former type can start a picture unit or alike, and the latter type can end a picture unit or alike. An SEI NAL unit contains one ormore SEI messages which are not required for the decoding of output pictures but may assist in related processes, such as picture output timing, post-processing of decoded pictures, rendering, error detection, error concealment, and resource reservation. Several SEI messages are specified in H.264 / AVC, H.265 / HEVC, H.266 / WC, and H.274 / VSEI standards, and the user data SEI messages enable organizations and companies to specify SEI messages for their own use. The standards may contain the syntax and semantics for the specified SEI messages, but a process for handling the messages in the recipient might not be defined. Consequently, encoders may be required to follow the standard specifying a SEI message when they create SEI message(s), and decoders might not be required to process SEI messages for output order conformance. One of the reasons to include the syntax and semantics of SEI messages in standards is to allow different system specifications to interpret the supplemental information identically, and hence interoperate. System specifications may require the use of particular SEI messages both in the encoding end and in the decoding end, and additionally the process for handling particular SEI messages in the recipient may be specified.
[0160] Information on neural network based imaqe / video coding
[0161] Features as described herein may generally relate to use of NN to code images and / or videos. Recently, neural networks (NNs) have been used in the context of image and video compression, by following mainly two approaches.
[0162] In a first approach, NNs are used to replace one or more of the components of a traditional codec, such as a WC / H.266-compliant codec. Here, “traditional” or “legacy” means those codecs whose components and their parameters are typically not learned from data by means of machine learning techniques. Examples of components that may be implemented as neural networks are: an in-loop filter, for example a NN that works as an additional in-loop filter with respect to the traditional loop filters, or a NN that works as the only additional in-loop filter, thus replacing any other in-loop filter; Intra-frame prediction; inter-frame prediction; transform and / or inverse transform; probability model for lossless coding; etc.
[0163] In a second approach, commonly referred to as “end-to-end learned compression” (or end-to-end learned codec), NNs are used as the main components of the image / video codecs. However, the codec may still comprise components which are not based on machine learning techniques. In this second approach, two design options are as follows:
[0164] - Option 1 : re-use the traditional video coding pipeline, but replace most or all the components withNNs. Referring now to FIG. 3, illustrated is an example of an end-to-end learned codec that includes NNs replacing some components of the traditional video coding pipeline. Input signal (x) (302) may be combined (303) with other information and provided to a neural transform (304), which may also receive input from an encoder parameter control (306). Output of the neural transform (304) may be provided for quantization (308), and then for inverse quantization / neural transform (310) as well as entropy coding (312) to a bitstream (314). Entropy coding (312) may be performed based on input from the encoder parameter control (306).
[0165] The output of the inverse quantization / neural transform (310) may be combined with other information, and provided to a neural intra codec (316) and to a deep loop filter (324). The neural intra codec (316)may also receive input from the encoder parameter control (306), and may comprise an encoder (318), intra coding (320), and a decoder (322).
[0166] The deep loop filter (324) may also receive input from the encoder parameter control (306), and may provide output to a decode picture buffer (326), which may produce an enhanced reference frame (328) based, at least partially, on one or more reconstructed frames (330). The decode picture buffer (326) may provide output for inter prediction (332), which may provide output based, at least partially, on input from the encoder parameter control (306) and ME / MC (336), Gnet(Cnet( )) (334).
[0167] In the example of FIG. 3, the forward and inverse transforms were replaced with two neural networks (304, 310), the neural intra codec (316) comprises a neural network, and the loop filter (324) is a neural network.
[0168] - Option 2: re-design the whole pipeline as a neural network auto-encoder with a quantization and lossless coding in the middle part. This option may also be referred to as end-to-end learned coding. The codec may comprise the following:
[0169] - Encoder NN (also referred to as a neural network based encoder, or NN encoder): performs a non-linear transformation of the input. The output is typically referred to as a latent tensor.
[0170] - Quantization and lossless encoding of the encoder NN’s output.
[0171] - Lossless decoding and dequantization.
[0172] - Decoder NN (also referred to as a neural network based decoder, or NN decoder): performs a non-linear inverse transformation from dequantized latent tensor to a reconstructed input.
[0173] It is to be understood that even in end-to-end learned approaches, there may be components which are not learned / trained from data, such as the arithmetic codec.
[0174] Further information on neural network-based end-to-end learned video coding
[0175] Features as described herein may generally relate to NN-based end-to-end (E2E) learned video codecs. Referring now to FIG. 4, illustrated is an example of neural network-based end-to-end learned coding, such as an end-to-end learned video coding system or an end-to-end learned image coding system.
[0176] Even though some examples are provided with respect to coding images or videos, it is to be understood that other types of data may be coded in a similar way, such as audio, speech, text, features, etc. As shown in FIG. 4, a typical neural network-based end-to-end learned coding system comprises an encoder (405) and a decoder (460).
[0177] The encoder (405) comprises an encoder NN (415), a quantizer or quantization operation (425), a probability model (435), a lossless encoder (445) (for example arithmetic encoder). The decoder (460) comprises a lossless decoder (455) (for example, an arithmetic decoder), a probability model (465), a dequantizer or dequantization operation (475), and a decoder NN (485).
[0178] It is to be noted that the probability model (435) present at encoder side and the probability model (465) present at decoder side may be the same or substantially the same. For example, they may be two copies of the same probability model. The probability model (435, 465) may also be a neural network and / or may mainlycomprise neural network components, and may be referred to as a neural network based probability model or learned probability model.
[0179] The lossless encoder (445) and the lossless decoder (455) form a lossless codec (440). A lossless codec may be an entropy-based lossless codec. An example of a lossless codec is an arithmetic codec, such as a context-adaptive binary arithmetic coding (CABAC). Sometimes, the term lossless codec may refer to a system that comprises also the probability model, in addition to, for example, an arithmetic encoder and an arithmetic decoder.
[0180] The encoder NN (415) and the decoder NN (485) may typically be two neural networks, or may mainly comprise neural network components.
[0181] The quantization operation (425), dequantization operation (475) and lossless codec (440) are typically not based on neural network components, but may potentially comprise neural network components.
[0182] In the example of FIG. 4, the encoder NN (415) may take an input x (410), which may comprise, for example, an image to be compressed. The encoder N 475N (415) may output a latent tensor z (420). In one example, the latent tensor may be a 3D tensor, where the three dimensions of such tensor may represent a channel dimension, a vertical dimension (also sometimes referred to as height dimension) and a horizontal dimension (also sometimes referred to as width dimension). In another example, the latent tensor may be a 4D tensor, where the four dimensions of such tensor may represent sample dimension (also sometimes referred to as batch dimension, which is the dimension along which different samples of data can be placed), a channel dimension, a vertical dimension (also sometimes referred to as height dimension) and a horizontal dimension (also sometimes referred to as width dimension). In yet another example, in the case of compressing a signal with a temporal dimension such as a video, the latent tensor may be a 4D tensor, where the four dimensions of such tensor may represent a channel dimension, a vertical dimension (also sometimes referred to as height dimension), a horizontal dimension (also sometimes referred to as width dimension), and a temporal dimension. The latent tensor (420) may be input to a quantization operation (425), obtaining a quantized latent tensor zq(430). The quantized latent tensor zq(430) may be lossless-encoded into a bitstream b (450) by the lossless encoder (445), based also on an output of the probability model (435). In particular, the probability model may take as input at least part of the quantized latent tensor zq(430) and may output an estimate of a probability, or an estimate of a probability distribution, or an estimate of one or more parameters of a probability distribution, for one or more elements of the quantized latent tensor. The bitstream (450) may represent an encoded or compressed version of the input x (410).
[0183] The bitstream (450) may be lossless-decoded by the lossless decoder (455) also based on an output of the probability model (465) present at decoder side, obtaining a quantized latent tensor zq(470). The quantized latent tensor may be dequantized by a dequantization operation(475), obtaining a reconstructed latent tensor z (480). The reconstructed latent tensor (480) may be input to a decoder NN (485), obtaining a reconstructed input x (490), i.e., a reconstructed version of the input x (410). The reconstructed input (490) may also be referred to as reconstructed data, or reconstruction, or decoded data, or decoded input, or decoded output, and the like.
[0184] FIG. 4 presents a simplified description of an end-to-end learned codec; more sophisticated designs, or variations of this design, are possible.
[0185] The neural network components, or a subset of the neural network components, of an end-to-end learned codec may be trained by minimizing a rate-distortion loss function:L = D + AR
[0186] where D is a distortion loss term, R is a rate loss term, and A is a weight that controls the balance between the two losses. The distortion loss term may be referred to also as reconstruction loss term, or simply reconstruction loss. The rate loss term may be referred to simply as rate loss.
[0187] The distortion loss term measures the quality of the reconstructed or decoded output, and may comprise (but may not be limited to) one or more of the following:
[0188] - Mean square error (MSE)
[0189] - Structure similarity index measure(SSIM)
[0190] - Multiscale structure similarity index measure (MS-SSIM)
[0191] - Losses derived from the use of a pretrained neural network. For example, error(f 1 , f2), where f1 and f2 are the features extracted by a pretrained neural network for the input data and the decoded data, respectively, and error() is an error or distance function, such as L1 norm or L2 norm.
[0192] - Losses derived from the use of a neural network that is trained (substantially) simultaneously with the end-to-end learned codec. For example, adversarial loss can be used, which is the loss provided by a discriminator neural network that is trained adversarially with respect to the codec, following the settings proposed in the context of Generative Adversarial Networks (GANs) and their variants.
[0193] - Loss that is related to a performance of one or more machine analysis tasks or to an estimated performance of one or more machine analysis tasks, where the one or more machine analysis tasks may comprise classification, object detection, image segmentation, instance segmentation, etc. In one example, the estimated performance of one or more machine analysis tasks may comprise a distortion computed based at least on a first set of features extracted from an output of the decoder, and a second set of features extracted from a respective ground truth data, where the first set of features and the second set of features are output by one or more layers of a pretrained feature-extraction neural network.
[0194] Multiple distortion losses may be used and integrated into D, such as a weighted sum of MSE and SSIM.
[0195] The rate loss term may be used to train the encoder NN to output a low-entropy latent tensor, or a latent tensor such that the quantized latent tensor has low entropy, or a latent tensor such that the probability distribution of the quantized latent tensor may be better estimated or predicted by the probability model.
[0196] The rate loss term may be used to train the probability model to better estimate or predict the probability distribution of the quantized latent tensor.
[0197] Examples of the rate loss terms include the following:
[0198] - In one example, the rate loss term may be derived from the output of the probability model, and it may represent the estimated entropy of the quantized latent representation, which may indicate the number of bits necessary to represent the quantized latent tensor.
[0199] - A sparsification loss, i.e., a loss that encourages the quantized latent tensor to comprise many zeros. Examples are L0 norm, L1 norm, L1 norm divided by L2 norm.
[0200] In order to train the neural network components, or a subset of the neural network components, of an end-to-end learned codec, one or more of reconstruction losses may be used, and one or more rate losses may be used. In one example, the one or more reconstruction losses and / or one or more rate losses may be combined by means of a weighted sum. Typically, the different loss terms are weighted using different weights, and these weights determine how the final system performs in terms of rate-distortion performance. For example, if more weight is given to the reconstruction losses with respect to the rate losses, the system may learn to compress less, but to reconstruct with higher accuracy (e.g. as measured by a metric that correlates with the reconstruction losses). These weights are usually considered to be hyper-parameters of the training process, and may be set manually by the person designing the training process, or automatically, for example by grid search or by using additional neural networks.
[0201] In one case, the training process may be performed jointly with respect to the distortion loss D and the rate loss R. In another case, the training process may be performed in two alternating phases, where in a first phase only the distortion loss D may be used, and in a second phase only the rate loss R may be used.
[0202] For lossless video / image compression, the system may only comprise the probability model and lossless encoder and lossless decoder. The loss function would comprise only the rate loss, since the distortion loss is always zero (i.e., no loss of information).
[0203] In the present disclosure, inference phase, or inference stage, or inference time, or test time, are referred to the phase when a neural network or a codec is used for its purpose, such as encoding and decoding an input image.
[0204] Information on Video Coding for Machines (VCM)
[0205] Features as described herein may generally relate to video coding for machines (VCM). Reducing the distortion in image and video compression is often intended to increase human perceptual quality, as humans are considered to be the end users, i.e. consuming / watching the decoded images or videos. Recently, with the advent of machine learning, especially deep learning, there is a rising number of machines (i.e., autonomous agents) that analyze data independently from humans, and may even make decisions based on the analysis results without human intervention. Examples of such analysis are object detection, scene classification, semantic segmentation, video event detection, anomaly detection, pedestrian tracking, etc. For example, such analysis tasks may be performed by neural networks.
[0206] It is likely that the device where the analysis takes place has multiple “machines” or neural networks (NNs). These multiple machines may be used in a certain combination which is, for example, determined by an orchestrator sub-system. The multiple machines may be used, for example, in succession, based on the output of the previously used machine, and / or in parallel. For example, a video may be analyzed by one machine (NN) fordetecting pedestrians, by another machine (another NN) for detecting cars, and by another machine (another NN) for estimating the depth of all the pixels in the frames.
[0207] Example use cases and applications are self-driving cars, video surveillance cameras and public safety, smart sensor networks, smart TV and smart advertisement, person re-identification, smart traffic monitoring, drones, etc. In addition to image and video data, automatic analysis and processing is increasingly being performed for other types of data, such as audio, speech, text.
[0208] Compressing (and decompressing) data where the end user comprises machines (e.g., neural networks) is commonly referred to as compression or coding for machines. In the case of video data, it is referred to as video compression or coding for machines (VCM). Compressing for machines may differ from compressing for humans, for example, with respect to the algorithms and technology used in the codec, or the training losses used to train any neural network components of the codec, or the evaluation methodology of codecs.
[0209] It is to be understood that, when considering the case of coding for machines, the term “receiverside” or “decoder-side” refer to the physical or abstract entity or device which comprises one or more machines, and runs these one or more machines on some encoded and eventually decoded video representation which is encoded by another physical or abstract entity or device, the “encoder-side device”.
[0210] Referring now to FIG. 5, illustrated is an example of a pipeline of video coding for machines. A VCM encoder (510) may encode the input video (505) into a bitstream (515). A bitrate (525) may be computed (520) from the bitstream (515), as a measure of the size of the bitstream. A VCM decoder (530) may decode the bitstream (515) that was produced by the VCM encoder (510).
[0211] The output of the VCM decoder (530) may be referred to as “Decoded data for machines” (535). This data may be considered as the decoded or reconstructed video. However, in some implementations of this pipeline, this data may not have same or similar characteristics as the original video which was input to the VCM encoder. For example, this data may not be easily understandable by a human by simply rendering the data onto a screen, if such rendering is possible.
[0212] The output or decoded data for machines (535) of the VCM decoder (530) may then be input to one or more task neural networks (540, 545, 550, 555). In FIG. 5, for the sake of illustrating that there may be any number of task-NNs, there are three example task-NNs, and a non-specified one (Task-NN X, 555). One goal of VCM may be to obtain a low bitrate while guaranteeing that the task-NNs still perform well (580, 585, 590, 595) in terms of the evaluation metric associated to each task (560, 565, 570, 575).
[0213] It is to be understood that, in some cases, the VCM decoder may not be present. In one example, the machines may be run directly on the bitstream. In some other cases, the VCM decoder may comprise only a lossless decoding stage, and the lossless decoded data may be provided as input to the machines. In yet some other cases, the VCM decoder may comprise a lossless decoding stage following by a dequantization operation, and the loss-decoded and dequantized data may be provided as input to the machines.
[0214] When a conventional video encoder, such as a H.266AA / C encoder, is used as a VCM encoder, one or more of the following approaches may be used to adapt the encoding to be suitable to machine analysis tasks:
[0215] - One or more regions of interest (ROIs) may be detected. An ROI detection method may be used.For example, ROI detection may be performed using a task NN, such as an object detection NN. In some cases, ROI boundaries of a group of pictures or an intra period may be spatially overlaid and rectangular areas may be formed to cover the ROI boundaries. The detected ROIs (or rectangular areas, likewise) may be used in one or more of the following ways: the quantization parameter (QP) may be adjusted spatially in a manner that ROIs are encoded using finer quantization step size(s) than other regions. For example, QP may be adjusted CTU-wise; the video may be preprocessed to contain only the ROIs, while the other areas may be replaced by one or more constant values or removed; the video may be preprocessed so that the areas outside the ROIs are blurred or filtered; or, a grid may be formed in a manner that a single grid cell covers a ROI. Grid rows or grid columns that contain no ROIs may be down-sampled as preprocessing to encoding.
[0216] - Quantization parameter of the highest temporal sublayer(s) may be increased (i.e. coarser quantization is used) when compared to practices for human watchable video.
[0217] - The original video may be temporally down-sampled as preprocessing prior to encoding. A frame rate up-sampling method may be used as postprocessing subsequent to decoding, if machine analysis at the original frame rate is desired.
[0218] - A filter may be used to preprocess the input to the conventional encoder. The filter may be a machine learning based filter, such as a convolutional neural network.
[0219] It is to be understood that, in the context of video coding for machines, the terms “machine vision”, “machine vision task”, “machine task”, “machine analysis”, “machine analysis task”, “computer vision”, “computer vision task”, "task network" and “task” may be used interchangeably. Also, it is to be understood that, in the context of video coding for machines, the terms “machine consumption” and “machine analysis” may be used interchangeably.
[0220] Neural network based filtering
[0221] A neural network may be used for filtering or processing input data. Such a neural network may be referred to as a neural network based filter, or simply as a NN filter. A NN filter may comprise one or more neural networks, and / or one or more components that may not be categorized as neural networks (i.e. may be categorized as traditional or legacy components that are not trained based on data using machine learning techniques). The purpose of a NN filter may comprise (but may not be limited to) visual enhancement, colorization, up-sampling, super-resolution, inpainting, temporal extrapolation, generating content, or the like.
[0222] In some video codecs, a neural network may be used as filter in the encoding and decoding loop (also referred to simply as coding loop), and it may be referred to as a neural network loop filter, or a neural network in-loop filter. The NN loop filter may replace all other loop filters of an existing video codec, or may represent an additional loop filter with respect to the already present loop filters in an existing video codec.
[0223] A neural network filter may be used as a post-processing filter for a codec, e.g., may be applied to an output of an image or video decoder in order to remove or reduce coding artifacts.
[0224] In one example, a codec is a modified WC / H.266 compliant codec (e.g., a WC / H.266 compliant codec that has been modified and thus it may not be compliant to the WC / H.266) that comprises one or more NNloop filters. An input to the one or more NN loop filters may comprise at least a reconstructed block or frames (simply referred to as reconstruction) or data derived from a reconstructed block or frame (e.g., the output of a conventional loop filter). The reconstruction may be obtained based on predicting a block or frame (e.g., by means of intra-frame prediction or inter-frame prediction) and performing residual compensation. The one or more NN loop filters may enhance the quality of at least one of their input, so that a rate-distortion loss is decreased. The rate may indicate a bitrate (estimate or real) of the encoded video. The distortion may indicate a pixel fidelity distortion such as the following:
[0225] - Mean-squared error (MSE).
[0226] - Mean absolute error (MAE).
[0227] - Mean Average Precision (mAP) computed based on the output of a task NN (such as an object detection NN) when the input is the output of the post-processing NN.
[0228] - Other machine task-related metric, for tasks such as object tracking, video activity classification, video anomaly detection, etc.
[0229] The enhancement may result into a coding gain, which may be expressed for example in terms of BD-rate or BD-PSNR (peak signal-to-noise ratio).
[0230] A neural network filter may be used as a post-processing filter for a codec, e.g., may be applied to an output of an image or video decoder in order to remove or reduce coding artifacts. In one example, the NN filter may be used as a post-processing filter where the input comprises data that is output by or is derived from an output of a traditional decoder, such as a decoder that is compliant with the WC / H.266 standard. In another example, the NN filter may be used as a post-processing filter where the input comprises data that is output by or is derived from an output of a decoder of an end-to-end learned decoder.
[0231] Input to a NN filter
[0232] Various input may be provided to a NN filter. In the case of filtering images, a filter may take as input at least one or more first images to be filtered and may output at least one or more second images, where the one or more second images are the filtered version of the one or more first images. In one example, the filter may take as input one image, and output one image. In another example, the filter may take as input more than one image, and output one image. In another example, the filter may take as input more than one image, and output more than one image.
[0233] It is to be understood that a filter may take as input also other data (also referred to as auxiliary data, or extra data) besides the data that is to be filtered, such as data that may aid the filter to perform a better filtering than if no auxiliary data was provided as input. In one example, the auxiliary data may comprise information about prediction data, and / or information about the picture type, and / or information about the slice type, and / or information about a Quantization Parameter (QP) used for encoding, and / or information about boundary strength, etc. In one example, the filter may take as input one image and other data associated to that image, such as information about the quantization parameter (QP) used for quantizing and / or dequantizing that image, and output one image.
[0234] Information on overfitting a neural network filter
[0235] Features as described herein may generally relate to adaptation of a NN. A NN filter may be adapted at test time based at least on part of the data to be encoded and / or decoded and / or post-processed. Such operation may be referred to, for example, with one of the following terms, when their meaning is clear from the context: adaptation, content adaptation, overfitting, finetuning, optimization, specialization, and the like.
[0236] Although, for simplicity, the case of a NN filter is being considered herein, similar adaptation may be performed for other coding tools and / or post-processing tools that are based on neural network technology. For example, a neural network based intra-frame prediction, or a neural network based inter-frame prediction, etc.
[0237] The NN filter that results from the adaptation process may be referred to, for example, with one of the following terms: adapted filter, content-adapted filter, overfitted filter, finetuned filter, optimized filter, specialized filter, and the like.
[0238] At the encoder side, the adaptation process may start with an initial NN filter. In one example, the initial NN filter may be a pretrained NN filter that was pretrained during an offline stage on a sufficiently large dataset. In another example, the initial NN filter may be a randomly initialized NN filter.
[0239] In the adaptation, one or more parameters of the NN filter may be adapted. Examples of such parameters may include (but may not be limited to) the following: the bias terms of a convolutional neural network; multiplier parameters that multiply one or more tensors produced by the NN filter, such as one or more feature tensors that are output by respective one or more layers of the NN filter; parameters of the kernels of a convolutional neural network; parameters of an adapter layer; or one or more arrays or tensors that are used as input to respective one or more layers of the NN filter.
[0240] The adaptation may be performed by means of a training process, e.g., by minimizing a loss function until a stopping criterion is met. The data used for this training process may comprise one or more pictures or blocks of input to the NN filter and associated respective one or more pictures or blocks of ground-truth data. In one example where the filter is an in-loop filter, the input to the NN filter may be reconstruction data, after prediction and residual compensation; the ground-truth data may be the uncompressed data that is given as input to the encoder. In one example where the filter is a post-processing filter, the input to the NN filter may be decoded data (e.g., the output of a video decoder); the ground-truth data may be the uncompressed data that is given as input to the encoder.
[0241] The loss function used during the training process may comprise one or more distortion loss functions (also referred to as reconstruction loss functions) and zero or more rate loss functions. A rate loss function may measure, for example, the cost in terms of bitrate of signaling any adaptation signal, such as updates to the parameters of the NN filter. A distortion loss function may comprise one of MSE, MS-SSIM, Video Multimethod Assessment Fusion (VMAF), etc.
[0242] The adaptation signal may be derived or determined based on the adapted NN filter and on the original NN filter (i.e., the NN filter before the overfitting process). In one example, the adaptation signal comprises an update to one or more parameters of the NN filter. Such an update may also be referred to as weight update, or parameter update. Such update may be computed, for example, by subtracting the values of the adapted parameters (i.e., the parameters of the adapted NN filter) from the corresponding values of the original parameters(i.e., the parameters of the original NN filter). In another example, the adaptation signal may comprise the parameters (of the NN filter) that were adapted, also referred to as updated parameters, or adapted parameters, or adapted weights, or overfitted parameters, and the like.
[0243] In order to keep the size of the adaptation signal low, the adaptation signal may go through one or more compression steps, such as sparsification, quantization and lossless coding, etc. In one example, an encoder that compresses the adaptation signal into a bitstream that is compliant with a neural network compression standard, such as MPEG neural network coding (NNC), may be used.
[0244] The compressed adaptation signal may be signaled from encoder to decoder in or along a bitstream that represents encoded image or video data. In one example, the compressed adaptation signal may be signaled in an Adaptation Parameter Set (APS) syntax structure of a video coding bitstream. In another example, the compressed adaptation signal may be signaled in a Supplemental Enhancement Information (SEI) message of a video coding bitstream. Signaling may comprise also other information which is associated with the adaptation signal and that may be required for correctly parsing and / or decompressing and / or using the adaptation signal, such as any quantization parameters.
[0245] Referring now to FIG. 6, illustrated is an example of an overfitting process (605) at the encoder side. The overfitting process (605) may be performed at the encoder side based on a training process. Input (610) may be provided to a NN filter (615) to determine an output (620). Loss (640) may be computed (630) between ground truth (625) and the output (620). The loss (640) may be provided to determine overfitting (635), which may be provided to the NN filter (615).
[0246] The resulting overfitted filter (645) may then be used to derive an overfitting signal (655), or adaptation signal (660). The overfitting signal may be derived or determined based partially on the original NN filter (650). The adaptation signal (660) may be compressed (665) to determine a compressed adaptation signal (670) and then signaled (675) from the encoder to the decoder, in or along a bitstream that represents encoded data, such as an encoded image or video.
[0247] In the example of FIG. 6, x (610) represents an input to the NN filter, x (620) represents an output of the NN filter (615), x (625) represents a ground-truth data associated with x (610), “Compute loss” (630) may compute a training loss I (640) in order to overfit the NN filter, and “Overfit” (635) may use I (640) to overfit the NN filter (615). As a result of the overfitting process (605), an overfitted NN filter may be obtained (645), which may be used (655), together with the original NN filter (650), to derive an adaptation signal (660). The adaptation signal may be compressed (665) and signaled (675) to a decoder or receiver.
[0248] At the decoder or receiver side, the signaled compressed adaptation signal may be received and decompressed. The decompressed adaptation signal may then be used to update the NN filter. In one example, where the adaptation signal may comprise a weight update, where the weight update may comprise one or more updates to respective one or more parameters of the NN filter, the one or more updates may be added to the one or more parameters. In another example, where the adaptation signal may comprise one or more updated or adapted parameters, the one or more updated or adapted parameters may be used to replace respective one or more parameters of the NN filter.
[0249] Once the NN filter has been updated based on the adaptation signal, the updated NN filter may be used for its purpose. For example, for filtering an input picture or an input block.
[0250] Referring now to FIG. 7, illustrated is an example of use of an adaptation signal for overfitting at the decoder or receiver side. A compressed adaptation signal (710) may be decompressed (720) to derive a decompressed adaptation signal (730). At the decoder side, the overfitting signal (730), or a signal derived or determined from the overfitting signal, may be used to update (750) the NN filter (740). The updated NN filter (760) may then be used to filter one or more pictures, or one or more blocks.
[0251] In the examples of FIGs. 6-7, the NN filter that is obtained from the overfitting process at encoder side may be different from the NN filter that is obtained from the updating process at decoder side. For example, one reason may be that the adaptation signal may be compressed in a lossy way. Thus, the former NN filter may be referred to as overfitted filter or adapted filter (or other similar terms, see above), and the latter NN filter may be referred to as updated filter.
[0252] In the present disclosure, the terms frame, picture and image may be used interchangeably. For example, the input and output to an end-to-end learned codec may be pictures. The input and output of a NN filter may be pictures. It is to be understood that also the term block, when it means a portion of a picture, may be simply referred to as frame or picture or image. In other words, at least some of the embodiments herein, even when described as applied to a picture, may be applicable also to a block, e.g., to a portion of a picture.
[0253] Example embodiment of the present disclosure may consider image and video as the data types. However, this is not limiting; the example embodiments may be extended to other types of data, such as audio.
[0254] At least some of the embodiments described herein, image and video data may be collectively referred to as visual data, and it is to be understood that visual data may refer to either image data or video data or both. The proposed embodiments can also be extended to data that are associated or related to visual data such as depth measurements and transparency information.
[0255] In the present disclosure, the terms signal, data, tensor and information may be used interchangeably to indicate an input or an output.
[0256] In the present disclosure, an end-to-end learned codec may be referred to also as E2E learned codec, or learned codec, or E2E codec.
[0257] In the present disclosure, neural network layers may be simply referred to as layers, or as a set of layers.
[0258] In at least some embodiments, a generator or generative neural network may refer to a neural network that may have generative capabilities, or a neural network that may be considered as generative artificial intelligence (Al), or a neural network that is capable of generating new content, or a neural network that is capable of extrapolating data for example with respect to a training distribution. An example of a generator is a neural network trained based on the Generative Adversarial Network (GAN) algorithm or paradigm, such as an image generator. Another example generator is a neural network trained based on diffusion modelling (also referred to as denoising score matching, or denoising diffusion probabilistic modelling), such as an image diffusion model.Another example generator is a neural network that represents a 3-dimensional scene as a radiance field (NeRF) and is capable of synthesizing new views for the scene.
[0259] In some embodiments, terms domain and space may be used interchangeably when referring to a domain or space of some data, such as of images. Domain or space may refer to some characteristics of some data, or a type of data. For example, image space and image domain may be used interchangeably; and feature space and feature domain may be used interchangeably.
[0260] In some embodiments, terms pixel domain, pixel space, picture domain, picture space, image domain, image space may be used interchangeably.
[0261] In some embodiments, terms latent and feature may be used interchangeably. For example, terms latent space and feature space may be used interchangeably.
[0262] It is to be understood that one or more operations performed by a data decoder, such as an image decoder, may be comprised in a data encoder, such as an image encoder. In an example, all the operations of an image decoder may be present in an image encoder.
[0263] EMBODIMENTS
[0264] Referring now to FIG. 8, illustrated is an example of a generator-nested codec according to an example embodiment of the present disclosure. In this embodiment, an output of a latent generator 801, or of a first portion of a generator 802, may be input to an encoder 804 to obtain a bitstream 806. The bitstream 806 is input to a decoder 808 to obtain a decoded output 810, and the decoded output 810 is input to a synthesis operation or to a second portion of the generator 812 to generate an output of the second portion of the generator 814.
[0265] Referring now to FIGs. 9a and 9b, illustrated is an example of a generator-nested codec according to an example embodiment of the present disclosure. In this embodiment, the generator 900 may comprise a first neural network and a second neural network, where the first neural network comprises a latent generator 902 or the first portion of the generator 904, and where the second neural network comprises the second portion of the generator 906 or the synthesis operation 908.
[0266] In an embodiment, the encoder and the decoder are comprised in or form a codec.
[0267] In an embodiment, the latent generator or the first portion of the generator comprises a first component and a second component, where the first component may map or transform an input to a latent space (e.g., extract features), and where the second component processes the mapped or transformed input in latent space. In an example, the first component comprises a neural network that maps or transforms an input image to features, and the second component comprises a diffusion model that performs an iterative denoising process on the features. In an additional embodiment, an input to the first component may comprise a noise signal. In an example, an input to the first component comprises a noise signal and an image. In an additional embodiment, an input to the second component may comprise a noise signal. In an example, an input to the second component comprises features and a noise signal.
[0268] In an embodiment, the second portion of the generator or frame synthesis may comprise a neural network that maps the generated latent data to an output in image domain. In another embodiment, the second portion of the generator or frame synthesis may comprise a first component and a second component. The firstcomponent may comprise a denoising process and the second component may be a neural network that converts the output of the first component into a final output in image domain. The denoising process may denoise an input to the denoising process, such as generated latent data, or improve a quality of the input to the denoising process according to one or more quality metrics. In an additional embodiment, the denoising process is a neural network. In another additional embodiment, the denoising process is a diffusion model, where the diffusion model may perform an iterative process to denoise the input generated latent data. The diffusion model in the second portion of the generator may be different or substantially different from the diffusion model in the first portion of the generator.
[0269] Inputs and outputs
[0270] In an embodiment, an input to the latent generator or to the first portion of the generator may comprise one or more of the following: one or more noise samples; one or more prompts, such as a text prompt; one or more images; one or more videos; one or more indications of respective one or more locations or positions, such as indications of positions of pixels to be generated; one or more indications of respective one or more time instants, such as indications of indexes of video frames to be generated; one or more indications of respective one or more spatio-temporal positions; one or more indications of respective one or more data items or data points to be generated.
[0271] Referring now to FIG. 10, illustrated is an example of a generator-nested codec according to an example embodiment of the present disclosure. In this embodiment, an input to the latent generator, or to the first portion of the generator 1002, or to the synthesis operation, or to the second portion of the generator 1004 may comprise one or more controlling signals 1006. An example for the controlling signal may be an additional conditioning prompt for the generation process, or data derived from an additional conditioning prompt.
[0272] Referring now to FIG. 11 , illustrated is an example of a generator-nested codec according to an example embodiment of the present disclosure. FIG. 11 illustrates another example of the embodiment described in FIG. 10, where the one or more controlling signals 1102 are processed by a processing unit 1104, such as a neural network. The outputs of the processing unit 1104 may be the same or different for its targeted components. Examples for the targeted components of the processing unit include: one or more intermediate features in the first portion of the generator 1106, or one or more intermediate features in the second portion of the generator 1108.
[0273] In an additional embodiment, the controlling signal received by the first portion of the generator may be different or at least partially different from the controlling signal received by the second portion of the generator. In yet another embodiment, the processing unit for the first portion of the generator may be different or at least partially different from the processing unit for the second portion of the generator.
[0274] In an additional embodiment, one or more controlling signals may be combined with the one or more intermediate features of the latent generator, or one or more intermediate features of the first portion of the generator, or one or more intermediate features of the synthesis operation, or one or more intermediate features of the second portion of the generator. Examples for the combination operation could be element-wise summation, element-wise multiplication, linear combination, or concatenation.
[0275] In another additional embodiment, one or more controlling signals may be combined with the parameters from one or more layers of the latent generator, or of the first portion of the generator, or of the synthesis operation, or of the second portion of the generator. The combination may be element-wise summation, element- wise multiplication, linear combination, pruning or expanding.
[0276] In one embodiment, the controlling signal that is used by or for the second portion of the generator, or data from which such controlling signal is derived, may be determined at an encoder side and signaled, in or along the bitstream, to a decoder side that comprises the second portion of the generator. In another embodiment, the controlling signal may be determined at a decoder side that comprises the second portion of the generator. In yet another embodiment, the controlling signal may be determined externally to a decoder side that comprises the second portion of the generator and may be provided to the second portion of the generator.
[0277] In an embodiment, the latent generator or the first portion of the generator may be a latent-space generator and may generate latent data, such as a latent tensor, or features, or a feature tensor. In an example, the latent generator may be a latent diffusion model (LDM), such as a Diffusion Transformer in latent space.
[0278] In an embodiment, an output of the synthesis operation or of the second portion of the generator may comprise a generated output. In an example, when an input to the first portion of the generator is an image, or when an output of the first portion of the generator comprises features of an image, an output of the second portion of the generator is a generated image.
[0279] Synthesis operation
[0280] In an embodiment, the synthesis operation may be a neural network and may be referred to also as a synthesis neural network. In an example, the synthesis neural network may have been used when training the generator, e.g., an output of the latent generator is input to the synthesis neural network to obtain a synthesized output, where the synthesized output is used to compute a training loss based also on ground-truth data. The neural network representing the synthesis operation may sometimes be referred to as “neural network decoder” or simply “decoder” in some of the machine learning literature, but it needs to be understood that the meaning of the term “decoder” is different from what is generally referred to in the present embodiments, e.g., in some of the machine learning literature and / or computer vision literature and / or artificial intelligence literature, “decoder” or “neural network decoder” may refer to a process or operation or neural network that maps or transforms features to a target domain such as visual data, image data, video data, audio data, etc., whereas in the present embodiments and / or in some image processing literature and / or data compression literature and / or information theory literature, “decoder” or “neural network decoder” may refer to a decompression process that may be part of a codec, such as a decoder of an image codec or a neural network decoder of an end-to-end learned image codec. It is further to be understood that the two meanings or uses of the term “decoder” described here may also overlap; a neural network that acts as a decoder or part of a decoder of an end-to-end learned codec may map features to a target domain such as images.
[0281] Lossless codec
[0282] Referring now to FIG. 12, illustrated is an example of a lossless codec according to an example embodiment of the present disclosure. In this embodiment, the codec is a lossless codec, the encoder is a lossless encoder, and the decoder is a lossless decoder.
[0283] Referring now to FIG. 13, illustrated is an example of a lossless codec according to an example embodiment of the present disclosure. In this additional embodiment, where the codec is a lossless codec, the codec comprises a learned probability model 1302 (may be referred to also as learned entropy model) that estimates one or more parameters of a probability distribution. In an example, the one or more parameters (or the probability distribution parameterized by the one or more parameters) are used for encoding one or more inputs to the lossless encoder 1202 and / or for decoding one or more inputs to the lossless decoder 1204. The learned probability model may be a neural network. The encoder comprises a first learned probability model 1302a and the decoder comprises a second learned probability model 1302b. In an example, the first and second learned probability models may be same or substantially same or may be two copies of the same learned probability model.
[0284] In an additional embodiment, where the codec is a lossless codec, the codec comprises an arithmetic encoder and an arithmetic decoder, such as a Context-Adaptive Binary Arithmetic Coding (CABAC) encoder and a CABAC decoder.
[0285] Lossy Codec
[0286] Referring now to FIG. 14, illustrated is an example of a lossy codec according to an example embodiment of the present disclosure. In this embodiment, the codec is a lossy codec, the encoder is a lossy encoder 1402, and the decoder is a lossy decoder 1404.
[0287] Referring now to FIG. 15, illustrated is an example of a lossy codec according to an example embodiment of the present disclosure. In this additional embodiment, where the codec is a lossy codec and comprises a lossy encoder 1502 and a lossy decoder 1504. The lossy encoder 1502 comprises a quantization operation 1506, a first probability model (e.g., a learned probability model 1508a), and a lossless encoder 1510 that uses an output of the first probability model. The lossy decoder 1504 may comprise a second probability model (e.g., a learned probability model 1508b), a lossless decoder 1512, and a dequantization operation 1514. In an example, the first probability model 1508a and the second probability model 1508b may be same or substantially same or may be two copies of the same learned probability model.
[0288] Referring now to FIG. 16, illustrated is an example of a lossy codec according to an example embodiment of the present disclosure. In this additional embodiment, where the codec is a lossy codec and comprises a lossy encoder 1602 and a lossy decoder 1604. The lossy encoder may comprise a neural network based encoder (NN encoder 1606), a quantization operation 1506, a first probability model (e.g., the learned probability model 1508a), and a lossless encoder 1510 that uses an output of the first probability model (e.g., the learned probability model 1508a). The lossy decoder 1604 may comprise a second probability model(e.g., the learned probability model 1508b), a lossless decoder 1512 that uses an output of the second probability model, a dequantization operation 1514, and a neural network based decoder (NN decoder 1608). In an example, the firstprobability model 1508a and the second probability model 1508b may be same or substantially same or may be two copies of the same learned probability model.
[0289] Such lossy codec may represent or comprise an end-to-end learned codec. However, it is to be noted that when training an end-to-end learned codec, the components that are trained may comprise the NN encoder, the learned probability model and the NN decoder, and may not comprise components that may not allow for training other components or that may not be possible to train, such as the lossless encoder and the lossless decoder.
[0290] It is to be noted that, in some examples or embodiments, a probability model (such as a learned probability model) may be considered to be part of or comprised in a lossless codec (e.g., in a lossless encoder, or in a lossless decoder), whereas in some other cases a probability model (such as a learned probability model) may be considered not to be part of (e.g., may be considered to be external to) a lossless codec (e.g., not part of a lossless encoder, or not part of a lossless decoder).
[0291] Training
[0292] In an embodiment, one or more of the following components or operations may be trained jointly: the latent generator, the NN encoder, the learned probability model, the NN decoder, and / or the synthesis operation.
[0293] In an embodiment, one or more of the following components or operations may be trained from scratch (e.g., from a random initialization): the latent generator, the NN encoder, the learned probability model, the NN decoder, and / or the synthesis operation.
[0294] In an embodiment, one or more of the following components or operations may be trained or finetuned from pretrained parameters (e.g., from a learned or pretrained initialization): the latent generator, the NN encoder, the learned probability model, the NN decoder, and / or the synthesis operation.
[0295] In an example, the latent generator and the synthesis operation are neural networks and are pretrained, for example, for the task of generating images based on an input noise, to obtain a pretrained latent generator and a pretrained synthesis neural network. The pretrained latent generator and the pretrained synthesis neural network are then kept frozen (e.g., unmodified, not trained further) and combined with an end-to-end learned codec (e.g., as in one or more of the previous embodiments), and the NN encoder, learned probability model and NN decoder of the end-to-end learned codec are trained from scratch (e.g., from a random initialization)
[0296] In another example, the latent generator and the synthesis operation are neural networks and are pretrained, for example, for the task of generating images based on an input noise, to obtain a pretrained latent generator and a pretrained synthesis neural network. The pretrained latent generator and the pretrained synthesis neural network are then combined with an end-to-end learned codec (e.g., as in one or more of the previous embodiments). The NN encoder, learned probability model and NN decoder of the end-to-end learned codec are trained from scratch (e.g., from a random initialization), whereas the latent generator and the synthesis neural network are finetuned. In an example, the training of the NN encoder, learned probability model and NN decoder and the finetuning of the latent generator and the synthesis neural network are performed jointly.
[0297] In an example, one or more components of the codec, such as the learned probability model, the NN encoder, the NN decoder, may be trained jointly with the second portion of the generator or the synthesis operation.
[0298] Further example embodiments
[0299] In an embodiment, a post-processing filter, such as a neural network based post-processing filter, may be applied on an output of the second portion of the generator or of the synthesis operation. In an additional embodiment, the post-processing filter may be trained jointly with one or more components of the codec, such as jointly with the learned probability model, the NN encoder, the NN decoder.
[0300] Referring now to FIG. 17, illustrated is an example of a generator-nested codec according to an example embodiment of the present disclosure. In this embodiment, the second portion of the generator 812 (or the synthesis operation) may be used as part of the decoder 1702, e.g., as a NN decoder. In this configuration, the bitstream 806 is input to a lossless decoder 1512 to obtain a lossless-decoded signal, and the lossless decoded signal is input to a dequantization operation 1514 to obtain a dequantized signal, and the dequantized signal is input to the second portion of the generator 812.
[0301] In an additional embodiment, one or more components of the encoder may be trained jointly with the second portion of the generator. In an example, the first portion of the generator and the second portion of the generator are pretrained; an end-to-end learned codec comprises a NN encoder, a quantization operation, a lossless encoder, a probability model, a lossless decoder, a dequantization operation, and the second portion of the generator, where the second portion of the generator represents or has a similar function of a NN decoder. One or more components of the end-to-end learned codec are trained jointly. For example, the NN encoder, the learned probability model and the second portion of the generator are trained jointly, while other learnable components, such as the first portion of the generator, are kept frozen or unmodified.
[0302] In an embodiment, when additional controlling signals are used with a lossy or lossless codec, the controlling signals or the data derived from them may be comprised in the main bitstream, or it may be sent in an auxiliary bitstream.
[0303] In one additional embodiment, in order to generate different outputs from one main bitstream, different controlling signals may be applied to the second portion of the generator or the synthesis operation. The controlling signals may be signaled from the first portion of the generator or provided to the second portion of the generator from an external source.
[0304] In an additional embodiment, the controlling signals or the data derived from them are inferred or derived on the decoder side, for example, brightness level, contrast, frame rate or other on-demand content modulations.
[0305] FIG. 18 is a diagram illustrating an example apparatus 1800, which may be implemented in hardware, configured to implement the examples described herein. The apparatus 1800 comprises at least one processor 1802 (e.g., an FPGA and / or CPU), at least one memory 1804 including computer program code 1805, the computer program code 1805 having instructions to carry out the methods described herein, wherein the at least one memory 1804 and the computer program code 1805 are configured to, with the at least one processor1802, cause the apparatus 1800 to implement circuitry, a process, component, module, or function (implemented with control module 1806) to implement the examples described herein, including implementing a nested codec, for example, a generator-nested codec. Optionally included encoder 1808 of the control module 1806 implements encoding based on the examples described herein, and optionally included decoder 1810 implements decoding based on the examples described herein. The at least one memory 1804 may be a non-transitory memory, a transitory memory, a volatile memory (e.g. RAM), or a non-volatile memory (e.g., ROM).
[0306] The apparatus 1800 includes a display and / or I / O interface 1812, which includes user interface (Ul) circuitry and elements, that may be used to display features or a status of the methods described herein (e.g., as one of the methods is being performed or at a subsequent time), or to receive input from a user such as with using a keypad, camera, touchscreen, touch area, microphone, biometric recognition, one or more sensors, etc. The apparatus 1800 includes one or more communication e.g. network (N / W) interfaces (l / F(s)) 1814. The communication l / F(s) 1814 may be wired and / or wireless and communicate over the Internet / other network(s) via any communication technique including via one or more links 1816. The communication l / F(s) 1814 may comprise one or more transmitters or one or more receivers.
[0307] The transceiver 1818 comprises one or more transmitters 1820 and one or more receivers 1822. The transceiver 1818 and / or communication l / F(s) 1814 may comprise standard well-known components such as an amplifier, filter, frequency-converter, (de)modulator, and encoder / decoder circuitries and one or more antennas, such as antennas 1824 used for communication over wireless link 1826.
[0308] The control module 1806 of the apparatus 1800 comprises one of or both parts 1806-1 and / or 1806- 2, which may be implemented in a number of ways. The control module 1806 may be implemented in hardware as control module 1806-1 , such as being implemented as part of the at least one processor 1802. The control module 1806-1 may be implemented also as an integrated circuit or through other hardware such as a programmable gate array. In another example, the control module 1806 may be implemented as control module 1806-2, which is implemented as computer program code (having corresponding instructions) 1805 and is executed by the at least one processor 1802. For instance, the at least one memory 1804 store instructions that, when executed by the at least one processor 1802, cause the apparatus 1800 to perform one or more of the operations as described herein. Furthermore, the at least one processor 1802, the at least one memory 1804, and example algorithms (e.g., as flowcharts and / or signaling diagrams), encoded as instructions, programs, or code, are means for causing performance of the operations described herein.
[0309] The apparatus 1800 to implement the functionality of control module 1806 may correspond to any of the apparatuses depicted herein. Alternatively, apparatus 1800 and its elements may not correspond to any of the other apparatuses depicted herein, as apparatus 1800 may be part of a self-organizing / optimizing network (SON) node or other node, such as a node in a cloud.
[0310] The apparatus 1800 may also be distributed throughout the network including within and between apparatus 1800 and any network element (such as a base station and / or terminal device and / or user equipment).
[0311] Interface 1828 enables data communication and signaling between the various items of apparatus 1800, as shown in FIG. 18. For example, the interface 1828 may be one or more buses such as address, data, orcontrol buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, and the like. Computer program code (e.g. instructions) 1805, including control module 1806 may comprise object-oriented software configured to pass data or messages between objects within computer program code 1805. The apparatus 1800 need not comprise each of the features mentioned, or may comprise other features as well. The various components of apparatus 1800 may at least partially reside in a housing 1830, or a subset of the various components of apparatus 1800 may at least partially be located in different housings, which different housings may include housing 1830.
[0312] FIG. 19 is a diagram illustrating representation of non-volatile memory media 1900a (e.g. computer / compact disc (CD) or digital versatile disc (DVD)) and 1900b (e.g. universal serial bus (USB) memory stick) and 1900c (e.g. cloud storage for downloading instructions and / or parameters 1902 or receiving emailed instructions and / or parameters 1902) storing instructions and / or parameters 1902 which when executed by a processor allows the processor to perform one or more of the operations of the methods described herein. Instructions and / or parameters 1902 may represent or correspond to a non-transitory computer readable medium.
[0313] FIG. 20 is a flowchart illustrating an example method 2000 as described herein. At 2002, the method 2000 includes receiving input for first portion of a generator. At 2004, the method 2000 includes generating, based at least on the input, an output of the first portion of the generator. At 2006, the method 2000 includes providing the output of the first portion of the generator to an encoder. At 2008, the method 2000 includes obtaining a bitstream based at least on the output of the first portion of the generator. At 2010, the method 2000 includes signaling the bitstream to a decoder, wherein the bitstream is intended to be used by the decoder to generate a decoded output.
[0314] The method 2000 may be performed with an encoding apparatus, such as the apparatus 1800, or any encoding apparatus described herein.
[0315] FIG. 21 is a flowchart illustrating an example method 2100 as described herein. At 2102, the method 2100 includes receiving a bitstream, wherein the bitstream has been generated by an encoder based at least on an output of a first portion of a generator. At 2104, the method 2100 includes decoding the bitstream to generate a decoded bitstream. At 2106, the method 2100 includes providing the decoded bitstream to a second portion of the generator. At 2108, the method 2100 includes generating an output of the second portion of the generator.
[0316] The method 2100 may be performed with a decoding apparatus, such as the apparatus 1800, or any decoding apparatus described herein.
[0317] The term “non-transitory,” as used herein, is a limitation of the medium itself (i.e. tangible, not a signal) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM).
[0318] It should be understood that the foregoing description is only illustrative. Various alternatives and modifications can be devised by those skilled in the art. For example, features recited in the various dependent claims could be combined with each other in any suitable combination(s). In addition, features from differentembodiments described above could be selectively combined into a new embodiment. Accordingly, the description is intended to embrace all such alternatives, modification and variances which fall within the scope of the appended claims.
Claims
CLAIMSWhat is claimed is:
1. An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform: receiving input for first portion of a generator; generating, based at least on the input, an output of the first portion of the generator; providing the output of the first portion of the generator to an encoder; obtaining a bitstream based at least on the output of the first portion of the generator; and signaling the bitstream to a decoder, wherein the bitstream is intended to be used by the decoder to generate a decoded output.
2. An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform: receiving a bitstream, wherein the bitstream has been generated by an encoder based at least on an output of a first portion of a generator; decoding the bitstream to generate a decoded bitstream; providing the decoded bitstream to a second portion of the generator; and generating an output of the second portion of the generator.
3. The apparatus of any of the claims 1 or 2, wherein the first portion of the generator comprises a first neural network and the second portion comprises a second neural network.
4. The apparatus of claim 3, wherein the first neural network comprises a latent generator and the second neural network comprises a synthesis operation.
5. The apparatus of any of the claims 1 to 4, wherein the first portion of the generator comprises a first component and a second component.
6. The apparatus of claim 5, wherein the first component maps or transforms an input to a latent space and the second component processes the mapped or transformed input in latent space; or the first component comprises a neural network that maps or transforms an input image to features and the second component comprises a first diffusion model that performs an iterative denoising process on the features.
7. The apparatus of claim 6, wherein an input to the first component comprises one of following: a noise signal; or the noise signal and an image.
8. The apparatus of any of claims 6 or 7, wherein an input to the second component comprises a noise signal; or features and the noise signal.
9. The apparatus of any of the claims 1 to 8, wherein the second portion of the generator comprises a neural network that maps the generated latent data to an output in image domain; or the second portion of the generator comprises a third component and a fourth component, wherein the third component comprises a denoising process and the fourth component comprises a neural network that converts the output of the third component into a final output in image domain, and wherein the denoising process denoises an input to the denoising process.
10. The apparatus of claim 9, wherein the denoising process comprises a neural network, or a second diffusion model for performing an iterative process to denoise the input generated latent data.
11. The apparatus of any of the claims 6 to 10, wherein the first diffusion model in the first portion of the generator is different or substantially different from the second diffusion model in the second portion of the generator.
12. The apparatus of any of the claims 1 to 11, wherein the input to the first portion of the generator comprises one or more of the following: one or more noise samples, one or more prompts, one or more images; one or more videos; one or more indications of respective one or more locations or positions, one or more indications of respective one or more time instants, one or more indications of respective one or more spatio-temporal positions, one or more indications of respective one or more data items or data points to be generated.
13. The apparatus of any of the claims 1 to 12, wherein the input to the first portion of the generator or to the second portion of the generator comprises one or more controlling signals.
14. The apparatus of claim 13, wherein the one or more controlling signals comprise an additional conditioning prompt for the generation process, or data derived from the additional conditioning prompt.
15. The apparatus of any of the claims 13 or 14, wherein the one or more controlling signals received by the first portion of the generator is different or at least partially different from the one or more controlling signals received by the second portion of the generator.
16. The apparatus of any of the claims 1 to 15, wherein a processing unit or processing circuit for the first portion of the generator is different or at least partially different from a processing unit or a processing circuit for the second portion of the generator.
17. The apparatus of any of the claims 13 to 16, wherein the one or more controlling signals are combined with: one or more intermediate features of the latent generator; one or more intermediate features of the first portion of the generator; one or more intermediate features of the synthesis operation; or one or more intermediate features of the second portion of the generator.
18. The apparatus of any of the claims 13 to 16, wherein the one or more controlling signals are combined with: parameters from one or more layers of the latent generator; parameters from one or more layers of the first portion of the generator; parameters from one or more layers of the synthesis operation; or parameters from one or more layers of the second portion of the generator.
19. The apparatus of any of the claims 13 to 18, wherein the apparatus is further caused to perform signaling or receiving: one or more controlling signals that are for the second portion of the generator, or data from which the one or more controlling signals are derived.
20. The apparatus of any of the claims 13 to 18, wherein the one or more controlling signals are determined at a decoder side that comprises the second portion of the generator.
21. The apparatus of any of the claims 13 to 18, wherein the one or more controlling signals are determined externally to a decoder side that comprises the second portion of the generator and are provided to the second portion of the generator.
22. The apparatus of any of the claims 6 to 21, wherein an output of the synthesis operation or of the second portion of the generator comprises a generated output.
23. The apparatus of claim 22, wherein when an input to the first portion of the generator is an image, or when an output of the first portion of the generator comprises features of the image, an output of the second portion of the generator comprises a generated image.
24. The apparatus of any of the claims 6, 22, or 23, wherein the synthesis operation comprises a neural network (synthesis neural network).
25. The apparatus of claim 24, wherein the apparatus is further caused to perform: training the generator by using the synthesis neural network.
26. The apparatus of claim 24, wherein to perform the training, the apparatus is further caused to perform: providing an output of the latent generator to the synthesis neural network to obtain a synthesized output; and using the synthesized output for computing a training loss based at least on ground-truth data.
27. The apparatus of any of the claims 1 to 26, wherein a codec comprises a lossless codec, the encoder comprises a lossless encoder, and the decoder comprises a lossless decoder.
28. The apparatus of claim 27, wherein the lossless codec comprises a learned probability model that estimates one or more parameters of a probability distribution, and wherein the one or more parameters are used for encoding one or more inputs to the lossless encoder and / or for decoding one or more inputs to the lossless decoder.
29. The apparatus of claim 28, wherein the encoder comprises a first learned probability model and the decoder comprises a second learned probability model, and wherein the first and second learned probability models are same, substantially same, or are two copies of the learned probability model.
30. The apparatus of any of the claims 1 to 26, wherein a codec comprises a lossy codec, the encoder comprises a lossy encoder, and the decoder comprises a lossy decoder.
31. The apparatus of claim 30, wherein when the codec comprises the lossy codec, the lossy encoder comprises: a neural network based encoder (NN encoder), a quantization operation, a first probability model, and a lossless encoder that uses an output of the first probability model; and the lossy decoder comprises a second probability model, a lossless decoder that uses an output of thesecond probability model, a dequantization operation, and a neural network based decoder (NN decoder), wherein the first probability model and the second probability model are same, substantially same, or are two copies of the same learned probability model.
32. The apparatus of any of the claims 1 to 31 , wherein one or more of the following components or operations are trained jointly: the latent generator, the NN encoder, the learned probability model, the NN decoder, or the synthesis operation.
33. The apparatus of any of the claims 1 to 31, wherein the latent generator and the synthesis operation comprises neural networks and are pretrained to obtain a pretrained latent generator and a pretrained synthesis neural network, and wherein the pretrained latent generator and the pretrained synthesis neural network are combined with an end-to-end learned codec.
34. The apparatus of any of the claims 1 to 33, wherein the apparatus is further caused to perform: applying a post-processing filter on the output of the second portion of the generator or the output of the synthesis operation.
35. The apparatus of any of the claims 1 to 33, wherein the second portion of the generator or the synthesis operation is used as part of the decoder, and wherein in this configuration, the bitstream is input to a lossless decoder to obtain a lossless-decoded signal, the lossless decoded signal is input to a dequantization operation to obtain a dequantized signal, and the dequantized signal is input to the second portion of the generator.
36. The apparatus of any of the claims 27 to 35, wherein the apparatus is further caused to perform: using one or more additional controlling signals with the lossy or lossless codec, and when the one or more additional controlling signals are used, the one or more additional controlling signals or the data derived from the one or more additional controlling signals is comprised in the bitstream or sent in an auxiliary bitstream.
37. The apparatus of any of the claims 27 to 36, wherein the apparatus is further caused to perform: applying different controlling signals of the one or more controlling signals to the second portion of the generator or the synthesis operation in order to generate different outputs from the bitstream.
38. The apparatus of any of the claims 27 to 37, wherein the apparatus is further caused to perform: receiving the one or more controlling signals by the second portion of the generator or signaling the one or more controlling signals from the first portion of the generator or from an external source.
39. The apparatus of any of the claims 27 to 37, wherein the one or more controlling signals or data derived from the one or more controlling signals are inferred or derived at the decoder side.
40. A method comprising: receiving input for first portion of a generator; generating, based at least on the input, an output of the first portion of the generator; providing the output of the first portion of the generator to an encoder; obtaining a bitstream based at least on the output of the first portion of the generator; andsignaling the bitstream to a decoder, wherein the bitstream is intended to be used by the decoder to generate a decoded output.
41. A method comprising: receiving a bitstream, wherein the bitstream has been generated by an encoder based at least on an output of a first portion of a generator; decoding the bitstream to generate a decoded bitstream; providing the decoded bitstream to a second portion of the generator; and generating an output of the second portion of the generator.
42. The method of any of the claims 40 or 41 , wherein the first portion of the generator comprises a first neural network and the second portion comprises a second neural network.
43. The method of claim 42, wherein the first neural network comprises a latent generator and the second neural network comprises a synthesis operation.
44. The method of any of the claims 40 to 43, wherein the first portion of the generator comprises a first component and a second component.
45. The method of claim 44, wherein the first component maps or transforms an input to a latent space and the second component processes the mapped or transformed input in latent space; or the first component comprises a neural network that maps or transforms an input image to features and the second component comprises a first diffusion model that performs an iterative denoising process on the features.
46. The method of claim 45, wherein an input to the first component comprises one of following: a noise signal; or the noise signal and an image.
47. The method of any of claims 45 or 46, wherein an input to the second component comprises a noise signal; or features and the noise signal.
48. The method of any of the claims 40 to 47, wherein the second portion of the generator comprises a neural network that maps the generated latent data to an output in image domain; or the second portion of the generator comprises a third component and a fourth component, wherein the third component comprises a denoising process and the fourth component comprises a neural network that converts the output of the third component into a final output in image domain, and wherein the denoising process denoises an input to the denoising process.
49. The method of claim 48, wherein the denoising process comprises a neural network, or a second diffusion model for performing an iterative process to denoise the input generated latent data.
50. The method of any of the claims 45 to 49, wherein the first diffusion model in the first portion of the generator is different or substantially different from the second diffusion model in the second portion of the generator.
51. The method of any of the claims 40 to 50, wherein the input to the first portion of the generator comprises one or more of the following: one or more noise samples, one or more prompts, one or more images; one or more videos; one or more indications of respective one or more locations orpositions, one or more indications of respective one or more time instants, one or more indications of respective one or more spatio-temporal positions, one or more indications of respective one or more data items or data points to be generated.
52. The method of any of the claims 40 to 51 , wherein the input to the first portion of the generator or to the second portion of the generator comprises one or more controlling signals.
53. The method of claim 52, wherein the one or more controlling signals comprise an additional conditioning prompt for the generation process, or data derived from an additional conditioning prompt.
54. The method of any of the claims 52 or 53, wherein the one or more controlling signals received by the first portion of the generator is different or at least partially different from the one or more controlling signals received by the second portion of the generator.
55. The method of any of the claims 40 to 54, wherein a processing unit or processing circuit for the first portion of the generator is different or at least partially differentfrom a processing unit or a processing circuit for the second portion of the generator.
56. The method of any of the claims 42 to 55, wherein the one or more controlling signals are combined with: one or more intermediate features of the latent generator; one or more intermediate features of the first portion of the generator; one or more intermediate features of the synthesis operation; or one or more intermediate features of the second portion of the generator.
57. The method of any of the claims 42 to 55, wherein the one or more controlling signals are combined with: parameters from one or more layers of the latent generator; parameters from one or more layers of the first portion of the generator; parameters from one or more layers of the synthesis operation; or parameters from one or more layers of the second portion of the generator.
58. The method of any of the claims 42 to 57 further comprising: signaling or receiving: one or more controlling signals that are for the second portion of the generator, or data from which the one or more controlling signals are derived.
59. The method of any of the claims 42 to 57, wherein the one or more controlling signals are determined at a decoder side that comprises the second portion of the generator.
60. The method of any of the claims 42 to 57, wherein the one or more controlling signals are determined externally to a decoder side that comprises the second portion of the generator and are provided to the second portion of the generator.
61. The method of any of the claims 45 to 60, wherein an output of the synthesis operation or of the second portion of the generator comprises a generated output.
62. The method of claim 61, wherein when an input to the first portion of the generator is an image, or when an output of the first portion of the generator comprises features of the image, an output of the second portion of the generator comprises a generated image.
63. The method of any of the claims 45, 61 , or 62, wherein the synthesis operation comprises a neural network (synthesis neural network).
64. The method of claim 63 further comprising: training the generator by using the synthesis neural network.
65. The method of claim 63, wherein to perform the training, the method further comprises: providing an output of the latent generator to the synthesis neural network to obtain a synthesized output; and using the synthesized output for computing a training loss based at least on ground-truth data.
66. The method of any of the claims 40 to 65, wherein the codec comprises a lossless codec, the encoder comprises a lossless encoder, and the decoder comprises a lossless decoder.
67. The method of claim 66, wherein the lossless codec comprises a learned probability model that estimates one or more parameters of a probability distribution, and wherein the one or more parameters are used for encoding one or more inputs to the lossless encoder and / or for decoding one or more inputs to the lossless decoder.
68. The method of claim 67, wherein the encoder comprises a first learned probability model and the decoder comprises a second learned probability model, and wherein the first and second learned probability models are same, substantially same, or are two copies of the learned probability model.
69. The method of any of the claims 40 to 65, wherein the codec comprises a lossy codec, the encoder comprises a lossy encoder, and the decoder comprises a lossy decoder.
70. The method of claim 69, wherein when the codec comprises the lossy codec, the lossy encoder comprises: a neural network based encoder (NN encoder), a quantization operation, a first probability model, and a lossless encoder that uses an output of the first probability model; and the lossy decoder comprises a second probability model, a lossless decoder that uses an output of the second probability model, a dequantization operation, and a neural network based decoder (NN decoder), wherein the first probability model and the second probability model are same, substantially same, or are two copies of the same learned probability model.
71. The method of any of the claims 40 to 70, wherein one or more of the following components or operations are trained jointly: the latent generator, the NN encoder, the learned probability model, the NN decoder, or the synthesis operation.
72. The method of any of the claims 40 to 70, wherein the latent generator and the synthesis operation comprises neural networks and are pretrained to obtain a pretrained latent generator and a pretrained synthesis neural network, and wherein the pretrained latent generator and the pretrained synthesis neural network are combined with an end-to-end learned codec.
73. The method of any of the claims 40 to 72 further comprising: applying a post-processing filter on the output of the second portion of the generator or the output of the synthesis operation.
74. The method of any of the claims 40 to 72, wherein the second portion of the generator or the synthesis operation is used as part of the decoder, and wherein in this configuration, the bitstream is input to a lossless decoder to obtain a lossless-decoded signal, the lossless decoded signal isinput to a dequantization operation to obtain a dequantized signal, and the dequantized signal is input to the second portion of the generator.
75. The method of any of the claims 66 to 74 further comprising: using one or more additional controlling signals with the lossy or lossless codec, and when the one or more additional controlling signals are used, the one or more additional controlling signals or the data derived from the one or more additional controlling signals is comprised in the bitstream or sent in an auxiliary bitstream.
76. The method of any of the claims 66 to 75 further comprising: applying different controlling signals of the one or more controlling signals to the second portion of the generator or the synthesis operation in order to generate different outputs from the bitstream.
77. The method of any of the claims 66 to 76 further comprising: receiving the one or more controlling signals by the second portion of the generator or signaling the one or more controlling signals from the first portion of the generator or from an external source.
78. The method of any of the claims 66 to 76, wherein the one or more controlling signals or data derived from the one or more controlling signals are inferred or derived at the decoder side.
79. A computer readable medium comprising instructions, when executed by an apparatus, cause the apparatus to perform at least the following: receiving input for a first portion of a generator; generating, based at least on the input, an output of the first portion of the generator; providing the output of the first portion of the generator to an encoder; obtaining a bitstream based at least on the output of the first portion of the generator; and signaling the bitstream to a decoder, wherein the bitstream is intended to be used by the decoder to generate a decoded output.
80. A computer readable medium comprising instructions, when executed by an apparatus, cause the apparatus to perform at least the following: receiving a bitstream, wherein the bitstream has been generated by an encoder based at least on an output of a first portion of a generator; decoding the bitstream to generate a decoded bitstream; providing the decoded bitstream to a second portion of the generator; and generating an output of the second portion of the generator.
81. The computer readable medium of any of the claims 79 or 80, wherein the apparatus is further caused to perform methods as claimed in any of the claims 42 to 78.
82. The computer readable medium of any of the claims 79 to 81, wherein the computer readable medium comprises a non-transitory computer readable medium.
83. An apparatus comprising: means for receiving input for a first portion of a generator; means for generating, based at least on the input, an output of the first portion of the generator; means for providing the output of the first portion of the generator to an encoder;means for obtaining a bitstream based at least on the output of the first portion of the generator; and means for signaling the bitstream to a decoder, wherein the bitstream is intended to be used by the decoder to generate a decoded output.
84. An apparatus comprising: means for receiving a bitstream, wherein the bitstream has been generated by an encoder based at least on an output of a first portion of a generator; means for decoding the bitstream to generate a decoded bitstream; means for providing the decoded bitstream to a second portion of the generator; and means for generating an output of the second portion of the generator.
85. The apparatus of any of the claims 83 or 84, wherein the apparatus further comprises means for performing methods as claimed in any of the claims 42 to 78.