Generative learned codec

A processor-based system enhances image and video coding by using generator inputs and neural networks to improve encoding and decoding efficiency and quality, addressing existing challenges in neural network utilization.

WO2026133293A1PCT designated stage Publication Date: 2026-06-25NOKIA TECHNOLOGIES OY

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
NOKIA TECHNOLOGIES OY
Filing Date
2025-12-19
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Existing image and video coding technologies face challenges in efficiently utilizing neural networks for encoding and decoding processes, particularly in generating high-quality outputs from decoded latent tensors and prompts.

Method used

A system comprising a processor and memory that executes instructions to receive generator inputs, generate outputs based on these inputs, and derive final outputs using generators and neural networks, incorporating features like noise signals, residual data, and external inputs for enhanced encoding and decoding.

Benefits of technology

The system effectively generates high-quality images and videos by leveraging generator inputs and neural networks, improving encoding and decoding efficiency and quality.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure IB2025063273_25062026_PF_FP_ABST
    Figure IB2025063273_25062026_PF_FP_ABST
Patent Text Reader

Abstract

An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform: receiving a generator input comprising a prompt, a decoded latent tensor, or data derived from the decoded latent tensor; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of the apparatus.
Need to check novelty before this filing date? Find Prior Art

Description

GENERATIVE LEARNED CODECTECHNICAL FIELD

[0001] The example and non-limiting embodiments relate generally to data encoding and decoding and, more particularly, to learned coding.BACKGROUND

[0002] It is known, in image and video coding, to use neural networks to perform encoding and decoding functions as part of a codec.SUMMARY

[0003] The following summary is merely intended to be illustrative. The summary is not intended to limit the scope of the claims.

[0004] Example 1: An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform: receiving a generator input comprising a prompt, a decoded latent tensor, or data derived from the decoded latent tensor; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of the apparatus.

[0005] Example 2: An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform: receiving a generator input comprising a decoded latent tensor or data derived from the decoded latent tensor; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of the apparatus.

[0006] Example 3: An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform: receiving a generator input comprising a prompt; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of the apparatus.

[0007] Example 4: An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform: receiving a generator input comprising a signal or data derived from the signal; generating,based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of the apparatus.

[0008] Example 5: The apparatus of example 4, wherein the signal comprises a decoded latent tensor or a prompt, and the data derived from the signal comprises data derived from the decoded latent tensor or data derived from the prompt.

[0009] Example 6: The apparatus of any of the examples 1, 2 or 5, wherein the decoded latent tensor comprises a lossless-decoded latent tensor that is comprised in an output of a lossless decoder.

[0010] Example 7: The apparatus of any of the examples 1, 2, 5, or 6, wherein the data derived from the decoded latent tensor comprises an output of a neural network decoder that is comprised in a decoder, and wherein the lossless-decoded latent tensor is an input to the neural network decoder.

[0011] Example 8: The apparatus of any of the examples 1, 3, or 5, wherein the prompt comprises one of following: a text prompt; an image; a prompt in feature space or domain; a prompt inferred from an output of a neural network decoder; or a prompt inferred from an output of a lossless decoder.

[0012] Example 9: The apparatus of example 8, wherein the text prompt comprises a prompt in natural language domain in form of: a text string, a tokenized text string; features extracted from a tokenized text string; or features extracted from a text string.

[0013] Example 10: The apparatus of any of the examples 1 to 9, wherein the generator input further comprises a noise signal.

[0014] Example 11: The apparatus of example 10, wherein the noise signal is generated at the decoder side.

[0015] Example 12: The apparatus of example 10, wherein the noise signal or one or more parameters for generating the noise signal are received from an encoder side.

[0016] Example 13: The apparatus of any of the examples 1, 2, 5, 6, 7, or 10 to 12, wherein the generator output comprises an input to a neural network decoder comprised in the apparatus, and wherein the neural network decoder decodes the decoded latent tensor based on the generator output.

[0017] Example 14: The apparatus of any of the examples 1 to 13, wherein the generator output comprises a decoded image.

[0018] Example 15: The apparatus of any of the examples 1 to 14, wherein the apparatus is further caused to perform: receiving, from an encoder, a residual or data from which a residual is derived; and combining the generator output with the residual by using at least a combination operation.

[0019] Example 16: The apparatus of example 15, wherein the combination operation comprises one of a summation operation, a tensor concatenation operation, or one or more neural network layers.

[0020] Example 17: The apparatus of any of examples 15 or 16, wherein the residual and the generator output are in feature space or domain; or the residual and the generator output are in picture space or domain, pixel space or domain, or image space or domain.

[0021] Example 18: The apparatus of any of the examples 1 to 17, wherein the generator output is used as one of following: an intra-frame prediction or inter-frame prediction of: an image encoder and / or decoder or a video encoder and / or decoder; a reference picture in the video encoder and / or decoder; a temporal extrapolation in the video encoder and / or decoder; and a spatial extrapolation in the image encoder and / or decoder or the video encoder and / or decoder.

[0022] Example 19: The apparatus of any of the examples 1 to 18, wherein when a generator comprises a group of generators comprising two or more generator modules, an output of one generator module in the group, or data derived therefrom, is comprised in an input to other one or more generator modules in the group.

[0023] Example 20: The apparatus of example 19, wherein a generator group of the group of generators comprises a global generator and a local generator, and wherein the global generator learns to generate global embeddings from previous decoded latent tensors, and wherein the local generator directly takes the generated embeddings as input to generate output in an image space.

[0024] Example 21: The apparatus of example 20, wherein the apparatus is further caused to perform: combining the global embeddings with the decoded latent tensor to obtain an input of the local generator.

[0025] Example 22: The apparatus any of the examples 1 to 21, wherein the apparatus is further caused to perform: receiving one or more external inputs with respect to the apparatus; and using the one or more external inputs as an input to the generator, using the one or more external inputs for processing the prompt, or using the one or more external inputs for modifying the prompt; wherein the one or more external inputs are not comprised in a bitstream decoded by the decoder.

[0026] Example 23: The apparatus of example 22, wherein the one or more external inputs comprise one or more of following: an audio signal that is derived from an audio track associated with one or more pictures that are input to a codec comprising the decoder; or ambient information.

[0027] Example 24: The apparatus of any of the examples 1, 3, 5, or 8 to 23, wherein: the prompt is determined by an encoder and / or by a process which is at encoder side but external to the encoder; the prompt is generated by an encoder based on a first picture as an input; the prompt is inferred by a neural network based on a second picture as an input, wherein the second picture and the inferred prompt are provided as input to the encoder; the prompt is comprised in an input to the encoder; the prompt is encoded by an encoder in a lossless way or in a lossy way, and wherein the encoded prompt is received by the apparatus as an input or as a signal; the prompt is used by the generator to generate an image, and wherein the prompt and the generated image is provided as an input to the encoder; the prompt is elaborated or processed by the encoder or by a process that is external to the encoder, and wherein the elaborated or processed prompt or data derived therefrom is received by the decoder or the generator; the prompt is used to generate an image to be encoded; or the prompt is inferred based on an image to be encoded.

[0028] Example 25: The apparatus of example 24, wherein the first picture and / or the second picture are encoded by a first encoder and the prompt is encoded by a second encoder.

[0029] Example 26: The apparatus of example 24, wherein the first picture, the second picture, and / or the prompt are encoded by the same encoder.

[0030] Example 27: The apparatus of any of the examples 1, 3, 5, or 8 to 23, wherein the apparatus is further caused to perform: receiving an encoded inferred prompt, wherein the encoded inferred prompt is an encoded version of an inferred prompt, and wherein the inferred prompt is generated based on an image generated by using the prompt; decoding the encoded inferred prompt to generate a decoded inferred prompt; providing the decoded inferred prompt as an input to the generator to obtain the generator output, wherein the generator is available at both the encoder side and the decoder side; providing the generator output as an input to the encoder and to the decoder, wherein a bitstream representing an encoded image is generated by the encoder based on an image and the generator output; receiving the bitstream; and generating a decoded image based on the bitstream and the generator output.

[0031] Example 28: The apparatus of any of the examples 1, 3, 5, or 8 to 23, wherein the apparatus is further caused to perform: decoding at least one of one or more bitstreams to obtain a decoded prompt, wherein a prompt and a picture are encoded into the one or more bitstreams; and using the decoded prompt or data derived from the decoded prompt as an input to the generator.

[0032] Example 29: The apparatus of example 24, wherein the elaborated or processed prompt is obtained based on a third picture that is input to the encoder.

[0033] Example 30: The apparatus of any of the examples 1, 3, 5, or 8 to 23, wherein the apparatus is further caused to perform: receiving an encoded elaborated or processed prompt and an encoded picture as an input, wherein an elaborated or processed prompt is generated based on a picture and the prompt, and wherein the picture is generated based on the prompt, and wherein the encoded picture is an encoded version of the picture; decoding the encoded elaborated or processed prompt to obtain a decoded elaborated or processed prompt; and providing the decoded elaborated or processed prompt as an input to the generator.

[0034] Example 31: The apparatus of example 24, wherein the elaborated prompt is determined based on the generator output when the input to the generator is the prompt from which the elaborated prompt is derived.

[0035] Example 32: The apparatus of any of the examples 1 to 31, wherein the apparatus is trained based at least on a prompt loss, and wherein the prompt loss comprises a training loss, or training objective, that is determined or computed based on a first prompt and a second prompt or based on data derived from the first prompt and data derived from the second prompt, wherein the first prompt is determined or inferred based on the generator output or on data derived from the generator output, and the second prompt is one of following: a prompt that was used to generate a picture that was input to the encoder; or a prompt that was determined or inferred based on the picture that was input to the encoder.

[0036] Example 33: The apparatus of example 32, wherein the prompt loss is computed by providing the first prompt and the second prompt as an input to a neural network and running the neural network to obtain an output that represents the prompt loss or data from which the prompt loss is derived.

[0037] Example 34: The apparatus of example 32, wherein the prompt loss is computed by computing a metric based on the first prompt and the second prompt, or based on first features extracted from the first prompt and second features extracted from the second prompt.

[0038] Example 35: The apparatus of any of the examples 1 to 31, wherein the apparatus is trained, finetuned, or overfitted based at least on the prompt loss at encoder side when encoding a fourth picture to obtain an update to the generator, and wherein the update or a signal derived from the update is received by the apparatus. In an example, the first picture, the second picture, the third picture, and the fourth picture are same or substantially same.

[0039] Example 36: The apparatus of example 35, wherein the apparatus is further caused to perform: updating the generator by using the update, or a signal derived from the update.

[0040] Example 37: The apparatus of any of the examples 1 to 36, wherein: the generator is pretrained and is frozen during the training of one or more other components of a codec; the generator is pretrained and then finetuned jointly with one or more other components of the codec; or the generator is trained jointly with one or more components of the codec from scratch or from an initialization of parameters of the generator.

[0041] Example 38: The apparatus of any of the examples 1 to 36, wherein the apparatus is further caused to perform: receiving a noise signal as an input, wherein the noise signal guides generation of a certain content.

[0042] Example 39: The apparatus of example 38, wherein the noise signal is generated at the decoder side; or the noise signal or parameters that generate the noise signal are derived at the encoder side.

[0043] Example 40: The apparatus of any of the examples 38 or 39, wherein the apparatus is further caused to perform: processing the noise signal; and providing the processed noise signal to the decoder.

[0044] Example 41: The apparatus of any of the examples 39 or 40, wherein the noise signal at the decoder side is a combination of the noise signal generated by the noise parameters received from the encoder and a random noise generated at the decoder side.

[0045] Example 42: The apparatus of example 39, wherein the noise signal at the decoder side is derived from a decoded bitstream.

[0046] Example 43: The apparatus of any of the examples 38 to 42, wherein, in order to make the noise generation deterministic and reproducible at the decoder, the apparatus is further caused to perform: receiving additional parameters that control a noise generation process.

[0047] Example 44: The apparatus of any of the examples 38 to 43, wherein the generator performs one or more steps to generate the generator output, and the apparatus is further caused to perform: receiving a signal comprising information about at least one step to which the noise signal needs to be applied.

[0048] Example 45: The apparatus of any of the examples 1, 3, 5, wherein the apparatus is further caused to perform: receiving an edited or modified picture, wherein one or more regions of thepicture had been edited, modified, generated by another generator based on an input prompt to obtain the edited or modified picture

[0049] Example 46: The apparatus of example 45, wherein at least one of the one or more regions had been encoded differently than other regions of the edited or modified picture.

[0050] Example 47: The apparatus of example 46, wherein as compared to the other regions, the one or more regions are: encoded in lower quality or lower resolution, partially coded, or not coded.

[0051] Example 48: The apparatus of example 46 or 47, wherein the apparatus is further caused to perform: receiving an encoded mask or an encoded opacity-level map.

[0052] Example 49: The apparatus of example 48, wherein the mask or the opacity-level had be obtained in one or more of the following ways: the mask or the opacity map is an output of the another generator; the mask or the opacity-level map had been determined based on the edited or the modified picture; the mask or the opacity-level map had been determined based on the edited or modified picture and the picture; the mask or the opacity-level map had been determined based on the input prompt that was an input to the another generator; the mask or the opacity-level map had been determined based on a prompt that had been inferred based on the edited or modified picture.

[0053] Example 50: The apparatus of any of the examples 45 to 49, wherein the one or more regions are determined by the encoder or a process external to the encoder.

[0054] Example 51 : The apparatus of any of the examples 45 to 50, wherein at least one of the one or more regions are determined to be a region that the generator at the decoder side or the decoder that comprises the generator is capable of generating or reconstructing with a sufficient quality with respect to a quality threshold or other criterion.

[0055] Example 52: The apparatus of any of the examples 1, 4, 5, wherein the apparatus comprises a decoder comprising a neural network based decoder (NN decoder), and wherein an input to the NN decoder comprises the decoded latent tensor, and wherein an output of the NN decoder is input to the generator, and wherein the apparatus is further caused to perform: receiving an update to the NN decoder, where the update is determined by the encoder; using the update the NN decoder to obtain an updated NN decoder; using the updated NN decoder to process the lossless decoded latent tensor to generate a processed lossless decoded latent tensor; and providing the processed lossless decoded latent tensor to the generator; generating a final decoded image based at least on the processed lossless decoded latent tensor.

[0056] Example 53: The apparatus of any of the examples 38 to 44, wherein when the input to the generator comprises noise, the generator output comprises hyper -prior latents, wherein the hyper-prior latents are used to derive one or more probability distribution parameters of a latent tensor, and wherein the one or more probability distribution parameters are used to decode the latent tensor.

[0057] Example 54: The apparatus of any of the examples 1 to 53, wherein the apparatus comprises a decoder, one or more generators, or the decoder comprising the one or more generators.

[0058] Example 55: The apparatus of example 54, wherein two of the one or more generators generate data comprising different features or characteristics of data to be decoded.

[0059] Example 56: The apparatus of example 55, wherein when a codec comprises a video codec, a first generator of the one or more generators generates image texture data or data from which image texture data is derived and a second generator of the one or more generators generates motion data or data from which motion data is derived.

[0060] Example 57: The apparatus of example 55, wherein when a codec comprises a video codec, a first generator of the one or more generators may generates motion data or data from which motion data is derived and a second generator of the one or more generators generates a residual data or data from which residual data is derived.

[0061] Example 58: The apparatus of example 55, wherein when a codec comprises an image codec or a video codec, a first generator of the one or more generators generates prediction data or data from which prediction data is derived and a second generator of the one or more generators generates a residual data or data from which residual data is be derived.

[0062] Example 59: The apparatus of example 55, wherein a generator group of the one or more generators comprises a global generator and a local generator, wherein the global generator learns to generate global embeddings from previous inputs, and the local generator utilizes the global embeddings or data derived from the global embeddings to generate a target data.

[0063] Example 60: A method comprising: receiving a generator input comprising a prompt, a decoded latent tensor, or data derived from the decoded latent tensor; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

[0064] Example 61: A method comprising: receiving a generator input comprising a decoded latent tensor or data derived from the decoded latent tensor; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

[0065] Example 62: A method comprising: receiving a generator input comprising a prompt; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

[0066] Example 63: A method comprising: receiving a generator input comprising a signal or data derived from the signal; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

[0067] Example 64: The method of example 63, wherein the signal comprises a decoded latent tensor or a prompt, and the data derived from the signal comprises data derived from the decoded latent tensor or data derived from the prompt.

[0068] Example 65: The method of any of the examples 60, 61 or 64, wherein the decoded latent tensor comprises a lossless-decoded latent tensor that is comprised in an output of a lossless decoder.

[0069] Example 66: The method of any of the examples 60, 61, 64, or 65, wherein the data derived from the decoded latent tensor comprises an output of a neural network decoder that is comprised in the decoder, and wherein the lossless-decoded latent tensor is an input to the neural network decoder.

[0070] Example 67: The method of any of the examples 60, 62, or 64, wherein the prompt comprises one of following: a text prompt; an image; a prompt in feature space or domain; a prompt inferred from an output of a neural network decoder; or a prompt inferred from an output of a lossless decoder.

[0071] Example 68: The method of example 67, wherein the text prompt comprises a prompt in natural language domain in form of: a text string, a tokenized text string; features extracted from a tokenized text string; or features extracted from a text string.

[0072] Example 69: The method of any of the examples 60 to 68, wherein the generator input further comprises a noise signal.

[0073] Example 70: The method of example 69, wherein the noise signal is generated at the decoder side.

[0074] Example 71: The method of example 69, wherein the noise signal or one or more parameters for generating the noise signal are received from an encoder side.

[0075] Example 72: The method of any of the examples 60, 61, 64, 65, 66, or 69 to 71, wherein the generator output comprises an input to a neural network decoder comprised in the apparatus, and wherein the neural network decoder decodes the decoded latent tensor based on the generator output.

[0076] Example 73: The method of any of the examples 60 to 72, wherein the generator output comprises a decoded image.

[0077] Example 74: The method of any of the examples 60 to 73 further comprising: receiving, from an encoder, a residual or data from which a residual is derived; and combining the generator output with the residual by using at least a combination operation.

[0078] Example 75: The method of example 74, wherein the combination operation comprises one of a summation operation, a tensor concatenation operation, or one or more neural network layers.

[0079] Example 76: The method of any of examples 74 or 75, wherein the residual and the generator output are in feature space or domain; or the residual and the generator output are in picture space or domain, pixel space or domain, or image space or domain.

[0080] Example 77: The method of any of the examples 60 to 76, wherein the generator output is used as one of following: an intra-frame prediction or inter-frame prediction of: an image encoder and / or decoder or a video encoder and / or decoder; a reference picture in the video encoder and / or decoder; a temporal extrapolation in the video encoder and / or decoder; and a spatial extrapolation in the image encoder and / or decoder or the video encoder and / or decoder.

[0081] Example 78: The method of any of the examples 60 to 77, wherein when the generator comprises a group of generators comprising two or more generator modules, an output of one generator module in the group, or data derived therefrom, is comprised in an input to other one or more generator modules in the group.

[0082] Example 79: The method of example 78, wherein a generator group of the group of generators comprises a global generator and a local generator, and wherein the global generator learns to generate global embeddings from previous decoded latent tensors, and wherein the local generator directly takes the generated embeddings as input to generate output in an image space.

[0083] Example 80: The method of example 79 further comprising: combining the global embeddings with the decoded latent tensor to obtain an input of the local generator.

[0084] Example 81: The method any of the examples 60 to 80 further comprising: receiving one or more external inputs with respect to the apparatus; and using the one or more external inputs as an input to the generator, using the one or more external inputs for processing the prompt, or using theone or more external inputs for modifying the prompt; wherein the one or more external inputs are not comprised in a bitstream decoded by the decoder.

[0085] Example 82: The method of example 81, wherein the one or more external inputs comprise one or more of following: an audio signal that is derived from an audio track associated with one or more pictures that are input to a codec comprising the decoder; or ambient information.

[0086] Example 83: The method of any of the examples 60, 62, 64, or 67 to 82, wherein: the prompt is determined by an encoder and / or by a process which is at encoder side but external to the encoder; the prompt is generated by an encoder based on a first picture as an input; the prompt is inferred by a neural network based on a second picture as an input, wherein the second picture and the inferred prompt are provided as input to the encoder; the prompt is comprised in an input to the encoder; the prompt is encoded by an encoder in a lossless way or in a lossy way, and wherein the encoded prompt is received by the apparatus as an input or as a signal; the prompt is used by the generator to generate an image, and wherein the prompt and the generated image is provided as an input to the encoder; the prompt is elaborated or processed by the encoder or by a process that is external to the encoder, and wherein the elaborated or processed prompt or data derived therefrom is received by the decoder or the generator; the prompt is used to generate an image to be encoded; or the prompt is inferred based on an image to be encoded.

[0087] Example 84: The method of example 83, wherein the first picture and / or the second picture are encoded by a first encoder and the prompt is encoded by a second encoder.

[0088] Example 85: The method of example 83, wherein the first picture, the second picture, and / or the prompt are encoded by the same encoder.

[0089] Example 86: The method of any of the examples 60, 62, 64, or 67 to 82 further comprising: receiving an encoded inferred prompt, wherein the encoded inferred prompt is an encoded version of an inferred prompt, and wherein the inferred prompt is generated based on an image generated by using the prompt; decoding the encoded inferred prompt to generate a decoded inferred prompt; providing the decoded inferred prompt as an input to the generator to obtain the generator output, wherein the generator is available at both the encoder side and the decoder side; providing the generator output as an input to the encoder and to the decoder, wherein a bitstream representing an encoded image is generated by the encoder based on an image and the generator output; receiving the bitstream; and generating a decoded image based on the bitstream and the generator output.

[0090] Example 87: The method of any of the examples 60, 62, 64, or 67 to 82 further comprising: decoding at least one of one or more bitstreams to obtain a decoded prompt, wherein aprompt and a picture are encoded into the one or more bitstreams; and using the decoded prompt or data derived from the decoded prompt as an input to the generator.

[0091] Example 88: The method of example 24, wherein the elaborated or processed prompt is obtained based on a third picture that is input to the encoder.

[0092] Example 89: The method of any of the examples 60, 62, 64, or 67 to 82 further comprising: receiving an encoded elaborated or processed prompt and an encoded picture as an input, wherein an elaborated or processed prompt is generated based on a picture and the prompt, and wherein the picture is generated based on the prompt, and wherein the encoded picture is an encoded version of the picture; decoding the encoded elaborated or processed prompt to obtain a decoded elaborated or processed prompt; and providing the decoded elaborated or processed prompt as an input to the generator.

[0093] Example 90: The method of example 83, wherein the elaborated prompt is determined based on the generator output when the input to the generator is the prompt from which the elaborated prompt is derived.

[0094] Example 91: The method of any of the examples 60 to 90, wherein the apparatus is trained based at least on a prompt loss, and wherein the prompt loss comprises a training loss, or training objective, that is determined or computed based on a first prompt and a second prompt or based on data derived from a first prompt and data derived from a second prompt, wherein the first prompt is determined or inferred based on the generator output or on data derived from the generator output, and the second prompt is one of following: a prompt that was used to generate a picture that was input to the encoder; or a prompt that was determined or inferred based on the picture that was input to the encoder.

[0095] Example 92: The method of example 91, wherein the prompt loss is computed by providing the first prompt and the second prompt as an input to a neural network and running the neural network to obtain an output that represents the prompt loss or data from which the prompt loss is derived.

[0096] Example 93: The method of example 91, wherein the prompt loss is computed by computing a metric based on the first prompt and the second prompt, or based on first features extracted from the first prompt and second features extracted from the second prompt.

[0097] Example 94: The method of any of the examples 60 to 90, wherein the method is trained, finetuned, or overfitted based at least on the prompt loss at encoder side when encoding a fourth picture to obtain an update to the generator, and wherein the update or a signal derived from the updateis received by the apparatus. In an example, the first picture, the second picture, the third picture, and the fourth picture are same or substantially same.

[0098] Example 95: The method of example 94 further comprising: updating the generator by using the update, or a signal derived from the update.

[0099] Example 96: The method of any of the examples 60 to 95, wherein: the generator is pretrained and is frozen during the training of one or more other components of a codec; the generator is pretrained and then finetuned jointly with one or more other components of the codec; or the generator is trained jointly with one or more components of the codec from scratch or from an initialization of parameters of the generator.

[0100] Example 97 : The method of any of the examples 60 to 95 further comprising: receiving a noise signal as an input, wherein the noise signal guides generation of a certain content.

[0101] Example 98: The method of example 97, wherein the noise signal is generated at the decoder side; or the noise signal or parameters that generate the noise signal are derived at the encoder side.

[0102] Example 99: The method of any of the examples 97 or 98, further comprising: processing the noise signal; and providing the processed noise signal to the decoder.

[0103] Example 100: The method of any of the examples 98 or 99, wherein the noise signal at the decoder side is a combination of the noise signal generated by the noise parameters received from the encoder and a random noise generated at the decoder side.

[0104] Example 101: The method of example 98, wherein the noise signal at the decoder side is derived from a decoded bitstream.

[0105] Example 102: The method of any of the examples 97 to 101, wherein, in order to make the noise generation deterministic and reproducible at the decoder, the method is further caused to perform: receiving additional parameters that control a noise generation process.

[0106] Example 103: The method of any of the examples 97 to 102, wherein the generator performs one or more steps to generate the generator output, and wherein the method further comprises: receiving a signal comprising information about at least one step to which the noise signal needs to be applied.

[0107] Example 104: The method of any of the examples 60, 62, 64 further comprising: receiving an edited or modified picture, wherein one or more regions of the picture had been edited,modified, generated by another generator based on an input prompt to obtain the edited or modified picture

[0108] Example 105: The method of example 104, wherein at least one of the one or more regions had been encoded differently than other regions of the edited or modified picture.

[0109] Example 106: The method of example 105, wherein as compared to the other regions, the one or more regions are: encoded in lower quality or lower resolution, partially coded, or not coded.

[0110] Example 107: The method of example 105 or 106 further comprising: receiving an encoded mask or an encoded opacity-level map.

[0111] Example 108: The method of example 107, wherein the mask or the opacity-level had be obtained in one or more of the following ways: the mask or the opacity map is an output of the another generator; the mask or the opacity-level map had been determined based on the edited or the modified picture; the mask or the opacity-level map had been determined based on the edited or modified picture and the picture; the mask or the opacity-level map had been determined based on the input prompt that was an input to the another generator; the mask or the opacity-level map had been determined based on a prompt that had been inferred based on the edited or modified picture.

[0112] Example 109: The method of any of the examples 104 to 108, wherein the one or more regions are determined by the encoder or a process external to the encoder.

[0113] Example 110: The method of any of the examples 104 to 109, wherein at least one of the one or more regions are determined to be a region that the generator at the decoder side or the decoder that comprises the generator is capable of generating or reconstructing with a sufficient quality with respect to a quality threshold or other criterion.

[0114] Example 111: The method of any of the examples 60, 63, 64, wherein the apparatus comprises a decoder comprising a neural network based decoder (NN decoder), and wherein an input to the NN decoder comprises the decoded latent tensor, and wherein an output of the NN decoder is input to the generator, and wherein the method further comprises: receiving an update to the NN decoder, where the update is determined by the encoder; using the update the NN decoder to obtain an updated NN decoder; using the updated NN decoder to process the lossless decoded latent tensor to generate a processed lossless decoded latent tensor; and providing the processed lossless decoded latent tensor to the generator; generating a final decoded image based at least on the processed lossless decoded latent tensor.

[0115] Example 112: The method of any of the examples 97 to 103, wherein when the input to the generator comprises noise, the generator output comprises hyper-prior latents, wherein the hyper-prior latents are used to derive one or more probability distribution parameters of a latent tensor, and wherein the one or more probability distribution parameters are used to decode the latent tensor.

[0116] Example 113: The method of any of the examples 60 to 112, wherein the apparatus comprises a decoder, one or more generators, or the decoder comprising the one or more generators.

[0117] Example 114: The method of example 113, wherein two of the one or more generators generate data comprising different features or characteristics of data to be decoded.

[0118] Example 115: The method of example 114, wherein when a codec comprises a video codec, a first generator of the one or more generators generates image texture data or data from which image texture data is derived and a second generator of the one or more generators generates motion data or data from which motion data is derived.

[0119] Example 116: The method of example 114, wherein when a codec comprises a video codec, a first generator of the one or more generators may generates motion data or data from which motion data is derived and a second generator of the one or more generators generates a residual data or data from which residual data is derived.

[0120] Example 117: The method of example 114, wherein when a codec comprises an image codec or a video codec, a first generator of the one or more generators generates prediction data or data from which prediction data is derived and a second generator of the one or more generators generates a residual data or data from which residual data is be derived.

[0121] Example 118: The method of example 114, wherein a generator group of the one or more generators comprises a global generator and a local generator, wherein the global generator learns to generate global embeddings from previous inputs, and the local generator utilizes the global embeddings or data derived from the global embeddings to generate a target data.

[0122] Example 119: A computer readable medium comprising instructions, when executed by an apparatus, cause the apparatus to perform at least the following: receiving a generator input comprising a prompt, a decoded latent tensor, or data derived from the decoded latent tensor; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

[0123] Example 120: A computer readable medium comprising instructions, when executed by an apparatus, cause the apparatus to perform at least the following: receiving a generator input comprising a decoded latent tensor or data derived from the decoded latent tensor; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

[0124] Example 121: A computer readable medium comprising instructions, when executed by an apparatus, cause the apparatus to perform at least the following: receiving a generator input comprising a prompt; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

[0125] Example 122: A computer readable medium comprising instructions, when executed by an apparatus, cause the apparatus to perform at least the following: receiving a generator input comprising a signal or data derived from the signal; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

[0126] Example 123: The computer readable medium of any of the examples 119 or 122, wherein the apparatus is further caused to perform methods as described in any of the examples 64 to 118.

[0127] Example 124: The computer readable medium of any of the examples 119 to 123, wherein the computer readable medium comprises a non-transitory computer readable medium.

[0128] Example 125: An comprising: means for receiving a generator input comprising a prompt, a decoded latent tensor, or data derived from the decoded latent tensor; means for generating, based at least on the generator input, a generator output; and means for deriving, based at least on the generator output, an output of an apparatus.

[0129] Example 126: An apparatus comprising: means for receiving a generator input comprising a decoded latent tensor or data derived from the decoded latent tensor; means for generating, based at least on the generator input, a generator output; and means for deriving, based at least on the generator output, an output of an apparatus.

[0130] Example 127: An apparatus comprising: means for receiving a generator input comprising a prompt; means for generating, based at least on the generator input, a generator output; and means for deriving, based at least on the generator output, an output of an apparatus.

[0131] Example 128: An apparatus comprising: means for receiving a generator input comprising a signal or data derived from the signal; means for generating, based at least on the generator input, a generator output; and means for deriving, based at least on the generator output, an output of an apparatus.

[0132] Example 129: The apparatus of any of the examples 125 or 128, wherein the apparatus is further caused to perform methods as described in any of the examples 64 to 118.BRIEF DESCRIPTION OF THE DRAWINGS

[0133] The foregoing examples and other features are explained in the following description, taken in connection with the accompanying drawings, wherein:

[0134] FIG. 1 is a block diagram of one possible and non-limiting example system in which the example embodiments may be practiced;

[0135] FIG. 2 is a diagram illustrating features as described herein;

[0136] FIG. 3 is a diagram illustrating features as described herein;

[0137] FIG. 4 is a diagram illustrating features as described herein;

[0138] FIG. 5 is a diagram illustrating features as described herein;

[0139] FIG. 6 is a diagram illustrating features as described herein;

[0140] FIG. 7 is a diagram illustrating features as described herein;

[0141] FIG. 8 is a diagram illustrating features as described herein;

[0142] FIG. 9 is a diagram illustrating features as described herein;

[0143] FIG. 10 is a diagram illustrating features as described herein;

[0144] FIG. 11 is a diagram illustrating features as described herein;

[0145] FIG. 12 is a diagram illustrating features as described herein;

[0146] FIG. 13 is a diagram illustrating features as described herein;

[0147] FIG. 14 is a diagram illustrating features as described herein;

[0148] FIG. 15 is a diagram illustrating features as described herein;

[0149] FIG. 16 is a diagram illustrating features as described herein;

[0150] FIG. 17 is a diagram illustrating features as described herein;

[0151] FIG. 18 is a diagram illustrating features as described herein;

[0152] FIG. 19 is a diagram illustrating features as described herein;

[0153] FIG. 20 is a diagram illustrating features as described herein;

[0154] FIG. 21 is a diagram illustrating features as described herein;

[0155] FIG. 22 is a diagram illustrating features as described herein;

[0156] FIG. 23 is a diagram illustrating an example apparatus, which may be implemented in hardware, configured to implement the examples described herein;

[0157] FIG. 24 is a diagram illustrating an example of non-volatile memory media used to store instructions that implement the examples described herein;

[0158] FIG. 25 is a flowchart illustrating an example method as described herein;

[0159] FIG. 26 is a flowchart illustrating another example method as described herein;

[0160] FIG. 26 is a flowchart illustrating yet another example method as described herein; and

[0161] FIG. 27 is a flowchart illustrating still another example method as described herein.DETAILED DESCRIPTION OF EMBODIMENTS

[0162] The following abbreviations that may be found in the specification and / or the drawing figures are defined as follows:3GPP third generation partnership project4G fourth generation5G fifth generation5GC 5G core networkAPS adaptation parameter setAR augmented realityCABAC context-adaptive binary arithmetic codingCDMA code division multiple accessCPU central processing unit cRAN cloud radio access networkDCT discrete cosine transformE2E end-to-end eNB (or eNodeB) evolved Node B (e.g., an LTE base station)EN-DC E-UTRA-NR dual connectivityen-gNB or En-gNB node providing NR user plane and control plane protocol terminations towards the UE, and acting as secondary node in EN-DCE-UTRA evolved universal terrestrial radio access, i.e., the LTE radio access technologyFDMA frequency division multiple accessGAN generative adversarial network gNB (or gNodeB) base station for 5G / NR, i.e., a node providing NR user plane and control plane protocol terminations towards the UE, and connected via the NG interface to the 5GCGPU graphical processing unitGSM global systems for mobile communicationsHMD head-mounted displayIBC intra block copyIEEE Institute of Electrical and Electronics EngineersIMD integrated messaging deviceIMS instant messaging service loT Internet of ThingsJVET Joint Video Expert TeamLTE long term evolutionMAE mean absolute error mAP mean average precisionMMS multimedia messaging serviceMPEG-I Moving Picture Experts Group immersive codec familyMR mixed realityMSE mean squared errorMS-SSIM multiscale structure similarity index measureNAL network abstraction layer ng or NG new generation ng-eNB or NG-eNB new generation eNBNN neural networkNNC neural network codingNR new radioN / W or NW networkO-RAN open radio access networkPC personal computerPDA personal digital assistantPSNR peak signal-to-noise ratioQP quantization parameterROI region of interestSEI supplemental enhancement informationSGD stochastic gradient descentSMS short messaging serviceSSIM structure similarity index measureTCP-IP transmission control protocol-internet protocolTDMA time division multiple accessUE user equipment (e.g., a wireless, typically mobile device)UMTS universal mobile telecommunications systemUSB universal serial busVCM video coding for machinesVMAF Video Multimethod Assessment FusionVNR virtualized network functionVR virtual realityVVC volumetric video codingWLAN wireless local area network

[0163] The following describes suitable apparatus and possible mechanisms for practicing example embodiments of the present disclosure. Accordingly, reference is first made to FIG. 1, which shows an example block diagram of an apparatus 50 (e.g., a user / electronic device). The apparatus may be configured to perform various functions such as, for example, gathering information by one or more sensors, encoding and / or decoding information, receiving and / or transmitting information, analyzing information gathered or received by the apparatus, or the like. A device configured to encode a video scene may (optionally) comprise one or more microphones for capturing the scene and / or one or more sensors, such as cameras, for capturing information about the physical environment in which the scene is captured. Alternatively, a device configured to encode a video scene may be configured to receive information about an environment in which a scene is captured and / or a simulated environment. A device configured to decode and / or render the video scene may be configured to receive a Moving Picture Experts Group immersive codec family (MPEG-I) bitstream comprising the encoded video scene. A device configured to decode and / or render the video scene may comprise one or more speakers / audio transducers and / or displays, and / or may be configured to transmit a decoded scene or signals to a device comprising one or more speakers / audio transducers and / or displays. A device configured to decode and / or render the video scene may comprise a user equipment, a head / mounted display, or another device capable of rendering to a user an AR, VR and / or MR experience.

[0164] The apparatus 50 may for example be a mobile terminal or user equipment of a wireless communication system. Alternatively, the electronic device may be a computer or part of a computer that is not mobile. It should be appreciated that example embodiments of the present disclosure may be implemented within any electronic device or apparatus which may process data. The apparatus 50 may comprise a device that can access a network and / or cloud through a wired or wireless connection. The apparatus 50 may comprise one or more controllers 56 (e.g., processors), one or more memories 58, and one or more radio interface circuitry 52 interconnected through one or more buses. The one or more controllers 56 may comprise a central processing unit (CPU) and / or a graphical processing unit (GPU). Each of the one or more radio interface circuitry 52 includes a receiver and a transmitter. The one or more buses may be address, data, or control buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, and the like. A “circuit” may include dedicated hardware or hardware in association with software executable thereon. The one or more transceivers may be connected to one or more antennas 44. The one or more memories 58 may include computer program code. The one or more memories 58 and the computer program code may be configured to, with the one or more controllers 56, cause the apparatus 50 to perform one or more of the operations as described herein.

[0165] The apparatus 50 may connect to a node of a network. The network node may comprise one or more processors, one or more memories, and one or more transceivers interconnected through one or more buses. Each of the one or more transceivers includes a receiver and a transmitter. The one or more buses may be address, data, or control buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, and the like. The one or more transceivers may be connected to one or more antennas. The one or more memories may include computer program code. The one or more memories and the computer program code may be configured to, with the one or more processors, cause the network node to perform one or more of the operations as described herein.

[0166] The apparatus 50 may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device 38 which in example embodiments of the present disclosure may be any one of: an earpiece, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery (or in other example embodiments of the present disclosure the device may be powered by any suitable mobile energy device such as solar cell, fuel cell, or clockwork generator). The apparatus 50 may further comprise a camera 42 or other sensor capable of recording or capturing images and / or video. Additionally or alternatively, the apparatus 50 may further comprise a depth sensor. The apparatus 50 may further comprise a display 32. The apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices. In other example embodiments of the presentdisclosure the apparatus 50 may further comprise any suitable short-range communication solution such as for example a BLUETOOTH™ wireless connection or a USB / firewire wired connection.

[0167] It should be understood that an apparatus 50 configured to perform example embodiments of the present disclosure may have fewer and / or additional components, which may correspond to what processes the apparatus 50 is configured to perform. For example, an apparatus configured to encode a video might not comprise a speaker or audio transducer and may comprise a microphone, while an apparatus configured to render the decoded video might not comprise a microphone and may comprise a speaker or audio transducer.

[0168] Referring now to FIG. 1, the apparatus 50 may comprise a controllers 56, processor or processor circuitry for controlling the apparatus 50. The controllers 56 may be connected to memory 58 which in example embodiments of the present disclosure may store both data in the form of image and audio data and / or may also store instructions for implementation on the controllers 56. The controllers 56 may further be connected to codec circuitry 54 suitable for carrying out coding and / or decoding of audio and / or video data or assisting in coding and / or decoding carried out by the controller.

[0169] The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC and UICC reader, for providing user information and being suitable for providing authentication information for authentication and authorization of the apparatus 50 at a network. The apparatus 50 may further comprise an input device 34, such as a keypad, one or more input buttons, or a touch screen input device, for providing information to the controllers 56.

[0170] The apparatus 50 may comprise radio interface circuitry 52 (e.g., transceivers) connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system, or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and / or for receiving radio frequency signals from other apparatus(es).

[0171] The apparatus 50 may comprise a microphone 36, camera 42, and / or other sensors capable of recording or detecting audio signals, image / video signals, and / or other information about the local / virtual environment, which are then passed to the codec circuitry 54 or the controllers 56 for processing. The apparatus 50 may receive the audio / image / video signals and / or information about the local / virtual environment for processing from another device prior to transmission and / or storage. The apparatus 50 may also receive either wirelessly or by a wired connection the audio / image / video signals and / or information about the local / virtual environment for encoding / decoding. The structural elements of apparatus 50 described above represent examples of means for performing a corresponding function.

[0172] The memory 58 may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, flash memory, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The memory 58 may be a non-transitory memory. The memory 58 may be means for performing storage functions. The controllers 56 may be or comprise one or more processors, which may be of any type suitable to the local technical environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs) and processors based on a multi-core processor architecture, as non-limiting examples. The controllers 56 may be means for performing functions.

[0173] The apparatus 50 may be configured to perform capture of a volumetric scene according to example embodiments of the present disclosure. For example, the apparatus 50 may comprise a camera 42 or other sensor capable of recording or capturing images and / or video. The apparatus 50 may also comprise one or more radio interface circuitry 52 to enable transmission of captured content for processing at another device. Such an apparatus 50 may or may not include all the modules illustrated in FIG. 1.

[0174] The apparatus 50 may be configured to perform processing of volumetric video content according to example embodiments of the present disclosure. For example, the apparatus 50 may comprise a controllers 56 for processing images to produce volumetric video content, a controllers 56 for processing volumetric video content to project 3D information into 2D information, patches, and auxiliary information, and / or a codec circuitry 54 for encoding 2D information, patches, and auxiliary information into a bitstream for transmission to another device with radio interface circuitry 52. Such an apparatus 50 may or may not include all the modules illustrated in FIG. 1.

[0175] The apparatus 50 may be configured to perform encoding or decoding of 2D information representative of volumetric video content according to example embodiments of the present disclosure. For example, the apparatus 50 may comprise a codec circuitry 54 for encoding or decoding 2D information representative of volumetric video content. Such an apparatus 50 may or may not include all the modules illustrated in FIG. 1.

[0176] The apparatus 50 may be configured to perform rendering of decoded 3D volumetric video according to example embodiments of the present disclosure. For example, the apparatus 50 may comprise a controller for projecting 2D information to reconstruct 3D volumetric video, and / or a display 32 for rendering decoded 3D volumetric video. Such an apparatus 50 may or may not include all the modules illustrated in FIG. 1.

[0177] With respect to FIG. 2, an example of a system within which example embodiments of the present disclosure can be utilized is shown. The system 10 comprises multiple communication devices which can communicate through one or more networks. The system 10 may comprise any combination of wired or wireless networks including, but not limited to a wireless cellular telephone network (such as a GSM, UMTS, E-UTRA, LTE, CDMA, 4G, 5G, 6G network etc.), a wireless local area network (WLAN) such as defined by any of the IEEE 802.x standards, a BLUETOOTH™ personal area network, an Ethernet local area network, a token ring local area network, a wide area network, and / or the Internet. A wireless network may implement network virtualization, which is the process of combining hardware and software network resources and network functionality into a single, softwarebased administrative entity, a virtual network. Network virtualization involves platform virtualization, often combined with resource virtualization. Network virtualization is categorized as either external, combining many networks, or parts of networks, into a virtual unit, or internal, providing network-like functionality to software containers on a single system. For example, a network may be deployed in a tele cloud, with virtualized network functions (VNF) running on, for example, data center servers. For example, network core functions and / or radio access network(s) (e.g. CloudRAN, O-RAN, edge cloud) may be virtualized. Note that the virtualized entities that result from the network virtualization are still implemented, at some level, using hardware such as processors and memories, and also such virtualized entities create technical effects.

[0178] It may also be noted that operations of example embodiments of the present disclosure may be carried out by a plurality of cooperating devices (e.g. cRAN).

[0179] The system 10 may include both wired and wireless communication devices and / or electronic devices suitable for implementing example embodiments of the present disclosure.

[0180] For example, the system shown in FIG. 2 shows a mobile telephone network 11 and a representation of the internet 28. Connectivity to the internet 28 may include, but is not limited to, long range wireless connections, short range wireless connections, and various wired connections including, but not limited to, telephone lines, cable lines, power lines, and similar communication pathways.

[0181] The example communication devices shown in the system 10 may include, but are not limited to, an apparatus 15, a combination of a personal digital assistant (PDA) and a mobile telephone 14, a PDA 16, an integrated messaging device (IMD) 18, a desktop computer 20, a notebook computer 22, and a head-mounted display (HMD) 17. The apparatus 50 may comprise any of those example communication devices. In an example embodiment of the present disclosure, more than one of these devices, or a plurality of one or more of these devices, may perform the disclosed process(es). These devices may connect to the internet 28 through a wireless connection 2.

[0182] The example embodiments of the present disclosure may also be implemented in a set- top box; i.e. a digital TV receiver, which may / may not have a display or wireless capabilities, in tablets or (laptop) personal computers (PC), which have hardware and / or software to process neural network data, in various operating systems, and in chipsets, processors, DSPs and / or embedded systems offering hardware / software based coding. The example embodiments of the present disclosure may also be implemented in cellular telephones such as smart phones, tablets, personal digital assistants (PDAs) having wireless communication capabilities, portable computers having wireless communication capabilities, image capture devices such as digital cameras having wireless communication capabilities, gaming devices having wireless communication capabilities, music storage and playback appliances having wireless communication capabilities, Internet appliances permitting wireless Internet access and browsing, tablets with wireless communication capabilities, as well as portable units or terminals that incorporate combinations of such functions.

[0183] Some or further apparatus may send and receive calls and messages and communicate with service providers through a wireless connection 25 to a base station 24, which may be, for example, an eNB, gNB, access point, access node, other node, etc. The base station 24 may be connected to a network server 26 that allows communication between the mobile telephone network 11 and the internet 28. The system may include additional communication devices and communication devices of various types.

[0184] The communication devices may communicate using various transmission technologies including, but not limited to, code division multiple access (CDMA), global systems for mobile communications (GSM), universal mobile telecommunications system (UMTS), time divisional multiple access (TDMA), frequency division multiple access (FDMA), transmission control protocolinternet protocol (TCP-IP), short messaging service (SMS), multimedia messaging service (MMS), email, instant messaging service (IMS), BLUETOOTH™, IEEE 802.11, 3GPP Narrowband loT and any similar wireless communication technology. A communications device involved in implementing various example embodiments of the present disclosure may communicate using various media including, but not limited to, radio, infrared, laser, cable connections, and any suitable connection.

[0185] In telecommunications and data networks, a channel may refer either to a physical channel or to a logical channel. A physical channel may refer to a physical transmission medium such as a wire, whereas a logical channel may refer to a logical connection over a multiplexed medium, capable of conveying several logical channels. A channel may be used for conveying an information signal, for example a bitstream, which may be a MPEG-I bitstream, from one or several senders (or transmitters) to one or several receivers.

[0186] Having thus introduced one suitable but non-limiting technical context for the practice of the example embodiments of the present disclosure, example embodiments will now be described with greater specificity.

[0187] Fundamentals of neural networks

[0188] Features as described herein may generally relate to neural networks. A neural network (NN) may be described as a computation graph consisting of several layers of computation. In an example of a NN, each layer may consist of one or more units, where each unit may perform an elementary computation. A unit may be connected to one or more other units, and the connection may be associated with a weight. The weight may be used for scaling the signal passing through the associated connection. Weights are learnable parameters, i.e., values which can be learned from training data. There may be other learnable parameters, such as those of batch-normalization layers. Example embodiments of the present disclosure may or may not relate to, or involve, NN comprising multiple layers of computation.

[0189] In some neural networks, such as convolutional neural networks for image classification, initial layers (those close to the input data) may extract semantically low-level features such as edges and textures in images, whereas intermediate layers may extract higher-level features. After the feature extraction layers, there may be one or more layers performing a certain task, such as classification, semantic segmentation, object detection, denoising, style transfer, super-resolution, etc. Example embodiments of the present disclosure may or may not relate to, or involve, convolutional neural networks.

[0190] Neural networks are being utilized in an ever-increasing number of applications for many different types of devices, such as mobile phones. Examples include image and video analysis and processing, social media data analysis, device usage data analysis, etc.

[0191] One property of neural nets / networks (and other machine learning tools) is that they are able to learn properties from input data, e.g., in a supervised way or in an unsupervised way. Such learning may be a result of a training algorithm, or may be achieved by means of another neural network providing the training signal (sometimes, this latter approach may be referred to as “meta learning”).

[0192] In general, the training algorithm may consist of changing some properties of the neural network so that its output is as close as possible to a desired output. For example, in the case of classification of objects in images, the output of the neural network may be used to derive a class or category index which may indicate the class or category to which the object in the input image belongs. Training may comprise minimizing or decreasing the output’s error, also referred to as the loss or loss function. Examples of losses are mean squared error, cross-entropy, etc. Example embodiments of thepresent disclosure may or may not relate to, or involve, neural networks trained according to a training algorithm.

[0193] In recent deep learning techniques, training may be an iterative process, where at each iteration the algorithm may modify the weights of the neural net to make a gradual improvement of the network’s output, i.e., to gradually decrease the loss, for example by means of a gradient descent technique. In one example, at each training iteration, gradients of the loss function with respect to one or more weights or parameters of the NN may be computed, for example by a backpropagation technique; the computed gradients may then be used by an optimization routine, such as Adam or Stochastic Gradient Descent (SGD) to obtain an update to the one or more weights or parameters.

[0194] In the present disclosure, the terms “model”, “neural network”, “neural net” and “network” are used interchangeably. In the present disclosure, the weights of neural networks may sometimes be referred to as learnable parameters or simply as parameters.

[0195] Training a neural network may be regarded as an optimization process, but the final goal may be different from the typical goal of optimization. In optimization, the main goal is to minimize a function. In machine learning, the goal of the optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset. In other words, the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, i.e., data which was not used for training the model. This is usually referred to as generalization. In practice, data is usually split into at least two sets, the training set and the validation set. The training set is used for training the network, i.e., to modify its learnable parameters in order to minimize the loss. The validation set is at least partially different from the training set. The validation set is used for checking the performance of the network on data which was not used to minimize the loss, as an indication of the final performance of the model. In particular, the errors on the training set and on the validation set may be monitored during the training process to understand the following:

[0196] - If the network is learning at all - in this case, the training set error should decrease, otherwise the model is in the regime of underfitting.

[0197] - If the network is learning to generalize - in this case, also the validation set error needs to decrease and to be not too much higher than the training set error. If the training set error is low, but the validation set error is much higher than the training set error, or it does not decrease, or it even increases, the model may be in the regime of overfitting. This means that the model has just memorized the training set’ s properties and performs well only on that set, but performs poorly on a set not used for tuning its parameters.

[0198] Fundamentals of video / image coding

[0199] Features as described herein may generally relate to video or image coding. A video codec consists of an encoder that transforms the input video into a compressed representation suited for storage / transmission, and a decoder that can decompress the compressed video representation back into a viewable form. Typically, the encoder discards some information in the original video sequence in order to represent the video in a more compact form (that is, at a lower bitrate).

[0200] Typical hybrid video codecs, for example ITU-T H.263 and H.264, encode the video information in two phases. Firstly, pixel values in a certain picture area (or “block”) are predicted, for example by motion compensation means (i.e. finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded), or by spatial means (i.e. using the pixel values around the block to be coded in a specified manner). Secondly, the prediction error, i.e. the difference between the predicted block of pixels and the original block of pixels, is coded. This is typically done by transforming the difference in pixel values using a specified transform (e.g. Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients, and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel representation (i.e. picture quality) and the size of the resulting coded video representation (i.e. file size or transmission bitrate).

[0201] Inter prediction, which may also be referred to as temporal prediction, motion compensation, or motion-compensated prediction, exploits temporal redundancy. In inter prediction, the sources of prediction are previously decoded pictures (a.k.a. reference pictures).

[0202] In temporal inter prediction, the sources of prediction are previously decoded pictures in the same scalable layer. In intra block copy (IBC; a.k.a. intra-block-copy prediction), prediction may be applied similarly to temporal inter prediction, but the reference picture is the current picture, and only previously decoded samples can be referred in the prediction process. Inter-layer or inter- view prediction may be applied similarly to temporal inter prediction, but the reference picture is a decoded picture from another scalable layer or from another view, respectively. In some cases, inter prediction may refer to temporal inter prediction only, while in other cases inter prediction may refer collectively to temporal inter prediction and any of intra block copy, inter-layer prediction, and inter-view prediction, provided that they are performed with the same or similar process as temporal prediction. Inter prediction, temporal inter prediction, or temporal prediction may sometimes be referred to as motion compensation or motion-compensated prediction.

[0203] Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in the spatial or transform domain, i.e., either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intra coding, where no inter prediction is applied.

[0204] One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently if they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors, and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction.

[0205] The decoder reconstructs the output video by applying prediction means similar to the encoder to form a predicted representation of the pixel blocks (e.g. using the motion or spatial information created by the encoder and stored in the compressed representation) and prediction error decoding (e.g. inverse operation of the prediction error coding recovering the quantized prediction error signal in the spatial pixel domain). After applying prediction and prediction error decoding means, the decoder sums up the prediction and prediction error signals (pixel values) to form the output video frame. The decoder (and encoder) can also apply additional filtering means to improve the quality of the output video before passing it for display and / or storing it as prediction reference for the forthcoming frames in the video sequence.

[0206] In typical video codecs, the motion information is indicated with motion vectors associated with each motion compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder side) or decoded (in the decoder side) and the prediction source block in one of the previously coded or decoded pictures. In order to represent motion vectors efficiently, those are typically coded differentially with respect to block specific predicted motion vectors. In typical video codecs, the predicted motion vectors are created in a predefined way, for example calculating the median of the encoded or decoded motion vectors of the adjacent blocks. Another way to create motion vector predictions is to generate a list of candidate predictions from adjacent blocks and / or co-located blocks in the temporal reference pictures, and signaling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded / decoded picture can be predicted. The reference index is typically predicted from adjacent blocks and / or or co-located blocks in the temporal reference picture. Moreover, typical high efficiency video codecs employ an additional motion information coding / decoding mechanism, often called merging / merge mode, where all the motion field information, which includes the motion vector and corresponding reference picture index for each available reference picture list, is predicted and used without any modification / correction. Similarly, predicting the motion field information is carried out using the motion field information of adjacent blocks and / or co-located blocks in the temporal reference pictures, and the used motion field information is signaled among a list of motion field candidates filled with motion field information of available adjacent / co-located blocks.

[0207] In typical video codecs, the prediction residual after motion compensation is first transformed with a transform kernel (like DCT) and then coded. The reason for this is that, often, there still exists some correlation among the residual and transform can in many cases help reduce this correlation and provide more efficient coding.

[0208] Typical video encoders utilize Lagrangian cost functions to find optimal coding modes, e.g. the desired Macroblock mode and associated motion vectors. This kind of cost function uses a weighting factor X to tie together the (exact or estimated) image distortion due to lossy coding methods and the (exact or estimated) amount of information that is required to represent the pixel values in an image area:C = D + R

[0209] where C is the Lagrangian cost to be minimized, D is the image distortion (e.g. Mean Squared Error) with the mode and motion vectors considered, and R the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data to represent the candidate motion vectors).

[0210] Video coding specifications may enable the use of supplemental enhancement information (SEI) messages or alike. Some video coding specifications include SEI network abstraction layer (NAL) units, and some video coding specifications contain both prefix SEI NAL units and suffix SEI NAL units, where the former type can start a picture unit or alike, and the latter type can end a picture unit or alike. An SEI NAL unit contains one or more SEI messages which are not required for the decoding of output pictures but may assist in related processes, such as picture output timing, postprocessing of decoded pictures, rendering, error detection, error concealment, and resource reservation. Several SEI messages are specified in H.264 / AVC, H.265 / HEVC, H.266 / VVC, and H.274 / VSEI standards, and the user data SEI messages enable organizations and companies to specify SEI messages for their own use. The standards may contain the syntax and semantics for the specified SEI messages, but a process for handling the messages in the recipient might not be defined. Consequently, encoders may be required to follow the standard specifying a SEI message when they create SEI message(s), and decoders might not be required to process SEI messages for output order conformance. One of the reasons to include the syntax and semantics of SEI messages in standards is to allow different system specifications to interpret the supplemental information identically, and hence interoperate. System specifications may require the use of particular SEI messages both in the encoding end and in the decoding end, and additionally the process for handling particular SEI messages in the recipient may be specified.

[0211] Information on neural network based image / video coding

[0212] Features as described herein may generally relate to use of NN to code images and / or videos. Recently, neural networks (NNs) have been used in the context of image and video compression, by following mainly two approaches.

[0213] In a first approach, NNs are used to replace one or more of the components of a traditional codec, such as a VVC / H.266-compliant codec. Here, “traditional” or “legacy” mean those codecs whose components and their parameters are typically not learned from data by means of machine learning techniques. Examples of components that may be implemented as neural networks are: an inloop filter, for example a NN that works as an additional in-loop filter with respect to the traditional loop filters, or a NN that works as the only additional in-loop filter, thus replacing any other in-loop filter; Intra-frame prediction; inter-frame prediction; transform and / or inverse transform; probability model for lossless coding; etc.

[0214] In a second approach, commonly referred to as “end-to-end learned compression” (or end- to-end learned codec), NNs are used as the main components of the image / video codecs. However, the codec may still comprise components which are not based on machine learning techniques. In this second approach, two design options are as follows:

[0215] - Option 1: re-use the traditional video coding pipeline, but replace most or all the components with NNs. Referring now to FIG. 3, illustrated is an example of an end-to-end learned codec that includes NNs replacing some components of the traditional video coding pipeline. Input signal (x) (302) may be combined (303) with other information and provided to a neural transform (304), which may also receive input from an encoder parameter control (306). Output of the neural transform (304) may be provided for quantization (308), and then for inverse quantization / neural transform (310) as well as entropy coding (312) to a bitstream (314). Entropy coding (312) may be performed based on input from the encoder parameter control (306).

[0216] The output of the inverse quantization / neural transform (310) may be combined with other information, and provided to a neural intra codec (316) and to a deep loop filter (324). The neural intra codec (316) may also receive input from the encoder parameter control (306), and may comprise an encoder (318), intra coding (320), and a decoder (322).

[0217] The deep loop filter (324) may also receive input from the encoder parameter control (306), and may provide output to a decode picture buffer (326), which may produce an enhanced reference frame (328) based, at least partially, on one or more reconstructed frames (330). The decode picture buffer (326) may provide output for inter prediction (332), which may provide output based, at least partially, on input from the encoder parameter control (306) and ME / MC (336), Gnet(Cnet(-)) (334).

[0218] In the example of FIG. 3, the forward and inverse transforms were replaced with two neural networks (304, 310), the intra codec comprises a neural intra codec (316), and the deep loop filter (324) is a neural network.

[0219] - Option 2: re-design the whole pipeline as a neural network auto-encoder with a quantization and lossless coding in the middle part. This option may also be referred to as end-to-end learned coding. The codec may comprise the following:

[0220] - Encoder NN (also referred to as a neural network based encoder, or NN encoder): performs a non-linear transformation of the input. The output is typically referred to as a latent tensor.

[0221] - Quantization and lossless encoding of the encoder NN’s output.

[0222] - Lossless decoding and dequantization.

[0223] - Decoder NN (also referred to as a neural network based decoder, or NN decoder): performs a non-linear inverse transformation from dequantized latent tensor to a reconstructed input.

[0224] It is to be understood that even in end-to-end learned approaches, there may be components which are not learned / trained from data, such as the arithmetic codec.

[0225] Further information on neural network-based end-to-end learned video coding

[0226] Features as described herein may generally relate to NN-based end-to-end (E2E) learned video codecs. Referring now to FIG. 4, illustrated is an example of neural network-based end-to-end learned coding, such as an end-to-end learned video coding system or an end-to-end learned image coding system.

[0227] Even though some examples are provided with respect to coding images or videos, it is to be understood that other types of data may be coded in a similar way, such as audio, speech, text, features, etc. As shown in FIG. 4, a typical neural network-based end-to-end learned coding system comprises an encoder (405) and a decoder (460).

[0228] The encoder (405) comprises an encoder NN (415), a quantizer or quantization operation (425), a probability model (435), a lossless encoder (445) (for example arithmetic encoder). The decoder (460) comprises a lossless decoder (455) (for example, an arithmetic decoder), a probability model (465), a dequantizer or dequantization operation (475), and a decoder NN (485).

[0229] It is to be noted that the probability model (435) present at encoder side and the probability model (465) present at decoder side may be the same or substantially the same. For example, they may be two copies of the same probability model. The probability model (435, 465) may also be a neuralnetwork and / or may mainly comprise neural network components, and may be referred to as a neural network based probability model or learned probability model.

[0230] The lossless encoder (445) and the lossless decoder (455) form a lossless codec (440). A lossless codec may be an entropy-based lossless codec. An example of a lossless codec is an arithmetic codec, such as a context-adaptive binary arithmetic coding (CABAC). Sometimes, the term lossless codec may refer to a system that comprises also the probability model, in addition to, for example, an arithmetic encoder and an arithmetic decoder.

[0231] The encoder NN (415) and the decoder NN (485) may typically be two neural networks, or may mainly comprise neural network components.

[0232] The quantizer or quantization operation (425), dequantizer or dequantization operation (475) and lossless codec (440) are typically not based on neural network components, but may potentially comprise neural network components.

[0233] In the example of FIG. 4, the encoder NN (415) may take an input x (410), which may comprise, for example, an image to be compressed. The encoder NN (415) may output a latent tensor z (420). In one example, the latent tensor may be a 3D tensor, where the three dimensions of such tensor may represent a channel dimension, a vertical dimension (also sometimes referred to as height dimension) and a horizontal dimension (also sometimes referred to as width dimension). In another example, the latent tensor may be a 4D tensor, where the four dimensions of such tensor may represent sample dimension (also sometimes referred to as batch dimension, which is the dimension along which different samples of data can be placed), a channel dimension, a vertical dimension (also sometimes referred to as height dimension) and a horizontal dimension (also sometimes referred to as width dimension). In yet another example, in the case of compressing a signal with a temporal dimension such as a video, the latent tensor may be a 4D tensor, where the four dimensions of such tensor may represent a channel dimension, a vertical dimension (also sometimes referred to as height dimension), a horizontal dimension (also sometimes referred to as width dimension), and a temporal dimension. The latent tensor (420) may be input to the quantizer or quantization operation (425), obtaining a quantized latent tensor zq(430). The quantized latent tensor zq(430) may be lossless-encoded into a bitstream b (450) by the lossless encoder (445), based also on an output of the probability model (435). In particular, the probability model may take as input at least part of the quantized latent tensor zq(430) and may output an estimate of a probability, or an estimate of a probability distribution, or an estimate of one or more parameters of a probability distribution, for one or more elements of the quantized latent tensor. The bitstream (450) may represent an encoded or compressed version of the input x (410).

[0234] The bitstream (450) may be lossless-decoded by the lossless decoder (455) also based on an output of the probability model (465) present at decoder side, obtaining a quantized latent tensor zq(470). The quantized latent tensor may be dequantized by the dequantizer or dequantization operation (475), obtaining a reconstructed latent tensor z (480). The reconstructed latent tensor (480) may be input to a decoder NN (485), obtaining a reconstructed input x (490), i.e., a reconstructed version of the input x (410). The reconstructed input x (490) may also be referred to as reconstructed data, or reconstruction, or decoded data, or decoded input, or decoded output, and the like.

[0235] FIG. 4 presents a simplified description of an end-to-end learned codec; more sophisticated designs, or variations of this design, are possible.

[0236] The neural network components, or a subset of the neural network components, of an end- to-end learned codec may be trained by minimizing a rate-distortion loss function:L = D + R

[0237] where D is a distortion loss term, R is a rate loss term, and X is a weight that controls the balance between the two losses. The distortion loss term may be referred to also as reconstruction loss term, or simply reconstruction loss. The rate loss term may be referred to simply as rate loss.

[0238] The distortion loss term measures the quality of the reconstructed or decoded output, and may comprise (but may not be limited to) one or more of the following:

[0239] - Mean square error (MSE)

[0240] - Structure similarity index measure(SSIM)

[0241] - Multiscale structure similarity index measure (MS-SSIM)

[0242] - Losses derived from the use of a pretrained neural network. For example, error(fl, f2), where fl and f2 are the features extracted by a pretrained neural network for the input data and the decoded data, respectively, and errorQ is an error or distance function, such as LI norm or L2 norm.

[0243] - Losses derived from the use of a neural network that is trained (substantially) simultaneously with the end-to-end learned codec. For example, adversarial loss can be used, which is the loss provided by a discriminator neural network that is trained adversarially with respect to the codec, following the settings proposed in the context of Generative Adversarial Networks (GANs) and their variants.

[0244] - Loss that is related to a performance of one or more machine analysis tasks or to an estimated performance of one or more machine analysis tasks, where the one or more machine analysistasks may comprise classification, object detection, image segmentation, instance segmentation, etc. In one example, the estimated performance of one or more machine analysis tasks may comprise a distortion computed based at least on a first set of features extracted from an output of the decoder, and a second set of features extracted from a respective ground truth data, where the first set of features and the second set of features are output by one or more layers of a pretrained feature-extraction neural network.

[0245] Multiple distortion losses may be used and integrated into D, such as a weighted sum of MSE and SSIM.

[0246] The rate loss term may be used to train the encoder NN to output a low-entropy latent tensor, or a latent tensor such that the quantized latent tensor has low entropy, or a latent tensor such that the probability distribution of the quantized latent tensor may be better estimated or predicted by the probability model.

[0247] The rate loss term may be used to train the probability model to better estimate or predict the probability distribution of the quantized latent tensor.

[0248] Examples of the rate loss terms include the following:

[0249] - In one example, the rate loss term may be derived from the output of the probability model, and it may represent the estimated entropy of the quantized latent representation, which may indicate the number of bits necessary to represent the quantized latent tensor.

[0250] - A sparsification loss, i.e., a loss that encourages the quantized latent tensor to comprise many zeros. Examples are L0 norm, LI norm, LI norm divided by L2 norm.

[0251] In order to train the neural network components, or a subset of the neural network components, of an end-to-end learned codec, one or more of reconstruction losses may be used, and one or more rate losses may be used. In one example, the one or more reconstruction losses and / or one or more rate losses may be combined by means of a weighted sum. Typically, the different loss terms are weighted using different weights, and these weights determine how the final system performs in terms of rate-distortion performance. For example, if more weight is given to the reconstruction losses with respect to the rate losses, the system may learn to compress less, but to reconstruct with higher accuracy (e.g. as measured by a metric that correlates with the reconstruction losses). These weights are usually considered to be hyper-parameters of the training process, and may be set manually by the person designing the training process, or automatically, for example by grid search or by using additional neural networks.

[0252] In one case, the training process may be performed jointly with respect to the distortion loss D and the rate loss R. In another case, the training process may be performed in two alternating phases, where in a first phase only the distortion loss D may be used, and in a second phase only the rate loss R may be used.

[0253] For lossless video / image compression, the system may only comprise the probability model and lossless encoder and lossless decoder. The loss function would comprise only the rate loss, since the distortion loss is always zero (i.e., no loss of information).

[0254] In the present disclosure, inference phase, or inference stage, or inference time, or test time, are referred to the phase when a neural network or a codec is used for its purpose, such as encoding and decoding an input image.

[0255] Information on Video Coding for Machines (VCM)

[0256] Features as described herein may generally relate to video coding for machines (VCM). Reducing the distortion in image and video compression is often intended to increase human perceptual quality, as humans are considered to be the end users, i.e. consuming / watching the decoded images or videos. Recently, with the advent of machine learning, especially deep learning, there is a rising number of machines (i.e., autonomous agents) that analyze data independently from humans, and may even make decisions based on the analysis results without human intervention. Examples of such analysis are object detection, scene classification, semantic segmentation, video event detection, anomaly detection, pedestrian tracking, etc. For example, such analysis tasks may be performed by neural networks.

[0257] It is likely that the device where the analysis takes place has multiple “machines” or neural networks (NNs). These multiple machines may be used in a certain combination which is, for example, determined by an orchestrator sub-system. The multiple machines may be used, for example, in succession, based on the output of the previously used machine, and / or in parallel. For example, a video may be analyzed by one machine (NN) for detecting pedestrians, by another machine (another NN) for detecting cars, and by another machine (another NN) for estimating the depth of all the pixels in the frames.

[0258] Example use cases and applications are self-driving cars, video surveillance cameras and public safety, smart sensor networks, smart TV and smart advertisement, person re-identification, smart traffic monitoring, drones, etc. In addition to image and video data, automatic analysis and processing is increasingly being performed for other types of data, such as audio, speech, text.

[0259] Compressing (and decompressing) data where the end user comprises machines (e.g., neural networks) is commonly referred to as compression or coding for machines. In the case of video data, it is referred to as video compression or coding for machines (VCM). Compressing for machines may differ from compressing for humans, for example, with respect to the algorithms and technology used in the codec, or the training losses used to train any neural network components of the codec, or the evaluation methodology of codecs.

[0260] It is to be understood that, when considering the case of coding for machines, the term “receiver- side” or “decoder-side” refer to the physical or abstract entity or device which comprises one or more machines, and runs these one or more machines on some encoded and eventually decoded video representation which is encoded by another physical or abstract entity or device, the “encoder-side device”.

[0261] Referring now to FIG. 5, illustrated is an example of a pipeline of video coding for machines. A VCM encoder (510) may encode the input video (505) into a bitstream (515). A bitrate (525) may be computed (520) from the bitstream (515), as a measure of the size of the bitstream. A VCM decoder (530) may decode the bitstream (515) that was produced by the VCM encoder (510).

[0262] The output of the VCM decoder (530) may be referred to as “Decoded data for machines” or output (535). This data may be considered as the decoded or reconstructed video. However, in some implementations of this pipeline, this data may not have same or similar characteristics as the original video which was input to the VCM encoder. For example, this data may not be easily understandable by a human by simply rendering the data onto a screen, if such rendering is possible.

[0263] The output (535) of the VCM decoder (530) may then be input to one or more task neural networks (540, 545, 550, 555). In FIG. 5, for the sake of illustrating that there may be any number of task-NNs, there are three example task-NNs, and a non-specified one (Task-NN X, 555). One goal of VCM may be to obtain a low bitrate while guaranteeing that the task-NNs still perform well (580, 585, 590, 595) in terms of the evaluation metric associated to each task (560, 565, 570, 575).

[0264] It is to be understood that, in some cases, the VCM decoder may not be present. In one example, the machines may be run directly on the bitstream. In some other cases, the VCM decoder may comprise only a lossless decoding stage, and the lossless decoded data may be provided as input to the machines. In yet some other cases, the VCM decoder may comprise a lossless decoding stage following by a dequantization operation, and the loss-decoded and dequantized data may be provided as input to the machines.

[0265] When a conventional video encoder, such as a H.266 / VVC encoder, is used as a VCM encoder, one or more of the following approaches may be used to adapt the encoding to be suitable to machine analysis tasks:

[0266] - One or more regions of interest (ROIs) may be detected. An ROI detection method may be used. For example, ROI detection may be performed using a task NN, such as an object detection NN. In some cases, ROI boundaries of a group of pictures or an intra period may be spatially overlaid and rectangular areas may be formed to cover the ROI boundaries. The detected ROIs (or rectangular areas, likewise) may be used in one or more of the following ways: the quantization parameter (QP) may be adjusted spatially in a manner that ROIs are encoded using finer quantization step size(s) than other regions. For example, QP may be adjusted CTU-wise; the video may be preprocessed to contain only the ROIs, while the other areas may be replaced by one or more constant values or removed; the video may be preprocessed so that the areas outside the ROIs are blurred or filtered; or, a grid may be formed in a manner that a single grid cell covers a ROI. Grid rows or grid columns that contain no ROIs may be down-sampled as preprocessing to encoding.

[0267] - Quantization parameter of the highest temporal sublayer(s) may be increased (i.e. coarser quantization is used) when compared to practices for human watchable video.

[0268] - The original video may be temporally down-sampled as preprocessing prior to encoding.A frame rate up-sampling method may be used as postprocessing subsequent to decoding, if machine analysis at the original frame rate is desired.

[0269] - A filter may be used to preprocess the input to the conventional encoder. The filter may be a machine learning based filter, such as a convolutional neural network.

[0270] It is to be understood that, in the context of video coding for machines, the terms “machine vision”, “machine vision task”, “machine task”, “machine analysis”, “machine analysis task”, “computer vision”, “computer vision task”, "task network" and “task” may be used interchangeably. Also, it is to be understood that, in the context of video coding for machines, the terms “machine consumption” and “machine analysis” may be used interchangeably.

[0271] Neural network based filtering

[0272] A neural network may be used for filtering or processing input data. Such a neural network may be referred to as a neural network based filter, or simply as a NN filter. A NN filter may comprise one or more neural networks, and / or one or more components that may not be categorized as neural networks (i.e. may be categorized as traditional or legacy components that are not trained based on data using machine learning techniques). The purpose of a NN filter may comprise (but may not belimited to) visual enhancement, colorization, up-sampling, super -resolution, inpainting, temporal extrapolation, generating content, or the like.

[0273] In some video codecs, a neural network may be used as filter in the encoding and decoding loop (also referred to simply as coding loop), and it may be referred to as a neural network loop filter, or a neural network in-loop filter. The NN loop filter may replace all other loop filters of an existing video codec, or may represent an additional loop filter with respect to the already present loop filters in an existing video codec.

[0274] A neural network filter may be used as a post-processing filter for a codec, e.g., may be applied to an output of an image or video decoder in order to remove or reduce coding artifacts.

[0275] In one example, a codec is a modified VVC / H.266 compliant codec (e.g., a VVC / H.266 compliant codec that has been modified and thus it may not be compliant to the VVC / H.266) that comprises one or more NN loop filters. An input to the one or more NN loop filters may comprise at least a reconstructed block or frames (simply referred to as reconstruction) or data derived from a reconstructed block or frame (e.g., the output of a conventional loop filter). The reconstruction may be obtained based on predicting a block or frame (e.g., by means of intra-frame prediction or inter-frame prediction) and performing residual compensation. The one or more NN loop filters may enhance the quality of at least one of their input, so that a rate-distortion loss is decreased. The rate may indicate a bitrate (estimate or real) of the encoded video. The distortion may indicate a pixel fidelity distortion such as the following:

[0276] - Mean-squared error (MSE).

[0277] - Mean absolute error (MAE).

[0278] - Mean Average Precision (mAP) computed based on the output of a task NN (such as an object detection NN) when the input is the output of the post-processing NN.

[0279] - Other machine task -related metric, for tasks such as object tracking, video activity classification, video anomaly detection, etc.

[0280] The enhancement may result into a coding gain, which may be expressed for example in terms of BD-rate or BD-PSNR (peak signal-to-noise ratio).

[0281] A neural network filter may be used as a post-processing filter for a codec, e.g., may be applied to an output of an image or video decoder in order to remove or reduce coding artifacts. In one example, the NN filter may be used as a post-processing filter where the input comprises data that is output by or is derived from an output of a traditional decoder, such as a decoder that is compliant withthe VVC / H.266 standard. In another example, the NN filter may be used as a post-processing filter where the input comprises data that is output by or is derived from an output of a decoder of an end-to- end learned decoder.

[0282] Input to a NN filter

[0283] Various input may be provided to a NN filter. In the case of filtering images, a filter may take as input at least one or more first images to be filtered and may output at least one or more second images, where the one or more second images are the filtered version of the one or more first images. In one example, the filter may take as input one image, and output one image. In another example, the filter may take as input more than one image, and output one image. In another example, the filter may take as input more than one image, and output more than one image.

[0284] It is to be understood that a filter may take as input also other data (also referred to as auxiliary data, or extra data) besides the data that is to be filtered, such as data that may aid the filter to perform a better filtering than if no auxiliary data was provided as input. In one example, the auxiliary data may comprise information about prediction data, and / or information about the picture type, and / or information about the slice type, and / or information about a Quantization Parameter (QP) used for encoding, and / or information about boundary strength, etc. In one example, the filter may take as input one image and other data associated to that image, such as information about the quantization parameter (QP) used for quantizing and / or dequantizing that image, and output one image.

[0285] Information on overfitting a neural network filter

[0286] Features as described herein may generally relate to adaptation of a NN. A NN filter may be adapted at test time based at least on part of the data to be encoded and / or decoded and / or postprocessed. Such operation may be referred to, for example, with one of the following terms, when their meaning is clear from the context: adaptation, content adaptation, overfitting, finetuning, optimization, specialization, and the like.

[0287] Although, for simplicity, the case of a NN filter is being considered herein, similar adaptation may be performed for other coding tools and / or post-processing tools that are based on neural network technology. For example, a neural network based intra-frame prediction, or a neural network based inter-frame prediction, etc.

[0288] The NN filter that results from the adaptation process may be referred to, for example, with one of the following terms: adapted filter, content-adapted filter, overfitted filter, finetuned filter, optimized filter, specialized filter, and the like.

[0289] At the encoder side, the adaptation process may start with an initial NN filter. In one example, the initial NN filter may be a pretrained NN filter that was pretrained during an offline stage on a sufficiently large dataset. In another example, the initial NN filter may be a randomly initialized NN filter.

[0290] In the adaptation, one or more parameters of the NN filter may be adapted. Examples of such parameters may include (but may not be limited to) the following: the bias terms of a convolutional neural network; multiplier parameters that multiply one or more tensors produced by the NN filter, such as one or more feature tensors that are output by respective one or more layers of the NN filter; parameters of the kernels of a convolutional neural network; parameters of an adapter layer; or one or more arrays or tensors that are used as input to respective one or more layers of the NN filter.

[0291] The adaptation may be performed by means of a training process, e.g., by minimizing a loss function until a stopping criterion is met. The data used for this training process may comprise one or more pictures or blocks of input to the NN filter and associated respective one or more pictures or blocks of ground-truth data. In one example where the filter is an in-loop filter, the input to the NN filter may be reconstruction data, after prediction and residual compensation; the ground-truth data may be the uncompressed data that is given as input to the encoder. In one example where the filter is a post-processing filter, the input to the NN filter may be decoded data (e.g., the output of a video decoder); the ground-truth data may be the uncompressed data that is given as input to the encoder.

[0292] The loss function used during the training process may comprise one or more distortion loss functions (also referred to as reconstruction loss functions) and zero or more rate loss functions. A rate loss function may measure, for example, the cost in terms of bitrate of signaling any adaptation signal, such as updates to the parameters of the NN filter. A distortion loss function may comprise one of MSE, MS-SSIM, Video Multimethod Assessment Fusion (VMAF), etc.

[0293] The adaptation signal may be derived or determined based on the adapted NN filter and on the original NN filter (i.e., the NN filter before the overfitting process). In one example, the adaptation signal comprises an update to one or more parameters of the NN filter. Such an update may also be referred to as weight update, or parameter update. Such update may be computed, for example, by subtracting the values of the adapted parameters (i.e., the parameters of the adapted NN filter) from the corresponding values of the original parameters (i.e., the parameters of the original NN filter). In another example, the adaptation signal may comprise the parameters (of the NN filter) that were adapted, also referred to as updated parameters, or adapted parameters, or adapted weights, or overfitted parameters, and the like.

[0294] In order to keep the size of the adaptation signal low, the adaptation signal may go through one or more compression steps, such as sparsification, quantization and lossless coding, etc. In one example, an encoder that compresses the adaptation signal into a bitstream that is compliant with a neural network compression standard, such as MPEG neural network coding (NNC), may be used.

[0295] The compressed adaptation signal may be signaled from encoder to decoder in or along a bitstream that represents encoded image or video data. In one example, the compressed adaptation signal may be signaled in an Adaptation Parameter Set (APS) syntax structure of a video coding bitstream. In another example, the compressed adaptation signal may be signaled in a Supplemental Enhancement Information (SEI) message of a video coding bitstream. Signaling may comprise also other information which is associated with the adaptation signal and that may be required for correctly parsing and / or decompressing and / or using the adaptation signal, such as any quantization parameters.

[0296] Referring now to FIG. 6, illustrated is an example of an overfitting process (605) at the encoder side. The overfitting process (605) may be performed at the encoder side based on a training process. Input (610) may be provided to a NN filter (615) to determine an output (620). Loss (640) may be computed (630) between ground truth (625) and the output (620). The loss (640) may be provided to determine overfitting (635), which may be provided to the NN filter (615).

[0297] The resulting overfitted filter (645) may then be used to derive an overfitting signal (655), or adaptation signal (660). The overfitting signal may be derived or determined based partially on the original NN filter (650). The adaptation signal (660) may be compressed (665) to determine a compressed adaptation signal (670) and then signaled (675) from the encoder to the decoder, in or along a bitstream that represents encoded data, such as an encoded image or video.

[0298] In the example of FIG. 6, x (610) represents an input to the NN filter, x (620) represents an output of the NN filter (615), x (625) represents a ground-truth data associated with x (610), “Compute loss” (630) may compute a training loss 1 (640) in order to overfit the NN filter, and “Overfit” (635) may use 1 (640) to overfit the NN filter (615). As a result of the overfitting process (605), an overfitted NN filter may be obtained (645), which may be used (655), together with the original NN filter (650), to derive an adaptation signal (660). The adaptation signal may be compressed (665) and signaled (675) to a decoder or receiver.

[0299] At the decoder or receiver side, the signaled compressed adaptation signal may be received and decompressed. The decompressed adaptation signal may then be used to update the NN filter. In one example, where the adaptation signal may comprise a weight update, where the weight update may comprise one or more updates to respective one or more parameters of the NN filter, the one or more updates may be added to the one or more parameters. In another example, where theadaptation signal may comprise one or more updated or adapted parameters, the one or more updated or adapted parameters may be used to replace respective one or more parameters of the NN filter.

[0300] Once the NN filter has been updated based on the adaptation signal, the updated NN filter may be used for its purpose. For example, for filtering an input picture or an input block.

[0301] Referring now to FIG. 7, illustrated is an example of use of an adaptation signal for overfitting at the decoder or receiver side. A compressed adaptation signal (710) may be decompressed (720) to derive a decompressed adaptation signal (730). At the decoder side, the overfitting signal (730), or a signal derived or determined from the overfitting signal, may be used to update (750) the NN filter (740). The updated NN filter (760) may then be used to filter one or more pictures, or one or more blocks.

[0302] In the examples of FIGs. 6-7, the NN filter that is obtained from the overfitting process at encoder side may be different from the NN filter that is obtained from the updating process at decoder side. For example, one reason may be that the adaptation signal may be compressed in a lossy way. Thus, the former NN filter may be referred to as overfitted filter or adapted filter (or other similar terms, see above), and the latter NN filter may be referred to as updated filter.

[0303] In the present disclosure, the terms frame, picture and image may be used interchangeably. For example, the input and output to an end-to-end learned codec may be pictures. The input and output of a NN filter may be pictures. It is to be understood that also the term block, when it means a portion of a picture, may be simply referred to as frame or picture or image. In other words, at least some of the embodiments herein, even when described as applied to a picture, may be applicable also to a block, e.g., to a portion of a picture.

[0304] Example embodiment of the present disclosure may consider image and video as the data types. However, this is not limiting; the example embodiments may be extended to other types of data, such as audio.

[0305] At least some of the embodiments described herein are applied to image or picture data; however, it is to be understood that the at least some of the embodiments described herein are also applicable to or valid for other types of data, such as video, audio, 3D images, 3D video, depth, opacitylevel data, and the like.

[0306] In some embodiments, image and video data may be collectively referred to as visual data, and it is to be understood that visual data may refer to either image data or video data or both.

[0307] In the present disclosure, the terms signal, data, tensor and information may be used interchangeably to indicate an input or an output.

[0308] In the present disclosure, an end-to-end learned codec may be referred to also as E2E learned codec, or learned codec, or E2E codec.

[0309] In the present disclosure, neural network layers may be simply referred to as layers, or as a set of layers.

[0310] In at least some embodiments, a generator or generative neural network may refer to a neural network that may have generative capabilities, or a neural network that may be considered as generative artificial intelligence (Al), or a neural network that is capable of generating new content, or a neural network that is capable of extrapolating data for example with respect to a training distribution. An example of a generator is a neural network trained based on the Generative Adversarial Network (GAN) algorithm or paradigm, such as an image generator. Another example generator is a neural network trained based on diffusion modelling (also referred to as denoising score matching, or denoising diffusion probabilistic modelling), such as an image diffusion model.

[0311] In some embodiments, terms domain and space may be used interchangeably when referring to a domain or space of some data, such as of images. Domain or space may refer to some characteristics of some data, or a type of data. For example, image space and image domain may be used interchangeably; and feature space and feature domain may be used interchangeably.

[0312] In some embodiments, terms pixel domain, pixel space, picture domain, picture space, image domain, image space may be used interchangeably.

[0313] In some embodiments, terms latent and feature may be used interchangeably. For example, terms latent space and feature space may be used interchangeably.

[0314] It is to be understood that one or more operations performed by a data decoder, such as an image decoder, may be comprised in a data encoder, such as an image encoder. In an example, all the operations of an image decoder may be present in an image encoder.

[0315] In an embodiment, a codec, such as an end-to-end learned codec, comprises an encoder and a decoder, where the encoder may encode an input data item into a bitstream and the decoder may decode the bitstream into a decoded output, and where the decoder may comprise one or more generators, where the one or more generators may generate data in one or more of a latent domain, or a target domain of interest such as image domain.

[0316] In one or more embodiments and examples at least one of the one or more generators may be referred to as “the generator”.

[0317] In an example, the input data item is a picture or image to be coded, the codec is an image codec, the decoded output is a decoded picture or decoded image.

[0318] Input to the generators

[0319] In an embodiment, an input to the generator, which may be referred to as a generator input, may comprise a signal received from the encoder or data derived therefrom.

[0320] Referring now to FIG. 8, illustrated is an example embodiment, in which the generator input is a lossless-decoded latent tensor 814. As shown in FIG. 8, an image 802 is provided as an input to an NN encoder 804, which generates a latent tensor 806 as an output. The latent tensor 806 is provided as an input to a lossless encoder 808, which generates a bitstream 810. The bitstream 810 is received by a lossless decoder 812, which generates a lossless-decoded latent tensor 814. The lossless-decoded latent tensor 814 is provided as an input to a generator 816, which is comprised in a decoder 818. The generator 816 generates a decoded image 820.

[0321] Referring now to FIG. 9, illustrated is another example embodiment, in which the generator input is an output of a neural network decoder 902 that is comprised in the decoder. In this example, the input to the neural network decoder 902 is a lossless-decoded latent tensor 904. Thus, the lossless-decoded latent tensor 904 is input to the neural network decoder 902 to obtain an output of the neural network decoder 902, and the output of the neural network decoder 902 is input to the generator 906. In an embodiment, the generator 906 may be considered as a post-processing operation.

[0322] In yet another example embodiment, the generator input is a text prompt, such as a prompt in natural language domain and in the form of a text string, or in the form of a tokenized text string, in the form of features extracted from a tokenized text string, or in the form of features extracted from a text string.

[0323] In yet another example embodiment, the generator input is a prompt in feature space or domain.

[0324] In yet another example, the generator input is a prompt inferred from an output of the NN decoder.

[0325] In yet another example embodiment, the generator input is a prompt inferred from an output of the lossless decoder.

[0326] In some embodiments, the input to a generator may comprise noise signal. The noise signal may be generated at the decoder side by an algorithm. In another example, the noise signal or one or more parameters to generate the noise signal may be derived at the encoder side and signaled tothe decoder side. In one example embodiment, the one or more parameters that generate the noise signal may comprise mean and / or variance parameters of a Gaussian-distributed noise.

[0327] Output of generators

[0328] In an embodiment, an output of the generator, which may be referred to as generator output, may comprise a signal from which an output of the decoder may be determined or derived.

[0329] Referring now to FIG. 10, illustrated is an example embodiment, in which the generator (1006) output is input to a neural network decoder 1002 that is comprised in the decoder. The neural network decoder 1002 decodes a lossless decoded latent tensor 1004 based on the generator (1006) output to generate decoded image 1008.

[0330] In another example, the generator output is an output of the decoder, such as a decoded picture.

[0331] Referring now to FIG. 11, illustrated is yet another example embodiment, in which the generator (1102) output comprises features 1104 that are input to a synthesis neural network 1106. The features may comprise a feature tensor and an output of the synthesis neural network 1106 may comprise an output of the decoder such as a decoded picture / image 1108.

[0332] Referring now to FIG. 12, illustrated is an example embodiment, in which an output of the generator 1202 is combined with a residual 1204. The output of the generator 1202 may be referred to as generator output and may be combined with the residual 1204 by using at least a combination operation 1206, where the residual 1204 or data from which the residual 1204 is obtained is signaled from the encoder. In an example, the combination operation 1206 comprises one or more of a summation operation, a tensor concatenation operation, one or more neural network layers. In another example, the residual 1204 and the generator ( 1202) output are in feature space or domain . In yet another example, the residual 1204 and the generator (1202) output are in picture or pixel or image space or domain.

[0333] In an embodiment, the output of the generator may be used as an intra-frame prediction or inter-frame prediction of an image encoder and / or decoder or of a video encoder and / or decoder. The intra-frame prediction or the inter-frame prediction may comprise data in a latent space (e.g., features) or data in a target space (e.g., image).

[0334] In an embodiment, the output of the generator may be used as a reference picture in a video encoder and / or decoder.

[0335] In an embodiment, the output of the generator may be used as a temporal extrapolation in a video encoder and / or decoder.

[0336] In an embodiment, the output of the generator may be used as a spatial extrapolation in an image or video encoder and / or decoder.

[0337] In an embodiment, when the generator is a group of generators that comprises two or more generator modules, the output of one generator module in the group, or data derived therefrom, may be comprised in an input to other one or more generator modules in the group.

[0338] In an example, the generator is a 2-step generator, in which each step is handled by a separate generator module, and an output of a first generator module in the 2-step generator is an input of a second generator module in the 2-step generator.

[0339] In another example, a generator group may comprise a global generator and a local generator. The global generator may learn to generate global embeddings from previous decoded latent tensors, and the local generator directly takes the generated embeddings as input to generate output in the image space. In yet another example, the global embeddings are combined with the current decoded latent tensor to be the input of the local generator. The ‘Different generators for generating different features section provides additional information for an example setup.

[0340] External conditioning

[0341] In an embodiment, one or more external inputs with respect to the decoder may be used as an input to the generator, or may be used to process or modify an input prompt to the generator, where the one or more external inputs may not be comprised in a bitstream decoded by the decoder. For example, the one or more external inputs may be information that is available at decoder side, such as information provided by an application or device that comprises or uses the decoder, or information that is decoded from a bitstream by means of another decoder.

[0342] Referring now to FIG. 13, illustrated is an example embodiment, in which audio 1302 is used as an external input to a generator 1304. In this example, the one or more external inputs comprises an audio signal that is derived from an audio track associated with one or more pictures that are input to the encoder 1306. For example, when a voice is dubbed to a certain language that is different from a language spoken in the original video that was input to the encoder 1306, the voice may be comprised in the one or more external inputs; the dubbed voice may then be used to generate data (e.g., face expressions and lips appearance in a picture or video) that are consistent or aligned with the dubbed voice.

[0343] In another example, the one or more external inputs may comprise ambient information, such as viewing conditions. In a further example, when a decoded picture is to be viewed in low light conditions, the generator may generate content that is more suitable or more pleasant to watch in low- light conditions, such as soft illumination, soft colors, more details, etc. In another example, when a decoded picture is to be viewed in high light conditions, the generator may generate content that is more suitable or more pleasant to watch in high light conditions, such as strong illumination, strong colors, less details, etc.

[0344] Prompt

[0345] In an embodiment, an input to the generator comprises a prompt, such as a text prompt, that may be determined by one or more of the following: the encoder; or a process which is at encoder side but external to the encoder.

[0346] In an example, an input to the encoder comprises a picture, and a neural network takes as input the picture and outputs the text prompt.

[0347] Referring now to FIG. 14, illustrated is another example embodiment, in which a picture 1402 is input to a neural network. In this example, the neural network may be referred to as an infer prompt neural network 1404 and outputs a prompt 1406 for the picture. The picture 1402 and the inferred prompt 1406 are provided as input to the encoder 1408.

[0348] In an embodiment, a prompt that is used by the generator, such as a text prompt, or data from which a prompt that is used by the generator is derived, may be comprised in an input to the encoder. In an embodiment, the encoder may encode a prompt that is used by the generator (e.g., a text prompt) or data derived therefrom, in a lossless way or in a lossy way, to obtain an encoded prompt (e.g., an encoded text prompt). The encoded prompt may then be signaled or input to the decoder.

[0349] In an embodiment, a picture and a prompt that is used by the generator, such as a text prompt, may be encoded by two separate encoders. In another embodiment, a picture and a prompt that is used by the generator, such as a text prompt, may be encoded by the same encoder.

[0350] In an embodiment, a picture and data from which a prompt that is used by the generator is derived may be encoded by two separate encoders. In another embodiment, a picture and data from which a prompt that is used by the generator is derived may be encoded by the same encoder.

[0351] Referring now to FIG. 15, illustrated is an example embodiment, in which an image 1502 is generated by an image generator 1504 based on a prompt 1506. In this example, the prompt 1506 is not available to the encoder 1508 and a process referred to as an infer prompt 1510 (e.g., a neural network), determines another prompt that may be referred to as an inferred prompt 1512. The inferredprompt 1512 is encoded by a prompt encoder 1514 to obtain an encoded inferred prompt. The encoded inferred prompt is decoded by a prompt decoder 1516 to obtain a decoded inferred prompt. The decoded inferred prompt is input to the generator 1518, where the generator 1518 is available at both encoder and decoder sides. An output (a generated image 1519) of the generator 1518 is input to the encoder 1508 and to a decoder 1520. The encoder 1508 also gets as input the image 1502 to be coded, and outputs a bitstream 1522 representing an encoded image. The bitstream 1522 is input to the decoder 1520, together with the output of the generator to generate an output. The output of the decoder comprises a decoded image 1524.

[0352] Referring now to FIG. 16, illustrated is another example embodiment, in which an input to the encoder 1602 may comprise a picture 1604 and a text prompt 1606. The picture 1604 in this embodiment is obtained by using an image generator 1608 and where an input to the image generator 1608 is the text prompt 1606.

[0353] Referring now to FIG. 17, illustrated is yet another example embodiment, in which a prompt 1702 and a picture 1704 are encoded into one or more bitstreams 1706. At least one of the one or more bitstreams 1706 is decoded to obtain a decoded prompt, and the decoded prompt 1708 or data derived therefrom is used as an input to the generator 1710.

[0354] In an embodiment, the encoder or a process that is external to the encoder, may elaborate or process a prompt (e.g., a text prompt), to obtain an elaborated or processed prompt. The elaborated or processed prompt, or data derived therefrom, may then be signaled or input to the decoder or to the generator. In an additional embodiment, the elaborated or processed prompt may be obtained based also on the input data item, such as an input picture to be encoded by the encoder. In another additional embodiment, the input data item may be generated by another generator based at least on the prompt.

[0355] Referring now to FIG. 18, illustrated is an example embodiment, in which an encoder 1802 receives a picture 1804 and an associated text prompt 1806 as an input. In this example, the associated text prompt 1806 is used by an image generator 1808 (e.g., a high complexity image generator) to obtain the picture 1804. Optionally, the encoder 1802 may also receive information indicative of one or more characteristics of the image generator, such as a complexity. The picture 1804 and the associated text prompt 1806 are input to a vision-language model 1807 (e.g., elaborate prompt circuit or module), comprised in the encoder 1802, that outputs an elaborated or processed text prompt 1809. The elaborated or processed text prompt 1809 and the picture 1804 are encoded by the encoder 1802. The picture 1804 may first be transformed to a latent tensor by using a NN encoder, then the latent tensor may be quantized and lossless-coded, to obtain an encoded picture. The encoded elaborated or processed text prompt and the encoded picture are input to the decoder 1810. The encoded picture may be first lossless decoded, to obtain a latent tensor or quantized latent tensor; the latent tensor orquantized latent tensor may be dequantized to obtain a dequantized latent tensor; the latent tensor, or the quantized latent tensor, or the dequantized latent tensor may be input to a NN decoder and an output of the NN decoder may be input to the generator, or the latent tensor, or the quantized latent tensor, or the dequantized latent tensor may be input to the generator. The encoded elaborated or processed text prompt is decoded to obtain a decoded elaborated or processed text prompt 1812 that is input to the generator 1814. The decoded elaborated or processed text prompt 1812 may be more suitable for being input to the generator 1814 comprised in the decoder 1810, for example, because the generator 1814 is of lower complexity compared to the image generator 1808 that generated the picture 1804. An output of the generator may comprise an output of the decoder, such as an output picture.

[0356] In an embodiment, the elaborated or processed text prompt 1809 may be determined also based on an output of the image generator 1808 when an input to the image generator 1808 is a prompt from which the elaborated or processed text prompt 1809 is derived. In other words, an input to the process that elaborates a prompt may comprise an output of the generator that is at a decoder side, when the prompt is provided as input to that generator. The prompt may be a prompt used to generate an image to be encoded or a prompt that was inferred based on an image to be encoded.

[0357] Prompt loss

[0358] In an embodiment, the generator may be trained based at least on a prompt loss. The prompt loss may be a training loss, or training objective, that is determined or computed based on a first prompt and a second prompt or based on data derived from a first prompt and data derived from a second prompt. The first prompt may be determined or inferred based on an output of the generator or on data derived therefrom, such as a decoded picture. The second prompt may be one of the following: a prompt that was used to generate a picture that was input to the encoder; or a prompt that was determined or inferred based on a picture that was input to the encoder.

[0359] In an example, computing the prompt loss may comprise inputting the first prompt and the second prompt to a neural network, such as a language model, running the neural network to obtain an output that represents the prompt loss or data from which the prompt loss is derived.

[0360] In another example, computing the prompt loss may comprise computing a metric, such as an error, based on the first prompt and the second prompt, or based on first features extracted from the first prompt and second features extracted from the second prompt.

[0361] In an embodiment, the generator may be trained or finetuned or overfitted based at least on the prompt loss at encoder side when encoding the input data item to obtain an update to the generator. The update or a signal derived therefrom, such as a compressed update, may be signaled to the decoder. The decoder may use the update, or a signal derived therefrom, such as a decompressedupdate, to update the generator. For example, the update may comprise one or more parameters of the generator, or one or more parameter updates for respective one or more parameters of the generator.

[0362] Training features

[0363] In an embodiment, the generator may be pretrained and the frozen (e.g., left unmodified) during the training of one or more other components of the codec.

[0364] In another embodiment, the generator may be pretrained and then finetuned jointly with one or more other components of the codec.

[0365] In yet another embodiment, the generator may be trained jointly with one or more components of the codec from scratch or from an initialization of its parameters such as from a random initialization.

[0366] Noise signal for the generator

[0367] In one embodiment, the generator at the decoder side may take a noise signal as input. The noise signal may guide the generation of certain content, for example, the texture or details of the fur of an animal or grasses. The noise signal may be generated at the decoder side with an algorithm. In another embodiment, the noise signal or the parameters that generate the noise signal may be derived at the encoder side and signaled to the decoder. In another example, the noise signal may be further processed by a processing component and the output of the processing component is the input of the generator.

[0368] Referring now to FIG. 19, illustrated is an example embodiment, in which parameters for the noise signal (e.g., noise parameters 1902) are determined at the encoder side and signaled to the decoder side. In this example, the received noise parameters are used to generate a noise signal that is input to a generator 1904.

[0369] In another embodiment, the noise signal at the decoder side may be a combination of noise signal generated by the noise parameters 1902 signaled from the encoder and a random noise generated at the decoder side by a noise generator 1906.

[0370] In another embodiment, the noise signal at the decoder side may be derived from the decoded bitstream.

[0371] In an additional embodiment, in order to make the noise generation deterministic and reproducible at the decoder 1908, additional parameters that control the noise generation process, such as random seed, may be signaled to the decoder 1908.

[0372] In yet another embodiment, the generating process comprises performing one or more iterations or repetitions of an operation (e.g., a denoising operation), where the cardinality of the one or more iterations is a number N. In this embodiment, the encoder may signal the number N, representing the number of iterations, to the decoder. In one example, the decoder applies a denoising operation N number of times by using the noise signal. In another example, when a diffusion model comprises performing 10 iterations of a denoising operation, the encoder signals the number 10 to the decoder.

[0373] Generating regions

[0374] In an embodiment, in which one or more regions of a picture are edited or modified or generated by another generator based on an input prompt to obtain an edited or modified picture, the encoder may encode at least one of the one or more regions in a different way than other regions of the edited or modified picture, such as in one or more of the following ways: lower quality (e.g., by using a higher quantization parameter), a lower resolution, partially coded (e.g., only luma channel is coded, or only a key or important object in the input is coded, or only a spatial portion is coded), not coded at all. In an additional embodiment, a mask or an opacity-level map may be encoded and signaled to the decoder, where the mask or the opacity-level map may be obtained in one or more of the following ways: the mask or the opacity map is an output of the another generator; the mask or the opacity-level map is determined based on the edited or modified picture; the mask or the opacity-level map is determined based on the edited or modified picture and the picture (e.g., the original picture before the modifications performed by the another generator); the mask or the opacity-level map is determined based on a prompt that was input to the another generator; the mask or the opacity-level map is determined based on a prompt that was inferred based on the edited or modified picture.

[0375] Referring now to FIG. 20, illustrated is an example embodiment, in which some regions of an input picture or image 2002 are edited by an image editor 2004. In this example, the image editor 2004 outputs an edited image 2006 and a mask 2008 that indicates the edited regions, such as the location of the edited regions.

[0376] In an embodiment, the encoder 2010 or a process that is external to the encoder may determine one or more regions of an input picture or image 2002, where at least one of the one or more regions may be encoded by the encoder 2010 in a different way than other regions of the input picture or image 2002, for example, in one or more of the following ways: lower quality (e.g., by using a higher quantization parameter), a lower resolution, partially coded (e.g., only luma information is coded or only the key or important object in the input is coded, or only a spatial portion is coded), not coded at all. In an additional embodiment, the at least one of the one or more regions may be reconstructed by the decoder by means of using the generator. In an additional embodiment, the at least one of the one or more regions may be determined to be a region that the generator at decoder side or the decoder thatcomprises the generator can generate or reconstruct with a sufficient quality with respect to a quality threshold or other criterion.

[0377] Overfitting a decoder followed by a generator

[0378] In an embodiment, the decoder comprises a neural network based decoder (or simply NN decoder). In this embodiment: an input to the NN decoder may be a lossless decoded latent tensor, an output of the NN decoder is input to the generator, the encoder determines an update (may be referred to also as overfitting signal) to the NN decoder based at least on an input to the encoder (such as an input image to be encoded) and signals it to the decoder, and the update is used to update the NN decoder to obtain an updated NN decoder. The updated NN decoder processes the lossless decoded latent tensor and provides its output to the generator. The output of the NN decoder and of the updated NN decoder may be either data in a target domain (e.g., a decoded picture) or data in a feature domain. The output of the generator may be in a target domain, such as in image domain.

[0379] Referring now to FIG. 21, illustrated is an example embodiment, in which an encoder 2102 encodes an image 2104 into a bitstream 2106 including an encoded latent tensor, an encoded overfitting signal and an encoded prompt (e.g., a text prompt and / or a feature prompt). The overfitting signal, or a decoded overfitting signal 2108, is used to update or overfit the decoder 2110 or part of the decoder 2110(e.g., a NN decoder). The output of the decoder 2110 (which has been overfitted) and the prompt 2112 are input to the generator 2114 to generate an output. In an example, the generator 2114 may be regarded as a post-processing operation with respect to the decoder. In an example, the output of the generator is a final decoded image 2116.

[0380] In another embodiment, the overfitting signal derived from the input image may update the generator at the decoder side, to obtain an updated generator. The updated generator takes the output of the decoder and a prompt as the input and outputs a decoded image.

[0381] In yet another embodiment, the overfitting signal may be in a form of text or feature prompt (and may be referred to also as overfitting prompt). The generator may be updated in terms of its network weights, internal state or any processing function according to the overfitting prompt. The updated or overfitted generator may then be used to generate a decoded image.

[0382] Generator as hyper-prior latent generator

[0383] In an embodiment, the generator may be used to generate hyper-prior latents, when an input to the generator comprises noise distribution parameters. The hyper-prior latents may be used to derive one or more probability distribution parameters of a latent tensor, and the one or more probability distribution parameters may be used to lossless-decode a bitstream to obtain a decoded latent tensor.

[0384] In an additional embodiment, the encoder may comprise an inverted version of the generator present at decoder side, referred to as inverted generator, where the inverted generator is used as a hyper-prior encoder which maps from hyper-prior latents to noise distribution parameters.

[0385] In an additional embodiment, the generator may be a latent diffusion model.

[0386] In an additional embodiment, the generator comprises an architecture which is invertible, such as comprising coupled layers as used in normalizing flows architectures.

[0387] Different generators for generating different features

[0388] In an embodiment, two of the one or more generators may generate data comprising different features, aspects or characteristics of data to be decoded.

[0389] In an additional embodiment, when the codec is a video codec, a first generator of the one or more generators may generate image texture data or data from which image texture data may be derived (such as texture features or texture latents) and a second generator of the one or more generators may generate motion data or data from which motion data may be derived (such as motion features or motion latents). For example, the first and second generators may be two latent diffusion models that generate texture latents and motion latents, respectively.

[0390] In an additional embodiment, when the codec is a video codec, a first generator of the one or more generators may generate motion data or data from which motion data may be derived (such as motion features or motion latents) and a second generator of the one or more generators may generate residual data or data from which residual data may be derived. For example, the first and second generators may be two latent diffusion models that generate motion latents and residual latents, respectively.

[0391] In an additional embodiment, when the codec is an image codec or a video codec, a first generator of the one or more generators may generate prediction data (such as an intra-frame prediction or an inter-frame prediction) or data from which prediction data may be derived (such as features of an intra-frame prediction or features of an inter-frame prediction) and a second generator of the one or more generators may generate residual data or data from which residual data may be derived. For example, the first and second generators may be two latent diffusion models that generate prediction latents and residual latents, respectively.

[0392] Referring now to FIG. 22, illustrated is another additional example embodiment, in which a generator group may comprise a global generator 2202 and a local generator 2204. The global generator 2202 may learn to generate global embeddings from previous inputs, and the local generator 2204 may utilize the global embeddings or data derived from the global embeddings to generate thetarget data. For example, when the codec is a video codec, the input data of the global generator 2202 may be derived from the decoded frames buffer, a buffer 2206 of previous and current decoded latent tensors 2208, a buffer 2210 of previous and current outputs of the NN decoder 2212, or from the output of a processing module 2214 that takes one or more of those buffers as input.

[0393] FIG. 23 is a diagram illustrating an example apparatus 2300, which may be implemented in hardware, configured to implement the examples described herein. The apparatus 2300 comprises at least one processor 2302 (e.g., an FPGA and / or CPU), at least one memory 2304 including computer program code 2305, the computer program code 2305 having instructions to carry out the methods described herein, wherein the at least one memory 2304 and the computer program code 2305 are configured to, with the at least one processor 2302, cause the apparatus 2300 to implement circuitry, a process, component, module, or function (implemented with control module 2306) to implement the examples described herein, including implementing generative learned coding, for example, a generative end-to-end learned coding. Optionally included encoder 2308 of the control module 2306 implements encoding based on the examples described herein, and optionally included decoder 2310 implements decoding based on the examples described herein. The at least one memory 2304 may be a non-transitory memory, a transitory memory, a volatile memory (e.g. RAM), or a non-volatile memory (e.g., ROM).

[0394] The apparatus 2300 includes a display and / or I / O interface 2312, which includes user interface (UI) circuitry and elements, that may be used to display features or a status of the methods described herein (e.g., as one of the methods is being performed or at a subsequent time), or to receive input from a user such as with using a keypad, camera, touchscreen, touch area, microphone, biometric recognition, one or more sensors, etc. The apparatus 2300 includes one or more communication e.g. network (N / W) interfaces (I / F(s)) 2314. The communication I / F(s) 2314 may be wired and / or wireless and communicate over the Internet / other network(s) via any communication technique including via one or more links 2316. The communication I / F(s) 2314 may comprise one or more transmitters or one or more receivers.

[0395] The transceiver 2318 comprises one or more transmitters 2320 and one or more receivers 2322. The transceiver 2318 and / or communication I / F(s) 2314 may comprise standard well-known components such as an amplifier, filter, frequency-converter, (de)modulator, and encoder / decoder circuitries and one or more antennas, such as antennas 2324 used for communication over wireless link 2326.

[0396] The control module 2306 of the apparatus 2300 comprises one of or both parts 2306-1 and / or 2306-2, which may be implemented in a number of ways. The control module 2306 may be implemented in hardware as control module 2306-1, such as being implemented as part of the at leastone processor 2302. The control module 2306-1 may be implemented also as an integrated circuit or through other hardware such as a programmable gate array. In another example, the control module 2306 may be implemented as control module 2306-2, which is implemented as computer program code (having corresponding instructions) 2305 and is executed by the at least one processor 2302. For instance, the at least one memory 2304 store instructions that, when executed by the at least one processor 2302, cause the apparatus 2300 to perform one or more of the operations as described herein. Furthermore, the at least one processor 2302, the at least one memory 2304, and example algorithms (e.g., as flowcharts and / or signaling diagrams), encoded as instructions, programs, or code, are means for causing performance of the operations described herein.

[0397] The apparatus 2300 to implement the functionality of control module 2306 may correspond to any of the apparatuses depicted herein. Alternatively, apparatus 2300 and its elements may not correspond to any of the other apparatuses depicted herein, as apparatus 2300 may be part of a self-organizing / optimizing network (SON) node or other node, such as a node in a cloud.

[0398] The apparatus 2300 may also be distributed throughout the network including within and between apparatus 2300 and any network element (such as a base station and / or terminal device and / or user equipment).

[0399] Interface 2328 enables data communication and signaling between the various items of apparatus 2300, as shown in FIG. 23. For example, the interface 2328 may be one or more buses such as address, data, or control buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, and the like. Computer program code (e.g. instructions) 2305, including control module 2306 may comprise object-oriented software configured to pass data or messages between objects within computer program code 2305. The apparatus 2300 need not comprise each of the features mentioned, or may comprise other features as well. The various components of apparatus 2300 may at least partially reside in a housing 2330, or a subset of the various components of apparatus 2300 may at least partially be located in different housings, which different housings may include housing 2330.

[0400] FIG. 24 is a diagram illustrating representation of non-volatile memory media 2400a (e.g. computer / compact disc (CD) or digital versatile disc (DVD)) and 2400b (e.g. universal serial bus (USB) memory stick) and 2400c (e.g. cloud storage for downloading instructions and / or parameters 2402 or receiving emailed instructions and / or parameters 2402) storing instructions and / or parameters 2402 which when executed by a processor allows the processor to perform one or more of the operations of the methods described herein. Instructions and / or parameters 2402 may represent or correspond to a non-transitory computer readable medium.

[0401] FIG. 25 is an example method 2500 performed with an apparatus, based on the examples described herein. At 2502, the method 2500 includes receiving a generator input comprising a prompt, a decoded latent tensor, or data derived from the decoded latent tensor. At 2504, the method 2500 includes generating, based at least on the generator input, a generator output. At 2506, the method 2500 includes deriving, based at least on the generator output, an output of an apparatus.

[0402] In an embodiment, the apparatus comprises a decoder, one or more generators, or the decoder comprising the one or more generators.

[0403] In an embodiment, the method 2500 may be performed with a decoding apparatus, such as the apparatus 2300, or any other decoding apparatus described herein.

[0404] FIG. 26 is another example method 2600 performed with an apparatus, based on the examples described herein. At 2602, the method 2500 includes receiving a generator input comprising a decoded latent tensor, or data derived from the decoded latent tensor. At 2504, the method 2500 includes generating, based at least on the generator input, a generator output. At 2506, the method 2500 includes deriving, based at least on the generator output, an output of an apparatus.

[0405] In an embodiment, the apparatus comprises a decoder, one or more generators, or the decoder comprising the one or more generators.

[0406] In an embodiment, the method 2600 may be performed with a decoding apparatus, such as the apparatus 2300, or any other decoding apparatus described herein.

[0407] FIG. 27 is yet another example method 2700 performed with an apparatus, based on the examples described herein. At 2702, the method 2700 includes receiving a generator input comprising a prompt. At 2704, the method 2700 includes generating, based at least on the generator input, a generator output. At 2706, the method 2500 includes deriving, based at least on the generator output, an output of an apparatus.

[0408] In an embodiment, the apparatus comprises a decoder, one or more generators, or the decoder comprising the one or more generators.

[0409] In an embodiment, the method 2700 may be performed with a decoding apparatus, such as the apparatus 2300, or any other decoding apparatus described herein.

[0410] FIG. 28 is still another example method 2800 performed with an apparatus, based on the examples described herein. At 2802, the method 2800 includes receiving a generator input comprising a signal, or data derived from the signal. At 2804, the method 2800 includes generating, based at leaston the generator input, a generator output. At 2806, the method 2800 includes deriving, based at least on the generator output, an output of an apparatus.

[0411] In an embodiment, the apparatus comprises a decoder, one or more generators, or the decoder comprising the one or more generators.

[0412] In an example embodiment, the signal comprises a decoded latent tensor or a prompt, and the data derived from the signal comprises data derived from the decoded latent tensor or data derived from the prompt.

[0413] In an embodiment, the method 2800 may be performed with a decoding apparatus, such as the apparatus 2300, or any other decoding apparatus described herein.

[0414] The term “non-transitory,” as used herein, is a limitation of the medium itself (i.e. tangible, not a signal) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM).

[0415] It should be understood that the foregoing description is only illustrative. Various alternatives and modifications can be devised by those skilled in the art. For example, features recited in the various dependent claims could be combined with each other in any suitable combination(s). In addition, features from different embodiments described above could be selectively combined into a new embodiment. Accordingly, the description is intended to embrace all such alternatives, modification and variances which fall within the scope of the appended claims.

Claims

1. CLAIMSWhat is claimed is:

1. An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform: receiving a generator input comprising a prompt, a decoded latent tensor, or data derived from the decoded latent tensor; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of the apparatus.

2. An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform: receiving a generator input comprising a decoded latent tensor or data derived from the decoded latent tensor; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of the apparatus.

3. An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform: receiving a generator input comprising a prompt; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of the apparatus.

4. An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to perform: receiving a generator input comprising a signal or data derived from the signal; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of the apparatus.

5. The apparatus of claim 4, wherein the signal comprises a decoded latent tensor or a prompt, and the data derived from the signal comprises data derived from the decoded latent tensor or data derived from the prompt.

6. The apparatus of any of the claims 1, 2 or 5, wherein the decoded latent tensor comprises a lossless-decoded latent tensor that is comprised in an output of a lossless decoder.

7. The apparatus of any of the claims 1, 2, 5, or 6, wherein the data derived from the decoded latent tensor comprises an output of a neural network decoder that is comprised in a decoder, and wherein the lossless-decoded latent tensor is an input to the neural network decoder.

8. The apparatus of any of the claims 1, 3, or 5, wherein the prompt comprises one of following: a text prompt; an image; a prompt in feature space or domain; a prompt inferred from an output of a neural network decoder; or a prompt inferred from an output of a lossless decoder.

9. The apparatus of claim 8, wherein the text prompt comprises a prompt in natural language domain in form of: a text string, a tokenized text string; features extracted from a tokenized text string; or features extracted from a text string.

10. The apparatus of any of the claims 1 to 9, wherein the generator input further comprises a noise signal.

11. The apparatus of claim 10, wherein the noise signal is generated at the decoder side.

12. The apparatus of claim 10, wherein the noise signal or one or more parameters for generating the noise signal are received from an encoder side.

13. The apparatus of any of the claims 1, 2, 5, 6, 7, or 10 to 12, wherein the generator output comprises an input to a neural network decoder comprised in the apparatus, and wherein the neural network decoder decodes the decoded latent tensor based on the generator output.

14. The apparatus of any of the claims 1 to 13, wherein the generator output comprises a decoded image.

15. The apparatus of any of the claims 1 to 14, wherein the apparatus is further caused to perform: receiving, from an encoder, a residual or data from which a residual is derived; and combining the generator output with the residual by using at least a combination operation.

16. The apparatus of claim 15, wherein the combination operation comprises one of a summation operation, a tensor concatenation operation, or one or more neural network layers.

17. The apparatus of any of claims 15 or 16, wherein the residual and the generator output are in feature space or domain; or the residual and the generator output are in picture space or domain, pixel space or domain, or image space or domain.

18. The apparatus of any of the claims 1 to 17, wherein the generator output is used as one of following: an intra-frame prediction or inter-frame prediction of: an image encoder and / or decoder or a video encoder and / or decoder; a reference picture in the video encoder and / or decoder; a temporal extrapolation in the video encoder and / or decoder; anda spatial extrapolation in the image encoder and / or decoder or the video encoder and / or decoder.

19. The apparatus of any of the claims 1 to 18, wherein when a generator comprises a group of generators comprising two or more generator modules, an output of one generator module in the group, or data derived therefrom, is comprised in an input to other one or more generator modules in the group.

20. The apparatus of claim 19, wherein a generator group of the group of generators comprises a global generator and a local generator, and wherein the global generator learns to generate global embeddings from previous decoded latent tensors, and wherein the local generator directly takes the generated embeddings as input to generate output in an image space.

21. The apparatus of claim 20, wherein the apparatus is further caused to perform: combining the global embeddings with the decoded latent tensor to obtain an input of the local generator.

22. The apparatus any of the claims 1 to 21, wherein the apparatus is further caused to perform: receiving one or more external inputs with respect to the apparatus; and using the one or more external inputs as an input to the generator, using the one or more external inputs for processing the prompt, or using the one or more external inputs for modifying the prompt; wherein the one or more external inputs are not comprised in a bitstream decoded by the decoder.

23. The apparatus of claim 22, wherein the one or more external inputs comprise one or more of following: an audio signal that is derived from an audio track associated with one or more pictures that are input to a codec comprising the decoder; or ambient information.

24. The apparatus of any of the claims 1, 3, 5, or 8 to 23, wherein: the prompt is determined by an encoder and / or by a process which is at encoder side but external to the encoder;the prompt is generated by an encoder based on a first picture as an input; the prompt is inferred by a neural network based on a second picture as an input, wherein the second picture and the inferred prompt are provided as input to the encoder; the prompt is comprised in an input to the encoder; the prompt is encoded by an encoder in a lossless way or in a lossy way, and wherein the encoded prompt is received by the apparatus as an input or as a signal; the prompt is used by the generator to generate an image, and wherein the prompt and the generated image is provided as an input to the encoder; the prompt is elaborated or processed by the encoder or by a process that is external to the encoder, and wherein the elaborated or processed prompt or data derived therefrom is received by the decoder or the generator; the prompt is used to generate an image to be encoded; or the prompt is inferred based on an image to be encoded.

25. The apparatus of claim 24, wherein the first picture and / or the second picture are encoded by a first encoder and the prompt is encoded by a second encoder.

26. The apparatus of claim 24, wherein the first picture, the second picture, and / or the prompt are encoded by the same encoder.

27. The apparatus of any of the claims 1, 3, 5, or 8 to 23, wherein the apparatus is further caused to perform: receiving an encoded inferred prompt, wherein the encoded inferred prompt is an encoded version of an inferred prompt, and wherein the inferred prompt is generated based on an image generated by using the prompt; decoding the encoded inferred prompt to generate a decoded inferred prompt; providing the decoded inferred prompt as an input to the generator to obtain the generator output, wherein the generator is available at both the encoder side and the decoder side; providing the generator output as an input to the encoder and to the decoder, wherein a bitstream representing an encoded image is generated by the encoder based on an image and the generator output;receiving the bitstream; and generating a decoded image based on the bitstream and the generator output.

28. The apparatus of any of the claims 1, 3, 5, or 8 to 23, wherein the apparatus is further caused to perform: decoding at least one of one or more bitstreams to obtain a decoded prompt, wherein a prompt and a picture are encoded into the one or more bitstreams; and using the decoded prompt or data derived from the decoded prompt as an input to the generator.

29. The apparatus of claim 24, wherein the elaborated or processed prompt is obtained based on a third picture that is input to the encoder.

30. The apparatus of any of the claims 1, 3, 5, or 8 to 23, wherein the apparatus is further caused to perform: receiving an encoded elaborated or processed prompt and an encoded picture as an input, wherein an elaborated or processed prompt is generated based on a picture and the prompt, and wherein the picture is generated based on the prompt, and wherein the encoded picture is an encoded version of the picture; decoding the encoded elaborated or processed prompt to obtain a decoded elaborated or processed prompt; and providing the decoded elaborated or processed prompt as an input to the generator.

31. The apparatus of claim 24, wherein the elaborated prompt is determined based on the generator output when the input to the generator is the prompt from which the elaborated prompt is derived.

32. The apparatus of any of the claims 1 to 31, wherein the apparatus is trained based at least on a prompt loss, and wherein the prompt loss comprises a training loss, or training objective, that is determined or computed based on a first prompt and a second prompt or based on data derived from the first prompt and data derived from the second prompt, wherein the first prompt is determined or inferred based on the generator output or on data derived from the generator output, and the second prompt is one of following: a prompt that was used to generate a picture that was input to the encoder; or a prompt that was determined or inferred based on the picture that was input to the encoder.

33. The apparatus of claim 32, wherein the prompt loss is computed by providing the first prompt and the second prompt as an input to a neural network and running the neural network to obtain an output that represents the prompt loss or data from which the prompt loss is derived.

34. The apparatus of claim 32, wherein the prompt loss is computed by computing a metric based on the first prompt and the second prompt, or based on first features extracted from the first prompt and second features extracted from the second prompt.

35. The apparatus of any of the claims 1 to 31, wherein the apparatus is trained, finetuned, or overfitted based at least on the prompt loss at encoder side when encoding a fourth picture to obtain an update to the generator, and wherein the update or a signal derived from the update is received by the apparatus.

36. The apparatus of claim 35, wherein the apparatus is further caused to perform: updating the generator by using the update, or a signal derived from the update.

37. The apparatus of any of the claims 1 to 36, wherein: the generator is pretrained and is frozen during the training of one or more other components of a codec; the generator is pretrained and then finetuned jointly with one or more other components of the codec; or the generator is trained jointly with one or more components of the codec from scratch or from an initialization of parameters of the generator.

38. The apparatus of any of the claims 1 to 36, wherein the apparatus is further caused to perform: receiving a noise signal as an input, wherein the noise signal guides generation of a certain content.

39. The apparatus of claim 38, wherein the noise signal is generated at the decoder side; or the noise signal or parameters that generate the noise signal are derived at the encoder side.

40. The apparatus of any of the claims 38 or 39, wherein the apparatus is further caused to perform:processing the noise signal; and providing the processed noise signal to the decoder.

41. The apparatus of any of the claims 39 or 40, wherein the noise signal at the decoder side is a combination of the noise signal generated by the noise parameters received from the encoder and a random noise generated at the decoder side.

42. The apparatus of claim 39, wherein the noise signal at the decoder side is derived from a decoded bitstream.

43. The apparatus of any of the claims 38 to 42, wherein, in order to make the noise generation deterministic and reproducible at the decoder, the apparatus is further caused to perform: receiving additional parameters that control a noise generation process.

44. The apparatus of any of the claims 38 to 43, wherein the generator performs one or more steps to generate the generator output, and the apparatus is further caused to perform: receiving a signal comprising information about at least one step to which the noise signal needs to be applied.

45. The apparatus of any of the claims 1, 3, 5, wherein the apparatus is further caused to perform: receiving an edited or modified picture, wherein one or more regions of the picture had been edited, modified, generated by another generator based on an input prompt to obtain the edited or modified picture.

46. The apparatus of claim 45, wherein at least one of the one or more regions had been encoded differently than other regions of the edited or modified picture.

47. The apparatus of claim 46, wherein as compared to the other regions, the one or more regions are: encoded in lower quality or lower resolution, partially coded, or not coded.

48. The apparatus of claim 46 or 47, wherein the apparatus is further caused to perform: receiving an encoded mask or an encoded opacity-level map.

49. The apparatus of claim 48, wherein the mask or the opacity-level had be obtained in one or more of the following ways: the mask or the opacity map is an output of the another generator; the mask or the opacity-level map had been determined based on the edited or the modified picture; the mask or the opacity-level map had been determined based on the edited or modified picture and the picture; the mask or the opacity-level map had been determined based on the input prompt that was an input to the another generator; the maskor the opacity-level map had been determined based on a prompt that had been inferred based on the edited or modified picture.

50. The apparatus of any of the claims 45 to 49, wherein the one or more regions are determined by the encoder or a process external to the encoder.

51. The apparatus of any of the claims 45 to 50, wherein at least one of the one or more regions are determined to be a region that the generator at the decoder side or the decoder that comprises the generator is capable of generating or reconstructing with a sufficient quality with respect to a quality threshold or other criterion.

52. The apparatus of any of the claims 1, 4, 5, wherein the apparatus comprises a decoder comprising a neural network based decoder (NN decoder), and wherein an input to the NN decoder comprises the decoded latent tensor, and wherein an output of the NN decoder is input to the generator, and wherein the apparatus is further caused to perform: receiving an update to the NN decoder, where the update is determined by the encoder; using the update the NN decoder to obtain an updated NN decoder; using the updated NN decoder to process the lossless decoded latent tensor to generate a processed lossless decoded latent tensor; and providing the processed lossless decoded latent tensor to the generator; and generating a final decoded image based at least on the processed lossless decoded latent tensor.

53. The apparatus of any of the claims 38 to 44, wherein when the input to the generator comprises noise, the generator output comprises hyper -prior latents, wherein the hyperprior latents are used to derive one or more probability distribution parameters of a latent tensor, and wherein the one or more probability distribution parameters are used to decode the latent tensor.

54. The apparatus of any of the claims 1 to 53, wherein the apparatus comprises a decoder, one or more generators, or the decoder comprising the one or more generators.

55. The apparatus of claim 54, wherein two of the one or more generators generate data comprising different features or characteristics of data to be decoded.

56. The apparatus of claim 55, wherein when a codec comprises a video codec, a first generator of the one or more generators generates image texture data or data from which image texture data is derived and a second generator of the one or more generators generates motion data or data from which motion data is derived.

57. The apparatus of claim 55, wherein when a codec comprises a video codec, a first generator of the one or more generators may generates motion data or data from which motion data is derived and a second generator of the one or more generators generates a residual data or data from which residual data is derived.

58. The apparatus of claim 55, wherein when a codec comprises an image codec or a video codec, a first generator of the one or more generators generates prediction data or data from which prediction data is derived and a second generator of the one or more generators generates a residual data or data from which residual data is be derived.

59. The apparatus of claim 55, wherein a generator group of the one or more generators comprises a global generator and a local generator, wherein the global generator learns to generate global embeddings from previous inputs, and the local generator utilizes the global embeddings or data derived from the global embeddings to generate a target data.

60. A method comprising: receiving a generator input comprising a prompt, a decoded latent tensor, or data derived from the decoded latent tensor; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

61. A method comprising: receiving a generator input comprising a decoded latent tensor or data derived from the decoded latent tensor; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

62. A method comprising: receiving a generator input comprising a prompt;generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

63. A method comprising: receiving a generator input comprising a signal or data derived from the signal; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

64. The method of claim 63, wherein the signal comprises a decoded latent tensor or a prompt, and the data derived from the signal comprises data derived from the decoded latent tensor or data derived from the prompt.

65. The method of any of the claims 60, 61 or 64, wherein the decoded latent tensor comprises a lossless-decoded latent tensor that is comprised in an output of a lossless decoder.

66. The method of any of the claims 60, 61, 64, or 65, wherein the data derived from the decoded latent tensor comprises an output of a neural network decoder that is comprised in the decoder, and wherein the lossless-decoded latent tensor is an input to the neural network decoder.

67. The method of any of the claims 60, 62, or 64, wherein the prompt comprises one of following: a text prompt; an image; a prompt in feature space or domain; a prompt inferred from an output of a neural network decoder; or a prompt inferred from an output of a lossless decoder.

68. The method of claim 67, wherein the text prompt comprises a prompt in natural language domain in form of: a text string, a tokenized text string; features extracted from a tokenized text string; or features extracted from a text string.

69. The method of any of the claims 60 to 68, wherein the generator input further comprises a noise signal.

70. The method of claim 69, wherein the noise signal is generated at the decoder side.

71. The method of claim 69, wherein the noise signal or one or more parameters for generating the noise signal are received from an encoder side.

72. The method of any of the claims 60, 61, 64, 65, 66, or 69 to 71, wherein the generator output comprises an input to a neural network decoder comprised in the apparatus, and wherein the neural network decoder decodes the decoded latent tensor based on the generator output.

73. The method of any of the claims 60 to 72, wherein the generator output comprises a decoded image.

74. The method of any of the claims 60 to 73 further comprising: receiving, from an encoder, a residual or data from which a residual is derived; and combining the generator output with the residual by using at least a combination operation.

75. The method of claim 74, wherein the combination operation comprises one of a summation operation, a tensor concatenation operation, or one or more neural network layers.

76. The method of any of claims 74 or 75, wherein the residual and the generator output are in feature space or domain; or the residual and the generator output are in picture space or domain, pixel space or domain, or image space or domain.

77. The method of any of the claims 60 to 76, wherein the generator output is used as one of following: an intra-frame prediction or inter-frame prediction of: an image encoder and / or decoder or a video encoder and / or decoder; a reference picture in the video encoder and / or decoder; a temporal extrapolation in the video encoder and / or decoder; and a spatial extrapolation in the image encoder and / or decoder or the video encoder and / or decoder.

78. The method of any of the claims 60 to 77, wherein when the generator comprises a group of generators comprising two or more generator modules, an output of one generatormodule in the group, or data derived therefrom, is comprised in an input to other one or more generator modules in the group.

79. The method of claim 78, wherein a generator group of the group of generators comprises a global generator and a local generator, and wherein the global generator learns to generate global embeddings from previous decoded latent tensors, and wherein the local generator directly takes the generated embeddings as input to generate output in an image space.

80. The method of claim 79 further comprising: combining the global embeddings with the decoded latent tensor to obtain an input of the local generator.

81. The method any of the claims 60 to 80 further comprising: receiving one or more external inputs with respect to the apparatus; and using the one or more external inputs as an input to the generator, using the one or more external inputs for processing the prompt, or using the one or more external inputs for modifying the prompt; wherein the one or more external inputs are not comprised in a bitstream decoded by the decoder.

82. The method of claim 81, wherein the one or more external inputs comprise one or more of following: an audio signal that is derived from an audio track associated with one or more pictures that are input to a codec comprising the decoder; or ambient information.

83. The method of any of the claims 60, 62, 64, or 67 to 82, wherein: the prompt is determined by an encoder and / or by a process which is at encoder side but external to the encoder; the prompt is generated by an encoder based on a first picture as an input; the prompt is inferred by a neural network based on a second picture as an input, wherein the second picture and the inferred prompt are provided as input to the encoder; the prompt is comprised in an input to the encoder;the prompt is encoded by an encoder in a lossless way or in a lossy way, and wherein the encoded prompt is received by the apparatus as an input or as a signal; the prompt is used by the generator to generate an image, and wherein the prompt and the generated image is provided as an input to the encoder; the prompt is elaborated or processed by the encoder or by a process that is external to the encoder, and wherein the elaborated or processed prompt or data derived therefrom is received by the decoder or the generator; the prompt is used to generate an image to be encoded; or the prompt is inferred based on an image to be encoded.

84. The method of claim 83, wherein the first picture and / or the second picture are encoded by a first encoder and the prompt is encoded by a second encoder.

85. The method of claim 83, wherein the first picture, the second picture, and / or the prompt are encoded by the same encoder.

86. The method of any of the claims 60, 62, 64, or 67 to 82 further comprising: receiving an encoded inferred prompt, wherein the encoded inferred prompt is an encoded version of an inferred prompt, and wherein the inferred prompt is generated based on an image generated by using the prompt; decoding the encoded inferred prompt to generate a decoded inferred prompt; providing the decoded inferred prompt as an input to the generator to obtain the generator output, wherein the generator is available at both the encoder side and the decoder side; providing the generator output as an input to the encoder and to the decoder, wherein a bitstream representing an encoded image is generated by the encoder based on an image and the generator output; receiving the bitstream; and generating a decoded image based on the bitstream and the generator output.

87. The method of any of the claims 60, 62, 64, or 67 to 82 further comprising:decoding at least one of one or more bitstreams to obtain a decoded prompt, wherein a prompt and a picture are encoded into the one or more bitstreams; and using the decoded prompt or data derived from the decoded prompt as an input to the generator.

88. The method of claim 83, wherein the elaborated or processed prompt is obtained based on a third picture that is input to the encoder.

89. The method of any of the claims 60, 62, 64, or 67 to 82 further comprising: receiving an encoded elaborated or processed prompt and an encoded picture as an input, wherein an elaborated or processed prompt is generated based on a picture and the prompt, and wherein the picture is generated based on the prompt, and wherein the encoded picture is an encoded version of the picture; decoding the encoded elaborated or processed prompt to obtain a decoded elaborated or processed prompt; and providing the decoded elaborated or processed prompt as an input to the generator.

90. The method of claim 83, wherein the elaborated prompt is determined based on the generator output when the input to the generator is the prompt from which the elaborated prompt is derived.

91. The method of any of the claims 60 to 90, wherein the apparatus is trained based at least on a prompt loss, and wherein the prompt loss comprises a training loss, or training objective, that is determined or computed based on a first prompt and a second prompt or based on data derived from a first prompt and data derived from a second prompt, wherein the first prompt is determined or inferred based on the generator output or on data derived from the generator output, and the second prompt is one of following: a prompt that was used to generate a picture that was input to the encoder; or a prompt that was determined or inferred based on the picture that was input to the encoder.

92. The method of claim 91 , wherein the prompt loss is computed by providing the first prompt and the second prompt as an input to a neural network and running the neural network to obtain an output that represents the prompt loss or data from which the prompt loss is derived.

93. The method of claim 91 , wherein the prompt loss is computed by computing a metric based on the first prompt and the second prompt, or based on first features extracted from the first prompt and second features extracted from the second prompt.

94. The method of any of the claims 60 to 90, wherein the method is trained, finetuned, or overfitted based at least on the prompt loss at encoder side when encoding a fourth picture to obtain an update to the generator, and wherein the update or a signal derived from the update is received by the apparatus.

95. The method of claim 94 further comprising: updating the generator by using the update, or a signal derived from the update.

96. The method of any of the claims 60 to 95, wherein: the generator is pretrained and is frozen during the training of one or more other components of a codec; the generator is pretrained and then finetuned jointly with one or more other components of the codec; or the generator is trained jointly with one or more components of the codec from scratch or from an initialization of parameters of the generator.

97. The method of any of the claims 60 to 95 further comprising: receiving a noise signal as an input, wherein the noise signal guides generation of a certain content.

98. The method of claim 97, wherein the noise signal is generated at the decoder side; or the noise signal or parameters that generate the noise signal are derived at the encoder side.

99. The method of any of the claims 97 or 98, further comprising: processing the noise signal; and providing the processed noise signal to the decoder.

100. The method of any of the claims 98 or 99, wherein the noise signal at the decoder side is a combination of the noise signal generated by the noise parameters received from the encoder and a random noise generated at the decoder side.

101. The method of claim 98, wherein the noise signal at the decoder side is derived from a decoded bitstream.

102. The method of any of the claims 97 to 101, wherein, in order to make the noise generation deterministic and reproducible at the decoder, the method is further caused to perform: receiving additional parameters that control a noise generation process.

103. The method of any of the claims 97 to 102, wherein the generator performs one or more steps to generate the generator output, and wherein the method further comprises: receiving a signal comprising information about at least one step to which the noise signal needs to be applied.

104. The method of any of the claims 60, 62, 64 further comprising: receiving an edited or modified picture, wherein one or more regions of the picture had been edited, modified, generated by another generator based on an input prompt to obtain the edited or modified picture.

105. The method of claim 104, wherein at least one of the one or more regions had been encoded differently than other regions of the edited or modified picture.

106. The method of claim 105, wherein as compared to the other regions, the one or more regions are: encoded in lower quality or lower resolution, partially coded, or not coded.

107. The method of claim 105 or 106 further comprising: receiving an encoded mask or an encoded opacity-level map.

108. The method of claim 107, wherein the mask or the opacity-level had be obtained in one or more of the following ways: the mask or the opacity map is an output of the another generator; the mask or the opacity -level map had been determined based on the edited or the modified picture; the mask or the opacity-level map had been determined based on the edited or modified picture and the picture; the mask or the opacity-level map had been determined based on the input prompt that was an input to the another generator; the mask or the opacity-level map had been determined based on a prompt that had been inferred based on the edited or modified picture.

109. The method of any of the claims 104 to 108, wherein the one or more regions are determined by the encoder or a process external to the encoder.

110. The method of any of the claims 104 to 109, wherein at least one of the one or more regions are determined to be a region that the generator at the decoder side or the decoder that comprises the generator is capable of generating or reconstructing with a sufficient quality with respect to a quality threshold or other criterion.

111. The method of any of the claims 60, 63, 64, wherein the apparatus comprises a decoder comprising a neural network based decoder (NN decoder), and wherein an input to the NN decoder comprises the decoded latent tensor, and wherein an output of the NN decoder is input to the generator, and wherein the method further comprises: receiving an update to the NN decoder, where the update is determined by the encoder; using the update the NN decoder to obtain an updated NN decoder; using the updated NN decoder to process the lossless decoded latent tensor to generate a processed lossless decoded latent tensor; providing the processed lossless decoded latent tensor to the generator; and generating a final decoded image based at least on the processed lossless decoded latent tensor.

112. The method of any of the claims 97 to 103, wherein when the input to the generator comprises noise, the generator output comprises hyper -prior latents, wherein the hyperprior latents are used to derive one or more probability distribution parameters of a latent tensor, and wherein the one or more probability distribution parameters are used to decode the latent tensor.

113. The method of any of the claims 60 to 112, wherein the apparatus comprises a decoder, one or more generators, or the decoder comprising the one or more generators.

114. The method of claim 113, wherein two of the one or more generators generate data comprising different features or characteristics of data to be decoded.

115. The method of claim 114, wherein when a codec comprises a video codec, a first generator of the one or more generators generates image texture data or data from which image texture data is derived and a second generator of the one or more generators generates motion data or data from which motion data is derived.

116. The method of claim 114, wherein when a codec comprises a video codec, a first generator of the one or more generators may generates motion data or data from which motion data is derived and a second generator of the one or more generators generates a residual data or data from which residual data is derived.

117. The method of claim 114, wherein when a codec comprises an image codec or a video codec, a first generator of the one or more generators generates prediction data or datafrom which prediction data is derived and a second generator of the one or more generators generates a residual data or data from which residual data is be derived.

118. The method of claim 114, wherein a generator group of the one or more generators comprises a global generator and a local generator, wherein the global generator learns to generate global embeddings from previous inputs, and the local generator utilizes the global embeddings or data derived from the global embeddings to generate a target data.

119. A computer readable medium comprising instructions, when executed by an apparatus, cause the apparatus to perform at least the following: receiving a generator input comprising a prompt, a decoded latent tensor, or data derived from the decoded latent tensor; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

120. A computer readable medium comprising instructions, when executed by an apparatus, cause the apparatus to perform at least the following: receiving a generator input comprising a decoded latent tensor or data derived from the decoded latent tensor; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

121. A computer readable medium comprising instructions, when executed by an apparatus, cause the apparatus to perform at least the following: receiving a generator input comprising a prompt; generating, based at least on the generator input, a generator output; and deriving, based at least on the generator output, an output of an apparatus.

122. A computer readable medium comprising instructions, when executed by an apparatus, cause the apparatus to perform at least the following: receiving a generator input comprising a signal or data derived from the signal; generating, based at least on the generator input, a generator output; andderiving, based at least on the generator output, an output of an apparatus.

123. The computer readable medium of any of the claims 119 or 122, wherein the apparatus is further caused to perform methods as claimed in any of the claims 64 to 118.

124. The computer readable medium of any of the claims 119 to 123, wherein the computer readable medium comprises a non-transitory computer readable medium.

125. An comprising: means for receiving a generator input comprising a prompt, a decoded latent tensor, or data derived from the decoded latent tensor; means for generating, based at least on the generator input, a generator output; and means for deriving, based at least on the generator output, an output of an apparatus.

126. An apparatus comprising: means for receiving a generator input comprising a decoded latent tensor or data derived from the decoded latent tensor; means for generating, based at least on the generator input, a generator output; and means for deriving, based at least on the generator output, an output of an apparatus.

127. An apparatus comprising: means for receiving a generator input comprising a prompt; means for generating, based at least on the generator input, a generator output; and means for deriving, based at least on the generator output, an output of an apparatus.

128. An apparatus comprising: means for receiving a generator input comprising a signal or data derived from the signal; means for generating, based at least on the generator input, a generator output; and means for deriving, based at least on the generator output, an output of an apparatus.

129. The apparatus of any of the claims 125 or 128, wherein the apparatus is further caused to perform methods as claimed in any of the claims 64 to 118.