Neural network based encoding and decoding methods

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By integrating a downsampling chroma neural network to adaptively convert between different chroma formats, the inefficiencies in existing neural network-based encoding and decoding methods are addressed, leading to improved accuracy and efficiency in picture reconstruction.

WO2026130965A1PCT designated stage Publication Date: 2026-06-25INTERDIGITAL CE PATENT HOLDINGS SAS

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: INTERDIGITAL CE PATENT HOLDINGS SAS
Filing Date: 2025-11-21
Publication Date: 2026-06-25

Application Information

Patent Timeline

21 Nov 2025

Application

25 Jun 2026

Publication

WO2026130965A1

IPC: H04N19/117; H04N19/172; H04N19/186; H04N19/59; H04N19/85

CPC: H04N19/117; H04N19/186; H04N19/85; H04N19/172; H04N19/59

AI Tagging

Application Domain

Digital video signal modification

Technology Topics

Pattern recognitionDecoding methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Joint expression coding system and method based on static and dynamic expression images
US12664819B2Character and pattern recognition Pattern recognition Image generation
A high-precision visual displacement measurement space-time combined error correction method
CN122258834ASystematic error suppressionStable Displacement MeasurementImage analysis Character and pattern recognition Pattern recognition Engineering
Apparatus and method for building an object database for training an artificial intelligence model
US20260170810A1Character and pattern recognition Pattern recognition Data set
A 3D human pose estimation method, device and storage medium
CN122244960ACharacter and pattern recognition Biological models Pattern recognition Human body
A three-dimensional gesture tracking method based on an RGB camera
CN115810219BSimple structure improve accuracy Character and pattern recognition Biological models Pattern recognition Computer graphics (images)

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing methods for picture encoding and decoding using neural networks face inefficiencies in handling different chroma formats, particularly when converting between YUV 4:4:4 and other formats like YUV 4:2:0 or YUV 4:2:2, leading to suboptimal training and increased high spatial frequencies.

Method used

Incorporating a downsampling chroma neural network to adaptively convert between different chroma formats, such as YUV 4:4:4 to YUV 4:2:0 or YUV 4:2:2, using lightweight neural networks to optimize the encoding and decoding process.

Benefits of technology

This approach improves the efficiency of neural network-based encoding and decoding by reducing computational overhead and enhancing the alignment of chroma components, resulting in more accurate and computationally efficient picture reconstruction.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure EP2025083879_25062026_PF_FP_ABST

Patent Text Reader

Abstract

A method, e.g., a decoding method, is disclosed. A set of latent variables representative of a current picture, parameters of a set of at least one first neural network, and at least one parameter of a downsampling chroma network are first obtained. The current picture is reconstructed based on the set of latent variables, the parameters of the set of at least one first neural network and the parameters of the downsampling chroma network.

Need to check novelty before this filing date? Find Prior Art

Description

[0001] Docket No 2024P00850WQ

[0002] NEURAL NETWORK BASED ENCODING AND DECODING METHODS

[0003] CROSS REFERENCE TO RELATED APPLICATIONS

[0004] This application claims the benefit of European Application No 24307136.2, filed on December 17, 2024 which is incorporated herein by reference in its entirety.

[0005] BACKGROUND

[0006] The present application is related to methods and apparatuses for picture encoding and decoding using neural networks.

[0007] BRIEF SUMMARY

[0008] Briefly stated, in one embodiment, a method (e.g., a decoding method) is disclosed that comprises obtaining a set of latent variables representative of a current picture, parameters of a set of at least one first neural network, and at least one parameter of a downsampling chroma network. The current picture is reconstructed based on the set of latent variables, the parameters of the set of at least one first neural network and the parameters of the downsampling chroma network. Another method (e.g., an encoding method) is disclosed that comprises training a neural network representation of a current picture to derive a set of latent variables representative of the current picture, parameters of a set of at least one first neural network, and at least one parameter of a downsampling chroma network. The set of latent variables, the parameters of the set of at least one first neural network and at least one parameter of the downsampling chroma network are then encoded. Corresponding apparatus (e.g., encoding and decoding apparatuses) are also disclosed.

[0009] BRIEF DESCRIPTION OF THE DRAWINGS

[0010] The following detailed description will be better understood when read in conjunction with the appended drawings, in which there are shown examples of one or more of the multiple embodiments of the present disclosure. It should be understood, however, that the embodiments described herein are not limited to the precise arrangements and instrumentalities shown in the drawings. In the drawings:

[0011] FIG. 1 is a block diagram illustrating an example system according to one or more embodiments of the present disclosure;

[0012] FIG. 2 illustrates an example of a neural network that can be used for implicit neural representation (INR); Docket No 2024P00850WQ

[0013] FIG. 3 illustrates an example of a method to encode a signal using an INR;

[0014] FIG. 4 illustrates an example of a neural architecture that is an INR variant, called in the following Hybrid INR;

[0015] FIG. 5 is a block diagram illustrating an example decoding method according to one or more embodiments of the present disclosure ;

[0016] FIG. 6 is a block diagram illustrating an example encoding method according to one or more embodiments of the present disclosure;

[0017] FIG. 7 is a block diagram illustrating an example encoding method according to one or more embodiments of the present disclosure;

[0018] FIGs 8 and 9 are block diagrams illustrating examples of neural network for chroma downsampling according to one or more embodiments of the present disclosure;

[0019] FIG. 10 illustrate two examples of chroma phasing;

[0020] FIG. 11 is a block diagram illustrating an example decoding method according to one or more embodiments of the present disclosure;

[0021] FIG. 12 is a block diagram illustrating an example decoding method according to one or more embodiments of the present disclosure; and

[0022] FIG. 13 is a block diagram illustrating an example decoding method according to one or more embodiments of the present disclosure.

[0023] DETAILED DESCRIPTION

[0024] In describing the various embodiments of the present disclosure, certain terminology is used herein for convenience only and should not be considered as limiting such embodiments. In the drawings, the same reference numerals are employed for designating the same elements throughout the several figures and the present description.

[0025] Referring to the drawings, there is shown in FIG. 1 a block diagram illustrating an example system 100 in which embodiments of the present disclosure can be implemented. The system 100 may be an electronic device including, for example, a personal computer, laptop computer, mobile phone, tablet computer, multimedia set-top box, digital television receiver, personal video recording system, connected home appliance, vehicle control and / or entertainment system, and server. One or more elements of the system 100, singly or in combination, may be implemented as an integrated circuit (IC), multiple ICs, and / or discrete components. For example, in one embodiment, the processing, encoding and / or decoding elements of system 100 are distributed across multiple ICs and / or discrete components. In some embodiments, the system 100 is Docket No 2024P00850WQ communicatively coupled to and / or in communication with other systems or devices, via, for example, a communications bus or dedicated input / output ports.

[0026] One or more of the elements of system 100 may be provided within an integrated housing, with such elements being interconnected and able to transmit data therebetween using any suitable connection arrangement 115 generally known in the art, including, for example, an internal bus (e.g., I2C bus), wiring, and printed circuit boards.

[0027] The system 100 includes at least one processor 110 configured to execute instructions for implementing the embodiments described herein, including signal / data coding and processing. The processor 110 may be a general-purpose processor or microprocessor, digital signal processor (DSP), one or more microprocessors in association with a DSP core, a controller, a microcontroller, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), a state machine, and the like. The processor 110 may include at least one central processing unit (CPU), embedded memory, input and output interfaces, and other circuitries.

[0028] The system 100 includes at least one memory 120, for example, a volatile memory device and / or anon-volatile memory device. The system 100 includes a storage device 140, that may be or include non-volatile memory and / or dynamic volatile memory, including EEPROM, ROM, PROM, RAM, DRAM, SRAM, DDR, flash, magnetic disk drives, solid state drives (SSD) and / or optical disk drives. The storage device 140 may be or include, for example, an internal storage device, an attached storage device, and / or a network accessible storage device. Although shown separately, the memory 120 and the storage device 140 may be collocated, integrated together, or otherwise combined.

[0029] The system 100 includes an encoder / decoder module 130 configured to process video data and to provide encoded video data or decoded video data. The encoder / decoder module 130 may include one or more processors and / or memory (not shown). Although FIG. 1 depicts the encoder / decoder module 130 as a separate element of system 100, it will be understood that the processor 110 and the encoder / decoder module 130 may be collocated and / or integrated together as a combination of hardware and / or software, e.g., in an electronic package or chip. The encoder / decoder module 130 may be or include one or more modules that may be included in one or more separate devices that perform encoding and / or decoding functions.

[0030] Instructions for execution by the processor 110 and / or the encoder / decoder module 130 may be stored in the storage device 140 and subsequently loaded into memory 120 for execution by the processor 110. In some embodiments, one or more of processor 110, memory 120, storage device 140, and encoder / decoder module 130 may store one or more items when performing the processes disclosed herein. Such items may include input video, decoded video or portions Docket No 2024P00850WQ thereof, bitstreams, matrices, variables, operational logic, and intermediate and / or final results from processing of equations, formulas, or operations.

[0031] In some embodiments, the memory of the processor 110 and / or the encoder / decoder module 130 is used to store instructions and / or provide working memory for video encoding and decoding functions. In some embodiments, memory external to the processor 110 and / or the encoder / decoder module 130 (e.g., the memory 120 and / or the storage device 140) is used for one or more of these functions and / or, for example, to store the operating system of a television.

[0032] The system 100 may obtain or receive information via one or more input devices, interfaces, and / or ports as indicated in input block 105. Examples of the input devices include a radio frequency (RF) device for transmitting and / or receiving RF signals over various media, for example, RF signals received over the air from a broadcaster; component video (COMP) inputs; a Universal Serial Bus (USB) input; and / or a High-Definition Multimedia Interface (HDMI) input. Other examples include composite video input (not shown). In some embodiments, the input devices are associated with respective input processing elements, e.g., those generally known in the art. For example, the RF device may be associated with elements suitable for selecting a desired frequency (e.g., selecting or band-limiting a signal) or performing error correction on the signal. The USB and / or HDMI inputs may include respective interface processors and transceivers (or transmitters and receivers) for coupling the system 100 to other devices via USB and / or HDMI ports or connections. Various forms of input processing may be implemented, for example, by and / or within a separate input processing device or the processor 110.

[0033] The system 100 includes a communication interface 150 that enables wired and / or wireless communication with other devices, e.g., via a communication channel 190. The communication interface 150 may include one or more transceivers, modems, network cards and the like. The communication channel 190 may be or include wired and / or wireless mediums.

[0034] In some embodiments, data may be streamed to the system 100 via wired and / or wireless networks. Examples of such wireless networks include cellular, Bluetooth or Wi-Fi (e.g., IEEE 802.11) networks. The wired and / or wireless networks may include one or more base stations (e.g., cellular base stations, access points, etc.), and / or user equipment (e.g. cellular user equipment, stations, etc.), and / or other network elements that communicate with the system 100 via the communication interface 150 and communication channel 190, whereby the system 100 may obtain data streamed from streaming applications (e.g., OTT services) via various networks, including the Internet. In some embodiments, data is streamed to the system 100 via the input block 105 (e.g., using a set-top box that delivers data via the HDMI connection or the RF connection). In some embodiments, data is received by the system 100 in a non-streaming manner. Docket No 2024P00850WQ

[0035] The system 100 may provide one or more output signals to one or more output devices. The output devices may include a display device 165 (e.g., touchscreen display, monitor, etc.), an audio device 175 (e.g., speakers), and other peripheral devices 185, including, for example, a stand-alone DVR, a disk player, a stereo system, a lighting system, and other devices that provide a function based on the output of the system 100. The display device 165 can be for a television, tablet, laptop, mobile phone, head-mounted display, or other device. In some embodiments, control signals are communicated between the system 100 and the display device 165, the audio device 175, and / or the peripheral devices 185, enabling device-to-device control with or without user intervention. The output devices may couple to and / or communicate with the system 100 via dedicated connections via respective display, audio, and peripheral interfaces 160, 170, 180. Alternatively, the output devices may couple to and / or communicate with the system 100 via the communication channel 190 and the communication interface 150.

[0036] The display device 165 and the audio device 175 may be collocated, integrated, or otherwise combined with the other components of system 100 in a single unit (e.g., a television). Alternatively, the display device 165 and the audio device 175 may be separate from one or more of the other components of the system 100. In embodiments in which the display device 165 and the audio device 175 are external components, the output signals may be provided via dedicated outputs and / or connections, including, for example, HDMI ports, USB ports, or COMP outputs.

[0037] FIG. 2 illustrates an example of a neural network that can be used for implicit neural representation (INR). Such a neural network used for INR can be referred to as an INR network. The INR network is also referred as Coordinates Neural Representation. Indeed, INRs are trained to map a (e.g., each) point in a given domain to the corresponding value of a signal at that point. For example, INRs for images leam to map the 2D coordinates of pixels to their corresponding RGB / YUV values. The INR network allows to obtain a compact representation of an input signal, an image for example. The INR network models the input signal by an overfitted Multi Layer Perceptron (MLP), performing, in the case of an image for example, the mapping from pixel coordinates to its RGB values.

[0038] The INR parameterizes a signal as a function 200, which takes coordinates 210 as input and outputs values 220 of a signal at these coordinates. INR has recently been applied to image, videos or 3D objects among other applications. In the image case, the inputs 210 can be pixel coordinates (c1;c2) and the INR may output 220 the color values (r, g, b) or (y, u, v) of the input pixel. The input coordinates 210 may be modified by a transformation before being used as input for the neural network. This transformation can be a Fourier mapping, coordinate transformation, normalization etc. Docket No 2024P00850WQ

[0039] The INR can be used to reconstruct a signal by computing the signal values for (e.g., every) necessary coordinate inputs. It can be used to upsample a signal by generating output for input coordinates corresponding to the upsampled pixels, for example the mean of the coordinates between two consecutive pixels for upsampling by a factor of 2.

[0040] An INR network 200 is typically a neural network, composed of multiple neural layers, such as fully connected layers. For example, in FIG. 2, the network has four layers. Intermediate outputs are represented by circles. Each neural layer can be described as a function that first multiplies the input by a tensor, adds a vector called the bias and then applies a nonlinear function on the resulting values. The shape (and other characteristics) of the tensor and the type of non-linear functions are called the architecture of the network. The values of the tensor and the bias are denoted by the term “weights”. The weights and, if applicable, the parameters of the non-linear functions, are called the parameters 9 of the network. The architecture and the parameters define a “model”. We will use feto denote an INR function parameterized by 9.

[0041] FIG. 3 illustrates an example of a method 399 to encode a signal 319 using an INR. This is done by optimizing 329 the parameters 9 (or a subset of them) of the INR network to reconstruct the signal and optionally encoding 339 them to create the output bitstream 359. For an image x of size (M x A), the parameters 9 can for example be optimized by minimizing the following loss function:

[0042] Loss = D x,fe) + AT? (9) where D is a distortion which quantifies the difference between the reconstructed image by fgto the original image x, R is the bitrate of the encoded parameters and A a trade-off parameter between D and R. D could be any differentiable distortion measure, such as mean squared error DMSEas in the second equation. M and N are the width and height of the original image. Other metrics such as LPIPS (learned perceptual image patch similarity) can also be used in this case. The optimization of the parameters 9 is typically performed by a machine learning approach such as a batch gradient descent method.

[0043] To decompress the signal, feis evaluated at all relevant coordinates. These coordinates can be selected at decoding. A typical choice would be all pixel coordinates for an image or video. As an example, for a 256x256 pixel image, these coordinates could be all pairs (c1;c2) for all G 9,1, ... ,255 and c2G 0,1, ... ,255. Other choices are possible, for example to upsample, downsample or extend the original image. Docket No 2024P00850WQ

[0044] The bitstream encoding a signal may thus be created by encoding the parameters of the neural network. This can be done by a neural compression codec such as Neural Network Coding (NNC) / ISO / IEC 15938-17 or MPEG-7 part 17 or by quantizing the parameters and / or pruning some neurons from the network.

[0045] FIG. 4 illustrates an example of a neural architecture that is an INR variant, called in the following Hybrid INR. Hybrid INRs are conditioned on a latent vector, typically learned end-to- end as part of the model. This hybrid INR network may be applied by INR-based encoder and INR-based decoder. In the example of FIG. 4, during the encoding of input data (that is, the training of the network 400), the input data’s coordinates 410 are first mapped, by a mapping unit 420, into respective latent variables 425. The mapping can be implemented by a lookup table or a hash function, for example. The mapping may also involve any transformation, such as a Fourier transformation, a coordinate transformation, a normalization transformation, or a combination thereof. The latent variables 425 may optionally be upsampled, by an up-sampling unit 430, resulting in upsampled latent variables 435. The up-sampling is applied in a case where the latent variables are represented at different resolutions so that the up-sampled latent variables are at one and the same resolution. The upsampled latent variables 435 may be used as an input to an INR network 440, trained to produce the reconstructed data 450. In a hybrid-INR network 400 the latent variables 425 are trained together with the parameters 6 of the INR network 440, resulting in optimal network parameters and optimal latent variables. Using such an architecture helps in handling the local attributes of the input data. Indeed, a group of latent variables that correspond to a given part of the data may be uncorrelated with other groups of latent variables that correspond to other parts of the data, and, thus, groups of latent variables can be tailored to (or be characteristic of) corresponding parts of the data.

[0046] Following the training of the hybrid INR network 400, the learned latent variables 425 and network parameters may be quantized and encoded (e.g., entropy coded) into the bitstream. Thus, during inference, the latent variable 425 and the network parameters of the trained INR network 440 are first decoded from the bitstream and dequantized. Then, the decoded and dequantized latent variables may be upsampled 430. To reconstruct the data 450, the up-sampled latent variables are fed into the trained INR network 440 using the decoded and dequantized network parameters of the INR network.

[0047] In an example, due to their large bit representation, efficient coding of the latent variables may use the estimation of their distributions. To that end, the distributions of respective latent variables may be learned using a probability prediction (PP) network, i.e., a network trained to produce parameters of distributions of respective latent variables. Based on these learned Docket No 2024P00850WQ distribution parameters, the latent variables 425 may be entropy coded into the bitstream. The PP network is defined by PP network parameters, denoted by t|r, determined during the training of the PP network. These PP network parameters may be entropy coded into the bitstream. In an example, while entropy coding of the latent variables relies on their learned respective distributions, the entropy coding of the PP network parameters r|r and the INR network parameters 0 rely on respective non-leamed distributions.

[0048] In another example, for some (e.g., all) of the latent variables respective non-leamed distributions can be used for entropy coding. These non-leamed distributions may be fixed distributions or may be distributions that were learned with respect to other latent variables (e.g., latent variables representing data from previous frames).

[0049] An example of a hybrid-INR architecture, called COOL-CHIC, for image coding is described in Ladune T, Philippe P, Henry F, et al. (2023) Cool-chic: Coordinate-based low complexity hierarchical image codec. Proceedings of the International Conference on Computer Vision (ICCV), hereinafter “Ladune”. In Ladune as in many hybrid INR models, the latent variables are arranged in hierarchical layers (or channels) ranging from a low-resolution representation (that provides for compact representation of smooth image regions) to a high-resolution representation (that captures the fine details of the image).

[0050] The training of a hybrid INR network requires the original image which is usually defined with three channels (e.g., RGB, YUV, YCbCr, etc) having the same resolution, i.e. the same number of samples per channel. In other examples, the original image may also be a frame with a YUV 4:2:0 format, in which case the luminance is stored at full resolution whereas the chroma components, U and V, are stored at half resolution (in both vertical and horizontal dimensions), meaning that there are four times as many Y values (i.e., samples) stored than for either of the chroma channels.

[0051] During training of the hybrid INR, the objective is thus to find the latent variables y and the INR network parameters 0 that minimize the coding cost, as follows: where y denotes the quantized latent variables and z denotes their up-sampled version; where fedenotes the INR network and 0 denotes the INR network parameters; where D denotes a distortion metric measuring the distance between the image x and its reconstructed version %, as produced by the INR network fefrom the up-sampled latent variables z, that is, x = fg(z) and where R denotes the rate (in bits per pixel) measuring the number of bits that are required to represent a pixel in a bitstream, that is, the number of bits that are required to represent y and 0. The distortion Docket No 2024P00850WQ

[0052] D and the rate R are balanced by a scalar value denoted by A. In a case where the up-sampling unit 435 is implemented by a neural network, the parameters of that network are also learned and coded into the bitstream to be used by the up-sampling unit when used in an inference mode.

[0053] Since the contribution of the INR network parameters 6 to the rate R is not as significant as that of the latent variables y, only the latter can be considered when minimizing the coding cost, that is, R(y, 6 ) « R(y). Furthermore, R(y) can be replaced by the cross entropy. Thus, equation (1) can be replaced by: where P(y) is the joint distribution of the latent variables y. According to Equation (2), minimizing the cost involves minimizing the rate associated with the latent variables. This can be achieved by reducing the amount of information contained in the latent variables, at the price of a less accurate reconstruction, as less information in y is likely to increase the distortion D. Alternatively, minimizing the cost can be achieved by obtaining estimates of the distributions of the respective latent variables, as described herein.

[0054] In another example, the entropy coding of the set of latent variables may use a probability prediction (PP) network denoted defined by parameters i|i. The network PP network takes as input latent variables (e.g., located in a causal neighborhood of a current latent variable) and output parameters (e.g., mean and deviation in case of a Gaussian distribution) of distributions of respective latent variables. The probabilities output by the PP network are thus used for entropy coding (decoding respectively) the latent variables. In this latter case, the parameters i|imaY be learned during the training of feand the latent variables. In this case, during training of the hybrid INR, the objective is thus to find the latent variables y, the INR network parameters 6 and the PP network parameters i|i that minimize the coding cost, as follows:

[0055] Equation (2) may thus be replaced by the following equation :

[0056] A x log2Pl|, (y)).

[0057] The original image x can be a frame extracted from of a video sequence. When the frame has a YUV 4:4:4 format, for which there is the same number of samples for the three channels, the frame can be directly used for the training without any difficulties. Docket No 2024P00850WQ

[0058] In existing approaches, the neural network (e.g., the neural network 440 on FIG. 4) outputs channels having a same number of samples (i.e., YUV 4:4:4), even in a case where the original frame is in another format, such as YUV 4:2:0, YUV 4:2:2. Consequently, the input YUV 4:2:0 frame is often transformed into a YUV 4:4:4 frame. A simple preprocessing, such as bicubic, is commonly used to upsample the U and V components to get the full resolution. However, this preprocessing stage does not consider that Y, and UV components are not necessarily well-aligned or phased. Therefore, this kind of approach may generate very high spatial frequency and make the inter component redundancy difficult to leverage. The training (e.g., that comprises computing the loss) is therefore non optimal since it is performed with channels having the full resolution but with potentially more high frequencies than necessary.

[0059] In another approach, the loss function used during the training may be computed on data in the encoding resolution (format), e.g., YUV 4:2:0. To this aim, the neural network’s output (i.e., channels having a same number of samples, i.e., YUV 4:4:4) may be adapted (e.g., transformed) to an appropriate frame format, namely to the encoding format (a.k.a encoding format).

[0060] A method is thus disclosed wherein at least one downsampling neural network (e.g., a chroma downsampling neural network) is overfitted such that downsampling operations are best suited to the incoming signal to encode.

[0061] More particularly, a downsampling neural network (e.g., a lightweight neural network such as a shallow network) is defined to convert (e.g., transform) YUV 4:4:4 into another format with less U and V chroma samples than luma samples Y, e.g., YUV 4:2:0 or YUV 4:2:2. This network is used at the encoder side (e.g., to perform the training and the loss computation in the encoding format) and at the decoder side (e.g., to output the decoded frame with the YUV 4:2:0 format).

[0062] In another example, a set of (e.g., pre-trained) weights that fits a given phase (alignment between Y and UV components) is defined, e.g., to reduce the computational training time and the amount of transmitted information. The set of (e.g., pre-trained / pre-computed) weights used for a current picture may be signaled in the bitstream. The set of all (e.g., pre-trained / pre-computed) weights is assumed to be known at the decoder side.

[0063] In all the examples described below, the formats of the frame are YUV 4:4:4, YUV 4:2:0 or YUV 4:2:2. However, the present principles are not limited to these formats and may be applied to other formats, e.g., format with one luma channel and a plurality of chroma channels such as Lab color space. Besides, the present principles are not limited to three channels and may be applied to two, three or more than three channels. Docket No 2024P00850WQ

[0064] FIG. 5 is a block diagram illustrating an example decoding method 500 according to one or more embodiments of the present disclosure.

[0065] A set of latent variables representative of a current picture is obtained (502). Parameters of a first neural network (or of a set comprising at least one first neural network) and at least one parameter of a downsampling chroma network are also obtained (502). For example, the set of set of latent variables, the parameters of the first neural network and the at least one parameter of the downsampling chroma network are decoded (e.g., entropy decoded) from a bitstream.

[0066] The current picture is reconstructed (504) based on the set of latent variables, the parameters of the first neural network and the at least one parameter of the downsampling chroma network.

[0067] In an embodiment, the first neural network is configured to output a picture with a plurality of channels having one and the same resolution (e.g., a picture in YUV 4:4:4 format) using the latent variables as inputs.

[0068] In an embodiment, the downsampling chroma network is configured to output a picture with a plurality of channels, each channel having a same resolution as a corresponding channel in the current picture (e.g., a current picture in YUV 4:2:2 format), using the output of the first neural network as input. Said otherwise, the downsampling chroma network is configured to output a picture at the encoding format (a.k.a. encoding frame format), wherein the encoding format may be signaled in the bitstream. Weights of the downsampling chroma network may be decoded from the bitstream or selected (e.g., loaded in the network) in a set of weights based on a first parameter representative of a chroma phasing and a second parameter representative of a color space.

[0069] In an embodiment, a type of clipping may be further obtained and the reconstructed current picture may (optionally) be clipped based on the type of clipping.

[0070] In another embodiment, the first neural network is replaced by a set of first neural networks, wherein the output of the set of first neural networks output a picture with a plurality of channels having one and the same resolution (e.g., a picture in YUV 4:4:4 format) using the latent variables as inputs

[0071] FIG. 6 is a block diagram illustrating an example encoding method 600 according to one or more embodiments of the present disclosure.

[0072] A neural network representation of a current picture is trained (602) to derive a set of latent variables representative of the current picture, parameters of a first neural network (or of a set comprising at least one first neural network), and at least one parameter of a downsampling chroma network. Docket No 2024P00850WQ

[0073] The set of latent variables, the parameters of the first neural network and the at least one parameter of the downsampling chroma network are encoded (604), e.g., entropy coded in a bitstream.

[0074] In an embodiment, the first neural network is configured to output a picture with a plurality of channels having one and the same resolution (e.g., a picture in YUV 4:4:4 format) using the latent variables as inputs.

[0075] In an embodiment, the downsampling chroma network is configured to output a picture with a plurality of channels, each channel having a same resolution as the corresponding channel in the current picture (e.g., a current picture in YUV 4:2:2 format), using the output of the first neural network as input. Said otherwise, the downsampling chroma network is configured to output a picture at the encoding format (a.k.a. encoding frame format), wherein the encoding format may be signaled in the bitstream. Weights of the downsampling chroma network may be encoded in the bitstream or selected (e.g., loaded in the network) in a set of weights based on a first parameter representative of a chroma phasing and a second parameter representative of a color space, in which case the first and second parameters are encoded in the bitstream.

[0076] In an embodiment, the reconstructed current picture may (optionally) be clipped based on a type of clipping and the type of clipping may be further encoded.

[0077] In another embodiment, the first neural network is replaced by a set of first neural networks, wherein the output of the set of first neural networks output a picture with a plurality of channels having one and the same resolution (e.g., a picture in YUV 4:4:4 format) using the latent variables as inputs. This is applicable to all embodiments described below.

[0078] FIG. 7 is a block diagram illustrating an example encoding method according to one or more embodiments of the present disclosure. The encoding method comprises training the neural networks and encoding the given input frame.

[0079] Let F be a frame (e.g., an image frame or a picture) to be encoded. The frame format F may be YUV 4:4:4, YUV 4:2:2, YUV 4:2:0 and is called the encoding format. The encoding format may be (e.g., should be) signaled (i.e., encoded) in the bitstream.

[0080] A set of latent variables are used to represent in a compact manner the signal to encode. A typical example of such approach is presented in Ladune. The latent variables are then used as input as a first neural network NN 1.

[0081] In the method of FIG. 7, two neural networks NN1 and NN2 are jointly trained (from left to right in FIG. 7) in addition to the set of latent variables (and possibly the PP network if such network is used): The main network NN1 (0) is parametrized by the parameters 6. In input, this network NN1 takes the set of latent variables, whereas the output of this network is a frame of (e.g., three) Docket No 2024P00850WQ channels having one and the same resolution, e.g., YUV 4:4:4 frame. The second network NN2 (a.k.a. down sampling chroma network) is a neural network that makes it possible to downsample the chroma components (e.g., U and V) to recover the encoding format of the chroma components, i.e., a frame with chroma components having the same resolution as the chroma components of the frame F. This network NN s parametrized by 6d. The downsampling network NN2 may be bypassed if / when the encoding frame format is YUV 4:4:4.

[0082] The second network NN2 may be a lightweight network. Two different architectures are illustrated by FIGs 8 and 9. FIG. 8 is a block diagram illustrating an example neural network for chroma downsampling conditioned by the Y component. The input network is the UV components at full resolution and the output network is the UV components at half resolution. The network comprises a 2D convolution layer (e.g., with stride equal to 2) followed by a concatenation and another 2D convolution layer. FIG. 9 is a block diagram illustrating an example neural network for chroma downsampling that leverages the cross components redundancy. For that purpose, the YUV components are the input of the network. A 2DPixel Shuffle operation is then used to rearrange channels and their resolutions in an appropriate format, such as:

[0083] (C, N, M) -> (4 * C, N / 2, M / 2)

[0084] Where C represents the number of channels (C=3 in our context).

[0085] The 2DPixelShuffle operation is followed by a set of convolution layers to reduce progressively the number of channels to 3. These propositions illustrated by FIG. 8 and 9 are just examples and some others model architectures may be used, with different numbers and types of layers. A 2D convolution layer is assumed to be a convolution operation associated with bias and followed by an activation function (e.g. ReLU).

[0086] The neural networks NN1 and NN2 are first trained. To this aim, a loss function is used to optimize the parameters {9, 9d} and the latent variables. The loss function may be a classical loss function defined as follows:

[0087] Where y denotes the quantized latent variables and z denotes their up-sampled version, D is a distortion which quantifies the difference between the reconstructed image x = ie&(Jg z) and the original image x. The reconstructed image is obtained by applying sequentially the two neural networks, i.e. fgand f9d, Ris the bitrate of the encoded parameters and A a trade-off parameter between D and R. D could be any differentiable distortion measure, such as mean squared error. Docket No 2024P00850WQ

[0088] In a variant, the loss function is used to optimize the parameters {9, 9dr|r} and the latent variables in a case where the entropy coding (700, 702) is performed by an entropy network.

[0089] For a YUV 4:2:0 frame, the distortion D may be given by:

[0090] D(%, x) = / x DY(XY, XY) + A2X Du(xu, xu)) + A3x Dv(xv, xv)

[0091] Where xY, xu, xvrepresent the Y, U, V components of the input signal. xY, xu, xvrepresent the Y, U, V components of the network’s output. 3and A2are weighting coefficients to balance the importance of luma and chroma components. When the distortion is the mean squared error,

[0092] 4 the following values may be used for a frame format YUV 4:2:0 = -, A2= A3= 1 / 6. Another set of parameters is : = ^3 = 1 / 12 which balance the PSNR luma chroma more fairly.

[0093] Before the computation of the loss, the results may (optionally) be clipped to define the range of the values. Below, different types of clipping may be used. These are examples and the present principles are not limited to these examples:

[0094] • clipping type = limited, the clipping bounds are 64-940 for luma and 64-960 for chroma for a 10 bits signal;

[0095] • clipping_type = full, the range is 0-1023 for all components;

[0096] • clipping_type = full8, the range is 0-1020 for all components;

[0097] • clipping type = custom. The boundaries are signaled in the bitstream.

[0098] Once trained, the parameters ofNNl andNN2 are encoded (e.g., entropy coded). The encoding of a frame (e.g., a YUV frame in format YUV 4:4:4, YUV 4:2:2, YUV 4:2:0) in a bitstream is thus performed as follows. The parameters 6 of the neural network NN1 are encoded (700). The PP network parameters i|imaY be further encoded (entropy coded) if such a network is used. The compact representation (i.e., the latent variables) of the signal is encoded (702). In the case where a PP network is used, the latent variables are encoded (entropy coded) in the bitstream based on the distribution parameters output by the PP network.

[0099] The encoding format of the input frame may be signaled in the bitstream as well as the clipping type (i.e., they are both encoded (700) in the bitstream).

[0100] In the first example illustrated on FIG. 7, the parameters 6dof the neural network NN2 are also encoded (700), e.g., if the encoding format of the input frame is not YUV 4:4:4.

[0101] Entropy coding is used to encode and transmits the set of trainable weights {0, 6d} and possibly {r|r}- They can be classically encoded with arithmetic coder or with the neural compression codec such as Neural Network Coding (NNC) / ISO / IEC 15938-17 or MPEG-7 part 17. The entropy Docket No 2024P00850WQ coding is also used to encode a compact representation of the input signal. As in Ladune, such representation can be a set of latent variables.

[0102] In the first example of FIG. 7, the weights {0d} of the downsampling chroma network are entropy coded. In a second example, instead of training and encoding the weights, the weights {0d} may be loaded (e.g., selected) from a set of (e.g., predefined) weights. This set may have been obtained from off-line training for known values of UV phasing values in a given color space (e.g., gamut). The set may further comprise custom weights in the case we would like to have a fast training mode. These custom weights may be set without any training in which case they are signaled in the bitstream. The weights {0d} selected from the set of (e.g., predefined) weights may be signaled by two syntax elements, e.g., weights_downsampling_chroma and type_color_space.

[0103] Table 1 presents the appropriate syntax for the weights downsampling chroma value. This value may be coded over 8 bits and may refer to a specific chroma phasing (values 1 to 5 in Table 1). In an example, if / when the value is equal to 6 (UNSPECIFIED), the weights are inferred from the training and thus signaled in the bitstream. Table 1: description of weights downsanipling chroma

[0104] Two examples of chroma phasing are illustrated in FIG. 10. Top of FIG. 10 illustrates atop chroma phasing and bottom of FIG. 10 illustrates a left chroma phasing.

[0105] Table 2 presents the gamut for which the transform is performed. Docket No 2024P00850WQ

[0106] Table 2: description of type color space transformation

[0107] In this second example, weights downsampling chroma and type color space transformation parameters may be signaled (i. e. , encoded) in the bitstream to indicate a specific set of weights. In yet another example, the first and second examples are combined.

[0108] The weights downsampling chroma and type color space transformation parameters may be signaled (i.e. , encoded) if required (700), e.g., if the encoding format is not YUV 4:4:4. If the two aforementioned parameters are not signaled (not encoded) and if the encoding format of the input frame is not YUV 4:4:4, the parameters of the neural network feare encoded. In another example, if the two aforementioned parameters are signaled and the value of weights_downsampling_chroma indicates UNSPECIFIED or Custom / generic weights, the parameters of the neural network fd dare encoded.

[0109] FIG. 11 is a block diagram illustrating an example decoding method according to one or more embodiments of the present disclosure. The decoding is performed from a bitstream, e.g., provided by the encoding method of FIG. 7.

[0110] The parameters feof the neural network NN1 are obtained (1200), e.g., decoded (entropy decoded) from the bitstream. The compact representation of the signal (i.e., the latent variables) is also obtained, e.g., decoded (entropy decoded) from the bitstream (1202). In a variant, the PP network parameters \| / may further be obtained (e.g., entropy decoded) if such a network is used. In the latter case, based on the distribution parameters output by the PP network, the latent variables are decoded (entropy decoded) from the bitstream. Docket No 2024P00850WQ

[0111] If / when the encoding frame format signaled in the bitstream is YUV 4:4:4, the decoded frame YUV 4:4:4 is reconstructed by using the neural network fgand the chroma downsampling operation is bypassed.

[0112] If / when the encoding frame format signaled in the bitstream is not YUV 4:4:4. The following applies. The parameters f8dof the neural network NN2 (i.e., the downscaling chroma network) are obtained, e.g., decoded (1200) from the bitstream. The frame YUV 4:4:4 is reconstructed (1204) from the decoded latent variables by using the neural network NN 1 fg) parameterized by {0} and the chroma is finally downsampled by the network NN2 ( / ed) parameterized by {0d}. The samples of the reconstructed frame (e.g., YUV) may (optionally) be clipped (1205) based on a clipping type. The type of clipping (e.g., used during the encoding) may be decoded from the bitstream (1200).

[0113] FIG. 12 is a block diagram illustrating an example decoding method according to one or more embodiments of the present disclosure. The decoding is performed from a bitstream, e.g., provided by the encoding method of FIG. 7.

[0114] The parameters fgof the neural network NN1 are obtained (1200), e.g., decoded (entropy decoded) from the bitstream. The compact representation of the signal (i.e., the latent variables) is also obtained, e.g., decoded (entropy decoded) from the bitstream (1202). In a variant, the PP network parameters r|r may further be obtained (e.g., entropy decoded) if such a network is used. In the latter case, based on the distribution parameters output by the PP network, the latent variables are decoded (entropy decoded) from the bitstream.

[0115] If / when the encoding (frame) format signaled in the bitstream is YUV 4:4:4, the decoded frame YUV 4:4:4 is reconstructed by using the neural network fgand the chroma downsampling operation is bypassed.

[0116] If / when the encoding frame format signaled in the bitstream is not YUV 4:4:4. The following applies. The weights downsampling chroma and type color space transformation parameters are signaled (i.e., are decoded from the bitstream at 1200), then the appropriate weights {0d} are uploaded in the neural network NN2. For example, the appropriate weights {d} may be derived from a Look-Up Table (LUT) associating weights with weights downsampling chroma and type color space transformation parameter values. The LUT may be known by both the encoder and decoder. The frame YUV 4:4:4 is reconstructed (1204) from the decoded latent variables by using the neural network NN1 parameterized {0} and the chroma is finally downsampled by the network NN2 parameterized by {0d}. The samples of the reconstructed frame (e.g., YUV) may Docket No 2024P00850WQ

[0117] (optionally) be clipped (1205) based on a clipping type. The type of clipping (e.g., used during the encoding) may be decoded from the bitstream (1200).

[0118] FIG. 13 is a block diagram illustrating an example decoding method according to one or more embodiments of the present disclosure. The decoding is performed from a bitstream, e.g., provided by the encoding method of FIG. 7.

[0119] The parameters feof the neural network NN1 are obtained (1200), e.g., decoded (entropy decoded) from the bitstream. The compact representation of the signal (i.e., the latent variables) is also obtained, e.g., decoded (entropy decoded) from the bitstream (1202). In a variant, the PP network parameters \| / may further be obtained (e.g., entropy decoded) if such a network is used. In the latter case, based on the distribution parameters output by the PP network, the latent variables are decoded (entropy decoded) from the bitstream.

[0120] If / when the encoding (frame) format signaled in the bitstream is YUV 4:4:4, the decoded frame YUV 4:4:4 is reconstructed by using the neural network fgand the chroma downsampling operation is bypassed.

[0121] If / when the encoding frame format signaled in the bitstream is not YUV 4:4:4. The following applies. If / when the weights downsampling chroma and type color space transformation parameters are signaled (i.e., are decoded from the bitstream at 1200), then the appropriate weights {0d} are uploaded in the neural network NN2, otherwise (i.e., the parameters weights downsampling chroma and type color space are not signalled) the parameters {0d} of the neural network NN2 (i.e., the downscaling chroma network) are decoded (1200) from the bitstream.

[0122] The frame YUV 4:4:4 is reconstructed (1204) from the decoded latent variables by using the neural network NN1 parameterized {0} and the chroma is finally downsampled by the network NN2 parameterized by {0d}. The samples of the reconstructed frame may (optionally) be clipped (1205) based on a clipping type. The type of clipping (e.g., used during the encoding) may be decoded from the bitstream (1200).

[0123] The methods described above may be applied to each frame of a video to encode (respectively reconstruct) the whole video.

[0124] One or more embodiments provide a computer program comprising instructions which when executed by one or more processors cause such processors to perform the encoding and / or decoding methods according to any of the embodiments described above. One or more Docket No 2024P00850WQ embodiments also provide a computer readable storage medium having stored thereon instructions for encoding or decoding video data according to the methods described above.

[0125] One or more embodiments provide a computer readable storage medium having stored thereon video data generated according to the methods described above. One or more embodiments also provide a method and apparatus for transmitting or receiving video data generated according to the methods described above.

[0126] The embodiments described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (e.g., as a method), the implementation of such features may also be implemented in other forms. An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. Corresponding methods may be implemented in, for example, a processor.

[0127] Various methods and aspects described herein can be used to modify one or more modules. For example, the intra predictors and inter predictors described with respect to FIGs. 2 and 3 may be implemented as one or more modules and modified according to the various embodiments of the present disclosure.

[0128] The various embodiments described herein provide at least the following features, devices or aspects, alone or on any combination, across various claim categories and types: i. Encoding, into coded video data, syntax elements that can enable the decoder to decode the coded video data, according to any of the embodiments described herein. ii. A bitstream that includes one or more of the described syntax elements, or variations thereof, whether transmitted, stored, or otherwise made available. iii. Creating, transmitting, receiving, and / or decoding of the bitstream. iv. An electronic device (e.g., TV, set-top box, mobile phone, tablet, etc.) that tunes a channel to receive a bitstream or that receives such bitstream over the air. The electronic device decodes the syntax elements from the bitstream, and, optionally, displays (e.g., via a monitor or other type of display) a resulting image.

[0129] Various numeric values are used in the present application. Such specific values are for example purposes and the embodiments described are not limited to these specific values.

[0130] Various methods are described herein, and such methods comprise one or more steps or actions for achieving the described method. Unless a specific order of steps or actions is required for the proper operation of the method, the order and / or use of specific steps and / or actions may be modified or combined. Additionally, terms such as “first”, “second”, etc. may be used in Docket No 2024P00850WQ various embodiments to modify an element, component, step, operation, etc., for example, a “first decoding” and a “second decoding”. Use of such terms does not imply an order to the operations unless specifically required.

[0131] The present disclosure may refer to “determining” various pieces of information. Determining information may include one or more of, for example, estimating, calculating, predicting, or retrieving (e.g., from memory) the information.

[0132] The present disclosure may refer to “accessing” various pieces of information. Accessing information may include one or more of, for example, receiving, retrieving (e.g., from memory), storing, moving, copying, calculating, determining, predicting, or estimating the information. Similarly, the present disclosure may refer to “receiving” various pieces of information. Receiving information may include one or more of, for example, accessing or retrieving (e.g., from memory) the information.

[0133] “Decoding,” as used herein, encompasses all or part of the processes performed, for example, on an encoded sequence to produce an output suitable for display. In some embodiments, such processes include one or more of the processes typically performed by a decoder, for example, entropy decoding, inverse quantization, etc. Whether the phrase “decoding process” is intended to refer to a subset of operations or generally to the broader decoding process will be clear based on the context of the specific description and will be well understood by those skilled in the art.

[0134] “Encoding,” as used herein, encompasses all or part of the processes performed, for example, on input video data an order to produce an encoded bitstream. Additionally, the terms “reconstructed” and “decoded” may be used interchangeably, the terms “encoded” or “coded” may be used interchangeably, the terms “frame”, “image,” “picture,” “sub-picture,” “slice,” and “frame” may be used interchangeably, and the terms “pixel” and “sample” may be used interchangeably.

[0135] The present disclosure refers to information, for example, syntax elements, that can be transmitted or stored. Such information can be packaged or arranged in a variety of manners, including for example manners common in video standards such as putting the information into a sequence parameter set (SPS), a picture parameter set (PPS), a network abstraction layer (NAL) unit, a header (for example, a NAL unit header, or a slice header), or an SEI message. Other manners are also available, including, for example, manners that are common for system level or application-level standards such as signaling the information into one or more of the following: i. session description protocol (SDP), for example as described in RFCs and / or used in conjunction with real-time transport protocol (RTP) transmission. Docket No 2024P00850WQ ii. hypertext transfer protocol (HTTP) live Streaming (HLS) manifest transmitted over HTTP. iii. dynamic adaptive streaming over HTTP (DASH) media presentation description (MPD) descriptors, for example as used in DASH and transmitted over HTTP. iv. RTP header extensions, for example as used during RTP streaming. v. International Organization for Standardization (ISO) base media file format, for example, as used in Omnidirectional MediA Format (OMAF).

[0136] As used herein, “signal” and “signaling” refer to, among other things, indicating information to a decoder. For example, in some embodiments the encoder signals a quantization matrix for de-quantization, whereby the same parameter is used for both encoding and decoding. In some embodiments, the signaling may be explicit, such that information (e.g., a particular parameter) is transmitted to the decoder enabling the decoder to use the same particular parameter. In some embodiments, the signaling may be implicit, in that the information (e.g., a particular parameter) is indicated based on other information at or transmitted to the decoder or derived or selected by the decoder based on information available at the decoder. By not transmitting the information (e.g., the particular parameter), a bit savings is thus realized in some embodiments. In some embodiments, one or more syntax elements or flags are used to signal information to a decoder. While the preceding relates to the verb form of the word “signal”, the word “signal” can also be used herein as a noun.

[0137] In some embodiments, signals may be produced that are formatted to carry information that may be stored or transmitted. Such information may include, for example, instructions for performing a method, or data produced by one of the described implementations (e.g., a bitstream of a described embodiment). Such a signal may be formatted, for example, as an electromagnetic wave or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links and may be stored on a processor-readable medium.

[0138] It is to be understood that use of any of the following “and / or”, and “at least one of’ is intended to encompass all possible selections of listed items, taken either individually or in any combination thereof.

[0139] While specific embodiments have been described in the foregoing description in connection with the accompanying drawings, it should be understood that embodiments described herein are examples only and should not be taken as limiting the scope of the present disclosure or Docket No 2024P00850WQ the following claims. Although features and elements are described herein in particular combinations, those of ordinary skill in the art will appreciate that such features or elements may be used alone or in any combination with the other features and elements. It is understood, therefore, that the overall teachings of the present disclosure are not limited to the particular embodiments, implementations, and examples disclosed herein, but are intended to cover variations, modifications, and alternatives as defined by the appended claims and any and all equivalents thereof.

Claims

Docket No 2024P00850WQCLAIMSWhat is claimed is:

1. A method comprising: obtaining a set of latent variables representative of a current picture, parameters of a set of at least one first neural network, and at least one parameter of a downsampling chroma network; and reconstructing the current picture based on the set of latent variables, the parameters of the set of at least one first neural network and the at least one parameter of the downsampling chroma network.

2. The method of claim 1, wherein the set of at least one first neural network is configured to output a picture with a plurality of channels having one and the same resolution using the set of latent variables as inputs.

3. The method of claim 1 or 2, wherein the downsampling chroma network is configured to output a picture with a plurality of channels, each channel having a same resolution as a corresponding channel in the current picture, using an output of the set of at least one first neural network as input.

4. The method of any one of claims 1 to 3, further comprising obtaining a type of clipping and clipping the reconstructed current picture based on the type of clipping.

5. The method of any one of claims 1 to 4, wherein obtaining at least one parameter of a downsampling chroma network comprises decoding weights of the downsampling chroma network.

6. The method of any one of claims 1 to 4, wherein obtaining at least one parameter of a downsampling chroma network comprises decoding a first parameter representative of a chroma phasing and a second parameter representative of a color space and selecting weights of the downsampling chroma network in a set of weights based on the first and second parameters.

7. A method comprising:Docket No 2024P00850WQ training a neural network representation of a current picture to derive a set of latent variables representative of the current picture, parameters of a set of at least one first neural network, and at least one parameter of a downsampling chroma network; and encoding the set of latent variables, the parameters of the set of at least one first neural network and at least one parameter of the downsampling chroma network.

8. The method of claim 7, wherein the set of at least one first neural network is configured to output a picture with a plurality of channels having one and the same resolution using the set of latent variables as inputs.

9. The method of claim 7 or 8, wherein the downsampling chroma network is configured to output a picture with a plurality of channels, each channel having a same resolution as a corresponding channel in the current picture, using an output of the set of at least one first neural network as input.

10. The method of any one of claims 7 to 9, further comprising clipping the output of the downsampling chroma network based on a type of clipping and encoding the type of clipping.

11. The method of any one of claims 1 to 4, wherein encoding at least one parameter of a downsampling chroma network comprises encoding weights of the downsampling chroma network.

12. The method of any one of claims 1 to 4, wherein weights of the downsampling chroma network being selected in a set of weights based on a first parameter representative of a chroma phasing and a second parameter representative of a color space, encoding at least one parameter of the downsampling chroma network comprises encoding the first and second parameters.

13. A decoding apparatus comprising one or more processors and at least one memory coupled to said one or more processors, wherein said one or more processors are configured to perform: obtaining a set of latent variables representative of a current picture, parameters of a set of at least one first neural network, and at least one parameter of a downsampling chroma network; and reconstructing the current picture based on the set of latent variables, the parameters of the set of at least one first neural network and at least one parameter of the downsampling chroma network.Docket No 2024P00850WQ14. A encoding apparatus comprising one or more processors and at least one memory coupled to said one or more processors, wherein said one or more processors are configured to perform: training a neural network representation of a current picture to derive a set of latent variables representative of the current picture, parameters of a set of at least one first neural network, and at least one parameter of a downsampling chroma network; and encoding the set of latent variables, the parameters of the set of at least one first neural network and at least one parameter of the downsampling chroma network.

15. A computer program comprising program code instructions for implementing the method according to any one of claims 1-12 when executed by a processor.

16. A computer readable storage medium having stored thereon instructions for implementing the method of any one of claims 1-12.

17. A signal comprising a bitstream, formed by performing the method of any one of claims 7-