Neural network post filter for restoring the QOE in an end-to-end energy adaptation context
A neural network post-filter addresses the challenge of multi-dimensional energy adaptation in video encoding and decoding, enhancing Quality of Experience by adapting brightness, spatial resolution, and film grain, thereby optimizing energy consumption across the video transmission chain.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- INTERDIGITAL CE PATENT HOLDINGS SAS
- Filing Date
- 2025-11-24
- Publication Date
- 2026-07-02
Smart Images

Figure EP2025084039_02072026_PF_FP_ABST
Abstract
Description
[0001] Docket No. 2024P00881WO
[0002] NEURAL NETWORK POST FILTER FOR RESTORING THE QOE IN AN END- TO-END ENERGY ADAPTATION CONTEXT
[0003] CROSS REFERENCE TO RELATED APPLICATIONS
[0004] This application claims the benefit of European Application No. 24307268.3, filed on December 23, 2024, which is incorporated herein by reference in its entirety.
[0005] BACKGROUND
[0006] The present application is related to adaptation of energy consumption of the video transmission chain, encompassing encoding, decoding and rendering.
[0007] BRIEF SUMMARY
[0008] Briefly stated, in one embodiment, a method of video decoding is presented, comprising: receiving at least a syntax element associated with energy adaptation; determining that one or more visual dimensions of a plurality of visual dimensions of a video are to be adjusted based on the at least a syntax element; and adapting a neural network based filter to adjust the one or more visual dimensions.
[0009] In another embodiment, an apparatus for video decoding is presented, comprising one or more processors and at least one memory coupled to the one or more processors, wherein the one or more processors are configured to: receive at least a syntax element associated with energy adaptation; determine that one or more visual dimensions of a plurality of visual dimensions of a video are to be adjusted based on the at least a syntax element; and adapt a neural network based filter to adjust the one or more visual dimensions.
[0010] In another embodiment, a method of video encoding is presented, comprising: determining that one or more visual dimensions of a plurality of visual dimensions of a video are to be adjusted for energy adaptation; and encoding at least a syntax element to indicate that the one or more visual dimensions are to be adjusted by a neural network based filter.
[0011] In another embodiment, an apparatus for video encoding is presented, comprising one or more processors and at least one memory coupled to the one or more processors, wherein the one or more processors are configured to: determine that one or more visual dimensions of a plurality of visual dimensions of a video are to be adjusted for energy adaptation; and encodeDocket No. 2024P00881WO
[0012] at least a syntax element to indicate that the one or more visual dimensions are to be adjusted by a neural network based filter.
[0013] BRIEF DESCRIPTION OF THE DRAWINGS
[0014] The following detailed description will be better understood when read in conjunction with the appended drawings, in which there are shown examples of one or more of the multiple embodiments of the present disclosure. It should be understood, however, that the embodiments described herein are not limited to the precise arrangements and instrumentalities shown in the drawings. In the drawings:
[0015] FIG. 1 is a block diagram illustrating an example system according to one or more embodiments of the present disclosure;
[0016] FIG. 2 is a block diagram illustrating an example video encoder according to one or more embodiments of the present disclosure;
[0017] FIG. 3 is a block diagram illustrating an example video decoder according to one or more embodiments of the present disclosure;
[0018] FIG. 4A illustrates a method of using the quality policy when we decide to apply the postprocessing filter on all visual dimension, according to one or more embodiments of the present disclosure;
[0019] FIG. 4B illustrates a method of using the quality policy when the postfilter is limited to recovering the spatial resolution, according to one or more embodiments of the present disclosure;
[0020] FIG. 4C illustrates a method of using the quality policy when the postfilter is limited to recovering the brightness, according to one or more embodiments of the present disclosure; and
[0021] FIG. 5 illustrates a method of implementing an NNPF (Neural-Network Post-Filter) according to one or more embodiments of the present disclosure.
[0022] DETAILED DESCRIPTION
[0023] In describing the various embodiments of the present disclosure, certain terminology is used herein for convenience only and should not be considered as limiting such embodiments.Docket No. 2024P00881WO
[0024] In the drawings, the same reference numerals are employed for designating the same elements throughout the several figures and the present description.
[0025] Referring to the drawings, there is shown in FIG. 1 a block diagram illustrating an example system 100 in which embodiments of the present disclosure can be implemented. The system 100 may be an electronic device including, for example, a personal computer, laptop computer, mobile phone, tablet computer, multimedia set-top box, digital television receiver, personal video recording system, connected home appliance, vehicle control and / or entertainment system, and server. One or more elements of the system 100, singly or in combination, may be implemented as an integrated circuit (IC), multiple ICs, and / or discrete components. For example, in one embodiment, the processing, encoding and / or decoding elements of system 100 are distributed across multiple ICs and / or discrete components. In some embodiments, the system 100 is communicatively coupled to and / or in communication with other systems or devices, via, for example, a communications bus or dedicated input / output ports.
[0026] One or more of the elements of system 100 may be provided within an integrated housing, with such elements being interconnected and able to transmit data therebetween using any suitable connection arrangement 115 generally known in the art, including, for example, an internal bus (e.g., I2C bus), wiring, and printed circuit boards.
[0027] The system 100 includes at least one processor 110 configured to execute instructions for implementing the embodiments described herein, including signal / data coding and processing. The processor 110 may be a general -purpose processor or microprocessor, digital signal processor (DSP), one or more microprocessors in association with a DSP core, a controller, a microcontroller, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), a state machine, and the like. The processor 110 may include at least one central processing unit (CPU), embedded memory, input and output interfaces, and other circuitries.
[0028] The system 100 includes at least one memory 120, for example, a volatile memory device and / or a non-volatile memory device. The system 100 includes a storage device 140, that may be or include non-volatile memory and / or dynamic volatile memory, including EEPROM, ROM, PROM, RAM, DRAM, SRAM, DDR, flash, magnetic disk drives, solid state drives (SSD) and / or optical disk drives. The storage device 140 may be or include, for example, an internal storage device, an attached storage device, and / or a network accessibleDocket No. 2024P00881WO
[0029] storage device. Although shown separately, the memory 120 and the storage device 140 may be collocated, integrated together, or otherwise combined.
[0030] The system 100 includes an encoder / decoder module 130 configured to process video data and to provide encoded video data or decoded video data. The encoder / decoder module 130 may include one or more processors and / or memory (not shown). Although FIG. 1 depicts the encoder / decoder module 130 as a separate element of system 100, it will be understood that the processor 110 and the encoder / decoder module 130 may be collocated and / or integrated together as a combination of hardware and / or software, e.g., in an electronic package or chip. The encoder / decoder module 130 may be or include one or more modules that may be included in one or more separate devices that perform encoding and / or decoding functions.
[0031] Instructions for execution by the processor 110 and / or the encoder / decoder module 130 may be stored in the storage device 140 and subsequently loaded into memory 120 for execution by the processor 110. In some embodiments, one or more of processor 110, memory 120, storage device 140, and encoder / decoder module 130 may store one or more items when performing the processes disclosed herein. Such items may include input video, decoded video or portions thereof, bitstreams, matrices, variables, operational logic, and intermediate and / or final results from processing of equations, formulas, or operations.
[0032] In some embodiments, the memory of the processor 110 and / or the encoder / decoder module 130 is used to store instructions and / or provide working memory for video encoding and decoding functions. In some embodiments, memory external to the processor 110 and / or the encoder / decoder module 130 (e.g., the memory 120 and / or the storage device 140) is used for one or more of these functions and / or, for example, to store the operating system of a television.
[0033] The system 100 may obtain or receive information via one or more input devices, interfaces, and / or ports as indicated in input block 105. Examples of the input devices include a radio frequency (RF) device for transmitting and / or receiving RF signals over various media, for example, RF signals received over the air from a broadcaster; component video (COMP) inputs; a Universal Serial Bus (USB) input; and / or a High-Definition Multimedia Interface (HDMI) input. Other examples include composite video input (not shown). In some embodiments, the input devices are associated with respective input processing elements, e.g., those generally known in the art. For example, the RF device may be associated with elements suitable for selecting a desired frequency (e.g., selecting or band-limiting a signal) orDocket No. 2024P00881WO
[0034] performing error correction on the signal. The USB and / or HDMI inputs may include respective interface processors and transceivers (or transmitters and receivers) for coupling the system 100 to other devices via USB and / or HDMI ports or connections. Various forms of input processing may be implemented, for example, by and / or within a separate input processing device or the processor 110.
[0035] The system 100 includes a communication interface 150 that enables wired and / or wireless communication with other devices, e.g., via a communication channel 190. The communication interface 150 may include one or more transceivers, modems, network cards and the like. The communication channel 190 may be or include wired and / or wireless mediums.
[0036] In some embodiments, data may be streamed to the system 100 via wired and / or wireless networks. Examples of such wireless networks include cellular, Bluetooth or Wi-Fi (e.g., IEEE 802.11) networks. The wired and / or wireless networks may include one or more base stations (e.g., cellular base stations, access points, etc.), and / or user equipment (e.g. cellular user equipment, stations, etc.), and / or other network elements that communicate with the system 100 via the communication interface 150 and communication channel 190, whereby the system 100 may obtain data streamed from streaming applications (e.g., OTT services) via various networks, including the Internet. In some embodiments, data is streamed to the system 100 via the input block 105 (e.g., using a set-top box that delivers data via the HDMI connection or the RF connection). In some embodiments, data is received by the system 100 in a nonstreaming manner.
[0037] The system 100 may provide one or more output signals to one or more output devices. The output devices may include a display device 165 (e.g., touchscreen display, monitor, etc.), an audio device 175 (e.g., speakers), and other peripheral devices 185, including, for example, a stand-alone DVR, a disk player, a stereo system, a lighting system, and other devices that provide a function based on the output of the system 100. The display device 165 can be for a television, tablet, laptop, mobile phone, head-mounted display, or other device. In some embodiments, control signals are communicated between the system 100 and the display device 165, the audio device 175, and / or the peripheral devices 185, enabling device-to-device control with or without user intervention. The output devices may couple to and / or communicate with the system 100 via dedicated connections via respective display, audio, and peripheral interfaces 160, 170, 180. Alternatively, the output devices may couple to and / or communicate with the system 100 via the communication channel 190 and the communication interface 150.Docket No. 2024P00881WO
[0038] The display device 165 and the audio device 175 may be collocated, integrated, or otherwise combined with the other components of system 100 in a single unit (e.g., a television). Alternatively, the display device 165 and the audio device 175 may be separate from one or more of the other components of the system 100. In embodiments in which the display device 165 and the audio device 175 are external components, the output signals may be provided via dedicated outputs and / or connections, including, for example, HDMI ports, USB ports, or COMP outputs.
[0039] FIG. 2 is a block diagram illustrating an example video encoder 200 that may be employed by the system 100 (e.g., via the encoder / decoder module 130) described with respect to FIG. 1. The video encoder 200 may be an encoder that employs video compression technologies, standards, specification, or protocols, including Advanced Video Coding (AVC, H.264 / MPEG-4), High Efficiency Video Coding (HEVC, H.265), Versatile Video Coding (VVC, H.266), Essential Video Coding (EVC, MPEG-5), AOMedia Video 1 (AVI), VP9, or the Enhanced Compression Model (ECM), and variations or improvements thereof. Those skilled in the art will understand that the various embodiments described herein are not limited to a specific standard and can be applied to other standards and recommendations, as well as extensions thereof.
[0040] Some embodiments disclosed herein are described with reference to a coding unit (CU) or block of a video frame (or a video image or picture) to which coding tools may be applied by the video encoder 200 and / or by the video decoder 300 (described below with reference to FIG. 3). Generally, embodiments described herein may be applied to a video region formed by a video partition of any shape or size. The video region may be a video slice, a coding tree unit (CTU), or a CU (to which inter prediction or intra prediction can be applied), or a partition thereof, each of which can include samples of a luma component, F, and chroma components, U and V (also denoted herein by C).
[0041] Referring generally to FIG. 2 and the video encoder 200, video data (e.g., one or more video frames) is encoded generally as described below. Prior to encoding, video data may be pre-processed by a precoding processor (201). The pre-processing may include, for example, applying a color model transform to the input color components of the input video data (e.g., conversion from RGB 4:4:4 to YUV 4:2:0) or mapping the color components of the input video data to obtain a signal distribution that is more resilient to compression (for instance, applying a histogram equalizer and / or a denoising filter to one or more of the video data’s color components). The pre-processing may include associating metadata (for example, aDocket No. 2024P00881WO
[0042] supplemental enhancement information (SEI) message) with the video data that can be attached to a coded video bitstream. After pre-processing, if any, an image (frame) to be encoded is partitioned into CUs (blocks) by an image partitioner 202.
[0043] In general, a CU includes a luma block and associated chroma blocks. As such, functions of the video encoder 200 described herein as applied to a CU refer generally to the luma block and the respective chroma blocks. The CUs may be encoded using an intra prediction mode performed by an intra predictor 260. In intra prediction mode, the content of a CU in a frame is predicted based on content from one or more other CUs of the same frame (or region), using reconstructed blocks of other CUs output from an adder 255. The CUs may also or alternatively be encoded using an inter prediction mode, in which motion estimation and motion compensation are performed by a motion estimator 275 and a motion compensator 270, respectively. In inter prediction mode, the content of a CU in a frame is predicted based on content from one or more reconstructed areas of reference frames, available from a reference picture buffer 280.
[0044] The video encoder 200 selects or otherwise determines at 205 which prediction mode (intra prediction mode and / or inter prediction mode) to use for encoding a CU. The selected prediction mode may be enhanced (e.g., filtered) by a prediction enhancer 285. Based on the selected mode, a prediction for the CU is generated. A residual block is determined based on the prediction (i.e., prediction block, predicted CU) and the input CU. In some embodiments, such determination is made by a subtractor 210.
[0045] The residual block or a partition thereof (e.g., a transform block) is transformed into transform coefficients by a transformer 220. The transform coefficients are quantized by a quantizer 230. An entropy encoder 245 performs entropy encoding of the quantized transform coefficients and coding parameters (e.g., syntax elements including motion vectors and other control data) to form a bitstream of coded video data.
[0046] In addition to coding the original video blocks as described herein, the video encoder 200 reconstructs the coded blocks to provide references for future predictions. Thus, quantized transform coefficients (from the quantizer 230) are de-quantized by an inverse quantizer 240, and inverse transformed by an inverse transformer 250, to reconstruct (decode) the residual blocks. The reconstructed residual blocks and prediction blocks are combined (e.g., by the adder 255) to form reconstructed blocks. Thus, the video encoder 200 performs decoding operations through which the encoded images (frames) are reconstructed.Docket No. 2024P00881WO
[0047] In-loop filters 265 may be applied to the reconstructed image (formed by the reconstructed blocks). The filtered reconstructed image(s) are stored in the reference picture buffer 280 and used by the motion estimator 275 and motion compensator 270, as explained above. The in-loop filters 265 can be applied to the reconstructed samples of an image to reduce distortions introduced by the encoding process. For example, a deblocking filter (DBF), bilateral filter (BIF), sample adaptive offset (SAG), and / or adaptive loop filter (ALF) can be applied to reduce encoding artifacts.
[0048] FIG. 3 is a block diagram illustrating an example of video decoder 300 that may be employed by the system 100 (e.g., via the encoder / decoder module 130) described with respect to FIG. 1. Generally, operational features of the video decoder 300 are reciprocal to operational features of the video encoder 200. In the video decoder 300, a coded video bitstream (e.g., generated by the video encoder 200 or another video encoding device or process) is entropy-decoded by an entropy decoder 330 to obtain transform coefficients, motion vectors, and other coding parameters. Based on the coding parameters, an image partitioner 335 divides the picture accordingly. The quantized transform coefficients are de-quantized by an inverse quantizer 340 and inverse transformed by an inverse transformer 350 to decode (reconstruct) respective residual blocks. Depending on the selected prediction mode, a predicted block can be obtained at 370 from an intra predictor 360 (i.e., intra prediction) or from a motion compensator 375 (i.e., inter prediction) and may be enhanced (e.g., filtered) by a prediction enhancer 390, generating a prediction block. The reconstructed residual blocks are combined with prediction blocks (e.g. by an adder 355), resulting in reconstructed blocks.
[0049] In-loop filters 365 (e.g., DBF, BIF, SAG, and / or ALF) can be applied to the reconstructed image (formed by the reconstructed blocks), to output reconstructed (decoded) video. The filtered reconstructed image is also stored in a reference picture buffer 380 for reference by the motion compensator 375.
[0050] A post-decoding processor (395) can process the reconstructed video data. For example, post-decoding processing can include an inverse color model transform (e.g., conversion from YUV 4:2:0 to RGB 4:4:4) or an inverse mapping to reverse the mapping process performed by the pre-encoding processor described with respect to FIG. 2. The postdecoding processor can use metadata derived by the pre-encoding processor and / or signaled in the video bitstream. The post-decoding processor (395) can also perform post-processing filtering, for example, using a neural-network based post-filter.Docket No. 2024P00881WO
[0051] Neural network post filters
[0052] The neural -network post-filter characteristics (NNPFC) SEI (Supplemental Enhancement Information) message specifies a neural network that may be used as a postprocessing filter. In particular, the NNPFC SEI message allows an encoder to define a neural network to be used in a post-processing operation after the decoding operation. This message also specifies the input and output of the neural network and describes its complexity.
[0053] The NNPFC SEI message indicates the intended purpose of the neural network. Up to now, 8 modes have been defined:
[0054] • Enhancing visual quality
[0055] • Chroma upsampling
[0056] • Changing spatial resolution, e.g., from high-definition decoded video to ultra-high definition
[0057] • Upsampling picture rate, e.g., from 30 Hz to 60 Hz
[0058] • Upsampling bit depth to increase the dynamic range of pixel values
[0059] • Colorization to convert monochrome video to full colors.
[0060] • Temporal extrapolation
[0061] • Spatial extrapolation
[0062] These modes are also described in the following table (cf. Table 1) extracted from an article entitled “ISO / IEC 23002-7:2024 / CD Amdl Information technology - MPEG video technologies - Part 7: Versatile supplemental enhancement information messages for coded video bitstreams - Amendment 1: Additional SEI messages.”
[0063] Table 1
[0064]
[0065] Docket No. 2024P00881WO
[0066]
[0067] By implementing neural-network filters as a post-processing step, their utilization can be introduced to some services without negatively impacting older systems that lack the ability to execute neural networks.
[0068] In the present disclosure, we propose to add a new purpose metadata in the NNPFC SEI message that indicates whether an adaptation of energy consumption has been performed at the encoder side or not. The energy consumption adaptation concerns film grain removal, spatial resolution reduction and brightness. At the decoder side, two use cases can be defined:
[0069] (i) use the incoming decoded video for lowering the energy consumption (the NNPFC SEI message indicates which processes have been used).
[0070] (ii) recover all or some visual dimensions of the decoded content, such as the grain, the brightness and the initial spatial resolution, for improving the Quality of Experience (QoE).
[0071] This new NNPFC SEI message provides the end-user an opportunity to recover a premium but energy-hungry Quality of Experience. In the context of end-to-end energy consumption reduction, this new purpose enables the end-user to recover all or some visual dimensions of the decoded energy-reduced content in order to improve the final Quality of Experience. Generally, a visual dimension is related to a specific visual property of the image, such as the color, the brightness, the resolution, the noise, and the grain, and an energy-reduced content refers to a content that requires less energy to be encoded, decoded, transmitted, rendered and / or stored.
[0072] The recovery could be the restoration of the initial resolution, the film grain restoration and the original brightness recovering. These three factors significantly influence the energy consumption of encoding, decoding and rendering, respectively. The content provider indicates in the SEI message which visual dimensions have been modified at the encoder side.
[0073] With this new NNPFC SEI message, a unique multi-task post-processing neural network-based filter (NNPF) can then be used to recover at once all or some of the aforementioned visual dimensions.
[0074] Brightness
[0075] Some approaches have been implemented to reduce the energy consumption of content while displayed. As displays consume the largest proportion of the energy, providing energy-Docket No. 2024P00881WO
[0076] aware content, i.e., that will consume less when displayed, is of significant importance, at least for OLED displays.
[0077] A commonly owned EP application No. 24305320.4 (our attorney docket number 2024PF00055) presents a type of such approach. It only concerns one visual dimension, namely the brightness.
[0078] One method to adapt the brightness of the content to reduce the energy consumption while displaying the content is described, for example, in commonly-owned EP application Nos. 22306908.9 (our attorney docket number 2022PF00508) and 23305565.6 (our attorney docket number 2023PF00367), which propose to compute a pixel-wise attenuation map that contains for each pixel the amount of reduction applied on one of the pixel components (specifically the luminance). The attenuation map is then transmitted to the decoder and applied to the decoded image, at the display side (i.e., before rendering the image on the display). These techniques match the goal of producing some energy -reduced content. However, the encoding / transmission / decoding of the attenuation maps makes this system complex to deploy. Furthermore, in these approaches it is proposed to reduce the energy consumption by modifying one visual dimension, i.e., the brightness.
[0079] Encoding and decoding
[0080] Reducing the energy consumption of the encoder and decoder is also of great importance. One straightforward method to reach this objective is to manipulate the complexity and the resolution of the content. Downscaling the content before encoding and upscaling it after decoding while preserving the same quality of experience is one straightforward solution to reduce the computational burden. Additionally, removing and modeling artistic noise, such as film grain, before encoding and synthesizing it after decoding, not only reduces encoding and decoding times, but also significantly reduces the bitrate, while still preserving the artistic intent at the user side.
[0081] Most existing methods consider only one visual dimension, and require a dedicated method for removing the grain, adapting the brightness or reducing the spatial resolution. However, 3R-INN (Removing grain, Reducing the spatial resolution, Reducing the brightness) (see a commonly owned EP application No. 23306984.8, our attorney docket number 2023PF01009, hereinafter “3R-INN”) allows to modify several visual dimensions of the input content at once, and thus providing some energy reduction at different steps of the video chain.Docket No. 2024P00881WO
[0082] As described above, in the context of NNPFC SEI message, several individual purposes are proposed, each of them corresponding to one specific task, e.g., spatial extrapolation, colourization, etc. In addition, one NNPF is usually trained for one specific purpose. We aim to overcome these limitations by proposing a new NNPF purpose and its corresponding NNPF filter able to tackle several purposes at once. In addition, none of the existing purposes of NNPFC is related to energy adaptation operations.
[0083] In the present disclosure, we propose to add a new purpose metadata in the NNPFC SEI message that indicates the end-to-end adaptation of energy consumption that is performed. The following use cases can be considered.
[0084] At the encoder side, the input content is modified prior the encoding, for example, by the streaming providers. The modifications aim to reduce the energy consumption of the encoding, the transmission, the decoding and the rendering. Modifications of several visual dimensions, for example, film grain removal, spatial resolution downsampling and brightness reduction, can be performed at the same time. At the decoder side, an NNPF filter can be applied to recover those visual dimensions in order to recover the original visual quality or a quality close to the original one. The purpose of this NNPF is then to recover the Quality of Experience, when the content has been preprocessed for reducing the energy consumption. Syntax description
[0085] We propose to extend the purpose of the NNPF with a new mode dedicated to end-to- end energy adaptation. The proposed modification and the associated semantics are detailed below.
[0086] Table 2
[0087]
[0088] Docket No. 2024P00881WO
[0089]
[0090] Note that the bitMask value is given as an example and another bitmask value could be used.
[0091] Then, the value of nnpfc_purpose shall be in the range of 0 to 511, inclusive, in bitstreams conforming to this edition of this document. Values of 512 to 65 535, inclusive, for nnpfc_purpose are reserved for future use by ITU-T | ISO / IEC and shall not be present in bitstreams conforming to this edition of this document.
[0092] The SEI message syntax is modified as followed (underlined). These modifications allow to set up correctly the NNPF filter.
[0093] Table 3
[0094] >
[0095] <
[0096]
[0097] Docket No. 2024P00881WO
[0098] &
[0099] &
[0100] <
[0101]
[0102] Docket No. 2024P00881WO
[0103] >
[0104] >
[0105]
[0106] Docket No. 2024P00881WO
[0107] >
[0108]
[0109] Docket No. 2024P00881WO
[0110]
[0111] The variables ChromaUpsamplingFlag, ResolutionResamplingFlag, PictureRateUpsamplingFlag, BitDepthUpsamplingFlag, ColourizationFlag, TemporalExtrapolationFlag, SpatialExtrapolationFlag, and QoEFlag, specifying whether nnpfc purpose indicates the purpose of the NNPF to include chroma upsampling, resolution resampling, picture rate upsampling, bit depth upsampling, colourization, temporal extrapolation, and energy adaptation respectively, are derived as follows:
[0112] ChromaUpsamplingFlag = ( ( nnpfc_purpose & 0x02 ) > 0 ) ? 1 : 0 ResolutionResamplingFlag = ( ( nnpfc_purpose & 0x04 ) > 0 ) ? 1 : 0 PictureRateUpsamplingFlag = ( ( nnpfc_purpose & 0x08 ) > 0 ) ? 1 : 0 (1) BitDepthUpsamplingFlag = ( ( nnpfc_purpose & 0x10 ) > 0 ) ? 1 : 0 ColourizationFlag = ( ( nnpfc_purpose & 0x20 ) > 0 ) ? 1 : 0 TemporalExtrapolationFlag = ( ( nnpfc_purpose & 0x40 ) > 0 ) ? 1 : 0 SpatialExtrapolationFlag = ( ( nnpfc_purpose & 0x80 ) > 0 ) ? 1 : 0
[0113] QoEFlag = ( ( nnpfc_purpose & 0x100 ) > 0 ) ? 1 : 0 nnpfc_energy_reduced_content_flags, when QoEFlag is equal to 1, is a bitfield indicating which modification the content encountered at the encoder to reduce its energy consumption in the video chain. The bitfield is defined over 3 bits.
[0114] • If (nnpfc_energy_reduced_content_flags & 0x01 > 0) ? 1:0 == 1, it indicates that the brightness of the content has been modified to reduce the energy consumption during the rendering.
[0115] • If (nnpfc_energy_reduced_content_flags & 0x02 > 0) ? 1:0 == 1, it indicates that the film grain of the content has been removed to reduce the energy consumption during the encoding process and the transmission.
[0116] • If (nnpfc_energy_reduced_content_flags & 0x04 > 0) ? 1:0 == 1, it indicates that the spatial resolution has been modified to reduce the energy consumption at the encoder and decoder sides and the transmission.
[0117] nnpfc_width and nnpfc_height, when nnpfc_energy_reduced_content_flags & 0x04 == 1, specify the original width and height of the original content, respectively.
[0118] nnpfc_brightness_modification_val, when nnpfc_energy_reduced_content_flags & 0x01 == 1, specifies the amount of brightness decrease that has been performed at the encoder side. This value is coded over 5 bits (0 to 31) and should be in expressed in percentage as follow:Docket No. 2024P00881WO
[0119] percentage_of_modification = 100 / 2A5 * nnpfc brightness modification val Examples:
[0120] • nnpfc brightness modification val = 1, the percentage_of_modification value is 3.125% • nnpfc brightness modification val = 10, the percentage_of_modification value is 31.25% • nnpfc brightness modification val = 20, the percentage_of_modifi cation value is 62.5% In another embodiment, the nnpfc_brightness_modification_val semantic could be given by the following one (coded over 2 bits):
[0121]
[0122] In another embodiment, the nnpfc_brightness_modification_val semantic directly encodes on 7 bits with the value between 0 and 99. The brightness reduction is therefore nnpfc brightness modification val + 1.
[0123] In some cases, the NNPF requires more information, such as on which components it should perform the process, the type of targeted display, and the expected quality.
[0124] Display model
[0125] nnpfc_display_model is a bit field mask which indicates the display models for which the brightness has been adapted.
[0126] Table 4
[0127]
[0128] For example, nnpfc display type field = 112means that the NNPF can be used for both “Transmissive pixel” and “Emissive pixel” display models.
[0129] Expected qualityDocket No. 2024P00881WO
[0130] If the client has a quality policy, for instance the quality cannot be lower than a given threshold (a target quality level), the expected quality can help the client take the decision about whether to apply or not the post filter. Several syntax elements can be defined as follows.
[0131] nnpfc_expected_overall_quality specifies the quality of the displayed video after applying the neural network post filter. Examples of metrics are PSNR (Peak Signal-to-Noise Ratio), V-MAF (Video Multi-Method Assessment Fusion) or SSIM (Structural Similarity Index Measure) values for the modified picture after applying the neural network post filter as specified in Table 5. This parameter can be expressed as an absolute value or as a percentage value of reduction or increase of quality according to a selected quality metric in comparison with its nominal value.
[0132] nnpfc_expected_grain_quality specifies the quality of the displayed video after applying the neural network post filter to recover the film grain only. Examples of metrics are PSNR, V-MAF or SSIM values for the modified picture after applying the neural network post filter as specified in Table 5. This parameter can be expressed as an absolute value or as a percentage value of reduction or increase of quality according to a selected quality metric in comparison with its nominal value.
[0133] nnpfc_expected_brightness_quality specifies the quality of the displayed video after applying the neural network post filter to recover the brightness only. Examples of metrics are PSNR, V-MAF or SSIM values for the modified picture after applying the neural network post filter as specified in Table 5. This parameter can be expressed as an absolute value or as a percentage value of reduction or increase of quality according to a selected quality metric in comparison with its nominal value.
[0134] nnpfc expected spatialresolution quality specifies the quality of the displayed video after applying the neural network post filter to recover the spatial resolution only. Examples of metrics are PSNR, V-MAF or SSIM values for the modified picture after applying the neural network post filter as specified in Table 5. This parameter can be expressed as an absolute value or as a percentage value of reduction or increase of quality according to a selected quality metric in comparison with its nominal value.
[0135] nnpfc expected grain brightness quality specifies the quality of the displayed video after applying the neural network post filter to recover the grain and the brightness. Examples of metrics are PSNR, V-MAF or SSIM values for the modified picture after applying the neural network post filter as specified in Table 5. This parameter can be expressed as anDocket No. 2024P00881WO
[0136] absolute value or as a percentage value of reduction or increase of quality according to a selected quality metric in comparison with its nominal value.
[0137] nnpfc expected grain spatialresolution quality specifies the quality of the displayed video after applying the neural network post filter to recover the grain and the spatial resolution. Examples of metrics are PSNR, V-MAF or SSIM values for the modified picture after applying the neural network post filter as specified in Table 5. This parameter can be expressed as an absolute value or as a percentage value of reduction or increase of quality according to a selected quality metric in comparison with its nominal value.
[0138] nnpfc expected brightness spatialresolution quality specifies the quality of the displayed video after applying the neural network post filter to recover the brightness and the spatial resolution. Examples of metrics are PSNR, V-MAF or SSIM values for the modified picture after applying the neural network post filter as specified in Table 5. This parameter can be expressed as an absolute value or as a percentage value of reduction or increase of quality according to a selected quality metric in comparison with its nominal value.
[0139] nnpfc_video_quality_metric indicates the quality metric to be considered when the NNPF is applied.
[0140] Table 5
[0141]
[0142] In one embodiment, the syntax is given by:
[0143] &
[0144] &
[0145]
[0146] Docket No. 2024P00881WO
[0147] &
[0148] &
[0149] &
[0150] &
[0151] &
[0152]
[0153] FIG. 4A illustrates a method of using the quality policy when we decide to apply the postprocessing filter on all visual dimension (nnpfc_energy_reduced_content_flags == 0x07), according to an embodiment. As shown in FIG. 4 A, the client decodes (410) the bitstream and receives (420) the NNPF SEI message, if any. If the QoEFlag is true (430), it indicates that the decoded video can be postprocessed to enhance its visual quality. The client can check (440) the expected visual quality after the use of the postfilter when recovering all visualDocket No. 2024P00881WO
[0154] dimensions. If the expected visual quality is above a quality threshold defined by the client (450), the postfilter process is triggered (460) to recover all visual dimensions before rendering (470). If not, the client does not allow the postprocessing operation.
[0155] FIG. 4B illustrates a method of using the quality policy when the postfilter is limited to recovering the spatial resolution (nnpfc_energy_reduced_content_flags == 0x04), according to an embodiment. The client can check (441) the expected visual quality after the use of the postfilter when recovering only the spatial resolution. If the expected visual quality is above a quality threshold defined by the client (451), the postfilter process is triggered (461) to recover the spatial resolution. If not, the client does not allow the postprocessing operation.
[0156] FIG. 4C illustrates a method of using the quality policy when the postfilter is limited to recovering the brightness (nnpfc_energy_reduced_content_flags == 0x01), according to an embodiment. The client can check (442) the expected visual quality after the use of the postfilter when recovering only the brightness. If the expected visual quality is above a quality threshold defined by the client (452), the postfilter process is triggered (462) to recover the brightness. If not, the client does not allow the postprocessing operation.
[0157] In the above, we provided three examples. However, there are different variants and combination associated with the three visual dimensions used in these examples (brightness, spatial resolution, film grain).
[0158] At the encoder side, prior to the encoding, the energy -reduced picture can be processed to retrieve the original visual dimensions. For all cases, i.e., restoring the brightness, restoring the spatial resolution, restoring the film grain, restoring the brightness and the spatial resolution, restoring the brightness and the film grain, restoring the spatial resolution and the film grain, the visual quality is evaluated according to a given quality metric specified in nnpfc video quality metric. Quality values are used to populate information such as nnpfc expected spatialresolution quality, nnpfc expected brightness quality.
[0159] Different inputs and outputs of the NNPF
[0160] Depending on the type of the specified filter, the neural network post filter can process the YUV components or the RGB components. In this case, the flag nnpfc separate colour description present flag enables to specify the color transformation. In a particular embodiment, the input post filter is YUV and the output is also YUV.Docket No. 2024P00881WO
[0161] Several strategies can be defined:
[0162] • The post filter only modifies the Y component, and U and V components are not changed. Note that if the Y component has been rescaled, U and V components have to be rescaled accordingly.
[0163] • The post filter computes the new Y component from the incoming Y component. However, the RGB values of the input image are modified by multiplying the RGB components by the ratio of the luminance. In this context, two color transformations are made: YUV to RGB, and the modified RGB to YUV.
[0164] FIG. 5 illustrates a method of using the proposed NNPF filter, according to an embodiment. The proposed SEI NNPFC message is defined at the encoder side, according to the processes that have been applied to the content. Such prior information, called metadata, are inserted in the proposed SEI message. The encoded energy -reduced picture and the SEI message are then inserted in the bitstream.
[0165] At step 510, the SEI NNPFC message is decoded. At step 515, the decoder checks whether the QoEFlag is activated or not. If it is not activated, the content is sent to the display (590) or to some other processing without modification. If the QoEFlag is activated, then the potential brightness ratio and the potential new resolution values are recovered if the appropriate bits in nnpfc_energy_reduced_content_flags are enabled, and the NNPF corresponding to the QoEFlag and appropriate parameters are applied on the content depending on the client decision, before sending it to the display (590) or to some other processings.
[0166] In particular, the decoder checks (520) whether nnpfc_energy_reduced_content_flags & 0x04 =1. If yes, then the spatial resolution is obtained and the recover spatial resolution flag is set to true (525); otherwise, the recovery spatial resolution flag is set to false (530). The decoder also checks (535) whether nnpfc_energy_reduced_content_flags & 0x01 ==1. If yes, then the brightness modification parameter is obtained and the recover brightness flag is set to true (540); otherwise, the recovery brightness flag is set to false (545). The decoder further checks (550) whether nnpfc_energy_reduced_content_flags & 0x02 ==1. If yes, then the recover film grain flag is set to true (555); otherwise, the recovery film grain flag is set to false (560). Based on the above parameters, the NNPF is configured (570). Then the NNPF is applied (580) based on the client decision (575), for example, as illustrated in FIGs. 4A-4C. The filtered content is then sent to display (590).Docket No. 2024P00881WO
[0167] In a particular implementation of the NNPF, the neural post filter can be the network 3R-INN. 3R-INN is an invertible neural network that has been trained for three different tasks: reducing the spatial resolution, removing the film grain and reducing the brightness. In a forward pass, the 3R-INN can process all the aforementioned visual dimensions and then outputs an energy -reduced content. In the backward pass, the 3R-INN can recovere one, two or three aforementioned visual dimensions. For that purpose, the 3R-INN requires two inputs: the energy -reduced content that will be re-processed and a latent vector indicating which visual dimensions the network should recovered. In other implementations, filters other than 3R-INN can be used.
[0168] In the above, three visual dimensions (brightness, film grain, spatial resolution) are considered in the context of energy adaptation. However, the present methods can be applied when different visual dimensions or more visual dimensions are considered.
[0169] One or more embodiments provide a computer program comprising instructions which when executed by one or more processors cause such processors to perform the encoding and / or decoding methods according to any of the embodiments described above. One or more embodiments also provide a computer readable storage medium having stored thereon instructions for encoding or decoding video data according to the methods described above.
[0170] One or more embodiments provide a computer readable storage medium having stored thereon video data generated according to the methods described above. One or more embodiments also provide a method and apparatus for transmitting or receiving video data generated according to the methods described above.
[0171] The embodiments described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (e.g., as a method), the implementation of such features may also be implemented in other forms. An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. Corresponding methods may be implemented in, for example, a processor.
[0172] Various methods and aspects described herein can be used to modify one or more modules. For example, the pre-processor and post-processor described with respect to FIGs.
[0173] 2 and 3 may be implemented as one or more modules and modified according to the various embodiments of the present disclosure.Docket No. 2024P00881WO
[0174] The various embodiments described herein provide at least the following features, devices or aspects, alone or on any combination, across various claim categories and types: i. Encoding, into coded video data, syntax elements that can enable the decoder to decode the coded video data, according to any of the embodiments described herein.
[0175] ii. A bitstream that includes one or more of the described syntax elements, or variations thereof, whether transmitted, stored, or otherwise made available.
[0176] iii. Creating, transmitting, receiving, and / or decoding of the bitstream.
[0177] iv. An electronic device (e.g., TV, set-top box, mobile phone, tablet, etc.) that tunes a channel to receive a bitstream or that receives such bitstream over the air. The electronic device decodes the syntax elements from the bitstream, and, optionally, displays (e.g., via a monitor or other type of display) a resulting image.
[0178] Various numeric values are used in the present application. Such specific values are for example purposes and the embodiments described are not limited to these specific values.
[0179] Various methods are described herein, and such methods comprise one or more steps or actions for achieving the described method. Unless a specific order of steps or actions is required for the proper operation of the method, the order and / or use of specific steps and / or actions may be modified or combined. Additionally, terms such as “first”, “second”, etc. may be used in various embodiments to modify an element, component, step, operation, etc., for example, a “first decoding” and a “second decoding”. Use of such terms does not imply an order to the operations unless specifically required.
[0180] The present disclosure may refer to “determining” various pieces of information. Determining information may include one or more of, for example, estimating, calculating, predicting, or retrieving (e.g., from memory) the information.
[0181] The present disclosure may refer to “accessing” various pieces of information. Accessing information may include one or more of, for example, receiving, retrieving (e.g., from memory), storing, moving, copying, calculating, determining, predicting, or estimating the information. Similarly, the present disclosure may refer to “receiving” various pieces of information. Receiving information may include one or more of, for example, accessing or retrieving (e.g., from memory) the information.
[0182] “Decoding,” as used herein, encompasses all or part of the processes performed, for example, on an encoded sequence to produce an output suitable for display. In some embodiments, such processes include one or more of the processes typically performed by aDocket No. 2024P00881WO
[0183] decoder, for example, entropy decoding, inverse quantization, etc. Whether the phrase “decoding process” is intended to refer to a subset of operations or generally to the broader decoding process will be clear based on the context of the specific description and will be well understood by those skilled in the art.
[0184] “Encoding,” as used herein, encompasses all or part of the processes performed, for example, on input video data an order to produce an encoded bitstream. Additionally, the terms “reconstructed” and “decoded” may be used interchangeably, the terms “encoded” or “coded” may be used interchangeably, the terms “image,” “picture,” “sub-picture,” “slice,” and “frame” may be used interchangeably, and the terms “pixel” and “sample” may be used interchangeably.
[0185] The present disclosure refers to information, for example, syntax elements, that can be transmitted or stored. Such information can be packaged or arranged in a variety of manners, including for example manners common in video standards such as putting the information into a sequence parameter set (SPS), a picture parameter set (PPS), a network abstraction layer (NAL) unit, a header (for example, a NAL unit header, or a slice header), or an SEI message. Other manners are also available, including, for example, manners that are common for system level or application-level standards such as signaling the information into one or more of the following:
[0186] i. session description protocol (SDP), for example as described in RFCs and / or used in conjunction with real-time transport protocol (RTP) transmission.
[0187] ii. hypertext transfer protocol (HTTP) live Streaming (HLS) manifest transmitted over HTTP.
[0188] iii. dynamic adaptive streaming over HTTP (DASH) media presentation description (MPD) descriptors, for example as used in DASH and transmitted over HTTP. iv. RTP header extensions, for example as used during RTP streaming.
[0189] v. International Organization for Standardization (ISO) base media file format, for example, as used in Omnidirectional MediA Format (OMAF).
[0190] As used herein, “signal” and “signaling” refer to, among other things, indicating information to a decoder. For example, in some embodiments the encoder signals a quantization matrix for de-quantization, whereby the same parameter is used for both encoding and decoding. In some embodiments, the signaling may be explicit, such that information (e.g., a particular parameter) is transmitted to the decoder enabling the decoder to use the same particular parameter. In some embodiments, the signaling may be implicit, in that theDocket No. 2024P00881WO
[0191] information (e.g., a particular parameter) is indicated based on other information at or transmitted to the decoder or derived or selected by the decoder based on information available at the decoder. By not transmitting the information (e.g., the particular parameter), a bit savings is thus realized in some embodiments. In some embodiments, one or more syntax elements or flags are used to signal information to a decoder. While the preceding relates to the verb form of the word “signal”, the word “signal” can also be used herein as a noun.
[0192] In some embodiments, signals may be produced that are formatted to carry information that may be stored or transmitted. Such information may include, for example, instructions for performing a method, or data produced by one of the described implementations (e.g., a bitstream of a described embodiment). Such a signal may be formatted, for example, as an electromagnetic wave or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links and may be stored on a processor-readable medium.
[0193] It is to be understood that use of any of the following
[0194]
[0195] “and / or”, and “at least one of’ is intended to encompass all possible selections of listed items, taken either individually or in any combination thereof.
[0196] While specific embodiments have been described in the foregoing description in connection with the accompanying drawings, it should be understood that embodiments described herein are examples only and should not be taken as limiting the scope of the present disclosure or the following claims. Although features and elements are described herein in particular combinations, those of ordinary skill in the art will appreciate that such features or elements may be used alone or in any combination with the other features and elements. It is understood, therefore, that the overall teachings of the present disclosure are not limited to the particular embodiments, implementations, and examples disclosed herein, but are intended to cover variations, modifications, and alternatives as defined by the appended claims and any and all equivalents thereof.
Claims
Docket No. 2024P00881WOCLAIMSWhat is claimed is:
1. A method of video decoding, comprising:receiving at least a syntax element associated with energy adaptation;determining that one or more visual dimensions of a plurality of visual dimensions of a video are to be adjusted based on the at least a syntax element; andadapting a neural network based filter to adjust the one or more visual dimensions.
2. An apparatus for video decoding, comprising one or more processors and at least one memory coupled to the one or more processors, wherein the one or more processors are configured to:receive at least a syntax element associated with energy adaptation;determine that one or more visual dimensions of a plurality of visual dimensions of a video are to be adjusted based on the at least a syntax element; andadapt a neural network based filter to adjust the one or more visual dimensions.
3. The method of claim 1, further comprising, or the apparatus of claim 2, wherein the one or more processors are further configured to perform:decoding parameters for adjusting the one or more visual dimensions.
4. The method of claim 3, or the apparatus of claim 3, wherein the parameters include at least one of a spatial resolution of an original content corresponding to the video, a brightness modification parameter, and a display model.
5. The method of claim 1, 3 or 4, further comprising, or the apparatus of any one of claims 2-4, wherein the one or more processors are further configured to perform:determining that visual dimensions of the plurality of visual dimensions other than the one or more visual dimensions are not to be adjusted based on the at least a syntax element.
6. The method of any one of claims 1 and 3-5, or the apparatus of any one of claims 2-5, wherein the one or more visual dimensions correspond to one or more of spatial resolution, brightness and film grain.
7. The method of any one of claims 1 and 3-6, or the apparatus of any one of claims 2-6, wherein the neural network based filter corresponds to a 3R-INN.Docket No. 2024P00881WO8. The method of any one of claims 1 and 3-7, or the apparatus of any one of claims 2-7, wherein a single syntax element indicates whether each of the plurality of visual dimensions is to be adjusted or not.
9. The method of any one of claims 1 and 3-8, further comprising, or the apparatus of any one of claims 2-8, wherein the one or more processors are further configured to perform:obtaining a second syntax element indicative of an expected quality level when the one or more visual dimensions are to be adjusted;obtaining a target quality level associated with the one or more visual dimensions; and determining that the neural network based filter is to adjust the one or more visual dimensions responsive to that the expected quality level exceeds the target quality level.
10. The method of claim 9, or the apparatus of claim 9, wherein the target quality level is set by a client.
11. A method of video encoding, comprising:determining that one or more visual dimensions of a plurality of visual dimensions of a video are to be adjusted for energy adaptation; andencoding at least a syntax element to indicate that the one or more visual dimensions are to be adjusted by a neural network based filter.
12. An apparatus for video encoding, comprising one or more processors and at least one memory coupled to the one or more processors, wherein the one or more processors are configured to:determine that one or more visual dimensions of a plurality of visual dimensions of a video are to be adjusted for energy adaptation; andencode at least a syntax element to indicate that the one or more visual dimensions are to be adjusted by a neural network based filter.
13. The method of claim 11, further comprising, or the apparatus of claim 12, wherein the one or more processors are further configured to perform:encoding parameters for adjusting the one or more visual dimensions.
14. The method of claim 13, or the apparatus of claim 13, wherein the parameters include at least one of a spatial resolution of an original content corresponding to the video, a brightness modification parameter, and a display model.Docket No. 2024P00881WO15. The method of any one of claims 11, 13 and 14, or the apparatus of any one of claims -14, wherein a single syntax element indicates whether each of the plurality of visual dimensions o be adjusted or not.