System and method for multi-reference frame generation for versatile video coding
The LMRFG network addresses inefficiencies in traditional video coding by using multi-scale feature extraction and coordinated attention motion estimation to generate enhanced reference frames, improving video coding efficiency and adaptability for VVC inter coding.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- GUANGDONG OPPO MOBILE TELECOMMUNICATIONS CORP LTD
- Filing Date
- 2024-12-24
- Publication Date
- 2026-07-02
AI Technical Summary
Traditional block-based video coding methods struggle with accurately modeling large-scale nonlinear motion, leading to inefficiencies in video compression, and existing neural reference frame synthesis techniques are not optimized for the Versatile Video Coding (VVC) standard, particularly in Low-Delay B (LDB) configurations, and rely on pre-trained optical flow networks that hinder end-to-end optimization.
A lightweight multi-scale reference frame generation (LMRFG) network is introduced, utilizing a multi-scale feature extractor to capture high-resolution and sub-resolution features, combined with a dual-branch coordinated attention motion estimator to generate optical flow maps, and a flow warping and fusion block to create enhanced reference pictures, tailored for VVC inter coding.
The LMRFG network enhances video coding efficiency by adaptively handling diverse motion scenarios, reducing bitrate requirements and complexity while maintaining quality, outperforming existing methods in both Random Access (RA) and LDB configurations.
Smart Images

Figure CN2024141902_02072026_PF_FP_ABST
Abstract
Description
SYSTEM AND METHOD FOR MULTI-REFERENCE FRAME GENERATION FOR VERSATILE VIDEO CODINGBACKGROUND
[0001] Embodiments of the present disclosure relate to video coding.
[0002] Digital video has become mainstream and is being used in a wide range of applications including digital television, video telephony, and teleconferencing. These digital video applications are feasible because of the advances in computing and communication technologies as well as efficient video coding techniques. Various video coding techniques may be used to compress video data, such that coding on the video data can be performed using one or more video coding standards. Exemplary video coding standards may include, but not limited to, versatile video coding (H. 266 / VVC) , high-efficiency video coding (H. 265 / HEVC) , advanced video coding (H. 264 / AVC) , moving picture expert group (MPEG) coding, to name a few.SUMMARY
[0003] According to one aspect of the present disclosure, a method of video decoding is provided. The method may include obtaining, by a processor, a first reference picture from a decoded picture buffer. The method may include generating, by the processor, a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The method may include generating, by the processor, an optical flow map based on the first feature map. The method may include generating, by the processor, a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The method may include decoding, by the processor, a current picture region based on the second reference picture.
[0004] According to another aspect of the present disclosure, a video decoder is provided. The video decoder may include a processor and memory storing instructions. The memory storing instructions, which when executed by the processor, may cause the processor to obtain a first reference picture from a decoded picture buffer. The memory storing instructions, which when executed by the processor, may cause the processor to generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The memory storing instructions, which when executed by the processor, may cause the processor to generate an optical flow map based on the first feature map. The memory storing instructions, which when executed by the processor, may cause the processor to generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The memory storing instructions, which when executed by the processor, may cause the processor to decode a current picture region based on the second reference picture.
[0005] According to a further aspect of the present disclosure, an apparatus for video decoding is provided. The apparatus for video decoding may include a processor and memory storing instructions. The memory storing instructions, which when executed by the processor, may cause the processor to obtain a first reference picture from a decoded picture buffer. The memory storing instructions, which when executed by the processor, may cause the processor to generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The memory storing instructions, which when executed by the processor, may cause the processor to generate an optical flow map based on the first feature map. The memory storing instructions, which when executed by the processor, may cause the processor to generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The memory storing instructions, which when executed by the processor, may cause the processor to decode a current picture region based on the second reference picture.
[0006] According to still another aspect of the present disclosure, a non-transitory computer-readable medium storing instructions for a video decoder is provided. The instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to obtain a first reference picture from a decoded picture buffer. The instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate an optical flow map based on the first feature map. The instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to decode a current picture region based on the second reference picture.
[0007] According to one aspect of the present disclosure, a method of video encoding is provided. The method may include obtaining, by a processor, a first reference picture from a decoded picture buffer. The method may include generating, by the processor, a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The method may include generating, by the processor, an optical flow map based on the first feature map. The method may include generating, by the processor, a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The method may include encoding, by the processor, a current picture region based on the second reference picture.
[0008] According to another aspect of the present disclosure, a video encoder is provided. The video encoder may include a processor and memory storing instructions. The memory storing instructions, which when executed by the processor, may cause the processor to obtain a first reference picture from a decoded picture buffer. The memory storing instructions, which when executed by the processor, may cause the processor to generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The memory storing instructions, which when executed by the processor, may cause the processor to generate an optical flow map based on the first feature map. The memory storing instructions, which when executed by the processor, may cause the processor to generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The memory storing instructions, which when executed by the processor, may cause the processor to encode a current picture region based on the second reference picture.
[0009] According to a further aspect of the present disclosure, an apparatus for video encoding is provided. The apparatus for video encoding may include a processor and memory storing instructions. The memory storing instructions, which when executed by the processor, may cause the processor to obtain a first reference picture from a decoded picture buffer. The memory storing instructions, which when executed by the processor, may cause the processor to generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The memory storing instructions, which when executed by the processor, may cause the processor to generate an optical flow map based on the first feature map. The memory storing instructions, which when executed by the processor, may cause the processor to generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The memory storing instructions, which when executed by the processor, may cause the processor to encode a current picture region based on the second reference picture.
[0010] According to still another aspect of the present disclosure, a non-transitory computer-readable medium storing instructions for a video encoder is provided. The instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to obtain a first reference picture from a decoded picture buffer. The instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate an optical flow map based on the first feature map. The instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to encode a current picture region based on the second reference picture.
[0011] According to yet a further aspect of the present disclosure, a non-transitory computer-readable medium storing a bitstream is provided. The bitstream may be generated based on one or more of the operations described herein.
[0012] These illustrative embodiments are mentioned not to limit or define the present disclosure, but to provide examples to aid understanding thereof. Additional embodiments are described in the Detailed Description, and further description is provided there.BRIEF DESCRIPTION OF THE DRAWINGS
[0013] The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments of the present disclosure and, together with the description, further serve to explain the principles of the present disclosure and to enable a person skilled in the pertinent art to make and use the present disclosure.
[0014] FIG. 1 illustrates a block diagram of an exemplary video codec system, according to some embodiments of the present disclosure.
[0015] FIG. 2A illustrates a block diagram of an exemplary encoding apparatus, according to some embodiments of the present disclosure.
[0016] FIG. 2B illustrates a block diagram of an exemplary decoding apparatus, according to some embodiments of the present disclosure.
[0017] FIG. 2C illustrates a detailed block diagram of an exemplary video coder that includes a lightweight multi-reference frame generation (LMRFG) network, according to some embodiments of the present disclosure.
[0018] FIG. 3 illustrates a detailed block diagram of the LMRFG network of FIG. 2C, according to some embodiments of the present disclosure.
[0019] FIG. 4 illustrates a detailed block diagram of a multi-scale feature extractor included in the LMRFG network of FIG. 3, according to some embodiments of the present disclosure.
[0020] FIG. 5A illustrates a detailed block diagram of a dual-branch coordinated attention motion estimator (DBCA-ME) included in in the LMRFG network of FIG. 3, according to some embodiments of the present disclosure.
[0021] FIG. 5B illustrates a detailed block diagram of a dual-branch coordinated attention flow estimator (DBCA-FE) block included in the DBCA-ME of FIG. 5A, according to some embodiments of the present disclosure.
[0022] FIG. 5C illustrates a detailed block diagram of a coordinated attention block (CAB) included in the DBCA-FE block of FIG. 5B, according to some embodiments of the present disclosure.
[0023] FIG. 5D illustrates a diagram of feature map visualization of the DBCA-ME of FIG. 5A, according to some embodiments of the present disclosure.
[0024] FIG. 6A illustrates a detailed block diagram of a flow warping and fusion block included in the LMRFG network of FIG. 3, according to some embodiments of the present disclosure.
[0025] FIG. 6B illustrates a detailed block diagram of a fusion block included in the flow warping and fusion block of FIG. 6A, according to some embodiments of the present disclosure.
[0026] FIG. 7 illustrates a diagram an exemplary depthwise over-parameterized convolution (DO-Conv) that may be used in the LMRFG network of FIG. 3, according to some embodiments of the present disclosure.
[0027] FIG. 8A illustrates a first exemplary progressive learning strategy 800 for an LMRFG network based on quantization parameter (QP) distance, according to some embodiments of the present disclosure.
[0028] FIG. 8B illustrates a second exemplary progressive learning strategy 825 for an LMRFG network based on QP distance, according to some embodiments of the present disclosure.
[0029] FIG. 9A illustrates exemplary average delta (BD) -rate (%) results on a set of test sequences in terms of peak signal-to-noise ratio (PSNR) , according to some embodiments of the present disclosure.
[0030] FIG. 9B illustrates an exemplary comparison of BD-rate (%) reduction between a LMRFG network and JVET proposals under a random access (RA) configuration, according to some embodiments of the present disclosure.
[0031] FIG. 9C illustrates an exemplary comparison of BD-rate (%) reduction between a LMRFG network and JVET proposals under a Low Delay B (LDB) configuration, according to some embodiments of the present disclosure.
[0032] FIG. 9D illustrates an exemplary complexity comparison of existing methods and the present LMRFG network, according to some embodiments of the present disclosure.
[0033] FIG. 10A is a first table illustrating average rate distortion (RD) results on all test sequences under RA configuration in terms of PSNR for a first dataset, according to some embodiments of the present disclosure.
[0034] FIG. 10B is a second table illustrating second average RD results on all test sequences under RA configuration in terms of PSNR for a second dataset, according to some embodiments of the present disclosure.
[0035] FIG. 10C is a third table illustrating third average RD results on all test sequences under RA configuration in terms of PSNR for a third dataset, according to some embodiments of the present disclosure.
[0036] FIG. 10D is a fourth table illustrating fourth average RD results on all test sequences under RA configuration in terms of PSNR for a fourth dataset, according to some embodiments of the present disclosure.
[0037] FIG. 10E is a fifth table illustrating fifth average RD results on all test sequences under LDB configuration in terms of PSNR for a fifth dataset, according to some embodiments of the present disclosure.
[0038] FIG. 10F is a sixth table illustrating sixth average RD results on all test sequences under LDB configuration in terms of PSNR for a sixth dataset, according to some embodiments of the present disclosure.
[0039] FIG. 10G is a sixth table illustrating sixth average RD results on all test sequences under LDB configuration in terms of PSNR for a sixth dataset, according to some embodiments of the present disclosure.
[0040] FIG. 10H is an eighth table illustrating eighth average RD results on all test sequences under LDB configuration in terms of PSNR for a eighth dataset, according to some embodiments of the present disclosure.
[0041] FIG. 11A illustrates a first contribution visualization of reference pictures, according to some embodiments of the present disclosure.
[0042] FIG. 11B illustrates a second contribution visualization of reference pictures, according to some embodiments of the present disclosure.
[0043] FIG. 12 illustrates a flow chart of an exemplary method of video decoding, according to some embodiments of the present disclosure.
[0044] FIG. 13 illustrates a flow chart of an exemplary method of video encoding, according to some embodiments of the present disclosure.
[0045] Embodiments of the present disclosure will be described with reference to the accompanying drawings.DETAILED DESCRIPTION
[0046] Although some configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. A person skilled in the pertinent art will recognize that other configurations and arrangements can be used without departing from the spirit and scope of the present disclosure. It will be apparent to a person skilled in the pertinent art that the present disclosure can also be employed in a variety of other applications.
[0047] It is noted that references in the specification to “one embodiment, ” “an embodiment, ” “an example embodiment, ” “some embodiments, ” “certain embodiments, ” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it would be within the knowledge of a person skilled in the pertinent art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
[0048] In general, terminology may be understood at least in part from usage in context. For example, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a, ” “an, ” or “the, ” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “based on” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.
[0049] Various aspects of video coding systems will now be described with reference to various apparatus and methods. These apparatus and methods will be described in the following detailed description and illustrated in the accompanying drawings by various modules, components, circuits, steps, operations, processes, algorithms, etc. (collectively referred to as “elements” ) . These elements may be implemented using electronic hardware, firmware, computer software, or any combination thereof. Whether such elements are implemented as hardware, firmware, or software depends upon the particular application and design constraints imposed on the overall system.
[0050] The techniques described herein may be used for various video coding applications. As described herein, video coding includes both encoding and decoding a video. Encoding and decoding of a video can be performed by the unit of block. For example, an encoding / decoding process such as transform, quantization, prediction, in-loop filtering, reconstruction, or the like may be performed on a coding block, a transform block, or a prediction block. As described herein, a block to be encoded / decoded will be referred to as a “current block. ” For example, the current block may represent a coding block, a transform block, or a prediction block according to a current encoding / decoding process. In addition, it is understood that the term “unit” used in the present disclosure indicates a basic unit for performing a specific encoding / decoding process, and the term “block” indicates a sample array of a predetermined size. Unless otherwise stated, the “block, ” “unit, ” and “component” may be used interchangeably.
[0051] With the rapid advancement of internet technology and the widespread adoption of mobile devices, video traffic has become an indispensable part of our daily lives. This surge in demand for video content presents significant challenges for transmission and storage, prompting continuous development in video compression standards. Recent studies indicate that the latest Versatile Video Coding (VVC / H. 266) standard can achieve approximately 50%bitrate savings compared to its predecessor, High Efficiency Video Coding (HEVC / H. 265) , while maintaining comparable quality. Inter-frame prediction, a fundamental component of video coding frameworks, plays a crucial role in alleviating temporal redundancy. By effectively leveraging motion estimation, it enhances both video storage efficiency and transmission capabilities. Despite significant improvements in coding efficiency, the traditional block-based matching methods used for inter-frame prediction remain algorithmically complex. These methods struggle to accurately model large-scale nonlinear motion, which can lead to inefficiency in video compression. Moreover, as traditional compression frameworks mature, the limitations inherent in handcrafted module optimization become increasingly evident. The challenge of further enhancing performance under these constraints calls for innovative approaches that go beyond conventional methodologies. To address these limitations, researchers are exploring advanced techniques such as deep learning and machine learning, which promise to revolutionize inter-frame prediction by providing more adaptive and robust solutions capable of handling diverse motion scenarios.
[0052] Neural network-based video coding (NNVC) has received significant attention from both academia and industry in recent years. Since 2020, the Joint Video Exploration Team (JVET) has established a dedicated group, AHG11, to explore promising NNVC technologies, including module-based video coding tools and deep learning-based video compression (DLVC) . This discussion centers on deep reference frame synthesis (DRF) for Versatile Video Coding (VVC) inter coding, which falls under the category of module-based video coding. Several DRF approaches have been introduced recently to improve inter-frame prediction performance in traditional video coding standards. For example, some proposals include a neural reference synthesis technique that generates high-fidelity reference blocks for motion estimation (ME) and motion compensation (MC) . However, this method was developed for HEVC / H. 265 rather than the more recent VVC / H. 266 and did not consider the Low-Delay B (LDB) configuration mode. Further contributions include an advanced inter prediction enhancement network called the deep reference frame (DRF) . This approach has been discussed and validated by JVET. Nonetheless, the DRF initially depends on a pre-trained optical flow network to extract motion information, which hampers end-to-end optimization. Additionally, the network architectures designed for the VVC Random Access (RA) and LDB configurations are not aligned, with the LDB model proving to be more complex than its RA counterpart. Lastly, DRF employs a conventional training method that uses uncompressed pictures as labels, which limits the model’s ability to achieve optimal performance.
[0053] To overcome these and other challenges, the present disclosure proposes a lightweight multi-scale reference frame generation (LMRFG) network to enhance VVC inter coding. To deeply explore the high-resolution (HR) features and sub-resolution (SubR) features of input pictures and adapt the network model to the changing receptive fields, the LMRFG network of the present disclosure include a multi-scale feature extractor (MFE) . The HR features capture the overall structure and shape of the image and are used for subsequent warping of the composite image, while the SubR features capture image details and textures of the image for subsequent optical flow estimation. The SubR features may be used by a dual-branch coordinated attention motion estimator (DBCA-ME) to generate optical flow maps (e.g., forward and backward flows) . A flow warping and fusion block may use the HR features generated by the MFE, along with the optical flow maps generated by the DBCA-ME and the original pictures, to generate an enhanced reference picture. Additional details of the exemplary LMRFG network are provided below in connection with FIGs. 1-13.
[0054] FIG. 1 is a block diagram of a video codec system, according to some embodiments of the present disclosure. The video codec system, according to an embodiment, may include an encoding apparatus 10 and a decoding apparatus 20. The encoding apparatus 10 may deliver encoded video and / or picture information or data to the decoding apparatus 20 in the form of a file or streaming via a digital storage medium or network.
[0055] The encoding apparatus 10, according to an embodiment, may include a video source generator 11, an encoding unit 12, and a transmitter 13. The decoding apparatus 20, according to an embodiment, may include a receiver 21, a decoding unit 22, and a renderer 23. The encoding unit 12 may be called a video / picture encoding unit, and the decoding unit 22 may be called a video / picture decoding unit. The transmitter 13 may be included in the encoding unit 12. The receiver 21 may be included in the decoding unit 22. The renderer 23 may include a display, and the display may be configured as a separate device or an external component.
[0056] The video source generator 11 may acquire a video / picture through a process of capturing, synthesizing, or generating the video / picture. The video source generator 11 may include a video / picture capture device and / or a video / picture generating device. The video / picture capture device may include, for example, one or more cameras, video / picture archives including previously captured video / pictures, and the like. The video / picture-generating device may include, for example, computers, tablets, and smartphones, and may (electronically) generate video / pictures. For example, a virtual video / picture may be generated through a computer or the like. In this case, the video / picture capturing process may be replaced by a process of generating related data.
[0057] The encoding unit 12 may encode an input video / picture. The encoding unit 12 may perform a series of procedures such as prediction, transform, and quantization for compression and coding efficiency. The encoding unit 12 may output encoded data (encoded video / picture information) in the form of a bitstream.
[0058] The transmitter 13 may transmit the encoded video / picture information or data output in the form of a bitstream to the receiver 21 of the decoding apparatus 20 through a digital storage medium or a network in the form of a file or streaming. The digital storage medium may include various storage mediums such as universal serial bus (USB) , secure digital (SD) , compact disc (CD) , digital video disc (DVD) , Blu-ray, hard disk drive (HDD) , solid-state drive (SSD) , and the like. The transmitter 13 may include an element for generating a media file through a predetermined file format and may include an element for transmission through a broadcast / communication network. The receiver 21 may extract / receive the bitstream from the storage medium or network and transmit the bitstream to the decoding unit 22.
[0059] The decoding unit 22 may decode the video / picture by performing a series of procedures such as dequantization, inverse transform, and prediction corresponding to the operation of the encoding unit 12.
[0060] The renderer 23 may render the decoded video / picture. The rendered video / picture may be displayed through the display.
[0061] FIG. 2A is a schematic block diagram of an encoding apparatus, in accordance with some aspects of the present disclosure. Referring to FIG. 2A, the encoding apparatus 200 includes a picture partitioner 210, a predictor 220, a residual processor 230, an entropy encoder 240, an adder 251, a filter 261, and a memory 271. The predictor 220 may include an inter predictor 221 and an intra predictor 222. The residual processor 230 may include a transformer 232, a quantizer 233, a dequantizer 234, and an inverse transformer 235. The residual processor 230 may further include a subtractor 231. The adder 251 may be called a reconstructor or a reconstructed block generator. The picture partitioner 210, the predictor 220, the residual processor 230, the entropy encoder 240, the adder 251, and the filter 261 may be configured by at least one hardware component (e.g., an encoder chipset or processor) , according to an embodiment. In addition, the memory 271 may include a decoded picture buffer (DPB) or may be configured by a digital storage medium. The hardware component may further include the memory 271 as an internal / external component.
[0062] The picture partitioner 210 may partition an input picture (or a picture or a frame) input to the encoding apparatus 200 into one or more processors. For example, the processor may be called a coding unit (CU) . In this case, the coding unit may be recursively partitioned according to a quad-tree binary-tree ternary-tree (QTBTTT) structure from a coding tree unit (CTU) or a largest coding unit (LCU) . For example, one coding unit may be partitioned into a plurality of coding units of a deeper depth based on a quad tree structure, a binary tree structure, and / or a ternary structure. In this case, for example, the quad tree structure may be applied first, and the binary tree structure and / or ternary structure may be applied later. Alternatively, the binary tree structure may be applied first. The coding procedure according to this disclosure may be performed based on the final coding unit that is no longer partitioned. In this case, the largest coding unit may be used as the final coding unit based on coding efficiency according to picture characteristics, or if necessary, the coding unit may be recursively partitioned into coding units of deeper depth, and a coding unit having an optimal size may be used as the final coding unit. Here, the coding procedure may include a procedure of prediction, transform, and reconstruction, which will be described later. As another example, the processor may further include a prediction unit (PU) or a transform unit (TU) . In this case, the prediction unit and the transform unit may be split or partitioned from the aforementioned final coding unit. The prediction unit may be a unit of sample prediction, and the transform unit may be a unit for deriving a transform coefficient and / or a unit for deriving a residual signal from the transform coefficient.
[0063] The unit may be used interchangeably with terms such as block or area in some cases. In a general case, an M×N block may represent a set of samples or transform coefficients composed of M columns and N rows. A sample may generally represent a pixel or a value of a pixel, may represent only a pixel / pixel value of a luma component or represent only a pixel / pixel value of a chroma component. A sample may be used as a term corresponding to one picture (or picture) for a pixel or a pel.
[0064] In the encoding apparatus 200, a prediction signal (predicted block, prediction sample array) output from the inter predictor 221 or the intra predictor 222 is subtracted from an input picture signal (original block, original sample array) to generate a residual signal (residual block, residual sample array) , and the generated residual signal is transmitted to the transformer 232. In this case, as shown, a unit for subtracting a prediction signal (predicted block, prediction sample array) from the input picture signal (original block, original sample array) in the encoding apparatus 200 may be called a subtractor 231. The predictor may perform prediction on a block to be processed (hereinafter, referred to as a current block) and generate a predicted block including prediction samples for the current block. The predictor may determine whether intra prediction or inter prediction is applied on a current block or CU basis. As described later in the description of each prediction mode, the predictor may generate various information related to prediction, such as prediction mode information, and transmit the generated information to the entropy encoder 240. The information on the prediction may be encoded in the entropy encoder 240 and output in the form of a bitstream.
[0065] The intra predictor 222 may predict the current block by referring to the samples in the current picture. The referred samples may be located in the neighborhood of the current block or may be located apart according to the prediction mode. In the intra prediction, prediction modes may include a plurality of non-directional modes and a plurality of directional modes. The non-directional mode may include, for example, a DC mode and a planar mode. The directional mode may include, for example, 33 directional prediction modes or 65 directional prediction modes according to the degree of detail of the prediction direction. However, this is merely an example, and more or less directional prediction modes may be used depending on the setting. The intra predictor 222 may determine the prediction mode applied to the current block by using a prediction mode applied to a neighboring block.
[0066] The inter predictor 221 may derive a predicted block for the current block based on a reference block (reference sample array) specified by a motion vector on a reference picture. Here, in order to reduce the amount of motion information transmitted in the inter prediction mode, the motion information may be predicted in units of blocks, subblocks, or samples based on the correlation of motion information between the neighboring block and the current block. The motion information may include a motion vector and a reference picture index. The motion information may further include inter prediction direction (L0 prediction, L1 prediction, Bi prediction, etc. ) information. In the case of inter prediction, the neighboring block may include a spatial neighboring block present in the current picture and a temporal neighboring block present in the reference picture. The reference picture including the reference block and the reference picture including the temporal neighboring block may be the same or different. The temporal neighboring block may be called a collocated reference block, a co-located CU (colCU) , and the like, and the reference picture including the temporal neighboring block may be called a collocated picture (colPic) . For example, the inter predictor 221 may configure a motion information candidate list based on neighboring blocks and generate information indicating which candidate is used to derive a motion vector and / or a reference picture index of the current block. Inter prediction may be performed based on various prediction modes. For example, in the case of a skip mode and a merge mode, the inter predictor 221 may use motion information of the neighboring block as motion information of the current block. In the skip mode, unlike the merge mode, the residual signal may not be transmitted. In the case of the motion vector prediction (MVP) mode, the motion vector of the neighboring block may be used as a motion vector predictor, and the motion vector of the current block may be indicated by signaling a motion vector difference.
[0067] The predictor 220 may generate a prediction signal based on various prediction methods described below. For example, the predictor may not only apply intra prediction or inter prediction to predict one block but also simultaneously apply both intra prediction and inter prediction. This may be called combined inter and intra prediction (CIIP) . In addition, the predictor may be based on an intra block copy (IBC) prediction mode or a palette mode for prediction of a block. The IBC prediction mode or palette mode may be used for content picture / video coding of a game or the like, for example, screen content coding (SCC) . The IBC basically performs prediction in the current picture but may be performed similarly to inter prediction in that a reference block is derived in the current picture. That is, the IBC may use at least one of the inter prediction techniques described in the present disclosure. The palette mode may be considered an example of intra coding or intra prediction. When the palette mode is applied, a sample value within a picture may be signaled based on information on the palette table and the palette index.
[0068] The prediction signal generated by the predictor (including the inter predictor 221 and / or the intra predictor 222) may be used to generate a reconstructed signal or to generate a residual signal. The transformer 232 may generate transform coefficients by applying a transform technique to the residual signal. For example, the transform technique may include at least one of a discrete cosine transform (DCT) , a discrete sine transform (DST) , a karhunen-loève transform (KLT) , a graph-based transform (GBT) , or a conditionally non-linear transform (CNT) . Here, the GBT means transform obtained from a graph when relationship information between pixels is represented by the graph. The CNT refers to the transform generated based on a prediction signal generated using all previously reconstructed pixels. In addition, the transform process may be applied to square pixel blocks having the same size or may be applied to blocks having a variable size rather than a square.
[0069] The quantizer 233 may quantize the transform coefficients and transmit them to the entropy encoder 240, and the entropy encoder 240 may encode the quantized signal (information on the quantized transform coefficients) and output a bitstream. The information on the quantized transform coefficients may be referred to as residual information. The quantizer 233 may rearrange block type quantized transform coefficients into a one-dimensional vector form based on a coefficient scanning order and generate information on the quantized transform coefficients based on the quantized transform coefficients in the one-dimensional vector form. Information on transform coefficients may be generated. The entropy encoder 240 may perform various encoding methods such as, for example, exponential Golomb, context-adaptive variable length coding (CAVLC) , context-adaptive binary arithmetic coding (CABAC) , and the like. The entropy encoder 240 may encode information necessary for video / picture reconstruction other than quantized transform coefficients (e.g., values of syntax elements, etc. ) together or separately. Encoded information (e.g., encoded video / picture information) may be transmitted or stored in units of NALs (network abstraction layer) in the form of a bitstream. The video / picture information may further include information on various parameter sets, such as an adaptation parameter set (APS) , a picture parameter set (PPS) , a sequence parameter set (SPS) , or a video parameter set (VPS) . In addition, the video / picture information may further include general constraint information. In the present disclosure, information and / or syntax elements transmitted / signaled from the encoding apparatus to the decoding apparatus may be included in video / picture information. The video / picture information may be encoded through the above-described encoding procedure and included in the bitstream. The bitstream may be transmitted over a network or may be stored in a digital storage medium. The network may include a broadcasting network and / or a communication network, and the digital storage medium may include various storage media such as USB, SD, CD, DVD, Blu-ray, HDD, SSD, and the like. A transmitter (not shown) transmitting a signal output from the entropy encoder 240 and / or a storage unit (not shown) storing the signal may be included as internal / external element of the encoding apparatus 200, and alternatively, the transmitter may be included in the entropy encoder 240.
[0070] The quantized transform coefficients output from the quantizer 233 may be used to generate a prediction signal. For example, the residual signal (residual block or residual samples) may be reconstructed by applying dequantization and inverse transform to the quantized transform coefficients through the dequantizer 234 and the inverse transformer 235. The adder 251 adds the reconstructed residual signal to the prediction signal output from the inter predictor 221 or the intra predictor 222 to generate a reconstructed signal (reconstructed picture, reconstructed block, reconstructed sample array) . If there is no residual for the block to be processed, such as in a case where the skip mode is applied, the predicted block may be used as the reconstructed block. The adder 251 may be called a reconstructor or a reconstructed block generator. The generated reconstructed signal may be used for intra prediction of the next block to be processed in the current picture and may be used for inter prediction of the next picture through filtering as described below.
[0071] Meanwhile, luma mapping with chroma scaling (LMCS) may be applied during picture encoding and / or reconstruction.
[0072] The filter 261 may improve subjective / objective picture quality by applying filtering to the reconstructed signal. For example, the filter 261 may generate a modified reconstructed picture by applying various filtering methods to the reconstructed picture and store the modified reconstructed picture in the memory 271, specifically, a DPB of the memory 271. The various filtering methods may include, for example, deblocking filtering, a sample adaptive offset, an adaptive loop filter, a bilateral filter, and the like. The filter 261 may generate various information related to the filtering and transmit the generated information to the entropy encoder 240, as described later in the description of each filtering method. The information related to the filtering may be encoded by the entropy encoder 240 and output in the form of a bitstream.
[0073] The modified reconstructed picture transmitted to the memory 271 may be used as the reference picture in the inter predictor 221. When the inter prediction is applied through the encoding apparatus, prediction mismatch between the encoding apparatus 200 and the decoding apparatus 250 may be avoided, and encoding efficiency may be improved.
[0074] The DPB of the memory 271 DPB may store the modified reconstructed picture for use as a reference picture in the inter predictor 221. The memory 271 may store the motion information of the block from which the motion information in the current picture is derived (or encoded) and / or the motion information of the blocks in the picture that have already been reconstructed. The stored motion information may be transmitted to the inter predictor 221 and used as the motion information of the spatial neighboring block or the motion information of the temporal neighboring block. The memory 271 may store reconstructed samples of reconstructed blocks in the current picture and may transfer the reconstructed samples to the intra predictor 222.
[0075] FIG. 2B is a schematic block diagram of a decoding, in accordance with some embodiments of the present disclosure. Referring to FIG. 2B, the decoding apparatus 250 may include an entropy decoder 270, a residual processor 252, a predictor 258, an adder 264, a filter 266, and a memory 268. The predictor 258 may include an inter predictor 260 and an intra predictor 262. The residual processor 252 may include a dequantizer 254 and an inverse transformer 256. The entropy decoder 270, the residual processor 252, the predictor 258, the adder 264, and the filter 266 may be configured by a hardware component (e.g., a decoder chipset or a processor) according to an embodiment. In addition, the memory 268 may include a decoded picture buffer (DPB) or may be configured by digital storage.
[0076] When a bitstream including video / picture information is input, the decoding apparatus 250 may reconstruct a picture corresponding to a process in which the video / picture information is processed in the encoding apparatus of FIG. 2A. For example, the decoding apparatus 250 may derive units / blocks based on block partition-related information obtained from the bitstream. The decoding apparatus 250 may perform decoding using a processor applied in the encoding apparatus. Thus, the processor of decoding may be a coding unit, for example, and the coding unit may be partitioned according to a quad tree structure, binary tree structure and / or ternary tree structure from the coding tree unit or the largest coding unit. One or more transform units may be derived from the coding unit. The input picture data signal decoded and output through the decoding apparatus 250 may be reproduced through a reproducing apparatus.
[0077] The decoding apparatus 250 may receive a signal output from the encoding apparatus of FIG. 2A in the form of a bitstream, and the received signal may be decoded through the entropy decoder 270. For example, the entropy decoder 270 may parse the bitstream to derive information (e.g., video / picture information) necessary for picture reconstruction (or picture reconstruction) . The video / picture information may further include information on various parameter sets, such as an adaptation parameter set (APS) , a picture parameter set (PPS) , a sequence parameter set (SPS) , or a video parameter set (VPS) . In addition, the video / picture information may further include general constraint information. The decoding apparatus may further decode picture based on the information on the parameter set and / or the general constraint information. Signaled / received information and / or syntax elements described later in the present disclosure may be decoded by the decoding procedure and obtained from the bitstream. For example, the entropy decoder 270 decodes the information in the bitstream based on a coding method such as exponential Golomb coding, CAVLC, or CABAC, and output syntax elements required for picture reconstruction and quantized values of transform coefficients for residual. More specifically, the CABAC entropy decoding method may receive a bin corresponding to each syntax element in the bitstream, determine a context model using a decoding target syntax element information, decoding information of a decoding target block or information of a symbol / bin decoded in a previous stage, and perform an arithmetic decoding on the bin by predicting a probability of occurrence of a bin according to the determined context model, and generate a symbol corresponding to the value of each syntax element. In this case, the CABAC entropy decoding method may update the context model by using the information of the decoded symbol / bin for a context model of the next symbol / bin after determining the context model. The information related to the prediction among the information decoded by the entropy decoder 270 may be provided to the predictor (the inter predictor 260 and the intra predictor 262) , and the residual value on which the entropy decoding was performed in the entropy decoder 270, that is, the quantized transform coefficients and related parameter information, may be input to the residual processor 252. The residual processor 252 may derive the residual signal (the residual block, the residual samples, and the residual sample array) . In addition, information on filtering among information decoded by the entropy decoder 270 may be provided to the filter 266. Meanwhile, a receiver (not shown) for receiving a signal output from the encoding apparatus may be further configured as an internal / external element of the decoding apparatus 250, or the receiver may be a component of the entropy decoder 270. Meanwhile, the decoding apparatus in the present disclosure may be referred to as a video / picture / picture decoding apparatus, and the decoding apparatus may be classified into an information decoder (video / picture / picture information decoder) and a sample decoder (video / picture / picture sample decoder) . The information decoder may include the entropy decoder 270, and the sample decoder may include at least one of the dequantizer 254, the inverse transformer 256, the adder 264, the filter 266, the memory 268, the inter predictor 260, and the intra predictor 262.
[0078] The dequantizer 254 may dequantize the quantized transform coefficients and output the transform coefficients. The dequantizer 254 may rearrange the quantized transform coefficients in the form of a two-dimensional block form. In this case, the rearrangement may be performed based on the coefficient scanning order performed in the encoding apparatus. The dequantizer 254 may perform dequantization on the quantized transform coefficients by using a quantization parameter (e.g., quantization step size information) and obtain transform coefficients.
[0079] The inverse transformer 256 inversely transforms the transform coefficients to obtain a residual signal (residual block, residual sample array) .
[0080] The predictor may perform prediction on the current block and generate a predicted block including prediction samples for the current block. The predictor may determine whether intra prediction or inter prediction is applied to the current block based on the information on the prediction output from the entropy decoder 270 and may determine a specific intra / inter prediction mode.
[0081] The predictor 258 may generate a prediction signal based on various prediction methods described below. For example, the predictor may not only apply intra prediction or inter prediction to predict one block but also simultaneously apply intra prediction and inter prediction. This may be called combined inter and intra prediction (CIIP) . In addition, the predictor may be based on an intra block copy (IBC) prediction mode or a palette mode for prediction of a block. The IBC prediction mode or palette mode may be used for content picture / video coding of a game or the like, for example, screen content coding (SCC) . The IBC basically performs prediction in the current picture but may be performed similarly to inter prediction in that a reference block is derived in the current picture. That is, the IBC may use at least one of the inter prediction techniques described in this document. The palette mode may be considered an example of intra coding or intra prediction. When the palette mode is applied, a sample value within a picture may be signaled based on information on the palette table and the palette index.
[0082] The intra predictor 262 may predict the current block by referring to the samples in the current picture. The referred samples may be located in the neighborhood of the current block or may be located apart according to the prediction mode. In the intra prediction, prediction modes may include a plurality of nondirectional modes and a plurality of directional modes. The intra predictor 262 may determine the prediction mode applied to the current block by using a prediction mode applied to a neighboring block.
[0083] The intra predictor 262 may predict the current block by referring to the samples in the current picture. The referenced samples may be located in the neighborhood of the current block or may be located apart according to the prediction mode. In intra prediction, prediction modes may include a plurality of nondirectional modes and a plurality of directional modes. The intra predictor 262 may determine the prediction mode applied to the current block by using the prediction mode applied to the neighboring block.
[0084] The inter predictor 260 may derive a predicted block for the current block based on a reference block (reference sample array) specified by a motion vector on a reference picture. In this case, in order to reduce the amount of motion information transmitted in the inter prediction mode, motion information may be predicted in units of blocks, subblocks, or samples based on the correlation of motion information between the neighboring block and the current block. The motion information may include a motion vector and a reference picture index. The motion information may further include inter prediction direction (L0 prediction, L1 prediction, Bi prediction, etc. ) information. In the case of inter prediction, the neighboring block may include a spatial neighboring block present in the current picture and a temporal neighboring block present in the reference picture. For example, the inter predictor 260 may configure a motion information candidate list based on neighboring blocks and derive a motion vector of the current block and / or a reference picture index based on the received candidate selection information. Inter prediction may be performed based on various prediction modes, and the information on the prediction may include information indicating a mode of inter prediction for the current block.
[0085] The adder 264 may generate a reconstructed signal (reconstructed picture, reconstructed block, reconstructed sample array) by adding the obtained residual signal to the prediction signal (predicted block, predicted sample array) output from the predictor (including the inter predictor 260 and / or the intra predictor 262) . If there is no residual for the block to be processed, such as when the skip mode is applied, the predicted block may be used as the reconstructed block.
[0086] The adder 264 may be called reconstructor or a reconstructed block generator. The generated reconstructed signal may be used for intra prediction of the next block to be processed in the current picture, may be output through filtering as described below, or may be used for inter prediction of the next picture.
[0087] Meanwhile, luma mapping with chroma scaling (LMCS) may be applied in the picture decoding process.
[0088] The filter 266 may improve subjective / objective picture quality by applying filtering to the reconstructed signal. For example, the filter 266 may generate a modified reconstructed picture by applying various filtering methods to the reconstructed picture and store the modified reconstructed picture in the memory 268, specifically, a DPB of the memory 268. The various filtering methods may include, for example, deblocking filtering, a sample adaptive offset, an adaptive loop filter, a bilateral filter, and the like.
[0089] The (modified) reconstructed picture stored in the DPB of the memory 268 may be used as a reference picture in the inter predictor 260. The memory 268 may store the motion information of the block from which the motion information in the current picture is derived (or decoded) and / or the motion information of the blocks in the picture that have already been reconstructed. The stored motion information may be transmitted to the inter predictor 221 so as to be utilized as the motion information of the spatial neighboring block or the motion information of the temporal neighboring block. The memory 268 may store reconstructed samples of reconstructed blocks in the current picture and transfer the reconstructed samples to the intra predictor 262.
[0090] In the present disclosure, the embodiments described in the filter 261, the inter predictor 221, and the intra predictor 222 of the encoding apparatus 200 may be the same as or respectively applied to correspond to the filter 266, the inter predictor 260, and the intra predictor 262of the decoding apparatus 250. The same may also apply to the inter predictor 260 and the intra predictor 262.
[0091] In-loop filtering techniques for traditional video codecs are one of the representative components of NN-based coding models or tools. Since VVC is a block-based hybrid coding framework, it inevitably produces undesirable compression artifacts, especially at a high compression rate. To address this issue, VVC employs a series of advanced filtering techniques to eliminate or reduce compression artifacts.
[0092] FIG. 2C illustrates a detailed block diagram of an exemplary video coder 255 that includes an LMRFG network 282, according to some embodiments of the present disclosure.
[0093] As shown in FIG. 2C, video coder 255 may include a first summation operator 291, a transform and quantization block 281, an inverse quantization block 272, an inverse transform block 274, a second summation operator 276, an in-loop filter 278, a decoded picture buffer (DPB) 280, the LMRFG network 282, a reference picture list 284, a motion estimation block 286, a motion compensation block 288, an intra prediction block 290, and an entropy coding block 292.
[0094] In the inference phase, LMRFG network 282 is first integrated into the VTM software to ensure online use. Then, LMRFG network 282 takes two previously decoded pictures (RA: and LDB: and ) from DPB 280 and the corresponding QP map as inputs, and learns to generate an interpolated or extrapolated picture that is inserted into reference picture list 284 (shown in FIG. 2C) as a candidate picture for inter frame prediction.
[0095] VVC adopts a hierarchical encoding structure, where video pictures are divided into different temporal layers. For RA configuration, VVC encoding is used for pictures with time layers of {0, 1, 2, 3, 7} . For pictures with time layers of {4, 5, 6} , LMRFG network 282 may be used because the image quality generated by LMRFG network 282 may not be satisfactory in lower layers with longer picture order count (POC) distances. Taking a model with RA configuration as an example, the mathematical formula can be expressed according to expression (1) :
[0096] For LDB configuration, for pictures with time layers of {0, 1} , VVC for encoding may be used. For pictures with a time layer greater than {2} , LMRFG may be used to generate reference pictures. Therefore, when VVC first encodes two video pictures in display order, LMRFG begins to be activated to generate subsequent reference pictures. Taking a model with LDB configuration as an example, the mathematical formula can be expressed according to expression (2) .
[0097] Additional details of LMRFG network 282 are provided below in connection with FIGs. 3-13.
[0098] FIG. 3 illustrates a detailed block diagram of LMRFG network 282 of FIG. 2C, according to some embodiments of the present disclosure. As shown, the proposed architecture of LMRFG network 282 can be divided into three main parts: multi-scale feature extractor (MFE) 302, dual-branch coordinated attention motion estimator (DBCA-ME) 304, and flow warp and fusion block (FWFB) 306.
[0099] Referring to FIG. 3, MFE block 302 may extract multi-scale features from two reference pictures 301a, 301b to generate high resolution (HR) and sub resolution (SubR) features. HR features may be used for subsequent picture warping fusion, while SubR may be used to predict optical flows at different resolutions. Subsequently, DBCA-ME block 304 performs dual-branch coordinated attention motion estimation using a strategy of hierarchical learning from low to high resolution to learn pixel motion position and direction information, and using an optical flow amplifier for upsampling to generate directional flow. Finally, using two cascaded U-nets as the basic backbone, the optical flow map and the original pictures features are combined through motion compensation by FWFB 306 to generate a high-quality reference picture 303.
[0100] FIG. 4 illustrates a detailed block diagram of MFE block 302 included in LMRFG network 282 of FIG. 3, according to some embodiments of the present disclosure. As shown, MFE block 302 may include, e.g., a plurality of DO-Conv blocks 402, a plurality of parametric rectified linear (PReLU) blocks 404, a first concatenator 406a, and a second concatenator 406b.
[0101] Referring to FIG. 4, MFE block 302 may first perform multi-scale feature extraction on two reference pictures 301a, 301b downsampled three times. Then, multi-scale feature extraction is performed on a series of pictures with different scales to generate HR (high-resolution) features 403 and SubR (sub-resolution) features 405, respectively.
[0102] HR features 403 capture the overall structure and shape of the image and are used for subsequent warping of the synthesized image and may be generated based on a concatenation of the HR features from the same scales of different image pyramids. For instance, first concatenator 406a may concatenate HR features for 1 / 2 size in the first and second picture pyramids, and second concatenator 406b may concatenate HR features for the 1 / 4 size in the first, second, and third picture pyramids. On the other hand, SubR features 405 may capture image details and textures for optical flow estimation of subsequent motion and may be generated as a summation of SubR features 401a from the first image pyramid, SubR features 401b from the second image pyramid, and third SubR features 401c from the third image pyramid.
[0103] SubR features 405 may be used as input to DBCA-ME block 304, while HR features 403 may be used as input to FWFB 306.
[0104] FIG. 5A illustrates a detailed block diagram of DBCA-ME block 304 included in LMRFG network 282 of FIG. 3, according to some embodiments of the present disclosure. As shown DBCA-ME block 304 may include, e.g., a plurality of DBCA-FE blocks 502, a plurality of first of optical flow proportional amplifiers 504a, and a plurality of second optical flow proportional amplifiers 504b, a plurality of first warping blocks 506a, and a plurality of second warping blocks 506b. SubR features 501a for a first input reference picture (e.g., 301a) and SubR features 501b for a second input reference picture (e.g., 301b) are input to a DBCA-FE block 502.
[0105] Referring to FIG. 5A, due to the difficulty of directly estimating optical flow with high resolution, a strategy based on hierarchical learning from low to high resolution is performed by DBCA-ME block 304. In addition, optical flow proportional amplifiers are used during upsampling to amplify the optical flow, which largely relies on the intuition that high-resolution motion should be greater than low resolution motion, thereby greatly increasing the number of pixels available for motion estimation. Plurality of first optical flow proportional amplifiers 504a may amplify the forward optical flows, while plurality of second optical flow proportional amplifiers 504b may amplify backward optical flows. First plurality of warping blocks 506a may warp the forward optical flows, while second plurality of warping blocks 506b may warp the backward optical flows.
[0106] FIG. 5B illustrates a detailed block diagram of DBCA-FE block 502 included in the DBCA-ME block 304 of FIG. 5A, according to some embodiments of the present disclosure. As shown, DBCA-FE block 502 may include, e.g., a plurality of DO-Conv blocks 510, a plurality of PReLU blocks 512, a plurality of residual blocks (RDB) 514, a plurality of CAB 516, and a concatenator 518.
[0107] Referring to FIG. 5B, the branch that includes the plurality of RDB 514 learns the positional coordinate information of the target itself, while the branch that includes the plurality of CAB 516 uses an X and Y-axis optical flow fusion mechanism to parse the two-dimensional motion direction information of each point.
[0108] FIG. 5C illustrates a detailed block diagram of CAB 516 included in the DBCA-FE block 502 of FIG. 5B, according to some embodiments of the present disclosure. As shown CAB 516 may include, e.g., an RDB 520, an X-axis average pooling block 522a, a Y-axis average pooling block 522b, a concatenator / convolutional block 524, a batch normalization / non-linear block 526, a pair of convolutional blocks 528, a pair of sigmoid blocks 530, and a re-weight block 532.
[0109] Referring to FIG. 5C, to enable CAB 516 to capture remote spatial interactions with precise positional information, global pooling may be decomposed into a pair of one-dimensional feature encoding operations performed by X-axis average pooling block 522a a Y-axis average pooling block 522b according to expression (3) :
[0110] According to expression (4) , given input X, use pooling kernels of size (H, 1) or (1, W) may be used encode each channel along the horizontal and vertical coordinates, respectively.
[0111] The above two transformations aggregate features along two spatial directions to obtain a pair of directional perception feature maps. After the transformation in the information embedding, concatenator / convolutional block 524 concatenates the above transformation and then uses a convolutional transformation function to transform it according to the Concat ( [ph, pw] ) term in expression (5) . Batch normalization / nonlinear block 526 may use up to or more than sixteen pictures to generate the δ term in expression (5) .C=δ (Concat ( [ph, pw] ) ) (5) .
[0112] The integrated transformation is split into Ch and Cw term by convolutional blocks 528 terms to extract attention information from two directions according to expression (6) . The Ch term may be input into one sigmoid block 530 and the Cw term may be input into another sigmoid block 530. Then, a transformation may be performed according to expressions (7) and (8) such that one sigmoid block 530 outputs Th and the other sigmoid block 530 outputs Tw.ChCw=Split (C) (6) ,Th=σ (Fh(Ch) ) (7) , andTw=σ (Fw(Cw) ) (8) .
[0113] Finally, re-weight block 532 may output Y, which may be expressed according to expression (9) , where X (m, n) are the original inputs.Y (m, n)=X (m, n) ×Th (m) ×Tw (n) (9) .
[0114] FIG. 5D illustrates a diagram of feature map visualization 550 of the DBCA-ME block 304 of FIG. 5A, according to some embodiments of the present disclosure.
[0115] FIG. 6A illustrates a detailed block diagram of FWFB 306 included in LMRFG network 282 of FIG. 3, according to some embodiments of the present disclosure. As shown, FWFB 306 may include, e.g., a first warping block 602a, a second warping block 602b, a concatenator 604, and a fusion block 606.
[0116] Referring to FIG. 6A, HR features 403, forward optical flow 505a, and backward optical flow 505b are distorted by first warping block 602a to generate feature information 601 (e.g., forward warped / backward warped features -distorted high-dimensional features) . Then, first reference pictures 301a, 301b, forward optical flow 505a, and backward optical flow 505b are warped to generate a first pair of warped pictures 601b (e.g., distorted pictures) . Concatenator 604 may concatenate HR features 403, forward optical flow 505a, backward optical flow 505b, forward warped / backward warped features 601a, and first pair of warped pictures 601b to generate a first set of features 603a for a first picture in the pair of warped pictures 601b and a second set of features for a second picture in the pair of warped pictures 601b. Next, fusion block 606 may fuse the first set of features 603a for the first picture and the second set of features 603b for the second picture to generate second reference picture 303. Additional details of fusion block 606 are provided below in connection with FIG. 6B.
[0117] FIG. 6B illustrates a detailed block diagram of fusion block 606 included in FWFB 306 of FIG. 6A, according to some embodiments of the present disclosure. As shown, fusion block 606 may include, e.g., a first U-network (U-Net) 610a, a second U-Net 610b, and a concatenator 612.
[0118] Referring to FIG. 6B, first U-Net 601a may fuse features, pictures, and optical flow to generate two pictures similar to the first reference pictures 301a, 301b, concatenator 612 may combine the two pictures, and finally second U-Net 610b may fuse these two pictures to generate the second reference picture 303.
[0119] FIG. 7 illustrates a diagram an exemplary DO-Conv 700 that may be used in LMRFG network 282 of FIG. 3, according to some embodiments of the present disclosure.
[0120] Referring to FIG. 7, DO-Conv decomposes a standard convolution operation into two simpler operations: deep convolution and pointwise convolution. Deep convolution first convolves each input channel separately, rather than convolving all input channels. This means that if C channels are input and K *K convolution kernels are used for convolution, deep convolution only requires C K *K convolution kernels. Then, the output of the deep convolution is combined point by point using a 1x1 convolution kernel. This operation can integrate information from different channels. Compared to standard convolution, DO-Conv significantly reduces the number of parameters. The traditional convolution parameter size is C *K *K *F (where F is the number of output channels) , while the DO-Conv parameter size is C *K *K + C *F.It is obvious that C *K *K + C *F is much smaller than C *K *K *F, which makes LMRFG achieve excellent results in complexity and parameter comparison.
[0121] FIG. 8A illustrates a first exemplary progressive learning strategy 800 for an LMRFG network based on QP distance, according to some embodiments of the present disclosure.FIG. 8B illustrates a second exemplary progressive learning strategy 825 for an LMRFG network based on QP distance, according to some embodiments of the present disclosure. A Traditional training strategy that takes uncompressed data as label is depicted in FIG. 8A, while FIG. 8B depicts QP distance-based training strategy that takes compressed data at lower QP as label.
[0122] Referring to FIGs. 8A and 8B, it should be mentioned that compression artifacts are unique to the field of image / video coding. These artifacts are caused by compression including ringing, blurring and blocking effects, and the impact of the compression artifacts needs to be carefully considered. In general, to train the compression models, the uncompressed raw data are used as label (i.e. ground truth) . However, this disclosure introduces an effective QP-distance training strategy to optimize the LMRFG network. As shown in FIGs. 8A and / or 8B, compressed data at higher QP (data with larger compression artifacts) may be used as the input and compressed data at lower QP (data with smaller compression artifacts) as the label, instead of using uncompressed raw data as labels. The advantages of QP-distance are summarized include, e.g., keeping data consistency between input and label and modeling of compression artifacts between input and label.
[0123] With respect to keeping data consistency between input and label, since the input and label are both compressed data with compression artifacts and the distortion of the label is smaller than the input, the proposed QP-distance based training strategy can meet the consistency of training data. Thus, the proposed QP-distance based training strategy reduces the QP gap between input and label during training while keeping the balance of the QP gap between input and label, thus making the trained model effective.
[0124] With respect to modeling of compression artifacts between input and label, for video coding, compression artifacts are usually introduced due to the compression algorithms (such as best-block matching and quantization) inside the encoders. The compression artifacts are regarded as a kind of noise. The QP-distance based training strategy set the input and label have the same kind of noise, thus enforcing the proposed network to capture accurate and robust relationship between input and label during training.
[0125] For training stage, QP-distance is set to 10 between input and label. That is, if the input for the proposed network is the compressed data with QP of {42, 37, 32, 27, 22} , the corresponding label is the compressed data with QP of {32, 27, 22, 17, 12} . This training scheme can reduce the learning difficulty of the model by guiding the statistical distribution of data closer to the higher-quality video pictures with smaller compression artifacts. The present disclosure adopts the L1 function as the only loss measure according to expression (10) . where L1 denotes the loss function, i denotes each pixel point, LMRFG denotes the prediction function, yirec1 denotes a reference pictures, yirec2 denotes another reference pictures, and yicurr denotes the label value of the current pictures. For RA configuration, yirec1 denotes the first pictures before the current pictures and yirec2 denotes the first pictures after the current pictures. For LDB configuration, yirec1 denotes the second pictures before the current pictures, yirec2 denotes the first pictures before the current pictures, and yicurr denotes the real value of the current pictures.
[0126] FIG. 9A illustrates exemplary average delta (BD) -rate (%) results 900 on a set of test sequences in terms of peak signal-to-noise ratio (PSNR) , according to some embodiments of the present disclosure.
[0127] Referring to FIG. 9A, BD-rate results of the proposed LMRFG network in terms of PSNR, when compared to the anchor under random access (RA) and low delay B (LDB) configurations, are shown. Each sequence is compressed with five QP values, including 22, 27, 32, 37, and 42. Then, an average BD-rate may be calculated in the table. From the table shown in FIG. 9A, it can be observed that the proposed LMRFG network achieves significant improvement in coding efficiency over VTM11.0-NNVC10.0 anchor on every sequence. For instance, there are average {RA: 4.31%, 6.54%, 7.05%} and {LDB: 3.81%, 8.90%, 8.77%} bitrate saving on {Y, U, V} components in terms of PSNR.
[0128] FIG. 9B illustrates an exemplary comparison of BD-rate (%) reduction 905 between a LMRFG network and JVET proposals under an RA configuration, according to some embodiments of the present disclosure.
[0129] Referring to FIG. 9B, for RA configuration, JVET proposals is compared with the proposed LMRFG network. JVET-AD0160 to VTM11.0-NNVC10.0 were embedded for testing, and the table in FIG. 9B shows the comparison results of BD rates in PSNR between existing JVET methods and proposed LMRFG network on all test datasets. The proposed LMRFG network may have the most advanced performance among existing deep reference frame generation schemes. As shown in FIG. 9B, in terms of BD rate measured by PSNR, LMRFG performs better than JVET-AD0160 in {Y, U, V} components by {1.94%, 4.01%, 3.86%} and LMRFG performs better than JVET-AJ0099 in {Y, U, V} components by {0.83%, 2.16%, 2.76%} .
[0130] FIG. 9C illustrates an exemplary comparison of BD-rate (%) reduction 910 between a LMRFG network and JVET proposals under an LDB configuration, according to some embodiments of the present disclosure.
[0131] Referring to FIG. 9C, for LDB configuration, the proposed LMRFG network is compared with JVET proposals. JVET-AD0160 to VTM11.0-NNVC10.0 were integrated for testing, and the table of FIG. 9C shows the comparison results between existing JVET methods and proposed LMRFG network. It can be found that LMRFG consistently outperforms JVET proposals on all test videos. For instance, LMRFG under the LDB configuration reaches average coding gains of {3.81%, 8.90%, 8.77%} for {Y, U, V} components. The proposed LMRFG network under LDB configuration exceeds JVET-AD0160 by about {2.77%, 5.07%, 5.46%} and the proposed LMRFG network exceeds JVET-AJ0099 by about {0.21%, 4.96%, 8.46%} in term of average BD-rate measured by PSNR. Comparing the tables in FIGs. 9B and 9C, it can be concluded that the proposed LMRFG network has more significant advantages in U channel and V channel under LDB configuration.
[0132] FIG. 9D illustrates an exemplary complexity comparison 915 of existing methods and the present LMRFG network, according to some embodiments of the present disclosure.
[0133] Referring to FIG. 9D, the complexity comparison between JVET proposals and LMRFG network is shown. For instance, for RA and LDB configurations, the proposed LMRFG network has a multiplication accumulation (MAC) count of 546.65 kMAC / pixel and 546.65 kMAC / pixel, respectively. For RA and LDB configurations, the number of MAC for JVET-AD0160 is 986 kMAC / pixel and 1192 kMAC / pixel, respectively, while for JVET-AJ0099 they are 727 kMAC / pixel and 727 kMAC / pixel, respectively. For RA and LDB configurations, the various complexities of LMRFG are lower than JVET-AD0160, while LMRFG significantly reduces KMAC while maintaining the same number of parameters as JVET-AJ0099. It should be noted that LMRFG is a unified network in both RA and LDB configurations in VVC. That is, the network architecture for RA and LDB configurations is the same, only with different inputs. In addition, compared to JVET-AD0160 and JVET-AJ0099, LMRFG achieves higher encoding gain with lower complexity.
[0134] FIG. 10A is a first table illustrating average RD results 1000 on all test sequences under RA configuration in terms of PSNR for a first dataset, according to some embodiments of the present disclosure. FIG. 10B is a second table illustrating second average RD results 1005 on all test sequences under RA configuration in terms of PSNR for a second dataset, according to some embodiments of the present disclosure. FIG. 10C is a third table illustrating third average RD results 1010 on all test sequences under RA configuration in terms of PSNR for a third dataset, according to some embodiments of the present disclosure. FIG. 10D is a fourth table illustrating fourth average RD results 1015 on all test sequences under RA configuration in terms of PSNR for a fourth dataset, according to some embodiments of the present disclosure. FIG. 10E is a fifth table illustrating fifth average RD results 1020 on all test sequences under LDB configuration in terms of PSNR for a fifth dataset, according to some embodiments of the present disclosure. FIG. 10F is a sixth table illustrating sixth average RD results 1025 on all test sequences under LDB configuration in terms of PSNR for a sixth dataset, according to some embodiments of the present disclosure. FIG. 10G is a sixth table illustrating sixth average RD results 1030 on all test sequences under LDB configuration in terms of PSNR for a sixth dataset, according to some embodiments of the present disclosure. FIG. 10H is an eighth table illustrating eighth average RD results 1035 on all test sequences under LDB configuration in terms of PSNR for a eighth dataset, according to some embodiments of the present disclosure. FIGs. 10A-10H will be described together.
[0135] In FIGs. 10A-10D, the RD curves of the proposed LMRFG network for all test categories on the Y channel under RA configuration are shown. In FIGs. 10E-10H, the RD curves of the proposed LMRFG network for all test categories on the Y channel under LDB configuration are shown. It can be observed that the RD curves of the proposed method are higher than the baseline VTM11.0-NNVC10.0 in both RA and LDB configurations. Under the same PSNR quality, the proposed LMRFG network greatly saves encoding bits.
[0136] FIG. 11A illustrates a first contribution visualization 1100 of reference pictures, according to some embodiments of the present disclosure. FIG. 11B illustrates a second contribution visualization 1105 of reference pictures, according to some embodiments of the present disclosure. Referring to FIGs. 11A and 11B, the contribution of forward optical flow and backward optical flow on a different generated reference pictures are shown.
[0137] FIG. 12 illustrates a flow chart of an exemplary method 1200 of video decoding, according to some embodiments of the present disclosure. Method 1200 may be performed by an apparatus, e.g., such as decoding apparatus 20, 250 decoding unit 22, LMRFG network 282, MFE block 302, DBCA-ME block 304, FWFB 306, etc. Method 1200 may include operations 1202-1212 as described below. It is understood that some of the operations may be optional, and some of the operations may be performed simultaneously, or in a different order other than shown in FIG. 12.
[0138] At 1202, the apparatus may obtain a first reference picture from a decoded picture buffer. Referring to FIG. 2C, a first reference picture may be obtained from decoded picture buffer 280.
[0139] [Rectified under Rule 91, 06.02.2025]At 1204, the apparatus may generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. In some implementations, the multi-scale feature extraction may include at least one DO-Conv. In some implementations, the apparatus may generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture by downsampling the first reference picture to obtain a first downsampled reference picture and a second downsampled reference picture. In some implementations, the apparatus may generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture by generating first high-resolution features based on the first reference picture, second high-resolution features based on the first downsampled reference picture, and third high-resolution features based on the second downsampled reference picture. In some implementations, the apparatus may generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture by generating the second feature map based on a concatenation of the first high-resolution features, the second high-resolution features, and the third high-resolution features. In some implementations, the apparatus may generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture by generating first sub-resolution features based on the first reference picture, second sub-resolution features based on the first downsampled reference picture, and third sub-resolution features based on the second downsampled reference picture. In some implementations, the apparatus may generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture by performing a summation of the first sub-resolution features, the second sub-resolution features, and the third sub-resolution features to generate the first feature map. For example, referring to FIG. 4, MFE block 302 may first perform multi-scale feature extraction on two reference pictures 301a, 301b downsampled three times. Then, multi-scale feature extraction is performed on a series of pictures with different scales to generate HR features 403 and SubR features 405, respectively. HR features 403 capture the overall structure and shape of the image and are used for subsequent warping of the synthesized image and may be generated based on a concatenation of the HR features from the same scales of different image pyramids. For instance, first concatenator 406a may concatenate HR features for 1 / 2 size in the first and second picture pyramids, and second concatenator 406b may concatenate HR features for the 1 / 4 size in the first, second, and third picture pyramids. On the other hand, SubR features 405 may capture image details and textures for optical flow estimation of subsequent motion and may be generated as a summation of SubR features 401a from the first image pyramid, SubR features 401b from the second image pyramid, and third SubR features 401c from the third image pyramid. SubR features 405 may be used as input to DBCA-ME block 304, while HR features 403 may be used as input to FWFB 306.
[0140] At 1206, the apparatus may generate an optical flow map based on the first feature map. In some implementations, the apparatus may generate the optical flow map based on the first feature may by generating the optical flow map based on a dual-branch coordinated attention motion estimation of the first feature map. In some implementations, the apparatus may generate the optical flow map based on the first feature map by generating encoded channels of the first feature map along horizonal and vertical coordinates using a pair of pooling kernels. In some implementations, the apparatus may generate the optical flow map based on the first feature map by aggregating features of the encoded channels along two spatial directions to generate a pair of directional perception feature maps based on a pair of transformations. In some implementations, the apparatus may generate the optical flow map based on the first feature map by generating a concatenated directional perception feature map based on a concatenation of the pair of directional perception feature maps. In some implementations, the apparatus may generate the optical flow map based on the first feature map by extracting attention information from the two spatial directions based on the concatenated directional perception feature map. In some implementations, the apparatus may generate the optical flow map based on the first feature map by generating forward flow information and backward flow information based on a convolutional transformation of the attention information. In some implementations, the apparatus may generate the optical flow map based on the first feature map by generating the optical flow map by warping the forward flow information and the backward flow information. For example, referring to FIG. 5A, due to the difficulty of directly estimating optical flow with high resolution, a strategy based on hierarchical learning from low to high resolution is performed by DBCA-ME block 304. In addition, optical flow proportional amplifiers are used during upsampling to amplify the optical flow, which largely relies on the intuition that high-resolution motion should be greater than low resolution motion, thereby greatly increasing the number of pixels available for motion estimation. Plurality of first optical flow proportional amplifiers 504a may amplify the forward optical flows, while plurality of second optical flow proportional amplifiers 504 may amplify backward optical flows. First plurality of warping blocks 506a may warp the forward optical flows, while second plurality of warping blocks 506b may warp the backward optical flows. Referring to FIG. 5B, the branch that includes the plurality of RDB 514 learns the positional coordinate information of the target itself, while the branch that includes the plurality of CAB 516 uses an X and Y-axis optical flow fusion mechanism to parse the two-dimensional motion direction information of each point. Referring to FIG. 5C, to enable CAB 516 to capture remote spatial interactions with precise positional information, global pooling may be decomposed into a pair of one-dimensional feature encoding operations performed by X-axis average pooling block 522a a Y-axis average pooling block 522b according to expression (3) (shown above) . According to expression (4) (shown above) , given input X, use pooling kernels of size (H, 1) or (1, W) may be used encode each channel along the horizontal and vertical coordinates, respectively. The above two transformations aggregate features along two spatial directions to obtain a pair of directional perception feature maps. After the transformation in the information embedding, concatenator / convolutional block 524 concatenates the above transformation and then uses a convolutional transformation function to transform it according to the Concat ( [ph, pw] ) term in expression (5) (shown above) . Batch normalization / nonlinear block 526 may use up to or more than sixteen pictures to generate the δ term in expression (5) (shown above) . The integrated transformation is split into Ch and Cw term by convolutional blocks 528 terms to extract attention information from two directions according to expression (6) (shown above. The Ch term may be input into one sigmoid block 530 and the Cw term may be input into another sigmoid block 530. Then, a transformation may be performed according to expressions (7) and (8) (shown above) such that one sigmoid block 530 outputs Th and the other sigmoid block 530 outputs Tw. Finally, re-weight block 532 may output Y, which may be expressed according to expression (9) (shown above) .
[0141] At 1208, the apparatus may generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map. In some implementations, the generating, by the processor, the second reference picture based on the first reference picture, the second feature map, and the optical flow map may include generating, by the processor, the second reference picture based on a flow warping and fusion of the first reference picture, the second feature map, and the optical flow map. In some implementations, the apparatus may generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map by generating first warped feature information based on a first warping of the first feature map and the optical flow map. In some implementations, the apparatus may generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map by generating, by the processor, a set of warped pictures based on a second warping of the optical flow map and the first reference picture. In some implementations, the apparatus may generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map by generating, by the processor, second warped feature information based on a concatenation of the first feature map, the optical flow map, the first warped feature information, and the set of warped pictures. In some implementations, the apparatus may generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map by generating, by the processor, the second reference picture by fusing the first set of warped pictures and the second warped feature information. For example, referring to FIG. 6A, HR features 403, forward optical flow 505a, and backward optical flow 505b are distorted by first warping block 602a to generate feature information 601 (e.g., forward warped / backward warped features -distorted high-dimensional features) . Then, first reference pictures 301a, 301b, forward optical flow 505a, and backward optical flow 505b are warped to generate a first pair of warped pictures 601b (e.g., distorted pictures) . Concatenator 604 may concatenate HR features 403, forward optical flow 505a, backward optical flow 505b, forward warped / backward warped features 601a, and first pair of warped pictures 601b to generate a first set of features 603a for a first picture in the pair of warped pictures 601b and a second set of features for a second picture in the pair of warped pictures 601b. Next, fusion block 606 may fuse the first set of features 603a for the first picture and the second set of features 603b for the second picture. Referring to FIG. 6B, first U-Net 601a may fuse features, pictures, and optical flow to generate two pictures similar to the first reference pictures 301a, 301b, concatenator 612 may combine the two pictures, and finally second U-Net 610b may fuse these two pictures to generate the second reference picture 303.
[0142] At 1210, the apparatus may add the second reference picture into a reference picture list. For example, referring to FIG. 2C, LMRFG network 282 may add the second reference picture to reference picture list 284.
[0143] At 1212, the apparatus may decode a current picture region based on the second reference picture. Referring to FIG. 2C, video coder 255 may decode a current picture region based on the second reference picture added to reference picture list 284.
[0144] FIG. 13 illustrates a flow chart of an exemplary method 1300 of video encoding, according to some embodiments of the present disclosure. Method 1300 may be performed by an apparatus, e.g., such as encoding apparatus 10, 200, encoding unit 12, LMRFG network 282, MFE block 302, DBCA-ME block 304, FWFB 306, etc. Method 1300 may include operations 1302-1312 as described below. It is understood that some of the operations may be optional, and some of the operations may be performed simultaneously, or in a different order other than shown in FIG. 13.
[0145] At 1302, the apparatus may obtain a first reference picture from a decoded picture buffer. Referring to FIG. 2C, a first reference picture may be obtained from decoded picture buffer 280.
[0146] [Rectified under Rule 91, 06.02.2025]At 1304, the apparatus may generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. In some implementations, the multi-scale feature extraction may include at least one DO-Conv. In some implementations, the apparatus may generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture by downsampling the first reference picture to obtain a first downsampled reference picture and a second downsampled reference picture. In some implementations, the apparatus may generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture by generating first high-resolution features based on the first reference picture, second high-resolution features based on the first downsampled reference picture, and third high-resolution features based on the second downsampled reference picture. In some implementations, the apparatus may generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture by generating the second feature map based on a concatenation of the first high-resolution features, the second high-resolution features, and the third high-resolution features. In some implementations, the apparatus may generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture by generating first sub-resolution features based on the first reference picture, second sub-resolution features based on the first downsampled reference picture, and third sub-resolution features based on the second downsampled reference picture. In some implementations, the apparatus may generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture by performing a summation of the first sub-resolution features, the second sub-resolution features, and the third sub-resolution features to generate the first feature map. For example, referring to FIG. 4, MFE block 302 may first perform multi-scale feature extraction on two reference pictures 301a, 301b downsampled three times. Then, multi-scale feature extraction is performed on a series of pictures with different scales to generate HR features 403 and SubR features 405, respectively. HR features 403 capture the overall structure and shape of the image and are used for subsequent warping of the synthesized image and may be generated based on a concatenation of the HR features from the same scales of different image pyramids. For instance, first concatenator 406a may concatenate HR features for 1 / 2 size in the first and second picture pyramids, and second concatenator 406b may concatenate HR features for the 1 / 4 size in the first, second, and third picture pyramids. On the other hand, SubR features 405 may capture image details and textures for optical flow estimation of subsequent motion and may be generated as a summation of SubR features 401a from the first image pyramid, SubR features 401b from the second image pyramid, and third SubR features 401c from the third image pyramid. SubR features 405 may be used as input to DBCA-ME block 304, while HR features 403 may be used as input to FWFB 306.
[0147] At 1306, the apparatus may generate an optical flow map based on the first feature map. In some implementations, the apparatus may generate the optical flow map based on the first feature may by generating the optical flow map based on a dual-branch coordinated attention motion estimation of the first feature map. In some implementations, the apparatus may generate the optical flow map based on the first feature map by generating encoded channels of the first feature map along horizonal and vertical coordinates using a pair of pooling kernels. In some implementations, the apparatus may generate the optical flow map based on the first feature map by aggregating features of the encoded channels along two spatial directions to generate a pair of directional perception feature maps based on a pair of transformations. In some implementations, the apparatus may generate the optical flow map based on the first feature map by generating a concatenated directional perception feature map based on a concatenation of the pair of directional perception feature maps. In some implementations, the apparatus may generate the optical flow map based on the first feature map by extracting attention information from the two spatial directions based on the concatenated directional perception feature map. In some implementations, the apparatus may generate the optical flow map based on the first feature map by generating forward flow information and backward flow information based on a convolutional transformation of the attention information. In some implementations, the apparatus may generate the optical flow map based on the first feature map by generating the optical flow map by warping the forward flow information and the backward flow information. For example, referring to FIG. 5A, due to the difficulty of directly estimating optical flow with high resolution, a strategy based on hierarchical learning from low to high resolution is performed by DBCA-ME block 304. In addition, optical flow proportional amplifiers are used during upsampling to amplify the optical flow, which largely relies on the intuition that high-resolution motion should be greater than low resolution motion, thereby greatly increasing the number of pixels available for motion estimation. Plurality of first optical flow proportional amplifiers 504a may amplify the forward optical flows, while plurality of second optical flow proportional amplifiers 504 may amplify backward optical flows. First plurality of warping blocks 506a may warp the forward optical flows, while second plurality of warping blocks 506b may warp the backward optical flows. Referring to FIG. 5B, the branch that includes the plurality of RDB 514 learns the positional coordinate information of the target itself, while the branch that includes the plurality of CAB 516 uses an X and Y-axis optical flow fusion mechanism to parse the two-dimensional motion direction information of each point. Referring to FIG. 5C, to enable CAB 516 to capture remote spatial interactions with precise positional information, global pooling may be decomposed into a pair of one-dimensional feature encoding operations performed by X-axis average pooling block 522a a Y-axis average pooling block 522b according to expression (3) (shown above) . According to expression (4) (shown above) , given input X, use pooling kernels of size (H, 1) or (1, W) may be used encode each channel along the horizontal and vertical coordinates, respectively. The above two transformations aggregate features along two spatial directions to obtain a pair of directional perception feature maps. After the transformation in the information embedding, concatenator / convolutional block 524 concatenates the above transformation and then uses a convolutional transformation function to transform it according to the Concat ( [ph, pw] ) term in expression (5) (shown above) . Batch normalization / nonlinear block 526 may use up to or more than sixteen pictures to generate the δ term in expression (5) (shown above) . The integrated transformation is split into Ch and Cw term by convolutional blocks 528 terms to extract attention information from two directions according to expression (6) (shown above. The Ch term may be input into one sigmoid block 530 and the Cw term may be input into another sigmoid block 530. Then, a transformation may be performed according to expressions (7) and (8) (shown above) such that one sigmoid block 530 outputs Th and the other sigmoid block 530 outputs Tw. Finally, re-weight block 532 may output Y, which may be expressed according to expression (9) (shown above) .
[0148] At 1308, the apparatus may generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map. In some implementations, the generating, by the processor, the second reference picture based on the first reference picture, the second feature map, and the optical flow map may include generating, by the processor, the second reference picture based on a flow warping and fusion of the first reference picture, the second feature map, and the optical flow map. In some implementations, the apparatus may generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map by generating first warped feature information based on a first warping of the first feature map and the optical flow map. In some implementations, the apparatus may generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map by generating, by the processor, a set of warped pictures based on a second warping of the optical flow map and the first reference picture. In some implementations, the apparatus may generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map by generating, by the processor, second warped feature information based on a concatenation of the first feature map, the optical flow map, the first warped feature information, and the set of warped pictures. In some implementations, the apparatus may generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map by generating, by the processor, the second reference picture by fusing the first set of warped pictures and the second warped feature information. For example, referring to FIG. 6A, HR features 403, forward optical flow 505a, and backward optical flow 505b are distorted by first warping block 602a to generate feature information 601 (e.g., forward warped / backward warped features -distorted high-dimensional features) . Then, first reference pictures 301a, 301b, forward optical flow 505a, and backward optical flow 505b are warped to generate a first pair of warped pictures 601b (e.g., distorted pictures) . Concatenator 604 may concatenate HR features 403, forward optical flow 505a, backward optical flow 505b, forward warped / backward warped features 601a, and first pair of warped pictures 601b to generate a first set of features 603a for a first picture in the pair of warped pictures 601b and a second set of features for a second picture in the pair of warped pictures 601b. Next, fusion block 606 may fuse the first set of features 603a for the first picture and the second set of features 603b for the second picture. Referring to FIG. 6B, first U-Net 601a may fuse features, pictures, and optical flow to generate two pictures similar to the first reference pictures 301a, 301b, concatenator 612 may combine the two pictures, and finally second U-Net 610b may fuse these two pictures to generate the second reference picture 303.
[0149] At 1310, the apparatus may add the second reference picture into a reference picture list. For example, referring to FIG. 2C, LMRFG network 282 may add the second reference picture to reference picture list 284.
[0150] At 1312, the apparatus may decode a current picture region based on the second reference picture. Referring to FIG. 2C, video coder 255 may encode a current picture region based on the second reference picture added to reference picture list 284.
[0151] In conclusion, the present disclosure proposes LMRFG network for VVC inter coding that is integrated into the VVC reference software. LMRFG mainly includes three major modules, namely MFE, DBCA-ME, and FWFB. In the MFE module, multiscale features are extracted from two pictures to generate HR and SubR features. HR is used for subsequent picture bending fusion, while SubR is used to predict optical flow at different scales. Subsequently, the DBCA-ME module performs dual-branch coordinated attention motion estimator. The residual block module branches to learn the position coordinate information of the target itself, while the coordinate attention block module branches to use the X and Y axis optical flow fusion mechanism to parse the two-dimensional motion direction information of each point. Based on a hierarchical learning strategy from low resolution to high resolution and using an optical flow amplifier for upsampling to generate directional flow. In addition, a training strategy based on QP distance to effectively optimize the proposed LMRFG network is introduced in the present disclosure. The experimental results demonstrate the effectiveness and superiority of the proposed method over existing state-of-the-art methods such as JVET-AD0160, JVET-AJ0099. Compared with the VTM11.0-NNVC10.0 anchor under RA and LDB configurations, the proposed LMRFG network achieves average BD rate reductions in PSNR metrics of {RA: 4.31%, 6.54%, 7.05%} and {LDB: 3.81%, 8.90%, 8.77%} . LMRFG outperforms the state-of-the-art reference frame generation method JVET-AD0160 by {1.94% (Y) , 4.01% (U) , 3.86% (V) } , {2.77% (Y) , 5.07% (U) , 5.46% (V) } in RA and LDB configurations, respectively. LMRFG outperforms the state-of-the-art reference frame generation method JVET-AJ0099 by {0.83% (Y) , 2.16% (U) , 2.75% (V) } , {0.21% (Y) , 4.96% (U) , 8.46% (V) } in RA and LDB configurations, respectively.
[0152] The techniques of the present disclosure can be further extended as follows. For example, the present disclosure provides a unified network architecture for reference picture generation in both RA and LDB configurations. In the RA configuration, an intermediate picture is generated using two temporally symmetrical pictures. For the LDB configuration, a third picture is synthesized based on the preceding two pictures. This resembles tasks in video frame interpolation (VFI) and video frame extrapolation (VFE) . Consequently, the present techniques can be extended to VFI and VFE, facilitating the generation of virtual views.
[0153] The combination of the present techniques with machine vision may be beneficial in image / video coding, e.g., image coding for machine (ICM) or video coding for machine (VCM) . By combining the present techniques with other machine vision tasks such as object detection or segmentation, encoding performance can be remarkably enhanced, especially when dealing with complex scenes or moving objects. In the present disclosure, optical flow estimation can capture motion information between video pictures, which is crucial for efficient motion compensation and prediction for ICM or VCM.
[0154] The QP-distance based training strategy described herein can be adapted to various domains within neural network-based video coding (NNVC) . Potential applications include in-loop filter, post-processing filter, reference picture resampling (RPR) , and intra-frame coding, thereby broadening its impact across different aspects of video compression.
[0155] In various aspects of the present disclosure, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as instructions on a non-transitory computer-readable medium. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a processor, such as a processor in encoding unit 12 or decoding unit 22 in FIG. 1. By way of example, and not limitation, such computer-readable media can include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, HDD, such as magnetic disk storage or other magnetic storage devices, Flash drive, SSD, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a processing system, such as a mobile device or a computer. Disk and disc, as used herein, include CD, laser disc, optical disc, digital video disc (DVD) , and floppy disk where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
[0156] According to one aspect of the present disclosure, a method of video decoding is provided. The method may include obtaining, by a processor, a first reference picture from a decoded picture buffer. The method may include generating, by the processor, a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The method may include generating, by the processor, an optical flow map based on the first feature map. The method may include generating, by the processor, a second reference picture based on first reference picture, the second feature map, and the first reference picture. The method may include decoding, by the processor, a current picture region based on the second reference picture.
[0157] In some implementations, the multi-scale feature extraction may include at least one DO-Conv.
[0158] [Rectified under Rule 91, 06.02.2025]In some implementations, the generating, by the processor, the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture may include downsampling, by the processor, the first reference picture to obtain a first downsampled reference picture and a second downsampled reference picture. In some implementations, the generating, by the processor, the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture may include generating, by the processor, first high-resolution features based on the first reference picture, second high-resolution features based on the first downsampled reference picture, and third high-resolution features based on the second downsampled reference picture. In some implementations, the generating, by the processor, the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture may include generating, by the processor, the second feature map based on a concatenation of the first high-resolution features, the second high-resolution features, and the third high-resolution features.
[0159] [Rectified under Rule 91, 06.02.2025]In some implementations, the generating, by the processor, the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture may include generating, by the processor, first sub-resolution features based on the first reference picture, second sub-resolution features based on the first downsampled reference picture, and third sub-resolution features based on the second downsampled reference picture. In some implementations, the generating, by the processor, the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture may include performing, by the processor, a summation of the first sub-resolution features, the second sub-resolution features, and the third sub-resolution features to generate the first feature map.
[0160] In some implementations, the generating, by the processor, the optical flow map based on the first feature map may include generating, by the processor, the optical flow map based on a dual-branch coordinated attention motion estimation of the first feature map.
[0161] In some implementations, the generating, by the processor, the optical flow map based on the first feature map may include generating, by the processor, encoded channels of the first feature map along horizonal and vertical coordinates using a pair of pooling kernels. In some implementations, the generating, by the processor, the optical flow map based on the first feature map may include aggregating, by the processor, features of the encoded channels along two spatial directions to generate a pair of directional perception feature maps based on a pair of transformations. In some implementations, the generating, by the processor, the optical flow map based on the first feature map may include generating, by the processor, a concatenated directional perception feature map based on a concatenation of the pair of directional perception feature maps. In some implementations, the generating, by the processor, the optical flow map based on the first feature map may include extracting, by the processor, attention information from the two spatial directions based on the concatenated directional perception feature map. In some implementations, the generating, by the processor, the optical flow map based on the first feature map may include generating, by the processor, forward flow information and backward flow information based on a convolutional transformation of the attention information. In some implementations, the generating, by the processor, the optical flow map based on the first feature map may include generating, by the processor, the optical flow map by warping the forward flow information and the backward flow information.
[0162] In some implementations, the generating, by the processor, the second reference picture based on the first reference picture, the second feature map, and the optical flow map may include generating, by the processor, the second reference picture based on a flow warping and fusion of the first reference picture, the second feature map, and the optical flow map.
[0163] In some implementations, the generating, by the processor, the second reference picture based on the first reference picture, the second feature map, and the optical flow map may include generating, by the processor, first warped feature information based on a first warping of the first feature map and the optical flow map. In some implementations, the generating, by the processor, the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map may include generating, by the processor, a set of warped pictures based on a second warping of the optical flow map and the first reference picture. In some implementations, the generating, by the processor, the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map may include generating, by the processor, second warped feature information based on a concatenation of the first feature map, the optical flow map, the first warped feature information, and the set of warped pictures. In some implementations, the generating, by the processor, the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map may include generating, by the processor, the second reference picture by fusing the set of warped pictures and the second warped feature information.
[0164] In some implementations, the method may include adding, by the processor, the second reference picture into a reference picture list.
[0165] According to another aspect of the present disclosure, a video decoder is provided. The video decoder may include a processor and memory storing instructions. The memory storing instructions, which when executed by the processor, may cause the processor to obtain a first reference picture from a decoded picture buffer. The memory storing instructions, which when executed by the processor, may cause the processor to generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The memory storing instructions, which when executed by the processor, may cause the processor to generate an optical flow map based on the first feature map. The memory storing instructions, which when executed by the processor, may cause the processor to generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The memory storing instructions, which when executed by the processor, may cause the processor to decode a current picture region based on the second reference picture.
[0166] In some implementations, the multi-scale feature extraction may include at least one DO-Conv.
[0167] [Rectified under Rule 91, 06.02.2025]In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the memory storing instructions, which when executed by the processor, may cause the processor to downsample the first reference picture to obtain a first downsampled reference picture and a second downsampled reference picture. In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the memory storing instructions, which when executed by the processor, may cause the processor to generate first high-resolution features based on the first reference picture, second high-resolution features based on the first downsampled reference picture, and third high-resolution features based on the second downsampled reference picture. In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the memory storing instructions, which when executed by the processor, may cause the processor to generate the second feature map based on a concatenation of the first high-resolution features, the second high-resolution features, and the third high-resolution features.
[0168] [Rectified under Rule 91, 06.02.2025]In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the memory storing instructions, which when executed by the processor, may cause the processor to generate first sub-resolution features based on the first reference picture, second sub-resolution features based on the first downsampled reference picture, and third sub-resolution features based on the second downsampled reference picture. In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the memory storing instructions, which when executed by the processor, may cause the processor to perform a summation of the first sub-resolution features, the second sub-resolution features, and the third sub-resolution features to generate the first feature map.
[0169] In some implementations, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, may cause the processor to generate the optical flow map based on a dual-branch coordinated attention motion estimation of the first feature map.
[0170] In some implementations, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, may cause the processor to generate encoded channels of the first feature map along horizonal and vertical coordinates using a pair of pooling kernels. In some implementations, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, may cause the processor to aggregate features of the encoded channels along two spatial directions to generate a pair of directional perception feature maps based on a pair of transformations. In some implementations, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, may cause the processor to generate a concatenated directional perception feature map based on a concatenation of the pair of directional perception feature maps. In some implementations, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, may cause the processor to extract attention information from the two spatial directions based on the concatenated directional perception feature map. In some implementations, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, may cause the processor to generate forward flow information and backward flow information based on a convolutional transformation of the attention information. In some implementations, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, may cause the processor to generate the optical flow map by warping the forward flow information and the backward flow information.
[0171] In some implementations, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the memory storing instructions, which when executed by the processor, may cause the processor to generate the second reference picture based on a flow warping and fusion of the first reference picture, the second feature map, and the optical flow map.
[0172] In some implementations, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the memory storing instructions, which when executed by the processor, may cause the processor to generate first warped feature information based on a first warping of the first feature map and the optical flow map. In some implementations, to generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map, the memory storing instructions, which when executed by the processor, may cause the processor to generate a set of warped pictures based on a second warping of the optical flow map and the first reference picture. In some implementations, to generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map, the memory storing instructions, which when executed by the processor, may cause the processor to generate a second warped feature information based on a concatenation of the first feature map, the optical flow map, the first warped feature information, and the set of warped pictures. In some implementations, to generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map, the memory storing instructions, which when executed by the processor, may cause the processor to generate the second reference picture by fusing the first set of warped pictures and the second set of warped pictures.
[0173] In some implementations, the memory storing instructions, which when executed by the processor, may further cause the processor to add the second reference picture into a reference picture list.
[0174] According to a further aspect of the present disclosure, an apparatus for video decoding is provided. The apparatus for video decoding may include a processor and memory storing instructions. The memory storing instructions, which when executed by the processor, may cause the processor to obtain a first reference picture from a decoded picture buffer. The memory storing instructions, which when executed by the processor, may cause the processor to generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The memory storing instructions, which when executed by the processor, may cause the processor to generate an optical flow map based on the first feature map. The memory storing instructions, which when executed by the processor, may cause the processor to generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The memory storing instructions, which when executed by the processor, may cause the processor to decode a current picture region based on the second reference picture.
[0175] According to still another aspect of the present disclosure, a non-transitory computer-readable medium storing instructions for a video decoder is provided. The instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to obtain a first reference picture from a decoded picture buffer. The instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate an optical flow map based on the first feature map. The instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to decode a current picture region based on the second reference picture.
[0176] In some implementations, the multi-scale feature extraction may include at least one DO-Conv.
[0177] [Rectified under Rule 91, 06.02.2025]In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to downsample the first reference picture to obtain a first downsampled reference picture and a second downsampled reference picture. In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate first high-resolution features based on the first reference picture, second high-resolution features based on the first downsampled reference picture, and third high-resolution features based on the second downsampled reference picture. In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate the second feature map based on a concatenation of the first high-resolution features, the second high-resolution features, and the third high-resolution features.
[0178] [Rectified under Rule 91, 06.02.2025]In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate first sub-resolution features based on the first reference picture, second sub-resolution features based on the first downsampled reference picture, and third sub-resolution features based on the second downsampled reference picture. In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to perform a summation of the first sub-resolution features, the second sub-resolution features, and the third sub-resolution features to generate the first feature map.
[0179] In some implementations, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate the optical flow map based on a dual-branch coordinated attention motion estimation of the first feature map.
[0180] In some implementations, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate encoded channels of the first feature map along horizonal and vertical coordinates using a pair of pooling kernels. In some implementations, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to aggregate features of the encoded channels along two spatial directions to generate a pair of directional perception feature maps based on a pair of transformations. In some implementations, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate a concatenated directional perception feature map based on a concatenation of the pair of directional perception feature maps. In some implementations, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to extract attention information from the two spatial directions based on the concatenated directional perception feature map. In some implementations, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate forward flow information and backward flow information based on a convolutional transformation of the attention information. In some implementations, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate the optical flow map by warping the forward flow information and the backward flow information.
[0181] In some implementations, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate the second reference picture based on a flow warping and fusion of the first reference picture, the second feature map, and the optical flow map.
[0182] In some implementations, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate first warped feature information based on a first warping of the first feature map and the optical flow map. In some implementations, to generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate a set of warped pictures based on a second warping of the optical flow map and the first reference picture. In some implementations, to generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate second warped feature information based on a concatenation of the first feature map, the optical flow map, the first warped feature information, and the set of warped pictures. In some implementations, to generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map, the instructions, which when executed by the processor of the video decoder, may cause the processor of the video decoder to generate the second reference picture by fusing the set of warped pictures and the second warped feature information.
[0183] In some implementations, the instructions, which when executed by the processor of the video decoder, may further cause the processor of the video decoder to add the second reference picture into a reference picture list.
[0184] According to one aspect of the present disclosure, a method of video encoding is provided. The method may include obtaining, by a processor, a first reference picture from a decoded picture buffer. The method may include generating, by the processor, a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The method may include generating, by the processor, an optical flow map based on the first feature map. The method may include generating, by the processor, a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The method may include encoding, by the processor, a current picture region based on the second reference picture.
[0185] In some implementations, the multi-scale feature extraction may include at least one DO-Conv.
[0186] [Rectified under Rule 91, 06.02.2025]In some implementations, the generating, by the processor, the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture may include downsampling, by the processor, the first reference picture to obtain a first downsampled reference picture and a second downsampled reference picture. In some implementations, the generating, by the processor, the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture may include generating, by the processor, first high-resolution features based on the first reference picture, second high-resolution features based on the first downsampled reference picture, and third high-resolution features based on the second downsampled reference picture. In some implementations, the generating, by the processor, the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture may include generating, by the processor, the second feature map based on a concatenation of the first high-resolution features, the second high-resolution features, and the third high-resolution features.
[0187] [Rectified under Rule 91, 06.02.2025]In some implementations, the generating, by the processor, the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture may include generating, by the processor, first sub-resolution features based on the first reference picture, second sub-resolution features based on the first downsampled reference picture, and third sub-resolution features based on the second downsampled reference picture. In some implementations, the generating, by the processor, the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture may include performing, by the processor, a summation of the first sub-resolution features, the second sub-resolution features, and the third sub-resolution features to generate the first feature map.
[0188] In some implementations, the generating, by the processor, the optical flow map based on the first feature map may include generating, by the processor, the optical flow map based on a dual-branch coordinated attention motion estimation of the first feature map.
[0189] In some implementations, the generating, by the processor, the optical flow map based on the first feature map may include generating, by the processor, encoded channels of the first feature map along horizonal and vertical coordinates using a pair of pooling kernels. In some implementations, the generating, by the processor, the optical flow map based on the first feature map may include aggregating, by the processor, features of the encoded channels along two spatial directions to generate a pair of directional perception feature maps based on a pair of transformations. In some implementations, the generating, by the processor, the optical flow map based on the first feature map may include generating, by the processor, a concatenated directional perception feature map based on a concatenation of the pair of directional perception feature maps. In some implementations, the generating, by the processor, the optical flow map based on the first feature map may include extracting, by the processor, attention information from the two spatial directions based on the concatenated directional perception feature map. In some implementations, the generating, by the processor, the optical flow map based on the first feature map may include generating, by the processor, forward flow information and backward flow information based on a convolutional transformation of the attention information. In some implementations, the generating, by the processor, the optical flow map based on the first feature map may include generating, by the processor, the optical flow map by warping the forward flow information and the backward flow information.
[0190] In some implementations, the generating, by the processor, the second reference picture based on the first reference picture, the second feature map, and the optical flow map may include generating, by the processor, the second reference picture based on a flow warping and fusion of the first reference picture, the second feature map, and the optical flow map.
[0191] In some implementations, the generating, by the processor, the second reference picture based on the first reference picture, the second feature map, and the optical flow map may include generating, by the processor, first warped feature information based on a first warping of the first feature map and the optical flow map. In some implementations, the generating, by the processor, the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map may include generating, by the processor, a set of warped pictures based on a second warping of the optical flow map and the first reference picture. In some implementations, the generating, by the processor, the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map may include generating, by the processor, second warped feature information based on a concatenation of the first feature map, the optical flow map, the first warped feature information, and the set of warped pictures. In some implementations, the generating, by the processor, the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map may include generating, by the processor, the second reference picture by fusing the set of warped pictures and the second warped feature information.
[0192] In some implementations, the method may include adding, by the processor, the second reference picture into a reference picture list.
[0193] According to another aspect of the present disclosure, a video encoder is provided. The video encoder may include a processor and memory storing instructions. The memory storing instructions, which when executed by the processor, may cause the processor to obtain a first reference picture from a decoded picture buffer. The memory storing instructions, which when executed by the processor, may cause the processor to generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The memory storing instructions, which when executed by the processor, may cause the processor to generate an optical flow map based on the first feature map. The memory storing instructions, which when executed by the processor, may cause the processor to generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The memory storing instructions, which when executed by the processor, may cause the processor to encode a current picture region based on the second reference picture.
[0194] In some implementations, the multi-scale feature extraction may include at least one DO-Conv.
[0195] [Rectified under Rule 91, 06.02.2025]In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the memory storing instructions, which when executed by the processor, may cause the processor to downsample the first reference picture to obtain a first downsampled reference picture and a second downsampled reference picture. In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the memory storing instructions, which when executed by the processor, may cause the processor to generate first high-resolution features based on the first reference picture, second high-resolution features based on the first downsampled reference picture, and third high-resolution features based on the second downsampled reference picture. In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the memory storing instructions, which when executed by the processor, may cause the processor to generate the second feature map based on a concatenation of the first high-resolution features, the second high-resolution features, and the third high-resolution features.
[0196] [Rectified under Rule 91, 06.02.2025]In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the memory storing instructions, which when executed by the processor, may cause the processor to generate first sub-resolution features based on the first reference picture, second sub-resolution features based on the first downsampled reference picture, and third sub-resolution features based on the second downsampled reference picture. In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the memory storing instructions, which when executed by the processor, may cause the processor to perform a summation of the first sub-resolution features, the second sub-resolution features, and the third sub-resolution features to generate the first feature map.
[0197] In some implementations, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, may cause the processor to generate the optical flow map based on a dual-branch coordinated attention motion estimation of the first feature map.
[0198] In some implementations, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, may cause the processor to generate encoded channels of the first feature map along horizonal and vertical coordinates using a pair of pooling kernels. In some implementations, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, may cause the processor to aggregate features of the encoded channels along two spatial directions to generate a pair of directional perception feature maps based on a pair of transformations. In some implementations, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, may cause the processor to generate a concatenated directional perception feature map based on a concatenation of the pair of directional perception feature maps. In some implementations, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, may cause the processor to extract attention information from the two spatial directions based on the concatenated directional perception feature map. In some implementations, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, may cause the processor to generate forward flow information and backward flow information based on a convolutional transformation of the attention information. In some implementations, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, may cause the processor to generate the optical flow map by warping the forward flow information and the backward flow information.
[0199] In some implementations, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the memory storing instructions, which when executed by the processor, may cause the processor to generate the second reference picture based on a flow warping and fusion of the first reference picture, the second feature map, and the optical flow map.
[0200] In some implementations, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the memory storing instructions, which when executed by the processor, may cause the processor to generate first warped feature information based on a first warping of the first feature map and the optical flow map. In some implementations, to generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map, the memory storing instructions, which when executed by the processor, may cause the processor to generate a set of warped pictures based on a second warping of the optical flow map and the first reference picture. In some implementations, to generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map, the memory storing instructions, which when executed by the processor, may cause the processor to generate a second warped feature information based on a concatenation of the first feature map, the optical flow map, the first warped feature information, and the set of warped pictures. In some implementations, to generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map, the memory storing instructions, which when executed by the processor, may cause the processor to generate the second reference picture by fusing the first set of warped pictures and the second set of warped pictures.
[0201] In some implementations, the memory storing instructions, which when executed by the processor, may further cause the processor to add the second reference picture into a reference picture list.
[0202] According to a further aspect of the present disclosure, an apparatus for video encoding is provided. The apparatus for video encoding may include a processor and memory storing instructions. The memory storing instructions, which when executed by the processor, may cause the processor to obtain a first reference picture from a decoded picture buffer. The memory storing instructions, which when executed by the processor, may cause the processor to generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The memory storing instructions, which when executed by the processor, may cause the processor to generate an optical flow map based on the first feature map. The memory storing instructions, which when executed by the processor, may cause the processor to generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The memory storing instructions, which when executed by the processor, may cause the processor to encode a current picture region based on the second reference picture.
[0203] According to still another aspect of the present disclosure, a non-transitory computer-readable medium storing instructions for a video encoder is provided. The instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to obtain a first reference picture from a decoded picture buffer. The instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture. The instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate an optical flow map based on the first feature map. The instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map. The instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to encode a current picture region based on the second reference picture.
[0204] In some implementations, the multi-scale feature extraction may include at least one DO-Conv.
[0205] [Rectified under Rule 91, 06.02.2025]In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to downsample the first reference picture to obtain a first downsampled reference picture and a second downsampled reference picture. In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate first high-resolution features based on the first reference picture, second high-resolution features based on the first downsampled reference picture, and third high-resolution features based on the second downsampled reference picture. In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate the second feature map based on a concatenation of the first high-resolution features, the second high-resolution features, and the third high-resolution features.
[0206] [Rectified under Rule 91, 06.02.2025]In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate first sub-resolution features based on the first reference picture, second sub-resolution features based on the first downsampled reference picture, and third sub-resolution features based on the second downsampled reference picture. In some implementations, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to perform a summation of the first sub-resolution features, the second sub-resolution features, and the third sub-resolution features to generate the first feature map.
[0207] In some implementations, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate the optical flow map based on a dual-branch coordinated attention motion estimation of the first feature map.
[0208] In some implementations, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate encoded channels of the first feature map along horizonal and vertical coordinates using a pair of pooling kernels. In some implementations, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to aggregate features of the encoded channels along two spatial directions to generate a pair of directional perception feature maps based on a pair of transformations. In some implementations, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate a concatenated directional perception feature map based on a concatenation of the pair of directional perception feature maps. In some implementations, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to extract attention information from the two spatial directions based on the concatenated directional perception feature map. In some implementations, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate forward flow information and backward flow information based on a convolutional transformation of the attention information. In some implementations, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate the optical flow map by warping the forward flow information and the backward flow information.
[0209] In some implementations, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate the second reference picture based on a flow warping and fusion of the first reference picture, the second feature map, and the optical flow map.
[0210] In some implementations, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate first warped feature information based on a first warping of the first feature map and the optical flow map. In some implementations, to generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate a set of warped pictures based on a second warping of the optical flow map and the first reference picture. In some implementations, to generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate second warped feature information based on a concatenation of the first feature map, the optical flow map, the first warped feature information, and the set of warped pictures. In some implementations, to generate the second reference picture based on the flow warping and fusion of the first reference picture, the second feature map, and the optical flow map, the instructions, which when executed by the processor of the video encoder, may cause the processor of the video encoder to generate the second reference picture by fusing the set of warped pictures and the second warped feature information.
[0211] In some implementations, the instructions, which when executed by the processor of the video encoder, may further cause the processor of the video encoder to add the second reference picture into a reference picture list.
[0212] According to yet a further aspect of the present disclosure, a non-transitory computer-readable medium storing a bitstream is provided. The bitstream may be generated based on one or more of the operations described herein.
[0213] The foregoing description of the embodiments will so reveal the general nature of the present disclosure that others can, by applying knowledge within the skill of the art, readily modify and / or adapt for various applications such embodiments, without undue experimentation, without departing from the general concept of the present disclosure. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.
[0214] Embodiments of the present disclosure have been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.
[0215] The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present disclosure as contemplated by the inventor (s) , and thus, are not intended to limit the present disclosure and the appended claims in any way.
[0216] Various functional blocks, modules, and steps are disclosed above. The arrangements provided are illustrative and without limitation. Accordingly, the functional blocks, modules, and steps may be reordered or combined in different ways than in the examples provided above. Likewise, some embodiments include only a subset of the functional blocks, modules, and steps, and any such subset is permitted.
[0217] The breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
Claims
1.A method of video decoding, comprising:obtaining, by a processor, a first reference picture from a decoded picture buffer;generating, by the processor, a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture;generating, by the processor, an optical flow map based on the first feature map;generating, by the processor, a second reference picture based on the first reference picture, the second feature map, and the optical flow map; anddecoding, by the processor, a current picture region based on the second reference picture.2.The method of claim 1, wherein the multi-scale feature extraction includes at least one depthwise over-parameter convolution (DO-Conv) .3.[Rectified under Rule 91, 06.02.2025]The method of claim 1, wherein the generating, by the processor, the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture comprises:downsampling, by the processor, the first reference picture to obtain a first downsampled reference picture and a second downsampled reference picture;generating, by the processor, first high-resolution features based on the first reference picture, second high-resolution features based on the first downsampled reference picture, and third high-resolution features based on the second downsampled reference picture; andgenerating, by the processor, the second feature map based on a concatenation of the first high-resolution features, the second high-resolution features, and the third high-resolution features.4.[Rectified under Rule 91, 06.02.2025]The method of claim 3, wherein the generating, by the processor, the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture comprises:generating, by the processor, first sub-resolution features based on the first reference picture, second sub-resolution features based on the first downsampled reference picture, and third sub-resolution features based on the second downsampled reference picture; andperforming, by the processor, a summation of the first sub-resolution features, the second sub-resolution features, and the third sub-resolution features to generate the first feature map.5.The method of claim 1, wherein the generating, by the processor, the optical flow map based on the first feature map comprises:generating, by the processor, the optical flow map based on a dual-branch coordinated attention motion estimation (DBCA-ME) of the first feature map.6.The method of claim 1, wherein the generating, by the processor, the optical flow map based on the first feature map comprises:generating, by the processor, encoded channels of the first feature map along horizonal and vertical coordinates using a pair of pooling kernels;aggregating, by the processor, features of the encoded channels along two spatial directions to generate a pair of directional perception feature maps based on a pair of transformations;generating, by the processor, a concatenated directional perception feature map based on a concatenation of the pair of directional perception feature maps;extracting, by the processor, attention information from the two spatial directions based on the concatenated directional perception feature map;generating, by the processor, forward flow information and backward flow information based on a convolutional transformation of the attention information; andgenerating, by the processor, the optical flow map by warping the forward flow information and the backward flow information.7.The method of claim 1, wherein the generating, by the processor, the second reference picture based on the first reference picture, the second feature map, and the optical flow map comprises:generating, by the processor, the second reference picture based on a flow warping and fusion of the first reference picture, the second feature map, and the optical flow map.8.The method of claim 1, wherein the generating, by the processor, the second reference picture based on the first reference picture, the second feature map, and the optical flow map comprises:generating, by the processor, first warped feature information based on a first warping of the first feature map and the optical flow map;generating, by the processor, a set of warped pictures based on a second warping of the optical flow map and the first reference picture;generating, by the processor, second warped feature information based on a concatenation of the first feature map, the optical flow map, the first warped feature information, and the set of warped pictures; andgenerating, by the processor, the second reference picture by fusing the set of warped pictures and the second warped feature information.9.The method of claim 1, further comprising:adding, by the processor, the second reference picture into a reference picture list.10.A video decoder, comprising:a processor; andmemory storing instructions, which when executed by the processor, cause the processor to:obtain a first reference picture from a decoded picture buffer;generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture;generate an optical flow map based on the first feature map;generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map; anddecode a current picture region based on the second reference picture.11.The video decoder of claim 10, wherein the multi-scale feature extraction includes at least one depthwise over-parameter convolution (DO-Conv) .12.[Rectified under Rule 91, 06.02.2025]The video decoder of claim 10, wherein, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the memory storing instructions, which when executed by the processor, cause the processor to:downsample the first reference picture to obtain a first downsampled reference picture and a second downsampled reference picture;generate first high-resolution features based on the first reference picture, second high-resolution features based on the first downsampled reference picture, and third high-resolution features based on the second downsampled reference picture; andgenerate the second feature map based on a concatenation of the first high-resolution features, the second high-resolution features, and the third high-resolution features.13.[Rectified under Rule 91, 06.02.2025]The video decoder of claim 12, wherein, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the memory storing instructions, which when executed by the processor, cause the processor to:generate first sub-resolution features based on the first reference picture, second sub-resolution features based on the first downsampled reference picture, and third sub-resolution features based on the second downsampled reference picture; andperform a summation of the first sub-resolution features, the second sub-resolution features, and the third sub-resolution features to generate the first feature map.14.The video decoder of claim 10, wherein, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, cause the processor to:generate the optical flow map based on a dual-branch coordinated attention motion estimation (DBCA-ME) of the first feature map.15.The video decoder of claim 10, wherein, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, cause the processor to:generate encoded channels of the first feature map along horizonal and vertical coordinates using a pair of pooling kernels;aggregate features of the encoded channels along two spatial directions to generate a pair of directional perception feature maps based on a pair of transformations;generate a concatenated directional perception feature map based on a concatenation of the pair of directional perception feature maps;extract attention information from the two spatial directions based on the concatenated directional perception feature map;generate forward flow information and backward flow information based on a convolutional transformation of the attention information; andgenerate the optical flow map by warping the forward flow information and the backward flow information.16.The video decoder of claim 11, wherein, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the memory storing instructions, which when executed by the processor, cause the processor to:generate the second reference picture based on a flow warping and fusion of the first reference picture, the second feature map, and the optical flow map.17.The video decoder of claim 10, wherein, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the memory storing instructions, which when executed by the processor, cause the processor to:generate first warped feature information based on a first warping of the first feature map and the optical flow map;generate a set of warped pictures based on a second warping of the optical flow map and the first reference picture;generate a second warped feature information based on a concatenation of the first feature map, the optical flow map, the first warped feature information, and the set of warped pictures; andgenerate the second reference picture by fusing the set of warped pictures and the second warped feature information.18.The video decoder of claim 10, wherein the memory storing instructions, which when executed by the processor, further cause the processor to:add the second reference picture into a reference picture list.19.An apparatus for video decoding, comprising:a processor; andmemory storing instructions, which when executed by the processor, cause the processor to:obtain a first reference picture from a decoded picture buffer;generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture;generate an optical flow map based on the first feature map;generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map; anddecode a current picture region based on the second reference picture.20.A non-transitory computer-readable medium storing instructions, which when executed by a processor of a video decoder, cause the processor of the video decoder to:obtain a first reference picture from a decoded picture buffer;generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture;generate an optical flow map based on the first feature map;generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map; anddecode a current picture region based on the second reference picture.21.The non-transitory computer-readable medium of claim 20, wherein the multi-scale feature extraction includes at least one depthwise over-parameter convolution (DO-Conv) .22.[Rectified under Rule 91, 06.02.2025]The non-transitory computer-readable medium of claim 20, wherein, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the instructions, which when executed by the processor of video decoder, cause the processor of video decoder to:downsample the first reference picture to obtain a first downsampled reference picture and a second downsampled reference picture;generate first high-resolution features based on the first reference picture, second high-resolution features based on the first downsampled reference picture, and third high-resolution features based on the second downsampled reference picture; andgenerate the second feature map based on a concatenation of the first high-resolution features, the second high-resolution features, and the third high-resolution features.23.[Rectified under Rule 91, 06.02.2025]The non-transitory computer-readable medium of claim 22, wherein, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the instructions, which when executed by the processor of video decoder, cause the processor of video decoder to:generate first sub-resolution features based on the first reference picture, second sub-resolution features based on the first downsampled reference picture, and third sub-resolution features based on the second downsampled reference picture; andperform a summation of the first sub-resolution features, the second sub-resolution features, and the third sub-resolution features to generate the first feature map.24.The non-transitory computer-readable medium of claim 20, wherein, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of video decoder, cause the processor of video decoder to:generate the optical flow map based on a dual-branch coordinated attention motion estimation (DBCA-ME) of the first feature map.25.The non-transitory computer-readable medium of claim 20, wherein, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of video decoder, cause the processor of video decoder to:generate encoded channels of the first feature map along horizonal and vertical coordinates using a pair of pooling kernels;aggregate features of the encoded channels along two spatial directions to generate a pair of directional perception feature maps based on a pair of transformations;generate a concatenated directional perception feature map based on a concatenation of the pair of directional perception feature maps;extract attention information from the two spatial directions based on the concatenated directional perception feature map;generate forward flow information and backward flow information based on a convolutional transformation of the attention information; andgenerate the optical flow map by warping the forward flow information and the backward flow information.26.The non-transitory computer-readable medium of claim 20, wherein, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the instructions, which when executed by the processor of the video decoder, cause the processor the video decoder to:generate the second reference picture based on a flow warping and fusion of the first reference picture, the second feature map, and the optical flow map.27.The non-transitory computer-readable medium of claim 20, wherein, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the instructions, which when executed by the processor of video decoder, cause the processor of video decoder to:generate first warped feature information based on a first warping of the first feature map and the optical flow map;generate a set of warped pictures based on a second warping of the optical flow map and the first reference picture;generate second warped feature information based on a concatenation of the first feature map, the optical flow map, the first warped feature information, and the set of warped pictures; andgenerate the second reference picture by fusing the set of warped pictures and the second warped feature information.28.The non-transitory computer-readable medium of claim 20, wherein the instructions, which when executed by the processor of the video decoder, further cause the processor of the video decoder to:add the second reference picture into a reference picture list.29.A method of video encoding, comprising:obtaining, by a processor, a first reference picture from a decoded picture buffer;generating, by the processor, a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture;generating, by the processor, an optical flow map based on the first feature map;generating, by the processor, a second reference picture based on the first reference picture, the second feature map, and the optical flow map; andencoding, by the processor, a current picture region based on the second reference picture.30.The method of claim 29, wherein the multi-scale feature extraction includes at least one depthwise over-parameter convolution (DO-Conv) .31.[Rectified under Rule 91, 06.02.2025]The method of claim 29, wherein the generating, by the processor, the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture comprises:downsampling, by the processor, the first reference picture to obtain a first downsampled reference picture and a second downsampled reference picture;generating, by the processor, first high-resolution features based on the first reference picture, second high-resolution features based on the first downsampled reference picture, and third high-resolution features based on the second downsampled reference picture; andgenerating, by the processor, the second feature map based on a concatenation of the first high-resolution features, the second high-resolution features, and the third high-resolution features.32.The method of claim 31, wherein the generating, by the processor, the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture comprises:generating, by the processor, first sub-resolution features based on the first reference picture, second sub-resolution features based on the first downsampled reference picture, and third sub-resolution features based on the second downsampled reference picture; andperforming, by the processor, a summation of the first sub-resolution features, the second sub-resolution features, and the third sub-resolution features to generate the second feature map.33.The method of claim 29, wherein the generating, by the processor, the optical flow map based on the first feature map comprises:generating, by the processor, the optical flow map based on a dual-branch coordinated attention motion estimation (DBCA-ME) of the first feature map.34.The method of claim 29, wherein the generating, by the processor, the optical flow map based on the first feature map comprises:generating, by the processor, encoded channels of the first feature map along horizonal and vertical coordinates using a pair of pooling kernels;aggregating, by the processor, features of the encoded channels along two spatial directions to generate a pair of directional perception feature maps based on a pair of transformations;generating, by the processor, a concatenated directional perception feature map based on a concatenation of the pair of directional perception feature maps;extracting, by the processor, attention information from the two spatial directions based on the concatenated directional perception feature map;generating, by the processor, forward flow information and backward flow information based on a convolutional transformation of the attention information; andgenerating, by the processor, the optical flow map by warping the forward flow information and the backward flow information.35.The method of claim 29, wherein the generating, by the processor, the second reference picture based on the first reference picture, the second feature map, and the optical flow map comprises:generating, by the processor, the second reference picture based on a flow warping and fusion of the first reference picture, the second feature map, and the optical flow map.36.The method of claim 29, wherein the generating, by the processor, the second reference picture based on the first reference picture, the second feature map, and the optical flow map comprises:generating, by the processor, first warped feature information based on a first warping of the first feature map and the optical flow map;generating, by the processor, a set of warped pictures based on a second warping of the optical flow map and the first reference picture;generating, by the processor, second warped feature information based on a concatenation of the first feature map, the optical flow map, the first warped feature information, and the set of warped pictures; andgenerating, by the processor, the second reference picture by fusing the set of warped pictures and the second warped feature information.37.The method of claim 29, further comprising:adding, by the processor, the second reference picture into a reference picture list.38.A video encoder, comprising:a processor; andmemory storing instructions, which when executed by the processor, cause the processor to:obtain a first reference picture from a decoded picture buffer;generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture;generate an optical flow map based on the first feature map;generate a second reference picture based the first reference picture, the second feature map, and the optical flow map; andencode a current picture region based on the second reference picture.39.The video encoder of claim 38, wherein the multi-scale feature extraction includes at least one depthwise over-parameter convolution (DO-Conv) .40.[Rectified under Rule 91, 06.02.2025]The video encoder of claim 38, wherein, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the memory storing instructions, which when executed by the processor, cause the processor to:downsample the first reference picture to obtain a first downsampled reference picture and a second downsampled reference picture;generate first high-resolution features based on the first reference picture, second high-resolution features based on the first downsampled reference picture, and third high-resolution features based on the second downsampled reference picture; andgenerate the second feature map based on a concatenation of the first high-resolution features, the second high-resolution features, and the third high-resolution features.41.[Rectified under Rule 91, 06.02.2025]The video encoder of claim 40, wherein, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the memory storing instructions, which when executed by the processor, cause the processor to:generate first sub-resolution features based on the first reference picture, second sub-resolution features based on the first downsampled reference picture, and third sub-resolution features based on the second downsampled reference picture; andperform a summation of the first sub-resolution features, the second sub-resolution features, and the third sub-resolution features to generate the first feature map.42.The video encoder of claim 38, wherein, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, cause the processor to:generate the optical flow map based on a dual-branch coordinated attention motion estimation (DBCA-ME) of the first feature map.43.The video encoder of claim 38, wherein, to generate the optical flow map based on the first feature map, the memory storing instructions, which when executed by the processor, cause the processor to:generate encoded channels of the first feature map along horizonal and vertical coordinates using a pair of pooling kernels;aggregate features of the encoded channels along two spatial directions to generate a pair of directional perception feature maps based on a pair of transformations;generate a concatenated directional perception feature map based on a concatenation of the pair of directional perception feature maps;extract attention information from the two spatial directions based on the concatenated directional perception feature map;generate forward flow information and backward flow information based on a convolutional transformation of the attention information; andgenerate the optical flow map by warping the forward flow information and the backward flow information.44.The video encoder of claim 38, wherein, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the memory storing instructions, which when executed by the processor, cause the processor to:generate the second reference picture based on a flow warping and fusion of the first reference picture, the second feature map, and the optical flow map.45.The video encoder of claim 38, wherein, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the memory storing instructions, which when executed by the processor, cause the processor to:generate first warped feature information based on a first warping of the first feature map and the optical flow map;generate a set of warped pictures based on a second warping of the optical flow map and the first reference picture;generate a second warped feature information based on a concatenation of the first feature map, the optical flow map, the first warped feature information, and the set of warped pictures; andgenerate the second reference picture by fusing the set of warped pictures and the second warped feature information.46.The video encoder of claim 38, wherein the memory storing instructions, which when executed by the processor, further cause the processor to:add the second reference picture into a reference picture list.47.An apparatus for video encoding, comprising:a processor; andmemory storing instructions, which when executed by the processor, cause the processor to:obtain a first reference picture from a decoded picture buffer;generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture;generate an optical flow map based on the first feature map;generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map; andencode a current picture region based on the second reference picture.48.A non-transitory computer-readable medium storing instructions, which when executed by a processor of a video encoder, cause the processor of the video encoder to:obtain a first reference picture from a decoded picture buffer;generate a first feature map and a second feature map based on a multi-scale feature extraction of the first reference picture;generate an optical flow map based on the first feature map;generate a second reference picture based on the first reference picture, the second feature map, and the optical flow map; andencode a current picture region based on the second reference picture.49.The non-transitory computer-readable medium of claim 48, wherein the multi-scale feature extraction includes at least one depthwise over-parameter convolution (DO-Conv) .50.[Rectified under Rule 91, 06.02.2025]The non-transitory computer-readable medium of claim 48, wherein, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the instructions, which when executed by the processor of video encoder, cause the processor of video encoder to:downsample the first reference picture to obtain a first downsampled reference picture and a second downsampled reference picture;generate first high-resolution features based on the first reference picture, second high-resolution features based on the first downsampled reference picture, and third high-resolution features based on the second downsampled reference picture; andgenerate the second feature map based on a concatenation of the first high-resolution features, the second high-resolution features, and the third high-resolution features.51.[Rectified under Rule 91, 06.02.2025]The non-transitory computer-readable medium of claim 50, wherein, to generate the first feature map and the second feature map based on the multi-scale feature extraction of the first reference picture, the instructions, which when executed by the processor of video encoder, cause the processor of video encoder to:generate first sub-resolution features based on the first reference picture, second sub-resolution features based on the first downsampled reference picture, and third sub-resolution features based on the second downsampled reference picture; andperform a summation of the first sub-resolution features, the second sub-resolution features, and the third sub-resolution features to generate the first feature map.52.The non-transitory computer-readable medium of claim 48, wherein, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of video encoder, cause the processor of video encoder to:generate the optical flow map based on a dual-branch coordinated attention motion estimation (DBCA-ME) of the first feature map.53.The non-transitory computer-readable medium of claim 48, wherein, to generate the optical flow map based on the first feature map, the instructions, which when executed by the processor of video encoder, cause the processor of video encoder to:generate encoded channels of the first feature map along horizonal and vertical coordinates using a pair of pooling kernels;aggregate features of the encoded channels along two spatial directions to generate a pair of directional perception feature maps based on a pair of transformations;generate a concatenated directional perception feature map based on a concatenation of the pair of directional perception feature maps;extract attention information from the two spatial directions based on the concatenated directional perception feature map;generate forward flow information and backward flow information based on a convolutional transformation of the attention information; andgenerate the optical flow map by warping the forward flow information and the backward flow information.54.The video encoder of claim 48, wherein, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the instructions, which when executed by the processor of the video encoder, cause the processor of the video encoder to:generate the second reference picture based on a flow warping and fusion of the first reference picture, the second feature map, and the optical flow map.55.The non-transitory computer-readable medium of claim 48, wherein, to generate the second reference picture based on the first reference picture, the second feature map, and the optical flow map, the instructions, which when executed by the processor of video encoder, cause the processor of video encoder to:generate first warped feature information based on a first warping of the first feature map and the optical flow map;generate a set of warped pictures based on a second warping of the optical flow map and the first reference picture;generate a second warped feature information based on a concatenation of the first feature map, the optical flow map, the first warped feature information, and the set of warped pictures; andgenerate the second reference picture by fusing the set of warped pictures and the second warped feature information.56.The non-transitory computer-readable medium of claim 48, wherein the instructions, which when executed by the processor of the video encoder, further cause the processor of the video encoder to:add the second reference picture into a reference picture list.57.A non-transitory computer-readable medium storing a bitstream, the bitstream being generated based on one or more of claims 29-37.