Video coding loop filter with residual scaling based on neural network

By using a neural network-based filter model to generate and scale residuals during video encoding and decoding, the problem of video encoding and decoding distortion in existing technologies is solved, achieving more efficient video quality and compression.

CN115037948BActive Publication Date: 2026-06-19FACE CUTE CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
FACE CUTE CO LTD
Filing Date
2022-03-04
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing video encoding and decoding technologies suffer from distortion during compression, and existing loop filtering techniques are ineffective in reducing distortion.

Method used

A neural network-based filter model is used for loop filtering. Filtered samples are generated by producing residuals, scaling residuals, and adding unfiltered samples. Different inference block sizes and scaling functions are combined to improve performance.

Benefits of technology

By applying neural network filter models, distortion in the video encoding and decoding process is significantly reduced, improving video quality and compression efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115037948B_ABST
    Figure CN115037948B_ABST
Patent Text Reader

Abstract

A method implemented by a video codec device is disclosed. The method includes applying the output of a neural network (NN) filter to unfiltered samples of a video unit to generate a residual, applying a scaling function to the residual to generate a scaled residual, adding another unfiltered sample to the scaled residual to generate filtered samples, and converting between a video media file and a bitstream based on the generated filtered samples. Corresponding video codec devices and non-transitory computer-readable media are also disclosed.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Cross-reference to related applications

[0002] This patent application claims priority and interest in U.S. Provisional Patent Application No. 63 / 156,726, filed March 4, 2021, entitled “A Neural Network-Based Video Codec Loop Filter with Residual Scaling,” the entire disclosure of which is incorporated herein by reference as part of the disclosure of this application. Technical Field

[0003] This disclosure generally relates to video encoding and decoding, and in particular, to loop filters in image / video encoding and decoding. Background Technology

[0004] Digital video accounts for the largest share of bandwidth usage on the Internet and other digital communication networks. As the number of connected user devices capable of receiving and displaying video increases, the bandwidth demand for digital video is expected to continue to grow. Summary of the Invention

[0005] This disclosure provides one or more neural network (NN) filter models trained as part of a loop filtering technique or filtering method used in a post-processing stage to reduce distortion generated during compression. Furthermore, samples with different characteristics are processed by different NN filter models. This disclosure also details how to scale the output of the NN filters to achieve better performance, how to set the inference block size, and how to combine the outputs of multiple NN filter models.

[0006] The first aspect relates to a method implemented by an encoding / decoding device. The method includes: applying the output of a neural network (NN) filter to unfiltered samples of a video unit to generate a residual; applying a scaling function to the residual to generate a scaled residual; adding another unfiltered sample to the scaled residual to generate filtered samples; and converting between a video media file and a bitstream based on the generated filtered samples.

[0007] Alternatively, in any of the foregoing aspects, another implementation of this aspect provides reconstructing the unfiltered samples before generating the residuals.

[0008] Alternatively, in any of the foregoing aspects, another embodiment of this aspect provides generating the filtered samples according to Y = X + F(R), where X represents the unfiltered samples, R represents the residual determined based on the output of the NN filter, F represents the scaling function, and Y represents the filtered samples.

[0009] Alternatively, in any of the foregoing aspects, another embodiment of this aspect provides generating the filtered samples according to Y = X + F(R, X), where X represents the unfiltered samples, R represents the residual determined based on the output of the NN filter, F represents the scaling function, and Y represents the filtered samples.

[0010] Alternatively, in any of the foregoing aspects, another embodiment of this aspect provides generating the filtered samples according to Y = X + F(RX), where X represents the unfiltered samples, R represents the residual determined based on the output of the NN filter, F represents the scaling function, and Y represents the filtered samples.

[0011] Alternatively, in any of the foregoing aspects, another embodiment of this aspect provides generating the filtered samples according to Y = Clip(X + F(R)), where X represents the unfiltered samples, R represents the residual determined based on the output of the NN filter, F represents the scaling function, Clip represents the clipping function based on the bit depth of the unfiltered samples, and Y represents the filtered samples.

[0012] Alternatively, in any of the foregoing aspects, another embodiment of this aspect provides that the scaling function is based on a linear model according to F(R) = α × R + β, where R represents the residual determined based on the output of the NN filter, F represents the scaling function, and α and β represent a pair of coefficient candidates (α, β).

[0013] Alternatively, in any of the foregoing aspects, another embodiment of this aspect provides determining the inference block size to be used when the NN filter is applied to the unfiltered samples.

[0014] Alternatively, in any of the foregoing aspects, another embodiment of this aspect provides selecting the inference block size from a plurality of inference block size candidates, wherein each of the plurality of inference block size candidates is based on at least one of quantization parameters, stripe type, image type, segmentation tree, and color component.

[0015] Alternatively, in any of the foregoing aspects, another embodiment of this aspect provides parsing the bitstream to obtain an indicator, wherein the indicator indicates the inference block size to be used when the NN filter is applied to the unfiltered samples.

[0016] Optionally, in any of the foregoing aspects, another embodiment of that aspect provides that the inference block size has a first value of a first bit rate and a second value of a second bit rate, wherein the first value is higher than the second value, and wherein the first bit rate is lower than the second bit rate.

[0017] Optionally, in any of the foregoing aspects, another embodiment of that aspect provides the inference block size having a first value with a first precision and a second value with a second precision, wherein the first value is higher than the second value, and wherein the first precision is higher than the second precision.

[0018] Alternatively, in any of the foregoing aspects, another embodiment of that aspect provides the NN filter as one of a plurality of NN filters whose output is applied to the unfiltered samples to generate the residual.

[0019] Alternatively, in any of the foregoing aspects, another embodiment of this aspect provides that when the outputs of the plurality of NN filters are applied to the unfiltered samples, some of the plurality of NN filters use different inference block sizes.

[0020] Alternatively, in any of the foregoing aspects, another embodiment of this aspect provides to individually weight the outputs of the plurality of NN filters and apply them to the unfiltered samples as a weighted sum.

[0021] Alternatively, in any of the foregoing aspects, another implementation of this aspect provides signaling notifications in the bitstream corresponding to the model and weights of each of the plurality of NN filters.

[0022] Alternatively, in any of the foregoing aspects, another implementation of this aspect provides weights corresponding to each of the plurality of NN filters based on one or more of quantization parameters, stripe type, image type, color components, color format, and temporal layer.

[0023] Alternatively, in any of the foregoing aspects, another embodiment of that aspect provides weights corresponding to each of the plurality of NN filters based on one or more of the NN filter model, inference block size, or spatial location of the unfiltered samples.

[0024] The second aspect relates to an apparatus for encoding and decoding video data, including a processor and a non-transitory memory having instructions thereon, wherein when the processor executes the instructions, the processor causes the processor to: apply the output of a neural network (NN) filter to unfiltered samples of a video unit to generate a residual; apply a scaling function to the residual to generate a scaled residual; add another unfiltered sample to the scaled residual to generate a filtered sample; and convert between a video media file and a bitstream based on the generated filtered sample.

[0025] The third aspect relates to a non-transitory computer-readable medium comprising a computer program product used by an encoding / decoding device, the computer program product including computer-executable instructions stored on the non-transitory computer-readable medium, which, when executed by one or more processors, cause the encoding / decoding device to: apply the output of a neural network (NN) filter to unfiltered samples of a video unit to generate a residual; apply a scaling function to the residual to generate a scaled residual; add another unfiltered sample to the scaled residual to generate a filtered sample; and convert between a video media file and a bitstream based on the generated filtered sample.

[0026] For clarity, any of the above embodiments can be combined with any one or more of the other embodiments to obtain new embodiments within the scope of this disclosure.

[0027] The above and other aspects and features of this disclosure will become clearer from the following detailed description in conjunction with the accompanying drawings, specification, and claims. Attached Figure Description

[0028] For a more complete understanding of this disclosure, reference is made to the following brief description in conjunction with the accompanying drawings and detailed description, wherein the same or similar reference numerals denote the same or similar parts.

[0029] Figure 1 This is an example of raster scan strip segmentation of an image.

[0030] Figure 2 This is an example of rectangular strip segmentation in an image.

[0031] Figure 3 These are examples of dividing an image into slices, tiles, and rectangular strips.

[0032] Figure 4A This is an example of a coding tree block (CTB) that spans the bottom image boundary.

[0033] Figure 4B This is an example of a CTB that crosses the boundary of the right figure.

[0034] Figure 4C This is an example of a CTB that spans the bottom right edge of the image.

[0035] Figure 5 This is an example of an encoder block diagram.

[0036] Figure 6 This is a diagram of the sample points within an 8×8 sample point block.

[0037] Figure 7 These are examples of pixels involved in filter on / off decisions and strong / weak filter selection.

[0038] Figure 8 Four one-dimensional (1-D) orientation patterns for EO sample point classification are shown.

[0039] Figure 9 An example of the filter shape for a geometry transformation-based adaptive loop filter (GALF) is shown.

[0040] Figure 10 An example of relative coordinates for 5×5 rhombus filter support is shown.

[0041] Figure 11 Another example of relative coordinates for 5×5 rhombus filter support is shown.

[0042] Figure 12A This is an example architecture of the proposed CNN filter.

[0043] Figure 12B This is an example of the construction of a ResBlock.

[0044] Figure 13A This is an example of the process of generating filtered samples using residual scaling and neural network filtering.

[0045] Figure 13B This is another example of the process of generating filtered samples using residual scaling and neural network filtering.

[0046] Figure 13C This is another example of the process of generating filtered samples using residual scaling and neural network filtering.

[0047] Figure 13D This is another example of the process of generating filtered samples using residual scaling and neural network filtering.

[0048] Figure 14 This is a block diagram illustrating an example video processing system.

[0049] Figure 15 This is a block diagram of a video processing device.

[0050] Figure 16 This is a block diagram illustrating an example video encoding / decoding system.

[0051] Figure 17 This is a block diagram illustrating an example of a video encoder.

[0052] Figure 18 This is a block diagram illustrating an example of a video decoder.

[0053] Figure 19A video data encoding / decoding method according to an embodiment of the present disclosure is shown. Detailed Implementation

[0054] First, it should be understood that although illustrative embodiments of one or more examples are provided below, the systems and / or methods of this disclosure can be implemented using any number of techniques, whether currently known or existing. This disclosure should not be limited to the illustrative embodiments, drawings, and techniques described below, which include the exemplary designs and implementations described herein, but modifications can be made within the scope of the appended claims and all their equivalents.

[0055] The use of the H.266 terminology in some descriptions is for ease of understanding only and not to limit the scope of the technology disclosed herein. Therefore, the technology described herein is also applicable to other video codec protocols and designs.

[0056] Video codec standards have primarily been developed through the work of well-known organizations such as the International Telecommunication Union-Telecommunication Standardization Department (ITU-T) and the International Organization for Standardization (ISO) / International Electrotechnical Commission (IEC). ITU-T developed H.261 and H.263, while ISO / IEC developed MPEG-1 and MPEG-4. These two organizations jointly developed the H.262 / MPEG-2 video standard, the H.264 / MPEG-4 Advanced Video Coding (AVC) standard, and the H.265 / High Efficiency Video Coding (HEVC) standard.

[0057] Since H.262, video codec standards have been based on a hybrid video codec architecture, employing temporal prediction plus transform coding. To explore future video codec technologies beyond HEVC, VCEG and MPEG jointly established the Joint Video Exploration Team (JVET) in 2015. Since then, JVET has adopted many new methods and incorporated them into reference software called the Joint Exploration Model (JEM).

[0058] In April 2018, a Joint Video Expert Team (JVET) was established between VCEG (Q6 / 16) and ISO / IEC JTC1 SC29 / WG11 (MPEG) to work on a Versatile Video Coding (VVC) standard that aims to reduce the bit rate by 50% compared to HEVC. The first version of VVC was finalized in July 2020.

[0059] Color spaces and chromaticity subsampling are discussed. A color space, also known as a color model (or color system), is an abstract mathematical model that simply describes a range of colors as tuples of numbers, typically 3 or 4 values ​​or color components (e.g., RGB). Essentially, a color space is a further detailed description of a coordinate system and its subspaces.

[0060] For video compression, the most commonly used color spaces are YCbCr and RGB. Y′CbCr or YPb / CbPr / Cr, also written as YC B C R or Y′C B C R RGB is a color space used as part of the color image pipeline in video and digital photography systems. Y′ is the luminance component, and CB and CR are the blue and red difference chromaticity components. Y′ (with the label) is different from Y, which is luminance, representing light intensity based on non-linear encoding of the gamma-corrected RGB primary colors.

[0061] Chromaticity subsampling is a technique that utilizes the fact that the human visual system is less sensitive to color differences than to brightness, and encodes images by applying a lower resolution to chromaticity information than to brightness information.

[0062] In 4:4:4 chroma subsampling, each of the three Y'CbCr components has the same sampling rate, therefore there is no chroma subsampling. This scheme is sometimes used in high-end film scanners and film post-production.

[0063] For 4:2:2 chroma subsampling, the two chroma components are sampled at half the luminance sampling rate: the horizontal chroma resolution is halved. This reduces the bandwidth of the uncompressed video signal by one-third and results in almost no visual difference.

[0064] For 4:2:0 chroma subsampling, the horizontal sampling is doubled compared to 4:1:1, but the vertical resolution is halved because the Cb and Cr channels are sampled only on each alternating line in this scheme. Therefore, the data rate is the same. Cb and Cr are subsampled by a factor of 2 in both the horizontal and vertical directions, respectively. There are three variants of the 4:2:0 scheme with different horizontal and vertical positioning.

[0065] In MPEG-2, Cb and Cr are horizontally co-located. Cb and Cr are located between pixels in the vertical direction (in the gap). In the Joint Photographic Experts Group (JPEG) / JPEG File Interchange Format (JFIF), H.261, and MPEG-1, Cb and Cr are located in the gap, i.e., in the middle of the interval luma sample. In 4:2:0 DV, Cb and Cr are horizontally co-located. In the vertical direction, they are co-located on the alternation line.

[0066] The definition of a video unit is provided. An image is divided into one or more slice rows and one or more slice columns. A slice is a series of coding tree units (CTUs) covering a rectangular area of ​​the image. A slice is divided into one or more tiles, each tile consisting of several CTU rows within the slice. A slice that is not divided into multiple tiles is also called a tile. However, a tile that is a proper subset of a slice is not called a slice. A strip contains several slices of an image, or contains several tiles of a slice.

[0067] Two stripe modes are supported: raster scan stripe mode and rectangular stripe mode. In raster scan stripe mode, the stripe contains a series of slices from a raster scan of the image. In rectangular stripe mode, the stripe contains multiple tiles of the image that together form a rectangular area of ​​the image. The tiles within a rectangular stripe are arranged according to the raster scan order of the stripe.

[0068] Figure 1 This is an example of raster scan strip segmentation of image 100, where the image is divided into 12 slices 102 and three raster scan strips 104. As shown, each slice 102 and strip 104 contains a number of CTUs 106.

[0069] Figure 2 This is an example of rectangular strip segmentation of image 200 according to the VVC specification, where the image is divided into 24 slices 202 (6 slice columns 203 and 4 slice rows 205) and 9 rectangular strips 204. As shown in the figure, each slice 202 and strip 204 contains several CTUs 206.

[0070] Figure 3This is an example of dividing an image 300 into slices, tiles, and rectangular strips according to the VVC specification. The image is divided into 4 slices 302 (two slice columns 303 and two slice rows 305), 11 tiles 304 (the top left slice contains one tile, the top right slice contains five tiles, the bottom left slice contains two tiles, and the bottom right slice contains three tiles), and 4 rectangular strips 306.

[0071] The CTU size and coding tree block (CTB) size are discussed. In VVC, the CTU size, which is notified by the syntax element log2_ctu_size_minus2 signaling in the sequence parameter set (SPS), can be as small as 4x4.

[0072] The syntax for the raw byte sequence payload (RBSP) is as follows.

[0073]

[0074]

[0075]

[0076] The increment of 2 in log2_ctu_size_minus2 specifies the size of the luminance codec tree block for each CTU.

[0077] log2_min_luma_coding_block_size_minus2 plus 2 specifies the minimum luma encoding / decoding block size.

[0078] The variables CtbLog2SizeY, CtbSizeY, MinCbLog2SizeY, MinCbSizeY, MinTbLog2SizeY, MaxTbLog2SizeY, MinTbSizeY, MaxTbSizeY, PicWidthInCtbsY, PicHeightInCtbsY, PicSizeInCtbsY, PicWidthInMinCbsY, PicHeightInMinCbsY, PicSizeInMinCbsY, PicSizeInSamplesY, PicWidthInSamplesC, and PicHeightInSamplesC are derived as follows.

[0079] CtbLog2SizeY=log2_ctu_size_minus2+2 (7-9)

[0080] CtbSizeY=1<<CtbLog2SizeY (7-10)

[0081] MinCbLog2SizeY=log2_min_luma_coding_block_size_minus2+2

[0082] (7-11)

[0083] MinCbSizeY=1<<MinCbLog2SizeY (7-12)

[0084] MinTbLog2SizeY=2 (7-13)

[0085] MaxTbLog2SizeY=6 (7-14)

[0086] MinTbSizeY=1<<MinTbLog2SizeY (7-15)

[0087] MaxTbSizeY=1<<MaxTbLog2SizeY (7-16)

[0088] PicWidthInCtbsY=Ceil(pic_width_in_luma_samples÷CtbSizeY)

[0089] (7-17)

[0090] PicHeightInCtbsY=Ceil(pic_height_in_luma_samples÷CtbSizeY)

[0091] (7-18)

[0092] PicSizeInCtbsY=PicWidthInCtbsY*PicHeightInCtbsY (7-19)

[0093] PicWidthInMinCbsY=pic_width_in_luma_samples / MinCbSizeY

[0094] (7-20)

[0095] PicHeightInMinCbsY=pic_height_in_luma_samples / MinCbSizeY

[0096] (7-21)

[0097] PicSizeInMinCbsY = PicWidthInMinCbsY * PicHeightInMinCbsY

[0098] (7-22)

[0099] PicSizeInSamplesY = pic_width_in_luma_samples * pic_height_in_luma_samples (7-23)

[0100] PicWidthInSamplesC = pic_width_in_luma_samples / SubWidthC (7-24)

[0101] PicHeightInSamplesC = pic_height_in_luma_samples / SubHeightC

[0102] (7-25)

[0103] Figure 4A is an example of a CTB that crosses the bottom picture boundary. Figure 4B is an example of a CTB that crosses the right picture boundary. Figure 4C is an example of a CTB that crosses the bottom-right picture boundary. In Figures 4A-4C , they are K = M, L < N; K < M, L = N; K < M, L < N, respectively.

[0104] Refer to Figures 4A-4C to discuss the CTUs in picture 400. Assume that the CTB / largest coding unit (LCU) size is represented by M × N (usually M equals N, as defined in HEVC / VVC). For a CTB located at the boundary of a picture (or slice or strip or other type, taking the picture boundary as an example), K × L samples are within the picture boundary, where K < M or L < N. For Figures 4A-4C those CTBs described in, the CTB size is still equal to M × N. However, the bottom boundary / right boundary of the CTB is outside picture 400.

[0105] discusses the encoding / decoding process of a typical video encoder / decoder (also known as a codec). Figure 5This is an example of a VVC encoder block diagram, containing three loop filter blocks: a deblocking filter (DF), a sample adaptive offset (SAO), and an adaptive loop filter (ALF). Unlike the DF, which uses predefined filters, the SAO and ALF use the side information signaling from the encoder / decoder to inform the offset and filter coefficients. Utilizing the original samples of the current image, they reduce the mean square error between the original and reconstructed samples by adding an offset and applying a finite impulse response (FIR) filter, respectively. The ALF is located in the last processing stage of each image and can be viewed as a tool attempting to capture and repair artifacts generated in the previous stage.

[0106] Figure 5 This is a schematic diagram of encoder 500. Encoder 500 is suitable for implementing VVC technology. Encoder 500 includes three loop filters: deblocking filter (DF) 502, sample adaptive offset (SAO) 504, and ALF 506. Unlike DF 502, which uses predefined filters, SAO 504 and ALF 506 use the side information signaling from the encoder and decoder to inform the offset and filter coefficients. Utilizing the original samples of the current image, they reduce the mean square error between the original and reconstructed samples by adding an offset and applying an FIR filter, respectively. ALF 506 is located in the last processing stage of each image and can be viewed as a tool attempting to capture and repair artifacts generated in the previous stage.

[0107] The encoder 500 also includes an intra-frame prediction component 508 and a motion estimation / compensation (ME / MC) component 510 configured to receive input video. The intra-frame prediction component 508 is configured to perform intra-frame prediction, while the ME / MC component 510 is configured to perform inter-frame prediction using a reference picture obtained from a reference picture buffer 512. Residual blocks from inter-frame or intra-frame prediction are provided to a transform component 514 and a quantization component 516 to generate quantized residual transform coefficients provided to an entropy codec component 518. The entropy codec component 518 entropy codes and decodes the prediction results and quantized transform coefficients, and transmits the same to a video decoder (not shown). Quantized components output from the quantization component 516 are provided to an inverse quantization component 520, an inverse transform component 522, and a reconstruction (REC) component 524. The REC component 524 is capable of outputting images to a DF 502, a SAO 504, and an ALF 506 for filtering before storing these images in the reference picture buffer 512.

[0108] The input to the DF 502 is the reconstructed sample points before the loop filter. First, the vertical edges in the image are filtered. Then, using the sample points modified during the vertical edge filtering process as input, the horizontal edges in the image are filtered. Vertical and horizontal edges in the CTB of each CTU are processed separately based on the codec unit. Starting from the left edge of the codec block, the vertical edges of the codec blocks in the codec unit are filtered geometrically from the edges towards the right edge of the codec block. Starting from the top edge of the codec block, the horizontal edges of the codec blocks in the codec unit are filtered geometrically from the edges towards the bottom edge of the codec block.

[0109] Figure 6 This is a diagram 600 showing sample point 602 within an 8×8 sample point block 604. As shown, diagram 600 includes horizontal block boundaries 606 and vertical block boundaries 608 on the 8×8 grid. Furthermore, diagram 600 depicts a non-overlapping block 610 of 8×8 sample points that can be de-blocked in parallel.

[0110] Boundary determination was discussed. Filtering was applied to 8×8 block boundaries. Furthermore, these must be transform block boundaries or codec sub-block boundaries (e.g., due to the use of affine motion prediction or optional temporal motion vector prediction (ATMVP)). Filters were disabled for those that did not fall into this category.

[0111] Boundary strength calculation was discussed. For transform block boundaries / encoder / decoder sub-block boundaries, if they are located in an 8×8 grid, the transform block boundaries / encoder / decoder sub-block boundaries can be filtered, and the edge's bS[xD]... i ][yD j (where [xD]) i ][yD j The settings for (representing coordinates) are defined in Table 1 and Table 2, respectively.

[0112] Table 1: Boundary Strength (with SPS IBC disabled)

[0113]

[0114] Table 2: Boundary Strength (with SPS IBC enabled)

[0115]

[0116]

[0117] The deblocking decision for the luminance component was discussed.

[0118] Figure 7Example 700 involves the decision of whether to turn the filter on / off and the selection of a strong / weak filter. A wider and stronger luminance filter is used only when Condition 1, Condition 2, and Condition 3 are all TRUE. Condition 1 is the "bulk condition". This condition detects whether the samples on the P-side and Q-side belong to a bulk, represented by the variables bSidePisLargeBlk and bSideQisLargeBlk, respectively. bSidePisLargeBlk and bSideQisLargeBlk are defined as follows.

[0119] bSidePisLargeBlk=((edge ​​type is vertical and p0 belongs to CU

[0120] with width>=32)||(edge ​​type is horizontal and p0 belongs to CU

[0121] with height>=32))? TRUE:FALSE

[0122] bSideQisLargeBlk=((edge ​​type is vertical and q0 belongs to CU

[0123] with width>=32)||(edge ​​type is horizontal and q0 belongs to CU

[0124] with height>=32))? TRUE:FALSE

[0125] Based on bSidePisLargeBlk and bSideQisLargeBlk, Condition 1 is defined as follows.

[0126] Condition 1=(bSidePisLargeBlk||bSidePisLargeBlk)? TRUE:FALSE

[0127] Next, if Condition 1 is true, Condition 2 will be further checked. First, the following variables are derived.

[0128] In HEVC, we first derive dp0, dp3, dq0, and dq3:

[0129] If the edge p is greater than or equal to 32:

[0130] dp0=(dp0+Abs(p50-2*p40+p30)+1)>>1

[0131] dp3=(dp3+Abs(p53-2*p43+p33)+1)>>1

[0132] If the q edge is greater than or equal to 32:

[0133] dq0=(dq0+Abs(q50-2*q40+q30)+1)>>1

[0134] dq3=(dq3+Abs(q53-2*q43+q33)+1)>>1

[0135] Condition 2=(d<β)? TRUE:FALSE

[0136] Where d = dp0 + dq0 + dp3 + dq3.

[0137] If Condition 1 and Condition 2 are valid, then further check if any blocks use sub-blocks.

[0138]

[0139] Finally, if both Condition 1 and Condition 2 are valid, the proposed deblocking method will check Condition 3 (large block strong filtering condition), which is defined as follows.

[0140] In Condition 3StrongFilterCondition, derive the following variables.

[0141] Derive dpq in HEVC.

[0142] Derive sp3 = Abs(p3-p0) in HEVC.

[0143] If the edge p is greater than or equal to 32:

[0144]

[0145]

[0146] Derive sq3 = Abs(q0 - q3) in HEVC.

[0147] If the q edge is greater than or equal to 32:

[0148]

[0149] Similar to HEVC, StrongFilterCondition = (dpq < (β >> 2), sp3 + sq3 < (3 * β >> 5), and Abs(p0 - q0) < (5 * t). C +1)>>1)? TRUE:FALSE。

[0150] A stronger luminance deblocking filter (designed for larger blocks) was discussed.

[0151] A bidirectional linear filter is used when samples on either side of the boundary belong to a large block. Samples are defined as belonging to a large block when the width of the vertical edge is greater than or equal to 32 and the height of the horizontal edge is greater than or equal to 32.

[0152] The following lists bidirectional linear filters.

[0153] Then, the block boundary samples pi (i=0 to Sp-1) and qi (j=0 to Sq-1) in the HEVC deblock are replaced by linear interpolation as follows. (pi and qi are the i-th sample in the filtered vertical edge row or the i-th sample in the filtered horizontal edge column.)

[0154] p i ′=(f i *Middle s,t +(64-f i )*P s +32)>>6), clipped to p i ±tcPD i

[0155] q j ′=(g j *Middle s,t +(64-g j )*Q s +32)>>6), clipped to q j ±tcPD j

[0156] Among them, tcPD i and tcPD j The item is a position-related clipping as described below, and g j f i Middle s,t P s and Q s As shown below.

[0157] Deblocking control of chroma was discussed.

[0158] A strong chromaticity filter is used on both sides of the block boundary. Here, the chromaticity filter is selected when both sides of the chromaticity edge are greater than or equal to 8 (chromaticity position), and the following three conditions are met: The first is the boundary strength and the size of the block. The proposed filter can be applied when the width or height of the block orthogonal to the block edge is equal to or greater than 8 in the chromaticity sample domain. The second and third are essentially the same as the HEVC luminance deblocking decision, namely the on / off decision and the strong filter decision, respectively.

[0159] In the first decision, the boundary strength (bS) is modified for chroma filtering, and the conditions are checked sequentially. If a condition is met, the remaining conditions with lower priority are skipped.

[0160] When a large block boundary is detected, chroma deblocking is performed when bS equals 2 or bS equals 1.

[0161] The second and third conditions are essentially the same as those for the HEVC luminance filter determination shown below.

[0162] Under the second condition: then d is derived in the HEVC luminance deblock. When d is less than β, the second condition will be TRUE.

[0163] Under the third condition, the derivation of StrongFilterCondition is as follows.

[0164] Derive sp3 = Abs(p3-p0) in HEVC.

[0165] Derive sq3 = Abs(q0 - q3) in HEVC.

[0166] Similar to the HEVC design, StrongFilterCondition = (dpq < (β >> 2), sp3 + sq3 < (β >> 3), and Abs(p0 - q0) < (5 * t). C +1)>>1).

[0167] Strong deblocking filters for chroma are discussed. The following strong deblocking filters for chroma are defined.

[0168] p2′=(3*p3+2*p2+p1+p0+q0+4)>>3

[0169] p1′=(2*p3+p2+2*p1+p0+q0+q1+4)>>3

[0170] p0′=(p3+p2+p1+2*p0+q0+q1+q2+4)>>3

[0171] The proposed chromaticity filter performs deblocking on a 4×4 chromaticity sample grid.

[0172] Position-dependent clipping (TCPD) is discussed. TCPD is applied to the output samples of strong and long filters that involve modifying 7, 5, and 3 samples at the boundaries. Assuming a quantization error distribution, a clipping value is proposed for samples expected to have higher quantization noise, thus predicting a larger deviation between the reconstructed sample values ​​and the true sample values.

[0173] For each P or Q boundary filtered using an asymmetric filter, a position-related threshold table is selected from two tables (i.e., Tc7 and Tc3 listed below) that serve as edge information to the decoder, based on the result of the decision process in the boundary strength calculation.

[0174] Tc7={6,5,4,3,2,1,1}; Tc3={6,4,2};

[0175] tcPD=(Sp==3)? Tc3:Tc7;

[0176] tcQD=(Sq==3)? Tc3:Tc7;

[0177] For P or Q boundaries filtered using short symmetric filters, apply a lower order of magnitude of position-related threshold.

[0178] Tc3 = {3, 2, 1};

[0179] After defining the threshold, the filtered p' i and q' i The sample values ​​are clipped based on the tcP and tcQ clipping values.

[0180] p” i =Clip3(p' i +tcP i ,p' i –tcP i ,p' i );

[0181] q” j =Clip3(q' j +tcQ j ,q' j –tcQ j ,q' j );

[0182] Among them, p' i and q' i It is the filtered sample value, p” i and q”i It is the output sample value after clipping, and tcP i tcP i The clipping threshold is derived from the VVC tc parameters, tcPD, and tcQD. The Clip3 function is the clipping function specified in VVC.

[0183] The adjustment of sub-blocks to remove blocks was discussed.

[0184] To enable parallel deblocking by simultaneously using long filters and sub-block deblocking, as shown in the brightness control of the long filter, the long filter is restricted to modifying a maximum of 5 samples on the side using sub-block deblocking (AFFINE, ATMVP, or decoder-side motion vector refinement (DMVR)). Furthermore, sub-block deblocking is adjusted such that sub-block boundaries on the 8×8 grid near the codec unit (CU) or implicit TU boundary are restricted to modifying a maximum of two samples on each side.

[0185] The following applies to sub-block boundaries that are not aligned with the CU boundary.

[0186]

[0187] Here, an edge equal to 0 corresponds to the CU boundary, an edge equal to 2 or orthogonalLength-2 corresponds to 8 sample points of the sub-block boundary of the CU boundary, and so on. If implicit partitioning using TU is employed, then implicit TU is true.

[0188] Sample Adaptive Offset (SAO) is discussed. The input to SAO is the reconstructed samples after deblocking (DB). The concept of SAO is to reduce the average sample distortion of a region by first classifying the region samples into multiple categories using a selected classifier, obtaining an offset for each category, and then adding the offset to each sample of that category. The classifier index and the region offset are encoded and decoded in the bitstream. In HEVC and VVC, the region (the unit for SAO parameter signaling notification) is defined as a CTU.

[0189] HEVC employs two SAO types that meet low complexity requirements: edge offset (EO) and band offset (BO), which will be discussed in more detail below. The SAO type indexes are encoded and decoded (within the range [0,2]). For EO, ​​sample classification is based on a comparison between the current sample and its neighboring samples according to a one-dimensional orientation pattern: horizontal, vertical, 135° diagonal, and 45° diagonal.

[0190] Figure 8Four one-dimensional (1-D) orientation patterns 800 for EO sample point classification are shown: horizontal (EO level = 0), vertical (EO level = 1), 135° diagonal (EO level = 2), and 45° diagonal (EO level = 3).

[0191] For a given EO level, each sample point within the CTB is classified into one of five categories. The current sample point value (labeled "c") is compared with the values ​​of two adjacent sample points along the selected one-dimensional pattern. The classification rules for each sample point are summarized in Table 3. Categories 1 and 4 are associated with local valleys and local peaks along the selected one-dimensional pattern, respectively. Categories 2 and 3 are associated with concave and convex angles of the selected one-dimensional pattern, respectively. If the current sample point does not belong to EO categories 1-4, it is classified as category 0, and SAO is not applied.

[0192] Table 3: Sampling classification rules for edge offset

[0193] category condition 1 c < a and c < b 2 (c<a&&c==b)||(c==a&&c<b) 3 (c>a&&c==b)||(c==a&&c>b) 4 c>a&&c>b 5 None of the above

[0194] An adaptive loop filter based on geometric transformation (SAO) is discussed in the Joint Exploration Model (JEM). The input to the DB is the reconstructed sample points after DB and SAO. The sample point classification and filtering processes are based on the reconstructed sample points after DB and SAO.

[0195] In JEM, a geometric transformation-based adaptive loop filter (GALF) with block-based filter adaptation is applied. For the luminance component, one of 25 filters is selected for each 2×2 block based on the direction and activity of the local gradient.

[0196] The filter shape was discussed. Figure 9 An example of a GALF filter shape 900 is shown, which includes a 5×5 rhombus on the left, a 7×7 rhombus in the middle, and a 9×9 rhombus on the right. In JEM, up to three rhombus filter shapes can be selected for the luminance component (e.g., ...). Figure 9 (As shown). At the image level, signaling notification indices indicate the filter shape used for the luminance component. Each square represents a sample point, and Ci (i is 0–6 (left), 0–12 (middle), 0–20 (right)) represents the coefficient to be applied to the sample point. For the chrominance component in the image, a 5×5 rhombus shape is always used.

[0197] Block classification was discussed. Each 2×2 block was classified into one of 25 levels. The classification index C was based on its directionality D and activity. The quantization value is derived as follows.

[0198]

[0199] To calculate D and First, the gradients in the horizontal, vertical, and two diagonal directions are calculated using a one-dimensional Laplacian.

[0200]

[0201]

[0202]

[0203]

[0204] The indices i and j refer to the coordinates of the top-left sample point in the 2×2 block, and R(i,j) represents the reconstructed sample point at coordinate (i,j).

[0205] Then set the maximum and minimum values ​​of the gradient D in the horizontal and vertical directions as follows:

[0206]

[0207] And set the maximum and minimum values ​​of the gradients in the two diagonal directions as follows:

[0208]

[0209] To derive the value of the directionality D, these values ​​are compared with each other and with two thresholds t1 and t2:

[0210] Step 1: If and All are true, so D is set to 0.

[0211] Step 2: If Continue from step 3; otherwise, continue from step 4.

[0212] Step 3: If Set D to 2; otherwise set D to 1.

[0213] Step 4: If Set D to 4; otherwise set D to 3.

[0214] The activity value A is calculated as follows:

[0215]

[0216] A is further quantized to a range including 0 to 4, and the quantized value is denoted as .

[0217] For the two chromaticity components in the image, no classification method should be applied; that is, a set of ALF coefficients should be applied to each chromaticity component.

[0218] The geometric transformation of the filter coefficients is discussed.

[0219] Figure 10 The relative coordinates 1000 supported by the 5×5 rhombus filter are shown: diagonal, vertical flip, and rotation (from left to right).

[0220] Before filtering each 2×2 block, a geometric transformation, such as rotation or diagonal and vertical flip, is applied to the filter coefficients f(k,l) associated with coordinates (k,l), depending on the gradient values ​​calculated for that block. This is equivalent to applying these transformations to samples within the filter's support region. The idea is to make different blocks to which ALF is applied more similar by calibrating the directionality.

[0221] Three geometric transformations are introduced: diagonal transformation, vertical flip, and rotation.

[0222] Diagonal: f D (k,l)=f(l,k),

[0223] Vertical flip: f V (k,l)=f(k,Kl-1), (9)

[0224] Rotation: f R (k,l)=f(Kl-1,k).

[0225] Here, K is the size of the filter, and 0 ≤ k, l ≤ K⁻¹ are the coefficient coordinates, so position (0, 0) is located at the top left corner, and position (K⁻¹, K⁻¹) is located at the bottom right corner. The transform is applied to the filter coefficients f(k, l) depending on the gradient values ​​calculated for this block. The relationship between the transform and the four gradients in the four directions is summarized in Table 4.

[0226] Table 4: Mapping of gradients and transformations computed for a block

[0227] gradient value Transformation <![CDATA[g d2 <g d1 And g h <g v ]]> No change <![CDATA[g d2 <g d1 And g v <g h ]]> diagonal <![CDATA[g d1 <g d2 And g h <g v ]]> Vertical flip <![CDATA[g d1 <g d2 And g v <g h ]]> Rotation

[0228] Filter parameter signaling notification is discussed. In JEM, GALF filter parameters are signaled for the first CTU, i.e., after the stripe header of the first CTU and before the SAO parameters. Up to 25 groups of luminance filter coefficients can be signaled. To reduce bit overhead, filter coefficients from different categories can be merged. Furthermore, the GALF coefficients of a reference image are stored and can be reused as GALF coefficients for the current image. The current image can optionally use the GALF coefficients stored for the reference image and bypass GALF coefficient signaling notification. In this case, only the index of one of the reference images is signaled, and the current image inherits the GALF coefficients stored in the indicated reference image.

[0229] To support GALF temporal prediction, a candidate list of GALF filter sets is maintained. The candidate list is empty when decoding a new sequence. After decoding an image, the corresponding filter set can be added to the candidate list. Once the size of the candidate list reaches the maximum allowed value (6 in the current JEM), new filter sets will overwrite the oldest filter set in the decoding order; that is, a first-in-first-out (FIFO) rule is applied to update the candidate list. To avoid duplication, a set is added to the list only if the corresponding image does not use GALF temporal prediction. To support temporal scalability, there are multiple candidate lists of filter sets, each associated with a temporal layer. More specifically, each array assigned by the temporal layer index (TempIdx) can constitute the filter set of previously decoded images with a lower TempIdx. For example, the k-th array is assigned to a TempIdx equal to k, and the k-th array contains only filter sets from images with TempIdx less than or equal to k. After encoding and decoding an image, the filter set associated with that image is used to update the arrays associated with TempIdx that are equal to or higher.

[0230] Temporal prediction of GALF coefficients is used for inter-frame encoding / decoding frames to minimize signaling notification overhead. For intra-frame frames, temporal prediction is unavailable, and a set of 16 fixed filters is assigned to each level. To indicate the use of fixed filters, a flag for each level is signaled, and the index of the selected fixed filter is also signaled if necessary. Even if a fixed filter is selected for a given level, the coefficients of an adaptive filter f(k,l) can still be sent for that level; in this case, the filter coefficients applied to the reconstructed image are the sum of the two sets of coefficients.

[0231] The filtering process for the luminance component can be controlled at the CU level. Signaling indicates whether GALF is applied to the luminance component of the CU. For the chrominance component, whether GALF is applied is indicated only at the image level.

[0232] The filtering process is discussed. At the decoder end, when GALF is enabled for a block, each sample R(i,j) within the block is filtered, resulting in sample values ​​R′(i,j) as shown below, where L represents the filter length, f m,n Let f(k,l) represent the filter coefficients, and let f(k,l) represent the decoded filter coefficients.

[0233]

[0234] Figure 11An example of relative coordinates for 5×5 diamond filter support is shown, assuming the current sample's coordinates (i, j) are (0, 0). Samples with different coordinates filled with the same color are multiplied by the same filter coefficients.

[0235] The adaptive loop filter (GALF) based on geometric transformation in VVC is discussed. In VVC test model 4.0 (VTM4.0), the following filtering process of the adaptive loop filter is performed:

[0236] O(x, y) = ∑ (i,j) w(i,j).I(x+i,y+j), (11)

[0237] Here, sample I(x+i, y+j) are the input samples. O(x, y) are the filtered output samples (i.e., the filtering result), and w(i, j) represent the filter coefficients. In fact, in VTM4.0, it is implemented using integer arithmetic with fixed-point precision.

[0238]

[0239] Where L represents the filter length, and w(i,j) are the filter coefficients with fixed-point precision.

[0240] Compared to JEM, the current design of GALF in VVC has the following main changes:

[0241] 1) Remove adaptive filter shapes. Only 7×7 filter shapes are allowed for the luminance component, and only 5×5 filter shapes are allowed for the chrominance component.

[0242] 2) Signaling notifications for ALF parameters have been removed from the strip / picture level to the CTU level.

[0243] 3) The classification index is calculated at a 4×4 level, instead of 2×2. Furthermore, as proposed in JVET-L0147, a subsampling Laplacian calculation method is used for ALF classification. More specifically, it is not necessary to calculate the horizontal / vertical / 45-degree diagonal / 135-degree gradient for each sample point within a block. Instead, a 1:2 subsampling is utilized.

[0244] The filtering was redefined, and the nonlinear ALF in the current VVC was discussed.

[0245] Without affecting encoding and decoding efficiency, equation (11) can be restated as follows:

[0246] O(x, y) = I(x, y) + ∑ (i,j)≠(0,0) w(i,j).(I(x+i,y+j)-I(x,y)), (13)

[0247] Where w(i,j) are the same filter coefficients as in equation (11) [except that w(0,0) is equal to 1 in equation (13), while it is equal to 1-∑ in equation (11).] (i,j)(0,0) w(i,j)].

[0248] Using the filter formula in (13) above, VVC reduces the influence of adjacent sample values ​​(I(x+i, y+j)) by using a simple clipping function (when it differs too much from the current sample value (I(x, y)) being filtered), and introduces nonlinearity to make ALF more effective.

[0249] More specifically, the ALF filter was modified as follows:

[0250] O′(x, y)=I(x, y)+∑ (i,j)≠(0,0) w(i,j).K(I(x+i,y+j)-I(x,y),k(i,j)) (14)

[0251] Where K(d, b) = min(b, max(-b, d)) is the pruning function, and k(i, j) is the pruning parameter, which depends on the (i, j) filter coefficients. The encoder performs optimization to find the optimal k(i, j).

[0252] In the implementation of JVET-N0242, a pruning parameter k(i,j) is specified for each ALF filter, and each filter coefficient is signaled with a pruning value. This means that each luminance filter can signal a maximum of 12 pruning values ​​in the bitstream, and each chroma filter can signal a maximum of 6 pruning values.

[0253] To limit the cost of signaling notification and the complexity of the encoder, only four fixed values ​​were used for both INTER and INTRA stripes.

[0254] Because the variance of local differences in luminance is typically higher than that in chrominance, two different sets are applied to the luminance and chrominance filters. A maximum sample value (1024 here representing a 10-bit depth) is also introduced for each set so that clipping can be disabled when not needed.

[0255] Table 5 provides the set of clipped values ​​used in the JVET-N0242 test. These four values ​​were selected by dividing the entire range of luminance sample values ​​(encoded in 10-bit) in the logarithmic domain and the range of chrominance from 4 to 1024 into approximately equal parts.

[0256] More precisely, the clipping value brightness table has been obtained using the following formula:

[0257]

[0258] Similarly, the clipping value color table is obtained according to the following formula:

[0259]

[0260] Table 5: Authorized Clipping Values

[0261]

[0262] In the "alf_data" syntax element, the selected clipped value is encoded and decoded using the Golomb encoding scheme corresponding to the clipped value index in Table 5 above. This encoding scheme is the same as the encoding scheme for the filter index.

[0263] A video encoding / decoding loop filter based on a convolutional neural network is discussed.

[0264] In deep learning, convolutional neural networks (CNNs, or ConvNets) are a type of deep neural network most commonly used for analyzing visual images. They have been very successful in image and video recognition / processing, recommender systems, image classification, medical image analysis, and natural language processing.

[0265] CNNs are a regularized version of multilayer perceptrons. Multilayer perceptrons typically refer to fully connected networks, meaning that each neuron in one layer is connected to all neurons in the next layer. This "fully connectedness" makes them prone to overfitting data. Typical regularization methods involve adding some form of weighted measurement to the loss function. CNNs take a different approach to regularization: they utilize hierarchical patterns in the data, combining smaller and simpler patterns to create more complex patterns. Therefore, CNNs are at a lower level in terms of both connectivity and complexity.

[0266] Compared to other image classification / processing algorithms, CNNs use relatively little preprocessing. This means the network can learn filters designed manually in traditional algorithms. This independence from prior knowledge and human intervention in feature design is a major advantage.

[0267] Deep learning-based image / video compression generally has two meanings: end-to-end compression based purely on neural networks and traditional frameworks enhanced by neural networks. End-to-end compression based purely on neural networks is discussed in: Johannes Ballé, Valero Laparra, and Eero P. Simoncelli, “End-to-end optimization of nonlinear transform codes for perceptual quality,” In: 2016 Picture Coding Symposium (PCS), pp. 1-5, Institute of Electrical and Electronics Engineers (IEEE); and Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár, “Lossy image compression with compressive autoencoders,” arXiv preprint arXiv:1703.00395 (2017).Traditional frameworks enhanced by neural networks: Jiahao Li, Bin Li, Jizheng Xu, Ruiqin Xiong, and Wen Gao, “Fully Connected Network-Based Intra Prediction for Image Coding,” IEEE Transactions on Image Processing 27, 7 (2018), 3236–3247, Yuanying Dai, Dong Liu, and Feng Wu, “A convolutional neural network approach for post-processing in HEVC intra coding,” MMM. Springer, 28–39, Rui Song, Dong Liu, Houqiang Li, and Feng Wu, “Neural network-based arithmetic coding of intra prediction modes in HEVC,” VCIP. IEEE, 1–4, and J. Pfaff, P. Helle, D. Maniry, S. Kaltenstadler, W. Samek, H. Schwarz, D. Marpe, and T. Wiegand, “Neural network based intra prediction for video coding,” Applications of Digital Image Processing XLI, Vol. 10752. International Society for Optics and Photonics, 1075213.

[0268] The first type typically adopts an autoencoder-like structure, implemented by convolutional neural networks or recurrent neural networks. While relying solely on neural networks for image / video compression avoids any manual optimization or design, the compression efficiency may not be ideal. Therefore, works in the second type use neural networks as an auxiliary means, enhancing traditional compression frameworks by replacing or strengthening some modules. In this way, they can inherit the advantages of highly optimized traditional frameworks. For example, a fully connected network for intra-frame prediction was proposed in HEVC, as discussed below: Jiahao Li, Bin Li, Jizheng Xu, Ruiqin Xiong, and Wen Gao, "Fully Connected Network-Based Intra Prediction for Image Coding," IEEE Transactions on Image Processing 27, 7 (2018), pp. 3236-3247.

[0269] Besides intra-frame prediction, deep learning has also been used to enhance other modules. For example, the loop filter of HEVC has been replaced by a convolutional neural network, with promising results achieved in Yuanying Dai, Dong Liu, and Feng Wu, “A convolutional neural network approach for post-processing in HEVC intra coding,” MMM. Springer, 28–39. The results in Rui Song, Dong Liu, Houqiang Li, and Feng Wu, “Neural network-based arithmetic coding of intra prediction modes in HEVC,” VCIP. IEEE, 1–4, apply neural networks to improve the arithmetic codec engine.

[0270] Loop filtering based on convolutional neural networks is discussed. In lossy image / video compression, the reconstructed frame is an approximation of the original frame, and because the quantization process is irreversible, this leads to distortion in the reconstructed frame. To mitigate this distortion, a convolutional neural network can be trained to learn the mapping from the distorted frame to the original frame. In practice, this training must be performed before utilizing CNN-based loop filtering.

[0271] Training was discussed. The purpose of training is to find the optimal values ​​of parameters, including weights and biases.

[0272] First, an encoder / decoder (such as HM, JEM, VTM, etc.) is used to compress the training dataset to generate distorted reconstructed frames. Then, the reconstructed frames are fed into a CNN, and the cost is calculated using the CNN's output and the ground truth frames (original frames). Common cost functions include the Sum of Absolute Difference (SAD) and Mean Square Error (MSE). Next, the gradient of the cost with respect to each parameter is derived using backpropagation. The gradients are used to update the parameter values. This process is repeated until the convergence criterion is met. After training is complete, the derived optimal parameters are saved for use in the inference phase.

[0273] The convolution process is discussed. During convolution, the filter moves across the image from left to right and from top to bottom, first horizontally by a one-pixel column change, and then vertically by a one-pixel row change. The amount of movement the filter makes across the input image is called the stride, which is almost symmetrical in both height and width. The default stride, or two-dimensional stride, for height and width movement is (1,1).

[0274] Figure 12A This is the example architecture 1200 of the proposed CNN filter. Figure 12B This is an example of constructing a ResBlock (1250). In most deep convolutional neural networks, ResBlocks are used as basic modules and stacked multiple times to build the final network. In one example, such as... Figure 12B As shown, the residual blocks are obtained by combining convolutional layers, ReLU / PReLU activation functions, and convolutional layers.

[0275] Inference is discussed. During the inference phase, the distorted reconstructed frames are fed into the CNN and processed by the CNN model, whose parameters have been determined during the training phase. The input samples to the CNN can be reconstructed samples before or after DB, or before or after SAO, or before or after ALF.

[0276] Currently, CNN-based loop filtering has the following problems. First, directly using the output of the CNN-based loop filter can lead to issues. For some content, using a linear model to scale the output can provide better filtering strength. Second, the inference block size is fixed for different video content or under different compression settings. For sequences with lower resolution or higher bitrates, using finer granularity may be beneficial. Third, how to combine the outputs of several models has not been fully explored.

[0277] This disclosure discloses techniques for addressing one or more of the aforementioned problems. For example, this disclosure provides one or more neural network (NN) filter models trained as part of a loop filtering technique or filtering technique used in a post-processing stage to reduce distortion generated during compression. Furthermore, samples with different characteristics are processed by different NN filter models. This disclosure also details how to scale the output of the NN filters to achieve better performance, how to set the inference block size, and how to combine the outputs of multiple NN filter models.

[0278] Video encoding and decoding are lossy processes. Convolutional neural networks (CNNs) can be trained to recover details lost during compression. In other words, artificial intelligence (AI) processes can create CNN filters based on training data.

[0279] Different CNN filters perform best in different situations. The encoder and decoder can use multiple CNN filters that have been pre-trained (also known as pre-trained). This disclosure describes methods and techniques that allow the encoder to signal to the decoder which CNN filter to use for each video unit. A video unit can be a sequence of images, pictures, strips, slices, tiles, sub-pictures, codec tree units (CTUs), CTU rows, codec units (CUs), etc. For example, different CNN filters can be used for different layers, different components (e.g., luminance, chrominance, Cb, Cr, etc.), different specific video units, etc. Signaling notification flags and / or indices can be used to indicate which CNN filter should be used for each video item. CNN filters can be signaled based on whether adjacent video units use that filter. Inheritance of CNN filters between parent and child nodes is also provided when a tree is used to segment video units.

[0280] The embodiments listed below should be considered as examples to explain general concepts. These embodiments should not be interpreted narrowly. Furthermore, these embodiments can be combined in any way.

[0281] In this disclosure, the NN filter can be any kind of NN filter, such as a convolutional neural network (CNN) filter. In the following discussion, the NN filter may also be referred to as a CNN filter.

[0282] In the following discussion, a video unit can be a sequence, picture, strip, slice, tile, sub-picture, CTU / CTB, CTU / CTB line, one or more CU / codec block (CB), one or more CTU / CTB, one or more Virtual Pipeline Data Units (VPDUs), or a sub-region within a picture / strip / slice / tile. A parent video unit represents a unit larger than a video unit. Typically, a parent unit will contain several video units; for example, when the video unit is a CTU, the parent unit can be a strip, a CTU line, multiple CTUs, etc. In some embodiments, a video unit can be a sample point / pixel.

[0283] Figures 13A-13D This is an example of the process of generating filtered samples using residual scaling and neural network filtering. Figure 13A In this context, the residual is the output of the NN filter. Figure 13A In the process 1300 shown, some unfiltered samples are input to an NN filter and a summing device. In one embodiment, unfiltered samples are samples (e.g., pixels) that have not undergone any filtering. The output of the NN filter is a residual (or used to generate a residual). A residual scaling function (e.g., linear / non-linear) is applied to the residual. The summing device combines the scaled residual with some unfiltered samples (bypassing the NN filter) and outputs filtered samples.

[0284] exist Figure 13B In this context, the residual is the difference between the output of the NN filter and the unfiltered samples. Figure 13B In the process 1320 shown, some unfiltered samples are input into an NN filter, a differential unit, and a summing unit. The output of the NN filter is input into the differential unit. The difference between the output of the NN filter and the unfiltered samples is the residual. A residual scaling function (e.g., linear / nonlinear) is applied to the residual. The summing unit combines the scaled residual with some unfiltered samples (bypassing the NN filter) and outputs the filtered samples.

[0285] exist Figure 13C In process 1340 shown, some unfiltered samples are input into an NN filter and a summing device. The output of the NN filter is the residual (or used to generate the residual). When the switch is in... Figure 13C In the position shown, a residual scaling function (e.g., linear / nonlinear) is applied to the residual. Subsequently, one of the summing devices combines the scaled residual with some unfiltered samples and outputs filtered samples. Alternatively, a switch can be positioned to directly supply the residual to another summing device, which combines the residual with some unfiltered samples to generate filtered samples.

[0286] exist Figure 13DIn process 1360 shown, some unfiltered samples are input into an NN filter, a differential device, and a summing device. The output of the NN filter is input into the differential device. The difference between the output of the NN filter and the unfiltered samples is the residual. When the switch is in... Figure 13D In the position shown, a residual scaling function (e.g., linear / nonlinear) is applied to the residual. Subsequently, one of the summing devices combines the scaled residual with some unfiltered samples and outputs filtered samples. Alternatively, a switch can be positioned to directly supply the residual to another summing device, which combines the residual with some unfiltered samples to generate filtered samples.

[0287] A discussion of model selection is provided.

[0288] Example 1

[0289] 1. The residual values ​​determined by the output of the NN filter can be corrected by a function (e.g., scaled) and then added to the corresponding unfiltered samples in the video unit to generate the final filtered samples.

[0290] a. In one example, the unfiltered samples are the reconstructions before NN filtering.

[0291] b. In one example, such as Figure 13A As shown, the residual is the output of the NN filter. Let the unfiltered sample be X, and the output of the NN filter be R. Then the filtering process is defined as: Y = X + F(R), where Y is the filtered sample, and F represents a function (e.g., residual scaling operation), which will be explained in detail later.

[0292] c. In one example, the input to the function may include at least the output of the NN filter and the unfiltered samples. Let the unfiltered samples be denoted as X, and the output of the NN filter as R. Then the filtering process is defined as: Y = X + F(R, X), where Y represents the filtered samples, and F represents a function (e.g., residual scaling operation), which will be explained in detail later.

[0293] d. In one example, such as Figure 13B As shown, the residual is the difference between the output of the NN filter and the unfiltered sample. Let the unfiltered sample be X, and the output of the NN filter be R. Then the filtering process is defined as: Y = X + F(RX), where Y is the filtered sample, and F represents a function (e.g., residual scaling operation), which will be explained in detail later.

[0294] e. In one example, Y = Clip(X + F(R)), where X and Y are the unfiltered sample point and the filtered sample point respectively. F represents a function (e.g., a residual scaling operation). Clip is a clipping operation. For example, Clip(w) = max(0, min((1 << B) - 1, w)), where B is the bit depth of the filtered sample point. The bit depth (also known as the color depth) describes the amount of information stored in each data pixel. In one embodiment, the bit depth is the number of bits used to indicate the color of a single pixel in a bitmap image or video frame buffer, or the number of bits for each color component of a single pixel.

[0295] f. In one example, the residual scaling is characterized by a linear model that includes two coefficients, i.e., F(Residual) = α × Residual + β, where the pair (α, β) is called a coefficient candidate, which can be predefined or derived on the fly.

[0296] i. In one example, (α, β) can be set to (1, 0), as Figure 13C-13D shown, which indicates that the residual scaling is turned off / disabled.

[0297] ii. Alternatively, F(Residual) = α × Residual + β can be replaced by F(Residual) = (Residual >> α) + β.

[0298] 1) Alternatively, F(Residual) = α × Residual + β can be replaced by F(Residual) = (Residual << α) + β.

[0299] 2) In one example, (α, β) can be set to (0, 0), as Figure 13C-13D shown, which indicates that the residual scaling is turned off / disabled.

[0300] iii. In one example, (α, β) can be fixed for the sample points within a video unit.

[0301] 1) Alternatively, for different sample points within a video unit, (α, β) can be different.

[0302] iv. In another example, F(Residual) = ((α × Residual + offset) >> s) + β, where s is a predefined scaling factor and offset is an integer, e.g., 1 << (s - 1).

[0303] v. In another example, F(Residual) = ((α×Residual+β+offset)>>s), where s is a predefined scaling factor and offset is an integer, such as 1<<(s-1).

[0304] vi. In one example, the output of a function can be truncated to the valid range.

[0305] vii. Alternatively, there are N predefined candidate coefficients {α0,β0},{α1,β1},…,{α N-1 ,β N-1}

[0306] viii. In one example, for a video unit, coefficient candidates can be determined based on decoding information, such as QP information / prediction mode / reconstruction sample information / color components / color format / temporal layer / strip or picture type. In one embodiment, picture type refers to instantaneous decoder refresh (IDR) picture, broken link access (BLA) picture, clean random access (CRA) picture, random access decodable leading (RADL) picture, random access skipped leading (RASL) picture, etc. In one embodiment, temporal layer is a layer in scalable video codec (e.g., layer 0, layer 1, layer 2, etc.).

[0307] ix. In one example, for a video unit, signaling informs an indicator of one or more coefficients (e.g., an index) to indicate selection from the candidates.

[0308] 1) Alternatively, the signaling indicator can be conditionally notified, for example, based on whether the NN filter is applied to the video unit.

[0309] x. In one example, the luminance and chrominance components in a video unit can use different sets of coefficient candidates.

[0310] xi. In one example, the same number of coefficient candidates are allowed for the luminance and chrominance components, but the coefficient candidates for the luminance and chrominance components are different.

[0311] xii. In one example, the chromaticity components (e.g., Cb and Cr or U and V) in a video unit can share the same coefficient candidates.

[0312] xiii. In one example, the coefficient candidates may be different for different video units (e.g., sequence / picture / strip / piece / tile / subpicture / CTU / CTU line / CU).

[0313] xiv. In one example, coefficient candidates may depend on the tree splitting structure (e.g., dual-tree or single-tree), the stripe type, or the quantization parameter (QP). In one embodiment, the QP determines the step size used to associate the transformed coefficients with a finite set of steps. The QP can be in the range of, for example, 0 to 51. In one embodiment, the stripe type is, for example, an I-strip (a stripe with only intra-frame predictions), a P-strip (a stripe with inter-frame predictions from one I or P stripe), and a B-strip (a stripe with inter-frame predictions from two I or P stripes).

[0314] xv. In one example, multiple coefficient candidates can be notified for a video unit using and / or signaling.

[0315] xvi. In one example, samples in a video unit can be grouped into N groups, and each group can use its own coefficient candidates. In another example, different color components (including luminance and chrominance) in a video unit can share the same one or more signaling notification coefficient indices.

[0316] 1) Alternatively, for each color component signaling notification coefficient index in the video unit.

[0317] 2) Alternatively, signaling is provided to the first coefficient index for the first color component (e.g., luminance), and signaling is provided to the second coefficient index for the second and third color components (e.g., Cb and Cr, or U and V).

[0318] 3) Alternatively, a signaling indicator (e.g., a flag) may be used to indicate whether all color components will share the same coefficient index.

[0319] a. In one example, when the flag is true, a coefficient index signaling is sent to the video unit. Otherwise, the coefficient index is sent according to the bullet point signaling described above.

[0320] 4) Alternatively, a signaling indicator (e.g., a flag) may be used to indicate whether two components (e.g., second and third color components, or Cb and Cr, or U and V) will share the same coefficient index.

[0321] a. In one example, when the flag is true, a coefficient index signaling is sent to both components. Otherwise, a separate coefficient index is sent to each of the two components.

[0322] 5) Alternatively, a signaling notification indicator (e.g., a flag) may be used to indicate whether residual scaling will be applied to the current video unit.

[0323] a. In one example, if the flag is false, residual scaling will not be applied to the current video unit, meaning no coefficient index will be transmitted. Otherwise, the coefficient index will be notified according to the bullet signaling described above.

[0324] 6) Alternatively, a signaling indicator (e.g., a flag) may be used to indicate whether residual scaling is applied to two components in the current video unit (e.g., the second and third color components, or Cb and Cr, or U and V).

[0325] a. In one example, if the flag is false, residual scaling will not be applied to the two components, meaning no coefficient indexes will be transmitted to either component. Otherwise, the coefficient indexes are notified according to the bullet signaling described above.

[0326] xvii. In arithmetic encoding and decoding, coefficient indices can be encoded and decoded using one or more contexts.

[0327] 1) In one example, the coefficient index can be binarized into a bin string, and at least one bin can be encoded or decoded using one or more contexts.

[0328] 2) Alternatively, the coefficient indices can be binarized into bin strings, and at least one bin can be encoded and decoded using bypass mode.

[0329] 3) The context can be derived from the encoding and decoding information of the current unit and / or adjacent units.

[0330] xviii. The coefficient index can be binarized using a fixed-length code, or a unary code, or a truncated unary code, or an exponential Golomb code (e.g., the Kth EG code, where K = 0), or a truncated exponential Golomb code, or a truncated binary code.

[0331] xix. In one example, coefficient indices can be encoded and decoded in a predictive manner.

[0332] 1) For example, the coefficient index of the previous encoding / decoding can be used as a prediction of the current coefficient index.

[0333] 2) Signaling notification flags can be used to indicate whether the current coefficient index is equal to the previously encoded / decoded coefficient index.

[0334] In one example, the coefficient index in the current video unit can be inherited from the previous codec / neighboring video units.

[0335] 1) In one example, the number of previous codec / adjacent video unit candidates is represented as C. The signaling then informs the current video unit of the inheritance index (ranging from 0 to C-1) to indicate the candidate to inherit.

[0336] xxi. In one example, the residual scaling on / off control of the current video unit can be inherited from the previous codec / adjacent video units.

[0337] xxii. In one example, a signaling notification first indicator can be placed in the parent cell to indicate how the signaling notification coefficient index will be applied for each video cell contained in the parent cell, or how residual scaling will be applied for each video cell contained in the parent cell.

[0338] 1) In one example, the first indicator can be used to indicate whether all samples within the parent cell share the same on / off control.

[0339] 2) Alternatively, a second indicator of the video unit within the parent unit may be conditionally signaled based on the first indicator to indicate the use of residual scaling.

[0340] 3) In one example, the first indicator can be used to indicate which coefficient indices are used for all samples within the parent cell.

[0341] 4) In one example, the first indicator can be used to indicate whether further signaling is needed for the coefficient index of the video unit within the parent unit.

[0342] 5) In one example, the indicator can have K+2 options, where K is the number of coefficient candidates.

[0343] a) In one example, when the indicator is 0, residual scaling is disabled for all video units contained in the parent unit.

[0344] b) In one example, when the indicator is i (1≤i≤K), the i-th coefficient will be used for all video units contained in the parent unit. Clearly, for the K options mentioned above, no coefficient index needs to be signaled for any video units contained in the parent unit.

[0345] c) In one example, when the indicator is K+1, the signaling notification coefficient index will be provided for each video unit contained in the parent unit.

[0346] Example 2

[0347] 2. In a second embodiment, for video units (e.g., strips / pictures / slices / sub-pictures), the inference block size (referring to the granularity when applying NN filters) can be derived implicitly or explicitly. Inference applies knowledge from the trained neural network model and uses that knowledge to infer the results. In one embodiment, the inference block size is a block size determined based on knowledge from the CNN.

[0348] a. In one example, there are multiple inference block size candidates for a video unit (e.g., strip / picture / slice / subpicture).

[0349] i. In one example, the candidate inference block size can be predefined or associated with some information (e.g., QP / strip or image type / segmentation tree / color component). A segmentation tree is a hierarchical data structure formed by recursively dividing video units into smaller video units (e.g., dividing a block into sub-blocks). A segmentation tree can include a series of branches and leaves.

[0350] b. In one example, signaling notifies the video unit of at least one indicator (e.g., an index) to indicate which candidate will be used.

[0351] c. In one example, the inference block size can be derived on the fly.

[0352] i. In one example, it may depend on the quantization parameter (QP) and / or resolution of the video sequence.

[0353] ii. In one example, the inference block size is set larger at low bit rates (i.e., high QP) and vice versa.

[0354] iii. In one example, the inference block size is set larger for high-resolution sequences and vice versa.

[0355] d. In one example, different networks can use different inference block sizes.

[0356] i. Networks can share certain parts, such as some connection layers.

[0357] e. In one example, a network can use different inference block sizes.

[0358] Example 3

[0359] 3. In the third embodiment, for the sample points to be filtered, multiple NN filter models can be applied instead of a single NN filter model to derive the filtered sample points.

[0360] a. In one example, the outputs of multiple NN filter models are weighted and summed (e.g., linearly or nonlinearly) to derive the final filtered output of the sample points.

[0361] i. Alternatively, a new NN filter model can be derived using the parameters associated with multiple NN filter models, and then the filtered samples can be derived using the new NN filter model.

[0362] b. In one example, signaling can be used in the bitstream to indicate the one or more weights and / or multiple NN filter models to be applied.

[0363] i. Alternatively, the weights can be derived in real time without signaling notification.

[0364] c. In one example, the weights of different NN filter models are equal.

[0365] d. In one example, the weights depend on the QP and / or stripe or picture type / color component / color format / temporal layer.

[0366] e. In one example, the weights depend on the NN filter model.

[0367] f. In one example, the weights depend on the inference block size.

[0368] g. In one example, different spatial locations have different weights. In one embodiment, a spatial location is the position of an element relative to another element (e.g., the position of a pixel within an image).

[0369] i. In one example, there are two NN filter models trained based on boundary strength and other information, respectively. For the model trained based on boundary strength, the weights of boundary samples are set higher (e.g., 1), and the weights of interior samples are set lower (e.g., 0). For the other model, the weights of boundary samples are set lower (e.g., 0), and the weights of interior samples are set higher (e.g., 1).

[0370] Figure 14 This is a block diagram illustrating an example video processing system 1400, which can implement various techniques provided in this disclosure. Various implementations may include some or all of the components of the video processing system 1400. The video processing system 1400 may include an input 1402 for receiving video content. The video content may be received in a raw or uncompressed format (e.g., 8 or 10-bit multi-component pixel values), or it may be received in a compressed or encoded format. Input 1402 may represent a network interface, a peripheral bus interface, or a storage interface. Examples of network interfaces include wired interfaces such as Ethernet, Passive Optical Network (PON), etc., and wireless interfaces such as Wi-Fi or cellular interfaces.

[0371] Video processing system 1400 may include codec component 1404, which may implement the various codec or encoding methods described in this document. Codec component 1404 may reduce the average bit rate of the video from input 1402 to the output of codec component 1404 to generate a codec representation of the video. Therefore, codec techniques are sometimes referred to as video compression or video transcoding techniques. As shown in component 1406, the output of codec component 1404 may be stored or transmitted via a connected communication. The stored or communicated bitstream (or codec) representation of the video received at input 1402 may be used by component 1408 to generate pixel values ​​or displayable video to be sent to display interface 1410. The process of generating user-visible video from the bitstream representation is sometimes referred to as video decompression. Furthermore, although some video processing operations are referred to as “codec” operations or tools, it should be understood that codec tools or operations are used by the encoder, and corresponding decoding tools or operations that invert the codec results will be performed by the decoder.

[0372] Examples of peripheral bus interfaces or display interfaces may include Universal Serial Bus (USB), High Definition Multimedia Interface (HDMI), or DisplayPort. Examples of storage interfaces include SATA (Serial Advanced Technology Accessory), PCI, IDE, etc. The technologies described in this document can be found in a variety of electronic devices, such as mobile phones, laptops, smartphones, or other devices capable of performing digital data processing and / or video display.

[0373] Figure 15 This is a block diagram of a video processing device 1500. Device 1500 can be used to implement one or more methods described in this disclosure. Device 1500 can be embodied in a smartphone, tablet, computer, Internet of Things (IoT) receiver, etc. Device 1500 may include one or more processors 1502, one or more memories 1504, and video processing hardware 1506 (also referred to as video processing circuitry). Processor 1502 can be configured to implement one or more methods described in this document. Memory 1504 can be used to store data and code for implementing the methods and techniques described herein. Video processing hardware 1506 can be used to implement some of the techniques described in this document in hardware circuitry. In some embodiments, hardware 1506 may be partially or wholly located within processor 1502 (e.g., a graphics processor).

[0374] Figure 16 This is a block diagram illustrating an example video codec system 1600 that can utilize the techniques disclosed herein. (See diagram for example.) Figure 16As shown, the video encoding / decoding system 1600 may include a source device 1610 and a target device 1620. The source device 1610 generates encoded video data, which may be referred to as a video encoding device. The target device 1620 can decode the encoded video data generated by the source device 1610, and may be referred to as a video decoding device.

[0375] The source device 1610 may include a video source 1612, a video encoder 1614, and an input / output (I / O) interface 1616.

[0376] Video source 1612 may include sources such as video capture devices, interfaces for receiving video data from video content providers, and / or computer graphics systems for generating video data, or combinations of these sources. Video data may include one or more images. Video encoder 1614 encodes the video data from video source 1612 to generate a bitstream. The bitstream may include bit sequences that form a codec representation of the video data. The bitstream may include codec images and associated data. A codec image is a codec representation of an image. Associated data may include sequence parameter sets, image parameter sets, and other syntax structures. I / O interface 1616 may include a modulator / demodulator (modem) and / or a transmitter. Encoded video data may be transmitted directly to target device 1620 via network 1630 through I / O interface 1616. Encoded video data may also be stored on storage medium / server 1640 for access by target device 1620.

[0377] The target device 1620 may include an I / O interface 1626, a video decoder 1624, and a display device 1622.

[0378] I / O interface 126 may include a receiver and / or a modem. I / O interface 1626 may acquire encoded video data from source device 1610 or storage medium / server 1640. Video decoder 1624 may decode the encoded video data. Display device 1622 may display the decoded video data to a user. Display device 1622 may be integrated with target device 1620 or may be external to target device 1620 configured to interface with an external display device.

[0379] The video encoder 1614 and the video decoder 1624 can operate according to video compression standards, such as the High Efficiency Video Codec (HEVC) standard, the Multi-Functional Video Codec (VVC) standard, and other current and / or further standards.

[0380] Figure 17 This is a block diagram illustrating an example of a video encoder 1700. The video encoder 1700 can be... Figure 16 The video encoder 1614 in the video codec system 1600 shown.

[0381] The video encoder 1700 can be configured to perform any or all of the technologies disclosed herein. Figure 17 In the example, the video encoder 1700 includes multiple functional components. The techniques described in this disclosure can be shared among the various components of the video encoder 1700. In some examples, the processor can be configured to perform any or all of the techniques described in this disclosure.

[0382] The functional components of the video encoder 1700 may include a segmentation unit 1701, a prediction unit 1702 (which may include a mode selection unit 1703, a motion estimation unit 1704, a motion compensation unit 1705, and an intra-frame prediction unit 1706), a residual generation unit 1707, a transform unit 1708, a quantization unit 1709, an inverse quantization unit 1710, an inverse transform unit 1711, a reconstruction unit 1712, a buffer 1713, and an entropy coding unit 1714.

[0383] In other examples, the video encoder 1700 may include more, fewer, or different functional components. In one example, the prediction unit 1702 may include an intra-block copy (IBC) unit. The IBC unit can perform prediction in IBC mode, where at least one reference picture is the picture containing the current video block.

[0384] Furthermore, some components, such as the motion estimation unit 1704 and the motion compensation unit 1705, can be highly integrated, but for interpretive purposes... Figure 17 The example is shown separately.

[0385] The segmentation unit 1701 can segment an image into one or more video blocks. Figure 16 The video encoder 1614 and video decoder 1624 can support various video block sizes.

[0386] The mode selection unit 1703 can select one of the encoding / decoding modes (intra-frame or inter-frame, e.g., based on error results) and provides the resulting intra-frame or inter-frame encoded / decoded block to the residual generation unit 1707 to generate residual block data, and provides it to the reconstruction unit 1712 to reconstruct the encoded block for use as a reference picture. In some examples, the mode selection unit 1703 can select an intra-frame and inter-frame prediction combined (CIIP) mode, where prediction is based on inter-frame prediction signals and intra-frame prediction signals. In the case of inter-frame prediction, the mode selection unit 1703 can also select a motion vector resolution (e.g., sub-pixel or integer pixel precision) for the block.

[0387] To perform inter-frame prediction on the current video block, motion estimation unit 1704 can generate motion information for the current video block by comparing one or more reference frames from buffer 1713 with the current video block. Motion compensation unit 1705 can determine the predicted video block for the current video block based on motion information and decoded samples from images other than those associated with the current video block from buffer 1713.

[0388] The motion estimation unit 1704 and motion compensation unit 1705 can perform different operations on the current video block, depending on whether the current video block is in an I-band, P-band, or B-band. I-bands (or I-frames) have the lowest compressibility but do not require decoding of other video frames. S-bands (or P-frames) can be decompressed using data from previous frames and are easier to compress than I-frames. B-bands (or B-frames) can use both previous frames and forward frames as data references to achieve the highest data compression.

[0389] In some examples, motion estimation unit 1704 may perform unidirectional prediction for the current video block, and may search reference images in list 0 or list 1 to find a reference video block for the current video block. Motion estimation unit 1704 may then generate a reference index indicating the reference image containing the reference video block in list 0 or list 1, and a motion vector indicating the spatial displacement between the current video block and the reference video block. Motion estimation unit 1704 may output the reference index, prediction direction indicator, and motion vector as motion information for the current video block. Motion compensation unit 1705 may generate a predicted video block for the current block based on the reference video block indicated by the motion information of the current video block.

[0390] In other examples, motion estimation unit 1704 can perform bidirectional prediction for the current video block. Motion estimation unit 1704 can search for a reference video block for the current video block in the reference images in list 0, and can also search for another reference video block for the current video block in the reference images in list 1. Motion estimation unit 1704 can then generate a reference index indicating the reference images in lists 0 and 1, which contains the reference video block and a motion vector indicating the spatial displacement between the reference video block and the current video block. Motion estimation unit 1704 can output the reference index and motion vector of the current video block as motion information for the current video block. Motion compensation unit 1705 can generate a predicted video block for the current video block based on the reference video block indicated by the motion information of the current video block.

[0391] In some examples, the motion estimation unit 1704 can output a complete set of motion information for the decoder to use in the decoding process.

[0392] In some examples, the motion estimation unit 1704 may not output the complete set of motion information for the current video. Instead, the motion estimation unit 1704 may signal the motion information of the current video block to another video block. For example, the motion estimation unit 1704 may determine that the motion information of the current video block is sufficiently similar to the motion information of neighboring video blocks.

[0393] In one example, the motion estimation unit 1704 may indicate a value in the syntactic structure associated with the current video block that indicates to the video decoder 1624 that the current video block has the same motion information as another video block.

[0394] In another example, motion estimation unit 1704 can identify another video block and motion vector difference (MVD) within the syntax structure associated with the current video block. The motion vector difference indicates the difference between the motion vector of the current video block and the motion vector of the indicated video block. Video decoder 1624 can use the motion vector of the indicated video block and the motion vector difference to determine the motion vector of the current video block.

[0395] As discussed above, the video encoder 1700 can predictively signal motion vectors. Two examples of predictive signaling notification techniques that can be implemented by the video encoder 1700 include advanced motion vector prediction (AMVP) and merge pattern signaling notification.

[0396] Intra-prediction unit 1706 can perform intra-prediction on the current video block. When intra-prediction unit 1706 performs intra-prediction on the current video block, it can generate prediction data for the current video block based on decoded samples from other video blocks in the same frame. The prediction data for the current video block may include the predicted video block and various syntax elements.

[0397] The residual generation unit 1707 can generate residual data for the current video block by subtracting (e.g., indicated by a minus sign) a predicted video block from the current video block. The residual data for the current video block may include residual video blocks that correspond to different sample components of the samples in the current video block.

[0398] In other examples, such as in skip mode, residual data may not exist for the current video block, and residual generation unit 1707 may not perform subtraction operations.

[0399] The transform processing unit 1708 can generate one or more transform coefficient video blocks for the current video block by applying one or more transforms to the residual video blocks associated with the current video block.

[0400] After the transform processing unit 1708 generates a transform coefficient video block associated with the current video block, the quantization unit 1709 can quantize the transform coefficient video block associated with the current video block based on one or more quantization parameter (QP) values ​​associated with the current video block.

[0401] The inverse quantization unit 1710 and the inverse transform unit 1711 can apply inverse quantization and inverse transform to the transform coefficient video block, respectively, to reconstruct the residual video block from the transform coefficient video block. The reconstruction unit 1712 can add the reconstructed residual video block to the corresponding samples of one or more predicted video blocks generated by the prediction unit 1702 to generate a reconstructed video block associated with the current block and store it in the buffer 1713.

[0402] After the video block is reconstructed by the reconstruction unit 1712, a loop filtering operation can be performed to reduce video block artifacts in the video block.

[0403] The entropy encoding unit 1714 can receive data from other functional components of the video encoder 1700. When the entropy encoding unit 1714 receives data, it can perform one or more entropy encoding operations to generate entropy-coded data and output a bit stream including the entropy-coded data.

[0404] Figure 18 This is a block diagram illustrating an example of a video decoder 1800. The video decoder 1800 can be... Figure 16 The video decoder 1624 in the video codec system 1600 shown.

[0405] The video decoder 1800 can be configured to perform any or all of the technologies disclosed herein. Figure 18 In the example, the video decoder 1800 includes multiple functional components. The techniques described in this disclosure can be shared among the various components of the video decoder 1800. In some examples, the processor can be configured to perform any or all of the techniques described in this disclosure.

[0406] exist Figure 18 In the example, the video decoder 1800 includes an entropy decoding unit 1801, a motion compensation unit 1802, an intra-frame prediction unit 1803, an inverse quantization unit 1804, an inverse transform unit 1805, a reconstruction unit 1806, and a buffer 1807. In some examples, the video decoder 1800 can perform functions typically associated with the video encoder 1614 (e.g., Figure 16 The encoding channel (pass) described corresponds to the decoding channel.

[0407] The entropy decoding unit 1801 can retrieve the encoded bitstream. The encoded bitstream may include entropy-encoded video data (e.g., encoded video data blocks). The entropy decoding unit 1801 can decode the entropy-encoded video data, and based on the entropy-encoded video data, the motion compensation unit 1802 can determine motion information including motion vectors, motion vector precision, reference image list index, and other motion information. For example, the motion compensation unit 1802 can determine this information by executing AMVP and merge mode signaling notifications.

[0408] The motion compensation unit 1802 can generate motion compensation blocks, which may perform interpolation based on an interpolation filter. The identifier of the interpolation filter used at sub-pixel precision can be included in the syntax element.

[0409] The motion compensation unit 1802 can use the interpolation filter used by the video encoder 1614 during the encoding of the video block to calculate the interpolation of the sub-integer pixels of the reference block. The motion compensation unit 1802 can determine the interpolation filter used by the video encoder 1614 based on the received syntax information and use the interpolation filter to generate the prediction block.

[0410] The motion compensation unit 1802 can use some syntax information to determine the size of the blocks used to encode the frames and / or stripes of the encoded video sequence, segmentation information describing how each macroblock of the picture of the encoded video sequence is segmented, a mode indicating how each segment is encoded, one or more reference frames (and a list of reference frames) for each inter-frame coded block, and other information for decoding the encoded video sequence.

[0411] Intra-prediction unit 1803 can use, for example, an intra-prediction mode received in the bitstream to form prediction blocks from spatially adjacent blocks. Inverse quantization unit 1803 performs inverse quantization, i.e., dequantization, on the quantized video block coefficients provided in the bitstream and decoded by entropy decoding unit 1801. Inverse transform unit 1805 applies an inverse transform.

[0412] The reconstruction unit 1806 can add the residual block to the corresponding prediction block generated by the motion compensation unit 1802 or the intra-frame prediction unit 1803 to form a decoded block. If necessary, a deblocking filter can also be applied to filter the decoded block to remove block artifacts. The decoded video block is then stored in a buffer 1807, which provides a reference block for subsequent motion compensation / intra-frame prediction and also generates decoded video for display on a display device.

[0413] Figure 19A video data encoding / decoding method 1900 according to an embodiment of this disclosure is illustrated. Method 1900 can be executed by an encoding / decoding device (e.g., an encoder) having a processor and memory. Method 1900 can be implemented to scale the output of NN filters for better performance, set the inference block size, and combine the outputs of multiple NN filter models.

[0414] In block 1902, the encoding / decoding device applies the output of a neural network (NN) filter to unfiltered samples of a video unit to generate a residual. In one embodiment, unfiltered samples are samples (or pixels) that have not undergone any filtering process. For example, unfiltered samples have not passed through any NN filter. As another example, unfiltered samples have not passed through an NN filter, an adaptive loop filter (ALF), a deblocking filter (DF), a sample adaptive offset (SAO) filter, or a combination thereof.

[0415] In box 1904, the codec device applies a scaling function to the residual to generate a scaled residual. In one embodiment, the scaled residual is a residual that has already been processed by a scaling function or scaling operation (e.g., as part of a Scalable Video Codec (SVC)). SVC standardizes the encoding of a high-quality video bitstream that also includes one or more subset bitstreams (e.g., in the form of layered codecs). The subset video bitstream is derived by dropping packets from the larger video to reduce the bandwidth required for the subset bitstream. The subset bitstream may represent a lower spatial resolution (smaller screen), a lower temporal resolution (lower frame rate), or a lower quality video signal.

[0416] In block 1906, the codec device adds another unfiltered sample to the scaling residual to generate a filtered sample. In one embodiment, the unfiltered sample added to the scaling residual is different from the unfiltered sample passed through the NN filter in block 1902. In block 1908, the codec device performs conversion between the video media file and the bitstream based on the generated filtered sample.

[0417] When implemented in an encoder, the conversion includes receiving a media file (e.g., a video unit) and encoding filtered samples into a bitstream. When implemented in a decoder, the conversion includes receiving a bitstream comprising filtered samples and decoding the bitstream to obtain filtered samples.

[0418] In one embodiment, method 1900 may utilize or combine one or more features or processes of other methods of this disclosure.

[0419] The following is a list of preferred embodiments.

[0420] The following scheme illustrates example embodiments of the techniques discussed in this disclosure (e.g., Example 1).

[0421] 1. A video processing method, comprising: for a conversion between a video comprising video units and a bitstream representation of the video, generating final filtered samples of the video units, wherein the final filtered samples of the video units correspond to the result of adding a corrected residual value to the unfiltered sample values ​​of the video units; wherein the corrected residual value corresponds to the output of applying a function to the residual values ​​of the video units; wherein the residual value is based on the output of a neural network (NN) filter applied to the unfiltered samples of the video units. Reference Figures 13A-13D The various options listed below are further described. Here, the final filtered samples can be used for further processing, such as storage or display, and / or as reference video for subsequent video encoding and decoding.

[0422] 2. The method as described in claim 1, wherein the unfiltered sample corresponds to the reconstructed video sample of the video unit.

[0423] 3. The method of claim 1, wherein the generation is represented as Y = X + F(R), where X represents the unfiltered sample value, R represents the output of the NN filter, F represents a function, and Y represents the final filtered sample value.

[0424] 4. The method of claim 1, wherein the generation is represented as Y = X + F(R, X), where X represents the unfiltered sample value, R represents the output of the NN filter, F represents a function, and Y represents the final filtered sample value.

[0425] 5. The method of claim 1, wherein the generation is represented as Y = X + F(RX), where X represents the unfiltered sample value, R represents the output of the NN filter, F represents a function, and Y represents the final filtered sample value.

[0426] 6. The method of claim 1, wherein the generation is represented as Y = Clip(X + F), where X represents the unfiltered sample value, F represents the output of the applied function, and Y represents the final filtered sample value.

[0427] 7. The method as described in claim 6, wherein F(Residual) = α × Residual + β, where (α, β) are predefined or derived numbers on the fly.

[0428] The following scheme illustrates an example embodiment of the technology discussed in the previous section (e.g., Example 2).

[0429] 8. A video processing method, comprising: for a conversion between a video comprising video units and a bitstream of the video, determining, according to rules, the inference block size for applying a neural network filter to unfiltered samples of the video units, and performing the conversion based on the determination.

[0430] 9. The method of claim 8, wherein the rule specifies that the inference block size is indicated in the bitstream.

[0431] 10. The method of claim 8, wherein the rule specifies the inference block size based on the block's encoding / decoding information.

[0432] 11. The method of claim 10, wherein the inference block size depends on the quantization parameters or the strip type of the image type or the segmentation tree type or the color component of the video unit.

[0433] 12. The method of claim 8, further comprising determining one or more additional inference block sizes for one or more additional neural network filters associated with the video unit.

[0434] The following scheme illustrates an example embodiment of the technology discussed in the previous section (e.g., Example 3).

[0435] 13. A video processing method comprising: performing a conversion between a video comprising video units and a bitstream of the video according to a rule, wherein the rule specifies that reconstructed samples of the video units are determined from filtering using multiple neural filter models.

[0436] 14. The method of claim 13, wherein the rule specifies the use of a weighted sum of the outputs of multiple filters to determine the reconstructed samples.

[0437] 15. The method of claim 14, wherein the weights used for the weighted sum are indicated in the bitstream.

[0438] 16. The method of claim 14, wherein the weighted sum applies equal weights to multiple neural network models.

[0439] 17. The method of claim 13, wherein the weights used for the weighted sum are functions of the quantization parameters of the video unit or the strip type or the picture type or the color component or the color format or the temporal layer.

[0440] 18. The method of claim 13, wherein the weights used for the weighted sum depend on the model type of the multiple neural network models.

[0441] 19. The method of claim 13, wherein the weights used for the weighted sum depend on the inference block size.

[0442] 20. The method of claim 13, wherein the weights used for the weighted sum are different in different spatial locations.

[0443] 21. The method as described in any of the preceding claims, wherein the video unit is a codec block, a video strip, a video picture, a video slice, or a video subpicture.

[0444] 22. The method of any one of claims 1-21, wherein the conversion includes generating video from a bitstream or generating a bitstream from a video.

[0445] 23. A method for storing a bit stream on a computer-readable medium, comprising generating the bit stream according to any one or more of the methods of claims 1-22 and storing the bit stream on a computer-readable medium.

[0446] 24. A computer-readable medium having a bitstream of video stored thereon, wherein when processed by a processor of a video decoder, the bitstream causes the video decoder to generate video, wherein the bitstream is generated by the method according to one or more of claims 1-22.

[0447] 25. A video decoding device, comprising a processor configured to implement one or more of the methods of claims 1-22.

[0448] 26. A video encoding apparatus comprising a processor configured to implement one or more of the methods of claims 1-22.

[0449] 27. A computer program product having computer code stored thereon, which, when executed by a processor, causes the processor to perform the method as described in any one of claims 1-22.

[0450] 28. A computer-readable medium having a bitstream thereon conforming to a bitstream format generated according to any one of claims 1-22.

[0451] 29. A method, a device, or a bit stream generated according to the disclosed method or system described in this document.

[0452] The following documents are incorporated in their entirety through reference:

[0453] [1]Johannes Ballé, Valero Laparra, and Eero P Simoncelli, “End-to-endoptimization of nonlinear transform codes for perceptual quality,” PCS IEEE (2016), 1–5.

[0454] [2]Lucas Theis,Wenzhe Shi,Andrew Cunningham,and Ferenc Huszár,“Lossyimage compression with compressive autoencoders,”arXiv preprint arXiv:1703.00395(2017).

[0455] [3]Jiahao Li,Bin Li,Jizheng Xu,Ruiqin Xiong,and Wen Gao,“FullyConnected Network-Based Intra Prediction for Image Coding,“IEEE Transactionson Image Processing”27,7(2018),3236–3247.

[0456] [4]Yuanying Dai,Dong Liu,and Feng Wu,“A convolutional neural networkapproach for post-processing in HEVC intra coding,”MMM.Springer,28–39.

[0457] [5]Rui Song,Dong Liu,Houqiang Li,and Feng Wu,“Neuralnetwork-basedarithmetic coding of intra prediction modes in HEVC,”VCIP IEEE(2017),1–4.

[0458] [6]J.Pfaff,P.Helle,D.Maniry,S.Kaltenstadler,W.Samek,H.Schwarz,D.Marpe,and T.Wiegand,“Neural network based intra prediction for videocoding,”Applications of Digital Image Processing XLI,Vol.10752.InternationalSociety for Optics and Photonics,1075213(2018).

[0459] The disclosed and other solutions, examples, embodiments, modules, and functional operations described in this document can be implemented in digital electronic circuits or computer software, firmware, or hardware, including the structures disclosed in this document and their structural equivalents, or combinations thereof. The disclosed and other embodiments can be implemented as one or more computer program products, i.e., one or more computer program instruction modules encoded on a computer-readable medium, executed or controlled by a data processing device. The computer-readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a combination of substances influencing machine-readable propagation signals, or a combination thereof. The term "data processing device" encompasses all means, apparatus, and machines that process data, including, for example, a programmable processor, a computer, or multiple processors or computers. In addition to hardware, the device may also include code that creates an execution environment for the computer program in question, such as code constituting processor firmware, a protocol stack, a database management system, an operating system, or a combination thereof. Propagation signals are artificially generated signals, such as machine-generated electrical, optical, or electromagnetic signals, generated to encode information for transmission to a suitable receiver device.

[0460] Computer programs (also known as programs, software, software applications, scripts, or code) can be written in any programming language (including compiled or interpreted languages) and can be deployed in any form, including standalone programs or modules, components, subroutines, or other units suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored as a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple harmonizing files (e.g., files storing one or more modules, subroutines, or portions of code). Computer programs can be deployed to execute on a single computer or on multiple computers located in one location or distributed across multiple locations and interconnected via a communication network.

[0461] The processes and logic described in this document can be executed by one or more programmable processors to execute one or more computer programs, thereby performing functions by manipulating input data and generating output. The processing and logic can also be executed by dedicated logic circuitry, and the device can be implemented as dedicated logic circuitry, such as a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).

[0462] For example, processors suitable for executing computer programs include general-purpose and special-purpose microprocessors, as well as any one or more processors in any kind of digital computer. Typically, the processor receives instructions and data from read-only memory or random access memory, or both. The basic components of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, to receive data from or transfer data to one or more mass storage devices, or both. However, a computer does not necessarily have such devices. Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices such as EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and memory may be supplemented by or incorporated into special-purpose logic circuitry.

[0463] Although this patent document contains numerous details, these details should not be construed as limiting any invention or the scope of the claims, but rather as a description of features that may be specific to particular embodiments of a particular invention. Certain features described in this patent document in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented separately in multiple embodiments or in any suitable sub-combination. Furthermore, although features may be described above as functioning in certain combinations and even initially claimed in this way, in some cases one or more features from the claimed combination may be removed from the combination, and the claimed combination may involve sub-combinations or variations of sub-combinations.

[0464] Similarly, although operations are depicted in a specific order in the accompanying drawings, this should not be construed as requiring such operations to be performed in the specific order shown or in a sequential order, or to perform all shown operations to achieve the desired effect. Furthermore, the separation of various system components in the embodiments described in this patent document should not be construed as requiring such separation in all embodiments.

[0465] Only some implementation methods and examples are described, and other implementation methods, enhancements and variations can be made based on the content described and shown in this patent document.

[0466] Although this patent document contains numerous details, these details should not be construed as limiting any invention or the scope of the claims, but rather as a description of features that may be specific to particular embodiments of a particular invention. Certain features described in this patent document in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented separately in multiple embodiments or in any suitable sub-combination. Furthermore, although features may be described above as functioning in certain combinations and even initially claimed in this way, in some cases one or more features from the claimed combination may be removed from the combination, and the claimed combination may involve sub-combinations or variations of sub-combinations.

[0467] Similarly, although operations are depicted in a specific order in the accompanying drawings, this should not be construed as requiring such operations to be performed in the specific order shown or in a sequential order, or to perform all shown operations to achieve the desired effect. Furthermore, the separation of various system components in the embodiments described in this patent document should not be construed as requiring such separation in all embodiments.

[0468] Only some implementation methods and examples are described, and other implementation methods, enhancements and variations can be made based on the content described and shown in this patent document.

Claims

1. A method implemented by a video encoding / decoding device, comprising: The output of a neural network (NN) filter is applied to the unfiltered samples of a video cell to generate a residual. Apply the scaling function to the residual to generate a scaled residual; Add another unfiltered sample to the scaled residual to generate a filtered sample; as well as The generated filtered samples are used to convert between video media files and bitstreams.

2. The method of claim 1, further comprising: The unfiltered samples are reconstructed before the residual is generated.

3. The method as described in claim 1, wherein, The filtered samples are generated according to Y = X + F(R), where X represents the unfiltered samples, R represents the residual determined based on the output of the NN filter, F represents the scaling function, and Y represents the filtered samples.

4. The method of claim 1, wherein, The filtered samples are generated according to Y = X + F(R, X), where X represents the unfiltered samples, R represents the residual determined based on the output of the NN filter, F represents the scaling function, and Y represents the filtered samples.

5. The method of claim 1, wherein, The filtered samples are generated according to Y = X + F(RX), where X represents the unfiltered samples, R represents the residual determined based on the output of the NN filter, F represents the scaling function, and Y represents the filtered samples.

6. The method of claim 1, wherein, The filtered sample points are generated according to Y = Clip(X + F(R)), where X represents the unfiltered sample points, R represents the residual determined based on the output of the NN filter, F represents the scaling function, Clip represents the clipping function based on the bit depth of the unfiltered sample points, and Y represents the filtered sample points.

7. The method of claim 1, wherein, The scaling function is based on a linear model according to F(R) = α × R + β, where R represents the residual determined based on the output of the NN filter, F represents the scaling function, and α and β represent a pair of candidate coefficients (α, β).

8. The method of claim 1, further comprising: Determine the inference block size to use when the NN filter is applied to the unfiltered samples.

9. The method of claim 8, further comprising: The inference block size is selected from a plurality of inference block size candidates, wherein each of the plurality of inference block size candidates is based on at least one of quantization parameters, strip type, image type, segmentation tree, and color component.

10. The method of claim 1, further comprising: The bitstream is parsed to obtain an indicator, wherein the indicator indicates the inference block size to be used when the NN filter is applied to the unfiltered samples.

11. The method of claim 8, wherein, The inference block size has a first value of a first bit rate and a second value of a second bit rate, wherein the first value is higher than the second value, and wherein the first bit rate is lower than the second bit rate.

12. The method of claim 8, wherein, The inference block size has a first value with a first precision and a second value with a second precision, wherein the first value is higher than the second value, and wherein the first precision is higher than the second precision.

13. The method of claim 1, wherein, The NN filter is one of a plurality of NN filters whose output is applied to the unfiltered samples to generate the residual.

14. The method of claim 13, wherein, When the outputs of the plurality of NN filters are applied to the unfiltered samples, some of the NN filters use different inference block sizes.

15. The method of claim 13, wherein, The outputs of the plurality of NN filters are individually weighted and applied to the unfiltered samples as a weighted sum.

16. The method of claim 13, wherein, The signaling in the bitstream informs the model and weights corresponding to each of the plurality of NN filters.

17. The method of claim 13, wherein, The weights corresponding to each of the plurality of NN filters are based on one or more of the following: quantization parameters, strip type, image type, color components, color format, and temporal layer.

18. The method of claim 13, wherein, The weights corresponding to each of the plurality of NN filters are based on one or more of the NN filter model, inference block size, or spatial location of the unfiltered sample.

19. An apparatus for encoding and decoding video data, comprising a processor and a non-transitory memory having instructions thereon, wherein, When the processor executes the instruction, the processor: The output of a neural network (NN) filter is applied to the unfiltered samples of a video cell to generate a residual. Apply the scaling function to the residual to generate a scaled residual; Add another unfiltered sample to the scaled residual to generate a filtered sample; as well as Based on the generated filtered samples, conversion is performed between the video media file and the bitstream.

20. A non-transitory computer-readable medium comprising a computer program product used by a codec device, the computer program product including computer-executable instructions stored on the non-transitory computer-readable medium, which, when executed by one or more processors, cause the codec device to: The output of a neural network (NN) filter is applied to the unfiltered samples of a video cell to generate a residual. Apply the scaling function to the residual to generate a scaled residual; Add another unfiltered sample to the scaled residual to generate a filtered sample; as well as Based on the generated filtered samples, conversion is performed between the video media file and the bitstream.