Optimization techniques for loop filtering using neural networks
Optimized neural network loop filtering techniques, including a parallel lx3/3xl backbone block and channel residual network, enhance coding efficiency and reduce complexity in image and video coding, addressing the limitations of existing methods.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- DOLBY LABORATORIES LICENSING CORP
- Filing Date
- 2025-12-22
- Publication Date
- 2026-07-02
Smart Images

Figure US2025060908_02072026_PF_FP_ABST
Abstract
Description
OPTIMIZATION TECHNIQUES FOR LOOP FILTERING USING NEURAL NETWORKSCROSS-REFERENCES TO RELATED APPLICATIONS
[0001] This application claims the benefit of priority from Indian Provisional Patent Applications No. 202511083904, filed on September 3, 2025, 202511058190, filed on June 17, 2025, 202511046871, filed on May 15, 2025, and 202411102168, filed on December 23, 2024, each of which is hereby incorporated by reference in its entirety.TECHNOLOGY
[0002] The present document relates generally to images. More particularly, embodiments of the present invention relate to optimization techniques for filtering images using neural networks.BACKGROUND
[0003] In 2020, the MPEG group in the International Standardization Organization (ISO), jointly with the International Telecommunications Union (ITU), released the first version of the Versatile Video Coding standard (VVC), also known as H.266 (Ref. [4]). More recently, the same joint group (JVET) and experts in still-image compression (JPEG) have started w orking on the development of the next generation of coding standards that will provide improved coding performance over existing image and video coding technologies. As part of this investigation, coding techniques based on artificial intelligence and deep learning are also examined. As used herein the term “deep learning” refers to neural networks having at least three layers, and preferably more than three layers.
[0004] As appreciated by the inventors here, improved techniques for the coding of images and video based on neural networks are described herein.
[0005] The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, issues identified with respect to one or more approaches should not assume to have been recognized in any prior art on the basis of this section, unless otherwise indicated.BRIEF DESCRIPTION OF THE DRAWINGS
[0006] An embodiment of the present invention is illustrated by way of example, and not in way by limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
[0007] FIG. 1 A depicts a network architecture for a unified low-complexity loop-filter design according to prior art;
[0008] FIG. IB depicts an example backbone block design with parallel separable convolutional (CONV) blocks according to an embodiment of this invention;
[0009] FIG. 1C and FIG. ID depict example backbone block designs with a channelresidual network, according to embodiments of this invention;
[0010] FIG. 2A, FIG. 2B, and FIG 2C depict example backbone block designs with multi-path, multi-kemel networks, according to embodiments of this invention;
[0011] FIG. 2D depicts an example of a 5x5 filtering kernel;
[0012] FIG. 2E depicts an example of a parallel Ix5 / 5xl / 3x3 backbone block design according to an embodiment of this invention;
[0013] FIG 2F depicts an example of a diamond-shaped filter corresponding to the design of the filter in FIG. 2E;
[0014] FIG. 3 depicts an example headblock design for a neural network loop filter with an input ALF-features neural netw ork, according to an embodiment of this invention;
[0015] FIG. 4 depicts an example fusion and transition block with an unequal split luma and chroma output, according to an embodiment of this invention;
[0016] FIG. 5A depicts a network architecture for a unified low-complexity loop-filter design with a cross-component link according to prior art;
[0017] FIG. 5B depicts details of the neural network modules in the architecture depicted in FIG. 5A;
[0018] FIG. 5C depicts a high-level architecture for a loop-filter design with split chroma (U and V) branches, according to an embodiment of this invention;
[0019] FIG. 6A and FIG. 6B depict example netw ork architectures for a unified low-complexity loop-filter design with a cross-component link according to embodiments of this invention;
[0020] FIG. 6C depicts an example network architecture of FIG. 6A or 6B wherein the chroma UV branch is split into separate chroma U and chroma V branches according to an embodiment of this invention;
[0021] FIG. 7 depicts an architecture combining a neural network loop filter (NNLF) with other conventional in-loop filters according to prior art;
[0022] FIGs 8A-8C depict alternative architectures combining an NNLF with other conventional in-loop filters according to prior art;
[0023] FIGs 9A-9C depict examples architectures combining an NNLF with other conventional in-loop filters according to embodiments of this invention;
[0024] FIG. 10A depicts a chroma loop filter according to prior art:
[0025] FIG. 10B depicts an alternative architecture combining an NNLF with a chroma loop filter according to an embodiment of this invention; and
[0026] FIGs 11A-11C depict examples of input bit-packing for the headblock of a NNLF according to embodiments of this invention.DESCRIPTION OF EXAMPLE EMBODIMENTS
[0027] Example embodiments for loop filtering, and signal processing in general, using neural networks in image and video coding are described herein. In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the various embodiments of present invention. It w ill be apparent, how ever, that the various embodiments of the present invention may be practiced without these specific details. In other instances, well-known structures and devices are not described in exhaustive detail, in order to avoid unnecessarily occluding, obscuring, or obfuscating embodiments of the present invention.SUMMARY
[0028] Example embodiments described herein relate to optimization methods to reduce computational complexity or improve coding efficiency in image and video coding when using neural networks. Example embodiments include: a backbone block design using parallel separable convolutional blocks, a channel residual netw ork to exploit the correlation between channels of feature maps, a multi-path backbone design, a headblock design with input for adaptive loop filter (ALF) features, an unequal luma-chroma split architecture, and neural filters with cross-component-link architectures. High-level syntax and example embodiments to optimize the combination of a neural netw ork loop filter (NNLF) with other conventional in-loop filters are also presented.EXAMPLE CODING MODEL USING DEEP LEARNING
[0029] Deep learning-based image and video compression approaches are increasingly popular, and it is an area of active research. Current research in neural -networks (NN) based coding can be divided in two general frameworks: a “hybrid” neural network-based framework, which simply replaces one or more existing coding or decoding modules with their corresponding neural-network-based implementation, where each NN module is trained and optimized on its own, and an “end-to-end” neural network, where training and optimizing is done on the whole network. The proposed neural network-based components / blocks are applicable to either architecture. The term YUV420 denotes a luma-chroma color space, where chroma is sub-sampled by two in both the horizontal and vertical dimensions, such as YCbCr 4:2:0, and the like. While examples refer to a neural network-based loop filter (NNLF), the techniques are applicable to a variety of other NN-based filters which remove noise artifacts or improve image quality, such as NN post filtering, super-resolution filtering, and the like. Embodiments presented herein offer improved coding efficiency at a reduced computational cost.
[0030] NNLF technologies have been actively studied in the Joint Video Experts Team (JVET) as part of exploratory experiment 1 (EE1). For example, Ref. [5], which describes a low complexity loop filter, has been adopted in JVET’s NN code base version NNVC-5.0 as the low operation point (LOP) anchor. Meanwhile, a unified LOP filter has been proposed in the July 2023 JVET meeting, which also incorporated a split luma-chroma design, first proposed in Ref. [5], Based on a proposed model of the unified LOP architecture, example embodiments introduce the following new features: 1) a parallel lx3 / 3xl backbone block design to reduce the number of convolution layers, 2) a channel residual network design to exploit the correlation between channels of feature maps, 3) a multi-path, with various kernel sizes, design for the backbone block, 4) adaptive loop filter (ALF) classification-based input features for the NNLF, 5) an unequal luma-chroma split architecture, and 6) a crosscomponent-link architecture.A parallel lx3 / 3xl backbone block design
[0031] Ref. [1] and Ref. [5] proposed a neural network-based loop filter method that incorporated CP decomposition. In CP decomposition (CP comes from a CANDECOMP / PARAFAC model), a 4D convolution kernel tensor is decomposed into a sequence of four convolutional layers with small kernels. The first convolution layer is apointwise convolution, the second and third layers are spatial convolutions in X and Y directions, and the fourth convolution is again a pointwise convolution in the channel dimension.
[0032] In Ref. [5], the key idea is to first decompose the 3x3 regular convolution into a 1x1 point-wise convolution, a 3x1 depth-wise separable convolution, a 1x3 depth-wise separable convolution, and a 1x1 point-wise convolution using CP decomposition. Next, the channels are split into a luma path and a chroma path, where the luma path usually has a larger number of channels than the chroma, as the luma component needs more parameters and weights to process. Finally, adjacent 1x1 convolutions after CP decomposition are fused (Ref. [5]).
[0033] FIG. 1 A (Ref. [6]) depicts a network architecture for a unified, low-complexity, loop-filter design using a variety of complexity reduction techniques. The design comprises a headblock (150), fusion and transition blocks (140), and two separate paths for processing luma and chroma using a series of backbone blocks (105) followed by a series of additional convolution blocks which form the luma and chroma tailblocks. In the headblock (150) d = [dl, d2, d3, d4, d5, d6J = [12, 8, 4, 2, 2, 24], and in the backbone (105), without limitation, example block parameters may be as in Table 1.Table 1. Example backbone neural-network parametersParameter Luma ChromaN NY =14 Nc = 4C CY =16 Cuv =16Cl C1Y =64 Cluv =32C21 C21Y =16 C21uv =16
[0034] As discussed earlier, compared to traditional video coding architectures, video coding architectures based on neural networks, such as the LOP NNLF depicted in FIG. 1 A, provide increased coding efficiency, but at a much higher computational cost. Using the LOP filter in FIG. 1 A as an example, and without limitation, in the next sections a variety of techniques to improve coding efficiency or reduce coding complexity will be proposed. These techniques are applicable to all neural-network-based architectures with components identical or similar to those used by the LOP filter.
[0035] In Ref. [6], residual block fusion (RBF) was proposed to adjust the locations of the skip connections and fuse the adjacent 1x1 convolutions. A typical backbone-blockdesign using RBF is depicted in FIG. 1A (105). The backbone block starts with an activation function block (PReLU). and it is followed by a 1x1 convolution, a 1x3 separable convolution, a 3x1 separable convolution, and a 1x1 convolution. The terms h and w denote width and height of the input patch sizes (e.g., h = w = 36), and the notation CONV MxN, Cn x Cm indicates a convolutional NN with M xN layers, Cn inputs, and Cm outputs.
[0036] This design has been adopted in the latest versions of JVET NNV C LOP anchor, including the designs in LOP.2 Ref. [2] and LOP.3 Ref. [3], It reduced the number of convolutional layers in one backbone block from 5 to 4, while still maintaining good performance.
[0037] This design uses 1x3 and 3x1 separable convolutional layers sequentially to capture the horizontal and vertical information. Though the complexity is reduced compared to one single 3x3 regular convolutional layer, this design does increase the number of layers and hence limits the ability to reduce latency and decoding time.
[0038] In an embodiment, to further reduce the number of lay ers and latency, a parallel lx3 / 3xl backbone block design is proposed. As depicted in FIG. IB, in 105B, the first two layers are a PReLU and a 1x1 convolution, then, the output C channel maps will be input to the 1x3 and 3x1 paths independently and in parallel. The 1x3 path has the 1x3 depth-wise separable convolutional layer with C input channels and C21output channels, while the 3x1 path has the 3x1 depth-wise separable convolutional layer with C input channels and C21output channels. The outputs of the two paths are added together by element-wise adding and then sent to the last 1x1 convolutional layer. This ensures that the horizontal and vertical information of the feature maps will be captured by the 1x3 and 3x1 separable convolutional layers and fused together by element-wise adding. Compared to the backbone design (105) in FIG. 1A. assuming all channel numbers are the same, this parallel lx3 / 3xl architecture will reduce the number of convolutional layers in the backbone from 4 to 3, while maintaining the exact same complexity in term of kMac / pixel. When replacing the backbone block in NNVC LOP.3 (Ref. [3]) with this design, for the same kMac / pixel load, experimental results show 0.08% luma gain and over 2% chroma gain. The term kMac / pixel stands for "‘thousand multiply-accumulation operations per pixel.’’
[0039] In general, and w ithout limitation, the kernel size in this parallel design could be extended to more than just 1x3 and 3x1, to adapt to different feature scales. For example, in an embodiment, the parallel kernel sizes could be 1 x M and M x 1 (e.g., M e [3, 9]). This parallel design is also applicable in attention layers as shown in Fig 5B or any architecturewhich uses depth- wise separable M x M convolutional layers or sequential M x 1 and 1 x M convolutional layers.A channel residual network design
[0040] As used herein, the term "feature maps” refers to the input or output tensors of convolutional layers in a neural network, which are in the form of a 3-dimensional tensor [C, h, w], where C is the number of channels, h is the height and w is the width. In FIG. 1C, one may consider a feature maps split operation (162, 164) as the opposite of a “concatenate” operation. In a neural network with multiple layers, output channels of feature maps from a layer are also the input channels to the subsequent layer.
[0041] For the intermediate channels of feature maps in a neural network, sometimes there are similarities between different channels. In an embodiment, FIG. 1C depicts an example of a channel residual network that exploits the correlations between those channels. Compared to FIG. IB, the proposed backbone block features: the split of channels of feature maps into two paths, each path processing half of the output channels from the first 1x1 CONV block, two additional separable CONV blocks, and a channel concatenate layer. Note that there is no limitation on how channels of feature maps are split (e.g., sequentially, odd vs even, and the like.) This is because during training the neural network will automatically adjust its weights as needed according to the desired optimization criterion.
[0042] As shown in FIG.1C, for the C output feature maps from the top 1x1 CONV network, the first half of the feature channels (164) is subtracted from the second half (162) to form a residual (172), and the residual is fed to 1x3 and 3x1 separable convolutional blocks (166). The first half (164) is added back to the output of the 166 networks. At the same time, the first half (164) is fed directly to its own set of 1x3 and 3x1 separable convolutions (168). The outputs of the two paths are concatenated and form the output feature maps (170) with C channels. The C channels of feature maps (170) are input to the last 1x1 convolutional layer for further processing.
[0043] For a group of feature maps, in some cases, there are similarities between channels. In such a case, the residuals (172) generated by subtracting half of the channels from the other half should yield relevantly small values. Therefore, compared to directly processing the original features (with relevantly large values), it will be easier for the residual path of 1 x3 and 3x 1 separable convolutions (166) to be better trained.
[0044] This channel residual network design can be generalized to form alternative neural network architectures. For example, such an alternative architecture is depicted in FIG. ID. Given feature maps x[l: C] with C channels, the C / 2 x[l: C / 2] channels are subtracted from the x\C / 2 + 1: C] channels, and the output residual (x[C / 2 + 1: C] — x[l: C / 2] ) (172) is input to neural network 2 (NN 2) to generate output y2[1: C / 2 (174), with C / 2 channels. Feature maps x[l: C] are fed to neural network 1 (NN 1) to generate output [1: C / 2] (176). The output of NN 2 (174) is added to the original x[l: C / 2] to form updated output y2[l: C / 2] = y2[l: C / 2] + x[l: C / 2] (178). The outputs of the two paths (176 and 178) are concatenated together to form the final output y[l: C] (180) with C channels to be further processed by the subsequent neural network layers. Neural network 1 and 2 could be of any architecture and any number of layers as long as the input and output channel numbers are both C.
[0045] In FIG. ID, the core architecture of the channel residual network design (190), that is, all components from the segmentation of feature maps to the channel concatenator. may be applicable as a stand-alone neural -network component to other NN designs and applications, beyond those discussed here. For example, in an embodiment, the PReLU and top 1x1 CONV network before the feature map segmentation could be replaced by a different neural network, say “Front NN”, and the last 1x1 CONV network, after channel concatenate, could also be replaced by a separate NN, say “End NN”, thus forming a new NN block, trainable for a desired application, comprising:• a Front NN• the channel residual NN (190); and• the End NNA multi-path with various kernel sizes design for backbone block
[0046] In another embodiment, to further utilize the benefit of various kernel sizes of convolutional networks, the separable convolutions with kernel sizes of 1x3 and 3x1 in FIG.1 A are replaced by multiple paths, operating in parallel, and with different kernel sizes respectively. Suppose there are T / convolutional blocks in parallel and the / -th path, where j = 1,2,..., L, contains 1 x Mj and Mj X 1 convolutional blocks, where Mj = 3, 5, 7, 9, and the like. Each parallel path will perform convolutions on the input feature maps x independently and will output feature map y7. After the convolution operation from all Lpaths, their feature maps y are added together through element-wise addition to generate the final output y, which is to be processed in the subsequent layers. An example of a two-path architecture is shown in FIG. 2A, while an example of three-path architecture is shown in FIG. 2B.
[0047] In general, a larger kernel size has larger receptive field to cover more area of the input, which usually translates to improved performance. But a larger kernel size will also increase the complexity of the neural network. This design enables the flexibility7to balance the performance and complexity by using various kernel sizes in one backbone block.
[0048] This design naturally combines multiple kernel sizes into a single backbone block. Furthermore, since the input image / video signals or intermediate features in the neural network may have different scales, the different kernel sizes in different convolution paths may better adapt to the various scales and efficiently capture the feature information. As used herein, the term “scale” refers to the level of details per pixel in the image / video. For example, with the same resolution, an image representing the map of a city contains more details than an image representing a country. And the image for the city' map is considered to have a larger scale. As a result, images with large scale may need larger kernel sizes to cover the same level of details.
[0049] The number of paths and the kernel sizes can be easily adj usted to meet different performance and complexity requirements. To reduce the complexity and minimize the computation increase introduced by multiple paths, in another embodiment, the input feature maps x with C channels may be divided into L groups. Each group ( / ) has Bj channels, where C = + B2+ — I- BL. Each path J will only use the corresponding group's feature map channels Xj as input and output feature map yj. Then, for all L paths, their feature maps y}are concatenated to get the final output y. An example of a two-path (L = 2) architecture is shown in FIG. 2C.Diamond-shape filter design
[0050] As discussed earlier, in FIG. IB, a parallel 1x3 and 3x1 separable convolutional layer design is proposed to reduce the latency introduced by the original sequential 1x3 and 3x1 convolutions. Though the horizontal and vertical information is captured by the two kernels, the information from the diagonal directions is not captured explicitly inside one lx3 / 3xl convolutional layer block. One way to explicitly capture information from these directions is to merge the 1x3 and 3x1 convolutional layers into one 3x3 separableconvolutional layer. To capture more neighboring pixels, a larger kernel size could be selected, e.g., 5x5; however, the complexity of an Mx M square-shaped kernel is proportional to M2(e.g., M E [3, 9]). Meanwhile, as depicted in FIG. 2D, the diagonal distance (e. g.,) between the center pixel (kernel center, C) and the pixel at the vertex (V) of the square is larger than the vertical distance (e.g., d) between the center pixel (C) and edge pixels (e.g. pixel (E)). Since the image or video is assumed to be isotopic, the filter design should have the highest performance-complexity efficiency if the distances reached at all directions are the same. The example 5x5 square filter obviously does not have the optimal design in this perspective. In a conventional video codec's filter design, a diamond shape filter is often used to improve efficiency in this aspect; however, conventional diamond shape filters are often manually designed with only one layer. It is believed that all mainstream deep learning libraries only support rectangle shape convolutional kernel size training, thus preventing the use of a diamond shape filter in deep neural networks.
[0051] In an embodiment, to exploit the efficiency of a diamond shape filter in deep neural networks, a parallel 1x5 / 5x1 / 3x3 backbone block design is proposed. As depicted in FIG. 2E, the first two layers are a PReLU and a 1x1 convolution, then, the output C channel maps will be input to the 1x5, 5x1 and 3x3 paths independently and in parallel. The 1x5 path has the 1x5 depth-wise separable convolutional layer with C input channels and C21output channels, while the 5x1 path has the 5x1 depth-wise separable convolutional layer with C input channels and C21output channels. The 3x3 path has the 3x3 depth- wise separable convolutional layer with C input channels and C21output channels. The outputs of the three paths are added together by element-wise addition to generate an output of C21channels which is sent to the last 1x1 convolutional layer to generate an output of Ci channels..
[0052] At inference, as shown in FIG. 2F, the three parallel kernels could be merged into one diamond shape filter to reduce complexity. Because the three are depth-wise separable convolutions and are added together by element-wise addition, the weights of the new diamond shape filter can be expressed as:= H^+V^ + StJ, for (i, j) = (0, -1), (0, 0), (0, 1), (-1, 0), (1, 0),Kij = Ht j, for (i, j) = (0, -2), (0, 2),Kt,j = Vt, for (i, j) = (-2, 0), (2, 0),Ki = St, for (i, j) = (-1, -1), (1, -1), (-1, 1), (1, 1),where H is the horizontal 1x5 separable convolution, Lis the vertical 5x1 separable convolution, S is the 3x3 separable convolution, and K is the fused diamond shape convolution. The ( i, j ) pair indicates the location of the kernel element, where (0, 0) is the center location of all kernels (see FIG. 2F). Before fusion, the three kernels operating at a certain pixel location will have 1x5+5x1+3x3=19 multiply operations. After fusion to the diamond shape fdter, the number of multiply operation is reduced to 13.
[0053] In general, and without limitation, the kernel size in this design could be extended to more than just 1x5, 5x1 and 3x3, to adapt to different diamond shape filters. For example, in an embodiment, the kernel sizes could be 1 x M, N x 1 and L x L, where L < M and L < N (e.g., M, N, L e [3, 9]). This design is also applicable in attention layers, as shown in FIG.5B, or any architecture which uses depth-wise separable convolutional layers.ALF classification-based input features for NNLF
[0054] As shown in FIG. 1A, the headblock (150) has six inputs, including Rec, Pred, BS, QPbase, QPslice and IPB. In FIG. 3, in an embodiment, the headblock is modified to also include k planes of input features used for adaptive loop filtering (ALF) classification. The features used for ALF classification include variance, gradient, activity and band classification (Ref. [7]). These features are typically computed based on sample-adaptive offset (SAG) samples used for the ALF classification. However, herein, these features are computed based on the reconstructed samples before any de-blocking, similarly to the values feeding the luma channels of the NNLF. Variance, activity and band classification have one plane each, whereas gradient has four planes. Benefits of using ALF classification features as input to NNLF are as follows: a) using ALF like features harmonizes the design of ALF and NNLF as both use similar classified features, b) reusing ALF classification features helps NNLF subsume any advantages ALF filter might have over NNLF due to fine grade classification. This can help NNLF achieve higher coding performance especially in the absence of ALF, or, when NNLF and ALF filters are placed in parallel, c) although a neural network may be capable of learning ALF-like classification features on its own, it may need a deeper netw ork for equivalent classification as the classification is done over a large window' sizes such as of w=12x!2 for gradient activity classification and M’=10xl0 for variance classification which needs at least 5 layers of a 3x3 convolution block to match same receptive size of the ALF classification window'. Explicit reuse of ALF features directly as input to NNLF can help NNLF relieve the burden of implicit classification inside the netw orkwhich can help in reducing NNLF complexity. In FIG.3, for the new ALF-related 3x3 CONV block, typical values for di include 8, 12, and 16.
[0055] Two alternative values for the number of ALF features channels (k) are considered, k = 6 and k = 4. When k =6, four planes are used for gradient (gv, gh, gdl, gd2), one plane for variance, and one plane for band classification. In this case, activity is not used since it is derived from the gradient planes (4 = gv+ g^. When k = 4. the four gradient planes are combined into two planes: one plane for directionality (0) and one plane for activity (4 = gv+ gh). The two planes for variance and band classification remain the same. The horizontal, vertical, and the two diagonal direction gradients are calculated using 1-D Laplacian using the following equations:9» =+1CX o - w - n - w + 1) i 9h= VkK —T t ” „^ I L I — / I L Hkj, Hk l= \2R(k, Z) - R(k - 1, Z) - R(k + 1, Z)| 9di = K — I S;+7^+1Dlkil, Dlk l= \2R(k, I) - R(k - 1, 1 - 1) - R(k + 1, 1 + 1)| 9d2 = ■DKi = IM o - - 1, i + 1) - R(k + 1, i - i)i,J2where w is the ALF gradient window size (such as w=4 for a 4x4 window or M = 12 for a 12x12 window), indices i and j refer to the coordinates of the upper left sample within the window, and R(i,j) indicates a reconstructed sample at coordinate (i,j). As previously mentioned, the gradients can be provided to NNLF either as four separate planes or combined into two planes for directionality and activity. Directionality D is derived by comparing the ratio of sum of the horizontal, vertical, and diagonal gradients (e. g., S = gh+ gv+ 9di+9<i2) with a set of thresholds which support more edge strengths:f_ max(givg^t_ max(gdll, gd2)Th-Vmin(gh‘,g{,) ’Tdl'd2min(gd’l, gd’2) ’
[0056] In an embodiment, the horizontal / vertical edge strength EHlVand the diagonal edge strength EDlare calculated first. Consider example thresholds Th = [1.25, 1.5, 2, 3, 4.5, 8], Edge strengths are computed as follows:_ f O if rfv< 77i[0]E HV — |(max integer so that rfv> Th[E^v— 1]ED= 1 (0ifrdi,d2Thl°](max integer so thatrdl,d2 > Th[Ep 1]When rfv> r^l d2, i.e., horizon tai / vertical (H / V) edges are dominant, then D is derived by using Table 2, otherwise, when diagonal (D) edges are dominant, then D is derived by using Table 3.Table 2. Mapping of H / V-dominant edge-strength values to D valuesVA 0 1 2 3 4 5 60 01 1 22 3 4 53 6 7 8 94 10 11 12 13 145 15 16 17 18 19 206 21 22 23 24 25 26 27Table 3. Mapping of D-dominant edge-strength values to D values0 1 2 3 4 5 60 28 0 0 0 0 0 01 29 30 0 0 0 0 02 31 32 33 0 0 0 03 34 35 36 37 0 0 04 38 39 40 41 42 0 05 43 44 45 46 47 48 06 49 50 51 52 53 54 55
[0057] Activity A is computed as the sum of vertical and horizontal gradients in a NxN window size (e.g., 7V=4) that covers the target 2x2 area.
[0058] The variance plane used as input is computed using again x(i ') values before any deblocking for the user specified window size N x N (for example 77=10), using, as an example, the following equation:VarM = 13 * - (Z^Z^o^Cbj))2) » 14
[0059] The band classifier is computed as the sum of luma samples in a 2x2 luma block (25 classes), whereclass_index = (sum * 25) » (sample bit depth + 2).Unequal luma-chroma split architecture
[0060] As shown in FIG. 1 A, the output of fusion and transition blocks (140) is split equally to Cyand Cuvchannels, which serve as input to the tw o separate processing paths for luma and chroma. In the prior art, the output of the fusion and transition block had Cy+ Cuvchannels which were split into Cyand Cuvchannels, where Cy= Cuv. In an example embodiment, it may be beneficial to allocate more channels for luma than for chroma. For example, as shown in FIG. 4, if the output of the fusion and transition blocks (440) comprises Cy + Ca'vchannels, these channels may be split into Cyand C„vchannels, where CyCu'v. As before, these Cyand Cv'vsplit channels serve as inputs to the first 1x1 convolutional layers in the subsequent luma and chroma paths; however, with the proposed unequal split, the luma performance is anticipated to improve, as it is expected that a few kernels from the fusion and transition blocks will prioritize luma processing. As an example, and without limitation, Table 4 lists example values for Cy+ Clv. Cyand C„v.Table 4. Example luma and chroma split valuesr ^y'+ ^ cuv rCyuufv64 48 1664 56 864 40 2432 24 8
[0061] Given this unequal split output, Cy, Cuv, Cly, and Cluvin the subsequent luma and chroma processing paths can be optionally adjusted as well. Alternatively, the other layers can retain the same configuration as before.Cross-component link architecture
[0062] In Ref. [8], the authors first introduced a cross-component merging technique between the luma and chroma backbone paths, where luma information closer to the output layers is fused to the chroma branch. However, in such an approach, the chroma backbones have to wait for a luma output, which makes the luma and chroma processing sequential and increases the latency in getting the final NNLF output. Later, the proposed cross-component technique in Ref. [9] solves this latency issue by adding the luma and chroma features after the backbone paths, and before the “tail'’ paths. However, this method has the disadvantage that now there are not enough CNN layers left to process the cross-component. The Ref. [9] technique, as shown in FIG. 5A (with NN modules as depicted in FIG. 5B), was adopted into the low-operation point (LOP) designs LOP5 (Ref.
[0010] and LOP6 (Ref.
[0011] ) adopted by JVET in the NNVC-13.0 software.
[0063] As depicted in FIG. 5A, the architecture is similar to the one depicted in FIG. 1A, but with certain changes, including the introduction of attention blocks.
[0064] For example, in the headblock d = [dl, d2, d3, d4, d5, d6] = [16, 8, 4, 2, 2, 64], and in a backbone block (BBBlock()), w ithout limitation, example block parameters (e.g., see FIG. 5B) may be as in Table 5.Table 5. Example backbone neural-network parameters for LOP5 and LOP6 Parameter Luma ChromaC CY =32 Cuv =32Cl C1Y =176 Cluv =96C21 C21Y =32 C21uv =32Ca 32
[0065] The TwinBlock() consists of two backbone blocks and an attention layer.TripleBlock() comprises three backbone blocks and an attention layer.
[0066] Pixel Shuffle rearranges elements in a tensor of shape (*, C x r2, H, W) to a tensor of shape (*, C, H x r, W x r). where r is an upscale factor. Note also that an inverse DCT(IDCT) block is now part of the two backend paths. Furthermore, there is a cross-component link and an addition from the luma backend to the chroma backend.
[0067] FIG 6A and FIG 6B depict examples of two alternative LOP architectures with a cross-component link according to embodiments of this invention. The proposed architectural changes are depicted using the shaded blocks. Unlike the prior art, these techniques avoid latency issues and have sufficient backbone blocks in the chroma branch for processing. In FIG 6A the luma information is taken before one triple block and a twin block from the luma branch and is fused into the chroma branch before three backbones. The 1x1 convolution layer (605) is used to process cross-component luma information before fusing with the chroma. The combination usage of the 1x1 convolutional layer (605) and the concatenation (CONCAT) operation of luma and chroma channels are different compared to the prior art. In FIG 6B, the luma information is taken just before the triple block and is fused to the chroma branch similarly to FIG 6A.
[0068] Note that in the chroma path, in the first 1x1 convolutional block (610), the original Cuv x Cuvi block is replaced by a Cuv x C ’uvi block. In an embodiment, without limitation, C ’uvi = 80. The two subsequent BBBlocks in the chroma path (BBBlock uv(l)) also use C ’uvi instead of the original Cuvi. Note that in the 1x1 convolutional layer (605) the output is C uvi. In an embodiment, without limitation, C uvi=16. Since Cuvi is the output channels of the CONCAT layer, Cuvi = C ’uvi + C uvi. In the embodiment without any limitation Cwi = 96. when C ’uvi = 80 and C uvi=16.
[0069] Similarly to these two examples, in other embodiments, luma information could be accessed at other intermediate points of the luma path and combined with same or other intermediate points of the chroma branch.
[0070] A person skilled in the art will appreciate that the proposed cross-link architecture is applicable to any neural network architecture with separate, but parallel, luma and chroma processing paths, especially when the luma branch has more CNN layers than the chroma branch. Furthermore, compared to using “addition” in the cross-link path (see FIG. 5A), the use of “concatenation” offers at least the following advantages:• The information from the luma and chroma paths is inherently different. Adding luma and chroma features directly without concatenation may lead to unwanted interference between the two different feature sets.• By concatenating instead of adding, the luma and chroma information from their respective paths is preserved. In the chroma path, after concatenation, the subsequentCNN layers can then determine how to fuse features from each path and the appropriate weights for each.Loop-filter design with split chroma U and V branches
[0071] In FIG. 5 A. FIG. 6A and FIG. 6B, a common design paradigm is that the input data is first fused by a headblock and then processed by the transition block. After that, the channel maps are divided into two groups and are passed into a luma branch and a chroma branch for processing separately. The benefit of this design is that while luma and chroma have similarities, they are different after reconstruction and luma usually requires more layers to process. This split luma-chroma design provides flexibility to assign more layers and parameters to luma, hence achieves a better trade-off between performance and complexity. Meanwhile, U and V chroma components, while different after reconstruction, are strongly correlated. In an embodiment, a similar philosophy could be used to further split the chroma (UV) branch to a U branch and a V branch to reduce complexity while maintaining good filtering performance. In an example embodiment, as shown in FIG. 5C, a general architecture design is that the neural network starts with a headblock to process the input and fuse into channel maps. The output of the headblock is fed into a transition block, whose output is fed into a luma branch and a chroma branch. The feature maps that are inputted to the chroma branch will be processed by one or more neural network layers first and then further split into two groups, which are processed by separate U and V branches. The outputs of the Luma branch, the U branch, and the V branch are used to reconstruct luma (Y), and chroma (U and V) components respectively.
[0072] In an embodiment, an example architecture of the split U and V design is depicted in FIG. 6C. Given the same top architecture as the one depicted in FIG. 6A or FIG. 6B, FIG.6C depicts the split of the Chroma UV processing block 620 to two separate paths. More specifically, in an embodiment, the Cuv channels of feature maps after the fourth BBlock uv are split into two groups, where each one has Cuv / 2 channels of feature maps. Each of the Cuv / 2 channels of feature maps are input to a U or a V branch for further filtering. The U and V branches are very similar to those of the corresponding layers in the Chroma branch in FIG. 6B, while the channel numbers in each branch are set to half of those in FIG. 6B.Loop-filter design with bit-packed input data
[0073] As shown in FIG. 1A, the headblock (150) has six separate inputs consisting of: three planes of luma and chroma reconstructed samples (Rec), three planes of luma and chroma predicted samples (Pred), three planes of luma and chroma boundary strength information (BS), the GOP level QP plane (QPbase), the slice QP plane (QPslice), and packed intra, inter uni-prediction, and inter bi-prediction information (IPB) in one plane. In a typical fixed-point implementation, each pixel of the input data is represented as 16-bit data; however, the effective value ranges of inputs BS, QPbase, QPslice, and IPB are much smaller than the allocated bit width. For example, BS consists of three channels (luma Y, chroma U, and chroma V) each ranging from 0 to 2, requiring 6 bits in total. The IPB mode information has one channel with a value in the range of 0 - 7, requiring 3 bits. QPSlice and QPbase each require 7 bits for their ranges of -24 to 63. The current approach allocates a separate 16-bit input codeword (head) for each of them, resulting in inefficient memory use and overhead.
[0074] Given the range of values for various neural network inputs, they all can be efficiently bit-packed using fewer 16-bit codewords. For example, in an embodiment, both QPbase and QPslice may be bit-packed using a single 16-bit codeword. For instance, as described in FIG. 1 IB, QPbase could occupy the lower 7 significant bits (LSB), while QPslice could be placed in bit indices 7 to 13. In a similar fashion, the three BS channels can be bit-packed together with IPB. For instance, as described in FIG 11C, the 3-bit IPB could be stored in the LSB position of 3 bits, and BS channels, each requiring 2 bits, would use the next 6 bits. The order of bit packing can be changed and need not be constrained, for example QPbase can be stored in the most significant bits (MSBs) and QPslice in the LSBs; however, the location in bit-packing needs to be harmonized between the training and inference.
[0075] FIG. 11 A depicts an example of a modified headblock (150A) that accounts for input bit-packing as described earlier. In an embodiment, without limitation, there are two bit-packed codewords: a) Qpslice and Qpbase (200) and b) BS channels (Y, Cb, and Cr) with IPB (205).
[0076] Additionally, an encoder can choose to enable CU-level QP-adaptation by enabling the pps cu qp delta enabled flag, in which case CUs are coded using different QPs in a slice. In such instances, the QPslice input will be replaced with CU-level QPs during the bit packing process.NNLF and in-loop filtering pipeline optimisations
[0077] FIG. 7 illustrates an existing pipeline of a video architecture with a neural-network loop filter (NNLF) in operation with other V VC-based (conventional) in-loop filters, which include a deblocking filter, a sample adaptive offset (SAO) filter, and an adaptive loop filter (ALF) (Ref.
[0012] ). In this pipeline, first, reconstructed samples (702) are processed by the deblocking filter and the NNLF. Next, the filtered samples from the NNLF (ftw) and the deblocking filter (RDB) are blended in blender (705) as:ROUT1= (RNN- RDB) x w + RDB= w x RNN+ (1 — w) x RDBwhere w denotes a blending factor in [0, 1], The blended output (Rouri) is later input to the SAO filter, followed by the ALF filtering operation. When an NNLF is being applied to reconstructed pictures, the scaling factor (iv) is derived by the encoder to minimise the distortion of blended output (Rouri) with respect to original input samples. This scaling factor may be signaled for each color component in the slice header. The NNLF filter placement in this pipeline is considered optimal for best coding performance, but results in high latency and low throughput as the sequentially-placed conventional loop filter tools, such as the SAO and ALF filters, need to wait for the output of the NNLF filter, which is many orders of magnitude more complex than the conventional VVC in-loop filter tools. Even when the NNLF filter is accelerated using a Graphical Processing Unit (GPU), a Neural Processing Unit (NPU), or a dedicated Al accelerated hardware, the data transfers between any specialized hardware and conventional loop filter blocks can result in high memory bandwidth and latency. Even if the NNLF is being processed on a GPU / NPU, it is anticipated that its execution will require more time than the combined execution time of the other loop filters. Thus, it is desirable to reduce the latency and throughput of such an architecture by enabling a more parallel processing of the NNLF filter with the other conventional in-loop filters.
[0078] As depicted in FIG. 8 A, the VVC in-loop filters have been further enhanced in the Enhanced Compression Software model (Ref.
[0013] ) by adding additional loop filter blocks, such as a Bilateral filter (BIF). a Cross-Component SAO (CC SAO) filter, and an improved ALF filter (805) which consist of multiple ALF fixed filters followed by an ALF online filter. BIF and SAO operate in parallel and use Rouri (the output from the deblocking filter) as input. As per the equation below, each filter creates an offset per sample and these offsets are added to the input sample and then clipped, to generate the output (Rouri)ROUT2 ~ clip{R0UT1+ $SAO + ^CCSAO +where 8BIFis the offset from the bilateral filter, 8SA0is the offset from SAO, and 8CCSA0is the offset from CC-SAO. ROUT2 is then input to the improved ECM ALF filter (805 A) which, as depicted in FIG. 8A, consists of ALF fixed filters followed by an ALF online filter. The ALF fixed filters use pre-trained ALF clips and coefficients which are implicitly applied on the decoder side based on the fixed filter classification. On the other hand, the weights and clip values for the ALF online filters are estimated on the encoder side and explicitly coded in the bitstream through ALF Adaptive Parameter Sets (APS), as in VVC.
[0079] FIG. 8B depicts the NNLF and ECM-based conventional loop filter pipeline proposed in Ref
[0014] , where the NNLF operates in parallel with the deblocking filter. The outputs of the deblocking filter and the NNLF filters are blended using the same equation described earlier to generate RQUTI- The blended output is then used as input for other loop filter tools such as SAO / BIF and the improved ALF filters (805B). As described earlier, this pipeline incurs high latency and reduced throughput due to the higher complexity of the NNLF filter.
[0080] FIG. 8C illustrates another NNLF pipeline, proposed in Ref.
[0015] , In this pipeline the NNLF operates in parallel to the deblocking filter, BIF, SAO, ALF Fixed Filters, and the Residual and Gaussian filters. The Residual Filter operates on residual samples. A switch, S (810), controls whether either the NNLF output or the Gaussian filtered outputs are fed into the ALF Online Filters. The blending of samples from the ALF Fixed Filters, the NNLF / Gaussian filter, and the Residual Filter is performed by the ALF Online Filter. Hence, this pipeline allows for more parallelization compared to the pipeline illustrated in FIG. 8B. This pipeline is expected to have less latency compared to the previous pipelines illustrated in FIG. 8A. Although the latency is reduced with improved parallelism, the ALF online filter in this pipeline is still sequentially applied after the NNLF filter and does not fully hide the NNLF latency. It still incurs the increased data sharing load between conventional loop filter blocks and NNLF blocks on specialized hardware (e.g.. GPU / NPU) and suffers also from 0.8% reduced compression efficiency compared to the architecture in FIG 8B.
[0081] In an embodiment, FIG. 9 A illustrates an example pipeline of a video architecture wherein the Neural Netw ork Loop Filter (NNLF) operates in parallel with all the other VVC in-loop filters, such as the deblocking filter, SAO. and ALF. Similarly. FIG. 9B illustrates a pipeline of a video architecture wherein the Neural Network Loop Filter (NNLF) operates inparallel with all the other conventional ECM in-loop filters, such as the deblocking filter, BIF, SAG, and the improved ALF (905) consisting of both fixed filters and the ALF online filter. In this embodiment, the blending (910) of NNLF (RNN) and ALF (RALF) output samples is carried out in a manner similar to the blending of the deblocking filter and NNLF samples illustrated in FIG. 7 to generate ROUTI- In an embodiment:ROUT= RNN× RALF× w + RALF= w × RNN+ (1 — w) × RALF.where the blending factor (w) is derived and signaled for each color component in the slice header as described earlier for Roun. This embodiment allows for maximum parallelization of the NNLF filter with other in-loop filters, unlike the previous pipelines illustrated in FIG.8B and FIG.8C. As the NNLF runs entirely in parallel with other loop filters, compared to the previous pipelines illustrated in FIG. 7 for VVC-based loop filters and FIG. 8B and FIG.8C for ECM-based loop filters, the observed latency is the least. It was also observed that this pipeline achieves higher compression efficiency compared to the pipeline in FIG 8C and offers a better trade-off between compression efficiency and the latency and data bandwidth for NN based loop filters operated in conjunction with other conventional in-loop filters.
[0082] In one alternative embodiment, one can position the NNLF after the deblocking filter. That is, in FIG. 9B the input of the NNLF filter will be moved from the input of the deblocking filter (902) to the output of the deblocking filter (904).
[0083] In another embodiment, one can position the NNLF after the SAO / CC-SAO filters. That is, in FIG. 9B the input of the NNLF filter will be moved from the input of the deblocking filter (902) to the output of the SAO / CC-SAO filters (906).
[0084] In another embodiment, FIG. 9C illustrates an example pipeline of a video architecture wherein NNLF operates in parallel to the deblocking filter, BIF, SAO, ALF Fixed Filters, and the Residual and Gaussian filters. Unlike the pipeline in FIG. 8C, NNLF samples are blended with ROU 2 before feeding it to the switch S. ROUT2 (920) is generated by blending of SAO and BIF as described earlier. The switch S then selects either ROUT(930), the blended NNLF and ROUT2 samples, or Gaussian filtered samples to pass to the ALF Online Filters. The NNLF output and R0UT2samples are blended using the following equation:ROUT= (RNN— ROUT2) × w + ROUT2= w × RNN+ (1 — w) × ROUT2where the blending factor (w) is derived and signaled for each color component in the slice header as described earlier for Roun.
[0085] The ALF Online filter in VVC (Ref.
[0016] ) classifies input samples into 25 distinct classes based on such properties as directionality, activity, and variance, derived from the blended output of SAO and BIF samples ( OUT2 samples (920)). For a luma component, each 4 x 4 block is categorized implicitly into one out of 25 classes. The classification index C is derived based on its directionality D and a quantized value of activity A, asC = 5D + A,where D and A are computed as in Ref.
[0016] ,
[0086] In an embodiment, the ALF Online filter reclassifier (925) classifies the ROUTsamples (930) into 25 distinct classes using the VVC classification process (Ref.
[0016] ). The resulting classes are utilized to determine clipping indices and filter coefficients which are used by the ALF Online filter to apply on its inputs designated for filtering. The determined clipping indices and coefficients are thereafter applied to filter the inputs processed by the ALF Online filter, for example, as applying the 7x7 diamond shaped centre symmetric filter used in VVC (Ref.
[0016] ). According to experiments, this pipeline may suffer a 0.4% reduced compression efficiency compared to the architecture in FIG. 8B, but it provides better compression efficiency compared to FIG. 8C and FIG. 9B. This pipeline offers similar parallelization compared to FIG 8C, proposed in Ref.
[0015] ,
[0087] In another embodiment, as will also be discussed later for the corresponding chroma loop filter, the switch S may be removed, thus allowing the ALF online filters to receive separate inputs from the Gaussian fixed filter and ROUT (930).
[0088] FIG. 10A depicts the ECM-based loop filter pipeline for chroma samples. Unlike FIG. 8A, which depicts the ECM-based loop filter pipeline for luma samples, here, the ALF Online Filter does not use pre-deblock, Gaussian fixed filter, or residual fixed filter samples as input. This indicates that the ALF Online Filter functions with different inputs for luma and chroma samples. In the NNLF pipeline described in FIG. 8C, which is proposed in Ref.
[0015] , a switch, S (810), controls whether either the NNLF output or the Gaussian filtered outputs are fed into the ALF Online Filters. Since the Gaussian fixed filter is not used for chroma ALF Online Filter, the method from Ref.
[0015] limits ALF Online Filter from utilizing NNLF samples to luma only.
[0089] In an embodiment, FIG. 10B illustrates an example pipeline for the chroma loop filtering where the ALF Online Filter uses NNLF output samples as an additional input. Thisseparate input to the ALF Online filters will provide additional BD-Rate gain from the NNLF chroma output samples. This idea can be used for the ALF Online luma filter as well by adding NNLF output samples as additional input to the ALF Online luma filters rather than toggling between Gaussian fixed filter and NNLF outputs using the switch S (810). Thus, in an embodiment, the ALF Online Filter may receive both Gaussian Fixed Filter and NNLF samples as input.
[0090] In another embodiment, the previously described NNLF and in-loop filtering pipeline optimisations remain applicable even if any conventional in-loop filters are modified or removed, provided that the NNLF placement within the pipeline does not depend on those filters. For instance, if the residual fixed filters or BIF are removed from any of the previously discussed pipelines, the proposed architectures are still applicable.ALF classification-based adaptive blending factors (w
[0091] In an embodiment, the blending of NNLF filtered samples with other loop filter tools is made adaptive by replacing a fixed scaling (blending) parameter (e.g., w) with adaptive blending at the block level using adaptive blending factors (w based on derived block classification indices ( ), as described earlier. For example, given a k x k block with classification index (i) and using weight wt:out=(RNN ~ RLF)X wi + RLF= wiXRNN + (1— wi)XRLF’where RFdenotes corresponding loop-filer output for the k x k block generated from traditional loop-filtering (e.g., RAlLFor R0lUT2)
[0092] For such adaptive blending, ALF-like block classification can be used to determine the adaptive blending factor (wi) that minimizes distortion of the blended output (ROUT) at a 2x2 or 4x4 window size, rather than using a fixed scaling factor (w) at the slice level. The scaling factors (wi) may be calculated jointly for all the k x k blocks of the frame that belong to the same class and signalled in the bitstream. The ALF classification block size kx k is typically 2x2 or 4x4. Hence in the worst-case scenario, there may be up to 25 scaling factors (wi) corresponding to the 25 ALF classification filters. The encoder may merge scaling factors across different classes if they are comparable or similar. This adaptive blending mechanism can be generalized to the fusion of NNLF output with output of any other conventional in-loop filter tools depicted in previous embodiments.
[0093] Table 6 depicts an example syntax for signalling ALF classification based NNLF adaptive scaling factors. The nnlf_params() parameters can be coded either in the slice header or the adaptive picture parameter sets of NN_APS described later.Table 6. Example of signaling of ALF classification-based adaptive blending factors nnlf_params( ) { Descriptor nnlf_filter_flag u(1) if(nnlf_filter_flag) {nnlf_num_blend_params_minusl ue(v)if (nnlf_num_blend_params_minusl > 0) {for( filtldx = 0; filtldx < NumAlfFilter; filtldx++)nnlf_blend_idx[filtldx] uc(v)for( i = 0; i <= nnlf_num_blend_params_minusl; i++)nnlf blend weight [i] ue(v)}else{nnlf fixed blend weight ue(v)}}nnlf_filter_flag equal to 1 specifies that an NNLF is applied to the current slice / NN APS. nnlf num blend params minusl - specifies the number of NNLF blending weights coded for current slice / NN APS. When nnlf_num_blend_params_minusl is 0, fixed blending weight is applied as specified by nnlf_fixed_blend_weight.As an example, and without limitation, the variable NumAlfFilters specifying the number of different adaptive loop filters (online filter classes) is set equal to 25.nnlf_blend_idx[filtldx] specifies the blending index associated for the ALF filter class indicated by filtldx ranging from 0 to NumAlfFilters - 1. The value of nnlf_blend_idx[ filtldx ] shall be in the range of 0 to nnlf_num_blend_params_minusl, inclusive.nnlf_blend_weight[i] specifies the i-th nnlf blending weight for the current slice / NN APS. nnlf_fixed_blend_weight specifies the fixed blending weight applied for the slice or picture when nnlf num blend params minusl is set to 0.Quantization-adaptive NNLF output blending
[0094] The output of the Neural Network Loop Filter (NNLF) is usually blended with the output from the conventional video codec to further improve the filtering performance. For example, in FIG. 7, the output of the NNLF is blended with the deblocking filter’s output. In FIG. 9B, the output of NNLF (TNA) is blended with the output of ALF (RALF) by:OUT=(RNN ~ RALF)X W+ RALF= W XRNN + (1—w) X RALF-where the blending factor (w ) is usually a predefined value. The above equation is a general form of blending. For example, to blend with the output of deblocking filter, simply replace the RALF with RDBF.
[0095] In conventional video codec, the quantization parameter (QP) is an important setting when compressing a video sequence. For example, a larger QP value means lower quality with more coding artifacts. Since NNLF has a better generalization and representation ability as a deep neural network, it usually outperforms conventional filtering tools more significantly at lower coding quality with larger QP values. Therefore, for lower quality coded videos at larger QP, it’s reasonable to use larger weight w for blending, i.e., rely on NNLF filtering more as it outperforms other filters a lot more at higher QP values. Current design using constant w value may hinder the NNLF’s ability. Therefore, it’s proposed to use an adaptive w value, which is a function of QP, which can be expressed as:w = f(QP)
[0096] In an embodiment, the (QP) function may be determined using a set of thresholds to decide the output value of w. For example:( 0.75, if QP > aiv = 0.5, if b < QP < a.( 0.25, if QP < bWithout limitation, for example, a = 37 and b = 27.High Level Syntax for NNLF
[0097] Given the proposed NNLF architectures, this section presents corresponding high-level syntax (HLS) for a variety of scenarios that combine NNLF architectures with a traditional coding architecture. Table 7 depicts an example of HLS when only a single NNLF architecture is adopted. New syntax elements over existing syntax are shown in an Italic font.Table 7. Example of NNLF signaling in the sequence parameter set RBSP syntax seq_parameter_set_rbsp( ) { Descriptor sps_seq_parameter_set_id u(4) sps video parameter set id u(4)...sps_nnlf_enabled_flag u(1)...}sps_nnlf_enabled_flag equal to 1 specifies that NNLF is enabled for the CLVS. sps_nnlf_enabled_flag equal to 0 specifies that NNLF is disabled for the CLVS.NNLF general constraints information syntax
[0098] An NNLF can greatly alleviate the compression artifacts and improve the coding efficiency, but as a deep neural network, it also comes with high computation complexity that calls for using a relevantly powerful GPU or other equivalents. Therefore, in real-world deployment of a video codec containing an NNLF, the flexibility to enable / disable a NNLF is important to adapt to various device and service requirements. For applications and devices that don’t have enough computing resource to encode / decode videos with NNLF’s complexity, it’s crucial to disable the use of NNLF to ensure smooth function of the codec. In an embodiment, a general constraints information (GCI) flag is proposed to enable / disable the use of NNLF at the profile level.Table 8. Example of NNLF signaling using general constraints information syntax general_constraints_info( ) { Descriptor gci present flag u(l) if( gci_present_flag ) {...gci_no_nnlf_constraint_flag u(1)...}}gci_no_nnlf_constraint_flag equal to 1 specifies that sps_nnlf_enabled_flag for all pictures in OlsInScope shall be equal to 0. gci_no_nnlf_constraint_flag equal to 0 does not impose such a constraint.
[0099] Alternatively, NNLF processing may be controlled in the picture or slice header as shown in Tables 9 and 10.Table 9. Example of NNLF signaling in the picture header structure syntax picture_header_structure( ) { Descriptor ph_gdr_or_irap_pic_flag u(l) u(l) if( sps nnlf enabled flag )ph_nnlf_enabled_flag u(1)}ph_nnlf_enabled_flag equal to 1 specifies that NNLF is enabled for the current picture. ph_nnlf_enabled_flag equal to 0 specifies that NNLF is disabled for the current picture. When not present, the value of ph nnlf enabled flag is inferred to be equal to 0.Table 10. Example of NNLF signaling in the slice header syntaxslice_header( ) { Descriptor sh_picture_header_in_slice_header_flag u(l) if( ph_nnlf_enabled_flag && !sh_picture_header_in_slice_header_flag )sh_nnlf_used_flag u(1) if( sh_nnlf_used_flag )sh_nnlf_params_flag u(1) if (sh nnlf _params Jlag)nnlf_params( ) / sh_nnlf_used_flag equal to 1 specifies that NNLF is used for the current slice. sh_nnlf_used_flag equal to 0 specifies that NNLF is not used for the current slice. When sh_nnlf_used_flag is not present, it is inferred to be equal to sh_picture_header_in_slice_header_flag? ph_nnlf_enabled_flag: 0.sh_nnlf_params_flag equal to 1 specifies that NNLF scaling parameters are coded for the current slice. When sh nnlf_params flag equal to 0, the NNLF scaling parameters are selected based on the NNLF parameters coded in the ph nnlf aps id corresponding to the current picture.
[0100] In another embodiment, multiple NNLF models may be supported in the codec. For example, reference pictures generally have better quality than non-reference pictures, while pictures with a lower temporalld generally have better quality than pictures with higher temporalld. Thus, different NNLF models can be assigned for pictures with different characteristics. For example, a higher-complexity NNLF could be assigned to pictures with lower temporalld to get the best quality, while the higher temporalld layer could use low-complexity NNLF for complexity reduction. Tables 11-14 show example of HLS to signal using multiple models using HLS in the adaptation parameter set (APS).Table 11. Example of NNLF model signaling in APS signaling syntax _ adaptation_parameter_set_rbsp( ) { Descriptor aps_params_type u(3) aps adaptation parameter set id u(5) aps_chroma_present_flag u(l) if( aps_params_type = = ALF APS )alf_data( )else if( aps_params_type = = LMCS APS )lmcs_data( )else if( aps_params_type = = SCALING APS )scaling_list_data( )else iff aps _params_type = = NNLF APS )nnlf_data( )aps extension flag u(l) if( aps extension flag )while( more_rbsp_data( ) )aps_extension_data_flag u(l) rbsp_trailing_bits( )}
[0101] aps_params_type specifies the type of APS parameters carried in the APS as specified in Table 12. The value of aps_params type shall be in the range of 0 to 3, inclusive, in bitstreams conforming to this version of this Specification. Other values of aps_params_type are reserved for future use by ITU T | ISO / IEC. Decoders conforming to this version of this Specification shall ignore APS NAL units with reserved values of aps_params_type.Table 12. Example APS parameters type codes and types of APS parameters supporting NNLFaps params type Name of Type of APS parameters aps params type0 ALF APS ALF parameters1 LMCS APS LMCS parameters2 SCALING APS Scaling list parameters 3 NNLF APS NNLF parameters4-7 reserved
[0102] All APS NAL units with a particular value of aps_params_type. regardless of the num_layer_id values and whether they are prefix or suffix APS NAL units, share the same value space for aps_adaptation_parameter_set_id. APS NAL units with different values of aps_params_type use separate values spaces for aps adaptation parameter set id.
[0103] aps_adaptation_parameter_set_id provides an identifier for the APS for reference by other syntax elements.When aps_params_type is equal to ALF_APS or SCALING APS, the value of aps_adaptation_parameter_set_id shall be in the range of 0 to 7, inclusive.When aps_params_type is equal to LMCS_APS, the value of aps_adaptation_parameter_set_id shall be in the range of 0 to 3, inclusive.When aps_params_type is equal to NNLF APS, the value of aps_adaptation_parameter_set_id shall be in the range of 0 to 1, inclusive.Note: More generally, in other embodiments, the aps_adaptation_parameter_set_id may be set in the range of 0 to N inclusive (e.g., N = 7)
[0104] If NNLF usage is signaled in the APS, then the model being used is signaled by specifying an NNLF aps_id in the picture header. An example is shown in Table 13.Table 13. Example of NNLF model signalling in the picture header structure syntax picture_header_structure( ) { Descriptor ph_gdr_or_irap_pic_flag u(l) u(l) if( sps nnlf enabled Jlag ) {ph_nnlf_enabled_flagif( ph_nnlf_enabled_flag)ph_nnlf_aps_id ue(v) }}ph_nnlf_aps_id specifies the aps_adaptation_parameter_set_id of the NNLF APS that the slices in the current picture refer to.When ph_nnlf_aps_id is present, the following applies:- The Temporalld of the APS NAL unit having aps_params_type equal to NNLF_APS and aps adaptation parameter set id equal to ph_nnlf_aps_id shall be less than or equal to the Temporalld of the picture associated with PH.- When sps_chroma_format_idc is equal to 0, the value of aps_chroma_present_flag of the APS NAL unit having aps_params_type equal to NNLF_APS and aps_adaptation_parameter_set_id equal to ph_nnlf_aps_id shall be equal to 0.
[0105] NNLF APS can contain data related to NNLF model description, such as model complexity. This provides important information about the NNLF model such as the complexity to the decoder, so that the decoder can prepare the model / data loading and other operations in advance to ensure correct and instant decoding with NNLF. An example of defining the model complexity’ information is given in Table 14.Table 14. Example syntax for NNLF model description informationnnlf_data( ) { Descriptor nnlf_num_parameters_minusl ue(v) nnlf_num_kmac_per_sample_minusl ue(v) nnlf_params( )}nnlf_num_parameters_minusl plus 1 indicates the maximum number of neural network parameters for the NNLF in units of a power of 1024.nnlf_num_kmac_per_sample_minusl plus 1 indicates that the maximum number of multiply -accumulate operations per sample of the NNLF in units of 1000.
[0106] In an another embodiment, controls for NNLF and ALF can be mutually exclusive. For example, ALF can be used only when no NPU / GPU is available and NNLF can be used only when an NPU / GPU is available. In another example, NNLF can be used forlower temporal ids and ALF can be used for higher temporal ids. The mutually exclusive control can be signaled using HLS, such as in SPS, PPS, PH, SH. and the like.Table 15. Example syntax for mutually exclusive control for NNLF and ALF in slice headerslice_header( ) { Descriptor sh_picture_header_in_slice_header_flag u(l)...if( sps nnlf enabled flag &&!sli_alf_used_flag )sh nnlf used flag u(l)}sh_nnlf_used_flag equal to 1 specifies that NNLF is enabled for the Y, Cb, or Cr colour component of the current slice. sh_nnlf_used_flag equal to 0 specifies that NNLF is disabled for the Y, Cb, or Cr colour component of the current slice. When not present, the value of sh nnlf used flag is inferred to be equal to ph nnlf enabled flag.
[0107] As depicted in FIG. 1A, NNLF utilizes the base QP of Group of Pictures (GOP), referred to as QPbase (150). as one of the input parameters in the NNLF headblock. This GOP- level QP is currently inferred based on the pps_init_qp_minus26 parameter signaled in the Picture Parameter Set (PPS). However, an encoder can choose to use a non-zero initial QP offset in the PPS to reduce the slice-level QP delta in the bitstream. This is illustrated in Table 16. below, for a random-access hierarchical GOP configuration of 16 pictures with 5 temporal layers for a base QP configuration of 37. The POC number, slice type, temporal Ids, and slice level QP are shown in the first three columns. If the encoder chooses to use a PPS initial QP of 37 (which is the same as the Base QP), then the slice level QP delta coded in the stream (see column 4) is much higher compared to the slice level QP delta depicted in the last column when the encoder choses to use a PPS initial QP of 43 (Base QP + 6).Table 16. Example slice QP delta coding for a GOP of 16 pictures with different PPS initial QPConfigured Base QP = 37 Base QP+6 = 43 POC Slice Type Slice QP Slice QP delta for Slice QP delta for (Temporal ID) PPS Init QP = 37 PPS Init QP = 43(pps_init_qp_minus26=l 1 ) (pps_init_qp_minus26=l 7) 0 IDR (0) 34 -3 -9 16 B (l) 40 3 -3 8 B (2) 41 4 -2 4 B (3) 43 6 0 2 B (4) 45 8 2 1 B (5) 46 9 3 3 B (5) 46 9 3 6 B (4) 45 8 2 5 B (5) 46 9 3 7 B (5) 46 9 3 12 B (3) 43 6 0 10 B (4) 45 8 2 9 B (5) 46 9 3 11 B (5) 46 9 3 14 B (4) 45 8 2 13 B (5) 46 9 3 15 B (5) 46 9 3
[0108] As illustrated in Table 16, the purpose of using a higher initial QP offset of 6 over the base QP in PPS is to minimize the magnitude of slice-level QP delta coded in the bitstream. In such a case, the QPbase for the GOP cannot be directly inferred from the PPS initial QP alone and the NNLF fdter operations would incorrectly interpret QPbase if they derive the QPbase directly from the PPS initial QP value. To fix this anomaly, in an embodiment, the initial QP offset used by the encoder is signaled in the bitstream through an additional parameter in the PPS, say, pps_nnlf_base_qp_delta. An NNLF can now derive the correct QPbase by subtracting the pps_nnlf_base_qp_delta signaled in the PPS. The syntax for signaling the pps_nnlf_base_qp_delta is specified in Table 17.Table 17. Example syntax for base QP delta for NNLF in the picture parameter set (PPS)pic_parameter_set( ) { Descriptor...pps_init_qp_minus26 se(v)if( sps nnlf enabled flag )pps nnlf base qp delta se(v)}pps_nnlf_base_qp_delta indicates the delta value an NNLF needs to subtract from the PPS initial QP for deriving the base QP as per the equation below:nnlf_qp_base — pps_init_qp_minus26 + 26 — pps_nnlf_base_qp_delta.ReferencesEach one of the references listed herein is incorporated by reference in its entirety. The term JVET refers to the Joint Video Experts Team of ITU-T SG 16 WP 3 and ISO / IEC JTC 1 / SC 29.[1] J. N. Shingala, et al., “AHG11: Complexity reduction on neural-network loop filter,” JVET-AA0080-v2. teleconference, July 2022.[2] D. Rusanovskyy et al., “AhGll / EEl: Status of the joint EE1-0 (LOP.2) training,” JVET- AF0043, Hannover, DE, 13-20 October 2023.[3] D. Liu et al., “EE1-1.2: Joint LOP model with inputs transformed,” JVET-AH0080, Rennes, FR, 17-24 April 2024.[4] Versatile Video Coding, Rec. ITU-T H.266, August 2020. ITU.[5] J. N. Shingala et al., “Loop Filtering using neural networks,” PCT Patent Application PCT / US2023 / 026238 (D22063W001), filed on 26 June 2023.[6] T. Shao et al., “Optimization techniques for loop filtering using neural networks,” PCT Patent Application PCT / US2024 / 049106 (D23131WO01), filed on 27 Sept. 2024.[7] M. Coban et al., “Algorithm description of Enhanced Compression Model 10 (ECM 10)”, JVET-AE2025, Geneva, Switzerland, July 2023.[8] Y. Li et al., “AHG11: Cross-component enhanced LOP filter”, JVET-AJ0068, Kerner, November 2024.[9] Y. Li et al.. “EE1-1.4: Cross-component enhanced LOP filter”, JVET-AK0195, Geneva, January 2025.
[0010] E. Alshina et al., “Exploration Experiments on Neural Network-based Video Coding (EE1),” JVET-AK2023, Geneva, Switzerland, January' 2025.
[0011] E. Alshina et al., “EE1: Exploration experiments on neural network-based video coding." JVET-AL2023. Teleconference, March 2025.
[0012] E. Alshina et al., '‘Algorithm Description for Neural Network-based Video Coding (NNVC-6.0)”, JVET-AE2019, Geneva, CH, 11-19 July 2023.
[0013] M. Coban et al., “Algorithm description of Enhanced Compression Model 13 (ECM 13)”, JVET-AH2025, Rennes, FR. 17-24 April 2024.
[0014] T. Poirier et aL, “EE2-related: NNLF interface in ECM”, JVET-AL0228, Teleconference, 26 March - 4 April 2025.
[0015] D. Rusanovskyy et al., “EE2-4.8: Integration of NN-based ILF in ALF,” JVET- AK0183, Geneva, CH, 14-22 January 2025.
[0016] A. Browne. Y. Ye, and S. H. Kim, “Algorithm description for Versatile Video Coding and Test Model 22,” JVET-AH2002, Rennes, FR, 14-24 April 2024.EXAMPLE COMPUTER SYSTEM IMPLEMENTATION
[0109] Embodiments of the present invention may be implemented with a computer system, systems configured in electronic circuitry and components, an integrated circuit (IC) device such as a microcontroller, a field programmable gate array (FPGA), or another configurable or programmable logic device (PLD), a discrete time or digital signal processor (DSP), an application specific IC (ASIC), and / or apparatus that includes one or more of such systems, devices or components. The computer and / or IC may perform, control, or execute instructions relating to loop filtering using neural networks for image and video coding, such as those described herein. The computer and / or IC may compute any of a variety of parameters or values that relate to loop filtering using neural networks for image and video coding described herein. The image and video embodiments may be implemented in hardware, software, firmware and various combinations thereof.
[0110] Certain implementations of the invention comprise computer processors which execute software instructions which cause the processors to perform a method of the invention. For example, one or more processors in a display, an encoder, a set top box, a transcoder, or the like may implement methods related to loop filtering using neural networks for image and video coding as described above by executing software instructions in a program memory accessible to the processors. Embodiments of the invention may also be provided in the form of a program product. The program product may comprise any non-transitory and tangible medium which carries a set of computer-readable signals comprising instructions which, when executed by a data processor, cause the data processor to execute amethod of the invention. Program products according to the invention may be in any of a wide variety of non-transitory and tangible forms. The program product may comprise, for example, physical media such as magnetic data storage media including floppy diskettes, hard disk drives, optical data storage media including CD ROMs, DVDs, electronic data storage media including ROMs, flash RAM, or the like. The computer-readable signals on the program product may optionally be compressed or encrypted.
[0111] Where a component (e.g. a software module, processor, assembly, device, circuit, etc.) is referred to above, unless otherwise indicated, reference to that component (including a reference to a "means") should be interpreted as including as equivalents of that component any component which performs the function of the described component (e.g., that is functionally equivalent), including components which are not structurally equivalent to the disclosed structure which performs the function in the illustrated example embodiments of the invention.EQUIVALENTS, EXTENSIONS, ALTERNATIVES AND MISCELLANEOUS
[0112] Example embodiments that relate to loop filtering using neural networks for image and video coding are thus described. In the foregoing specification, embodiments of the present invention have been described with reference to numerous specific details that may vary from implementation to implementation. Thus, the sole and exclusive indicator of what is the invention, and what is intended by the applicants to be the invention, is the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction. Any definitions expressly set forth herein for terms contained in such claims shall govern the meaning of such terms as used in the claims.Hence, no limitation, element, property, feature, advantage or attribute that is not expressly recited in a claim should limit the scope of such claim in any way. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
[0113] Various aspects of the present disclosure may be appreciated from the following Enumerated Example Embodiments (EEEs):
[0114] EEE 1. An apparatus for video loop filtering using neural-netw ork components, the apparatus comprising: a series of one or more processing blocks, wherein a processing block comprises: a first separable IxM CONV block with input C channels and output a first set of C21 channels; a second separable Mxl CONV block with input the C channels andoutput a second set of C21 channels; and an adder block to add the first set of C21 channels and the second set of C21 channels to generate a third set of C21 channels.
[0115] EEE 2. The apparatus of EEE 1, wherein C= C21 = 16.
[0116] EEE 3. The apparatus of EEE 1 or 2, wherein M is 3, 5, or 7.
[0117] EEE 4. An apparatus for video loop filtering using neural-network components, the apparatus comprising: a series of one or more processing blocks, wherein a processing block comprises: a first neural network (NN 1) with input a first set of C / 2 channels of feature maps generated from an input set of C channels and output a first set of output C / 2 channels (176); a subtractor block to subtract the first set of C / 2 channels of feature maps from a second set of C / 2 channels of feature maps generated from the input set of C channels to generate a residual set of channels of feature maps (172); a second neural network (NN 2) with input the residual set of channels of feature maps (172) and output a second set of output C / 2 channels (174); an adder block to add the first set of C / 2 channels of feature maps and the second set of output C / 2 channels (174) to generate a third set of output C / 2 channels (178); and a concatenator to concatenate the first set of output C / 2 channels (176) and the third set of output C / 2 channels (178), to generate an output set of C channels (180).
[0118] EEE 5. The apparatus of EEE 4, wherein C = 32.
[0119] EEE 6. The apparatus of EEE 4 or 5, wherein each one of the first neural network (NN 1) or the second neural network (NN 2) comprises: a first separable 1 x M, (C / 2) x R convolutional network generating a first set of R channels; followed by a second separable M x 1, R x (C / 2) convolutional network.
[0120] EEE 7. The apparatus of EEE 6, wherein M is 3. 5, or 7.
[0121] EEE 8. The apparatus of EEE 6 or 7. wherein R = C / 2.
[0122] EEE 9. An apparatus for video loop filtering using neural-network components, the apparatus comprising: a series of one or more processing blocks, wherein a processingblock comprises: a first set of L separable 1 x Mj CONV blocks operating in parallel, wherein j = 1, 2,, L, wherein the j -th 1 x Mj CONV has input first C channels and outputs Rj channels; a second set of L separable Mj x 1 CONV blocks operating in parallel, wherein the j-th Mj x 1 CONV block has input the Rj channels and outputs Cj channels; and an adder network to add all the output Cj channels together to generate second C channels.
[0123] EEE 10. The apparatus of EEE 9, wherein Mj can take values 3, 5, 7, or 9.
[0124] EEE 11. The apparatus of EEE 10, wherein L =2, Mi = 5 and M2 = 3.
[0125] EEE 12. The apparatus of EEE 10, wherein L = 3, Mi = 5, M2 = 7, and M3 = 3.
[0126] EEE 13. An apparatus for video loop filtering using neural-network components, the apparatus comprising: a series of one or more processing blocks, wherein a processing block comprises: a first set of L separable 1 x Mj CONV blocks operating in parallel, wherein j = 1, 2,..., L, wherein the j-th 1 x Mj CONV has input Cj channels from first C channels and outputs Rj channels, wherein C1+C2+... CL= C; a second set of L separable Mj x 1 CONV blocks operating in parallel, wherein the j-th Mjx 1 CONV block has input the Rj channels and outputs Gj channels, where G1+G2+... GL = C; and a channel concatenator to concatenate the output Gj channels together and generate second C channels.
[0127] EEE 14. The apparatus of EEE 13, wherein L = 2, Mi = 5, and M2 = 3.
[0128] EEE 15. The apparatus of any one of EEEs 1-14, wherein the processing block further comprises a neural -network attention block.
[0129] EEE 16. An apparatus for video loop filtering using neural-network components, the apparatus comprising: a series of one or more processing blocks, wherein a processing block comprises: a first separable CONV IxM layer block, a second separable CONV Nxl layer block, and a third separable CONV LxL layer block, all operating in parallel and each with C input channels and C21 output channels; followed by an adder adding the C21 outputchannels of the three separable CONV layer blocks by element-wise addition to generate second C21 output channels.
[0130] EEE 17. The apparatus of EEE 16, wherein the adder is followed by a CONV 1x1 layer block with input the second C21 output channels and Ci outputs.
[0131] EEE 18. The apparatus of EEE 16 or 17, wherein the three separable CONV layer blocks are preceded by a PReLU which is followed by a CONV 1x1 block with Ci input channels and C output channels, wherein the C output channels feed the three separable CONV layer blocks.
[0132] EEE 19. The apparatus of any one of EEEs 16 to 18, where M = N = 5 and L = 3.
[0133] EEE 20. The apparatus of any one of EEEs 16 to 19, wherein, during inference, kernels of each of the three separable CONV layer blocks are merged together into a single diamond-shaped kernel.
[0134] EEE 21. An apparatus for processing data using a neural network, the apparatus comprising: a front neural network (NN) with output a first set of C channels; a first neural network (NN 1 ) with input a first set of C / 2 channels of feature maps from the first set of C channels and output a first set of C / 2 channels; a subtractor block to subtract the first set of C / 2 channels of feature maps from a second set of C / 2 channels of feature maps from the first set of C channels to generate a residual set of channels of feature maps (172); a second neural network (NN 2) with input the residual set of channels of feature maps and output a second set of C / 2 channels (174); an adder block to add the first set of C / 2 channels of feature maps and the second set of C / 2 channels to generate a third set of C / 2 channels (178); a channel concatenator to concatenate the first set of C / 2 channels and the third set of C / 2 channels, to generate second C channels; and an end neural network with input the second C channels and output a second set of Ci channels.
[0135] EEE 22. The apparatus of EEE 21, wherein the front NN comprises an activation function followed by a first convolutional network, and the end NN comprises a second convolutional network.
[0136] EEE 23. An apparatus for video loop filtering using neural-network components, the apparatus comprising: a headblock comprising a series of video-decoding related inputs, each of the video-decoding related inputs followed by a convolutional network, wherein the headblock comprises an adaptive loop filter (ALF) classification block with k input planes, followed by a 3x3 convolutional network, wherein the k input planes are computed based on reconstructed samples generated by a video decoder for a coded input bitstream before the reconstructed samples are input to any deblocking filters to generate a decoded video sequence.
[0137] EEE 24. The apparatus of EEE 23, wherein for k = 6, the k input planes comprise: four planes for gradient classification, one plane for variance classification, and one plane for band classification.
[0138] EEE 25. The apparatus of EEE 23, wherein for k = 4, the k input planes comprise: one plane for directionality, one plane for activity, one plane for variance classification, and one plane for band classification.
[0139] EEE 26. The apparatus of EEE 24, wherein activity is computed as a sum of a horizontal gradient and a vertical gradient, and directionality is computed based on the four planes for gradient classification.
[0140] EEE 27. An apparatus for video loop filtering using neural -network components, the apparatus comprising: a headblock (150) comprising a series of video-decoding related inputs, each of the video-decoding related inputs followed by a convolutional network to generate a headblock output: a fusion and transition block (440) following the headblock output and generating CY + Cuv outputs to be subsequently processed by a luma (Y)processing path of neural networks with CY inputs and a chroma (UV) path of neural networks with Cuv inputs, wherein the number of CY inputs is different than the number of Cuv inputs.
[0141] EEE 28. The apparatus of EEE 27, wherein the number of CY inputs is larger than the number of Cuv inputs.
[0142] EEE 29. An apparatus for video loop fdtering using neural-network components, the apparatus comprising: a headblock comprising a series of video-decoding related inputs, each of the video-decoding related inputs followed by a convolutional network, and generating a headblock output; a fusion and transition block (440) following the headblock output and generating CY + Cuv outputs to be subsequently processed by a luma (Y) processing path of neural networks with CY inputs and a chroma (UV) processing path of neural networks with Cuv inputs, wherein the chroma processing path of neural networks comprises: a first chroma 1x1 convolutional (CONV) Cuv x C’uvi block, with input Cuv channels and output C’uvi channels; followed by a first chroma backbone block (BBBblock uv) with C’uvi input and output channels; followed by a second chroma backbone block (BBBblock uv) with C’uvi input and output channels; followed by a concatenation block concatenating the C’uvi output channels from the second chroma backbone block and C”uvi output channels from a link 1 x 1 CONV CYI X C”UVI block (605), wherein Cuvi = C’uvi + C”uvi, wherein the CYI input channels to the link 1 x 1 CONV block are generated in the luma processing path of neural networks by: a first luma 1x1 convolutional (CONV) CY X CYI block, with input CY channels and output CYI channels; followed by two luma backbone blocks, each with CYI input and output channels; followed by two luma twinblocks or three luma twinblocks, each with CYI input and output channels, wherein each twinblock comprises two backbone blocks and an attention layer.
[0143] EEE 30. The apparatus of EEE 29, wherein, following the concatenation block, the chroma UV processing path is further split into a chroma U path and a chroma V path, wherein the chroma U and V paths are operating in parallel.
[0144] EEE 31. The apparatus of EEE 30, wherein in each NN block of the chroma U or V paths, the number of input and output channels is half of the channels in the corresponding block in the chroma UV path.
[0145] EEE 32. An apparatus for processing image or video content using neural-network components, the apparatus comprising: a first stage of neural network processing wherein luma and chroma information is processed together to generate a combination of luma and chroma output channels; followed by a luma neural -network branch comprising a first plurality of neural network blocks to process luma output channels from the first stage of neural networks, and, in parallel, a chroma neural-network branch comprising a second plurality of neural network blocks to process chroma output channels from the first stage of neural networks; and a cross-component link from a source location of the luma neural-network branch to a destination location in the chroma neural -network branch, wherein the cross-component link comprises: a 1x1 link convolutional (CONV) CYI X C”uvi block, with input CYI channels from the luma neural-network branch at the source location and output C”uvi channels; a concatenation block with: a first input of C’uvi channels from the chroma neural -network branch at the destination location; a second input of the output C”uvi channels from the 1x1 link CONV block; and a concatenated output of Cuvi = C’uvi+ C”uvi channels to be input to the chroma neural-network branch following the destination location.
[0146] EEE 33. The apparatus of EEE 32, wherein the source location and the destination location are selected so that overall processing latency in the chroma neural-network branch after the destination location is approximately the same or does not exceed the overall processing latency in the luma neural -network branch after the source location.
[0147] EEE 34. An apparatus for in-loop filtering of image or video content using neural-network components, the apparatus comprising: a neural networks loop filter (NNLF) receiving an input for in-loop filtering and generating an NNLF output; a deblocking filter receiving the input for in-loop filtering and generating a deblocking-filter output; one or more auxiliary in-loop filters receiving as input the deblocking-filter output or the input for in-loop filtering and generating an auxiliary -filters output; and a blender to blend the auxiliary -filters output with the NNLF output and generate an in-loop filtering output.
[0148] EEE 35. The apparatus of EEE 34, wherein the one or more auxiliary in-loop filters comprise a sample adaptive offset (SAO) filter followed by an adaptive loop filter (ALF) filter.
[0149] EEE 36. The apparatus of EEE 34, wherein the one or more auxiliary in-loop filters comprise a Bilateral filter (BIF), a Cross-Component SAO (CC SAO) filter, and an ALF filter, wherein the BIF and CC SAO filters receive as input the deblocking-filter output and generate a BIF / SAO output; and the ALF filter receives as input at least the BIF / SAO output and generates the auxiliary-filters output.
[0150] EEE 37. The apparatus of EEE 34 or claim 36, wherein the input to the NNLF comprises the deblocking-filter output instead of the input for in-loop filtering.
[0151] EEE 38. The apparatus of EEE 36, wherein the input to the NNLF comprises the BIF / SAO output instead of the input for in-loop filtering.
[0152] EEE 39. The apparatus of any one of EEEs 34-38, wherein generating the in-loop filter output (R_0UT) using the blender comprises computingROUT= (RNN− RALF) × w + RALF= w × RNN+ (1 − w) × RALF,wherein RNNdenotes the NNLF output, RALFdenotes the auxiliary-filters output, and w denotes a blending factor in [0, 1],
[0153] EEE 40. An apparatus for in-loop filtering of image or video content using neural-network components, the apparatus comprising: a neural networks loop filter (NNLF) receiving an input for in-loop filtering and generating an NNLF output; a deblocking filter receiving the input for in-loop filtering and generating a deblocking-filter output; one or more auxiliary in-loop filters receiving as input the deblocking-filter output and generating an auxiliary -filters output (920); a blender to combine the NNLF output with the auxiliary -filters output (920) to generate a blended NNLF output (930); and a set of ALF online filters to generate a final output based at least on the auxiliary-filters output and the blended NNLF output.
[0154] EEE 41. The apparatus of claim 40 wherein generating the blended NNLF output comprises computing:ROUT= (RNN− ROUT2) × w + ROUT2= w × RNN+ (1 − w) × ROUT2wherein RNNdenotes the NNLF output, R0UT2 denotes the auxiliary-filter output, and w denotes a blending factor in [0, 1],
[0155] EEE 42. The apparatus of EEE 40 or EEE 41, further comprising a reclassifier (925), wherein the reclassifier receives as input the blended NNLF output and classifies its input samples into two or more distinct classification classes to be used as further input to the set of ALF online filters to determine clipping indices and / or filter coefficients.
[0156] EEE 43. The apparatus of EEE 34 or EEE 40, further comprising determining blending factors w,. according to a classification class ( z ) of the NNLF output at a kx k block level.
[0157] EEE 44. The apparatus of EEE 43, wherein generating the blended NNLF output for a k x k block classified with the classification class z comprises computing:Rout= (RNN− RAF) × wi+ RAF= wi× RNN+ (1 − wi) × RAF,wherein RNNdenotes the NNLF output and RAFdenotes the auxiliary -filter output.
[0158] EEE 45. The apparatus of EEE 39 or EEE 41, wherein the blending factor w is computed as a function of a quantization parameter (QP).
[0159] EEE 46. The apparatus of EEE 45, wherein the blending factor is computed as:0.75, if QP > a;0.5, if b < QP < a;0.25, if QP < b.
[0160] EEE 47. A method to transmit a coded bitstream from an encoder to a decoder. wherein the coded bitstream comprises high-level syntax, wherein generating the coded bitstream comprises: receiving a sequence of pictures to generate coded pictures; generating high-level syntax to assist a decoder decoding the coded pictures; and combining the coded pictures with the high-level syntax to generate the coded bitstream, wherein the high-level syntax comprises a flag indicating whether a neural net loop filter (NNLF) was used or not for generating the coded pictures, wherein the flag is part of one or more of a sequence parameter syntax set, a constraints information syntax set, a picture header syntax set, a slice header syntax set, or an adaptation parameter syntax set.
[0161] EEE 48. The method of EEE 47, wherein if the flag is part of the adaptation parameter syntax set, then the high-level syntax further comprises an NNLF ID indicating an NNLF model being used for generating the coded pictures.
[0162] EEE 49. The method of EEE 47, wherein high-level syntax further comprises information related to computational complexity of the NNLF.
[0163] EEE 50. The method of EEE 47, wherein the high-level syntax further comprises one or more parameters to indicate that NNLF and adaptive loop filtering (ALF) are used mutually exclusively.
[0164] EEE 51. The method of EEE 47, wherein the high-level syntax further comprises a syntax parameter (pps_nnlf_base_qp_delta) indicating a delta value the NNLF needs tosubtract from an initial picture parameter set (PPS) QP value (pps_init_qp_minus26) to derive an NNLF base QP value (nnlf_qp_base), whereinnnlf_qp_base = pps_init_qp_minus26 + 26 — pps_nnlf_base_qp_delta.
[0165] EEE 52. An apparatus for video loop filtering using neural-network components, the apparatus comprising: one or more video inputs; a headblock comprising a series of video-decoding related blocks and generating a headblock output based on the one or more video inputs; a transition block receiving the headblock output and generating using neural network blocks CY + Cuv outputs; a luma (Y) processing path of neural networks with CY inputs out of the CY + Cuv outputs of the transition block generating CY outputs of reconstructed luma channels; a chroma (UV) processing path of neural networks with Cuv inputs out of the CY + Cuv outputs of the transition block generating reconstructed chroma channels, wherein the chroma processing path of neural networks further comprises: a first set of neural network blocks processing the Cuv inputs and generating Cuv outputs; a chroma U path of neural network blocks receiving as input Cu outputs out of the Cuv outputs to generate Cu reconstructed output chroma channels; and a chroma V path of neural network blocks receiving as input Cv outputs out of the Cuv outputs to generate Cv reconstructed output chroma channels, wherein Cu+Cv = Cuv.
[0166] EEE 53. The apparatus of EEE 52, wherein Cu = Cv = Cuv / 2.
[0167] EEE 54. An apparatus for video loop filtering using neural-network components, the apparatus comprising: a headblock (150A) comprising a series of video-decoding related inputs, each of the video-decoding related inputs followed by a convolutional network to generate a headblock output, wherein inputs for two or more of the video-decoding related inputs are bit-packed to fit within a single bit-packed input codeword.
[0168] EEE 55. The apparatus of EEE 54, wherein inputs that are bit-packed comprise a Qpslice input and a Qpbase input.
[0169] EEE 56. The apparatus of EEE 55, wherein bit-packing comprises assigning QPbase to bits 0-6 of a bit-packed input codeword and assigning Qpslice to bits 7-13 of the bit-packed input codeword.
[0170] EEE 57. The apparatus of EEE 54, wherein inputs that are bit-packed comprise BS-chroma V (2 bits), BS-chroma U (2 bits), BS-luma Y (2 bits), and IPB (3 bits).
[0171] EEE 58. The apparatus of EEE 57, wherein bit-packing comprises assigning the BS-chroma and BS-luma inputs to bits 3-8 of a bit-packed input codeword and assigning IPB to bits 0-2 of the bit-packed input codeword.
Claims
1. CLAIMSWhat is claimed is:
1. An apparatus for video loop filtering using neural -network components, the apparatus comprising:a series of one or more processing blocks, wherein a processing block comprises: a first separable IxM CONV block with input C channels and output a first set of C21 channels;a second separable Mxl CONV block with input the C channels and output a second set of C21 channels; andan adder block to add the first set of C21 channels and the second set of C21 channels to generate a third set of C21 channels.
2. The apparatus of claim 1, wherein C = C21 = 16.
3. The apparatus of claim 1 or 2, wherein M is 3, 5, or 7.
4. An apparatus for video loop filtering using neural-network components, the apparatus comprising:a series of one or more processing blocks, wherein a processing block comprises: a first neural network (NN 1) with input a first set of C / 2 channels of feature maps generated from an input set of C channels and output a first set of output C / 2 channels (176);a subtractor block to subtract the first set of C / 2 channels of feature maps from a second set of C / 2 channels of feature maps generated from the input set of C channels to generate a residual set of channels of feature maps (172);a second neural network (NN 2) with input the residual set of channels of feature maps (172) and output a second set of output C / 2 channels (174);an adder block to add the first set of C / 2 channels of feature maps and the second set of output C / 2 channels (174) to generate a third set of output C / 2 channels (178); anda concatenator to concatenate the first set of output C / 2 channels (176) and the third set of output C / 2 channels (178). to generate an output set of C channels (180).
5. The apparatus of claim 4, wherein C = 32.
6. The apparatus of claim 4 or 5, wherein each one of the first neural network (NN 1) or the second neural network (NN 2) comprises:a first separable 1 x M, (C / 2) x R convolutional network generating a first set of R channels:followed by a second separable M x 1, R x (C / 2) convolutional network.
7. The apparatus of claim 6, wherein M is 3, 5, or 7.
8. The apparatus of claim 6 or 7. wherein R = C / 2.
9. An apparatus for video loop filtering using neural-network components, the apparatus comprising:a series of one or more processing blocks, wherein a processing block comprises: a first set of L separable 1 x Mj CONV blocks operating in parallel, wherein j = 1, 2,..., L, wherein the j-th 1 x Mj CONV has input first C channels and outputs Rj channels;a second set of L separable Mj x 1 CONV blocks operating in parallel, wherein the j-th Mjx 1 CONV block has input the Rj channels and outputs Cj channels; and an adder network to add all the output Cj channels together to generate second C channels.
10. The apparatus of claim 9, wherein Mj can take values 3. 5, 7, or 9.
11. The apparatus of claim 10, wherein L =2, Mi = 5 and M2 = 3.
12. The apparatus of claim 10, wherein L = 3, Mi = 5, M2 = 7, and M3 = 3.
13. An apparatus for video loop filtering using neural -network components, the apparatus comprising:a series of one or more processing blocks, wherein a processing block comprises: a first set of L separable 1 x Mj CONV blocks operating in parallel, wherein j = 1, 2,..., L, wherein the j-th 1 x Mj CONV has input Cj channels from first Cchannels and outputs Rj channels, wherein C1+C2+... CL= C;a second set of L separable Mj x 1 CONV blocks operating in parallel, wherein the j -th Mjx 1 CONV block has input the Rj channels and outputs Gj channels, where G1+G2+... GL= C; anda channel concatenator to concatenate the output Gj channels together and generate second C channels.
14. The apparatus of claim 13, wherein L = 2, Mi = 5, and M2 = 3.
15. The apparatus of any one of claims 1-14, wherein the processing block further comprises a neural-network attention block.
16. An apparatus for video loop filtering using neural -network components, the apparatus comprising:a series of one or more processing blocks, wherein a processing block comprises: a first separable CONV IxM layer block, a second separable CONV Nxl layer block, and a third separable CONV LxL layer block, all operating in parallel and each with C input channels and C21 output channels;followed by an adder adding the C21 output channels of the three separable CONV layer blocks by element-wise addition to generate second C21 output channels.
17. The apparatus of claim 16, wherein the adder is followed by a CONV 1x1 layer block with input the second C21 output channels and Ci outputs.
18. The apparatus of claim 16 or 17, wherein the three separable CONV layer blocks are preceded by a PReLU which is followed by a CONV 1x1 block with Ci input channels and C output channels, wherein the C output channels feed the three separable CONV layer blocks.
19. The apparatus of any one of claims 16 to 18, where M = N = 5 and L = 3.
20. The apparatus of any one of claims 16 to 19, wherein, during inference, kernels of each of the three separable CONV layer blocks are merged together into a single diamond-shaped kernel.
21. An apparatus for processing data using a neural network, the apparatus comprising: a front neural network (NN) with output a first set of C channels; a first neural network (NN 1) with input a first set of C / 2 channels of feature maps from the first set of C channels and output a first set of C / 2 channels;a subtractor block to subtract the first set of C / 2 channels of feature maps from a second set of C / 2 channels of feature maps from the first set of C channels to generate a residual set of channels of feature maps (172);a second neural network (NN 2) with input the residual set of channels of feature maps and output a second set of C / 2 channels (174);an adder block to add the first set of C / 2 channels of feature maps and the second set of C / 2 channels to generate a third set of C / 2 channels (178);a channel concatenator to concatenate the first set of C / 2 channels and the third set of C / 2 channels, to generate second C channels; andan end neural network with input the second C channels and output a second set of Ci channels.
22. The apparatus of claim 21, wherein the front NN comprises an activation function followed by a first convolutional network, and the end NN comprises a second convolutional network.
23. An apparatus for video loop filtering using neural -network components, the apparatus comprising:a headblock comprising a series of video-decoding related inputs, each of the videodecoding related inputs followed by a convolutional network, wherein the headblock comprises an adaptive loop filter (ALF) classification block with k input planes, followed by a 3x3 convolutional network, wherein the k input planes are computed based on reconstructed samples generated by a video decoder for a coded input bitstream before the reconstructed samples are input to any deblocking filters to generate a decoded video sequence.
24. The apparatus of claim 23, wherein for k = 6, the k input planes comprise: four planes for gradient classification, one plane for variance classification, and one plane for band classification.
25. The apparatus of claim 23, wherein for k = 4, the k input planes comprise:one plane for directionality, one plane for activity, one plane for variance classification, and one plane for band classification.
26. The apparatus of claim 24, wherein activity is computed as a sum of a horizontal gradient and a vertical gradient, and directionality is computed based on the four planes for gradient classification.
27. An apparatus for video loop filtering using neural -network components, the apparatus comprising:a headblock (150) comprising a series of video-decoding related inputs, each of the video-decoding related inputs followed by a convolutional network to generate a headblock output;a fusion and transition block (440) following the headblock output and generating CY + Cuv outputs to be subsequently processed by a luma (Y) processing path of neural networks with CY inputs and a chroma (UV) path of neural networks with Cw inputs, wherein the number of CY inputs is different than the number of Cuv inputs.
28. The apparatus of claim 27, wherein the number of CY inputs is larger than the number of Cuv inputs.
29. An apparatus for video loop filtering using neural -network components, the apparatus comprising:a headblock comprising a series of video-decoding related inputs, each of the videodecoding related inputs followed by a convolutional network, and generating a headblock output;a fusion and transition block (440) following the headblock output and generating CY + Cuv outputs to be subsequently processed by a luma (Y) processing path of neural netw orks with CY inputs and a chroma (UV) processing path of neural networks with Cuv inputs, wherein the chroma processing path of neural networks comprises:a first chroma 1x1 convolutional (CONV) Cuv x C’uvi block, with input Cuv channels and output C’uvi channels; followed bya first chroma backbone block (BBBblock uv) with C’uvi input and output channels; followed bya second chroma backbone block (BBBblock uv) with C’uvi input and outputchannels; followed bya concatenation block concatenating the C’uvi output channels from the second chroma backbone block and C”uvi output channels from a link 1 x 1 CONV CYI x C”uvi block (605), wherein Cuvi = C’uvi + C”uvi, wherein the CYI input channels to the link 1 x 1 CONV block are generated in the luma processing path of neural networks by:a first luma 1x1 convolutional (CONV) CY X CYI block, with input CY channels and output CYI channels; followed bytwo luma backbone blocks, each with CYI input and output channels; followed bytwo luma twinblocks or three luma twinblocks, each with CYI input and output channels, wherein each twinblock comprises two backbone blocks and an attention layer.
30. The apparatus of claim 29, wherein, following the concatenation block, the chroma UV processing path is further split into a chroma U path and a chroma V path, wherein the chroma U and V paths are operating in parallel.
31. The apparatus of claim 30, wherein in each NN block of the chroma U or V paths, the number of input and output channels is half of the channels in the corresponding block in the chroma UV path.
32. An apparatus for processing image or video content using neural-network components, the apparatus comprising:a first stage of neural network processing wherein luma and chroma information is processed together to generate a combination of luma and chroma output channels; followed bya luma neural-network branch comprising a first plurality of neural network blocks to process luma output channels from the first stage of neural networks, and, in parallel, a chroma neural -network branch comprising a second plurality of neural network blocks to process chroma output channels from the first stage of neural networks; anda cross-component link from a source location of the luma neural-network branch to a destination location in the chroma neural-network branch, wherein the cross-component link comprises:a 1x1 link convolutional (CONV) CYI X C ”UVI block, with input CYI channels from the luma neural-network branch at the source location and output C’uvi channels;a concatenation block with: a first input of C’uvi channels from the chroma neural-network branch at the destination location;a second input of the output C”uvi channels from the 1x1 link CONV block; anda concatenated output of Cuvi = C’uvi+ C”uvi channels to be input to the chroma neural -network branch following the destination location.
33. The apparatus of claim 32, wherein the source location and the destination location are selected so that overall processing latency in the chroma neural -network branch after the destination location is approximately the same or does not exceed the overall processing latency in the luma neural-network branch after the source location.
34. An apparatus for in-loop filtering of image or video content using neural -network components, the apparatus comprising:a neural networks loop filter (NNLF) receiving an input for in-loop filtering and generating an NNLF output;a deblocking filter receiving the input for in-loop filtering and generating a deblocking-filter output;one or more auxiliary in-loop filters receiving as input the deblocking-filter output or the input for in-loop filtering and generating an auxiliary -filters output; anda blender to blend the auxiliary-filters output with the NNLF output and generate an in-loop filtering output.
35. The apparatus of claim 34, wherein the one or more auxiliary in-loop filters comprise a sample adaptive offset (SAO) filter followed by an adaptive loop filter (ALF) filter.
36. The apparatus of claim 34, wherein the one or more auxiliary in-loop filters comprise a Bilateral filter (BIF), a Cross-Component SAO (CC SAO) filter, and an ALF filter, wherein the BIF and CC SAO filters receive as input the deblocking-filter output and generate a BIF / S AO output; andthe ALF filter receives as input at least the BIF / SAO output and generates the auxiliary-filters output.
37. The apparatus of claim 34 or claim 36, wherein the input to the NNLF comprises the deblocking-filter output instead of the input for in-loop filtering.
38. The apparatus of claim 36, wherein the input to the NNLF comprises the BIF / SAO output instead of the input for in-loop filtering.
39. The apparatus of any one of claims 34-38, wherein generating the in-loop filter output (R_OUT) using the blender comprises computingOUT=RNN ~ RALF)X W+ RALF= w × RNN+ (1 — w) × RALFwherein RNNdenotes the NNLF output, RALF denotes the auxiliary-filters output, and w denotes a blending factor in [0, 1],40. An apparatus for in-loop filtering of image or video content using neural -network components, the apparatus comprising:a neural networks loop filter (NNLF) receiving an input for in-loop filtering and generating an NNLF output;a deblocking filter receiving the input for in-loop filtering and generating a deblocking-filter output;one or more auxiliary in-loop filters receiving as input the deblocking-filter output and generating an auxiliary-filters output (920);a blender to combine the NNLF output with the auxiliary-filters output (920) to generate a blended NNLF output (930); and a set of ALF online filters to generate a final output based at least on the auxiliary-filters output and the blended NNLF output.
41. The apparatus of claim 40 wherein generating the blended NNLF output comprises computing:ROUT=RNN ~ ROUT2 ) X w + ROUT2 = w X RNN+ (1 — w) X ROUT2 wherein RNNdenotes the NNLF output, R0UT2 denotes the auxiliary -filter output, and w denotes a blending factor in [0, 1],42. The apparatus of claim 40 or claim 41, further comprising a reclassifier (925), wherein the reclassifier receives as input the blended NNLF output and classifies its input samplesinto two or more distinct classification classes to be used as further input to the set of ALF online filters to determine clipping indices and / or filter coefficients.
43. The apparatus of claim 34 or claim 40, further comprising determining blending factors Wi. according to a classification class ( i ) of the NNLF output at a k x k block level.
44. The apparatus of claim 43, wherein generating the blended NNLF output for a k x k block classified with the classification class i comprises computing:Rout=(RNN~ RAF}X wi + RAF= wiXRNN + (1— wi)xRAF >wherein RNNdenotes the NNLF output and RAFdenotes the auxiliary -filter output.
45. The apparatus of claim 39 or claim 41, wherein the blending factor w is computed as a function of a quantization parameter (QP).
46. The apparatus of claim 45, wherein the blending factor is computed as:( 0.75, if QP > a;iv = / 0.5, if b < QP < a;[ 0.25, if QP < b.
47. A method to transmit a coded bitstream from an encoder to a decoder, wherein the coded bitstream comprises high-level syntax, wherein generating the coded bitstream comprises: receiving a sequence of pictures to generate coded pictures;generating high-level syntax to assist a decoder decoding the coded pictures; and combining the coded pictures with the high-level syntax to generate the coded bitstream, wherein the high-level syntax comprises a flag indicating whether a neural net loop filter (NNLF) was used or not for generating the coded pictures, wherein the flag is part of one or more of a sequence parameter syntax set, a constraints information syntax set, a picture header syntax set, a slice header syntax set, or an adaptation parameter syntax set.
48. The method of claim 47, wherein if the flag is part of the adaptation parameter syntax set, then the high-level syntax further comprises an NNLF ID indicating an NNLF model being used for generating the coded pictures.
49. The method of claim 47, wherein high-level syntax further comprises information related to computational complexity of the NNLF.
50. The method of claim 47, wherein the high-level syntax further comprises one or more parameters to indicate that NNLF and adaptive loop filtering (ALF) are used mutually exclusively.
51. The method of claim 47, wherein the high-level syntax further comprises a syntax parameter (pps_nnlf_base_qp_delta) indicating a delta value the NNLF needs to subtract from an initial picture parameter set (PPS) QP value (pps_init_qp_minus26) to derive an NNLF base QP value (nnlf qp base), whereinnnlf_qp_base = pps_init_qp_minus26 + 26 — pps_nnlf_base_qp_delta.
52. An apparatus for video loop filtering using neural-network components, the apparatus comprising:one or more video inputs;a headblock comprising a series of video-decoding related blocks and generating a headblock output based on the one or more video inputs;a transition block receiving the headblock output and generating using neural network blocks CY + Cuv outputs;a luma (Y) processing path of neural networks with CY inputs out of the CY + Cuv outputs of the transition block generating CY outputs of reconstructed luma channels;a chroma (UV) processing path of neural networks with Cuv inputs out of the CY + Cuv outputs of the transition block generating reconstructed chroma channels, wherein the chroma processing path of neural networks further comprises:a first set of neural network blocks processing the Cuv inputs and generating Cuv outputs;a chroma U path of neural network blocks receiving as input Cu outputs out of the Cuv outputs to generate Cu reconstructed output chroma channels; anda chroma V path of neural network blocks receiving as input Cv outputs out of the Cuv outputs to generate Cv reconstructed output chroma channels, wherein Cu+Cv = Cuv.
53. The apparatus of claim 52, wherein Cu = Cv = Cuv / 2.
54. An apparatus for video loop filtering using neural -network components, the apparatus comprising:a headblock (150A) comprising a series of video-decoding related inputs, each of the video-decoding related inputs followed by a convolutional network to generate a headblock output, wherein inputs for two or more of the video-decoding related inputs are bit-packed to fit within a single bit-packed input codeword.
55. The apparatus of claim 54, wherein inputs that are bit-packed comprise a Qpslice input and a Qpbase input.
56. The apparatus of claim 55, wherein bit-packing comprises assigning QPbase to bits 0-6 of a bit-packed input codeword and assigning Qpslice to bits 7-13 of the bit-packed input codeword.
57. The apparatus of claim 54, wherein inputs that are bit-packed comprise BS-chroma V (2 bits), BS-chroma U (2 bits), BS-luma Y (2 bits), and IPB (3 bits).
58. The apparatus of claim 57. wherein bit-packing comprises assigning the BS-chroma and BS-luma inputs to bits 3-8 of a bit-packed input codeword and assigning IPB to bits 0-2 of the bit-packed input codeword.