A neural network in-loop filtering method and apparatus for video coding

By introducing an end-to-end autoencoder loop filter module into video coding, and using the original image and side information to generate a loop-filtered bitstream, the problem of limited coding performance improvement in existing technologies is solved, and more efficient coding performance is achieved.

CN115914654BActive Publication Date: 2026-06-30XIDIAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIDIAN UNIV
Filing Date
2022-10-25
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing neural network loop filters only enhance the reconstructed image in video coding, failing to fully utilize the transmitted side information, resulting in limited improvement in coding performance.

Method used

An end-to-end autoencoder loop filter module is adopted, which uses the original image and side information as input, generates a loop filter bitstream through the autoencoder, and uses neural networks for processing at the encoding and decoding ends to improve encoding performance.

Benefits of technology

Reducing the bitrate while maintaining the same quality, or improving image quality while maintaining the same bitrate, is particularly useful for supplementing key information missing from reconstructed images at ultra-low bitrates.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115914654B_ABST
    Figure CN115914654B_ABST
Patent Text Reader

Abstract

This application provides a neural network loop filtering method and apparatus for video coding. The method includes: at the encoding end of a video codec, acquiring the original image of the current frame, the reconstructed image output by the previous module, and other side information; inputting the original image of the current frame, the reconstructed image output by the previous module, and other side information into an autoencoder-based loop filter module at the encoding end to obtain a first reconstructed image and a loop-filtered bitstream of the current frame; at the decoding end of the video codec, acquiring the loop-filtered bitstream of the current frame, the reconstructed image output by the previous module, and other side information; inputting the loop-filtered bitstream of the current frame, the reconstructed image output by the previous module, and other side information into an autoencoder-based loop filter module at the decoding end to obtain a second reconstructed image. This scheme can further improve coding performance, that is, reduce the bitrate at the same quality, or improve the quality at the same bitrate.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of video coding technology, and specifically relates to a neural network loop filtering method and apparatus for video coding. Background Technology

[0002] Video encoding and decoding, as a fundamental technology, is widely used in various industries. Loop filtering plays a crucial role in video encoding and decoding. It is typically located at the end of the entire encoding and decoding process for the current image, effectively removing block artifacts to improve visual quality, enhancing the quality of the reconstructed image by utilizing internal information, and simultaneously benefiting the encoding of subsequent images. Current loop filtering methods, depending on the technology used, can be categorized into traditional manually designed loop filtering methods such as the Deblocking Filter (DBK), Sample Adaptive Offset (SAO), Adaptive Loop Filter (ALF), and the Luma Mapping With Chroma Scaling (LMCS) filter in the next-generation standard VVC; as well as neural network filters, such as various forms of CNNs or other network structures. The key is the introduction of neural networks into loop filtering. These filters can typically be combined with the aforementioned manually designed loop filters or replace one or more filters. Related research has demonstrated that neural network loop filters can bring significant performance improvements.

[0003] Current neural network loop filters often improve the algorithm by modifying the input and network structure to enhance performance. For example, the input may include quantization parameters QP, prediction signals Pred, and other encoded side information, in addition to the reconstructed image of the current frame. The network structure may include typical residual network modules, attention mechanisms, multi-scale mechanisms, etc.

[0004] These methods can improve network performance, but their ability to improve performance is limited because the enhanced information depends on the quality of the reconstructed image and cannot enhance the information lost in the reconstructed image. Summary of the Invention

[0005] The purpose of the embodiments in this specification is to provide a neural network loop filtering method and apparatus for video encoding.

[0006] To solve the above-mentioned technical problems, the embodiments of this application are implemented in the following ways:

[0007] In a first aspect, this application provides a neural network loop filtering method for video coding, the method comprising:

[0008] At the encoding end of the video codec, the original image of the current frame, the reconstructed image output by the previous module, and other side information are obtained;

[0009] The original image of the current frame, the reconstructed image output by the previous module, and other side information are input into the autoencoder-based loop filter module at the encoding end to obtain the first reconstructed image and the loop filter bitstream of the current frame.

[0010] At the decoding end of the video codec, the loop filter bitstream of the current frame, the reconstructed image output by the previous module, and other side information are obtained;

[0011] The loop filter bitstream of the current frame, the reconstructed image output by the previous module, and other side information are input into the autoencoder-based loop filter module at the decoding end to obtain the second reconstructed image.

[0012] In one embodiment, the autoencoder-based loop filter module at the encoding end includes an encoder and a first decoder;

[0013] The original image of the current frame, the reconstructed image output from the previous module, and other side information are input into the autoencoder-based loop filter module at the encoding end to obtain the first reconstructed image and the loop-filtered bitstream of the current frame, including:

[0014] The original image of the current frame, the reconstructed image output by the previous module, and other side information are input into the encoder to obtain the loop-filtered bitstream of the current frame;

[0015] The loop-filtered bitstream of the current frame, the reconstructed image output by the previous module, and other side information are input to the first decoder to obtain the first reconstructed image.

[0016] In one embodiment, the original image of the current frame, the reconstructed image output by the previous module, and other side information are input to the encoder to obtain the loop-filtered bitstream of the current frame, including:

[0017] The original image of the current frame, the reconstructed image output by the previous module, and other side information are normalized to obtain normalized features;

[0018] The normalized features are input into the encoder to extract the feature vector.

[0019] The feature vector is quantized and entropy encoded to convert it into a loop-filtered bitstream of the current frame.

[0020] In one embodiment, the autoencoder-based loop filter module at the decoding end includes a second decoder;

[0021] The loop-filtered bitstream of the current frame, the reconstructed image output from the previous module, and other side information are input into the autoencoder-based loop filter module at the decoding end to obtain the second reconstructed image, including:

[0022] The loop-filtered bitstream of the current frame, the reconstructed image output by the previous module, and other side information are input into the second decoder to obtain the second reconstructed image.

[0023] In one embodiment, the encoder and / or decoder are composed of a neural network;

[0024] The decoder includes a first decoder and a second decoder.

[0025] In one embodiment, the neural network is any one of a convolutional neural network, a fully connected network, a recurrent neural network, or a reversible neural network.

[0026] In one embodiment, the parameters of the neural network are obtained by joint training of the encoder and decoder.

[0027] In one embodiment, the parameters of the neural network are jointly trained by the encoder and decoder, including:

[0028] The original image of the current frame is pre-coded to obtain the pre-coded bitstream and pre-coded bitrate, other training side information, and the reconstructed image output by the previous training module;

[0029] The original image of the current training frame, other training side information, and the reconstructed image output by the previous module are input into the preset encoder to obtain the training bitrate and training bitstream.

[0030] The training bitstream, training other side information, and the reconstructed image output from the previous module are input into a preset decoder to obtain the training reconstructed image;

[0031] The loss function is determined based on the original image of the current training frame, the reconstructed image, the training bitrate, and the preorder coding bitrate.

[0032] When the loss function value or the number of iterations meets the preset conditions, the corresponding parameters are used as the parameters of the neural network.

[0033] In one embodiment, the other side information includes at least one or more of the following: block partitioning information, prediction mode, motion vector, reconstructed images of other frames, filter control parameters, and quantization parameters.

[0034] Secondly, this application provides a neural network loop filtering device for video coding, the device comprising:

[0035] The first acquisition module is used to acquire the original image of the current frame, the reconstructed image output by the previous module, and other side information at the encoding end of the video codec.

[0036] The encoding module is used to input the original image of the current frame, the reconstructed image output by the previous module, and other side information into the autoencoder-based loop filter module at the encoding end to obtain the first reconstructed image and the loop filter bitstream of the current frame.

[0037] The second acquisition module is used to acquire the loop filter bitstream of the current frame, the reconstructed image output by the previous module, and other side information at the decoding end of the video codec.

[0038] The decoding module is used to input the loop filter bitstream of the current frame, the reconstructed image output by the previous module, and other side information into the autoencoder-based loop filter module at the decoding end to obtain the second reconstructed image.

[0039] As can be seen from the technical solutions provided in the embodiments of this specification above, this solution uses the original image as one of the inputs, and the output side information is automatically generated by the loop filter module based on the autoencoder rather than manually designed. This method can further improve coding performance, that is, reduce the bit rate at the same quality, or improve the quality at the same bit rate. Attached Figure Description

[0040] To more clearly illustrate the technical solutions in the embodiments or prior art of this specification, the drawings used in the description of the embodiments or prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0041] Figure 1 A schematic diagram of the existing hybrid coding framework;

[0042] Figure 2 A schematic diagram of an existing VVC loop filter module;

[0043] Figure 3 This is a schematic diagram of existing loop filtering technology;

[0044] Figure 4 for Figure 3 A schematic diagram of the structure of CNNLF;

[0045] Figure 5 A flowchart illustrating the neural network loop filtering method for video coding provided in this application;

[0046] Figure 6 is a schematic diagram of the loop filter module based on the autoencoder provided in this application, wherein Figure 6(a) is a schematic diagram of the encoder end and Figure 6(b) is a schematic diagram of the decoder end;

[0047] Figure 7 A training block diagram for the neural network parameters provided in this application;

[0048] Figure 8 A schematic diagram of the combination of the AELF with other filter modules provided in this application, wherein A to E represent the allowed insertion positions of the AELF;

[0049] Figure 9 A schematic diagram of the AELF filter provided in this application replacing part or all of the traditional filter;

[0050] Figure 10 A schematic diagram of a specific embodiment provided in this application;

[0051] Figure 11 for Figure 10 A schematic diagram of the structure of the attention module;

[0052] Figure 12 This is a schematic diagram of the neural network loop filter device for video encoding provided in this application. Detailed Implementation

[0053] To enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this specification, and not all embodiments. Based on the embodiments in this specification, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of this specification.

[0054] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.

[0055] Various modifications and variations can be made to the specific embodiments described in this application without departing from the scope or spirit of this application, as will be apparent to those skilled in the art. Other embodiments derived from this application will be obvious to those skilled in the art. This application specification and embodiments are merely exemplary.

[0056] The terms “include,” “including,” “have,” “contain,” etc., used in this article are all open-ended terms, meaning that they include but are not limited to.

[0057] Unless otherwise specified, "parts" in this application refers to parts by weight.

[0058] Video encoding and decoding converts video into a binary bitstream, aiming to minimize the size of the output bitstream under certain distortion conditions. Common video encoding and decoding standards include H.264 / AVC, H.265 / HEVC, and H.266 / VVC jointly developed by the international ITU-T and ISO, VP9, ​​AV1, and AV2 released by Google, and AVS1, AVS2, and AVS3 developed domestically. These encoding and decoding standards mostly adopt the hybrid coding framework of H.264 / AVC, such as... Figure 1 As shown, the specific details are slightly different.

[0059] Understandably, this application focuses on improving the loop filtering module in video encoding and decoding. Figure 1 Other modules are necessary steps in this application and are not subject to special restrictions; that is, improvements to other modules can be integrated with this application.

[0060] The following is the first part. Figure 1 The key steps are introduced below. The input to video compression encoding is the video to be compressed. The image order is adjusted according to a certain encoding structure (such as low-latency configuration, random access configuration, or full intra-frame coding mode) before encoding. Low-latency configuration refers to an IPPPP encoding structure, where the first frame of the video is encoded as an I-frame, allowing only intra-frame coding mode and decoding using only the bitstream containing the current frame's information; subsequent frames are encoded as P-frames, allowing the use of preceding frames as reference frames for motion estimation and compensation. Random access (RA) configuration is generally set to a hierarchical encoding structure, encoding according to GOPs (Group of Pictures). The starting frame of each GOP is encoded as an I-frame, and the frames in the middle of the GOP are encoded as B-frames (allowing the use of preceding and following frames as references).

[0061] The image to be encoded is generally divided into image blocks (defined as CTUs (coding tree units) in HEVC and VVC). These blocks may be further subdivided, and the optimal subdivision is selected by comparing the rate-distortion costs of different subdivisions. For each subdivision, the encoder uses the reconstructed image of the previously encoded image block and the temporally encoded reconstructed frames as references to perform intra-frame prediction or motion estimation and motion compensation to obtain the prediction block for the current subdivision. The difference between the original image block and the prediction image block yields a residual block, which undergoes a forward transform to obtain transform coefficients, followed by quantization to remove visual redundancy. To improve coding efficiency by utilizing the correlation between consecutive encoded blocks, the decoding process is reproduced at the encoder. That is, the quantized coefficients undergo inverse quantization and inverse transform to obtain the reconstruction residual, which is then added to the prediction image block to obtain the reconstructed image block. For intra-frame prediction mode, the reconstructed image block is directly used as the prediction reference; for inter-frame prediction mode, the reconstructed image after loop filtering is used as the reference. After all encoded blocks are reconstructed, they pass through a loop filtering module to remove block artifacts, reduce ringing artifacts, and improve the quality of the reconstructed image. The filtered image is used as the final output.

[0062] Common loop filters in existing technologies include: DBK, SAO, ALF, LMCS, etc. For example... Figure 2 As shown, the loop filter module of the latest generation international standard VVC is presented.

[0063] like Figure 3 The image shows a loop filtering technique that proposes a neural network-based loop filtering module, CNNLF (Convolutional Neural Network Loop Filter). This technique proposes using CNNLF to replace DBK and SAO for I-frame keyframes in video coding. For B-frames, a decision needs to be made between DBK+SAO and CNNLF to select the better filter. The structure of CNNLF is shown below. Figure 4 As shown, this technique, in addition to the basic reconstructed image input rec_yuv, utilizes multiple side information, including the predicted reconstructed image pred_yuv, the block partition par_yuv, the basic quantization parameter BaseQP, and the slice-level quantization parameter Slice QP. In terms of network structure, it uses a relatively simple residual convolutional neural network structure. In fact, most current neural network-based loop filters differ primarily in their input and network structure.

[0064] Existing neural network-based loop filtering techniques differ as follows: 1) Different neural network inputs: Besides the basic reconstructed image as input, other information may include the predicted image generated during encoding, block partitioning information, encoding edge information (such as QP, SAO filtering parameters), and reconstructed images from previous and subsequent frames; 2) Different neural network structures: For example, Convolutional Neural Networks (CNNs) are a common basic structure, with differences possibly in the number of layers, number of channels, whether downsampling is performed, and whether an attention mechanism is included. Other structures are also used for loop filtering. 3) Different number of models: Different models are used for luminance and the two chrominance components, while some researchers use the same model to process all three components. Furthermore, regarding the number of models, some works use different neural networks for different quantization parameters (QP); 4) Different positional relationships with traditional filters: Some works propose using neural network filters to replace some or all of the traditional filters, such as the above... Figure 3 In this study, CNNLF was used to replace the traditional DBK and SAO filters in I-frames. Some works perform mode decision-making between neural networks and traditional filters, selecting the better one, as shown above. Figure 3 For B-frames, the choice is between CNNLF and DBK+SAO. Other works place neural network filters in different positions compared to traditional filters.

[0065] These works all offer significant performance improvements over traditional filters, but the neural network is only used to enhance the reconstructed image and does not consider the gain effect of transmitting additional side information on the loop filter. Although deciding between traditional filters and neural network filters is better, it requires transmitting one bit of side information to indicate the type of filter used.

[0066] To address the aforementioned shortcomings, this application provides a neural network loop filtering method for video coding. This method employs an end-to-end loop filter that transmits side information, using the original image as one of the inputs, and the output side information is automatically generated by the neural network rather than manually designed. This method can further improve coding performance, i.e., reducing the bitrate while maintaining the same quality, or improving quality while maintaining the same bitrate.

[0067] The present invention will be further described in detail below with reference to the accompanying drawings and embodiments.

[0068] Reference Figure 5 It illustrates a flowchart of a neural network loop filtering method for video encoding applicable to embodiments of this application.

[0069] like Figure 5 As shown, a neural network loop filtering method for video coding may include:

[0070] S510. At the encoding end of the video codec, obtain the original image of the current frame, the reconstructed image output by the previous module, and other side information.

[0071] S520. Input the original image of the current frame, the reconstructed image output by the previous module, and other side information into the autoencoder-based loop filter module at the encoding end to obtain the first reconstructed image and the loop filter bitstream of the current frame.

[0072] S530: At the decoding end of the video codec, obtain the loop filter bitstream of the current frame, the reconstructed image output by the previous module, and other side information.

[0073] S540: Input the loop filter bitstream of the current frame, the reconstructed image output by the previous module, and other side information into the autoencoder-based loop filter module at the decoding end to obtain the second reconstructed image.

[0074] Specifically, both the encoding and decoding ends of the video codec include an autoencoder-based loop filter module (AELF). The structure at the encoding end is shown in Figure 6(a), where the AELF includes an encoder and a first decoder. The structure at the decoding end is shown in Figure 6(b), where the AELF includes a second decoder. This module can be combined with or replace parts of traditional loop filter modules such as DBK, SAO, LMCS, and ALF. Optionally, the encoder and / or decoder can be constructed from a neural network; understandably, the decoder can include both a first and a second decoder.

[0075] In one embodiment, S520 inputs the original image of the current frame, the reconstructed image output by the previous module, and other side information to the autoencoder-based loop filter module at the encoding end to obtain the first reconstructed image and the loop-filtered bitstream of the current frame, including:

[0076] The original image of the current frame, the reconstructed image output by the previous module, and other side information are input into the encoder to obtain the loop-filtered bitstream of the current frame;

[0077] The loop-filtered bitstream of the current frame, the reconstructed image output by the previous module, and other side information are input to the first decoder to obtain the first reconstructed image.

[0078] The input to the autoencoder-based loop filter module at the encoding end consists of three parts: the original image record of the current frame (I), the reconstructed image I' output by the previous module, and other side information S. The reconstructed image output by the previous module is the main input to the AELF (Automatic Asynchronous Filter), and the original image of the current frame is equivalent to the image to be encoded, used to generate the bitstream to supplement information that cannot be generated solely from the reconstructed image output by the previous module. Other side information may include block partitioning information, prediction modes, motion vectors, reconstructed images from other frames, SAO filter control parameters, quantization parameters QP, etc., from previous modules. Generally, this information can further improve the loop filter performance. This part of the input can also be omitted, which saves computation, memory, and cache size, but at the cost of reduced coding performance.

[0079] In one embodiment, the original image of the current frame, the reconstructed image output by the previous module, and other side information are input to the encoder to obtain the loop-filtered bitstream of the current frame, including:

[0080] The original image of the current frame, the reconstructed image output by the previous module, and other side information are normalized to obtain normalized features;

[0081] The normalized features are input into the encoder to extract the feature vector.

[0082] The feature vector is quantized and entropy encoded to convert it into a loop-filtered bitstream of the current frame.

[0083] Specifically, at the encoding end, the original image of the current frame is recorded as I, the reconstructed image I' output by the previous module and other side information S are normalized and then input into the neural network to extract feature vectors. These feature vectors are then quantized and entropy encoded to be converted into the loop filter bitstream (which can be simply referred to as the bitstream) of the current frame.

[0084] In one embodiment, S540 inputs the loop-filtered bitstream of the current frame, the reconstructed image output by the previous module, and other side information to the autoencoder-based loop filter module at the decoding end to obtain a second reconstructed image, including:

[0085] The loop-filtered bitstream of the current frame, the reconstructed image output by the previous module, and other side information are input into the second decoder to obtain the second reconstructed image.

[0086] Specifically, at the decoding end, the reconstructed image output by the previous module, the bitstream generated by the encoding end, and other side information are used as inputs to the decoding end, which is composed of a neural network, to synthesize the final reconstructed image.

[0087] The neural network in this context is not subject to any special constraints; it can be a convolutional neural network, a fully connected network, a recurrent neural network, or a reversible neural network. Understandably, the weights of the neural network are pre-trained parameters.

[0088] In one embodiment, the parameters of the neural network are obtained by joint training of the encoder and decoder. It is understood that the decoder can be either a first decoder or a second decoder.

[0089] Specifically, the parameters of the neural network, jointly trained by the encoder and decoder, include:

[0090] The original image of the current frame is pre-coded to obtain the pre-coded bitstream and pre-coded bitrate, other training side information, and the reconstructed image output by the previous training module;

[0091] The original image of the current training frame, other training side information, and the reconstructed image output by the previous module are input into the preset encoder to obtain the training bitrate and training bitstream.

[0092] The training bitstream, training other side information, and the reconstructed image output from the previous module are input into a preset decoder to obtain the training reconstructed image;

[0093] The loss function is determined based on the original image of the current training frame, the reconstructed image, the training bitrate, and the preorder coding bitrate.

[0094] When the loss function value or the number of iterations meets the preset conditions, the corresponding parameters are used as the parameters of the neural network.

[0095] In order to obtain pre-trained parameters, the encoder and decoder need to be trained together.

[0096] For ease of reading, the word "training" is omitted from the initial, intermediate, and output data used in neural network training. For example, training the original image of the current frame is abbreviated as the original image of the current frame, and training the reconstructed image is abbreviated as the reconstructed image.

[0097] like Figure 7 A training block diagram for training neural network parameters is provided. Understandably, based on... Figure 7 The trained encoder parameters can be applied to the encoder in Figure 6(a), according to... Figure 7 The trained decoder parameters can be applied to the first decoder in Figure 6(a) and the second decoder in Figure 6(b). Utilizing... Figure 7 The training process for training neural network parameters, as shown in the training block diagram, is as follows: The original image I of the current frame is first combined with other information (such as reconstructed images of frames before and after the current frame, encoding control parameters, etc.) and pre-coding is performed (such as traditional encoding methods like H.264 / AVC, H.265 / HEVC, H.266 / VVC, etc.) to obtain the pre-coded bitstream and pre-coded bitrate. The pre-coded bitrate is denoted as R. tOther side information S and the reconstructed image I′ output by the previous module, i.e., the basic training data unit, are composed of... <I,I′,S,R t The training dataset is generated by encoding different images from different videos. During training, the original image I of the current frame, other side information S, and the reconstructed image I′ output by the previous module are used as inputs to the preset encoder to obtain the estimated bitrate R. n The encoder includes quantization, and the output bitstream may contain side information and data information, such as the Hyperprior model. This bitstream, the reconstructed image I′ output from the previous module, and other side information S are then combined as input to a preset decoder. This bitstream is then processed by the preset decoder to obtain the final reconstructed image. The loss function L comprehensively considers the preceding coding rate R. t The bit rate R consumed by the loop filtering module n The reconstructed image output by the previous module The distortion between the original image I and the current frame is as follows:

[0098]

[0099] in, Measure the reconstructed image output by the previous module The distortion between the image and the original image in the current frame can be an objective quality metric such as mean squared error (MSE) or a subjective quality metric such as the multi-scale structural similarity model (MS-SSIM). t +R n The overall bit rate is balanced by comprehensively considering the bit rate of the pre-coding and the bit rate consumed by the loop filtering module; λ controls the trade-off between distortion and bit rate.

[0100] If the loss function value or the number of iterations meets the preset conditions, the corresponding parameters are used as the parameters of the neural network; otherwise, the iteration is repeated.

[0101] It is understood that the autoencoder-based loop filter module (AELF) proposed in this application can be combined with traditional filter modules, or can replace some or all of the traditional filter modules. Figure 8 and Figure 9 The overall connection relationship of the loop filter module is given. Figure 8 This involves combining the AELF proposed in this application with other filter modules. In the figure, A to E represent the allowed insertion positions of the AELF. Figure 9 This is to replace the AELF proposed in this application. Figure 2Some or all of the traditional filters. The difference between these schemes lies in the location of the AELF and its combination with the traditional filters, thus the AELF plays a different role, thereby affecting the complexity and performance gain of the neural network.

[0102] Figure 10 A specific embodiment of this application is given. Figure 10 The encoder and Hyper encoder in the text correspond to each other. Figure 7 The encoder in the code; quantization parameter QP and preorder coding rate R t correspond Figure 7 Other edge information; corresponding decoder and Hyper decoder Figure 7 The decoder in the middle; Figure 10 The two parts of the bitstream together constitute Figure 7 The bitstream in the video. Figure 10 The encoder generates the main bitstream. The hyper encoder and decoder jointly estimate the mean μ and variance σ of the distribution of features generated after quantization by the attention module (assuming it follows a Gaussian distribution). These parameters are used to determine the probability of symbol occurrence during arithmetic coding and decoding. Therefore, the intermediately generated latent variables need to be quantized and encoded before being written into the bitstream to ensure that the decoder can obtain both types of parameters. This Gaussian distribution is used to estimate the bitrate of the quantized features and to estimate the symbol probability in arithmetic coding. Figure 10 In the diagram, Conv represents a convolutional layer, and in A×B×C / S, A, B, C, and S represent the number of convolutional channels, kernel width, kernel height, and stride, respectively. ↑ and ↓ represent upsampling and downsampling, respectively. ReLU is a non-linear activation layer, Q represents quantization (uniform quantization can be used), AE and AD represent arithmetic encoder and arithmetic decoder, respectively. ABS represents absolute value. GDN and IGDN are commonly used non-linear activation layers for end-to-end image coding based on deep learning, namely Generative Divisive Normalization and Inverse Generative Divisive Normalization. Concat represents a connection layer. Attention modules are as follows: Figure 11 As shown.

[0103] Current neural network loop filters define the function of the neural network as enhancing the encoded image without transmitting the bitstream. This application builds upon this by introducing the original image into the loop filter process and additionally encoding the feature information / side information learned by the neural network.

[0104] Furthermore, most current neural networks are only used to enhance reconstructed images, without considering the gain effect of transmitting additional side information on the loop filter. Although choosing between traditional filters and neural network filters requires transmitting one bit of side information to indicate the filter type used, this application proposes a novel end-to-end loop filter that transmits side information. This filter uses the original image as one of the inputs, and the output side information is automatically generated by the neural network rather than manually designed. This invention introduces minimizing the rate-distortion cost D + λ·R as the optimization objective for the neural network parameters, where D and R represent distortion and bit rate, respectively. This method can further improve coding performance, i.e., reducing the bit rate at the same quality, or improving quality at the same bit rate; especially in ultra-low bit rate scenarios, this method can supplement the key information missing from the reconstructed image.

[0105] Reference Figure 12 The diagram illustrates a schematic of a neural network loop filter device for video encoding, as described in one embodiment of this application.

[0106] like Figure 12 As shown, the neural network loop filter device 1200 for video encoding may include:

[0107] The first acquisition module 1210 is used to acquire the original image of the current frame, the reconstructed image output by the previous module, and other side information at the encoding end of the video codec.

[0108] The encoding module 1220 is used to input the original image of the current frame, the reconstructed image output by the previous module, and other side information into the autoencoder-based loop filter module at the encoding end to obtain the first reconstructed image and the loop filter bitstream of the current frame.

[0109] The second acquisition module 1230 is used to acquire the loop filter bitstream of the current frame, the reconstructed image output by the previous module, and other side information at the decoding end of the video codec.

[0110] The decoding module 1240 is used to input the loop filter bitstream of the current frame, the reconstructed image output by the previous module, and other side information into the autoencoder-based loop filter module at the decoding end to obtain the second reconstructed image.

[0111] Optionally, the autoencoder-based loop filter module at the encoding end includes an encoder and a first decoder; the encoding module 1220 is also used for:

[0112] The original image of the current frame, the reconstructed image output by the previous module, and other side information are input into the encoder to obtain the loop-filtered bitstream of the current frame;

[0113] The loop-filtered bitstream of the current frame, the reconstructed image output by the previous module, and other side information are input to the first decoder to obtain the first reconstructed image.

[0114] Optionally, the encoding module 1220 is also used for:

[0115] The original image of the current frame, the reconstructed image output by the previous module, and other side information are normalized to obtain normalized features;

[0116] The normalized features are input into the encoder to extract the feature vector.

[0117] The feature vector is quantized and entropy encoded to convert it into a loop-filtered bitstream of the current frame.

[0118] Optionally, the autoencoder-based loop filter module at the decoding end includes a second decoder; the decoding module 1240 is also used for:

[0119] The loop-filtered bitstream of the current frame, the reconstructed image output by the previous module, and other side information are input into the second decoder to obtain the second reconstructed image.

[0120] Optionally, the encoder and / or decoder may be composed of a neural network;

[0121] The decoder includes a first decoder and a second decoder.

[0122] Optionally, the neural network can be any one of the following: convolutional neural network, fully connected network, recurrent neural network, or invertible neural network.

[0123] Optionally, the parameters of the neural network are obtained by joint training of the encoder and decoder.

[0124] Optionally, the device may also include:

[0125] The parameter training module is used for:

[0126] The original image of the current frame is pre-coded to obtain the pre-coded bitstream and pre-coded bitrate, other training side information, and the reconstructed image output by the previous training module;

[0127] The original image of the current training frame, other training side information, and the reconstructed image output by the previous module are input into the preset encoder to obtain the training bitrate and training bitstream.

[0128] The training bitstream, training other side information, and the reconstructed image output from the previous module are input into a preset decoder to obtain the training reconstructed image;

[0129] The loss function is determined based on the original image of the current training frame, the reconstructed image, the training bitrate, and the preorder coding bitrate.

[0130] When the loss function value or the number of iterations meets the preset conditions, the corresponding parameters are used as the parameters of the neural network.

[0131] Optionally, other side information may include at least one or more of the following: block partitioning information, prediction mode, motion vector, reconstructed images from other frames, filter control parameters, and quantization parameters.

[0132] This embodiment provides a neural network loop filtering device for video encoding, which can perform the above-described method. Its implementation principle and technical effect are similar, and will not be repeated here.

[0133] It should be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0134] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.

Claims

1. A neural network loop filtering method for video coding, characterized in that, The method includes: At the encoding end of the video codec, the original image of the current frame, the reconstructed image output by the previous module, and other side information are obtained; The original image of the current frame, the reconstructed image output by the previous module, and the other side information are input to the autoencoder-based loop filter module at the encoding end to obtain the first reconstructed image and the loop filter bitstream of the current frame. The autoencoder-based loop filter module at the encoding end includes an encoder and a first decoder. At the decoding end of the video codec, the loop-filtered bitstream of the current frame, the reconstructed image output by the previous module, and other side information are obtained; The loop filter bitstream of the current frame, the reconstructed image output by the previous module, and the other side information are input to the autoencoder-based loop filter module at the decoding end to obtain the second reconstructed image. The autoencoder-based loop filter module at the decoding end includes a second decoder. The encoder and / or decoder are composed of a neural network, and the decoder includes the first decoder and the second decoder; The parameters of the neural network are obtained by joint training of the encoder and the decoder, specifically including: The original image of the current training frame is pre-coded to obtain the pre-coded bitstream and pre-coded bitrate, the other training side information, and the reconstructed image output by the previous training module; The original image of the current training frame, the other training side information, and the reconstructed image output by the previous training module are input into a preset encoder to obtain the training bitrate and training bitstream. The training bitstream, the other training side information, and the reconstructed image output by the previous training module are input into a preset decoder to obtain the training reconstructed image; The loss function is determined based on the original image of the current training frame, the reconstructed training image, the training bitrate, and the preorder coding bitrate. When the loss function value or the number of iterations meets the preset conditions, the corresponding parameters are used as the parameters of the neural network.

2. The method according to claim 1, characterized in that, The step of inputting the original image of the current frame, the reconstructed image output by the previous module, and the other side information into the autoencoder-based loop filter module at the encoding end to obtain the first reconstructed image and the loop-filtered bitstream of the current frame includes: The original image of the current frame, the reconstructed image output by the previous module, and the other side information are input to the encoder to obtain the loop-filtered bitstream of the current frame; The loop-filtered bitstream of the current frame, the reconstructed image output by the previous module, and other side information are input to the first decoder to obtain the first reconstructed image.

3. The method according to claim 2, characterized in that, The original image of the current frame, the reconstructed image output by the previous module, and the other side information are input to the encoder to obtain the loop-filtered bitstream of the current frame, including: The original image of the current frame, the reconstructed image output by the previous module, and the other side information are normalized to obtain normalized features; The normalized features are input into the encoder to extract the feature vector. The feature vector is quantized and entropy encoded to convert it into a loop-filtered bitstream of the current frame.

4. The method according to claim 2, characterized in that, The step of inputting the loop-filtered bitstream of the current frame, the reconstructed image output by the previous module, and the other side information into the autoencoder-based loop filter module at the decoding end to obtain the second reconstructed image includes: The loop-filtered bitstream of the current frame, the reconstructed image output by the previous module, and the other side information are input into the second decoder to obtain the second reconstructed image.

5. The method according to claim 1, characterized in that, The neural network can be any one of the following: convolutional neural network, fully connected network, recurrent neural network, or invertible neural network.

6. The method according to claim 1, characterized in that, The other side information includes at least one or more of the following: block partitioning information, prediction mode, motion vector, reconstructed images from other frames, filter control parameters, and quantization parameters.

7. A neural network loop filtering device for video coding, characterized in that, The device includes: The first acquisition module is used to acquire the original image of the current frame, the reconstructed image output by the previous module, and other side information at the encoding end of the video codec. The encoding module is used to input the original image of the current frame, the reconstructed image output by the previous module, and the other side information into the autoencoder-based loop filter module of the encoding end to obtain the first reconstructed image and the loop filter bitstream of the current frame. The autoencoder-based loop filter module of the encoding end includes an encoder and a first decoder. The second acquisition module is used to acquire, at the decoding end of the video codec, the loop filter bitstream of the current frame, the reconstructed image output by the previous module, and the other side information; The decoding module is used to input the loop filter bitstream of the current frame, the reconstructed image output by the previous module, and the other side information to the autoencoder-based loop filter module at the decoding end to obtain the second reconstructed image. The autoencoder-based loop filter module at the decoding end includes a second decoder. The encoder and / or decoder are composed of a neural network, and the decoder includes the first decoder and the second decoder; The parameters of the neural network are obtained by joint training of the encoder and the decoder, specifically including: The original image of the current training frame is pre-coded to obtain the pre-coded bitstream and pre-coded bitrate, the other training side information, and the reconstructed image output by the previous training module; The original image of the current training frame, the other training side information, and the reconstructed image output by the previous training module are input into a preset encoder to obtain the training bitrate and training bitstream. The training bitstream, the other training side information, and the reconstructed image output by the previous training module are input into a preset decoder to obtain the training reconstructed image; The loss function is determined based on the original image of the current training frame, the reconstructed training image, the training bitrate, and the preorder coding bitrate. When the loss function value or the number of iterations meets the preset conditions, the corresponding parameters are used as the parameters of the neural network.