A method and system for inter-frame prediction enhancement based on convolutional neural networks

By using a convolutional neural network-based method to enhance the quality of reference blocks in video coding, the problem of insufficient reference block quality in inter-frame prediction is solved, resulting in more efficient video coding, reduced video transmission bitrate, and improved coding efficiency.

CN116208774BActive Publication Date: 2026-06-30SHANDONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANDONG UNIV
Filing Date
2023-03-03
Publication Date
2026-06-30

Smart Images

  • Figure CN116208774B_ABST
    Figure CN116208774B_ABST
Patent Text Reader

Abstract

This invention relates to an inter-frame prediction enhancement method and system based on convolutional neural networks. A neural network training set is created using a BVI_DVC video dataset. At the decoding end, three datasets are created, each consisting of a 16×16, 32×32, and 64×64 luminance prediction block and its corresponding original current luminance block. These three datasets are then used to train three neural networks with the same structure. The three neural networks perform quality enhancement on rectangular luminance reference blocks with minimum side lengths of 16, 32, and 64, respectively. Finally, prediction blocks are derived from the enhanced reference blocks. The effectiveness of network enhancement is used to determine whether each block uses the method, and a flag indicating whether the network is used is added to the bitstream. Experiments show that this patent can reduce the BDrate by 1.10% for the luminance component compared to the standard solution.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to an inter-frame prediction enhancement method and system, which utilizes deep learning to enhance CU blocks in video coding inter-frame prediction, and belongs to the field of image processing technology. Background Technology

[0002] Video is a crucial form of information presentation. Compared to text, audio, and images, video has garnered more attention due to its realism, efficiency, and intuitiveness, becoming an indispensable part of people's learning, work, and daily life. Looking back at the history of video development, from the initial monotonous black-and-white videos to the later color videos with better visual effects, and now to high dynamic range (HDR), multi-view video (MVD), wide color gamut (WCG), and high frame rate (HFR) videos, people's demands for the sensory experience of video have gradually increased. On the one hand, with the rapid development of the internet industry, many video-related applications have emerged, allowing everyone to become a video creator, leading to a year-on-year increase in the proportion of video data in internet traffic. On the other hand, due to advancements in people's lifestyles and production methods, the application areas of video in human society are becoming increasingly widespread, including video-on-demand (VoD), live streaming, and ultra-low latency real-time communication. Therefore, in order to better meet the needs of video transmission and storage under limited bandwidth and storage space, further improving video encoding efficiency is particularly crucial.

[0003] Video coding refers to the technology of encoding video into a binary bitstream for storage and transmission. Based on whether the original video can be recovered without loss from the binary bitstream, it is divided into lossless coding and lossy coding. Because lossless coding has low efficiency and cannot meet practical coding needs, most video coding technologies focus on lossy coding. To standardize the video encoding and decoding process and improve its universality, international organizations began establishing international video coding standards in the 1980s. Currently, international video coding standards organizations include the Video Coding Experts Group (VCEG) under the International Telecommunication Union Standardization Organization (ITU-T) and the Moving Picture Experts Group (MPEG) under the International Organization for Standardization / International Electrotechnical Commission (ISO / IEC). The latest international video coding standard is the Universal Video Coding Standard (H.266 / VVC), jointly released by VCEG and MPEG in 2020. Compared to the previous generation of international video coding standard, High Efficient Video Coding (H.265 / HEVC), H.266 / VVC improves compression efficiency by 50% while maintaining the same subjective video quality.

[0004] The VVC / H.266 encoding and decoding process is as follows: Figure 1 As shown, the original video signal is first divided into many image blocks, each of which is processed individually before passing through a prediction module. The prediction module mainly consists of intra-frame prediction and inter-frame prediction. Intra-frame prediction infers the content of the current block based on spatial relationships between images. Inter-frame prediction, on the other hand, predicts the content of the current block based on the spatial and temporal relationships between consecutive frames. The difference between the predicted image (predicted block) and the original signal image (original block) is calculated to obtain the residual image (residual block). The residual block is then subjected to operations such as change quantization and entropy encoding to further compress the video data, finally outputting the video bitstream file. The decoder can reconstruct the video signal from the bitstream file output by the encoder. Most common video compression methods are lossy compression, sacrificing image quality to achieve a high compression ratio. Therefore, the video signal reconstructed by the decoder is lossy compared to the original video.

[0005] Inter-frame prediction primarily aims to eliminate temporal redundancy in video signals. In video signals, adjacent video frames often contain a significant amount of similar content. For example... Figure 2As shown, when encoding the current block of the current frame, we can find the image block (reference block) that is most similar to the current block from the reconstructed reference frame. Since the two image blocks are extremely similar, we only need to transmit the difference between the two blocks and the motion vector (MV) between them. Compared to directly transmitting the information of the current block image, this method greatly compresses the image information that needs to be transmitted. The process of finding the reference block is called motion estimation (ME), but due to the inherent spatial discretization of digital video, the block translation may not be exactly aligned with the pixels. Figure 3 In this diagram, a dotted block represents an integer pixel position. The current block finds the position of the reference block based on the integer pixel MV (Imv), but this method is still not precise enough. The idea is to first interpolate the reference block and then select the fractional pixel position image to further improve prediction accuracy. The fractional pixel image is called the prediction block, and the fractional precision displacement is called Fmv. The process of selecting the optimal prediction block based on the reference block is called Motion Compensation (MC). To derive the fractional pixel image from the integer pixel image, VVC uses discrete cosine interpolation filters with different parameters to filter the integer pixel image. The predicted block obtained after filtering may be closer to the original block than the reference block.

[0006] As can be seen from the above explanation, the quality of the reference block directly affects the coding quality of the current block. The closer the image of the reference block is to the current original block, the smaller the residual block information that needs to be transmitted, and the lower the video bitrate. Summary of the Invention

[0007] To address the shortcomings of existing technologies, this invention proposes an inter-frame prediction enhancement method based on convolutional neural networks.

[0008] This invention relates to a method for enhancing the quality of a reference block image in video encoding, making the reference block closer to the original current block, and adding a flag bit to enable it to select the optimal encoding method, thereby reducing the video transmission bitstream.

[0009] Terminology Explanation:

[0010] 1. Current block: The encoder's term for the block it is currently encoding.

[0011] 2. Original Current Block: The image block located at the current block position in the original video signal.

[0012] 3. Reference block: The name of the image block that the encoder finds in the reconstructed frames that is most similar to the current block, based on the algorithm.

[0013] 4. Predicted block: A new image block obtained after motion compensation of the reference block.

[0014] 5. Residual block: The difference between the current block and the predicted block.

[0015] 6. Motion estimation: The process by which the encoder searches for the most similar reference block using the Tzsearch algorithm when encoding the current block.

[0016] 7. Motion Compensation: The process by which the encoder, after obtaining the reference block, performs interpolation filtering on the reference block to find the prediction block with the smallest residual to the current block.

[0017] 8. Discrete Cosine Transform Interpolation Filtering (DCTIF): This method filters a reference image using different coefficients through filter taps. This allows for image displacement in different directions and at different fractional pixel distances.

[0018] 9. Affine Motion Estimation: Conventional motion compensation is for objects undergoing translational motion in videos. However, in the real world, various other motions exist besides translation, such as scaling, rotation, and other irregular motions. VTM5 proposes a block-based affine transformation motion compensation prediction method. (See below) Figure 4 As shown, the affine motion vector of a block is generated by two control points (4 parameters) or three control points (6 parameters). First, the block is divided into 4x4 luma sub-blocks. For each luma sub-block, the motion vector of its center pixel is calculated from the affine vector using the following formula, and then rounded to 1 / 16 precision.

[0019] For a 4-parameter affine motion model, the motion vector of the sub-block with center pixel (x, y) is calculated as follows:

[0020]

[0021] For a 6-parameter affine motion model, the motion vector of the sub-block with center pixel (x, y) is calculated as follows:

[0022]

[0023] Among them, (mv0 x mv0 y ),(mv1 x mv1 y ),(mv2 x mv2 y ) are the motion vectors of the control points at the top left, top right, and bottom left corners, respectively. W and H are the width and height of the current block, respectively.

[0024] After calculating the motion vector for each sub-block, motion compensation interpolation filtering is performed based on the motion vector to obtain the predicted value for each sub-block.

[0025] Motion compensation can effectively enhance the accuracy of inter-frame prediction and its ability to resist aliasing and compressed noise.

[0026] 10. Convolutional Neural Networks (CNNs): Convolutional Neural Networks (CNNs) are a type of feedforward neural network that includes convolutional computations and has a deep structure. They are one of the representative algorithms of deep learning. CNNs have representation learning capabilities and can classify input information in a translation-invariant manner according to their hierarchical structure; therefore, they are also known as "translation-invariant artificial neural networks".

[0027] Research on convolutional neural networks began in the 1980s and 1990s, with time-delay networks and LeNet-5 being the earliest convolutional neural networks. In the 21st century, with the introduction of deep learning theory and the improvement of numerical computing equipment, convolutional neural networks have developed rapidly and have been applied to fields such as computer vision and natural language processing.

[0028] The difference between a convolutional neural network (CNN) and a regular neural network lies in the fact that a CNN includes a feature extractor consisting of convolutional layers and subsampling layers (pooling layers). In the convolutional layers of a CNN, a neuron is connected only to a subset of its neighboring neurons. A convolutional layer in a CNN typically contains several feature maps, each composed of neurons arranged in a rectangular pattern. Neurons within the same feature map share weights, which are called the convolutional kernel. The convolutional kernel is usually initialized as a random fractional matrix, and during network training, it learns appropriate weights. The direct benefit of sharing weights (the convolutional kernel) is reducing the connections between network layers, while also lowering the risk of overfitting.

[0029] 11. Neural Network Attention Mechanism: The Convolutional Block Attention Module (CBAM) is an attention mechanism for deep learning-based image classification models, consisting of pooling-convolutional components and weighting components. The pooling-convolutional components extract spatial features, including max-pooling layers and convolutional layers, which can extract key information in the image, such as edges and contours. The weighting component identifies important features in the image; it weights the output using the features extracted by the pooling-convolutional components.

[0030] 12. RDcost, Rate Distortion Optimization. An existing technology that calculates which image patch performs better. It calculates the number of bits required to generate the luminance component of the prediction patch using two encoding methods and selects the method with the smaller value.

[0031] The technical solution of this invention is as follows:

[0032] An inter-frame prediction enhancement method based on convolutional neural networks is used to enhance the quality of the reference block luminance component in inter-frame prediction, and a flag is added to determine whether to use the network for enhancement. The method includes the following steps:

[0033] Find the block most similar to the current block as a reference block through motion estimation;

[0034] The luminance component of the reference block in the VTM code, i.e., the reference block in inter-frame prediction, is extracted and input into the trained neural network model. The luminance component of the reference block is then enhanced in quality, and interpolated and filtered to obtain the luminance component of the predicted block. This predicted luminance component is compared with the luminance component generated in the VVC standard scheme, and rate-distortion optimization is used to select the prediction block with the better pattern. The pattern of the predicted luminance component is then labeled: flag=0, no neural network model enhancement is used, and the default method is employed; flag=1, a convolutional neural network-based inter-frame prediction enhancement method is used. Further preferably, the block most similar to the current block is found as the reference block through motion estimation, including: performing a Tzsearch search on the constructed video frames; the video block most similar to the current block, i.e., with the smallest RDcost, is the reference block.

[0035] According to a preferred embodiment of the present invention, the neural network model includes 10 attention residual convolutional modules (ARCBs), convolutional layers at the beginning and end, and long connections.

[0036] The neural network model takes the brightness components of the reference block before motion compensation as input, passes them through a 3×3 convolutional layer and a ReLU (Rectified Linear Unit), as shown in Equation (I):

[0037] F1 = max(0, W0*x) (I)

[0038] In Equation (I), W0 represents the convolution kernel of the convolutional layer, x represents the input of the neural network model, "*" represents the convolution operation, and F1 represents the output of the convolutional layer.

[0039] A further preferred embodiment of the attention residual convolution module includes two convolutional layers, ReLU, CBAM modules, and short links; as shown in equations (II), (III), (IV), and (V):

[0040] F i,1 =max(0,W i,1 *F i (II)

[0041] F i,2 =Wi,2 *F i,1 (III)

[0042] F i,3 =CBAM(F i,2 (IV)

[0043] F i+1 =F i +F i,3 (V)

[0044] In equations (II), (III), (IV), and (V), 1 ≤ i ≤ 10, F i W is the input to the i-th attention residual convolutional module. i,1 W is the convolution kernel of the first convolutional layer in the two convolutional layers of the i-th attention residual convolutional module. i,2 For the i-th attention residual convolutional module, the kernel is the kernel of the second convolutional layer in the two convolutional layers, where max() represents ReLU; F i,1 For the output of the first convolutional layer and ReLU after i attention residual convolutional modules, F i,2 This represents the output of the second convolutional layer after passing through i attention residual convolutional modules, and is also the input of CBAM(); CBAM() represents the operation of the CBAM module; F i,3 For the output of the CBAM module, F i+1 This is the output of the i-th attention residual convolutional module.

[0045] More preferably, the CBAM module is used to calculate the weights located in the channel dimension and the weights in the spatial pixels, and multiplies the calculated weights with the original input to achieve attention allocation for different data.

[0046] The CBAM module includes a channel attention module and a spatial attention module;

[0047] If a W×H image is input to the entire neural network model at a time, the input data dimension in the channel attention module is 1×64×W×H. First, average pooling and max pooling are performed on the image dimension to compress the data dimension to 1×64×1×1. Then, the channel information is processed through two convolutional layers. Finally, the Sigmoid function is used to restrict the data range to between 0 and 1. At the same time, the calculated weights are multiplied by the input data, so that the attention of different channels is different, as shown in equations (VI), (VII), and (VIII).

[0048]

[0049]

[0050] Fi,c =F i,2 *sigmoid(F i,c_avg +F i,c_max (VIII)

[0051] In equations (VI), (VII), and (VIII), Let be the first convolutional kernel of the CBAM module in the i-th RCB module. Let f be the second convolutional kernel of the CBAM module in the i-th RCB module, where maxpool() is max pooling and avgpool() is average pooling. i,c_avh ,F i,c_max For the data compressed by average pooling and max pooling in the channel attention mechanism, sigmoid() represents the calculation process of the sigmoid activation function, F i,c This is the output of the channel attention module in the i-th RCB module;

[0052] In the spatial attention module, the data dimension is first compressed to 1×1×W×H by average pooling and max pooling on the channel dimension; then, the data compressed by max pooling and average pooling is concatenated on the channel to obtain 1×2×W×H data, which is then compressed to 1×1×W×H by a convolutional layer with a kernel size of 7×7; then, the data range is restricted to 0 to 1 by the sigmoid function; finally, the calculated spatial weights are multiplied by the input data to make the attention different for different spatial pixels, as shown in equations (IX), (X), and (XI):

[0053] F i,s_max =maxpool(F i,c (IX)

[0054] F i,s_avg =avgpool(F i,c (X)

[0055]

[0056] In equations (IX), (X), and (XI), F i,s_avg F i,s_max The data is compressed using average pooling and max pooling in the channel attention mechanism. This is the first convolutional kernel of the spatial attention module (the third convolutional kernel of the CBAM module), and cat() is the process of channel concatenation;

[0057] Finally, a 3×3 convolutional kernel with 1 channel is used to compress all feature maps into a residual image; the residual image is added to the input image to obtain the final output; as shown in equations (XII) and (XIII):

[0058] F 12 =W 12 *F 11 (XII)

[0059] y = F 12 +x (XIII)

[0060] In equations (XII) and (XIII), y represents the output, and F... 11 For the output of the 10th RCB module, W 12 For the final convolution kernel; F 12 This is the final convolution kernel output.

[0061] According to a preferred embodiment of the present invention, the training process of the neural network model is as follows:

[0062] Encode the training set videos;

[0063] Creating the dataset needed to train the neural network model;

[0064] The reference block was divided into large, medium, and small blocks based on its size.

[0065] Three different datasets were created for large, medium, and small blocks to train three neural networks;

[0066] Training the neural network model: The brightness prediction block after subpixel precision displacement in the dataset is used as the input of the neural network model, and the current block with lossless brightness is used as the ground truth to train the neural network model.

[0067] An inter-frame prediction enhancement system based on a convolutional neural network includes an encoder and a decoder. A trained neural network model is embedded into both the encoder and decoder. At the encoder, the current block finds a reference block based on motion estimation, enhances the luminance component of the reference block using the neural network model, replaces the original reference block with the output, and performs interpolation filtering on the enhanced reference block to obtain a prediction block. This prediction block is then compared with the luminance component of the prediction block generated in the VVC standard scheme, and the prediction block with the better mode is selected. A flag indicating the best mode is added to the bitstream. At the decoder, the current block finds a reference block in the reconstructed frame based on the decoded MV, and determines whether to use the neural network model to enhance the luminance component of the reference block using the flag in the bitstream.

[0068] A computer device includes a memory and a processor, the memory storing a computer program, the processor executing the computer program to implement the steps of the convolutional neural network-based inter-frame prediction enhancement method.

[0069] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the convolutional neural network-based inter-frame prediction enhancement method.

[0070] The beneficial effects of this invention are as follows:

[0071] In the VTM_11_NNVC_2.0 platform, the luminance component BD-rate is reduced by -1.10% compared to the standard method after passing through the quality enhancement network. Attached Figure Description

[0072] Figure 1 This is a schematic diagram of the VVC / H.266 encoding and decoding process;

[0073] Figure 2 This is a schematic diagram of the reference frame and the current block in inter-frame prediction;

[0074] Figure 3 A schematic diagram of motion compensation;

[0075] Figure 4 A schematic diagram of affine motion compensation;

[0076] Figure 5 This is a flowchart comparing the original standard method with the inter-frame prediction enhancement method based on convolutional neural networks of this invention.

[0077] Figure 6(a) is a schematic diagram of the working method of the encoder after embedding the neural network;

[0078] Figure 6(b) is a schematic diagram of the working method of the decoder after embedding the neural network;

[0079] Figure 7 A diagram illustrating the creation of the dataset;

[0080] Figure 8 This is a schematic diagram of a neural network architecture;

[0081] Figure 9(a) is a schematic diagram of conventional motion compensation;

[0082] Figure 9(b) is a schematic diagram of affine motion compensation;

[0083] Figure 10 This is a diagram illustrating the mode selection for frame 8 of BQsquare. Detailed Implementation

[0084] The present invention will be further defined below with reference to the accompanying drawings and embodiments, but is not limited thereto.

[0085] Example 1

[0086] An inter-frame prediction enhancement method based on convolutional neural networks, such as Figure 5 As shown, the method for quality enhancement of the reference block lumen component in inter-frame prediction includes the following steps:

[0087] In the inter-frame prediction process of the VVC standard, motion estimation is used to find the block most similar to the current block as the reference block; this implementation process is an existing technology in the VVC standard scheme. It includes: performing a Tzsearch search on the constructed video frames, and the video block most similar to the current block, i.e., with the smallest RDcost, is the reference block. Figure 9(a) is a schematic diagram of conventional motion compensation; Figure 9(b) is a schematic diagram of affine motion compensation.

[0088] The reference block luminance component in the VTM code, i.e., the reference block in inter-frame prediction, is extracted and input into the trained neural network model. The luminance component of the reference block is then enhanced in quality, and interpolation filtering is applied to the enhanced luminance component to obtain the luminance component of the prediction block. The obtained luminance component of the prediction block is compared with the luminance component of the prediction block generated in the VVC standard scheme. Rate-distortion optimization is used to select the prediction block with a better mode. The mode of the luminance component of the prediction block is then marked. When flag=0, no neural network model is used for enhancement, and the default method is used. When flag=1, a convolutional neural network-based inter-frame prediction enhancement method is used.

[0089] Example 2

[0090] The difference between the inter-frame prediction enhancement method based on convolutional neural networks described in Example 1 and the following is:

[0091] like Figure 8 As shown, the neural network model includes 10 attention residual convolutional modules (ARCBs), convolutional layers at the beginning and end, and long connections.

[0092] The neural network model takes the brightness components of the reference block before motion compensation as input, passes them through a 3×3 convolutional layer and a ReLU (Rectified Linear Unit), as shown in Equation (I):

[0093] F1 = max(0, W0*x) (I)

[0094] In Equation (I), W0 represents the convolution kernel of the convolutional layer, x represents the input of the neural network model, "*" represents the convolution operation, and F1 represents the output of the convolutional layer.

[0095] The attention residual convolutional module consists of two convolutional layers, ReLU, CBAM modules, and short connections; the purpose of using short connections is to allow for deeper network design. As shown in equations (II), (III), (IV), and (V):

[0096] F i,1=max(0, W) i,1 *F i (II)

[0097] F i,2 =W i,2 *F i,1 (III)

[0098] F i,3 =CBAM(F i,2 (IV)

[0099] F i+1 =F i +F i,3 (V)

[0100] In equations (II), (III), (IV), and (V), 1 ≤ i ≤ 10, F i W is the input to the i-th attention residual convolutional module. i,1 W is the convolution kernel of the first convolutional layer in the two convolutional layers of the i-th attention residual convolutional module. i,2 For the i-th attention residual convolutional module, the kernel is the kernel of the second convolutional layer in the two convolutional layers, where max() represents ReLU; F i,1 For the output of the first convolutional layer and ReLU after i attention residual convolutional modules, F i,2 This represents the output of the second convolutional layer after passing through i attention residual convolutional modules, and is also the input of CBAM(); CBAM() represents the operation of the CBAM module; F i,3 For the output of the CBAM module, F i+1 This is the output of the i-th attention residual convolutional module.

[0101] The CBAM module is used to calculate the weights located in the channel dimension and the weights in the spatial pixels. The calculated weights are multiplied by the original input to achieve attention allocation for different data.

[0102] The CBAM module includes a channel attention module and a spatial attention module;

[0103] If a W×H image is input to the entire neural network model at a time, the input data dimension in the channel attention module is 1×64×W×H. First, average pooling and max pooling are performed on the image dimension to compress the data dimension to 1×64×1×1. Then, the channel information is processed through two convolutional layers. Finally, the Sigmoid function is used to restrict the data range to between 0 and 1. At the same time, the calculated weights are multiplied by the input data, so that the attention of different channels is different, as shown in equations (VI), (VII), and (VIII).

[0104]

[0105]

[0106] F i,c =F i,2 *sigmoid(F i,c_avg +F i,c_max (VIII)

[0107] In equations (VI), (VII), and (VIII), Let be the first convolutional kernel of the CBAM module in the i-th RCB module. Let f be the second convolutional kernel of the CBAM module in the i-th RCB module, where maxpool() is max pooling and avgpool() is average pooling. i,c_avg F i,c_max For the data compressed by average pooling and max pooling in the channel attention mechanism, sigmoid() represents the calculation process of the sigmoid activation function, F i,c This is the output of the channel attention module in the i-th RCB module;

[0108] In the spatial attention module, the data dimension is first compressed to 1×1×W×H by average pooling and max pooling on the channel dimension; then, the data compressed by max pooling and average pooling is concatenated on the channel to obtain 1×2×W×H data, which is then compressed to 1×1×W×H by a convolutional layer with a kernel size of 7×7; then, the data range is restricted to between 0 and 1 by the sigmoid function; finally, similar to the channel attention module, the calculated spatial weights are multiplied by the input data to make the attention different for different spatial pixels; as shown in equations (IX), (X), and (XI):

[0109] Fi, s _max = maxpool(Fi, c (IX)

[0110] Fi,s_avg=avgpool(Fi,c)(X)

[0111]

[0112] In equations (IX), (X), and (XI), Fi, s_av g Fi, s_max represents the data compressed using average pooling and max pooling in the channel attention mechanism. This is the first convolutional kernel of the spatial attention module (the third convolutional kernel of the CBAM module), and cat() is the process of channel concatenation;

[0113] Finally, a 3×3 convolutional kernel with 1 channel is used to compress all feature maps into a residual image; the residual image is added to the input image to obtain the final output; as shown in equations (XII) and (XIII):

[0114] F12 = W12 * F11(XII)

[0115] y = F12 + x(XIII)

[0116] In equations (XII) and (XIII), y is the output, F11 is the output of the 10th RCB module, W12 is the final convolution kernel, and F12 is the output of the final convolution kernel.

[0117] The convolutional kernels of the entire network have 64 channels, except for the last layer which has 1 channel.

[0118] The results of the inter-frame prediction enhancement method based on convolutional neural networks in this invention on the CTC D test sequence are shown in Table 1: Figure 10 This is a diagram illustrating the mode selection for frame 8 of BQsquare.

[0119] Table 1

[0120]

[0121] In the VTM_11_NNVC_2.0 platform, the luminance component BD-rate is reduced by -1.10% compared to the standard method after passing through the quality enhancement network.

[0122] Example 3

[0123] According to the inter-frame prediction enhancement method based on convolutional neural networks described in Example 2, the training process of the neural network model is as follows:

[0124] Encode the training set videos; randomly select 650 videos from the BVI-DVC dataset (a dataset containing a large number of scene YUV video files). Using the VTM-9.3 encoder, set the encoding mode to LDP mode and the quantization parameter (Qp) to 22, encode the first 32 frames of these videos to generate a bitstream file.

[0125] Creating the dataset needed to train the neural network model;

[0126] like Figure 7As shown, the motion compensation process between frames at the decoding end is similar to that at the encoder end. The bitstream file contains motion vector information with integer and fractional precision for the current block, allowing the integer pixel position of the reference block to be derived from the integer pixel position of the current block. Then, a subpixel shift is performed on the integer pixel reference block using a subpixel interpolation filter to obtain the prediction block. Since the size of the reference block varies from 4×4 to 64×64 and is rectangular, this invention divides the reference block into large, medium, and small blocks based on its size. Small blocks are rectangular blocks with a minimum side length of 16, including 16×16, 16×32, 32×16, 16×64, and 64×16 rectangular blocks. Medium blocks are rectangular blocks with a minimum side length of 32, including 32×32, 32×64, and 64×32 rectangular blocks. Large blocks are rectangular blocks with a minimum side length of 64, including 64×64 rectangular blocks. For rectangular blocks with minimum side lengths of 4 and 8, the number of pixel textures they contain is limited and the number of blocks is excessive. Enhancing the quality of these blocks would significantly increase complexity; therefore, this invention does not perform quality enhancement on the neural network model for these smaller rectangular blocks. Three different datasets are created for large, medium, and small blocks to train three neural networks. Due to the large number of rectangular blocks segmented in the video, only the luminance components of the predicted square blocks with sizes of 16×16, 32×32, and 64×64 are selected to create three datasets. The neural networks trained using these three datasets are used to perform quality enhancement for small, medium, and large rectangular blocks, respectively.

[0127] Training the neural network model: The brightness prediction blocks in the dataset, after sub-pixel precision shifting, are used as input to the neural network model. The current block with lossless brightness is used as the ground truth to train the neural network model. The MSEloss function is used as the loss function, with an initial learning rate of 0.0001. Every 60 epochs, the learning rate is reduced to 0.1 times the original rate, and each network is trained for 180 epochs. Prediction blocks of sizes 16×16, 32×32, and 64×64 correspond to three different neural networks, for a total of three neural networks.

[0128] Example 4

[0129] An inter-frame prediction enhancement system based on a convolutional neural network includes an encoder and a decoder. A trained neural network model is embedded into both the encoder and decoder. Figure 6(a) shows the flowchart of the encoder after embedding the neural network; Figure 6(b) shows the flowchart of the decoder after embedding the neural network. At the encoder, the current block finds a reference block based on motion estimation, enhances the luminance component of the reference block using the neural network model, replaces the original reference block with the output result, performs interpolation filtering on the enhanced reference block to obtain a prediction block, and compares it with the luminance component of the prediction block generated in the VVC standard scheme. The prediction block with the better mode is selected, and a flag indicating the best mode is added to the bitstream. At the decoder, the current block finds a reference block in the reconstructed frame based on the decoded MV, and determines whether to use the neural network model to enhance the reference block using the flag in the bitstream.

[0130] Example 5

[0131] A computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement steps of an inter-frame prediction enhancement method based on a convolutional neural network.

[0132] Example 6

[0133] A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of an inter-frame prediction enhancement method based on a convolutional neural network.

Claims

1. A method of inter prediction enhancement based on convolutional neural network, for quality enhancement of reference block luma component in inter prediction, characterized in that, The steps include the following: Find the block most similar to the current block as a reference block through motion estimation; The reference block luminance component in the VTM code, i.e., the reference block in inter-frame prediction, is extracted and input into the trained neural network model. The luminance component of the reference block is then enhanced in quality, and interpolated and filtered to obtain the predicted block luminance component. This predicted block luminance component is compared with the luminance component generated in the VVC standard scheme, and rate-distortion optimization is used to select the prediction block with the better mode. The mode of the predicted block luminance component is then labeled: flag=0, no neural network model is used for enhancement; flag=1, a convolutional neural network-based inter-frame prediction enhancement method is used. The neural network model includes 10 attention residual convolutional modules (ARCBs), convolutional layers at the beginning and end, and long connections; The neural network model takes the reference block luminance component before motion compensation as input, passes through a convolution layer with a 3x3 convolution kernel and ReLU; as shown in equation (1): ) ( ) In formula (I), ) wherein, represents a convolution kernel of a convolution layer, represents an input of a neural network model, "*" represents a convolution operation, and F1 represents an output of the convolution layer; The attention residual convolution module includes two convolution layers, a Relu, a CBAM module, and a short link; as shown in formula ( ), formula ( ), formula ( ), formula ( ). ( ) ( ) ( ) ( ) Mode( ),Mode( ),Mode( ),Mode( In the given information, 1 ≤ i ≤ 10. This is the input to the i-th attention residual convolutional module. Let the kernel be the convolution kernel of the first convolutional layer in the two convolutional layers of the i-th attention residual convolutional module. is the convolution kernel of the second convolutional layer in the two convolutional layers of the i-th attention residual convolutional module, and max() represents ReLU; The output of the first convolutional layer and ReLU after passing through i attention residual convolutional modules. This represents the output of the second convolutional layer after passing through i attention residual convolutional modules, and is also the input of CBAM(); Represents the operations of the CBAM module; For the output of the CBAM module, This is the output of the i-th attention residual convolutional module.

2. The inter-frame prediction enhancement method based on convolutional neural networks according to claim 1, characterized in that, The reference block is found by motion estimation, which includes performing a Tzsearch search on the constructed video frames. The video block that is most similar to the current block, i.e. has the smallest RDcost, is the reference block.

3. The inter-frame prediction enhancement method based on convolutional neural networks according to claim 1, characterized in that, The CBAM module is used to calculate the weights located in the channel dimension and the weights in the spatial pixels. The calculated weights are multiplied by the original input to achieve attention allocation for different data. The CBAM module includes a channel attention module and a spatial attention module; If a W×H image is input to the entire neural network model at a time, the input data dimension in the channel attention module is 1×64×W×H. First, average pooling and max pooling are performed on the image dimension to compress the data dimension to 1×64×1×1. Then, the channel information is processed through two convolutional layers. Finally, the Sigmoid function is used to restrict the data range to between 0 and 1. Simultaneously, the calculated weights are multiplied by the input data, thus differentiating the attention for different channels; as shown in equation (…). ),Mode( ),Mode( As shown in the image: ( ) ( ) ( ) Mode( ),Mode( ),Mode( )middle, Let be the first convolutional kernel of the CBAM module in the i-th RCB module. The second convolutional kernel of the CBAM module in the i-th RCB module. For max pooling, For average pooling, The data is compressed using average pooling and max pooling in the channel attention mechanism. This represents the calculation process of the sigmoid activation function. This is the output of the channel attention module in the i-th RCB module; In the spatial attention module, the data dimension is first compressed to 1×1×W×H by average pooling and max pooling on the channel dimension; then, the data compressed by max pooling and average pooling is concatenated on the channel to obtain 1×2×W×H data, which is then compressed to 1×1×W×H by a convolutional layer with a kernel size of 7×7; then, the data range is restricted to 0 to 1 by the sigmoid function; finally, the calculated spatial weights are multiplied by the input data to make the attention different for different spatial pixels; as shown in equation ( ),Mode( ),Mode( As shown in the image: ( ) ( ) ( ) Mode( ),Mode( ),Mode( )middle, The data is compressed using average pooling and max pooling in the channel attention mechanism. This is the first convolutional kernel of the spatial attention module (the third convolutional kernel of the CBAM module), and cat() is the process of channel concatenation; Finally, a 3×3 convolutional kernel with 1 channel is used to compress all feature maps into a residual image; the residual image is added to the input image to obtain the final output; as shown in equation (). ),Mode( As shown in the image: ( ) Mode( ),Mode( In the diagram, y represents the output. This is the output of the 10th RCB module. For the final convolution kernel; This is the final convolution kernel output.

4. The inter-frame prediction enhancement method based on convolutional neural networks according to claim 1, characterized in that, The training process of a neural network model is as follows: Encode the training set videos; Creating the dataset needed to train the neural network model; The reference block was divided into large, medium, and small blocks based on its size. Three different datasets were created for large, medium, and small blocks to train three neural networks; Training the neural network model: The brightness prediction block after subpixel precision displacement in the dataset is used as the input of the neural network model, and the current block with lossless brightness is used as the ground truth to train the neural network model.

5. A frame prediction enhancement system based on a convolutional neural network, used to implement the frame prediction enhancement method based on a convolutional neural network as described in any one of claims 1-4, characterized in that, The algorithm includes an encoder and a decoder. The trained neural network model is embedded into the encoder and decoder. At the encoder, the current block finds a reference block based on motion estimation, uses the neural network model to enhance the luminance component of the reference block, replaces the original reference block with the output, performs interpolation filtering on the enhanced reference block to obtain a prediction block, and compares it with the luminance component of the prediction block generated in the VVC standard scheme. The prediction block with the better mode is selected, and the flag of the best mode is added to the bitstream. At the decoder, the current block finds the reference block in the reconstructed frame based on the decoded MV, and uses the flag in the bitstream to determine whether to use the neural network model to enhance the quality of the reference block.

6. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the inter-frame prediction enhancement method based on any one of claims 1-4.

7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the inter-frame prediction enhancement method based on any one of claims 1-4.