A neural network-based method for enhancing the quality of compressed video.
By constructing a neural network-based method for enhancing the quality of compressed video, the problem of compression noise caused by video coding standards is solved. Through spatiotemporal information pre-extraction and fusion network, the quality of compressed video is significantly improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING UNIV OF TECH
- Filing Date
- 2023-03-16
- Publication Date
- 2026-06-30
Smart Images

Figure CN116418990B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of video compression coding, and specifically to a method for enhancing the quality of compressed video in HEVC compressed video post-processing technology. Background Technology
[0002] In recent years, the video service has developed rapidly, with an ever-increasing demand for high resolution and high definition. Emerging video applications, such as 8K video, panoramic video, and virtual reality (VR) video, have brought significant challenges to video encoding and transmission. During video encoding, block-based video coding standards, such as H.264, HEVC, and VVC, often produce compression noise due to their lossy compression techniques, resulting in artifacts in the output video and reducing the quality of the video experience. Specifically, after compression by a video coding standard, the original video generates video frames of relatively lower quality than the original. This is because the original video encoding process requires quantization of transform coefficients to compress the data volume, and the decoding stage involves inverse quantization. The information loss caused by quantization results in a gap between the quality of the video processed by the video coding standard and the quality of the original video. In the early days, extensive research was conducted on removing compression artifacts or enhancing the quality of individual compressed images to improve video quality. These traditional methods generally enhance video frame quality by optimizing the transform coefficients of a specific compression standard, but they are difficult to extend to other compression schemes.
[0003] With the development of deep neural networks, deep learning-based methods for enhancing the quality of compressed videos have emerged. These methods achieve significant results by learning the nonlinear mapping between the original and compressed videos and regressing artifact-free video frames from a large amount of training data. Specifically, by designing a network model, the correlation between temporal and spatial information of video frames is utilized to compensate for information in the compressed video frames, thereby reducing or eliminating artifacts in the compressed video. Summary of the Invention
[0004] The purpose of this invention is to address the problem of video quality degradation after compression by video compression standards by proposing a neural network-based method for enhancing the quality of compressed video.
[0005] To achieve the above objectives, the present invention adopts the following technical solution:
[0006] A neural network-based method for enhancing the quality of compressed video includes the following:
[0007] Step 1: Construct a compressed video dataset for training and testing
[0008] Multiple uncompressed videos with different resolutions and content were collected from the data for training. Eighteen uncompressed videos released by the video coding collaboration team were selected for video quality evaluation. All the above videos were compressed using the H.265 / HEVC reference software HM16.5, with compression performed at four different quantization parameters (QPs): 37, 32, 27, and 22.
[0009] Step 2: Construct a spatiotemporal information pre-extraction network module
[0010] This network module is constructed based on the "Encoder-Decoder" concept. Its key feature is that the network specifically includes a pre-extraction part, an encoding part, and a decoding part. The pre-extraction part is configured as a 3D convolutional layer with a 7×5×5 kernel. The network receives 7 consecutive video frames as input each time. The first dimension of the 3D convolutional kernel is set to 7, allowing it to extract information from consecutive video frames in the temporal dimension. The second and third dimensions of the 3D convolutional kernel are set to 5, allowing it to extract information from video frames in the spatial dimension. The convolutional kernel of the pre-extraction layer has a large receptive field, enabling it to capture a wide range of temporal and spatial information from the input video frames. The encoding part incorporates three downsampling operations. The spatiotemporal feature map extracted by the pre-extraction part is downsampled spatially. After each downsampling operation, the height and width of the feature map are reduced by half. The downsampling operation is used to extract low-level features from the input frames and increase the receptive field of the convolutional kernel. The downsampling operation is implemented as follows: Each downsampling module consists of two 3D convolutional layers with 3×3×3 kernels. The stride of the first convolution operation is set to (1,2,2), reducing the spatial size of the feature map while maintaining its temporal size. The stride of the second convolution operation is (1,1,1), used to extract the feature information obtained after downsampling in the first convolution operation. The decoding part incorporates three upsampling operations and two skip connections. Upsampling restores the abstract features to the size of the input video frame. Skip connections merge channels, connecting features with the same spatial resolution from the lower layers to the deeper layers, allowing local information from shallow features to reach the output. The specific implementation of upsampling and skip connections is as follows: set 3 transposed convolutional layers with 3×4×4 kernels, and set the stride of the convolution operation to (1,2,2). After the first and second transposed convolutions, the feature maps obtained after the third and second downsamplings are merged by skip connections. After each transposed convolutional layer, a 3D convolutional layer with 3×3×3 kernels is set to extract information from the merged feature maps.
[0011] Step 3: Construct a spatiotemporal information fusion network module
[0012] The spatiotemporal information fusion network module fully mines the spatiotemporal information extracted by the spatiotemporal information pre-extraction network. Its key feature is that it includes five spatiotemporal decomposition and fusion modules. The key to these modules is: segmenting the feature maps of the input continuous video frames in the temporal domain, then extracting information from each frame's feature map individually in the spatial domain, and finally reassembling the segmented feature maps in the temporal domain. The spatiotemporal decomposition and fusion modules are configured sequentially as follows: one 3D convolutional layer with a 1×1×1 kernel, one normalization layer, three 2D convolutional layers with a 3×3 kernel, one channel attention module, one 3D convolutional layer with a 3×1×1 kernel, and one 3D convolutional layer with a 1×1×1 kernel. The channel attention module is configured with one adaptive pooling layer and one 2D convolutional layer with a 1×1 kernel. For each channel, the average of all its elements is adaptively calculated as the feature importance score for that channel. The score features are extracted through the 1×1 2D convolutional layer, and the obtained score features are multiplied with the elements of the input feature map to obtain a weighted feature map. The spatiotemporal decomposition and fusion module is implemented as follows: after the feature map passes through the first convolutional layer, the feature map is divided into 7 parts in the time dimension. Each feature map first passes through a normalization layer to stabilize the training process, then through three 3×3 2D convolutional layers, followed by a channel attention module to enhance information for each video frame individually in the spatial domain. Finally, the segmented seven feature maps are stitched together in the temporal dimension, and then fused in the spatial domain through a 3×1×1 3D convolutional layer. Finally, the enhanced information is output through a 1×1×1 3D convolutional layer. Each spatiotemporal decomposition and fusion module is connected end to end to accelerate the network convergence process. Attached Figure Description
[0013] Figure 1 This is a schematic diagram of the overall process of the neural network-based method for enhancing the quality of compressed video in this invention.
[0014] Figure 2 This is a schematic diagram of the spatiotemporal information pre-extraction network module in this invention.
[0015] Figure 3 This is a schematic diagram of the spatiotemporal information fusion network module in this invention. Detailed Implementation
[0016] This invention primarily aims to enhance the quality of compressed videos. The specific methods employed in this invention will be described in detail below with reference to the accompanying drawings.
[0017] Specifically, the overall process of the neural network-based method for enhancing the quality of compressed video is shown in the appendix. Figure 1As shown, the process includes the following steps: S1: Construct a compressed video dataset for network training. S2: Construct a spatiotemporal information pre-extraction network. S3: Construct a spatiotemporal information fusion network. S4: Train the spatiotemporal information pre-extraction network and the spatiotemporal information fusion network end-to-end and test them.
[0018] (1) For S1: Construct a compressed video dataset for network training.
[0019] The dataset was collected from two databases, Xiph (Xiph.org) and VQEG, containing a total of 126 uncompressed videos with different resolutions and content for validation. All videos were compressed using the H.265 / HEVC reference software HM16.5 at four different quantization parameters (QPs): 37, 32, 27, and 22. The luminance channel (Y component) of the compressed video frames was converted to 1mdb format for reading. The video frames were randomly flipped and rotated to enhance the data. The compressed and corresponding uncompressed video frames were then segmented into 128×128 sub-frames to form video frame pairs for training.
[0020] (2) For S2: Construct a spatiotemporal information pre-extraction network.
[0021] Spatiotemporal information pre-extraction network is shown in the appendix. Figure 2Seven consecutive 128×128 pixel video frames are input into a first 3D convolutional layer with a 7×5×5 kernel. The input channel size is 1, representing the Y component of the input video frame. After the first convolution, the output channels are expanded to 32 to learn richer features. Then, three consecutive downsampling modules are passed, each consisting of two 3D convolutional layers with 3×3×3 kernels. The stride of the first convolution operation is (1,2,2) for downsampling. After three downsampling operations, the video frame sizes become 64×64, 32×32, and 16×16 respectively. The stride of the second convolution operation is (1,1,1) to extract the video frame information obtained after the first downsampling operation. Throughout the downsampling process, the number of input and output channels remains constant at 32. After three consecutive downsampling operations, upsampling begins. These three upsampling operations are performed by transpose-transformation, with a 3×4×4 kernel and a stride of (1,2,2). After the first upsampling operation, the input size of the video frame becomes 32×32. Then, the feature map of this layer is concatenated with the feature map after the third downsampling on the output channel, making the output channel (the input channel of the next layer) 64. The second upsampling operation first passes through a 3D convolutional layer with a 3×3×3 kernel and a stride of (1,1,1). The input channels are set to 64, and the output channels are set to 32. This convolutional layer extracts features from the concatenated feature map of the third downsampling and the first upsampling. Then, a transposed convolution is used to increase the video frame size to 64×64. The features from this layer are then concatenated with the feature map from the second downsampling on the output channel, making the output channel (the input channel of the next layer) 64. The third upsampling passes through a 3D convolutional layer with a 3×3×3 kernel and a transposed convolution to restore the video frame size to 128×128. Finally, a 3D convolutional layer with a 3×3×3 kernel outputs the intermediate features. Except for the last 3D convolutional layer, each convolutional layer is followed by a non-linear activation layer (LeakyReLU layer) to add non-linear features to the network.
[0022] (3) For S3: Construct a spatiotemporal information fusion network.
[0023] The spatiotemporal information fusion network consists of 5 appendices Figure 3The spatiotemporal decomposition and fusion modules shown are connected in series. The role of the spatiotemporal information fusion network is to fuse the spatiotemporal information extracted by the spatiotemporal information pre-extraction network. The intermediate features output by the spatiotemporal information pre-extraction network are fed into the spatiotemporal decomposition and fusion module, where the number of input and output channels of the convolutional layers is always maintained at 32. The intermediate features first pass through a 3D convolutional layer with a 1×1×1 kernel, and then the feature map output by this layer is divided into 7 parts in the temporal dimension to obtain 7 frames of feature maps. Each frame of feature map is sequentially fed into a layer normalization layer, three 2D convolutional layers with a 3×3 kernel, and a channel attention module. The channel attention module consists of an adaptive pooling layer and a 2D convolutional layer with a 1×1 kernel. The parameters in the layer normalization layer, 2D convolutional layers, and channel attention module in the spatiotemporal decomposition and fusion module are shared, and the 7 segmented feature maps undergo the same feature processing in these parameter-shared convolutional layers. Finally, the seven segmented video frames are stitched together temporally, then spatially fused using a 3D convolutional layer with a 3×1×1 kernel, followed by a 1×1×1 3D convolutional layer to output information. Each spatiotemporal decomposition and fusion module is connected end-to-end. The intermediate features undergo spatiotemporal information fusion through five spatiotemporal decomposition and fusion modules, and then... (The sentence is incomplete in the original text). Figure 1 The convolutional kernel used to merge output channel features is a 3×1×1 3D convolutional kernel. This layer has 32 input channels and 1 output channel. By removing the dimension of the output channel, 7 feature tensors are obtained. These 7 feature tensors are concatenated to the new output channel, increasing the number of output channels to 7. Then, a 1×1 2D convolutional kernel is used to merge temporal channels. This layer has 7 input channels and 1 output channel. Finally, the output of this layer is added to the intermediate frame of the original 7 consecutive video frames to obtain the enhanced feature map. In addition to the last 3D convolutional layer and the channel attention module, each convolutional layer is followed by a non-linear activation layer (LeakyReLU layer) to add non-linear features to the network.
[0024] (4) For S4: end-to-end training spatiotemporal information pre-extraction network and spatiotemporal information fusion network
[0025] During training, 128×128 video frames were randomly cropped from both the original and compressed videos as training samples. Data augmentation was performed by rotating or flipping the samples, and the Adam optimizer was used to train all models. The models were trained and tested at four QP values (22, 27, 32, 37). When QP was 37, the learning rate was set to 0.0005 and remained constant throughout training. When QP was 22, 27, and 32, the learning rate was set to 0.0003 and remained constant throughout training. The batch size for network training was set to 16, with a total of 300,000 iterations. The total loss function was set to the sum of the squared errors between the augmented target frame and the corresponding uncompressed original video frame. After training, the models were tested on 18 test videos. The test results are shown in Table 1. Class A to Class B represent video frame resolutions of 2560×1600, 1920×1080, 832×480, 416×240, and 1270×720, respectively. Based on the experimental data analysis in Table 1, the video frames enhanced by the method of this invention show certain improvements in peak signal-to-noise ratio (PSNR) and structural similarity (SSIM).
[0026] Table 1: ΔPSNR / ΔSSIM on test videos at 4 different QPs
[0027]
[0028]
[0029] The above specific embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Those skilled in the art should understand that the above embodiments do not limit the present invention in any way, and all similar technical solutions obtained by equivalent substitution or equivalent transformation are within the protection scope of the present invention.
Claims
1. A method for enhancing the quality of compressed video based on neural networks, characterized in that: Includes the following steps: First: the compressed video dataset for training and testing; 1.1 Collect multiple uncompressed videos with different resolutions and contents from the database for training; Several uncompressed videos released by the video coding collaboration team were selected for video quality assessment. 1.
2. Compress all the above videos using the H.265 / HEVC reference software HM16.
5. The compression was performed at four different quantization parameters QPs, which were 37, 32, 27, and 22 respectively. 1.
3. The luminance channels (Y components) of the uncompressed and compressed video frames are converted into lmdb format for use as the training dataset. The video frames are randomly flipped and rotated to enhance the data. The compressed video frames and their corresponding uncompressed video frames are cut into 128×128 sub-video frames to form video frame pairs for training. The test videos are not converted; the Y components of the uncompressed and compressed video frames of 18 test videos are directly extracted for testing. Second: Spatiotemporal information pre-extraction network; 2.1 The network is constructed based on the "Encoder-Decoder" concept, which includes a pre-extraction part, an encoding part and a decoding part. 3D convolutional layers are used to encode and decode the feature map in the spatiotemporal dimension, and at the same time, the low-level features and deep features of the feature map are extracted in the spatiotemporal dimension. 2.2 The pre-extraction part is set as a 3D convolutional layer with a kernel of 7×5×5. The network is input with 7 consecutive video frames each time. The first dimension of the 3D convolutional kernel is set to 7 so that it can extract the information of the consecutive video frames in the time dimension. The second and third dimensions of the 3D convolutional kernel are set to 5 so that it can extract the information of the video frames in the spatial dimension. 2.3 The encoding part incorporates three downsampling operations. The spatiotemporal feature maps extracted from the pre-extraction part are spatially downsampled. After each downsampling operation, the height and width of the feature map are reduced by half. The downsampling operation is used to extract low-level features from the input frame and increase the receptive field of the convolutional kernels. Specifically, each downsampling module consists of two 3D convolutional layers with 3×3×3 kernels. The stride of the first convolution operation is set to (1,2,2), reducing the spatial size of the feature map while maintaining its temporal size. The stride of the second convolution operation is (1,1,1), used to extract the feature information obtained after downsampling in the first convolution operation. 2.4 The decoding section incorporates three upsampling operations and two skip connections; the upsampling operations restore the abstract features to the size of the input video frame. By merging channels through skip connections, features with the same spatial resolution are connected from the bottom layer to the deep layer, allowing local information of shallow features to reach the output. The specific implementation of upsampling operation and skip connections is as follows: set 3 transposed convolutional layers with 3×4×4 kernels, and set the stride of the convolution operation to (1,2,2). After the first and second transposed convolutions, the feature maps obtained after the third and second downsampling are merged through skip connections. After each transposed convolutional layer, a 3D convolutional layer with 3×3×3 kernel is set to extract information from the merged feature maps. Third: Spatiotemporal information fusion network; 3.1 The spatiotemporal information fusion network module mines the spatiotemporal information extracted from the spatiotemporal information pre-extraction network, including 5 spatiotemporal decomposition and fusion modules; 3.2 The key to the spatiotemporal decomposition and fusion module is to segment the feature maps of the input continuous video frames in the time domain, then use 2D convolutional layers to extract information from each frame feature map separately in the spatial domain, and finally reassemble the segmented feature maps in the time domain. 3.3 The spatiotemporal decomposition and fusion module is set up in the following order: 1 3D convolutional layer with a kernel of 1×1×1, 1 normalization layer, 3 2D convolutional layers with a kernel of 3×3, 1 channel attention module, 1 3D convolutional layer with a kernel of 3×1×1, and 1 3D convolutional layer with a kernel of 1×1×1. 3.4 The channel attention module is set up in sequence as one adaptive pooling layer and one 2D convolutional layer with a 1×1 kernel. For each channel, the average value of all its elements is adaptively calculated as the feature importance score of that channel. The score features are extracted through the 1×1 2D convolutional layer. The obtained score features are multiplied with the elements of the input feature map to obtain the weighted feature map. 3.5 The specific implementation of the spatiotemporal decomposition and fusion module is as follows: After the feature map passes through the first convolutional layer, the feature map is divided into 7 parts in the time dimension; each feature map first passes through a normalization layer to stabilize the training process, then through 3 2D convolutional layers with 3×3 kernels, and a channel attention module to enhance the information of each video frame separately in the spatial domain. Finally, the 7 segmented feature maps are stitched together in the time dimension, and the spatiotemporal information is fused in the spatial domain through a 3D convolutional layer with 3×1×1 kernel. Finally, the enhanced information is output through a 3D convolutional layer with 1×1×1 kernel. Each spatiotemporal decomposition and fusion module is connected end to end.